Significant Performance Gap in Reproduction - Implementation Discrepancies Found
Summary
I attempted to reproduce the SALAD results on MVTec LOCO using the provided code and default hyperparameters. However, the reproduced results show a significant performance gap compared to the paper's reported numbers. After careful analysis, I identified two critical discrepancies between the paper's stated implementation details (Section 4.2) and the actual code that likely explain this gap.
Reproduction Results vs. Paper
Paper's Reported Results (SALAD† with composition maps - Table 1):
| Category |
Logical |
Structural |
Average |
| breakfast_box |
99.6 |
88.8 |
94.2 |
| juice_bottle |
99.6 |
98.9 |
99.3 |
| pushpins |
99.9 |
98.3 |
99.1 |
| screw_bag |
98.6 |
94.7 |
96.7 |
| splicing_connectors |
95.8 |
98.6 |
97.2 |
| Average |
98.7 |
95.8 |
97.3 |
My Reproduction Results (Using Current Code):
| Category |
Logical |
Structural |
Average |
Gap |
| breakfast_box |
91.96 |
79.65 |
85.81 |
-8.39 |
| juice_bottle |
99.80 |
99.41 |
99.61 |
+0.31 |
| pushpins |
83.97 |
96.48 |
90.23 |
-8.87 |
| screw_bag |
81.36 |
93.50 |
87.43 |
-9.27 |
| splicing_connectors |
94.90 |
97.54 |
96.22 |
-0.98 |
| Average |
90.40 |
93.32 |
91.86 |
-5.44 |
Performance Gap Analysis:
- Average gap: -5.44 points (91.86 vs 97.3)
- Largest gap: screw_bag (-9.27 points)
- Smallest gap: juice_bottle (+0.31 points - actually better!)
- Logical anomalies: -8.3 points (90.40 vs 98.7)
- Structural anomalies: -2.48 points (93.32 vs 95.8)
This is a substantial and consistent performance degradation across most categories, with a massive gap in logical anomaly detection (-8.3 points). The gap is especially severe for:
- Screw bag: -9.27 points overall (-17.24 on logical!)
- Pushpins: -8.87 points overall (-15.93 on logical!)
- Breakfast box: -8.39 points overall (-7.64 on logical)
Only juice_bottle performs well, actually slightly exceeding paper results (+0.31 points).
Observed Issue
After thorough code analysis, I identified two issues in the code that don't match the paper's specifications:
Training Iterations Mathematical Inconsistency
Paper Statements (Section 4.2):
"SALAD follows the training regime from EfficientAD - 70000 iterations with the Adam optimizer."
"Both learning rates were multiplied by 0.1 after 90% (66500) of the iterations."
Mathematical Problem:
The paper contains an internal inconsistency:
- 90% of 70,000 iterations = 63,000 iterations
- Paper explicitly states: 66,500 iterations
- These numbers are mathematically incompatible!
- To get 66,500 at 90%: 66,500 ÷ 0.9 = 73,889 ≈ 74,000 total iterations
Current Code:
File: argparser.py, Line 13
parser.add_argument('-t', '--train_steps', type=int, default=70000)
File: train_salad.py, Line 179
scheduler = torch.optim.lr_scheduler.StepLR(
optimizer, step_size=int(0.9 * config.train_steps), gamma=0.1)
With train_steps=70000, the LR drops at iteration 63,000, which is 3,500 iterations earlier than the paper's stated 66,500.
Issue:
The code implements 70,000 iterations (matching the paper's text), but this means:
- LR drops at 63,000 (not 66,500 as stated in paper)
- Training is ~4,000 iterations shorter than what paper actually used
- Fine-tuning phase is ~3,500 iterations shorter
Detailed Reproduction Results Breakdown
For transparency, here are the full results by category and detection method:
| Category |
All Logical |
All Structural |
Maha Logical |
Maha Structural |
Img Logical |
Img Structural |
Comp Logical |
Comp Structural |
| breakfast_box |
91.96 |
79.65 |
93.62 |
69.44 |
85.06 |
86.33 |
75.28 |
67.32 |
| juice_bottle |
99.80 |
99.41 |
99.66 |
95.14 |
95.78 |
99.78 |
89.44 |
79.33 |
| pushpins |
83.97 |
96.48 |
84.73 |
97.16 |
74.73 |
95.62 |
63.87 |
69.24 |
| screw_bag |
81.36 |
93.50 |
78.40 |
84.49 |
59.21 |
88.83 |
68.91 |
73.56 |
| splicing_connectors |
94.90 |
97.54 |
86.31 |
80.89 |
94.90 |
98.63 |
80.96 |
86.46 |
| Average |
90.40 |
93.32 |
88.54 |
85.42 |
81.94 |
93.84 |
75.69 |
75.18 |
Key Observation: The composition branch is catastrophically underperforming (75.69 logical vs expected ~95-98), which strongly suggests the composition maps are not properly trained due to the 5× lower UNet learning rate. This is the primary bottleneck causing the -8.3 point gap in logical anomaly detection and particularly devastating performance on screw_bag (-17.24 points) and pushpins (-15.93 points) logical anomalies.
Questions for Authors
-
UNet Learning Rate Confirmation:
- Can you confirm the actual UNet training used
lr=5e-4 (as stated in paper)?
- The code has
lr=1e-4 - is this a bug or was the paper description incorrect?
- Could you share example composition maps to verify quality?
-
Training Iterations Clarification:
- Was actual training done with 70,000 or ~74,000 iterations?
- Which is correct: "66,500" or "63,000" for the LR drop?
- Possible scenarios:
- a) Paper typo: Should say "(63,000)" instead of "(66,500)"
- b) Code bug: Should use ~74,000 iterations to get 66,500 at 90%
- c) Text simplification: Actual 74k was rounded to "70k" in text
-
Reproducibility:
- Can you reproduce the paper results with the current code and default hyperparameters?
- Were there any additional hyperparameters or training procedures not documented?
- What hardware was used? (GPU type, batch accumulation, etc.)
-
Expected Performance:
- What performance should we expect with the current code settings?
- Is the ~6 point gap expected with the current hyperparameters?
Request for Authors
This is an excellent paper and the code release is greatly appreciated! However, to help the research community properly reproduce and build upon your work, could you please:
-
Confirm or correct the hyperparameters:
- UNet learning rate: 5e-4 or 1e-4?
- Total training iterations: 70,000 or 74,000?
- LR drop point: 63,000 or 66,500?
-
Help with reproduction:
- Share any additional undocumented settings or procedures
- Provide guidance on expected performance with current code
- Consider updating the README with confirmed hyperparameters
-
Consider updating the code/paper:
- Update code to match paper, OR
- Update paper to match code, OR
- Add a note explaining the discrepancy
Reproduction Details
My Setup:
- Hardware: NVIDIA GPU (CUDA 12.1)
- Python: 3.10
- PyTorch: 2.1+
- Dataset: MVTec LOCO (official download)
- Code: Current repository (commit: latest)
- Hyperparameters: All default values from the code
- Random seed: 42 (default)
Training Procedure:
- Generated foreground masks with SAM
- Created pseudo-labels with DINO + SAM-HQ
- Trained composition segmentation UNet (15 epochs, lr=1e-4)
- Trained SALAD models (70,000 iterations)
- Evaluated on test set
Results Obtained:
- Average AUROC: 91.86% (vs paper's 97.3%)
- Gap: -5.44 points
- Most affected: logical anomaly detection (-8.3 points)
- Worst categories: screw_bag (-17.24 logical), pushpins (-15.93 logical)
Thank You
Despite these issues, SALAD represents important work in logical anomaly detection. The composition map generation approach is innovative and the overall framework is well-designed. I'm confident that with clarification on these hyperparameters, the community can properly reproduce and build upon this excellent research.
Thank you for your time and for contributing to open science! 🙏
Significant Performance Gap in Reproduction - Implementation Discrepancies Found
Summary
I attempted to reproduce the SALAD results on MVTec LOCO using the provided code and default hyperparameters. However, the reproduced results show a significant performance gap compared to the paper's reported numbers. After careful analysis, I identified two critical discrepancies between the paper's stated implementation details (Section 4.2) and the actual code that likely explain this gap.
Reproduction Results vs. Paper
Paper's Reported Results (SALAD† with composition maps - Table 1):
My Reproduction Results (Using Current Code):
Performance Gap Analysis:
This is a substantial and consistent performance degradation across most categories, with a massive gap in logical anomaly detection (-8.3 points). The gap is especially severe for:
Only juice_bottle performs well, actually slightly exceeding paper results (+0.31 points).
Observed Issue
After thorough code analysis, I identified two issues in the code that don't match the paper's specifications:
Training Iterations Mathematical Inconsistency
Paper Statements (Section 4.2):
Mathematical Problem:
The paper contains an internal inconsistency:
Current Code:
File:
argparser.py, Line 13File:
train_salad.py, Line 179With
train_steps=70000, the LR drops at iteration 63,000, which is 3,500 iterations earlier than the paper's stated 66,500.Issue:
The code implements 70,000 iterations (matching the paper's text), but this means:
Detailed Reproduction Results Breakdown
For transparency, here are the full results by category and detection method:
Logical
Structural
Logical
Structural
Logical
Structural
Logical
Structural
Key Observation: The composition branch is catastrophically underperforming (75.69 logical vs expected ~95-98), which strongly suggests the composition maps are not properly trained due to the 5× lower UNet learning rate. This is the primary bottleneck causing the -8.3 point gap in logical anomaly detection and particularly devastating performance on screw_bag (-17.24 points) and pushpins (-15.93 points) logical anomalies.
Questions for Authors
UNet Learning Rate Confirmation:
lr=5e-4(as stated in paper)?lr=1e-4- is this a bug or was the paper description incorrect?Training Iterations Clarification:
Reproducibility:
Expected Performance:
Request for Authors
This is an excellent paper and the code release is greatly appreciated! However, to help the research community properly reproduce and build upon your work, could you please:
Confirm or correct the hyperparameters:
Help with reproduction:
Consider updating the code/paper:
Reproduction Details
My Setup:
Training Procedure:
Results Obtained:
Thank You
Despite these issues, SALAD represents important work in logical anomaly detection. The composition map generation approach is innovative and the overall framework is well-designed. I'm confident that with clarification on these hyperparameters, the community can properly reproduce and build upon this excellent research.
Thank you for your time and for contributing to open science! 🙏