Skip to content

Significant Performance Gap in Reproduction - Implementation Discrepancies Found #9

@LengSenghak

Description

@LengSenghak

Significant Performance Gap in Reproduction - Implementation Discrepancies Found

Summary

I attempted to reproduce the SALAD results on MVTec LOCO using the provided code and default hyperparameters. However, the reproduced results show a significant performance gap compared to the paper's reported numbers. After careful analysis, I identified two critical discrepancies between the paper's stated implementation details (Section 4.2) and the actual code that likely explain this gap.

Reproduction Results vs. Paper

Paper's Reported Results (SALAD† with composition maps - Table 1):

Category Logical Structural Average
breakfast_box 99.6 88.8 94.2
juice_bottle 99.6 98.9 99.3
pushpins 99.9 98.3 99.1
screw_bag 98.6 94.7 96.7
splicing_connectors 95.8 98.6 97.2
Average 98.7 95.8 97.3

My Reproduction Results (Using Current Code):

Category Logical Structural Average Gap
breakfast_box 91.96 79.65 85.81 -8.39
juice_bottle 99.80 99.41 99.61 +0.31
pushpins 83.97 96.48 90.23 -8.87
screw_bag 81.36 93.50 87.43 -9.27
splicing_connectors 94.90 97.54 96.22 -0.98
Average 90.40 93.32 91.86 -5.44

Performance Gap Analysis:

  • Average gap: -5.44 points (91.86 vs 97.3)
  • Largest gap: screw_bag (-9.27 points)
  • Smallest gap: juice_bottle (+0.31 points - actually better!)
  • Logical anomalies: -8.3 points (90.40 vs 98.7)
  • Structural anomalies: -2.48 points (93.32 vs 95.8)

This is a substantial and consistent performance degradation across most categories, with a massive gap in logical anomaly detection (-8.3 points). The gap is especially severe for:

  • Screw bag: -9.27 points overall (-17.24 on logical!)
  • Pushpins: -8.87 points overall (-15.93 on logical!)
  • Breakfast box: -8.39 points overall (-7.64 on logical)

Only juice_bottle performs well, actually slightly exceeding paper results (+0.31 points).


Observed Issue

After thorough code analysis, I identified two issues in the code that don't match the paper's specifications:

Training Iterations Mathematical Inconsistency

Paper Statements (Section 4.2):

"SALAD follows the training regime from EfficientAD - 70000 iterations with the Adam optimizer."

"Both learning rates were multiplied by 0.1 after 90% (66500) of the iterations."

Mathematical Problem:

The paper contains an internal inconsistency:

  • 90% of 70,000 iterations = 63,000 iterations
  • Paper explicitly states: 66,500 iterations
  • These numbers are mathematically incompatible!
  • To get 66,500 at 90%: 66,500 ÷ 0.9 = 73,889 ≈ 74,000 total iterations

Current Code:

File: argparser.py, Line 13

parser.add_argument('-t', '--train_steps', type=int, default=70000)

File: train_salad.py, Line 179

scheduler = torch.optim.lr_scheduler.StepLR(
    optimizer, step_size=int(0.9 * config.train_steps), gamma=0.1)

With train_steps=70000, the LR drops at iteration 63,000, which is 3,500 iterations earlier than the paper's stated 66,500.

Issue:

The code implements 70,000 iterations (matching the paper's text), but this means:

  • LR drops at 63,000 (not 66,500 as stated in paper)
  • Training is ~4,000 iterations shorter than what paper actually used
  • Fine-tuning phase is ~3,500 iterations shorter

Detailed Reproduction Results Breakdown

For transparency, here are the full results by category and detection method:

Category All
Logical
All
Structural
Maha
Logical
Maha
Structural
Img
Logical
Img
Structural
Comp
Logical
Comp
Structural
breakfast_box 91.96 79.65 93.62 69.44 85.06 86.33 75.28 67.32
juice_bottle 99.80 99.41 99.66 95.14 95.78 99.78 89.44 79.33
pushpins 83.97 96.48 84.73 97.16 74.73 95.62 63.87 69.24
screw_bag 81.36 93.50 78.40 84.49 59.21 88.83 68.91 73.56
splicing_connectors 94.90 97.54 86.31 80.89 94.90 98.63 80.96 86.46
Average 90.40 93.32 88.54 85.42 81.94 93.84 75.69 75.18

Key Observation: The composition branch is catastrophically underperforming (75.69 logical vs expected ~95-98), which strongly suggests the composition maps are not properly trained due to the 5× lower UNet learning rate. This is the primary bottleneck causing the -8.3 point gap in logical anomaly detection and particularly devastating performance on screw_bag (-17.24 points) and pushpins (-15.93 points) logical anomalies.


Questions for Authors

  1. UNet Learning Rate Confirmation:

    • Can you confirm the actual UNet training used lr=5e-4 (as stated in paper)?
    • The code has lr=1e-4 - is this a bug or was the paper description incorrect?
    • Could you share example composition maps to verify quality?
  2. Training Iterations Clarification:

    • Was actual training done with 70,000 or ~74,000 iterations?
    • Which is correct: "66,500" or "63,000" for the LR drop?
    • Possible scenarios:
      • a) Paper typo: Should say "(63,000)" instead of "(66,500)"
      • b) Code bug: Should use ~74,000 iterations to get 66,500 at 90%
      • c) Text simplification: Actual 74k was rounded to "70k" in text
  3. Reproducibility:

    • Can you reproduce the paper results with the current code and default hyperparameters?
    • Were there any additional hyperparameters or training procedures not documented?
    • What hardware was used? (GPU type, batch accumulation, etc.)
  4. Expected Performance:

    • What performance should we expect with the current code settings?
    • Is the ~6 point gap expected with the current hyperparameters?

Request for Authors

This is an excellent paper and the code release is greatly appreciated! However, to help the research community properly reproduce and build upon your work, could you please:

  1. Confirm or correct the hyperparameters:

    • UNet learning rate: 5e-4 or 1e-4?
    • Total training iterations: 70,000 or 74,000?
    • LR drop point: 63,000 or 66,500?
  2. Help with reproduction:

    • Share any additional undocumented settings or procedures
    • Provide guidance on expected performance with current code
    • Consider updating the README with confirmed hyperparameters
  3. Consider updating the code/paper:

    • Update code to match paper, OR
    • Update paper to match code, OR
    • Add a note explaining the discrepancy

Reproduction Details

My Setup:

  • Hardware: NVIDIA GPU (CUDA 12.1)
  • Python: 3.10
  • PyTorch: 2.1+
  • Dataset: MVTec LOCO (official download)
  • Code: Current repository (commit: latest)
  • Hyperparameters: All default values from the code
  • Random seed: 42 (default)

Training Procedure:

  1. Generated foreground masks with SAM
  2. Created pseudo-labels with DINO + SAM-HQ
  3. Trained composition segmentation UNet (15 epochs, lr=1e-4)
  4. Trained SALAD models (70,000 iterations)
  5. Evaluated on test set

Results Obtained:

  • Average AUROC: 91.86% (vs paper's 97.3%)
  • Gap: -5.44 points
  • Most affected: logical anomaly detection (-8.3 points)
  • Worst categories: screw_bag (-17.24 logical), pushpins (-15.93 logical)

Thank You

Despite these issues, SALAD represents important work in logical anomaly detection. The composition map generation approach is innovative and the overall framework is well-designed. I'm confident that with clarification on these hyperparameters, the community can properly reproduce and build upon this excellent research.

Thank you for your time and for contributing to open science! 🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions