Significant Performance Gap in Reproduction - Implementation Discrepancies Found

# Significant Performance Gap in Reproduction - Implementation Discrepancies Found

## Summary

I attempted to reproduce the SALAD results on MVTec LOCO using the provided code and default hyperparameters. However, **the reproduced results show a significant performance gap** compared to the paper's reported numbers. After careful analysis, I identified two critical discrepancies between the paper's stated implementation details (Section 4.2) and the actual code that likely explain this gap.

## Reproduction Results vs. Paper

### Paper's Reported Results (SALAD† with composition maps - Table 1):
| Category | Logical | Structural | Average |
|----------|---------|------------|---------|
| breakfast_box | 99.6 | 88.8 | 94.2 |
| juice_bottle | 99.6 | 98.9 | 99.3 |
| pushpins | 99.9 | 98.3 | 99.1 |
| screw_bag | 98.6 | 94.7 | 96.7 |
| splicing_connectors | 95.8 | 98.6 | 97.2 |
| **Average** | **98.7** | **95.8** | **97.3** |

### My Reproduction Results (Using Current Code):
| Category | Logical | Structural | Average | Gap |
|----------|---------|------------|---------|-----|
| breakfast_box | 91.96 | 79.65 | 85.81 | **-8.39** |
| juice_bottle | 99.80 | 99.41 | 99.61 | **+0.31** |
| pushpins | 83.97 | 96.48 | 90.23 | **-8.87** |
| screw_bag | 81.36 | 93.50 | 87.43 | **-9.27** |
| splicing_connectors | 94.90 | 97.54 | 96.22 | **-0.98** |
| **Average** | **90.40** | **93.32** | **91.86** | **-5.44** |

### Performance Gap Analysis:
- **Average gap: -5.44 points** (91.86 vs 97.3)
- **Largest gap: screw_bag (-9.27 points)**
- **Smallest gap: juice_bottle (+0.31 points - actually better!)**
- **Logical anomalies: -8.3 points** (90.40 vs 98.7)
- **Structural anomalies: -2.48 points** (93.32 vs 95.8)

This is a **substantial and consistent performance degradation** across most categories, with a massive gap in logical anomaly detection (-8.3 points). The gap is especially severe for:
- **Screw bag:** -9.27 points overall (-17.24 on logical!)
- **Pushpins:** -8.87 points overall (-15.93 on logical!)
- **Breakfast box:** -8.39 points overall (-7.64 on logical)

Only juice_bottle performs well, actually slightly exceeding paper results (+0.31 points).

---

## Observed Issue

After thorough code analysis, I identified two issues in the code that don't match the paper's specifications:
### Training Iterations Mathematical Inconsistency

#### Paper Statements (Section 4.2):
> "SALAD follows the training regime from EfficientAD - **70000 iterations** with the Adam optimizer."
> 
> "Both learning rates were multiplied by 0.1 after **90% (66500)** of the iterations."

#### Mathematical Problem:
The paper contains an internal inconsistency:
- 90% of 70,000 iterations = **63,000** iterations
- Paper explicitly states: **66,500** iterations
- **These numbers are mathematically incompatible!**
- To get 66,500 at 90%: 66,500 ÷ 0.9 = **73,889 ≈ 74,000** total iterations

#### Current Code:
**File:** `argparser.py`, Line 13
```python
parser.add_argument('-t', '--train_steps', type=int, default=70000)
```

**File:** `train_salad.py`, Line 179
```python
scheduler = torch.optim.lr_scheduler.StepLR(
 optimizer, step_size=int(0.9 * config.train_steps), gamma=0.1)
```

With `train_steps=70000`, the LR drops at iteration **63,000**, which is **3,500 iterations earlier** than the paper's stated 66,500.

#### Issue:
The code implements 70,000 iterations (matching the paper's text), but this means:
- LR drops at 63,000 (not 66,500 as stated in paper)
- Training is ~4,000 iterations shorter than what paper actually used
- Fine-tuning phase is ~3,500 iterations shorter

---

## Detailed Reproduction Results Breakdown

For transparency, here are the full results by category and detection method:

| Category | All Logical | All Structural | Maha Logical | Maha Structural | Img Logical | Img Structural | Comp Logical | Comp Structural |
|----------|--------|-------|---------|-------|---------|-------|---------|-------|
| breakfast_box | 91.96 | 79.65 | 93.62 | 69.44 | 85.06 | 86.33 | 75.28 | 67.32 |
| juice_bottle | 99.80 | 99.41 | 99.66 | 95.14 | 95.78 | 99.78 | 89.44 | 79.33 |
| pushpins | 83.97 | 96.48 | 84.73 | 97.16 | 74.73 | 95.62 | 63.87 | 69.24 |
| screw_bag | 81.36 | 93.50 | 78.40 | 84.49 | 59.21 | 88.83 | 68.91 | 73.56 |
| splicing_connectors | 94.90 | 97.54 | 86.31 | 80.89 | 94.90 | 98.63 | 80.96 | 86.46 |
| **Average** | **90.40** | **93.32** | **88.54** | **85.42** | **81.94** | **93.84** | **75.69** | **75.18** |

**Key Observation:** The composition branch is **catastrophically underperforming** (75.69 logical vs expected ~95-98), which strongly suggests the composition maps are not properly trained due to the 5× lower UNet learning rate. This is the primary bottleneck causing the **-8.3 point gap in logical anomaly detection** and particularly devastating performance on screw_bag (-17.24 points) and pushpins (-15.93 points) logical anomalies.

---

## Questions for Authors

1. **UNet Learning Rate Confirmation:**
 - Can you confirm the actual UNet training used `lr=5e-4` (as stated in paper)?
 - The code has `lr=1e-4` - is this a bug or was the paper description incorrect?
 - Could you share example composition maps to verify quality?

2. **Training Iterations Clarification:**
 - Was actual training done with 70,000 or ~74,000 iterations?
 - Which is correct: "66,500" or "63,000" for the LR drop?
 - Possible scenarios:
 - a) Paper typo: Should say "(63,000)" instead of "(66,500)"
 - b) Code bug: Should use ~74,000 iterations to get 66,500 at 90%
 - c) Text simplification: Actual 74k was rounded to "70k" in text

3. **Reproducibility:**
 - Can you reproduce the paper results with the current code and default hyperparameters?
 - Were there any additional hyperparameters or training procedures not documented?
 - What hardware was used? (GPU type, batch accumulation, etc.)

4. **Expected Performance:**
 - What performance should we expect with the current code settings?
 - Is the ~6 point gap expected with the current hyperparameters?

---

## Request for Authors

This is an excellent paper and the code release is greatly appreciated! However, to help the research community properly reproduce and build upon your work, could you please:

1. **Confirm or correct the hyperparameters:**
 - UNet learning rate: 5e-4 or 1e-4?
 - Total training iterations: 70,000 or 74,000?
 - LR drop point: 63,000 or 66,500?

2. **Help with reproduction:**
 - Share any additional undocumented settings or procedures
 - Provide guidance on expected performance with current code
 - Consider updating the README with confirmed hyperparameters

3. **Consider updating the code/paper:**
 - Update code to match paper, OR
 - Update paper to match code, OR
 - Add a note explaining the discrepancy

---

## Reproduction Details

### My Setup:
- **Hardware:** NVIDIA GPU (CUDA 12.1)
- **Python:** 3.10
- **PyTorch:** 2.1+
- **Dataset:** MVTec LOCO (official download)
- **Code:** Current repository (commit: latest)
- **Hyperparameters:** All default values from the code
- **Random seed:** 42 (default)

### Training Procedure:
1. Generated foreground masks with SAM
2. Created pseudo-labels with DINO + SAM-HQ
3. Trained composition segmentation UNet (15 epochs, **lr=1e-4**)
4. Trained SALAD models (**70,000 iterations**)
5. Evaluated on test set

### Results Obtained:
- Average AUROC: **91.86%** (vs paper's **97.3%**)
- Gap: **-5.44 points**
- Most affected: logical anomaly detection (-8.3 points)
- Worst categories: screw_bag (-17.24 logical), pushpins (-15.93 logical)

## Thank You

Despite these issues, SALAD represents important work in logical anomaly detection. The composition map generation approach is innovative and the overall framework is well-designed. I'm confident that with clarification on these hyperparameters, the community can properly reproduce and build upon this excellent research.

Thank you for your time and for contributing to open science! 🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant Performance Gap in Reproduction - Implementation Discrepancies Found #9

Significant Performance Gap in Reproduction - Implementation Discrepancies Found

Summary

Reproduction Results vs. Paper

Paper's Reported Results (SALAD† with composition maps - Table 1):

My Reproduction Results (Using Current Code):

Performance Gap Analysis:

Observed Issue

Training Iterations Mathematical Inconsistency

Paper Statements (Section 4.2):

Mathematical Problem:

Current Code:

Issue:

Detailed Reproduction Results Breakdown

Questions for Authors

Request for Authors

Reproduction Details

My Setup:

Training Procedure:

Results Obtained:

Thank You

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Category	Logical	Structural	Average
breakfast_box	99.6	88.8	94.2
juice_bottle	99.6	98.9	99.3
pushpins	99.9	98.3	99.1
screw_bag	98.6	94.7	96.7
splicing_connectors	95.8	98.6	97.2
Average	98.7	95.8	97.3

Category	Logical	Structural	Average	Gap
breakfast_box	91.96	79.65	85.81	-8.39
juice_bottle	99.80	99.41	99.61	+0.31
pushpins	83.97	96.48	90.23	-8.87
screw_bag	81.36	93.50	87.43	-9.27
splicing_connectors	94.90	97.54	96.22	-0.98
Average	90.40	93.32	91.86	-5.44

Significant Performance Gap in Reproduction - Implementation Discrepancies Found #9

Description

Significant Performance Gap in Reproduction - Implementation Discrepancies Found

Summary

Reproduction Results vs. Paper

Paper's Reported Results (SALAD† with composition maps - Table 1):

My Reproduction Results (Using Current Code):

Performance Gap Analysis:

Observed Issue

Training Iterations Mathematical Inconsistency

Paper Statements (Section 4.2):

Mathematical Problem:

Current Code:

Issue:

Detailed Reproduction Results Breakdown

Questions for Authors

Request for Authors

Reproduction Details

My Setup:

Training Procedure:

Results Obtained:

Thank You

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions