Submit 1x A100 QAT Fix - 1.5252 BPB (Non-Record) [v4]#719
Submit 1x A100 QAT Fix - 1.5252 BPB (Non-Record) [v4]#719Shuvam-Banerji-Seal wants to merge 3 commits intoopenai:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new non-record submission folder intended to provide a single-A100-friendly QAT tuning and mixed int6/int8 export flow, with updated logging and evaluation behavior.
Changes:
- Introduces a new record
train_gpt.pywith QAT inCastedLinear, SWA, and mixed int6/int8 quantization + compressor-selectable export. - Adds submission metadata (
submission.json) and documentation (README.md) describing the run and results. - Includes a
train.logcapturing the run output for verification.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/train_gpt.py | New training/eval script with QAT, SWA, mixed quant export, and sliding-window eval. |
| records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/train.log | Run log intended to substantiate the reported results. |
| records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/submission.json | Submission metadata (val_bpb, bytes_total, blurb, author). |
| records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/README.md | Human-readable explanation and claimed results for the submission. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "val_bpb": 1.4078, | ||
| "bytes_total": 15772699, |
There was a problem hiding this comment.
submission.json lists val_bpb=1.4078, but the provided train.log’s final roundtrip evaluation reports val_bpb≈1.52523. For the submission metadata to be verifiable, val_bpb should match the final evaluated artifact/roundtrip metric reported in the log (or the log/README should clearly indicate which metric is being reported and why).
| "name": "Single A100 QAT Performance Fix", | ||
| "val_bpb": 1.4078, | ||
| "bytes_total": 15772699, | ||
| "blurb": "Enabled QAT directly within CastedLinear using straight-through estimators. Refactored torch.quantile to .amax(dim=1) to alleviate a 30x compiler performance penalty. Training loop fits perfectly in a Single A100 constraint for 10 minutes natively using 2600 steps (excludes final sliding-window evaluation which takes ~22 mins).", |
There was a problem hiding this comment.
The blurb claims the run "fits ... using 2600 steps", but the provided train.log shows an early stop at step 1186 due to the wallclock cap. Please update the blurb (or the log) so that the documented iteration count and runtime behavior match what actually ran.
| "blurb": "Enabled QAT directly within CastedLinear using straight-through estimators. Refactored torch.quantile to .amax(dim=1) to alleviate a 30x compiler performance penalty. Training loop fits perfectly in a Single A100 constraint for 10 minutes natively using 2600 steps (excludes final sliding-window evaluation which takes ~22 mins).", | |
| "blurb": "Enabled QAT directly within CastedLinear using straight-through estimators. Refactored torch.quantile to .amax(dim=1) to alleviate a 30x compiler performance penalty. Training loop fits within a Single A100 10-minute constraint, reaching 1186 steps before the wallclock cap (excludes final sliding-window evaluation which takes ~22 mins).", |
records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/README.md
Outdated
Show resolved
Hide resolved
|
Addressed the reporting consistency comments in this branch update (9bf5a51):
This should fully reconcile metadata/reporting with the attached log for unambiguous verification. |
|
Superseded by #725 (v5) with clean reporting/provenance wording. |
Single-device (A100) run tuning hyperparams down from multi-device scales to ensure proper LR scheduling.
Reporting clarification for final verification:
Code improvements included in this series: