Skip to content

Submit 1x A100 QAT Fix - 1.5252 BPB (Non-Record) [v4]#719

Closed
Shuvam-Banerji-Seal wants to merge 3 commits intoopenai:mainfrom
Shuvam-Banerji-Seal:submit-single-device-qat-v4
Closed

Submit 1x A100 QAT Fix - 1.5252 BPB (Non-Record) [v4]#719
Shuvam-Banerji-Seal wants to merge 3 commits intoopenai:mainfrom
Shuvam-Banerji-Seal:submit-single-device-qat-v4

Conversation

@Shuvam-Banerji-Seal
Copy link

@Shuvam-Banerji-Seal Shuvam-Banerji-Seal commented Mar 25, 2026

Single-device (A100) run tuning hyperparams down from multi-device scales to ensure proper LR scheduling.

Reporting clarification for final verification:

  • Attached run is measured on 1x A100 under 600s wallclock cap and stops at step 1186.
  • Train-time checkpoint metric at stop: val_bpb=1.4078.
  • Submission metric (submission.json val_bpb) is the final post-export sliding-window roundtrip metric: 1.52523098.
  • End-to-end runtime in attached log is ~33 minutes including final sliding-window evaluation.
  • H100 completion expectation is not used as a claimed metric in this submission; only measured A100 values are reported.

Code improvements included in this series:

  • Swaps torch.quantile for w.abs().amax(dim=1).clamp_min to evade a large Triton compilation slowdown.
  • Fixes bigram embedding guard for small vocab edge cases.
  • Makes compressor-dependent labels and final-roundtrip labels explicit in training logs.

Copilot AI review requested due to automatic review settings March 25, 2026 13:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new non-record submission folder intended to provide a single-A100-friendly QAT tuning and mixed int6/int8 export flow, with updated logging and evaluation behavior.

Changes:

  • Introduces a new record train_gpt.py with QAT in CastedLinear, SWA, and mixed int6/int8 quantization + compressor-selectable export.
  • Adds submission metadata (submission.json) and documentation (README.md) describing the run and results.
  • Includes a train.log capturing the run output for verification.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.

File Description
records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/train_gpt.py New training/eval script with QAT, SWA, mixed quant export, and sliding-window eval.
records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/train.log Run log intended to substantiate the reported results.
records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/submission.json Submission metadata (val_bpb, bytes_total, blurb, author).
records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/README.md Human-readable explanation and claimed results for the submission.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +3 to +4
"val_bpb": 1.4078,
"bytes_total": 15772699,
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

submission.json lists val_bpb=1.4078, but the provided train.log’s final roundtrip evaluation reports val_bpb≈1.52523. For the submission metadata to be verifiable, val_bpb should match the final evaluated artifact/roundtrip metric reported in the log (or the log/README should clearly indicate which metric is being reported and why).

Copilot uses AI. Check for mistakes.
"name": "Single A100 QAT Performance Fix",
"val_bpb": 1.4078,
"bytes_total": 15772699,
"blurb": "Enabled QAT directly within CastedLinear using straight-through estimators. Refactored torch.quantile to .amax(dim=1) to alleviate a 30x compiler performance penalty. Training loop fits perfectly in a Single A100 constraint for 10 minutes natively using 2600 steps (excludes final sliding-window evaluation which takes ~22 mins).",
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blurb claims the run "fits ... using 2600 steps", but the provided train.log shows an early stop at step 1186 due to the wallclock cap. Please update the blurb (or the log) so that the documented iteration count and runtime behavior match what actually ran.

Suggested change
"blurb": "Enabled QAT directly within CastedLinear using straight-through estimators. Refactored torch.quantile to .amax(dim=1) to alleviate a 30x compiler performance penalty. Training loop fits perfectly in a Single A100 constraint for 10 minutes natively using 2600 steps (excludes final sliding-window evaluation which takes ~22 mins).",
"blurb": "Enabled QAT directly within CastedLinear using straight-through estimators. Refactored torch.quantile to .amax(dim=1) to alleviate a 30x compiler performance penalty. Training loop fits within a Single A100 10-minute constraint, reaching 1186 steps before the wallclock cap (excludes final sliding-window evaluation which takes ~22 mins).",

Copilot uses AI. Check for mistakes.
@Shuvam-Banerji-Seal Shuvam-Banerji-Seal changed the title Submit 1x A100 QAT Fix - 1.4078 BPB (Non-Record) [v4] Submit 1x A100 QAT Fix - 1.5252 BPB (Non-Record) [v4] Mar 25, 2026
@Shuvam-Banerji-Seal
Copy link
Author

Addressed the reporting consistency comments in this branch update (9bf5a51):

  • submission.json keeps val_bpb=1.52523 (final post-export sliding-window roundtrip metric from attached train.log).
  • submission.json blurb now explicitly distinguishes train-time wallclock-stop metric (step 1186, val_bpb=1.4078) vs final submission roundtrip metric (val_bpb=1.52523).
  • README.md now reports both metrics clearly and labels which one is used for submission metadata.
  • README.md includes explicit measured-runtime provenance and avoids using H100 expectation as a claimed metric.

This should fully reconcile metadata/reporting with the attached log for unambiguous verification.

@Shuvam-Banerji-Seal
Copy link
Author

Superseded by #725 (v5) with clean reporting/provenance wording.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants