Submit 1x A100 QAT Fix - 1.5252 BPB (Non-Record) [v4] by Shuvam-Banerji-Seal · Pull Request #719 · openai/parameter-golf

Shuvam-Banerji-Seal · 2026-03-25T13:43:03Z

Single-device (A100) run tuning hyperparams down from multi-device scales to ensure proper LR scheduling.

Reporting clarification for final verification:

Attached run is measured on 1x A100 under 600s wallclock cap and stops at step 1186.
Train-time checkpoint metric at stop: val_bpb=1.4078.
Submission metric (submission.json val_bpb) is the final post-export sliding-window roundtrip metric: 1.52523098.
End-to-end runtime in attached log is ~33 minutes including final sliding-window evaluation.
H100 completion expectation is not used as a claimed metric in this submission; only measured A100 values are reported.

Code improvements included in this series:

Swaps torch.quantile for w.abs().amax(dim=1).clamp_min to evade a large Triton compilation slowdown.
Fixes bigram embedding guard for small vocab edge cases.
Makes compressor-dependent labels and final-roundtrip labels explicit in training logs.

Copilot

Pull request overview

Adds a new non-record submission folder intended to provide a single-A100-friendly QAT tuning and mixed int6/int8 export flow, with updated logging and evaluation behavior.

Changes:

Introduces a new record train_gpt.py with QAT in CastedLinear, SWA, and mixed int6/int8 quantization + compressor-selectable export.
Adds submission metadata (submission.json) and documentation (README.md) describing the run and results.
Includes a train.log capturing the run output for verification.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.

File	Description
records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/train_gpt.py	New training/eval script with QAT, SWA, mixed quant export, and sliding-window eval.
records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/train.log	Run log intended to substantiate the reported results.
records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/submission.json	Submission metadata (val_bpb, bytes_total, blurb, author).
records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/README.md	Human-readable explanation and claimed results for the submission.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-25T13:49:25Z

records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/submission.json

+  "val_bpb": 1.4078,
+  "bytes_total": 15772699,


submission.json lists val_bpb=1.4078, but the provided train.log’s final roundtrip evaluation reports val_bpb≈1.52523. For the submission metadata to be verifiable, val_bpb should match the final evaluated artifact/roundtrip metric reported in the log (or the log/README should clearly indicate which metric is being reported and why).

Copilot · 2026-03-25T13:49:25Z

records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/submission.json

+  "name": "Single A100 QAT Performance Fix",
+  "val_bpb": 1.4078,
+  "bytes_total": 15772699,
+  "blurb": "Enabled QAT directly within CastedLinear using straight-through estimators. Refactored torch.quantile to .amax(dim=1) to alleviate a 30x compiler performance penalty. Training loop fits perfectly in a Single A100 constraint for 10 minutes natively using 2600 steps (excludes final sliding-window evaluation which takes ~22 mins).",


The blurb claims the run "fits ... using 2600 steps", but the provided train.log shows an early stop at step 1186 due to the wallclock cap. Please update the blurb (or the log) so that the documented iteration count and runtime behavior match what actually ran.

Suggested change

"blurb": "Enabled QAT directly within CastedLinear using straight-through estimators. Refactored torch.quantile to .amax(dim=1) to alleviate a 30x compiler performance penalty. Training loop fits perfectly in a Single A100 constraint for 10 minutes natively using 2600 steps (excludes final sliding-window evaluation which takes ~22 mins).",

"blurb": "Enabled QAT directly within CastedLinear using straight-through estimators. Refactored torch.quantile to .amax(dim=1) to alleviate a 30x compiler performance penalty. Training loop fits within a Single A100 10-minute constraint, reaching 1186 steps before the wallclock cap (excludes final sliding-window evaluation which takes ~22 mins).",

records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/README.md

Shuvam-Banerji-Seal · 2026-03-25T14:33:08Z

Addressed the reporting consistency comments in this branch update (9bf5a51):

submission.json keeps val_bpb=1.52523 (final post-export sliding-window roundtrip metric from attached train.log).
submission.json blurb now explicitly distinguishes train-time wallclock-stop metric (step 1186, val_bpb=1.4078) vs final submission roundtrip metric (val_bpb=1.52523).
README.md now reports both metrics clearly and labels which one is used for submission metadata.
README.md includes explicit measured-runtime provenance and avoids using H100 expectation as a claimed metric.

This should fully reconcile metadata/reporting with the attached log for unambiguous verification.

Shuvam-Banerji-Seal · 2026-03-25T14:43:24Z

Superseded by #725 (v5) with clean reporting/provenance wording.

Copilot AI review requested due to automatic review settings March 25, 2026 13:43

Copilot started reviewing on behalf of Shuvam-Banerji-Seal March 25, 2026 13:43 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

Submit 1x A100 QAT Fix - 1.4078 BPB (Non-Record) [v4]

64e03ed

Shuvam-Banerji-Seal changed the title ~~Submit 1x A100 QAT Fix - 1.4078 BPB (Non-Record) [v4]~~ Submit 1x A100 QAT Fix - 1.5252 BPB (Non-Record) [v4] Mar 25, 2026

Update README and metadata to match actual training log

8fa90cc

Clarify BPB reporting and measured runtime provenance

9bf5a51

Shuvam-Banerji-Seal closed this Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submit 1x A100 QAT Fix - 1.5252 BPB (Non-Record) [v4]#719

Submit 1x A100 QAT Fix - 1.5252 BPB (Non-Record) [v4]#719
Shuvam-Banerji-Seal wants to merge 3 commits intoopenai:mainfrom
Shuvam-Banerji-Seal:submit-single-device-qat-v4

Shuvam-Banerji-Seal commented Mar 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Uh oh!

Shuvam-Banerji-Seal commented Mar 25, 2026

Uh oh!

Shuvam-Banerji-Seal commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	"blurb": "Enabled QAT directly within CastedLinear using straight-through estimators. Refactored torch.quantile to .amax(dim=1) to alleviate a 30x compiler performance penalty. Training loop fits perfectly in a Single A100 constraint for 10 minutes natively using 2600 steps (excludes final sliding-window evaluation which takes ~22 mins).",
	"blurb": "Enabled QAT directly within CastedLinear using straight-through estimators. Refactored torch.quantile to .amax(dim=1) to alleviate a 30x compiler performance penalty. Training loop fits within a Single A100 10-minute constraint, reaching 1186 steps before the wallclock cap (excludes final sliding-window evaluation which takes ~22 mins).",

Conversation

Shuvam-Banerji-Seal commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Shuvam-Banerji-Seal commented Mar 25, 2026

Uh oh!

Shuvam-Banerji-Seal commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Shuvam-Banerji-Seal commented Mar 25, 2026 •

edited

Loading