Submit 1x A100 QAT Fix - 1.4078 BPB (Non-Record) [v2] by Shuvam-Banerji-Seal · Pull Request #707 · openai/parameter-golf

Shuvam-Banerji-Seal · 2026-03-25T12:41:53Z

Single-device (A100) run tuning hyperparams down from multi-device scales to ensure proper LR scheduling.

Swaps torch.quantile for w.amax().clamp_min to evade a 30x compiler performance penalty in Triton.
Fully addresses all previous review feedback on unused dependencies and imports.

Closes #527

Copilot

Pull request overview

Adds a new non-record track_non_record_16mb submission tuned for single-A100 QAT runs, focusing on correct LR scheduling at 1-GPU scale and avoiding a torch.quantile-related compile slowdown in the QAT path.

Changes:

Introduces a full train_gpt.py training/eval + int6/int8 mixed-quant export pipeline for the submission.
Adds run artifacts (train.log, submission.json, README.md) documenting results and configuration.
Updates QAT weight clipping inside CastedLinear to use amax(...).clamp_min(...) instead of torch.quantile in the hot path.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File	Description
records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/train_gpt.py	Main training script implementing single-A100 hyperparams, QAT in `CastedLinear`, and export/roundtrip eval.
records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/train.log	Captured training + final sliding-window eval output for the submission.
records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/submission.json	Submission metadata (name, val_bpb, size, blurb, author/date).
records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/README.md	Human-readable summary of the rationale and reported results.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-25T12:46:35Z

records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/train_gpt.py

+        code_bytes = len(code.encode("utf-8"))
+        log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes")
+        log0(f"Total submission size int8+zlib: {quant_file_bytes + code_bytes} bytes")


The log message hard-codes "int8+zlib" for total submission size, but this script supports multiple compressors (zstd/lz4/zlib) and the size being reported is for the chosen _COMPRESSOR. This makes logs (and any downstream parsing) misleading when COMPRESSOR != zlib; consider using _COMPRESSOR in the message (and/or aligning the "int6/int8" label with the actual quantization).

Copilot · 2026-03-25T12:46:36Z

records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/train_gpt.py

+    def bigram_hash(self, tokens: Tensor) -> Tensor:
+        t = tokens.to(torch.int32)
+        mod = self.bigram_vocab_size - 1
+        out = torch.empty_like(t)
+        out[..., 0] = mod
+        out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod
+        return out.long()


bigram_hash uses mod = bigram_vocab_size - 1 and then does % mod; if BIGRAM_VOCAB_SIZE is set to 1 (or 0), this will divide/modulo by 0 and crash. Add an explicit validation that bigram_vocab_size >= 2 when enabling bigram embeddings, or handle these small values safely.

Copilot AI review requested due to automatic review settings March 25, 2026 12:41

Copilot started reviewing on behalf of Shuvam-Banerji-Seal March 25, 2026 12:42 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

Submit 1x A100 QAT Fix - 1.4078 BPB (Non-Record) [v2]

ea5049c

Shuvam-Banerji-Seal mentioned this pull request Mar 25, 2026

Submit 1x A100 QAT Fix - 1.4078 BPB (Non-Record) [v3] #712

Closed

Shuvam-Banerji-Seal closed this Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submit 1x A100 QAT Fix - 1.4078 BPB (Non-Record) [v2]#707

Submit 1x A100 QAT Fix - 1.4078 BPB (Non-Record) [v2]#707
Shuvam-Banerji-Seal wants to merge 1 commit intoopenai:mainfrom
Shuvam-Banerji-Seal:submit-single-device-qat-v2

Shuvam-Banerji-Seal commented Mar 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Shuvam-Banerji-Seal commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Shuvam-Banerji-Seal commented Mar 25, 2026 •

edited

Loading