Submit 1x A100 QAT Fix - 1.4078 BPB (Non-Record) [v2]#707
Submit 1x A100 QAT Fix - 1.4078 BPB (Non-Record) [v2]#707Shuvam-Banerji-Seal wants to merge 1 commit intoopenai:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new non-record track_non_record_16mb submission tuned for single-A100 QAT runs, focusing on correct LR scheduling at 1-GPU scale and avoiding a torch.quantile-related compile slowdown in the QAT path.
Changes:
- Introduces a full
train_gpt.pytraining/eval + int6/int8 mixed-quant export pipeline for the submission. - Adds run artifacts (
train.log,submission.json,README.md) documenting results and configuration. - Updates QAT weight clipping inside
CastedLinearto useamax(...).clamp_min(...)instead oftorch.quantilein the hot path.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/train_gpt.py | Main training script implementing single-A100 hyperparams, QAT in CastedLinear, and export/roundtrip eval. |
| records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/train.log | Captured training + final sliding-window eval output for the submission. |
| records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/submission.json | Submission metadata (name, val_bpb, size, blurb, author/date). |
| records/track_non_record_16mb/2026-03-23_Single_A100_QAT_FastFix/README.md | Human-readable summary of the rationale and reported results. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| code_bytes = len(code.encode("utf-8")) | ||
| log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes") | ||
| log0(f"Total submission size int8+zlib: {quant_file_bytes + code_bytes} bytes") |
There was a problem hiding this comment.
The log message hard-codes "int8+zlib" for total submission size, but this script supports multiple compressors (zstd/lz4/zlib) and the size being reported is for the chosen _COMPRESSOR. This makes logs (and any downstream parsing) misleading when COMPRESSOR != zlib; consider using _COMPRESSOR in the message (and/or aligning the "int6/int8" label with the actual quantization).
| def bigram_hash(self, tokens: Tensor) -> Tensor: | ||
| t = tokens.to(torch.int32) | ||
| mod = self.bigram_vocab_size - 1 | ||
| out = torch.empty_like(t) | ||
| out[..., 0] = mod | ||
| out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod | ||
| return out.long() |
There was a problem hiding this comment.
bigram_hash uses mod = bigram_vocab_size - 1 and then does % mod; if BIGRAM_VOCAB_SIZE is set to 1 (or 0), this will divide/modulo by 0 and crash. Add an explicit validation that bigram_vocab_size >= 2 when enabling bigram embeddings, or handle these small values safely.
Single-device (A100) run tuning hyperparams down from multi-device scales to ensure proper LR scheduling.
torch.quantileforw.amax().clamp_minto evade a 30x compiler performance penalty in Triton.Closes #527