Record: 11L XSA6 + Warmdown3000 + QAT@0.30 (val_bpb=1.1352, 2-seed mean) by 0xNoramiya · Pull Request #695 · openai/parameter-golf

0xNoramiya · 2026-03-25T08:23:28Z

11L XSA6 + Warmdown3000 + QAT@0.30 (val_bpb: 1.1352)

val_bpb: 1.1360 (sliding window stride=64, 2-seed mean) | 15.88 MB | 8xH100 SXM, 600s

Changes from SOTA (PR #414)

Three targeted hyperparameter changes identified through 37 local ablation experiments on an RTX 4060 Ti:

Change	SOTA (PR #414)	Ours	Rationale
XSA layers	last 4	last 6	More context-only attention layers
Warmdown	3500 iters	3000 iters	Shorter cooldown preserves more full-LR training
Late QAT threshold	0.15	0.30	Earlier QAT gives more steps to adapt to int6

Results (2 seeds, 8xH100 SXM)

Seed	Steps	ms/step	Sliding BPB (s64)	Artifact
42	5,447	110.2	1.1352	15,883,805 bytes
1337	5,448	110.1	1.1367	15,730,868 bytes

Mean: 1.1360 | Std: 0.0008 | Submitted: seed 42 (best)

Note: ~5,400 steps (110ms/step) vs SOTA's ~7,100 (85ms/step) due to PyTorch SDPA fallback — FlashAttention 3 was unavailable in our deployment environment. With FA3, we would expect ~7,000 steps and correspondingly lower BPB.

Methodology

Hyperparameters selected via 37 ablation experiments on a single RTX 4060 Ti (500 steps each) across 8 dimensions: BigramHash buckets, EMA decay, warmdown ratio, matrix LR, gradient clip, QAT threshold, XSA layers, and Muon momentum. Key finding: EMA decay and LR are step-count-dependent (don't transfer from local to H100), while warmdown ratio and QAT threshold do transfer.

Compliance

2 seeds on 8xH100 SXM, <=600s each
All artifacts <=16,000,000 bytes (max: 15,883,805)
Sliding window eval stride=64
No test-time training on validation data
No network calls during evaluation
Self-contained train_gpt.py

…an 1.1360)

Record: 11L XSA6 + Warmdown3000 + QAT@0.30 (val_bpb=1.1352, 2-seed me…

e508c8c

…an 1.1360)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L XSA6 + Warmdown3000 + QAT@0.30 (val_bpb=1.1352, 2-seed mean)#695

Record: 11L XSA6 + Warmdown3000 + QAT@0.30 (val_bpb=1.1352, 2-seed mean)#695
0xNoramiya wants to merge 1 commit intoopenai:mainfrom
0xNoramiya:submission/xsa6-wd3000-qat030

0xNoramiya commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

0xNoramiya commented Mar 25, 2026

11L XSA6 + Warmdown3000 + QAT@0.30 (val_bpb: 1.1352)

Changes from SOTA (PR #414)

Results (2 seeds, 8xH100 SXM)

Methodology

Compliance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant