Skip to content

10L Int5-MLP + BigramHash(4096) + SWA (1.1507 BPB)#694

Open
Bortlesboat wants to merge 2 commits intoopenai:mainfrom
Bortlesboat:submission/10L-int5-bigram4096
Open

10L Int5-MLP + BigramHash(4096) + SWA (1.1507 BPB)#694
Bortlesboat wants to merge 2 commits intoopenai:mainfrom
Bortlesboat:submission/10L-int5-bigram4096

Conversation

@Bortlesboat
Copy link

Non-record submission

val_bpb: 1.1507 (mean of 3 seeds, sliding window stride=64, post int5/int6+zstd quantization roundtrip)

Seed val_bpb artifact_bytes
42 1.1508 15,620,994
1337 1.1499 15,290,882
2024 1.1514 15,327,813

Architecture

  • 10 layers, d=512, GQA 8H/4KV, relu^2
  • BigramHash(4096, dim=128), SmearGate, U-Net skips
  • Mixed int5 MLP / int6 attention + zstd-22
  • SWA(frac=0.4), Muon WD=0.04, warmdown=3000

Based on

thwu1's 10L Int5-MLP submission with reduced BigramHash for reliable size margin across seeds.

Timing (8xH100 SXM)

  • Training: ~600s (6200 steps)
  • Eval: ~258s (sliding window stride=64)

Explores stacking eval-time techniques (neural cache, LoRA TTT) and
quantization-aware training on top of the openai#1 recipe. QAT has an export
mismatch bug resulting in high quantization penalty — submitting as
non-record to document the approach for iteration.
Non-record submission. 10 layers, d=512, GQA 8H/4KV, mixed int5/int6
quantization + zstd-22. BigramHash(4096, dim=128), SmearGate, SWA(0.4).
Mean of 3 seeds: 1.1507 +/- 0.0006 BPB. All artifacts under 16MB.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant