Skip to content

Record: PR549 + MiLe decay + 8-bit Muon + 1.04x LR + Cache+Backout — val_bpb 1.1176#703

Open
Gusanidas wants to merge 1 commit intoopenai:mainfrom
Gusanidas:alejandro/pr-clean
Open

Record: PR549 + MiLe decay + 8-bit Muon + 1.04x LR + Cache+Backout — val_bpb 1.1176#703
Gusanidas wants to merge 1 commit intoopenai:mainfrom
Gusanidas:alejandro/pr-clean

Conversation

@Gusanidas
Copy link

Summary

Four orthogonal improvements on PR #549 (LeakyReLU² + Legal TTT + Parallel Muon):

  • MiLe loss — Entropy-weighted token loss ((1-exp(-entropy))^γ) with γ=1.1 decaying to 0 during warmdown. Focuses training on harder tokens early, then reverts to standard CE.
  • 8-bit Muon momentum — Blockwise symmetric int8 quantization (block_size=256) of Muon first-moment buffers. ~62% memory reduction, lossless.
  • 1.04x LR boost — All learning rates scaled by 1.04x.
  • Cache+Backout — After layer 7, cache hidden states. Layers 8-10 attention reads from cached (clean) context. Post-decoder: x = x - λ·x_cache where λ is a learned scalar (init 0.1).

Tested on 4xB200 with wall clock 903s to match equivalent 8xH100 compute.
Pending to be tested on 8xH100 SXM.

Test plan

  • Verify on 8xH100 SXM with 600s wall clock
  • Confirm 3-seed mean BPB
  • Check artifact size < 16MB

🤖 Generated with Claude Code

…val_bpb 1.1176

Four orthogonal improvements on PR openai#549 (LeakyReLU² + Legal TTT + Parallel Muon):

- MiLe loss: entropy-weighted token loss with γ=1.1 decaying to 0 during warmdown
- 8-bit Muon momentum: blockwise symmetric int8 quantization of momentum buffers
- 1.04x LR boost: all learning rates scaled by 1.04x
- Cache+Backout: cache layer 7 state, late attention reads cached context,
  subtract backout_lambda * cache from final output

Tested on 4xB200 with wall clock 903s to match equivalent 8xH100 compute.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant