Skip to content

Record: Depth Recurrence (layers 4 and 5 repeated): val_bpb 1.1182#686

Open
msisovic wants to merge 8 commits intoopenai:mainfrom
msisovic:submission/2026-03-25_RecurLayers_TTT
Open

Record: Depth Recurrence (layers 4 and 5 repeated): val_bpb 1.1182#686
msisovic wants to merge 8 commits intoopenai:mainfrom
msisovic:submission/2026-03-25_RecurLayers_TTT

Conversation

@msisovic
Copy link

@msisovic msisovic commented Mar 25, 2026

Summary

Building on PR #549, I explored two directions for improving val_bpb: width scaling (MODEL_DIM=576) and depth scaling (adding layers). Width scaling to dim=576 provided a regression in performance. Depth scaling to 12 independent layers at dim=512 reached 1.1126 post-TTT - significantly better - so I decided to go in that direction.

This led me to depth recurrence: re-executing mid-network layers with independent learnable block scalars, getting the depth benefit without the parameter/size cost. Layers 4 and 5 are each executed twice in sequence (pattern: 0,1,2,3,4,5,4,5,6,7,8,9,10), producing 13 virtual layers from 11 physical. Only ~2K block scalar params are added. Dual recurrence recovers ~70% of the independent 12-layer gain while keeping the artifact well under budget at ~15.9MB.

I also confirmed that tied TTT (no weight untying for recurrent layers) performs equivalently to untied, and that the TTT gain (~0.0025 BPB) is consistent regardless of ecurrence config. Everything else (TTT, int6 quantization, SWA, bigram embeddings, value embeddings, Muon optimizer) is inherited from #549.

Config Params Artifact Post-TTT val_bpb
PR #549 baseline (11L) ~24M ~19.5MB 1.1194
Full 12L (over budget) ~29M ~17.3MB 1.1126
Recur L5 (11→12 virtual) ~27M ~15.9MB 1.1180
Recur L4,5 (11→13 virtual) ~27M ~15.9MB 1.1182

Reproducibility

Seed val_loss val_bpb
1337 1.88749538 1.11788404
2025 1.88948575 1.11906285
2024 1.88811812 1.11825287
Mean 1.88836642 1.11839992
Std 0.00083132 0.00049235

Run Commands

# Seed 1337 (default)
ITERATIONS=9000 RECUR_LAYERS=4,5 TTT_ENABLED=1 TTT_UNTIE=0 \
  torchrun --nproc_per_node=8 train_gpt.py

# Seed 2025
ITERATIONS=9000 RECUR_LAYERS=4,5 TTT_ENABLED=1 TTT_UNTIE=0 SEED=2025 \
  torchrun --nproc_per_node=8 train_gpt.py

# Seed 2024
ITERATIONS=9000 RECUR_LAYERS=4,5 TTT_ENABLED=1 TTT_UNTIE=0 SEED=2024 \
  torchrun --nproc_per_node=8 train_gpt.py

msisovic and others added 6 commits March 25, 2026 01:43
Seed 1337 complete (val_bpb=1.1179). Seeds 42 and 2024 need rerun after
GPU restart (stale CUDA contexts blocking clean runs).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@msisovic msisovic changed the title Submission/2026 03 25 recur layers ttt Record: Depth Recurrence (layers 4 and 5 repeated): val_bpb 1.1182 Mar 25, 2026
Previous run accidentally used 8000 iterations. Reran with 9000 to match
other seeds. Mean val_bpb: 1.1184 (was 1.1182), std: 0.00049 (was 0.00076).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@msisovic
Copy link
Author

Note: One of the 3 runs was ran with ITERATIONS=8000, instead of 9000, reran and fixed now, no actual changes were done since the submission.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant