Byte-Level Tokenizer-Free Transformer: 1.2151 BPB (beats baseline 1.2244)#705
Byte-Level Tokenizer-Free Transformer: 1.2151 BPB (beats baseline 1.2244)#705seanward wants to merge 2 commits intoopenai:mainfrom
Conversation
…ts baseline) Co-authored-by: Sean Ward <seanmmward@gmail.com>
|
Great experiment! Was it based on Meta's BLT paper? Have you tried the more complex stuff like the patches that the BLT paper presents? |
|
@CiprianFlorin-Ifrim Thanks! Yes, BLT (Byte Latent Transformer) was one of the architectural directions investigated in depth. Short answer on BLT-style patching: We tried it extensively and it doesn't work at this scale. Here's the progression: We tested four approaches: Naive byte-level autoregressive (no patching) — each position predicts the next byte. This is what the submission uses. The throughput paradox: Patching should help (fewer positions = cheaper attention). But FlashAttention 3 on H100 makes quadratic attention at 4096 positions fast enough (83ms/step). SSM layers (linear complexity) are 2-7× slower per layer than FA3 on H100 due to kernel optimization gaps. The architecture that benefits most from optimized hardware kernels wins, not the one with the best theoretical complexity. We also explored Mamba2 SSM + attention hybrids (31 versions), GLA (Gated Linear Attention), and JEPA-style latent prediction — all documented in the README. |
…search log Co-authored-by: Sean Ward <seanmmward@gmail.com>
|
@seanward Thanks for the reply! Will give the README a proper look as I was worked a lot on researching small BLTs and whether they were doable at very small scales (25k to 2mil params). So always interested in anything that adds to that. |
Byte-Level Tokenizer-Free Transformer
First tokenizer-free byte-level model to beat the sp1024 baseline in Parameter Golf.
Architecture
Results (4-seed significance test)
Mean sliding BPB: 1.2151 vs baseline 1.2244
99% CI: [1.2080, 1.2223] — baseline 1.2244 is outside the CI.
JEPA Auxiliary Loss Study
We also tested JEPA-style latent prediction as an auxiliary training objective. At both weight=0.1 and weight=0.01, the auxiliary loss hurt BPB due to gradient competition at this model scale.
Included Files
train_byte_model.py— Complete training scriptconvert_to_bytes.py— Data conversion (sp1024 → raw bytes)requirements.txt— Python dependenciessubmission.json— Leaderboard metadataREADME.md— Full documentationresearch_log.md— Comprehensive research log documenting all experimentstrain_seed{1337,42,2025,7}.txt— Training logs for all 4 seedstrain_jepa_k4_w{01,001}.txt— JEPA experiment logsCreated by Maestro on behalf of Sean Ward
View Session