Skip to content

Byte-Level Tokenizer-Free Transformer: 1.2151 BPB (beats baseline 1.2244)#705

Open
seanward wants to merge 2 commits intoopenai:mainfrom
seanward:byte-level-tokenizer-free
Open

Byte-Level Tokenizer-Free Transformer: 1.2151 BPB (beats baseline 1.2244)#705
seanward wants to merge 2 commits intoopenai:mainfrom
seanward:byte-level-tokenizer-free

Conversation

@seanward
Copy link

@seanward seanward commented Mar 25, 2026

Byte-Level Tokenizer-Free Transformer

First tokenizer-free byte-level model to beat the sp1024 baseline in Parameter Golf.

Architecture

  • 13-layer pure self-attention transformer operating directly on raw UTF-8 bytes (vocab=256)
  • No tokenizer, no BPE, no SentencePiece — raw byte input
  • LeakyReLU² activation, SmearGate, hashed byte-bigram embeddings (4096 buckets, 32 dim)
  • U-Net style skip connections, tied embeddings, logit softcap
  • 27.6M parameters, trained for 10 min on 8×H100
  • Sliding window evaluation (stride=512, seq_len=4096)
  • Quantized with int6 + zstd-22

Results (4-seed significance test)

Seed Sliding BPB Artifact Under 16 MiB
1337 1.2146 15.53 MB Yes
42 1.2120 15.80 MB Yes
2025 1.2174 16.45 MB Yes
7 1.2166 15.46 MB Yes

Mean sliding BPB: 1.2151 vs baseline 1.2244

Comparison Δ nats p (one-sided)
vs Official baseline (1.2244) 0.0064 0.0024
vs Post-quant baseline (1.2269) 0.0081 0.0012

99% CI: [1.2080, 1.2223] — baseline 1.2244 is outside the CI.

JEPA Auxiliary Loss Study

We also tested JEPA-style latent prediction as an auxiliary training objective. At both weight=0.1 and weight=0.01, the auxiliary loss hurt BPB due to gradient competition at this model scale.

Included Files

  • train_byte_model.py — Complete training script
  • convert_to_bytes.py — Data conversion (sp1024 → raw bytes)
  • requirements.txt — Python dependencies
  • submission.json — Leaderboard metadata
  • README.md — Full documentation
  • research_log.md — Comprehensive research log documenting all experiments
  • train_seed{1337,42,2025,7}.txt — Training logs for all 4 seeds
  • train_jepa_k4_w{01,001}.txt — JEPA experiment logs


Created by Maestro on behalf of Sean Ward

View Session

…ts baseline)

Co-authored-by: Sean Ward <seanmmward@gmail.com>
@CiprianFlorin-Ifrim
Copy link
Contributor

Great experiment! Was it based on Meta's BLT paper? Have you tried the more complex stuff like the patches that the BLT paper presents?

@seanward
Copy link
Author

@CiprianFlorin-Ifrim Thanks! Yes, BLT (Byte Latent Transformer) was one of the architectural directions investigated in depth.

Short answer on BLT-style patching: We tried it extensively and it doesn't work at this scale. Here's the progression:

We tested four approaches:

Naive byte-level autoregressive (no patching) — each position predicts the next byte. This is what the submission uses.
Byte Patch K=4 (independent) — stride-4 conv encodes 4 bytes into one patch vector, predict all 4 bytes independently → 2.3945 BPB
Byte Patch K=4 (causal conv decoder) — depthwise conv within each 4-byte window for autoregressive intra-patch prediction → 2.1533 BPB
Byte Patch K=2 (causal conv) — smaller patches → 1.3832 BPB
No patching (this submission) — full autoregressive at seq_len=4096 → 1.2776 BPB pre-quant
Why patching fails at 16MB: The patch encoder compresses K bytes into 1 position. The decoder must reconstruct K bytes using only the patch latent + local context. Cross-patch byte dependencies are lost. At BLT's 8B params, the encoder learns rich enough representations. At ~25M params under a 16MB artifact constraint, the compression is too lossy.

The throughput paradox: Patching should help (fewer positions = cheaper attention). But FlashAttention 3 on H100 makes quadratic attention at 4096 positions fast enough (83ms/step). SSM layers (linear complexity) are 2-7× slower per layer than FA3 on H100 due to kernel optimization gaps. The architecture that benefits most from optimized hardware kernels wins, not the one with the best theoretical complexity.

We also explored Mamba2 SSM + attention hybrids (31 versions), GLA (Gated Linear Attention), and JEPA-style latent prediction — all documented in the README.

…search log

Co-authored-by: Sean Ward <seanmmward@gmail.com>
@CiprianFlorin-Ifrim
Copy link
Contributor

@seanward Thanks for the reply! Will give the README a proper look as I was worked a lot on researching small BLTs and whether they were doable at very small scales (25k to 2mil params). So always interested in anything that adds to that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants