Byte-Level Tokenizer-Free Transformer: 1.2151 BPB (beats baseline 1.2244) by seanward · Pull Request #705 · openai/parameter-golf

seanward · 2026-03-25T12:15:21Z

Byte-Level Tokenizer-Free Transformer

First tokenizer-free byte-level model to beat the sp1024 baseline in Parameter Golf.

Architecture

13-layer pure self-attention transformer operating directly on raw UTF-8 bytes (vocab=256)
No tokenizer, no BPE, no SentencePiece — raw byte input
LeakyReLU² activation, SmearGate, hashed byte-bigram embeddings (4096 buckets, 32 dim)
U-Net style skip connections, tied embeddings, logit softcap
27.6M parameters, trained for 10 min on 8×H100
Sliding window evaluation (stride=512, seq_len=4096)
Quantized with int6 + zstd-22

Results (4-seed significance test)

Seed	Sliding BPB	Artifact	Under 16 MiB
1337	1.2146	15.53 MB	Yes
42	1.2120	15.80 MB	Yes
2025	1.2174	16.45 MB	Yes
7	1.2166	15.46 MB	Yes

Mean sliding BPB: 1.2151 vs baseline 1.2244

Comparison	Δ nats	p (one-sided)
vs Official baseline (1.2244)	0.0064	0.0024
vs Post-quant baseline (1.2269)	0.0081	0.0012

99% CI: [1.2080, 1.2223] — baseline 1.2244 is outside the CI.

JEPA Auxiliary Loss Study

We also tested JEPA-style latent prediction as an auxiliary training objective. At both weight=0.1 and weight=0.01, the auxiliary loss hurt BPB due to gradient competition at this model scale.

Included Files

train_byte_model.py — Complete training script
convert_to_bytes.py — Data conversion (sp1024 → raw bytes)
requirements.txt — Python dependencies
submission.json — Leaderboard metadata
README.md — Full documentation
research_log.md — Comprehensive research log documenting all experiments
train_seed{1337,42,2025,7}.txt — Training logs for all 4 seeds
train_jepa_k4_w{01,001}.txt — JEPA experiment logs

Created by Maestro on behalf of Sean Ward

View Session

…ts baseline) Co-authored-by: Sean Ward <seanmmward@gmail.com>

CiprianFlorin-Ifrim · 2026-03-25T12:35:34Z

Great experiment! Was it based on Meta's BLT paper? Have you tried the more complex stuff like the patches that the BLT paper presents?

seanward · 2026-03-25T12:41:17Z

@CiprianFlorin-Ifrim Thanks! Yes, BLT (Byte Latent Transformer) was one of the architectural directions investigated in depth.

Short answer on BLT-style patching: We tried it extensively and it doesn't work at this scale. Here's the progression:

We tested four approaches:

Naive byte-level autoregressive (no patching) — each position predicts the next byte. This is what the submission uses.
Byte Patch K=4 (independent) — stride-4 conv encodes 4 bytes into one patch vector, predict all 4 bytes independently → 2.3945 BPB
Byte Patch K=4 (causal conv decoder) — depthwise conv within each 4-byte window for autoregressive intra-patch prediction → 2.1533 BPB
Byte Patch K=2 (causal conv) — smaller patches → 1.3832 BPB
No patching (this submission) — full autoregressive at seq_len=4096 → 1.2776 BPB pre-quant
Why patching fails at 16MB: The patch encoder compresses K bytes into 1 position. The decoder must reconstruct K bytes using only the patch latent + local context. Cross-patch byte dependencies are lost. At BLT's 8B params, the encoder learns rich enough representations. At ~25M params under a 16MB artifact constraint, the compression is too lossy.

The throughput paradox: Patching should help (fewer positions = cheaper attention). But FlashAttention 3 on H100 makes quadratic attention at 4096 positions fast enough (83ms/step). SSM layers (linear complexity) are 2-7× slower per layer than FA3 on H100 due to kernel optimization gaps. The architecture that benefits most from optimized hardware kernels wins, not the one with the best theoretical complexity.

We also explored Mamba2 SSM + attention hybrids (31 versions), GLA (Gated Linear Attention), and JEPA-style latent prediction — all documented in the README.

…search log Co-authored-by: Sean Ward <seanmmward@gmail.com>

CiprianFlorin-Ifrim · 2026-03-25T13:25:11Z

@seanward Thanks for the reply! Will give the README a proper look as I was worked a lot on researching small BLTs and whether they were doable at very small scales (25k to 2mil params). So always interested in anything that adds to that.

Add byte-level tokenizer-free transformer submission (1.2151 BPB, bea…

8a8dc76

…ts baseline) Co-authored-by: Sean Ward <seanmmward@gmail.com>

Add training logs (4 seeds + 2 JEPA experiments) and comprehensive re…

c574e8b

…search log Co-authored-by: Sean Ward <seanmmward@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Byte-Level Tokenizer-Free Transformer: 1.2151 BPB (beats baseline 1.2244)#705

Byte-Level Tokenizer-Free Transformer: 1.2151 BPB (beats baseline 1.2244)#705
seanward wants to merge 2 commits intoopenai:mainfrom
seanward:byte-level-tokenizer-free

seanward commented Mar 25, 2026 •

edited

Loading

Uh oh!

CiprianFlorin-Ifrim commented Mar 25, 2026

Uh oh!

seanward commented Mar 25, 2026

Uh oh!

CiprianFlorin-Ifrim commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

seanward commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!