Skip to content

Record: 1.0240 BPB — Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (100% autonomous research via goldfish)#702

Open
lukacf wants to merge 4 commits intoopenai:mainfrom
lukacf:submission/ngram-backoff-entropy-1.0240
Open

Record: 1.0240 BPB — Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (100% autonomous research via goldfish)#702
lukacf wants to merge 4 commits intoopenai:mainfrom
lukacf:submission/ngram-backoff-entropy-1.0240

Conversation

@lukacf
Copy link

@lukacf lukacf commented Mar 25, 2026

Summary

val_bpb = 1.0244 (3-seed mean) | 15.79 MB artifact | 8xH100 SXM, 600s training + 124s eval

Novel eval-time n-gram cache with two extensions over the vanilla 5-gram approach (PR #674):

  1. Multi-order backoff (2,3,4,5-gram): When a 5-gram context has no match, fall back to 4-gram, 3-gram, 2-gram
  2. Entropy-adaptive mixing weight: When model is uncertain (high entropy), trust n-gram more. alpha = 0.05 + 0.35 * sigmoid(2 * (H - 4.0))

These extensions improve over vanilla 5-gram by 0.018 BPB (1.0423 → 1.0240) on the same base model.

Legality

3-Seed Validation

Seed Sliding BPB N-gram BPB Artifact
1 1.1156 1.0240 15,788,203
2 1.1164 1.0247 ~15,790,000
3 1.1158 1.0242 ~15,790,000
Mean 1.1159 1.0243
Std 0.0004 0.0003

Training Architecture

11L transformer, 512d, GQA, LeakyReLU(0.5)², XSA-all(11), VRL, Soft-Round QAT, Full Hessian GPTQ (Cholesky + actorder), int6+zstd-22, 7% prune. ~6600 steps at 91ms/step on 8xH100 SXM.

Infrastructure

Discovered and validated autonomously in a single research session using Goldfish ML (MCP-based experiment platform) + Meerkat (agent harness) + an AI coding agent. 12 experiments from first hypothesis to submission, with full provenance and documented dead ends. See README for complete experiment timeline and lineage.

lukacf added 3 commits March 23, 2026 10:13
3-seed mean: 0.9789 BPB (sliding window stride=64)
Best seed: 0.9779 (seed 7)
Std: 0.0015

Key innovation: Autonomous ML research methodology.
AI coding agent discovered cosine LR scaling for TTT in a single
2-hour session — 7 experiments from hypothesis to record.

Technical: CosineAnnealingLR over 100 TTT epochs (3-line change).
Architecture: PR openai#398/openai#442 base (11L, int6+zstd, 15.51MB).
Novel eval-time n-gram cache with two extensions over vanilla 5-gram:
1. Multi-order backoff (2,3,4,5-gram with highest-order-first fallback)
2. Entropy-adaptive mixing weight (sigmoid-modulated 0.05-0.40)

3-seed mean: 1.0244 BPB (std 0.0003) | 15.79 MB artifact
Beats merged SOTA (1.1194) by 0.095 BPB
Beats best unmerged (1.0461) by 0.022 BPB

Score-first legal: cache updated only after scoring.
Proper distribution: p_mixed = (1-a)*p_model + a*p_ng sums to 1.

Discovered and validated autonomously using Goldfish ML + Claude Code.
@lukacf lukacf changed the title Record: 1.0240 BPB — Multi-Order N-gram Backoff + Entropy-Adaptive Alpha Record: 1.0240 BPB — Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (Goldfish ML Autonomous Research) Mar 25, 2026
@lukacf lukacf changed the title Record: 1.0240 BPB — Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (Goldfish ML Autonomous Research) Record: 1.0240 BPB — Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (autonomous ai research via goldfish) Mar 25, 2026
@lukacf lukacf changed the title Record: 1.0240 BPB — Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (autonomous ai research via goldfish) Record: 1.0240 BPB — Multi-Order N-gram Backoff + Entropy-Adaptive Alpha (100% autonomous research via goldfish) Mar 25, 2026
Asukabot0 added a commit to Asukabot0/parameter-golf that referenced this pull request Mar 25, 2026
Two eval-time improvements (no retraining needed):

1. Multi-order backoff (orders 2-7): When 7-gram has no cache hit,
   falls back to 6/5/4/3/2-gram. Dramatically increases cache hit rate
   on 8xH100 where per-GPU cache is sparse. PR openai#702 reports -0.018 BPB.

2. Entropy-adaptive alpha: alpha = 0.05 + 0.55 * sigmoid(2*(H-4.0))
   Model uncertain → trust n-gram more. Model confident → keep LM.
   Compliant: alpha depends only on model's own distribution.

Both configurable via env vars (NGRAM_ENTROPY=0 to disable adaptive).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lukacf
Copy link
Author

lukacf commented Mar 25, 2026

Follow-up: No-GPTQ Validation

We also validated the result without GPTQ (using percentile search quantization instead), confirming the n-gram technique is robust to the quantization method:

With GPTQ (submitted above)

Seed N-gram BPP Size
1 1.0240 15.79 MB
2 1.0247 ~15.79 MB
3 1.0242 ~15.79 MB
Mean 1.0243
Std 0.0004

Without GPTQ (percentile search only)

Seed N-gram BPP Size
1 1.0245 15.80 MB
2 1.0275 15.90 MB
3 1.0241 16.16 MB
4 1.0244 15.84 MB
Mean 1.0251
Std 0.0015

The n-gram improvement is consistent regardless of quantization method — GPTQ vs percentile search changes the result by only ~0.001 BPP. The no-GPTQ variant uses full 600s for training with no post-training calibration step.

Happy to provide the no-GPTQ train_gpt.py if useful for verification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant