Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Record: Curriculum Learning + LeakyReLU(0.9)^2 + 7-gram Backoff (val_bpb=0.9633)

**val_bpb = 0.9633** (seed 42, additional seeds pending compute grant) | **15.56 MB** | 8xH100 SXM, 600s

## Approach

Built on PR #753 (Podracing II) with two additions:

### 1. Curriculum Learning (Shard Reordering)

Training shards reordered by model perplexity — hardest shards first. Based on PR #650 by @abaybektursun which demonstrated -0.003 BPB from shard ordering alone. Zero code change, just environment variable.

### 2. LeakyReLU(0.9)^2 Slope Optimization

Following @MatoTeziTanka's controlled slope sweep on issue #140, replaced standard slope=0.5 with slope=0.9. The sweep showed monotonic improvement from 0.1 to 0.9, with 0.9 giving -0.013 BPB vs 0.5 on the same stack.

## Results

| Metric | Value |
|--------|-------|
| Sliding window (stride=64) | 1.1216 |
| **Sliding + 7-gram backoff** | **0.9633** |
| Legal TTT (score-first, 3ep) | 1.1216 |
| Artifact | 15,560,351 bytes |
| Steps | 6,647 at 90.3ms/step |
| Training time | 600s |

## Architecture (from PR #753)

- 11L, 512d, GQA 8/4, MLP 3x
- LeakyReLU(0.9)^2 activation
- XSA on all 11 layers
- BigramHash, SmearGate, SWA, EMA
- Int6 QAT + GPTQ (within training budget, issue #677 compliant)
- 7-gram backoff eval cache (backward-looking, no weight updates)

## Eval-time Techniques

**7-gram backoff cache** (from PR #753): Multi-order n-gram model built from already-scored tokens. Linear interpolation with entropy-adaptive alpha. Fully backward-looking — each token scored before its statistics enter the cache.

**Legal score-first TTT** (from PR #753): SGD with 3 epochs, freeze last 2 blocks. Every token scored under inference_mode before any weight update.

## Reproduction

```bash
SEED=42 bash run.sh
```

Environment variables set in run.sh:
- `SHARD_ORDER=44,63,65,42,...` (curriculum learning)
- `MLP_LEAKY_SLOPE=0.9`
- `NGRAM_EVAL_ORDER=7`

## Acknowledgments

- @newjordan (PR #753, Podracing II base)
- @abaybektursun (PR #650, curriculum learning / shard reordering)
- @MatoTeziTanka (LeakyReLU slope sweep, issue #140)
- @Asukabot0 (PR #715/#727, n-gram backoff technique)

## Status

1 seed submitted. 2 additional seeds pending OpenAI compute grant ($1000 applied).
Previously PR #486 (formerly #2 on leaderboard, TrigramHash originator). $339 personal compute spent.
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/bin/bash
set -euo pipefail
export PYTHONUNBUFFERED=1

SEED="${SEED:-42}"

# Production-ready: PR #753 base + curriculum learning
export SEED
export SHARD_ORDER="${SHARD_ORDER:-44,63,65,42,18,67,30,69,61,3,13,19,50,49,56,45,73,79,57,32,28,68,66,34,46,38,17,77,0,14,26,74,59,62,41,9,58,22,78,4,48,8,12,27,75,36,16,43,52,15,33,47,25,55,54,23,37,51,31,21,60,1,20,72,24,53,39,35,71,76,40,5,10,2,7,6,70,11,64,29}"
# N-gram backoff defaults from PR #753
export NGRAM_EVAL_ORDER="${NGRAM_EVAL_ORDER:-7}"
# LeakyReLU slope 0.9 > 0.5 (MatoTeziTanka sweep, -0.013 BPP)
export MLP_LEAKY_SLOPE="${MLP_LEAKY_SLOPE:-0.9}"

NGPU=$(nvidia-smi -L 2>/dev/null | wc -l)
echo "GPUs: $NGPU | Seed: $SEED | Ngram: $NGRAM_EVAL_ORDER | Shard order: ${SHARD_ORDER:+yes}"

if [ "$NGPU" -gt 1 ]; then
torchrun --standalone --nproc_per_node="$NGPU" train_gpt.py
else
python train_gpt.py
fi
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"name": "Curriculum Learning + LeakyReLU(0.9)² + 7-gram Backoff",
"author": "ndokutovich",
"github_id": "ndokutovich",
"val_bpb": 0.9633,
"val_loss": 1.6265,
"bytes_total": 15560351,
"artifact_bytes": 15560351,
"training_time_seconds": 600,
"eval_time_seconds": 131,
"hardware": "8xH100 SXM",
"seed": 42,
"num_seeds": 1,
"date": "2026-03-25",
"blurb": "PR #753 base + curriculum learning (hardest-first shard reorder, PR #650) + LeakyReLU(0.9)² slope optimization (MatoTeziTanka sweep) + 7-gram backoff eval cache. 1 seed, 2 additional pending compute grant.",
"notes": "Previously PR #486 (formerly #2 on leaderboard, TrigramHash originator). $360 personal compute spent."
}
Loading