Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2,083 changes: 2,083 additions & 0 deletions concepts/f1_sota_garage/car02_speed_lane/train_gpt.py

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Podracing: 5-gram Eval + LeakyReLU² + GPTQ

## Results

| Seed | Sliding BPB | 5-gram BPB | Artifact |
|------|-------------|-----------|----------|
| 1337 | 1.1190 | **1.0451** | 15.63 MB |
| 42 | 1.1217 | **1.0471** | 15.59 MB |
| 2045 | 1.1200 | **1.0460** | 15.64 MB |
| **Mean** | **1.1202** | **1.0461** | — |

## Architecture

11L/512d U-Net, 8H/4KV, LeakyReLU² (slope 0.5), XSA last 4, BigramHash 1536,
VE128 on layers 9-10, partial RoPE (24/64 dims), tied embeddings. 26.93M params.

## 5-gram Eval (score-first, legal)

Fixed-weight hashed n-gram interpolation during sliding window eval:
- Cache built from already-scored tokens only (backward-looking)
- Fixed alpha=0.20: `p_final = 0.80 * p_model + 0.20 * p_ngram`
- No safety gate, no target-aware selection, no min-NLL comparison
- Hashed count-min sketch (4M buckets), min_count=2
- N-gram concept credited to @deanbrr (PR #659)

## Reproduce

```bash
SEED=2045 MLP_ACT=leaky_relu_sq MLP_LEAKY_SLOPE=0.5 \
XSA_LAST_N=4 BIGRAM_VOCAB_SIZE=1536 ROPE_DIMS=24 \
NGRAM_EVAL_ORDER=5 NGRAM_EVAL_ALPHA=0.20 \
NGRAM_EVAL_MIN_COUNT=2 NGRAM_EVAL_BUCKETS=4194304 \
torchrun --nproc_per_node=8 train_gpt.py
```

8xH100 SXM, 600s training + ~190s eval.
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"author": "Frosty40",
"github_id": "newjordan",
"name": "Podracing: 5-gram Eval + LeakyReLU² + GPTQ",
"blurb": "11L/512d U-Net with legal score-first hashed 5-gram eval interpolation (fixed alpha=0.20). 3-seed mean val_bpb=1.0461 (std=0.0010). N-gram cache concept credited to @deanbrr (PR #659).",
"date": "2026-03-25T03:45:00Z",
"val_loss": 1.7661,
"val_bpb": 1.0461,
"bytes_total": 15631465,
"bytes_code": 102835
}

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
_____ _
| |
|_|
W0325 02:56:21.776000 303290 torch/distributed/run.py:803]
W0325 02:56:21.776000 303290 torch/distributed/run.py:803] *****************************************
W0325 02:56:21.776000 303290 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0325 02:56:21.776000 303290 torch/distributed/run.py:803] *****************************************
logs/f1_car02_iso_var_t2_rope24_ngram5_s1337_20260325_025620.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:80
val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
model_params:26928220
f1_corr:rank=0 params=0 est_int6_bytes~0
mlp_act:leaky_relu_sq mlp_leaky_slope:0.5
XSA:last_4 world_size:8 grad_accum_steps:1
num_heads:8 num_kv_heads:4 embed_lr:0.035 matrix_lr:0.025
train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
compile:enabled=1 fullgraph=0
seed:1337
ngram_eval:order=5 alpha=0.2 min_count=2 buckets=4194304
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
step:0/20000 val_loss:6.9317 val_bpb:4.1054 train_time:0ms step_avg:0.02ms
step:1/20000 train_loss:6.9343 train_time:143ms step_avg:143.26ms
step:2/20000 train_loss:8.7867 train_time:225ms step_avg:112.32ms
step:3/20000 train_loss:7.9649 train_time:310ms step_avg:103.33ms
step:4/20000 train_loss:7.2290 train_time:396ms step_avg:98.99ms
step:5/20000 train_loss:6.9872 train_time:482ms step_avg:96.38ms
step:6/20000 train_loss:6.9480 train_time:568ms step_avg:94.60ms
step:7/20000 train_loss:6.8041 train_time:654ms step_avg:93.37ms
step:8/20000 train_loss:6.6817 train_time:739ms step_avg:92.43ms
step:9/20000 train_loss:6.3606 train_time:825ms step_avg:91.66ms
step:10/20000 train_loss:6.0478 train_time:910ms step_avg:91.00ms
step:500/20000 train_loss:2.3734 train_time:43806ms step_avg:87.61ms
step:1000/20000 train_loss:2.2560 train_time:87852ms step_avg:87.85ms
step:1500/20000 train_loss:2.2085 train_time:131828ms step_avg:87.89ms
step:2000/20000 train_loss:2.0473 train_time:175779ms step_avg:87.89ms
step:2500/20000 train_loss:2.1515 train_time:219751ms step_avg:87.90ms
step:3000/20000 train_loss:2.1464 train_time:263722ms step_avg:87.91ms
step:3500/20000 train_loss:2.1659 train_time:307675ms step_avg:87.91ms
step:4000/20000 train_loss:1.9575 train_time:351636ms step_avg:87.91ms
step:4000/20000 val_loss:2.0468 val_bpb:1.2122 train_time:351640ms step_avg:87.91ms
step:4500/20000 train_loss:2.1031 train_time:395591ms step_avg:87.91ms
step:5000/20000 train_loss:2.0869 train_time:439535ms step_avg:87.91ms
late_qat:enabled step:5076 scale:0.5000
step:5500/20000 train_loss:1.9999 train_time:483471ms step_avg:87.90ms
step:6000/20000 train_loss:1.9245 train_time:527392ms step_avg:87.90ms
swa:start step:6150
step:6500/20000 train_loss:2.0633 train_time:571582ms step_avg:87.94ms
step:6822/20000 val_loss:1.9217 val_bpb:1.1382 train_time:600026ms step_avg:87.95ms
stopping_early: wallclock_cap train_time:600026ms step:6822/20000
peak memory allocated: 20672 MiB reserved: 20718 MiB
ema:applying EMA weights
DIAGNOSTIC post_ema val_loss:1.9201 val_bpb:1.1372 eval_time:2042ms
Serialized model: 106047497 bytes
Code size: 102835 bytes
gptq:calibrating with training data...
gptq:calibrated 68 layers in 3.5s
gptq_quantize: 66 GPTQ layers, 0 naive layers
gptq_quantize: 66 GPTQ layers, 0 naive layers
gptq_quantize: 66 GPTQ layers, 0 naive layers
gptq_quantize: 66 GPTQ layers, 0 naive layers
gptq_quantize: 66 GPTQ layers, 0 naive layers
gptq_quantize: 66 GPTQ layers, 0 naive layers
gptq_quantize: 66 GPTQ layers, 0 naive layers
gptq_quantize: 66 GPTQ layers, 0 naive layers
Serialized model int6+zstd: 15528630 bytes
Total submission size int6+zstd: 15631465 bytes
Total submission size int8+zlib: 15631465 bytes
final_int6_roundtrip val_loss:1.9293 val_bpb:1.1426 eval_time:36893ms
final_int6_roundtrip_exact val_loss:1.92926834 val_bpb:1.14262138
final_int6_sliding_window val_loss:1.8894 val_bpb:1.1190 stride:64 eval_time:97693ms
final_int6_sliding_window_exact val_loss:1.88940527 val_bpb:1.11901519
final_int8_zlib_roundtrip_exact val_loss:1.88940527 val_bpb:1.11901519
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.109128 t=47s
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.091968 t=48s
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.089283 t=48s
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.096762 t=48s
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.085093 t=48s
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.084128 t=48s
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.105740 t=48s
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.071534 t=48s
final_int6_sliding_window_ngram5 val_loss:1.7646 val_bpb:1.0451 eval_time:91757ms
final_int6_sliding_window_ngram5_exact val_loss:1.76457796 val_bpb:1.04508523
Connection to 100.65.33.119 closed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
_____ _
| |
|_|
W0325 03:31:35.065000 435589 torch/distributed/run.py:803]
W0325 03:31:35.065000 435589 torch/distributed/run.py:803] *****************************************
W0325 03:31:35.065000 435589 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0325 03:31:35.065000 435589 torch/distributed/run.py:803] *****************************************
logs/f1_car02_iso_var_t2_rope24_ngram5_s2045_20260325_033133.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:80
val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
model_params:26928220
f1_corr:rank=0 params=0 est_int6_bytes~0
mlp_act:leaky_relu_sq mlp_leaky_slope:0.5
XSA:last_4 world_size:8 grad_accum_steps:1
num_heads:8 num_kv_heads:4 embed_lr:0.035 matrix_lr:0.025
train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
compile:enabled=1 fullgraph=0
seed:2045
ngram_eval:order=5 alpha=0.2 min_count=2 buckets=4194304
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
step:0/20000 val_loss:6.9303 val_bpb:4.1045 train_time:0ms step_avg:0.01ms
step:1/20000 train_loss:6.9322 train_time:144ms step_avg:144.47ms
step:2/20000 train_loss:8.7644 train_time:225ms step_avg:112.59ms
step:3/20000 train_loss:7.8957 train_time:311ms step_avg:103.69ms
step:4/20000 train_loss:7.1978 train_time:396ms step_avg:99.12ms
step:5/20000 train_loss:6.9515 train_time:481ms step_avg:96.30ms
step:6/20000 train_loss:6.9438 train_time:567ms step_avg:94.49ms
step:7/20000 train_loss:6.8028 train_time:653ms step_avg:93.25ms
step:8/20000 train_loss:6.6956 train_time:739ms step_avg:92.37ms
step:9/20000 train_loss:6.3689 train_time:825ms step_avg:91.63ms
step:10/20000 train_loss:6.0672 train_time:910ms step_avg:91.01ms
step:500/20000 train_loss:2.3814 train_time:43815ms step_avg:87.63ms
step:1000/20000 train_loss:2.2559 train_time:87771ms step_avg:87.77ms
step:1500/20000 train_loss:2.2061 train_time:131708ms step_avg:87.81ms
step:2000/20000 train_loss:2.0501 train_time:175692ms step_avg:87.85ms
step:2500/20000 train_loss:2.1548 train_time:219671ms step_avg:87.87ms
step:3000/20000 train_loss:2.1471 train_time:263638ms step_avg:87.88ms
step:3500/20000 train_loss:2.1656 train_time:307599ms step_avg:87.89ms
step:4000/20000 train_loss:1.9575 train_time:351557ms step_avg:87.89ms
step:4000/20000 val_loss:2.0474 val_bpb:1.2126 train_time:351562ms step_avg:87.89ms
step:4500/20000 train_loss:2.1057 train_time:395598ms step_avg:87.91ms
step:5000/20000 train_loss:2.0869 train_time:439521ms step_avg:87.90ms
late_qat:enabled step:5077 scale:0.4998
step:5500/20000 train_loss:2.0005 train_time:483438ms step_avg:87.90ms
step:6000/20000 train_loss:1.9250 train_time:527343ms step_avg:87.89ms
swa:start step:6150
step:6500/20000 train_loss:2.0644 train_time:571490ms step_avg:87.92ms
step:6823/20000 val_loss:1.9229 val_bpb:1.1389 train_time:600022ms step_avg:87.94ms
stopping_early: wallclock_cap train_time:600022ms step:6823/20000
peak memory allocated: 20672 MiB reserved: 20718 MiB
ema:applying EMA weights
DIAGNOSTIC post_ema val_loss:1.9213 val_bpb:1.1379 eval_time:2043ms
Serialized model: 106047497 bytes
Code size: 102835 bytes
gptq:calibrating with training data...
gptq:calibrated 68 layers in 3.6s
gptq_quantize: 66 GPTQ layers, 0 naive layers
gptq_quantize: 66 GPTQ layers, 0 naive layers
gptq_quantize: 66 GPTQ layers, 0 naive layers
gptq_quantize: 66 GPTQ layers, 0 naive layers
gptq_quantize: 66 GPTQ layers, 0 naive layers
gptq_quantize: 66 GPTQ layers, 0 naive layers
gptq_quantize: 66 GPTQ layers, 0 naive layers
gptq_quantize: 66 GPTQ layers, 0 naive layers
Serialized model int6+zstd: 15537990 bytes
Total submission size int6+zstd: 15640825 bytes
Total submission size int8+zlib: 15640825 bytes
final_int6_roundtrip val_loss:1.9309 val_bpb:1.1436 eval_time:37187ms
final_int6_roundtrip_exact val_loss:1.93088788 val_bpb:1.14358056
final_int6_sliding_window val_loss:1.8910 val_bpb:1.1200 stride:64 eval_time:98043ms
final_int6_sliding_window_exact val_loss:1.89104663 val_bpb:1.11998730
final_int8_zlib_roundtrip_exact val_loss:1.89104663 val_bpb:1.11998730
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.110070 t=47s
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.084993 t=48s
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.072233 t=48s
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.086047 t=48s
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.092938 t=48s
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.106821 t=48s
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.097940 t=48s
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.089815 t=49s
final_int6_sliding_window_ngram5 val_loss:1.7661 val_bpb:1.0460 eval_time:92842ms
final_int6_sliding_window_ngram5_exact val_loss:1.76610289 val_bpb:1.04598838
Connection to 100.65.33.119 closed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
_____ _
| |
|_|
W0325 03:13:59.013000 369417 torch/distributed/run.py:803]
W0325 03:13:59.013000 369417 torch/distributed/run.py:803] *****************************************
W0325 03:13:59.013000 369417 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0325 03:13:59.013000 369417 torch/distributed/run.py:803] *****************************************
logs/f1_car02_iso_var_t2_rope24_ngram5_s42_20260325_031357.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:80
val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
model_params:26928220
f1_corr:rank=0 params=0 est_int6_bytes~0
mlp_act:leaky_relu_sq mlp_leaky_slope:0.5
XSA:last_4 world_size:8 grad_accum_steps:1
num_heads:8 num_kv_heads:4 embed_lr:0.035 matrix_lr:0.025
train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
compile:enabled=1 fullgraph=0
seed:42
ngram_eval:order=5 alpha=0.2 min_count=2 buckets=4194304
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
step:0/20000 val_loss:6.9293 val_bpb:4.1039 train_time:0ms step_avg:0.01ms
step:1/20000 train_loss:6.9307 train_time:145ms step_avg:144.74ms
step:2/20000 train_loss:8.6833 train_time:227ms step_avg:113.25ms
step:3/20000 train_loss:7.9060 train_time:312ms step_avg:104.16ms
step:4/20000 train_loss:7.2570 train_time:398ms step_avg:99.58ms
step:5/20000 train_loss:7.0185 train_time:484ms step_avg:96.80ms
step:6/20000 train_loss:6.8703 train_time:570ms step_avg:94.98ms
step:7/20000 train_loss:6.7342 train_time:655ms step_avg:93.62ms
step:8/20000 train_loss:6.6461 train_time:741ms step_avg:92.61ms
step:9/20000 train_loss:6.3717 train_time:826ms step_avg:91.82ms
step:10/20000 train_loss:6.0673 train_time:913ms step_avg:91.26ms
step:500/20000 train_loss:2.3771 train_time:43835ms step_avg:87.67ms
step:1000/20000 train_loss:2.2588 train_time:87833ms step_avg:87.83ms
step:1500/20000 train_loss:2.2073 train_time:131823ms step_avg:87.88ms
step:2000/20000 train_loss:2.0545 train_time:175840ms step_avg:87.92ms
step:2500/20000 train_loss:2.1549 train_time:219846ms step_avg:87.94ms
step:3000/20000 train_loss:2.1506 train_time:263922ms step_avg:87.97ms
step:3500/20000 train_loss:2.1715 train_time:307898ms step_avg:87.97ms
step:4000/20000 train_loss:1.9611 train_time:351877ms step_avg:87.97ms
step:4000/20000 val_loss:2.0500 val_bpb:1.2142 train_time:351882ms step_avg:87.97ms
step:4500/20000 train_loss:2.1072 train_time:395852ms step_avg:87.97ms
step:5000/20000 train_loss:2.0883 train_time:439818ms step_avg:87.96ms
late_qat:enabled step:5072 scale:0.4998
step:5500/20000 train_loss:2.0034 train_time:483779ms step_avg:87.96ms
step:6000/20000 train_loss:1.9276 train_time:527740ms step_avg:87.96ms
swa:start step:6150
step:6500/20000 train_loss:2.0657 train_time:571958ms step_avg:87.99ms
step:6817/20000 val_loss:1.9253 val_bpb:1.1403 train_time:600008ms step_avg:88.02ms
stopping_early: wallclock_cap train_time:600008ms step:6817/20000
peak memory allocated: 20672 MiB reserved: 20718 MiB
ema:applying EMA weights
DIAGNOSTIC post_ema val_loss:1.9237 val_bpb:1.1393 eval_time:2041ms
Serialized model: 106047497 bytes
Code size: 102835 bytes
gptq:calibrating with training data...
gptq:calibrated 68 layers in 3.6s
gptq_quantize: 66 GPTQ layers, 0 naive layers
gptq_quantize: 66 GPTQ layers, 0 naive layers
gptq_quantize: 66 GPTQ layers, 0 naive layers
gptq_quantize: 66 GPTQ layers, 0 naive layers
gptq_quantize: 66 GPTQ layers, 0 naive layers
gptq_quantize: 66 GPTQ layers, 0 naive layers
gptq_quantize: 66 GPTQ layers, 0 naive layers
gptq_quantize: 66 GPTQ layers, 0 naive layers
Serialized model int6+zstd: 15487548 bytes
Total submission size int6+zstd: 15590383 bytes
Total submission size int8+zlib: 15590383 bytes
final_int6_roundtrip val_loss:1.9335 val_bpb:1.1451 eval_time:39681ms
final_int6_roundtrip_exact val_loss:1.93345159 val_bpb:1.14509894
final_int6_sliding_window val_loss:1.8939 val_bpb:1.1217 stride:64 eval_time:100629ms
final_int6_sliding_window_exact val_loss:1.89385941 val_bpb:1.12165319
final_int8_zlib_roundtrip_exact val_loss:1.89385941 val_bpb:1.12165319
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.091392 t=48s
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.099295 t=48s
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.086410 t=48s
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.108322 t=48s
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.094407 t=48s
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.087306 t=48s
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.111408 t=48s
ngram_eval:progress windows=64032/121136 (52.9%) bpb=1.073701 t=49s
final_int6_sliding_window_ngram5 val_loss:1.7680 val_bpb:1.0471 eval_time:92720ms
final_int6_sliding_window_ngram5_exact val_loss:1.76796876 val_bpb:1.04709346
Connection to 100.65.33.119 closed.
Loading