Summary
Post-merge of PR #68 (recompile-elimination + device_put fixes), untuned `torch.compile(vmap(single_step), fullgraph=True)` throughput on H200 regressed vs the prior README numbers for several envs. Tuned numbers + per-env root cause have not been re-measured yet.
Observed vs prior README (H200, float64, 1000 steps, 7 batch sizes)
| env/B |
prior (steps/s) |
new (steps/s) |
delta |
| humanoid/32768 |
2.02M |
1.21M |
-40% |
| humanoid/65536 |
~1.97M |
1.16M |
-41% |
| humanoid/131072 |
~1.86M |
1.12M |
-40% |
| halfcheetah/32768 |
~3.72M |
1.20M |
-68% |
Compile time is also ~2-4x longer per env.
Hypotheses to check
- Inductor tuning regression: try `--tuned` (coordinate-descent + aggressive fusion) and see if the gap closes.
- Graph structure: compare `TORCH_LOGS=inductor,graph_breaks` traces against the last-known-good commit.
- New overhead from the `torch.compiler.is_compiling()` guards (multiple new conditional branches inside `step()`).
- Extra host-side work in `make_data` (warm caches, precomp).
Data
JSONL outputs from the 305299 sweep are in `~/bench_all_305299/` on steve.
Not in scope
Immediate fix — follow-up agent will investigate with `TORCH_LOGS=graph_breaks,recompiles,inductor` + compare against the pre-PR commit.
Summary
Post-merge of PR #68 (recompile-elimination + device_put fixes), untuned `torch.compile(vmap(single_step), fullgraph=True)` throughput on H200 regressed vs the prior README numbers for several envs. Tuned numbers + per-env root cause have not been re-measured yet.
Observed vs prior README (H200, float64, 1000 steps, 7 batch sizes)
Compile time is also ~2-4x longer per env.
Hypotheses to check
Data
JSONL outputs from the 305299 sweep are in `~/bench_all_305299/` on steve.
Not in scope
Immediate fix — follow-up agent will investigate with `TORCH_LOGS=graph_breaks,recompiles,inductor` + compare against the pre-PR commit.