Skip to content

Untuned compile/torch throughput regressed 30-60% vs prior README on humanoid/halfcheetah #70

@vmoens

Description

@vmoens

Summary

Post-merge of PR #68 (recompile-elimination + device_put fixes), untuned `torch.compile(vmap(single_step), fullgraph=True)` throughput on H200 regressed vs the prior README numbers for several envs. Tuned numbers + per-env root cause have not been re-measured yet.

Observed vs prior README (H200, float64, 1000 steps, 7 batch sizes)

env/B prior (steps/s) new (steps/s) delta
humanoid/32768 2.02M 1.21M -40%
humanoid/65536 ~1.97M 1.16M -41%
humanoid/131072 ~1.86M 1.12M -40%
halfcheetah/32768 ~3.72M 1.20M -68%

Compile time is also ~2-4x longer per env.

Hypotheses to check

  • Inductor tuning regression: try `--tuned` (coordinate-descent + aggressive fusion) and see if the gap closes.
  • Graph structure: compare `TORCH_LOGS=inductor,graph_breaks` traces against the last-known-good commit.
  • New overhead from the `torch.compiler.is_compiling()` guards (multiple new conditional branches inside `step()`).
  • Extra host-side work in `make_data` (warm caches, precomp).

Data

JSONL outputs from the 305299 sweep are in `~/bench_all_305299/` on steve.

Not in scope

Immediate fix — follow-up agent will investigate with `TORCH_LOGS=graph_breaks,recompiles,inductor` + compare against the pre-PR commit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions