torch-cudagraph-debug is a PyTorch extension for inspecting tensor values from
CUDA Graph replay. A tensor probe returns its input tensor unchanged while
inserting graph-captured debug work:
- device-to-host copy into per-invocation pinned host staging memory;
- optional CUDA host callback;
- native print or compare logic on CPU memory.
The first domain is tensor_debug:
TensorPrint: print a compact tensor summary and sample values.TensorRecord: keep the latest CPU snapshot for each logical probe slot without a host callback.TensorCompare: compare replay values against CPU or NumPy ground truth.
This is a v0.1 source-built package for Linux CUDA environments. It targets PyTorch 2.6+ and builds against the CUDA-enabled PyTorch already installed in the runtime where you install it. Prebuilt wheels are intentionally out of scope for the first release.
Install CUDA-enabled PyTorch first, then install the v0.1.0 release tag from GitHub with build isolation disabled so the extension builds against that exact PyTorch:
pip install --no-build-isolation \
"git+https://github.com/buptzyb/torch-cudagraph-debug.git@v0.1.0"To test the latest development branch instead, install main:
pip install --no-build-isolation \
"git+https://github.com/buptzyb/torch-cudagraph-debug.git@main"From a source checkout:
cd torch-cudagraph-debug
pip install --no-build-isolation .Enabled probes require a native extension built against CUDA-enabled PyTorch. Source-tree imports and all-disabled probes can run without the extension, but normal source installation for runtime use should happen in the target CUDA environment.
- API reference: signatures, method semantics, execution modes, multi-action behavior, examples, and troubleshooting.
- Release checklist: validation steps for source releases and GPU smoke tests.
- Changelog: release notes and compatibility notes.
import torch
from torch_cudagraph_debug.tensor_debug import (
CudaGraphTensorProbe,
TensorCompare,
TensorPrint,
TensorRecord,
)
static_x = torch.ones(4, device="cuda")
expected = torch.full((4,), 3.0, device="cpu")
probe = CudaGraphTensorProbe(
"mid",
actions=[
TensorPrint(max_items=8),
TensorRecord(),
TensorCompare([expected], rtol=1e-5, atol=1e-8),
],
)
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
mid = static_x + 2
mid = probe(mid)
g.replay()
torch.cuda.synchronize()
snapshots = probe.records()
probe.assert_ok()
probe.close()CudaGraphTensorProbe.__call__ always returns the original tensor object. By
default, eager and warmup calls are transparent no-ops; debug side effects are
installed only while the current CUDA stream is being captured.
TensorPrint is useful when you need a quick value check from graph replay:
probe = CudaGraphTensorProbe("activation", [TensorPrint(max_items=16, every=10)])TensorRecord keeps the latest CPU snapshot for each logical probe slot:
probe = CudaGraphTensorProbe("activation", [TensorRecord()])
snapshots = probe.records()
latest_cpu_tensor = snapshots[0].tensorAll actions use retained pinned host memory per logical slot. This path exposes
those slots as latest records and does not capture a host callback when
TensorRecord is the only action. Snapshot replay_index is 0 because no
callback runs to count graph replays. A probe is a single-capture object: all
active calls must happen in the same capture session, and repeated calls in that
capture get logical slots in capture-call order. Slots may have different
shapes, dtypes, and sizes. Create a new probe for another graph capture. Read
records after graph.replay() and torch.cuda.synchronize() to get meaningful
values. CUDA graph capture records the D2H node but does not execute it, so
fresh record slots read before first replay contain zero-initialized bytes
rather than graph tensor values.
For a replay-indexed time series, collect TensorRecord snapshots in the Python
replay loop:
snapshots_by_replay = []
for replay_index in range(1, 5):
graph.replay()
torch.cuda.synchronize()
for snapshot in probe.records():
snapshots_by_replay.append(
(replay_index, snapshot.invocation_index, snapshot.tensor.clone())
)Each TensorSnapshot contains probe_name, replay_index,
invocation_index, shape, dtype, device, and the recorded CPU tensor.
TensorCompare checks replay values against a list of CPU tensors or NumPy
arrays, one per invocation_index. A mismatch sets a sticky failure flag; call
probe.assert_ok() after replay and synchronization.
probe = CudaGraphTensorProbe(
"activation",
[TensorCompare([expected_cpu_tensor], rtol=1e-4, atol=1e-5)],
)If one probe instance is called multiple times in a captured graph and each
logical slot has a different expected value, pass the expected tensors in
invocation_index order:
expected_by_slot = [expected0, expected1, expected2]
probe = CudaGraphTensorProbe("layer.hidden", [TensorCompare(expected_by_slot)])Every action accepts enabled=False for config-driven toggles. Disabled actions
are not installed into the native probe. If every action is disabled, the probe
is a pure no-op: it returns the input tensor unchanged, exposes no records, and
does not require the native extension to be loaded.
Use probe.attach_grad(tensor) to register an autograd hook that probes a
tensor's backward gradient. It returns the original tensor, so activation probes
can stay inline:
hidden = activation_value_probe(hidden)
hidden = activation_grad_probe.attach_grad(hidden)For parameter gradients, prefer side-effect style so the code does not look like it replaces module state:
weight_grad_probe.attach_grad(module.weight)This parameter hook observes the gradient when autograd produces it. If you want
the final .grad buffer after backward(), probe that buffer explicitly:
loss.backward()
if module.weight.grad is not None:
final_weight_grad_probe(module.weight.grad)If you need to remove a long-lived hook, request the PyTorch hook handle:
_, handle = weight_grad_probe.attach_grad(module.weight, return_handle=True)
handle.remove()For a module-level example that inserts a probe into an internal hidden tensor
and controls actions from a small config object, see
examples/transformer_block_probe.py.
For the common forward/gradient probe patterns, see
examples/grad_probe_patterns.py.
For repeated calls of one probe in a single graph, see
examples/multiple_invocations_record_compare.py.
For TensorBoard summaries from recorded snapshots, see
examples/tensorboard_export_records.py.
Custom Python actions are intentionally not executed inside CUDA host callbacks.
Use TensorRecord to bring graph replay values back to CPU, then consume or
clone them after synchronization:
probe = CudaGraphTensorProbe("hidden", [TensorRecord()])
graph.replay()
torch.cuda.synchronize()
for snapshot in probe.records():
my_custom_action(snapshot.tensor)If you need a replay-by-replay time series, clone the CPU tensors before the next replay overwrites the latest-record slots.
The default execution mode is mode="capture":
probe = CudaGraphTensorProbe("activation", [TensorRecord()])In this mode, probe(tensor) is completely transparent during eager warmup: it
does not validate the input tensor, allocate staging memory, enqueue D2H copies,
launch callbacks, print, record, or compare. When the same call
runs inside torch.cuda.graph(...), the probe captures the debug D2H copy and
captures a host callback only when an action needs one (TensorPrint or
TensorCompare).
mode="always" is an explicit escape hatch:
probe = CudaGraphTensorProbe(
"activation",
[TensorRecord()],
mode="always",
)Keep this mode for cases where eager side effects are intentional: testing the probe without writing a CUDA graph, recording eager and graph values with the same probe machinery, debugging non-graph CUDA stream code, or covering native enqueue behavior in this package's tests. It is not the default because it can pollute warmup records, print during warmup, or set compare failures before graph replay. Eager calls in this mode do not claim CUDA graph capture ownership; the first graph capture that uses the probe still owns the probe.
A single CudaGraphTensorProbe instance may be called multiple times in one
captured graph, for example inside a repeated layer stack. These calls define
logical slots with invocation_index values in capture order: 0, 1, 2,
and so on. A probe called three times in one graph and replayed twice records:
replay_index: 1 1 1 2 2 2
invocation_index: 0 1 2 0 1 2
A probe is bound to the first CUDA graph capture session that uses it. Reusing
the same probe in another graph capture, including recapturing the same Python
code, is an error. Create a new probe for each graph capture. Slots within that
one capture may have different tensor metadata. TensorCompare takes a list of
expected tensors and uses expected[invocation_index] for each slot. Callback
replay_index values are counted independently per slot. A missing expected
item for an observed invocation is an error; extra expected items are ignored.
Each logical slot owns its own pinned host staging buffer. This keeps callback-backed actions from reading data overwritten by another invocation, including invocations captured on different streams in the same graph. Pinned host memory is roughly the sum of the largest tensor byte size observed for each invocation slot:
sum(slot_nbytes for slot in captured_probe_invocations)
The default non-contiguous policy is fail-fast when debug work is actually installed:
probe = CudaGraphTensorProbe("view", [TensorRecord()])
with torch.cuda.graph(g):
probe(non_contiguous_tensor) # raisesIn the default mode="capture", eager warmup calls are no-ops, so this policy is
checked during capture. If you explicitly allow it, the probe creates an
internal contiguous CUDA copy only for the debug path, captures the copy before
the D2H node, keeps that internal storage alive with the captured probe context,
and still returns the original tensor:
probe = CudaGraphTensorProbe(
"view",
[TensorRecord()],
non_contiguous="copy",
)The copy policy costs roughly tensor.numel() * tensor.element_size() additional
CUDA graph-pool memory for each captured non-contiguous probe site. In tight
memory captures, prefer probing a smaller slice or making the debug copy explicit
in your model code so the memory cost is visible.
Probe nodes are part of the captured graph dependency chain. Even though
probe(tensor) returns the original tensor, the D2H copy and CUDA host callback
are enqueued on the captured stream, so later dependent graph work waits for
that callback to finish. This can create large GPU bubbles and make performance
traces look much worse than the uninstrumented model. Per-invocation staging
also means large tensors probed at many call sites can retain substantial pinned
host memory. Treat this package as a correctness debugging aid, not as a
low-overhead profiling tool; remove or gate probes before measuring performance.
- Keep the probe alive for at least as long as any captured
torch.cuda.CUDAGraphthat contains it can replay. - Call
probe.close()only after those graphs will never replay again. - Call
torch.cuda.synchronize()before reading records or asserting compare status if replay was launched asynchronously. - Read
TensorRecordvalues only after at least one graph replay; capture alone does not execute the captured D2H copy. - Host callbacks do not call Python or CUDA APIs. Print and compare are native CPU operations on pinned staging memory.
- Linux/CUDA only.
- One probe is not designed for concurrent replay on multiple streams.
- One probe can be captured by only one CUDA graph capture session; create new probes for recapture or for separate graph captures.
- Inputs must use supported dense dtypes: float64, float32, float16, bfloat16, int64, int32, int16, int8, uint8, or bool.
- Compare expected values must be a non-empty sequence of CPU tensors or NumPy arrays, with each item matching the corresponding invocation's shape and dtype.
The v0.1.0 release gate validates the public GitHub install path in
nvcr.io/nvidia/pytorch:26.03-py3 on Computelab GPU nodes. The gate builds the
native extension from source, checks native_extension_available True, runs the
full CUDA pytest suite, and covers Python-side replay snapshot collection,
gradient probes, TensorBoard export, single-capture rejection, and
repeated-invocation examples.
The package is intended to build against CUDA-enabled PyTorch 2.6+ installations, but each CUDA/PyTorch/container combination should be verified in the target environment before relying on it in a larger training workflow.
Local metadata and Python checks:
python -m py_compile $(find src tests examples -name '*.py')
python -m build --sdist --no-isolationGPU validation requires a CUDA-enabled PyTorch environment:
pip install --no-build-isolation --no-deps .
rm -rf /tmp/tcgd-tests
cp -r tests /tmp/tcgd-tests
cd /tmp
pytest -q /tmp/tcgd-testsRun the GPU test suite from outside the source checkout after installation.
Running pytest directly in the source tree can import the unbuilt src/
package and skip native CUDA tests.