Port tracing capabilities to Gluon IrisDeviceCtx API#473
Port tracing capabilities to Gluon IrisDeviceCtx API#473
Conversation
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com> Agent-Logs-Url: https://github.com/ROCm/iris/sessions/3068c30a-5cdf-47fd-b543-fe83d4d7c613
|
@copilot — Review: Bugs found while integrating gluon tracing with the all-gather kernel I tested this PR by integrating it with the gluon all-gather kernel on MI355X hardware. Several bugs prevented it from working correctly with 1.
|
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com> Agent-Logs-Url: https://github.com/ROCm/iris/sessions/c02aed93-76db-4b7d-9e2e-9807ec681118
Fixed in d8fba1d. Here's what was addressed:
The |
mawad-amd
left a comment
There was a problem hiding this comment.
Review: Several issues to fix + needs hardware testing
Critical Issues
-
Type mismatches on stores:
cur_rankis loaded as int64 from the context tensor (gl.loadreturns int64) but gets stored tobuf_cur_rankwhich ispointer_type(tl.int32). Same for other int32 buffers receiving int64 values. Add explicittl.cast(value, tl.int32)before storing to int32 buffer pointers. Affected stores inrecord_event_start:self.rank→buf_cur_rank(int64 → int32)target_rank→buf_target_rank(likely int64 → int32)pid_m,pid_n→ their buffers (check types)gl.program_id(0)→buf_pid(check if gl.program_id returns int32 or int64)
-
Missing
.item()on bounds check: The Triton referenceDeviceTracingusesevent_idx.item() < self.max_events.item()for scalar comparison in theifguard. Your version usesevent_idx < self.max_eventswhich may produce a tensor bool instead of a Python-level scalar bool, causing theifbranch to behave incorrectly. Check how Gluon handles this — you may need.item()or explicit scalar extraction. -
max_eventstype inconsistency: In the enabled path,max_eventsis loaded as int64 from the context tensor. In the disabled path, you createmax_events_zero = tl.cast(0, tl.int32). The aggregate fieldmax_eventshas inconsistent types between the two paths. Either cast the loaded value to int32 or use int64 consistently. -
tl.cast(0, tl.int32)vstl.full((), 0, dtype=tl.int32): The Triton reference usestl.full((), 0, dtype=...)for creating scalar zero values. Your version usestl.cast(0, ...). While likely equivalent, match the reference pattern for consistency:tl.full((), 0, dtype=tl.int32).
Medium Issues
- Tests only verify counter, never check event data:
test_device_context_gluon_tracing_1d_addressonly assertstrace_counter.item() >= 1. It should also read back trace buffers and verify:event_idmatchesTraceEvent().putcur_rankmatchessource_ranktarget_rankmatches(source_rank + 1) % num_rankstimestampis non-zeroduration_cycles(end timestamp) >=timestamp(start timestamp)payload_size=BLOCK_SIZE * 8(4 elements × 8 bytes for int64)
Hardware Testing Required
You are running on an AMD GPU CI runner (MI325X). Please:
-
Run the existing test:
python -m pytest tests/unittests/test_device_context_gluon.py -v— confirm it passes on real hardware. -
Write and run a no-overhead test: Create a simple Gluon kernel that does a basic operation (e.g., vector add or copy). Run it with
tracing=Falseandtracing=True(with tracing disabled on the host, i.e. don't callshmem.tracing.enable()). Compare execution times over 100+ iterations. Withtracing=False(constexpr), the tracing code should be dead-code-eliminated by the compiler — confirm there is zero overhead. -
Write and run a tracing correctness test: Enable tracing, run a kernel that does a known operation, export the trace, and verify the exported JSON contains the expected events with correct fields.
Fix the critical issues first, then run the tests to confirm everything works on hardware.
… test assertions Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com> Agent-Logs-Url: https://github.com/ROCm/iris/sessions/0326dad4-e25d-48f2-a6a8-4492a710ae36
Fixed in 4729f60. Here's what was addressed: Critical fixes:
Test enhancement: |
IrisDeviceCtx(Gluon API) lacked tracing support available in the TritonDeviceContext, making it impossible to profile Gluon kernels with the same instrumentation.Changes
New:
GluonDeviceTracingdevice-side aggregate (iris/experimental/iris_gluon.py)DeviceTracingusing@gluon.jit/@gluon.constexpr_functiondevice_utils.read_realtime(),get_xcc_id(),get_cu_id()for hardware timestampsUpdated:
IrisDeviceCtx.initialize()tracing: gl.constexpr = Falseparametertracing=True, decodes tracing buffer pointers from context tensor via runtime pointer arithmetic (no constexprnum_ranksrequired — offset computed from the loaded value)tracing=False, constructs a disabledGluonDeviceTracingwith dummy pointers (zero overhead, dead code eliminated at compile time)Updated:
IrisGluonhost classself.tracing = Tracing(self)— reuses the existing host-sideTracingclass (buffer allocation,reset(),export())_build_device_context()now encodes tracing buffer pointers when enabled, using the same layout asIris._build_device_context()New:
tests/unittests/test_device_context_gluon.pytest_device_context_tracing_1d_addressto Gluontracing.enable()Usage
Original prompt
💬 Send tasks to Copilot coding agent from Slack and Teams to turn conversations into code. Copilot posts an update in your thread when it's finished.