Skip to content

[harness] Device-specific kernels for Triton and Helion#57

Merged
adam-smnk merged 1 commit into
mainfrom
backends-cuda
Mar 5, 2026
Merged

[harness] Device-specific kernels for Triton and Helion#57
adam-smnk merged 1 commit into
mainfrom
backends-cuda

Conversation

@adam-smnk
Copy link
Copy Markdown
Collaborator

@adam-smnk adam-smnk commented Mar 3, 2026

Adds device-specific subdirectories to Triton and Helion backends.
Refactors CI benchmark options to use 'device x backend' grid.

The deeper nesting allows for separate kernel implementations that may encode target-specific optimizations.
For maintenance simplicity and readability, the split is chosen over maintaining multiple kernel implementations in a single file.

The separate files act as entry points for the runner. In the future, truly universal kernels can be stored in a separate location and backend file structure might offer only simple redirection.

While these backends support running the same kernel on different devices, encoding target-specific details can improve performance. The baseline PyTorch backend still relies on a single implementation thanks to its higher abstraction.
Future backends should pick the most suitable structure for their needs.

Adds device-specific subdirectories to Triton and Helion backends.

The deeper nesting allows for separate kernel implementations that
may incode target-specific optimizations.
For maintenance simplicity and readability, the split is chosen over
maintaining multiple kernel implementations in a single file.

The separate files act as entry points for the runner. In the future,
truly universal kernels can be stored in a separate location and
backend file structure might offer only simple redirection.

While these backends support running the same kernel on different
devices, encoding target-specific details can improve performance.
The baseline PyTorch backend still relies on a single implementation
thanks to its higher abstraction.
Future backends should pick the most suitable structure for their
needs.
@adam-smnk adam-smnk requested a review from sandlbn March 3, 2026 15:42
Copy link
Copy Markdown
Collaborator

@sandlbn sandlbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. One thing worth a quick sanity check: benchmark_compare.py uses kernel paths directly via runner.kernels / level / f"{kernel_name}.py" — since runner.kernels is now set correctly in KernelBenchRunner.init with the device-type subdirectory, this should work fine, but worth confirming end-to-end for Triton/Helion on CUDA before merging.

@adam-smnk
Copy link
Copy Markdown
Collaborator Author

Just double checked, compare run as:
ai-bench-compare --problem level1/1_Square_matrix_multiplication_ --backends pytorch helion triton --cuda
works as expected.
Individual benchmark runs have been validated to work correctly on CI for both XPU and CUDA.

But I definitely need to add tests to cover bench compare module.

@adam-smnk adam-smnk merged commit 96ccc32 into main Mar 5, 2026
24 checks passed
@adam-smnk adam-smnk deleted the backends-cuda branch April 7, 2026 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants