[harness] Device-specific kernels for Triton and Helion by adam-smnk · Pull Request #57 · libxsmm/AI-bench

adam-smnk · 2026-03-03T15:42:13Z

Adds device-specific subdirectories to Triton and Helion backends.
Refactors CI benchmark options to use 'device x backend' grid.

The deeper nesting allows for separate kernel implementations that may encode target-specific optimizations.
For maintenance simplicity and readability, the split is chosen over maintaining multiple kernel implementations in a single file.

The separate files act as entry points for the runner. In the future, truly universal kernels can be stored in a separate location and backend file structure might offer only simple redirection.

While these backends support running the same kernel on different devices, encoding target-specific details can improve performance. The baseline PyTorch backend still relies on a single implementation thanks to its higher abstraction.
Future backends should pick the most suitable structure for their needs.

Adds device-specific subdirectories to Triton and Helion backends. The deeper nesting allows for separate kernel implementations that may incode target-specific optimizations. For maintenance simplicity and readability, the split is chosen over maintaining multiple kernel implementations in a single file. The separate files act as entry points for the runner. In the future, truly universal kernels can be stored in a separate location and backend file structure might offer only simple redirection. While these backends support running the same kernel on different devices, encoding target-specific details can improve performance. The baseline PyTorch backend still relies on a single implementation thanks to its higher abstraction. Future backends should pick the most suitable structure for their needs.

sandlbn

LGTM overall. One thing worth a quick sanity check: benchmark_compare.py uses kernel paths directly via runner.kernels / level / f"{kernel_name}.py" — since runner.kernels is now set correctly in KernelBenchRunner.init with the device-type subdirectory, this should work fine, but worth confirming end-to-end for Triton/Helion on CUDA before merging.

adam-smnk · 2026-03-05T16:16:00Z

Just double checked, compare run as:
ai-bench-compare --problem level1/1_Square_matrix_multiplication_ --backends pytorch helion triton --cuda
works as expected.
Individual benchmark runs have been validated to work correctly on CI for both XPU and CUDA.

But I definitely need to add tests to cover bench compare module.

adam-smnk requested a review from sandlbn March 3, 2026 15:42

sandlbn approved these changes Mar 5, 2026

View reviewed changes

adam-smnk merged commit 96ccc32 into main Mar 5, 2026
24 checks passed

adam-smnk deleted the backends-cuda branch April 7, 2026 09:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[harness] Device-specific kernels for Triton and Helion#57

[harness] Device-specific kernels for Triton and Helion#57
adam-smnk merged 1 commit into
mainfrom
backends-cuda

adam-smnk commented Mar 3, 2026 •

edited

Loading

Uh oh!

sandlbn left a comment

Uh oh!

adam-smnk commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adam-smnk commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sandlbn left a comment

Choose a reason for hiding this comment

Uh oh!

adam-smnk commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adam-smnk commented Mar 3, 2026 •

edited

Loading