Summary
This issue serves as a tracker for Week 1 (Reductions) submissions.
Participants are expected to contribute reduction-related kernels and/or optimizations, along with minimal correctness and performance evidence. Multiple PRs can link back to this issue for coordination and review.
Motivation / Use Case
Reductions are fundamental building blocks for many ML and HPC workloads (e.g., layernorm, softmax, etc).
Week 1 focuses on exploring CuTe DSL implementations of reduction ops and building intuition around tiling, memory traffic, and parallelization strategies.
This tracker centralizes:
- Submission PRs
- Optimization discussions
The goal is to accelerate iteration and knowledge sharing across different reduction variants.
Proposed Solution
Example kernel scope (not exhaustive):
- Reduction ops (e.g., sum / max / mean) over configurable axes
- Supported shapes: 2-dimension tensor
- Supported dtypes: fp16 / bf16 / fp32 (as applicable)
- Variants across different optimization approaches
- Optional benchmark + pytest additions for each submission
We will use this tracker to discuss optimizations, compare approaches, and guide follow-up improvements.
Contributions that go beyond the baseline scope (e.g., ndim > 2, non-contiguous layouts, or other extended functionality) are very welcome. Please highlight any such additions in your submission.
Scope Alignment
v0.1 scope (Weeks 0-2)
Alternatives Considered
No response
Additional Context
No response
Summary
This issue serves as a tracker for Week 1 (Reductions) submissions.
Participants are expected to contribute reduction-related kernels and/or optimizations, along with minimal correctness and performance evidence. Multiple PRs can link back to this issue for coordination and review.
Motivation / Use Case
Reductions are fundamental building blocks for many ML and HPC workloads (e.g., layernorm, softmax, etc).
Week 1 focuses on exploring CuTe DSL implementations of reduction ops and building intuition around tiling, memory traffic, and parallelization strategies.
This tracker centralizes:
The goal is to accelerate iteration and knowledge sharing across different reduction variants.
Proposed Solution
Example kernel scope (not exhaustive):
We will use this tracker to discuss optimizations, compare approaches, and guide follow-up improvements.
Contributions that go beyond the baseline scope (e.g., ndim > 2, non-contiguous layouts, or other extended functionality) are very welcome. Please highlight any such additions in your submission.
Scope Alignment
v0.1 scope (Weeks 0-2)
Alternatives Considered
No response
Additional Context
No response