Skip to content

Week1: reduce ops #20

@Tcc0403

Description

@Tcc0403

Summary

This issue serves as a tracker for Week 1 (Reductions) submissions.

Participants are expected to contribute reduction-related kernels and/or optimizations, along with minimal correctness and performance evidence. Multiple PRs can link back to this issue for coordination and review.

Motivation / Use Case

Reductions are fundamental building blocks for many ML and HPC workloads (e.g., layernorm, softmax, etc).

Week 1 focuses on exploring CuTe DSL implementations of reduction ops and building intuition around tiling, memory traffic, and parallelization strategies.

This tracker centralizes:

  • Submission PRs
  • Optimization discussions

The goal is to accelerate iteration and knowledge sharing across different reduction variants.

Proposed Solution

Example kernel scope (not exhaustive):

  • Reduction ops (e.g., sum / max / mean) over configurable axes
  • Supported shapes: 2-dimension tensor
  • Supported dtypes: fp16 / bf16 / fp32 (as applicable)
  • Variants across different optimization approaches
  • Optional benchmark + pytest additions for each submission

We will use this tracker to discuss optimizations, compare approaches, and guide follow-up improvements.

Contributions that go beyond the baseline scope (e.g., ndim > 2, non-contiguous layouts, or other extended functionality) are very welcome. Please highlight any such additions in your submission.

Scope Alignment

v0.1 scope (Weeks 0-2)

Alternatives Considered

No response

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions