Skip to content

B200 Cross Entropy Backward Performance #68

@Jokeren

Description

@Jokeren

Seems like it's slower than torch.compile generated code. Is it expected? I'm using quack/main and triton 3.6.0

=== Cross Entropy Backward Pass Benchmark ===
Tensor dimensions: [8192, 16384]
Input Data type: BFloat16
Input tensor shapes:
x: torch.Size([8192, 16384]), dtype: torch.bfloat16
target: torch.Size([8192]), dtype: torch.int64
Kernel execution time: 0.4894 ms
Mem throughput: 1097.00 GB/s
Ref kernel execution time: 0.2265 ms
Ref mem throughput: 2371.00 GB/s

cc @tridao @lezcano

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions