B200 Cross Entropy Backward Performance

Seems like it's slower than `torch.compile` generated code. Is it expected? I'm using quack/main and triton 3.6.0

```
=== Cross Entropy Backward Pass Benchmark ===
Tensor dimensions: [8192, 16384]
Input Data type: BFloat16
Input tensor shapes:
x: torch.Size([8192, 16384]), dtype: torch.bfloat16
target: torch.Size([8192]), dtype: torch.int64
Kernel execution time: 0.4894 ms
Mem throughput: 1097.00 GB/s
Ref kernel execution time: 0.2265 ms
Ref mem throughput: 2371.00 GB/s
```

cc @tridao @lezcano

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

B200 Cross Entropy Backward Performance #68

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

B200 Cross Entropy Backward Performance #68

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions