=== Cross Entropy Backward Pass Benchmark ===
Tensor dimensions: [8192, 16384]
Input Data type: BFloat16
Input tensor shapes:
x: torch.Size([8192, 16384]), dtype: torch.bfloat16
target: torch.Size([8192]), dtype: torch.int64
Kernel execution time: 0.4894 ms
Mem throughput: 1097.00 GB/s
Ref kernel execution time: 0.2265 ms
Ref mem throughput: 2371.00 GB/s
Seems like it's slower than
torch.compilegenerated code. Is it expected? I'm using quack/main and triton 3.6.0cc @tridao @lezcano