Only one matrix size tested (M=N=K=2048)

Every kernel is benchmarked at exactly one point. Two concrete problems can be happen:

a) Non-multiple sizes are never tested. All kernels assume M, N, K are exact multiples of their tile dimensions. No bounds checking exists. Running M=N=K=3000 would produce wrong results or crash. This is undocumented.

b) The memory wall shows differently at different scales. At M=N=K=2048 the total FP16 data is only A+B = 16 MB. The T4 can load this in 16 MB / 320 GB/s = 0.05 ms. But v08 takes 5.3 ms and reads 662 MB — 41× over the minimum. At M=N=K=4096 the data-to-bandwidth ratio changes; the kernel might behave very differently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only one matrix size tested (M=N=K=2048) #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Only one matrix size tested (M=N=K=2048) #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions