Every kernel is benchmarked at exactly one point. Two concrete problems can be happen:
a) Non-multiple sizes are never tested. All kernels assume M, N, K are exact multiples of their tile dimensions. No bounds checking exists. Running M=N=K=3000 would produce wrong results or crash. This is undocumented.
b) The memory wall shows differently at different scales. At M=N=K=2048 the total FP16 data is only A+B = 16 MB. The T4 can load this in 16 MB / 320 GB/s = 0.05 ms. But v08 takes 5.3 ms and reads 662 MB — 41× over the minimum. At M=N=K=4096 the data-to-bandwidth ratio changes; the kernel might behave very differently.
Every kernel is benchmarked at exactly one point. Two concrete problems can be happen:
a) Non-multiple sizes are never tested. All kernels assume M, N, K are exact multiples of their tile dimensions. No bounds checking exists. Running M=N=K=3000 would produce wrong results or crash. This is undocumented.
b) The memory wall shows differently at different scales. At M=N=K=2048 the total FP16 data is only A+B = 16 MB. The T4 can load this in 16 MB / 320 GB/s = 0.05 ms. But v08 takes 5.3 ms and reads 662 MB — 41× over the minimum. At M=N=K=4096 the data-to-bandwidth ratio changes; the kernel might behave very differently.