Skip to content

Only one matrix size tested (M=N=K=2048) #2

Description

@drakempham

Every kernel is benchmarked at exactly one point. Two concrete problems can be happen:

a) Non-multiple sizes are never tested. All kernels assume M, N, K are exact multiples of their tile dimensions. No bounds checking exists. Running M=N=K=3000 would produce wrong results or crash. This is undocumented.

b) The memory wall shows differently at different scales. At M=N=K=2048 the total FP16 data is only A+B = 16 MB. The T4 can load this in 16 MB / 320 GB/s = 0.05 ms. But v08 takes 5.3 ms and reads 662 MB — 41× over the minimum. At M=N=K=4096 the data-to-bandwidth ratio changes; the kernel might behave very differently.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions