-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
hi,
i am trying to understand why latency increases by 4× when going from batch_size = 1 to batch_size = 8 for 512 tokens? - but i dont see an increase in latency for 128 tokens from batch_size = 1 to batch_size = 8
The following shows the GPU trace using NVIDIA Nsight.
in this context, batch_size means that the client sends a batched request and the server has dynamic batching enabled.
given that the SM utilization reaches almost 90% at batch_size = 8 for 512 tokens, does this mean that the GPU is already fully utilized?
what should I look at to determine whether the GPU is truly fully utilized SM and DRAM bandwith?
512 tokens, batch_size=1:
p90 latency: 12ms, rps: 82.49
gpu trace:

512 tokens, batch_size=8:
p90 latency: 39ms, rps: 24.92
gpu trace:

but when i look at 128 tokens the latency does not increase in latency.
128 tokens, batch_size=1:
p90 latency: 11ms, rps: 91
gpu trace:

512 tokens, batch_size=8:
p90 latency: 7ms, rps: 146
gpu trace:

cc @yuanyao-nv @brb-nv @lix19937 @YouSenRong @zhenhuaw-me @zhenhuaw-me @JuncFang-git @ttyio @ttyio @pranavm-nvidia
