analyze performance for batching.

hi,

i am trying to understand why latency increases by 4× when going from batch_size = 1 to batch_size = 8 for 512 tokens? - but i dont see an increase in latency for 128 tokens from batch_size = 1 to batch_size = 8
The following shows the GPU trace using NVIDIA Nsight.
in this context, batch_size means that the client sends a batched request and the server has dynamic batching enabled.
given that the SM utilization reaches almost 90% at batch_size = 8 for 512 tokens, does this mean that the GPU is already fully utilized?
what should I look at to determine whether the GPU is truly fully utilized SM and DRAM bandwith?

512 tokens, batch_size=1:
p90 latency: 12ms, rps: 82.49
gpu trace:
<img width="1132" height="1034" alt="Image" src="https://github.com/user-attachments/assets/229acd69-13cc-4b56-bb88-0b497e7e3a86" />

512 tokens, batch_size=8:
p90 latency: 39ms, rps: 24.92
gpu trace:
<img width="1391" height="1032" alt="Image" src="https://github.com/user-attachments/assets/1a3f2f44-b89d-4e33-821b-074a9327f226" />

more zoomed out overview:
<img width="1522" height="997" alt="Image" src="https://github.com/user-attachments/assets/ab2fbb78-9c7e-4476-95e0-7a3584e9152c" />

but when i look at 128 tokens the latency does not increase in latency.

128 tokens, batch_size=1:
p90 latency: 11ms, rps: 91
gpu trace:
<img width="1122" height="1032" alt="Image" src="https://github.com/user-attachments/assets/385954c6-bc90-43b0-b6bf-4d610bc1b47b" />

512 tokens, batch_size=8:
p90 latency: 7ms, rps: 146
gpu trace:
<img width="1227" height="1034" alt="Image" src="https://github.com/user-attachments/assets/04d4ddea-bc44-4dc2-a350-b43e95f70108" />

cc @yuanyao-nv @brb-nv @lix19937 @YouSenRong @zhenhuaw-me @zhenhuaw-me @JuncFang-git @ttyio @ttyio @pranavm-nvidia 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

analyze performance for batching. #4680

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

analyze performance for batching. #4680

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions