Skip to content

analyze performance for batching. #4680

@geraldstanje

Description

@geraldstanje

hi,

i am trying to understand why latency increases by 4× when going from batch_size = 1 to batch_size = 8 for 512 tokens? - but i dont see an increase in latency for 128 tokens from batch_size = 1 to batch_size = 8
The following shows the GPU trace using NVIDIA Nsight.
in this context, batch_size means that the client sends a batched request and the server has dynamic batching enabled.
given that the SM utilization reaches almost 90% at batch_size = 8 for 512 tokens, does this mean that the GPU is already fully utilized?
what should I look at to determine whether the GPU is truly fully utilized SM and DRAM bandwith?

512 tokens, batch_size=1:
p90 latency: 12ms, rps: 82.49
gpu trace:
Image

512 tokens, batch_size=8:
p90 latency: 39ms, rps: 24.92
gpu trace:
Image

more zoomed out overview:
Image

but when i look at 128 tokens the latency does not increase in latency.

128 tokens, batch_size=1:
p90 latency: 11ms, rps: 91
gpu trace:
Image

512 tokens, batch_size=8:
p90 latency: 7ms, rps: 146
gpu trace:
Image

cc @yuanyao-nv @brb-nv @lix19937 @YouSenRong @zhenhuaw-me @zhenhuaw-me @JuncFang-git @ttyio @ttyio @pranavm-nvidia

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions