Skip to content

Parallelism setups might not correct in OPT-OSS 120B vLLM inference recipes #161

@lepan-google

Description

@lepan-google

When I ran the benchmark following these recipes:

I found that only one model copy was downloaded to the node from GCS, even though tp=2 was set.

Upon further inspection of the logs, I noticed that data_parallelism switches from 4 to 1 in the vLLM engine.

I would like to understand this behavior better. Specifically, based on the current configurations, how many model copies should reside on a single node, and how many copies are actually downloaded from GCS? Does the workload download the model multiple times, or is it downloaded once and then shared across the chips?

cc @karan

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions