Parallelism setups might not correct in OPT-OSS 120B vLLM inference recipes

When I ran the benchmark following these recipes:
- https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/inference/ironwood/vLLM/GPT-OSS/README.md
- https://github.com/AI-Hypercomputer/tpu-recipes/blob/main/inference/ironwood/vLLM/GPT-OSS/README-gcs.md

I found that only one model copy was downloaded to the node from GCS, even though tp=2 was set.

Upon further inspection of the logs, I noticed that data_parallelism switches from [4](https://cloudlogging.app.goo.gl/ZH9WzYWzvPhv1weZ7) to [1](https://cloudlogging.app.goo.gl/wyq1UhchSzi6MiWN9) in the vLLM engine.

I would like to understand this behavior better. Specifically, based on the current configurations, how many model copies should reside on a single node, and how many copies are actually downloaded from GCS? Does the workload download the model multiple times, or is it downloaded once and then shared across the chips?

cc @karan 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelism setups might not correct in OPT-OSS 120B vLLM inference recipes #161

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parallelism setups might not correct in OPT-OSS 120B vLLM inference recipes #161

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions