When I ran the benchmark following these recipes:
I found that only one model copy was downloaded to the node from GCS, even though tp=2 was set.
Upon further inspection of the logs, I noticed that data_parallelism switches from 4 to 1 in the vLLM engine.
I would like to understand this behavior better. Specifically, based on the current configurations, how many model copies should reside on a single node, and how many copies are actually downloaded from GCS? Does the workload download the model multiple times, or is it downloaded once and then shared across the chips?
cc @karan
When I ran the benchmark following these recipes:
I found that only one model copy was downloaded to the node from GCS, even though tp=2 was set.
Upon further inspection of the logs, I noticed that data_parallelism switches from 4 to 1 in the vLLM engine.
I would like to understand this behavior better. Specifically, based on the current configurations, how many model copies should reside on a single node, and how many copies are actually downloaded from GCS? Does the workload download the model multiple times, or is it downloaded once and then shared across the chips?
cc @karan