Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
345 changes: 193 additions & 152 deletions inference/trillium/vLLM/Qwen2.5-32B/README.md
Original file line number Diff line number Diff line change
@@ -1,105 +1,126 @@
# Serve Qwen2.5-32B with vLLM on TPU VMs.

In this guide, we show how to serve Qwen2.5-32B ([Qwen/Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B)).

## Step 0: Install `gcloud cli`

You can reproduce this experiment from your dev environment (e.g. your laptop). You need to install `gcloud` locally to complete this tutorial.

To install `gcloud cli` please follow this guide: [Install the gcloud CLI](https://cloud.google.com/sdk/docs/install#mac)

Once it is installed, you can login to GCP from your terminal with this command: `gcloud auth login`.

## Step 1: Create a v6e TPU instance

We create a single VM with 4 trillium chips - if you need a different number of chips, you can set a different value for `--topology` such as `1x1`, `2x4`, etc.

To learn more about topologies: [v6e VM Types](https://cloud.google.com/tpu/docs/v6e#vm-types).

```bash
export TPU_NAME=your-tpu-name
export ZONE=your-tpu-zone
export PROJECT=your-tpu-project

# this command creates a tpu vm with 4 Trillium (v6e) chips - adjust it to suit your needs
gcloud alpha compute tpus tpu-vm create $TPU_NAME \
--type v6e --topology 2x2 \
--project $PROJECT --zone $ZONE --version v2-alpha-tpuv6e
```

## Step 2: ssh to the instance

```bash
gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE
```

## Step 3: Use the latest vllm docker image for TPU
We use a pinned image but you can change it to `vllm/vllm-tpu:nightly` to get the latest TPU nightly image.

```bash
export DOCKER_URI=vllm/vllm-tpu:nightly
```

## Step 4: Run the docker container in the TPU instance

```bash
sudo docker run -t --rm --name $USER-vllm --privileged --net=host -v /dev/shm:/dev/shm --shm-size 10gb -p 8000:8000 --entrypoint /bin/bash -it ${DOCKER_URI}
```

## Step 5: Set up env variables
Export your hugging face token along with other environment variables inside the container.

```bash
export HF_HOME=/dev/shm
export HF_TOKEN=<your HF token>
```

## Step 6: Serve the model

Now we serve the vllm server. Make sure you keep this terminal open for the entire duration of this experiment.

```bash
export MAX_MODEL_LEN=4096
export TP=4 # number of chips
# export RATIO=0.8
# export PREFIX_LEN=0

VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-32B --seed 42 --disable-log-requests --gpu-memory-utilization 0.95 --max-num-batched-tokens 512 --max-num-seqs 512 --tensor-parallel-size $TP --max-model-len $MAX_MODEL_LEN
```

It takes a few minutes depending on the model size to prepare the server - once you see the below snippet in the logs, it means that the server is ready to serve requests or run benchmarks:

```bash
INFO: Started server process [7]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
```

## Step 7: Prepare the test environment

Open a new terminal to test the server and run the benchmark (keep the previous terminal open).

First, we ssh into the TPU vm via the new terminal:

```bash
export TPU_NAME=your-tpu-name
export ZONE=your-tpu-zone
export PROJECT=your-tpu-project

gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE
```

## Step 8: access the running container

```bash
sudo docker exec -it $USER-vllm bash
```

## Step 9: Test the server.

Let's submit a test request to the server. This helps us to see if the server is launched properly and we can see legitimate response from the model.

# Serve Qwen/Qwen2.5-32B with vLLM on Cloud TPU v6e (Trillium) VMs

This guide provides step-by-step instructions to serve the Qwen/Qwen2.5-32B model using vLLM on Cloud TPU v6e (Trillium) virtual machines. By following this recipe, you will deploy the model and be able to send inference requests to it.

The target audience for this guide is ML engineers familiar with Google Cloud, Linux, and Docker.

## Prerequisites

Before you begin, ensure you have the following:

* **Google Cloud SDK:** You need the `gcloud` command-line tool installed and authenticated. For installation instructions, refer to [Installing the Google Cloud SDK](https://cloud.google.com/sdk/docs/install). After installation, log in to your Google Cloud account by running `gcloud auth login`.
* **Hugging Face Token:** A Hugging Face token is required to download the model from the Hugging Face Hub. Ensure you have one available. You can find more information on creating tokens on the [Hugging Face website](https://huggingface.co/docs/hub/security-tokens).
* **Basic Linux and Docker knowledge:** You should be familiar with basic Linux commands and Docker concepts for navigating directories, running scripts, and managing containers.

## Step 1: Create a Cloud TPU v6e instance

This step guides you through creating a Cloud TPU v6e virtual machine (VM). You will set the environment variables `TPU_NAME`, `ZONE`, and `PROJECT` in your local shell where you run `gcloud` commands.

1. Define environment variables for your Cloud TPU v6e instance configuration:
```bash
export TPU_NAME="your-tpu-name"
export ZONE="your-tpu-zone"
export PROJECT="your-gcp-project"
```
Replace `your-tpu-name`, `your-tpu-zone`, and `your-gcp-project` with your desired TPU name, zone, and Google Cloud project ID, respectively.

2. Execute the command to create the Cloud TPU v6e VM. The following command creates a VM with 4 Trillium chips (using `--accelerator-type=v6e-4` and `--topology=2x2`). If you need a different configuration, adjust these values. For more details on available topologies, refer to [Cloud TPU v6e VM Types](https://cloud.google.com/tpu/docs/v6e#vm-types):
```bash
gcloud alpha compute tpus tpu-vm create ${TPU_NAME} \
--accelerator-type v6e-4 \
--project ${PROJECT} \
--zone ${ZONE} \
--version v2-alpha-tpuv6e \
--topology 2x2 # Optional for v6e-4 but good practice
```
This command may take a few minutes to complete. Upon success, `gcloud` will display details of the created Cloud TPU v6e VM.

3. Once the instance is created, connect to it using SSH. The following command uses the environment variables you defined previously:
```bash
gcloud compute tpus tpu-vm ssh ${TPU_NAME} --project ${PROJECT} --zone ${ZONE}
```
You are now connected to your Cloud TPU v6e VM's shell.

## Step 2: Set up the environment and launch vLLM

Execute the following commands on the Cloud TPU v6e VM you connected to in the previous step.

1. Define the Docker image to use and run the Docker container. You can use `vllm/vllm-tpu:nightly` for the latest vLLM TPU nightly image or specify a pinned image if you require a specific version:
```bash
export DOCKER_URI="vllm/vllm-tpu:nightly" # Or your preferred pinned image, e.g., vllm/vllm-tpu:20240701
sudo docker run -t --rm --name "${USER}-vllm" --privileged --net=host -v /dev/shm:/dev/shm --shm-size 10gb -p 8000:8000 --entrypoint /bin/bash -it "${DOCKER_URI}"
```
This command starts a Docker container in privileged mode, maps port `8000`, and mounts `/dev/shm` for shared memory. You will now be inside the Docker container's shell.

2. Inside the container, export your Hugging Face token and set the `HF_HOME` environment variable. The `HF_TOKEN` is necessary for downloading the model.
```bash
export HF_HOME="/dev/shm" # Specifies the cache directory for Hugging Face assets
export HF_TOKEN="<your HF token>" # Replace <your HF token> with your actual Hugging Face token
```

## Step 3: Start the vLLM server

Execute the following commands inside the Docker container on the Cloud TPU v6e VM.

1. Define environment variables for the vLLM server configuration:
```bash
export MAX_MODEL_LEN=4096 # Maximum sequence length the model can handle
export TP=4 # Tensor parallelism: number of TPU chips (should match topology, e.g., 4 for 2x2)
# export RATIO=0.8 # Optional: for benchmark configuration (share of random prompts)
# export PREFIX_LEN=0 # Optional: for benchmark configuration (fixed prefix length for prompts)
```

2. Start the vLLM server with the Qwen/Qwen2.5-32B model:
```bash
VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-32B \
--seed 42 \
--disable-log-requests \
--gpu-memory-utilization 0.95 \
--max-num-batched-tokens 512 \
--max-num-seqs 512 \
--tensor-parallel-size ${TP} \
--max-model-len ${MAX_MODEL_LEN}
```
This command starts the vLLM server. Key parameters include:
* `--tensor-parallel-size ${TP}`: Distributes the model across the specified number of TPU chips.
* `--max-model-len ${MAX_MODEL_LEN}`: Sets the maximum sequence length for the model.
* `--gpu-memory-utilization 0.95`: Instructs vLLM to use 95% of the available TPU memory.

The server may take a few minutes to download the model and initialize. Keep this terminal open. When the server is ready to accept requests, you will see messages similar to the following in the logs:
```text
INFO: Started server process [7]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
```

## Step 4: Prepare the testing environment

To test the server and run benchmarks, open a new, separate terminal window on your local machine. Do not close the terminal running the vLLM server.

1. In the new terminal, define the same environment variables for your Cloud TPU v6e instance as in Step 1, if they are not already set in your current shell session:
```bash
export TPU_NAME="your-tpu-name" # Ensure these are the same as in Step 1
export ZONE="your-tpu-zone"
export PROJECT="your-gcp-project"
```

2. SSH into the Cloud TPU v6e VM again from this new terminal:
```bash
gcloud compute tpus tpu-vm ssh ${TPU_NAME} --project ${PROJECT} --zone ${ZONE}
```

3. Once connected to the Cloud TPU v6e VM in the new terminal, access the running Docker container using `docker exec`:
```bash
sudo docker exec -it "${USER}-vllm" bash
```
You are now inside the same Docker container where the vLLM server is running, but in a new shell session.

## Step 5: Test the server endpoint

Execute this command inside the Docker container (from the new terminal session you opened in Step 4).

Send a completion request to the running vLLM server using `curl`:
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
Expand All @@ -110,61 +131,81 @@ curl http://localhost:8000/v1/completions \
"temperature": 0
}'
```

## Step 9: Preparing the test image

You might need to install datasets as it's not available in the base vllm image.

```bash
pip install datasets
You should receive a JSON response from the model containing the completed text. The `id`, `created` timestamp, and exact `text` will vary:
```json
{
"id": "cmpl-xxxxxxxxxxxxxxxxxxxxxxxx",
"object": "text_completion",
"created": 1678886400,
"model": "Qwen/Qwen2.5-32B",
"choices": [
{
"index": 0,
"text": " they are a new beginning for me. I can start fresh and new each day...",
"logprobs": null,
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 7,
"total_tokens": 207,
"completion_tokens": 200
}
}
```

## Step 10: Run the benchmarking

Finally, we are ready to run the benchmark:

```bash
export MAX_INPUT_LEN=1800
export MAX_OUTPUT_LEN=128
export HF_TOKEN=<your HF token>

cd /workspace/vllm

python benchmarks/benchmark_serving.py \
--backend vllm \
--model "Qwen/Qwen2.5-32B" \
--dataset-name random \
--num-prompts 1000 \
--random-input-len=$MAX_INPUT_LEN \
--random-output-len=$MAX_OUTPUT_LEN \
--seed 100
# --random-range-ratio=$RATIO \
# --random-prefix-len=$PREFIX_LEN
```

The snippet below is what you’d expect to see - the numbers vary based on the vllm version, the model size and the TPU instance type/size.

```bash
## Step 6: Run the benchmark

Execute these commands inside the Docker container (from the new terminal session opened in Step 4), where you also ran the `curl` test.

1. Install the `datasets` Python library, which is used by the benchmark script and might not be included in the base vLLM Docker image:
```bash
pip install datasets
```

2. Define environment variables for the benchmark configuration. If your `HF_TOKEN` is not already set in this specific shell session within the container, uncomment and set it:
```bash
export MAX_INPUT_LEN=1800
export MAX_OUTPUT_LEN=128
# export HF_TOKEN="<your HF token>" # Uncomment and replace if HF_TOKEN is not set in this session
```

3. Navigate to the vLLM workspace (the path might vary depending on the Docker image structure; `/workspace/vllm` is a common location) and run the benchmark script:
```bash
cd /workspace/vllm # Adjust if your vLLM benchmarks are in a different directory

python benchmarks/benchmark_serving.py \
--backend vllm \
--model "Qwen/Qwen2.5-32B" \
--dataset-name random \
--num-prompts 1000 \
--random-input-len ${MAX_INPUT_LEN} \
--random-output-len ${MAX_OUTPUT_LEN} \
--seed 100
# --random-range-ratio=${RATIO} # Optional, ensure RATIO is set if used (defined in Step 3.1)
# --random-prefix-len=${PREFIX_LEN} # Optional, ensure PREFIX_LEN is set if used (defined in Step 3.1)
```
The benchmark script will send requests to the server for some time. Upon completion, it will display results showing metrics like throughput (requests per second, tokens per second) and latency (time to first token, time per output token). The exact numbers will vary based on your vLLM version, model size, and specific Cloud TPU v6e instance configuration:
```text
============ Serving Benchmark Result ============
Successful requests: xxxxxxx
Benchmark duration (s): xxxxxxx
Total input tokens: xxxxxxx
Total generated tokens: xxxxxxx
Request throughput (req/s): xxxxxxx
Output token throughput (tok/s): xxxxxxx
Total Token throughput (tok/s): xxxxxxx
Successful requests: 1000
Benchmark duration (s): 123.45
Total input tokens: 1800000
Total generated tokens: 128000
Request throughput (req/s): 8.10
Output token throughput (tok/s): 1036.86
Total Token throughput (tok/s): 15616.04
---------------Time to First Token----------------
Mean TTFT (ms): xxxxxxx
Median TTFT (ms): xxxxxxx
P99 TTFT (ms): xxxxxxx
Mean TTFT (ms): 100.00
Median TTFT (ms): 99.00
P99 TTFT (ms): 110.00
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): xxxxxxx
Median TPOT (ms): xxxxxxx
P99 TPOT (ms): xxxxxxx
Mean TPOT (ms): 0.90
Median TPOT (ms): 0.89
P99 TPOT (ms): 1.00
---------------Inter-token Latency----------------
Mean ITL (ms): xxxxxxx
Median ITL (ms): xxxxxxx
P99 ITL (ms): xxxxxxx
Mean ITL (ms): 0.95
Median ITL (ms): 0.94
P99 ITL (ms): 1.05
==================================================
```