Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions models/Qwen/Qwen3-Coder-480B-A35B-Instruct.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ meta:
mi300x: verified
mi325x: verified
mi355x: verified
ironwood: verified

model:
model_id: "Qwen/Qwen3-Coder-480B-A35B-Instruct"
Expand Down Expand Up @@ -92,6 +93,8 @@ guide: |

[Qwen3-Coder](https://github.com/QwenLM/Qwen3-Coder) is an advanced large language model created by the Qwen team. `Qwen3-Coder-480B-A35B-Instruct` is the flagship coder MoE with 480B total / 35B active parameters. vLLM supports it including tool calling; the guide below covers BF16 and FP8 serving on NVIDIA and AMD GPUs.

TPU support is provided through [vLLM TPU](https://github.com/vllm-project/tpu-inference) with a recipe for [Ironwood](https://github.com/AI-Hypercomputer/tpu-recipes/tree/main/inference/ironwood/vLLM/Qwen3-Coder-480B-A35B). The Ironwood docker command rendered by the hardware picker uses the `vllm/vllm-tpu` image; pin to the tag specified by the upstream recipe.

## Prerequisites

### CUDA
Expand Down Expand Up @@ -188,3 +191,4 @@ guide: |
- [FP8 checkpoint](https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8)
- [Qwen3-Coder GitHub](https://github.com/QwenLM/Qwen3-Coder)
- [EvalPlus](https://github.com/evalplus/evalplus)
- [TPU recipes: Ironwood](https://github.com/AI-Hypercomputer/tpu-recipes/tree/main/inference/ironwood/vLLM/Qwen3-Coder-480B-A35B)
18 changes: 18 additions & 0 deletions models/meta-llama/Llama-3.3-70B-Instruct.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ meta:
h200: verified
b200: verified
gb200: verified
trillium: verified

model:
model_id: "meta-llama/Llama-3.3-70B-Instruct"
Expand Down Expand Up @@ -92,13 +93,29 @@ guide: |
and Blackwell (B200/GB200) GPUs. FP4 is Blackwell-only and provides the best
VRAM efficiency.

TPU support is provided through [vLLM TPU](https://github.com/vllm-project/tpu-inference) with a recipe for [Trillium](https://github.com/AI-Hypercomputer/tpu-recipes/tree/main/inference/trillium/vLLM/Llama3.x).

## Prerequisites

- Hardware: 1x H100/H200 (FP8), 1x B200 (FP4), or 2x GPUs for BF16
- vLLM >= 0.12.0
- CUDA Driver >= 575
- Docker with NVIDIA Container Toolkit (recommended)

### Docker (Cloud TPU — Trillium)
TPU uses the separate `vllm/vllm-tpu` image (no pip wheel). Pull the tag specified by the upstream [Trillium recipe](https://github.com/AI-Hypercomputer/tpu-recipes/tree/main/inference/trillium/vLLM/Llama3.x), then run:
```bash
docker run -itd --name llama33-tpu \
--privileged --network host --shm-size 16G \
-v /dev/shm:/dev/shm -e HF_TOKEN=$HF_TOKEN \
vllm/vllm-tpu:latest \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documentation on line 106 correctly advises to 'Pull the tag specified by the upstream Trillium recipe'. However, the example docker run command on line 111 uses the vllm/vllm-tpu:latest tag. Using :latest can lead to non-reproducible behavior and might break if there are changes in the upstream image. To align with the documentation's recommendation and avoid potential issues for users, please consider using a placeholder to indicate that a specific tag from the recipe should be used.

    vllm/vllm-tpu:<tag-from-recipe> \

--model meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 8 \
--max-model-len 16384 \
--host 0.0.0.0 --port 8000
```
Trillium requires a 4-chip slice minimum.

## Client Usage

```python
Expand All @@ -125,3 +142,4 @@ guide: |
- [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)
- [NVIDIA FP8 variant](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP8)
- [NVIDIA FP4 variant](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP4)
- [TPU recipes: Trillium](https://github.com/AI-Hypercomputer/tpu-recipes/tree/main/inference/trillium/vLLM/Llama3.x)