Run accuracy + perf workloads against vLLM, defined by small YAML recipes in workloads/.
Each recipe is one (model, hardware, set of tasks) combination. The Buildkite pipeline picks recipes up automatically — to ship a new run, you write a YAML file, push it, and trigger a build.
workloads/ one YAML per (model, hardware) recipe
lib/ orchestrator (run.sh), helpers, GPU profiles
.buildkite/ pipeline bootstrap and step generator
CLAUDE.md agent conventions and detailed Buildkite workflow
- Copy an existing workload that targets the same GPU — e.g.
workloads/qwen3_5_h200.yamlis a small, complete example. - Name the file
<model>_<hardware>.yaml. Keep hardware variants in separate files. - Edit the fields to match your model and tasks. Set
nightly: trueif it should run in the nightly schedule; leave it off for opt-in recipes. - Open a PR. The pipeline auto-discovers
workloads/*.yaml— no Buildkite YAML edits needed.
A recipe has top-level metadata plus up to three eval blocks:
vllm:— how the server runs. Defines what model to serve and how (model,serve_args, optional image/env overrides). Required.lm_eval:— what accuracy to measure. Lists lm-evaluation-harness tasks to run against the live server (e.g.gsm8k,aime25). Each task's score is saved underresults/<name>/<task-name>/. Optional.vllm_bench:— what perf to measure. Listsvllm bench serveconfigs (input/output lengths, concurrency, dataset). Raw JSON is saved and ingested into the perf dashboard. Optional.bfcl:— function-calling eval. Runs BFCL test categories against the live server. Some models need--enable-auto-tool-choiceand--tool-call-parserinserve_args. Results are transformed to lm_eval format and ingested asbfcl_<category>tasks. Optional.
Include one or more of lm_eval: / vllm_bench: / bfcl: depending on what you want out of this recipe.
name: qwen3_5-h200 # used in container name and results/<name>/
gpu: H200 # picks queue/image/HF cache from lib/gpu_profiles.yaml
num_gpus: 8
nightly: true # include in the nightly schedule (default: false)
vllm: # how the server is brought up
model: Qwen/Qwen3.5-397B-A17B-FP8
image: vllm/vllm-openai:nightly # optional; falls back to VLLM_IMAGE / VLLM_COMMIT / latest
env: # optional; merged over the GPU profile's env
SOME_VAR: value
serve_args: >- # appended to `vllm serve <model>`; word-split
-dp 8 --enable-expert-parallel
--trust-remote-code
lm_eval: # accuracy tasks (optional)
model_args: # workload-level defaults, merged into every task
tokenized_requests: false
timeout: 6000
tasks:
- name: gsm8k # must match an lm-eval task name
num_fewshot: 5
model_args: # per-task overrides (merged on top of workload defaults)
num_concurrent: 1024
max_length: 40960
- name: aime25
num_fewshot: 0
bfcl: # function-calling eval (optional)
test_categories: # BFCL test categories to run
- simple_python
- multiple
- parallel
num_threads: 8 # optional, default 8
temperature: 0.001 # optional, default 0.001
vllm_bench: # perf runs (optional) — fed to the perf dashboard
configs:
- name: 1k-in-1k-out-conc-256
dataset: random # or speed_bench
input_len: 1024
output_len: 1024
num_prompts: 500
max_concurrency: 256A few things worth knowing:
gpumust match a key inlib/gpu_profiles.yaml. The profile sets the Buildkite queue, default image, HF cache path, and baseline env vars.nightlycontrols only the nightly schedule. Recipes withnightly: false(or omitted) are still triggerable explicitly via theWORKLOADSenv var.lm_eval.tasksis a list because each entry runs as a separatelm_evalinvocation —--num_fewshotis a single global flag, so different shot counts need separate runs. Each task's results land inresults/<name>/<task-name>/.vllm_benchruns first if both blocks are present — that way perf-pipeline bugs surface quickly instead of waiting on a full lm-eval pass.bfclmay need tool-call serve args. Some models require--enable-auto-tool-choiceand--tool-call-parserfor function-calling; the parser warns if--tool-call-parseris absent. Each category runs as a separate generate + evaluate pass; scores appear on the eval dashboard asbfcl_<category>tasks.
For everything else (the full set of supported fields, defaults, validation rules), the existing files in workloads/ are the working reference and lib/parse_workload.py is the source of truth.
The pipeline is vllm/perf-eval. With no extra config, a build runs every workload that has nightly: true.
From the UI: open the pipeline → New Build → pick branch and commit (must be pushed to GitHub) → optionally fill Environment Variables to scope the run → Create Build.
Required env vars — both must be set on every build:
VLLM_COMMIT— vLLM commit SHA being tested. Used to tag results and track which vLLM version produced them.VLLM_IMAGE— full Docker image URI (e.g.vllm/vllm-openai:nightly-abc1234). This is the image that gets pulled and run.
Optional env vars:
WORKLOADS— comma- or newline-separated list of workload paths or stems. Runs exactly those instead of the defaultnightly: trueset.NIGHTLY— set to1to tag every ingested row withnightly: true. The dashboard's/nightlyview filters on this to pair adjacent nightly builds; only the scheduled nightly cron should set it.
Example — trigger a build from the Buildkite UI:
- Open the
vllm/perf-evalpipeline → New Build. - Pick the branch and commit (must already be pushed to GitHub).
- Set the environment variables:
VLLM_COMMIT=abc1234def5678 VLLM_IMAGE=vllm/vllm-openai:nightly-abc1234def5678 WORKLOADS=qwen3_5_h200 - Click Create Build.
This runs the qwen3_5_h200 workload against the specified vLLM nightly image. Omit WORKLOADS to run all nightly: true workloads.
From an agent: see CLAUDE.md for the Buildkite MCP workflow (don't shell out to curl or bk).
A real run needs a GPU host with Docker, vLLM, and lm-eval available:
./lib/run.sh workloads/qwen3_5_h200.yamlLocally, you can smoke-test recipe changes without a GPU — see CLAUDE.md for the parser stub and shell-syntax checks.
CLAUDE.md has conventions for AI agents working in this repo: smoke-testing changes, launching Buildkite builds for a chosen branch/commit, and the AI-assistance disclosure rule for PRs and commits.