This project runs benchmark jobs in a RunPod serverless worker.
Request and response examples live in docs/requests.md. For local benchmarking against OpenRouter instead of RunPod, use run_openrouter_benchmarks.py.
Each job request provides:
model_id: the model under testbenchmarks: a list of benchmark ids that map tobenchmarks/<id>.jsonljudge_model_id: optional judge model, defaultQwen/Qwen3.5-9Bbatch_size: optional prompt batch size per GPU pass, default5generation_config: optionaltemperature,top_p, andmax_tokens
Execution is sequential and GPU-safe for a single endpoint:
- Load the target model.
- Run the current benchmark in small prompt batches, defaulting to 5 prompts per vLLM call.
- Unload the target model if the judge model is different.
- Load the judge model and score that benchmark in the same fixed-size batches.
- Write results and move to the next benchmark.
Artifacts are written under RUNPOD_VOLUME_ROOT/jobs/<job_id>/.
Current benchmark ids in this repo:
gsm8kaime_2026mmlu_pro
To refresh the benchmark JSONL files from Hugging Face:
python3 -m pip install datasets
python3 scripts/convert_hf_benchmarks.pyEach line in benchmarks/<id>.jsonl must be valid JSON with:
{
"id": "case-1",
"input": "What is 2 + 2?",
"reference": "4",
"metadata": {
"topic": "arithmetic"
}
}python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
PYTHONPATH=src pytest
python handler.pySet OPENROUTER_API_KEY and run:
PYTHONPATH=src python3 scripts/run_openrouter_benchmarks.py \
--model-id qwen/qwen3.5-2b \
--judge-model-id qwen/qwen3.5-9b \
--benchmarks aime_2026 \
--batch-size 5This writes artifacts under artifacts/openrouter/jobs/<job_id>/.
{
"input": {
"model_id": "meta-llama/Llama-3.1-8B-Instruct",
"benchmarks": ["gsm8k"],
"judge_model_id": "Qwen/Qwen3.5-9B",
"batch_size": 5,
"generation_config": {
"temperature": 0,
"top_p": 1,
"max_tokens": 256
}
}
}Judge output tokens are currently capped separately at 16384.