Skip to content

TeichAI/serverless-benchmark

Repository files navigation

RunPod Benchmarks

This project runs benchmark jobs in a RunPod serverless worker.

Request and response examples live in docs/requests.md. For local benchmarking against OpenRouter instead of RunPod, use run_openrouter_benchmarks.py.

Each job request provides:

  • model_id: the model under test
  • benchmarks: a list of benchmark ids that map to benchmarks/<id>.jsonl
  • judge_model_id: optional judge model, default Qwen/Qwen3.5-9B
  • batch_size: optional prompt batch size per GPU pass, default 5
  • generation_config: optional temperature, top_p, and max_tokens

Execution is sequential and GPU-safe for a single endpoint:

  1. Load the target model.
  2. Run the current benchmark in small prompt batches, defaulting to 5 prompts per vLLM call.
  3. Unload the target model if the judge model is different.
  4. Load the judge model and score that benchmark in the same fixed-size batches.
  5. Write results and move to the next benchmark.

Artifacts are written under RUNPOD_VOLUME_ROOT/jobs/<job_id>/.

Benchmarks

Current benchmark ids in this repo:

  • gsm8k
  • aime_2026
  • mmlu_pro

To refresh the benchmark JSONL files from Hugging Face:

python3 -m pip install datasets
python3 scripts/convert_hf_benchmarks.py

Benchmark format

Each line in benchmarks/<id>.jsonl must be valid JSON with:

{
  "id": "case-1",
  "input": "What is 2 + 2?",
  "reference": "4",
  "metadata": {
    "topic": "arithmetic"
  }
}

Local run

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
PYTHONPATH=src pytest
python handler.py

Local OpenRouter Run

Set OPENROUTER_API_KEY and run:

PYTHONPATH=src python3 scripts/run_openrouter_benchmarks.py \
  --model-id qwen/qwen3.5-2b \
  --judge-model-id qwen/qwen3.5-9b \
  --benchmarks aime_2026 \
  --batch-size 5

This writes artifacts under artifacts/openrouter/jobs/<job_id>/.

Request payload

{
  "input": {
    "model_id": "meta-llama/Llama-3.1-8B-Instruct",
    "benchmarks": ["gsm8k"],
    "judge_model_id": "Qwen/Qwen3.5-9B",
    "batch_size": 5,
    "generation_config": {
      "temperature": 0,
      "top_p": 1,
      "max_tokens": 256
    }
  }
}

Judge output tokens are currently capped separately at 16384.

More docs

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors