LiteLLM Benchmarking System

Purpose

This project provides a local-first benchmarking system for comparing provider, model, harness, and harness-configuration performance through a shared LiteLLM proxy.

The system is built for interactive terminal agents and IDE agents that can be pointed at a custom inference base URL. The benchmark application does not own the harness runtime. It owns session registration, correlation, collection, normalization, storage, reporting, and dashboards.

What the system answers

The completed system should make it easy to answer questions such as:

Which provider and model combination is fastest for the same task card and harness?
How does Claude Code compare with Codex, OpenCode, OpenHands, Gemini-oriented clients, or other agent harnesses when routed through the same local proxy?
Does a harness configuration change improve TTFT, total latency, output throughput, error rate, or cache behavior?
Does a provider-specific routing change improve session-level performance?
How much variance exists between repeated sessions of the same benchmark variant?

Recommended local stack

Use Docker Compose for infrastructure and uv for the benchmark application.

Infrastructure services:

LiteLLM proxy
PostgreSQL
Prometheus
Grafana

Benchmark application capabilities:

config loading and validation
experiment, variant, and session registry
session credential issuance
harness env rendering
LiteLLM request collection and normalization
Prometheus metric collection and rollups
query API and exports
dashboards and reports

Core design choices

LiteLLM is the single shared proxy and routing layer.
Every interactive benchmark session gets a benchmark-owned session ID.
Session correlation is built around a session-scoped proxy credential plus benchmark tags.
The project stores canonical benchmark records in a project-owned database.
LiteLLM and Prometheus are telemetry sources, not the canonical query model.
Prompt and response content are disabled by default.
The benchmark application stays harness-agnostic in its core path.

Primary workflow

Define providers, harness profiles, variants, experiments, and task cards in versioned config files.
Create a benchmark session for a chosen variant and task card.
The session manager issues a session-scoped proxy credential and renders the exact environment snippet for the selected harness.
Launch the harness manually and use it interactively against the local LiteLLM proxy.
LiteLLM emits request data and Prometheus metrics while the benchmark app captures benchmark metadata.
Collectors normalize request- and session-level data into the project database.
Reports and dashboards compare sessions, variants, providers, models, and harnesses.

Repository layout

.
├── AGENTS.md
├── README.md
├── pyproject.toml
├── Makefile
├── docker-compose.yml
├── .env.example
├── configs/
│   ├── litellm/
│   ├── prometheus/
│   ├── grafana/
│   ├── providers/
│   ├── harnesses/
│   ├── variants/
│   ├── experiments/
│   └── task-cards/
├── dashboards/
├── docs/
│   ├── architecture.md
│   ├── benchmark-methodology.md
│   ├── config-and-contracts.md
│   ├── data-model-and-observability.md
│   ├── implementation-plan.md
│   ├── references.md
│   └── security-and-operations.md
├── skills/
│   └── convert-tasks-to-linear/
│       └── SKILL.md
├── src/
│   ├── benchmark_core/
│   ├── cli/
│   ├── collectors/
│   ├── reporting/
│   └── api/
└── tests/

Documentation map

AGENTS.md
- persistent project context for coding agents
- architectural invariants
- delivery and testing rules
docs/architecture.md
- system components
- data flow
- deployment boundaries
docs/benchmark-methodology.md
- how to run comparable interactive benchmark sessions
- metric definitions and confounder controls
docs/config-and-contracts.md
- config schemas
- session and CLI contracts
- normalization contracts
docs/data-model-and-observability.md
- canonical entities
- storage model
- derived metrics
docs/security-and-operations.md
- local security posture
- redaction, retention, and secrets
- operator safeguards
docs/implementation-plan.md
- parent issues and sub-issues
- Definition of Ready information
- acceptance criteria and test plans
docs/references.md
- external references that shaped the design
skills/convert-tasks-to-linear/SKILL.md
- reusable instructions for converting a markdown implementation plan into Linear parent issues and sub-issues

MVP success criteria

The MVP is complete when a developer can:

start LiteLLM, Postgres, Prometheus, and Grafana locally with one command
validate provider, harness profile, variant, experiment, and task-card configs
create a session for a specific benchmark variant
receive a session-specific environment snippet for a chosen harness
run the harness interactively against the proxy
collect and normalize request- and session-level data into the benchmark database
view live metrics in Grafana and historical comparisons in the benchmark app
export structured comparison results for providers, models, harnesses, and harness configurations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LiteLLM Benchmarking System

Purpose

What the system answers

Recommended local stack

Core design choices

Primary workflow

Repository layout

Documentation map

MVP success criteria

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.agents/skills/convert-tasks-to-linear		.agents/skills/convert-tasks-to-linear
configs		configs
docs		docs
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

LiteLLM Benchmarking System

Purpose

What the system answers

Recommended local stack

Core design choices

Primary workflow

Repository layout

Documentation map

MVP success criteria

About

Resources

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages