test-llama

test-llama is a small CLI for running and comparing local GGUF models through llama.cpp and turboquant.

Backend repos:

The goal is practical comparison on your own hardware. We care about which backend feels better to use, how well it handles our prompts, and how it behaves at long context lengths. This is intentionally not a perplexity/KL-divergence project.

Install

cd test-llama.cpp
uv sync

If you just want to run it without a local install:

uv run test-llama <command>

Configure

test-llama looks for config in test_llama.yaml in the repo root first, then in ~/.config/test_llama.yaml. You can also pass a specific file with --config on any command.

Create a starter config with:

test-llama init

The important paths are:

model_dir: where your GGUF models live
llama_cpp_dir: path to your local llama.cpp checkout
turbo_dir: path to your local llama-cpp-turboquant checkout
cuda_root: CUDA install root used for builds

Example:

model_dir: ~/ollama
llama_cpp_dir: ~/Projects/llama.cpp
turbo_dir: ~/Projects/llama-cpp-turboquant
cuda_root: /opt/cuda-13.1

These values are used across the CLI:

list, ctxinfo, start, run, and batch use model_dir to discover local models
start, run, and batch use llama_cpp_dir or turbo_dir to launch the selected backend
build uses llama_cpp_dir, turbo_dir, and cuda_root
ctxinfo uses the family metadata in the config to estimate context limits

Run Models

The main workflow is:

pick a model
choose a backend
start the server
run the tests
stop the server

Examples:

# List models discovered in model_dir
test-llama list

# Show which models are currently running
test-llama running

# Show estimated max context for each model
test-llama ctxinfo

# Start a model with llama.cpp defaults
test-llama start qwen27b

# Start the same model on TurboQuant
test-llama start qwen27b --turbo3
test-llama start qwen27b --turbo4

# Start with a custom context size
test-llama start qwen27b --ctx 65536

# Stop a running server
test-llama stop 11435

# Start, test, and stop in one command
test-llama run qwen27b --turbo3

Backends

The two backends we compare are:

llama.cpp
turboquant

The whole point of the repo is to compare them on actual local inference behavior. That means the same model, the same prompt, and the same test harness, so we can see which backend is more useful in practice.

Benchmarks

There are three different ways to evaluate a model in this repo.

1. Task Tests

This is the main benchmark.

test-llama test 11435

This runs the prompt suite in tests/ against a running model and scores the responses. The current suite focuses on coding and debugging tasks, with the harness checking whether responses actually work.

This is the best tool for comparing llama.cpp vs turboquant, because it measures usefulness instead of just raw probability metrics.

You can also run it end-to-end with:

test-llama run qwen27b --turbo3

2. Context Stress Tests

test-llama context 11435 file1.txt,file2.txt

This is a long-context stress test. It builds prompts that contain a hidden token inside a large text block, then checks whether the model can still recover that token at different context sizes.

This mode is useful for answering questions like:

does the model stay coherent at long context?
does it start looping or repeating itself?
does it still find the important detail buried in the prompt?

3. Throughput Bench

test-llama bench 11435

This is a simple speed check for a running server. It measures tokens per second for a synthetic prompt and generation length.

This is useful when you care about throughput, but it is not the main quality benchmark.

Batch Runs

Batch mode is for comparing many model/backend combinations in one go.

test-llama batch batch_template.yaml

The batch runner:

starts each model
runs the task tests
saves JSON and CSV results in results/
stops the model before moving to the next item

Example batch config:

- model: qwen27b
- model: qwen27b
  turbo3: true
- model: Qwen3.6-35B-A3B
  ctk: turbo3
  ctv: turbo3
- model: Qwen3.6-35B-A3B-UD-Q6_K_XL
  fit: true
  no_offload_kv: true
  ctx: 8192

Test Suite

The prompts live in tests/.

code/ tests coding tasks
debugging/ tests bug-fixing tasks

The harness in test_llama/testing.py runs those prompts locally against a model server and evaluates the responses. For reproducibility, the test path uses deterministic settings like temperature=0 and a fixed seed.

Build

You can build either backend directly from the CLI:

test-llama build llama.cpp
test-llama build llama.cpp --cpu-only
test-llama build turboquant
test-llama build all --force

Useful Commands

test-llama tests
test-llama status 11435
test-llama perplexity qwen27b path/to/text.txt

perplexity is still available, but it is mainly a lower-level diagnostic. The core benchmark story in this repo is the task suite and the context stress suite.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.serena		.serena
.vscode		.vscode
test_llama		test_llama
tests		tests
.codex		.codex
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
batch_template.yaml		batch_template.yaml
batch_test copy.yaml		batch_test copy.yaml
batch_test.yaml		batch_test.yaml
pyproject.toml		pyproject.toml
test_llama.yaml		test_llama.yaml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

test-llama

Install

Configure

Run Models

Backends

Benchmarks

1. Task Tests

2. Context Stress Tests

3. Throughput Bench

Batch Runs

Test Suite

Build

Useful Commands

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

test-llama

Install

Configure

Run Models

Backends

Benchmarks

1. Task Tests

2. Context Stress Tests

3. Throughput Bench

Batch Runs

Test Suite

Build

Useful Commands

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages