LLM Eval Suite

A pytest-based offline evaluation suite for vLLM-served models. Tests cover sanity, determinism, reasoning, tool calling, Arabic dialect, edge cases, output format, and regression.

Requirements

Python 3.10+
CUDA-capable GPU (tested on RTX 4090)
CUDA 12.6+

Installation

pip install -r requirements.txt

For quantized models (int4/GPTQ/AWQ) you may also need:
pip install autoround auto-gptq

Configuration

Before running, open test_eval.py and update the 3 places marked with comments:

1. Model ID (top of file):

MODEL_ID = "Intel/Qwen3.5-9B-int4-AutoRound"  # ← change this

2. vLLM engine args (inside the llm fixture):

return LLM(
    model=MODEL_ID,
    quantization="auto",        # remove for full-precision models
    dtype="bfloat16",
    gpu_memory_utilization=0.90,
    max_model_len=32768,
    trust_remote_code=True,
    reasoning_parser="qwen3",   # Qwen3-specific — remove for other models
    enable_reasoning=True,       # Qwen3-specific — remove for other models
    enable_auto_tool_choice=True,
    tool_call_parser="qwen3_coder",  # Qwen3-specific — change or remove
)

3. Thinking-block stripping (inside generate()):

# Qwen3 uses <think>...</think> — adjust or remove for other models
out = re.sub(r'<think>.*?</think>', '', out, flags=re.DOTALL)

Usage

# Run all 31 tests
pytest test_eval.py -v

# Run a specific category
pytest test_eval.py -v -k "sanity"
pytest test_eval.py -v -k "saudi"
pytest test_eval.py -v -k "tool or format"
pytest test_eval.py -v -k "reasoning"

# Stop on first failure
pytest test_eval.py -x

# Run with a summary report
pytest test_eval.py -v --tb=short

Test Categories

Category	Tests	Description
`sanity`	4	Basic response, arithmetic, geography, latency
`deterministic`	2	Same output across runs at temperature=0
`reasoning`	3	Chain-of-thought, syllogism, boxed math
`tool_calling`	3	JSON tool dispatch, schema validation, RAG
`saudi_dialect`	11	Arabic dialect responses, sentiment, translation, summarisation, RAG, tool calling
`edge_case`	3	Empty input, malformed prompt, long context
`output_format`	3	JSON, Markdown table, Python function
`regression`	2	English sentiment, Spanish translation

Example Output

collected 31 items

test_eval.py::test_sanity_greeting            PASSED
test_eval.py::test_sanity_arithmetic          PASSED
test_eval.py::test_sanity_capital             PASSED
test_eval.py::test_sanity_latency             PASSED
test_eval.py::test_deterministic_repeat       PASSED
...
test_eval.py::test_regression_translation_es  PASSED

============================== 31 passed in 184.32s ==============================

Adding New Tests

All tests follow the same pattern — add a function prefixed with test_ anywhere in the file:

def test_my_new_case(llm):
    out = generate(llm, [{"role": "user", "content": "Your prompt here"}])
    assert "expected output" in out.lower()

The llm fixture is session-scoped so the model loads once regardless of how many tests you add.

Environment Variables

Variable	Description
`HF_TOKEN`	HuggingFace token for gated models
`MODEL_PATH`	Local path to model directory (skips HF download)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Readme.md		Readme.md
requirements.txt		requirements.txt
test_eval.py		test_eval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Eval Suite

Requirements

Installation

Configuration

Usage

Test Categories

Example Output

Adding New Tests

Environment Variables

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Eval Suite

Requirements

Installation

Configuration

Usage

Test Categories

Example Output

Adding New Tests

Environment Variables

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages