A pytest-based offline evaluation suite for vLLM-served models. Tests cover sanity, determinism, reasoning, tool calling, Arabic dialect, edge cases, output format, and regression.
- Python 3.10+
- CUDA-capable GPU (tested on RTX 4090)
- CUDA 12.6+
pip install -r requirements.txtFor quantized models (int4/GPTQ/AWQ) you may also need:
pip install autoround auto-gptq
Before running, open test_eval.py and update the 3 places marked with comments:
1. Model ID (top of file):
MODEL_ID = "Intel/Qwen3.5-9B-int4-AutoRound" # ← change this2. vLLM engine args (inside the llm fixture):
return LLM(
model=MODEL_ID,
quantization="auto", # remove for full-precision models
dtype="bfloat16",
gpu_memory_utilization=0.90,
max_model_len=32768,
trust_remote_code=True,
reasoning_parser="qwen3", # Qwen3-specific — remove for other models
enable_reasoning=True, # Qwen3-specific — remove for other models
enable_auto_tool_choice=True,
tool_call_parser="qwen3_coder", # Qwen3-specific — change or remove
)3. Thinking-block stripping (inside generate()):
# Qwen3 uses <think>...</think> — adjust or remove for other models
out = re.sub(r'<think>.*?</think>', '', out, flags=re.DOTALL)# Run all 31 tests
pytest test_eval.py -v
# Run a specific category
pytest test_eval.py -v -k "sanity"
pytest test_eval.py -v -k "saudi"
pytest test_eval.py -v -k "tool or format"
pytest test_eval.py -v -k "reasoning"
# Stop on first failure
pytest test_eval.py -x
# Run with a summary report
pytest test_eval.py -v --tb=short| Category | Tests | Description |
|---|---|---|
sanity |
4 | Basic response, arithmetic, geography, latency |
deterministic |
2 | Same output across runs at temperature=0 |
reasoning |
3 | Chain-of-thought, syllogism, boxed math |
tool_calling |
3 | JSON tool dispatch, schema validation, RAG |
saudi_dialect |
11 | Arabic dialect responses, sentiment, translation, summarisation, RAG, tool calling |
edge_case |
3 | Empty input, malformed prompt, long context |
output_format |
3 | JSON, Markdown table, Python function |
regression |
2 | English sentiment, Spanish translation |
collected 31 items
test_eval.py::test_sanity_greeting PASSED
test_eval.py::test_sanity_arithmetic PASSED
test_eval.py::test_sanity_capital PASSED
test_eval.py::test_sanity_latency PASSED
test_eval.py::test_deterministic_repeat PASSED
...
test_eval.py::test_regression_translation_es PASSED
============================== 31 passed in 184.32s ==============================
All tests follow the same pattern — add a function prefixed with test_ anywhere in the file:
def test_my_new_case(llm):
out = generate(llm, [{"role": "user", "content": "Your prompt here"}])
assert "expected output" in out.lower()The llm fixture is session-scoped so the model loads once regardless of how many tests you add.
| Variable | Description |
|---|---|
HF_TOKEN |
HuggingFace token for gated models |
MODEL_PATH |
Local path to model directory (skips HF download) |