Skip to content

mohameddmansurr/llm-eval-suite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

LLM Eval Suite

A pytest-based offline evaluation suite for vLLM-served models. Tests cover sanity, determinism, reasoning, tool calling, Arabic dialect, edge cases, output format, and regression.

Requirements

  • Python 3.10+
  • CUDA-capable GPU (tested on RTX 4090)
  • CUDA 12.6+

Installation

pip install -r requirements.txt

For quantized models (int4/GPTQ/AWQ) you may also need:

pip install autoround auto-gptq

Configuration

Before running, open test_eval.py and update the 3 places marked with comments:

1. Model ID (top of file):

MODEL_ID = "Intel/Qwen3.5-9B-int4-AutoRound"  # ← change this

2. vLLM engine args (inside the llm fixture):

return LLM(
    model=MODEL_ID,
    quantization="auto",        # remove for full-precision models
    dtype="bfloat16",
    gpu_memory_utilization=0.90,
    max_model_len=32768,
    trust_remote_code=True,
    reasoning_parser="qwen3",   # Qwen3-specific — remove for other models
    enable_reasoning=True,       # Qwen3-specific — remove for other models
    enable_auto_tool_choice=True,
    tool_call_parser="qwen3_coder",  # Qwen3-specific — change or remove
)

3. Thinking-block stripping (inside generate()):

# Qwen3 uses <think>...</think> — adjust or remove for other models
out = re.sub(r'<think>.*?</think>', '', out, flags=re.DOTALL)

Usage

# Run all 31 tests
pytest test_eval.py -v

# Run a specific category
pytest test_eval.py -v -k "sanity"
pytest test_eval.py -v -k "saudi"
pytest test_eval.py -v -k "tool or format"
pytest test_eval.py -v -k "reasoning"

# Stop on first failure
pytest test_eval.py -x

# Run with a summary report
pytest test_eval.py -v --tb=short

Test Categories

Category Tests Description
sanity 4 Basic response, arithmetic, geography, latency
deterministic 2 Same output across runs at temperature=0
reasoning 3 Chain-of-thought, syllogism, boxed math
tool_calling 3 JSON tool dispatch, schema validation, RAG
saudi_dialect 11 Arabic dialect responses, sentiment, translation, summarisation, RAG, tool calling
edge_case 3 Empty input, malformed prompt, long context
output_format 3 JSON, Markdown table, Python function
regression 2 English sentiment, Spanish translation

Example Output

collected 31 items

test_eval.py::test_sanity_greeting            PASSED
test_eval.py::test_sanity_arithmetic          PASSED
test_eval.py::test_sanity_capital             PASSED
test_eval.py::test_sanity_latency             PASSED
test_eval.py::test_deterministic_repeat       PASSED
...
test_eval.py::test_regression_translation_es  PASSED

============================== 31 passed in 184.32s ==============================

Adding New Tests

All tests follow the same pattern — add a function prefixed with test_ anywhere in the file:

def test_my_new_case(llm):
    out = generate(llm, [{"role": "user", "content": "Your prompt here"}])
    assert "expected output" in out.lower()

The llm fixture is session-scoped so the model loads once regardless of how many tests you add.

Environment Variables

Variable Description
HF_TOKEN HuggingFace token for gated models
MODEL_PATH Local path to model directory (skips HF download)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages