LLM Benchmark — Customer Support Chatbot

A structured pipeline for benchmarking multiple LLM backends on a customer support chatbot task. Each model is tested against the same input dataset, outputs are scored by an LLM judge (Claude), and results are visualized with charts and summary tables.

📐 Architecture

data/dataset.csv
      │
      ├──► src/models/deepseek_transformers.py  ──► results/output_DeepSeek-R1-Distill-Llama-8B.csv
      ├──► src/models/deepseek_unsloth.py        ──► results/output_DeepSeek-...-unsloth-bnb-4bit.csv
      ├──► src/models/deepseek_vllm.py           ──► results/output_DeepSeek-...-vllm.csv
      └──► src/models/phi4_unsloth.py            ──► results/output_phi-4-unsloth-bnb-4bit.csv
                                                              │
                                              src/utils/csv_utils.merge_results()
                                                              │
                                              results/benchmark.csv
                                                              │
                                        src/evaluation/evaluator.py  (Claude judge)
                                                              │
                                        results/evaluated_responses.csv
                                                              │
                                        src/evaluation/visualize.py
                                                              │
                                        results/plots/{mean_scores, distribution,
                                                       heatmap, inference_time}.png

🗂️ Project Structure

llm-benchmark/
├── config.py                        # Central config (paths, model IDs, settings)
├── prompts/
│   └── system_prompt.txt            # System prompt for the chatbot
├── data/
│   └── dataset.csv                  # Input CSV (USER INPUT, TRUE OUTPUT, prompt)
├── src/
│   ├── models/
│   │   ├── deepseek_transformers.py # HuggingFace Transformers backend
│   │   ├── deepseek_unsloth.py      # Unsloth 4-bit backend
│   │   ├── deepseek_vllm.py         # vLLM backend
│   │   └── phi4_unsloth.py          # Phi-4 via Unsloth
│   ├── evaluation/
│   │   ├── evaluator.py             # LLM-as-judge scorer (Claude API)
│   │   └── visualize.py             # Charts & summary tables
│   └── utils/
│       └── csv_utils.py             # Shared CSV read/write helpers
├── scripts/
│   ├── run_inference.py             # CLI: run one or all models
│   └── run_evaluation.py            # CLI: score + visualize
├── notebooks/
│   ├── 01_deepseek_transformers.ipynb
│   └── 05_evaluation_and_visualization.ipynb
├── requirements/
│   ├── transformers.txt
│   ├── unsloth.txt
│   └── vllm.txt
└── results/                         # Auto-created; gitignored except sample data

📊 Dataset Format (`data/dataset.csv`)

Column	Description
`USER INPUT`	The user's question to the chatbot
`TRUE OUTPUT`	A reference good response (ground truth)
`prompt`	Full chat template as a Python list of `{role, content}` dicts

Example row:

USER INPUT: "How do I reset my password?"
TRUE OUTPUT: "Go to the login page, click 'Forgot Password', and follow the email link."
prompt: [{"role": "system", "content": "..."}, {"role": "user", "content": "How do I reset my password?"}]

🚀 Quickstart

1. Install dependencies

Pick the backend(s) you need:

pip install -r requirements/transformers.txt   # HuggingFace pipeline
pip install -r requirements/unsloth.txt        # Unsloth 4-bit (GPU required)
pip install -r requirements/vllm.txt           # vLLM (GPU required)

2. Set your API key (for evaluation)

export ANTHROPIC_API_KEY=sk-ant-...

3. Run inference

# Run a single model (add --limit N for a quick test)
python scripts/run_inference.py --model deepseek_transformers --limit 5

# Run all models and merge results
python scripts/run_inference.py --model all --merge

Available model keys: deepseek_transformers, deepseek_unsloth_4bit, deepseek_vllm, phi4_unsloth_4bit

4. Evaluate and visualize

# Score all model outputs and generate plots
python scripts/run_evaluation.py --visualize

# Only plot (if you already have evaluated_responses.csv)
python scripts/run_evaluation.py --only-visualize

📈 Output Visualizations

Chart	Description
`mean_scores.png`	Bar chart of mean score ± std dev per model
`score_distribution.png`	Box plot of score spread per model
`score_heatmap.png`	Heatmap: score for each question × model
`inference_time.png`	Horizontal bar chart of average inference time
`summary_table.csv`	Mean, median, min, max, avg time per model

⚙️ Configuration (`config.py`)

Key	Default	Description
`INPUT_CSV`	`data/dataset.csv`	Input dataset path
`OUTPUT_DIR`	`results/`	Where CSVs are written
`MAX_NEW_TOKENS`	`256`	Token budget per response
`MAX_SEQ_LENGTH`	`2048`	Max context for Unsloth models
`LOAD_IN_4BIT`	`True`	4-bit quantization for Unsloth

🔧 Adding a New Model

Create src/models/my_model.py following the pattern of existing modules:
- load_model() → returns the model/tokenizer/pipeline
- build_inference_fn(...) → returns fn(user_input, prompt) → str
- run(input_csv, output_csv, limit) → calls write_results(...)
Add an entry in REGISTRY in scripts/run_inference.py.
Add the column name to EVAL_MODEL_COLS in config.py.

📋 Requirements

Python 3.10+
NVIDIA GPU (required for Unsloth and vLLM backends)
ANTHROPIC_API_KEY environment variable (for evaluation scoring)

📄 License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Benchmark — Customer Support Chatbot

📐 Architecture

🗂️ Project Structure

📊 Dataset Format (`data/dataset.csv`)

🚀 Quickstart

1. Install dependencies

2. Set your API key (for evaluation)

3. Run inference

4. Evaluate and visualize

📈 Output Visualizations

⚙️ Configuration (`config.py`)

🔧 Adding a New Model

📋 Requirements

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
notebooks		notebooks
prompts		prompts
requirements		requirements
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
config.py		config.py

Folders and files

Latest commit

History

Repository files navigation

LLM Benchmark — Customer Support Chatbot

📐 Architecture

🗂️ Project Structure

📊 Dataset Format (data/dataset.csv)

🚀 Quickstart

1. Install dependencies

2. Set your API key (for evaluation)

3. Run inference

4. Evaluate and visualize

📈 Output Visualizations

⚙️ Configuration (config.py)

🔧 Adding a New Model

📋 Requirements

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

📊 Dataset Format (`data/dataset.csv`)

⚙️ Configuration (`config.py`)

Packages