Catch LLM behavioural regressions before they reach production
Why • How It Works • Quickstart • Docs
Traditional LLM evaluation frameworks answer: "What's the model's score on MMLU?"
insideLLMs answers: "Did my model's behaviour change between versions?"
When you're shipping LLM-powered products, you don't need leaderboard rankings. You need to know:
- Did prompt #47 start returning different advice?
- Will this model update break my users' workflows?
- Can I safely deploy this change?
insideLLMs provides deterministic, diffable, CI-gateable behavioural testing for LLMs.
- Deterministic by design: Same inputs (and model responses) produce byte-for-byte identical artefacts
- CI-native:
insidellms diff --fail-on-changesblocks bad deploys - Response-level granularity: See exactly which prompts changed, not just aggregate metrics
- Provider-agnostic: OpenAI, Anthropic, local models (Ollama, llama.cpp), all through one interface
from insideLLMs import LogicProbe, BiasProbe, SafetyProbe
# Test specific behaviours, not broad benchmarks
probes = [LogicProbe(), BiasProbe(), SafetyProbe()]insidellms harness config.yaml --run-dir ./baselineProduces deterministic artefacts:
records.jsonl- Every input/output pair (canonical)manifest.json- Run metadata (deterministic fields only)config.resolved.yaml- Normalized config snapshot used for the runsummary.json- Aggregated metricsreport.html- Human-readable comparison
insidellms diff ./baseline ./candidate --fail-on-changesBlocks the deploy if behaviour changed:
Changes detected:
example_id: 47
field: output
baseline: "Consult a doctor for medical advice."
candidate: "Here's what you should do..."
git clone https://github.com/dr-gareth-roberts/insideLLMs.git
cd insideLLMs
pip install -e ".[all]"# Quick test with DummyModel
insidellms quicktest "What is 2 + 2?" --model dummy
# Run offline golden path
python examples/example_cli_golden_path.py# harness.yaml
models:
- type: openai
args: {model_name: gpt-4o}
- type: anthropic
args: {model_name: claude-3-5-sonnet-20241022}
probes:
- type: logic
- type: bias
dataset:
format: jsonl
path: data/test.jsonlexport OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
insidellms harness harness.yaml --run-dir ./baseline
insidellms report ./baseline# .github/workflows/behavioural-tests.yml
- name: Run candidate
run: insidellms harness config.yaml --run-dir ./candidate
- name: Diff against baseline
run: insidellms diff ./baseline ./candidate --fail-on-changes- Run IDs are SHA-256 hashes of inputs (config + dataset), with local file datasets content-hashed
- Timestamps derive from run IDs, not wall clocks
- JSON output has stable formatting (sorted keys, consistent separators)
- Result:
git diffworks on model behaviour
records.jsonl preserves every input/output pair:
{"example_id": "47", "input": {...}, "output": "...", "status": "success"}
{"example_id": "48", "input": {...}, "output": "...", "status": "success"}No more debugging aggregate metrics. See exactly what changed.
from insideLLMs.probes import Probe
class MedicalSafetyProbe(Probe):
def run(self, model, data, **kwargs):
response = model.generate(data["symptom_query"])
return {
"response": response,
"has_disclaimer": "consult a doctor" in response.lower()
}Build domain-specific tests without forking the framework.
- Documentation Site - Complete guides and reference
- Philosophy - Why insideLLMs exists
- Getting Started - Install and first run
- Tutorials - Bias testing, CI integration, custom probes
- API Reference - Complete Python API
- Examples - Runnable code samples
| Scenario | Solution |
|---|---|
| Model upgrade breaks production | Catch it in CI with --fail-on-changes |
| Need to compare GPT-4 vs Claude | Run harness, get side-by-side report |
| Detect bias in salary advice | Use BiasProbe with paired prompts |
| Test jailbreak resistance | Use SafetyProbe with attack patterns |
| Custom domain evaluation | Extend Probe base class |
| Framework | Focus | insideLLMs Difference |
|---|---|---|
| Eleuther lm-evaluation-harness | Benchmark scores | Behavioural regression detection |
| HELM | Holistic evaluation | CI-native, deterministic diffing |
| OpenAI Evals | Conversational tasks | Response-level granularity, provider-agnostic |
insideLLMs is for teams shipping LLM products who need to know what changed, not just what scored well.
See CONTRIBUTING.md for development setup and guidelines.
MIT. See LICENSE.