insideLLMs

Catch LLM behavioural regressions before they reach production

Why • How It Works • Quickstart • Docs

Why insideLLMs

Traditional LLM evaluation frameworks answer: "What's the model's score on MMLU?"

insideLLMs answers: "Did my model's behaviour change between versions?"

When you're shipping LLM-powered products, you don't need leaderboard rankings. You need to know:

Did prompt #47 start returning different advice?
Will this model update break my users' workflows?
Can I safely deploy this change?

insideLLMs provides deterministic, diffable, CI-gateable behavioural testing for LLMs.

Built for Production Teams

Deterministic by design: Same inputs (and model responses) produce byte-for-byte identical artefacts
CI-native: insidellms diff --fail-on-changes blocks bad deploys
Response-level granularity: See exactly which prompts changed, not just aggregate metrics
Provider-agnostic: OpenAI, Anthropic, local models (Ollama, llama.cpp), all through one interface

How It Works

1. Define Behavioural Tests (Probes)

from insideLLMs import LogicProbe, BiasProbe, SafetyProbe

# Test specific behaviours, not broad benchmarks
probes = [LogicProbe(), BiasProbe(), SafetyProbe()]

2. Run Across Models

insidellms harness config.yaml --run-dir ./baseline

Produces deterministic artefacts:

records.jsonl - Every input/output pair (canonical)
manifest.json - Run metadata (deterministic fields only)
config.resolved.yaml - Normalized config snapshot used for the run
summary.json - Aggregated metrics
report.html - Human-readable comparison

3. Detect Changes in CI

insidellms diff ./baseline ./candidate --fail-on-changes

Blocks the deploy if behaviour changed:

Changes detected:
  example_id: 47
  field: output
  baseline: "Consult a doctor for medical advice."
  candidate: "Here's what you should do..."

Quickstart

Install

git clone https://github.com/dr-gareth-roberts/insideLLMs.git
cd insideLLMs
pip install -e ".[all]"

5-Minute Test (No API Keys)

# Quick test with DummyModel
insidellms quicktest "What is 2 + 2?" --model dummy

# Run offline golden path
python examples/example_cli_golden_path.py

First Real Comparison

# harness.yaml
models:
  - type: openai
    args: {model_name: gpt-4o}
  - type: anthropic
    args: {model_name: claude-3-5-sonnet-20241022}

probes:
  - type: logic
  - type: bias

dataset:
  format: jsonl
  path: data/test.jsonl

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."

insidellms harness harness.yaml --run-dir ./baseline
insidellms report ./baseline

Add to CI

# .github/workflows/behavioural-tests.yml
- name: Run candidate
  run: insidellms harness config.yaml --run-dir ./candidate

- name: Diff against baseline
  run: insidellms diff ./baseline ./candidate --fail-on-changes

Key Features

Deterministic Artefacts

Run IDs are SHA-256 hashes of inputs (config + dataset), with local file datasets content-hashed
Timestamps derive from run IDs, not wall clocks
JSON output has stable formatting (sorted keys, consistent separators)
Result: git diff works on model behaviour

Response-Level Granularity

records.jsonl preserves every input/output pair:

{"example_id": "47", "input": {...}, "output": "...", "status": "success"}
{"example_id": "48", "input": {...}, "output": "...", "status": "success"}

No more debugging aggregate metrics. See exactly what changed.

Extensible Probes

from insideLLMs.probes import Probe

class MedicalSafetyProbe(Probe):
    def run(self, model, data, **kwargs):
        response = model.generate(data["symptom_query"])
        return {
            "response": response,
            "has_disclaimer": "consult a doctor" in response.lower()
        }

Build domain-specific tests without forking the framework.

Documentation

Documentation Site - Complete guides and reference
Philosophy - Why insideLLMs exists
Getting Started - Install and first run
Tutorials - Bias testing, CI integration, custom probes
API Reference - Complete Python API
Examples - Runnable code samples

Use Cases

Scenario	Solution
Model upgrade breaks production	Catch it in CI with `--fail-on-changes`
Need to compare GPT-4 vs Claude	Run harness, get side-by-side report
Detect bias in salary advice	Use BiasProbe with paired prompts
Test jailbreak resistance	Use SafetyProbe with attack patterns
Custom domain evaluation	Extend Probe base class

Comparison with Other Frameworks

Framework	Focus	insideLLMs Difference
Eleuther lm-evaluation-harness	Benchmark scores	Behavioural regression detection
HELM	Holistic evaluation	CI-native, deterministic diffing
OpenAI Evals	Conversational tasks	Response-level granularity, provider-agnostic

insideLLMs is for teams shipping LLM products who need to know what changed, not just what scored well.

Contributing

See CONTRIBUTING.md for development setup and guidelines.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
.github		.github
benchmarks		benchmarks
ci		ci
data		data
docs		docs
examples		examples
insideLLMs		insideLLMs
scripts		scripts
tests		tests
wiki		wiki
.codecov.yml		.codecov.yml
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
API_REFERENCE.md		API_REFERENCE.md
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DOCUMENTATION_INDEX.md		DOCUMENTATION_INDEX.md
LICENSE		LICENSE
Makefile		Makefile
QUICK_REFERENCE.md		QUICK_REFERENCE.md
README.md		README.md
SECURITY.md		SECURITY.md
insidellms.jpg		insidellms.jpg
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

insideLLMs

Why insideLLMs

Built for Production Teams

How It Works

1. Define Behavioural Tests (Probes)

2. Run Across Models

3. Detect Changes in CI

Quickstart

Install

5-Minute Test (No API Keys)

First Real Comparison

Add to CI

Key Features

Deterministic Artefacts

Response-Level Granularity

Extensible Probes

Documentation

Use Cases

Comparison with Other Frameworks

Contributing

License

About

Uh oh!

Releases

Uh oh!

Contributors 4

Uh oh!

Languages

License

dr-gareth-roberts/insideLLMs

Folders and files

Latest commit

History

Repository files navigation

insideLLMs

Why insideLLMs

Built for Production Teams

How It Works

1. Define Behavioural Tests (Probes)

2. Run Across Models

3. Detect Changes in CI

Quickstart

Install

5-Minute Test (No API Keys)

First Real Comparison

Add to CI

Key Features

Deterministic Artefacts

Response-Level Granularity

Extensible Probes

Documentation

Use Cases

Comparison with Other Frameworks

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors 4

Uh oh!

Languages