Sub-second RAG regression testing for production pipelines
"Did my last commit break retrieval?" — now you know in seconds.
LongProbe is a sub-second RAG regression harness. Define your Golden Questions once, run longprobe check on every commit, and get an exact diff of which document chunks were lost in your latest change — before your users notice.
Think pytest --watch for your RAG pipeline.
Full RAG regression testing workflow: auto-capture golden questions, run tests, save baseline, detect regressions.
Detailed quality monitoring with Python API and comprehensive results.
Baseline comparison and regression detection with deployment verdict.
Every RAG developer faces the same silent killer: you refactor chunking strategy, upgrade LangChain, or add a new document — and your retrieval silently degrades. DeepEval and RAGChecker are heavyweight evaluation frameworks meant for batch analysis, not fast regression checks in a dev loop.
LongProbe gives you instant feedback:
- ⚡ Sub-second checks on small golden sets
- 🔍 Exact diffs showing which chunks were lost/gained
- 📊 Recall scores with per-question breakdown
- 💾 Baseline tracking to catch regressions over time
- 🧪 pytest integration for existing test suites
- 🔌 Pluggable adapters for any vector store
LongProbe is part of the EnDevSols Long Suite of RAG tools:
- LongParser - Document ingestion and chunking
- LongTrainer - RAG chatbot framework
- LongTracer - Hallucination detection
- LongProbe - Retrieval regression testing ← You are here
Together they cover the full RAG pipeline from ingestion to production monitoring.
- ⚡ Sub-second checks on small golden sets
- 📋 Golden Questions + Required Chunks defined in simple YAML
- 🔍 Three match modes: exact ID, text substring, semantic similarity
- 📊 Recall Score with per-question breakdown
- 🔄 Regression diff: exactly which chunks were lost/gained
- 💾 SQLite baseline store: compare against any previous run
- 🧪 pytest plugin: integrate into existing test suites
- 🔌 Pluggable adapters: LangChain, LlamaIndex, Chroma, Pinecone, Qdrant
- 🖥️ Beautiful CLI with Rich tables, JSON, and GitHub Actions output
- 👀 Watch mode: auto re-run on file changes
- 🏗️ CI/CD ready: fails pipeline on regression
# Install with UV (recommended)
uv pip install longprobe
# Install with pip
pip install longprobe
# Install with optional dependencies
uv pip install longprobe[chroma] # ChromaDB support
uv pip install longprobe[openai] # OpenAI embeddings
uv pip install longprobe[all] # Everythinglongprobe initThis creates:
.longprobe/— directory for baseline storagegoldens.yaml— example golden questionslongprobe.yaml— configuration file
Edit goldens.yaml with your test cases:
name: "my-rag-golden-set"
version: "1.0"
questions:
- id: "q1"
question: "What is the termination clause?"
match_mode: "id" # exact chunk ID match
required_chunks:
- "contracts_chunk_42"
- "contracts_chunk_43"
top_k: 5
tags: ["contracts", "critical"]
- id: "q2"
question: "What are the payment terms?"
match_mode: "text" # substring match
required_chunks:
- "net 30 days from invoice"
top_k: 5
- id: "q3"
question: "Who can sign contracts?"
match_mode: "semantic" # embedding similarity
semantic_threshold: 0.80
required_chunks:
- "The following officers are authorized to sign"
top_k: 10Edit longprobe.yaml:
retriever:
type: "chroma"
chroma:
persist_directory: "./chroma_db"
collection: "my_documents"
embedder:
provider: "local"
model: "text-embedding-3-small"
scoring:
recall_threshold: 0.8
fail_on_regression: true
baseline:
db_path: ".longprobe/baselines.db"
auto_compare: true# Run against live vector store
longprobe check --goldens goldens.yaml
# Override settings
longprobe check --threshold 0.9 --top-k 10
# JSON output for automation
longprobe check --output json
# GitHub Actions annotations
longprobe check --output github| Command | Description |
|---|---|
longprobe init |
Create starter configuration files |
longprobe check |
Run probes against the golden set |
longprobe diff |
Compare current results against baseline |
longprobe baseline save |
Save current results as baseline |
longprobe baseline list |
List all saved baselines |
longprobe watch |
Watch golden file and re-run on changes |
longprobe generate |
Auto-generate Golden Questions from documents |
longprobe capture |
Build goldens.yaml by querying your retriever |
# Initialize project
longprobe init
# Run checks with custom config
longprobe check -g goldens.yaml -c longprobe.yaml
# Save baseline for comparison
longprobe baseline save --label v1.0
# Compare against baseline
longprobe diff --baseline v1.0
# Watch mode for development
longprobe watch --interval 2
# Generate questions from documents
longprobe generate ./docs --capture --autofrom longprobe import LongProbe
from longprobe.adapters import create_adapter
# Create adapter for your vector store
adapter = create_adapter(
"chroma",
collection_name="my_documents",
persist_directory="./chroma_db"
)
# Create and run probe
probe = LongProbe(
adapter=adapter,
goldens_path="goldens.yaml",
config_path="longprobe.yaml"
)
report = probe.run()
print(f"Overall Recall: {report.overall_recall:.2%}")
print(f"Pass Rate: {report.pass_rate:.2%}")from longprobe import LongProbe
from longprobe.adapters import create_adapter
adapter = create_adapter("chroma", collection_name="docs", persist_directory="./db")
probe = LongProbe(adapter=adapter, goldens_path="goldens.yaml")
# Run and save baseline
report = probe.run()
probe.save_baseline(label="v1.0")
# After making changes...
report2 = probe.run()
# Compare against baseline
diff = probe.diff(baseline_label="v1.0")
print(f"Regressions: {len(diff['regressions'])}")
print(f"Improvements: {len(diff['improvements'])}")from longprobe import LongProbe
from longprobe.adapters import LangChainRetrieverAdapter
# Wrap your existing LangChain retriever
adapter = LangChainRetrieverAdapter(your_langchain_retriever)
probe = LongProbe(adapter=adapter, goldens_path="goldens.yaml")
report = probe.run()
assert report.overall_recall >= 0.85, f"Recall too low: {report.overall_recall}"from longprobe import LongProbe
from longprobe.adapters import LlamaIndexRetrieverAdapter
adapter = LlamaIndexRetrieverAdapter(your_llamaindex_retriever)
probe = LongProbe(adapter=adapter, goldens_path="goldens.yaml")
report = probe.run()# conftest.py
import pytest
from longprobe import LongProbe
from longprobe.adapters import create_adapter
@pytest.fixture
def probe():
adapter = create_adapter(
"chroma",
collection_name="test_docs",
persist_directory="./test_db"
)
return LongProbe(
adapter=adapter,
goldens_path="tests/goldens.yaml",
recall_threshold=0.85
)def test_retrieval_recall(probe):
report = probe.run()
assert report.overall_recall >= 0.85, (
f"Recall dropped to {report.overall_recall:.2f}"
)
def test_no_regression_vs_baseline(probe):
report = probe.run()
assert not report.regression_detected, (
f"Regression detected! Delta: {report.recall_delta}"
)LongProbe supports multiple vector stores and retrieval frameworks:
| Adapter | Type | Configuration |
|---|---|---|
| ChromaDB | Direct | type: chroma |
| Pinecone | Direct | type: pinecone |
| Qdrant | Direct | type: qdrant |
| HTTP API | Direct | type: http |
| LangChain | Programmatic | LangChainRetrieverAdapter |
| LlamaIndex | Programmatic | LlamaIndexRetrieverAdapter |
retriever:
type: chroma
collection: my_collection
persist_directory: ./chroma_dbretriever:
type: http
url: "http://localhost:8000/api/retrieve"
method: "POST"
body_template: '{"query": "{question}"}'
response_mapping:
results_path: "data.chunks"
text_field: "content"name: RAG Regression Check
on: [push, pull_request]
jobs:
rag-probe:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v4
- run: uv pip install longprobe[chroma]
- name: Run RAG regression check
run: longprobe check --goldens goldens.yaml --output github
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}Exact string match on chunk/document IDs. Best when you control the IDs in your vector store.
Case-insensitive substring matching. Checks if the required text appears anywhere in the retrieved documents.
Word-frequency cosine similarity. Useful when exact text may vary but meaning should be preserved.
# Install for development
git clone https://github.com/ENDEVSOLS/LongProbe.git
cd LongProbe
uv sync --dev
# Run tests
uv run pytest tests/unit/ -v
uv run pytest tests/ -v --run-integration
# Lint and format
uv run ruff check src/
uv run ruff format src/goldens.yaml → GoldenLoader → QueryEmbedder → RetrieverAdapter → RecallScorer
↓
BaselineStore → DiffReporter
- Define your Golden Questions + Required Fact Chunks in YAML
- Embed each question using your configured embedding model
- Retrieve from your live vector store using the pluggable adapter
- Score each question by checking if required chunks appear in Top-K results
- Compare against saved baselines to detect regressions
- Report a Recall Score, diff of lost chunks, and optionally fail CI/CD
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
For security issues, please see SECURITY.md.
MIT License — see LICENSE for details.


