Most teams evaluate LLM responses with another LLM. That doesn't survive an audit. Groundlens replaces the judge with a geometric scorer: every response gets a deterministic, sub-second grounding score so humans review the bottom percentile, not everything. No second LLM in the loop. Designed for production deployment in regulated industries.
Groundlens is triage — pay immediate attention to particular priorities. It does not classify responses as right or wrong. It ranks them by grounding signal, exposes the riskiest first, and stays out of the human reviewer's way for the rest.
| Problem with LLM-as-judge | The triage approach |
|---|---|
| Non-deterministic — same response, different verdict next month | Same input → same score, byte-identical, indefinitely. Reproducible across years for audit. |
| Circular — the judge LLM has the same failure modes as the LLM it judges | Geometric scorer over embedding space. No second LLM in the loop. |
| Doesn't scale — $0.05–$0.20 per call × 1M responses/month = $50K–$200K/month just to validate | Sub-second per response, marginal cost ~$0 post-deployment. Score every output, not a sample. |
| Binary verdicts hide nuance, force arbitrary thresholds | Continuous score for ranking. Review the bottom percentile of your batch — the threshold is operational, not metaphysical. |
| Comparable historicals are lost when you upgrade the judge | Pure embedding geometry. Method does not change with model upgrades. Time-series analysis stays valid. |
| Agentic pipelines are black boxes | LangGraph callback auto-scores every LLM call with per-node triage and structured traces. |
SGI: Semantic Grounding Index (with context) | DGI: Directional Grounding Index (without context)
| Goal | Start here |
|---|---|
| Triage my RAG pipeline outputs | SGI quick start · RAG triage guide |
| Score chat responses without context | DGI quick start · DGI deep dive |
| Score every LLM call in my LangGraph agent | LangGraph quick start · LangGraph docs |
| Rank a batch of outputs for review | Batch evaluation · Batch guide |
| Wrap my LLM provider with auto-scoring | Provider guard · Providers docs |
| Integrate with LangChain / CrewAI / etc. | Integrations · Integration docs |
| Improve accuracy for my domain | Domain calibration · Calibration guide |
| Comply with the EU AI Act | EU AI Act guide |
| Understand the math | How it works · Research papers |
| Understand what it can and cannot detect | Hallucination taxonomy |
| Check my environment is set up correctly | groundlens doctor |
| Contribute | CONTRIBUTING.md · CLAUDE.md · AGENTS.md |
pip install groundlensWith LLM provider support:
pip install "groundlens[openai]" # OpenAI
pip install "groundlens[anthropic]" # Anthropic
pip install "groundlens[google]" # Google Generative AI
pip install "groundlens[providers]" # All providersWith framework integrations:
pip install "groundlens[langgraph]" # LangGraph (agentic pipelines)
pip install "groundlens[langchain]" # LangChain
pip install "groundlens[crewai]" # CrewAI
pip install "groundlens[semantic-kernel]" # Semantic Kernel
pip install "groundlens[autogen]" # AutoGen
pip install "groundlens[all]" # EverythingRequirements: Python 3.10+, numpy, sentence-transformers.
SGI (Semantic Grounding Index) measures whether a response engaged with the provided context or stayed anchored to the question. It requires three inputs.
from groundlens import compute_sgi
result = compute_sgi(
question="What is the capital of France?",
context="France is in Western Europe. Its capital is Paris.",
response="The capital of France is Paris.",
)
print(result.value) # 1.23 — ratio of distances
print(result.normalized) # 0.61 — mapped to [0, 1]
print(result.flagged) # False — above review threshold
print(result.explanation) # "SGI=1.230 — strong context engagement (pass)"Interpretation: SGI is a continuous score, not a verdict. Higher values mean the response engaged more with the source material than with the question. Use SGI for ranking and triage; default thresholds (flagged, review, trusted) are provided as interpretation aids but can be tuned to your operating point.
DGI (Directional Grounding Index) detects hallucinations without requiring source context. It checks whether the question-to-response displacement vector aligns with the characteristic direction of verified grounded responses.
from groundlens import compute_dgi
result = compute_dgi(
question="What causes seasons on Earth?",
response="Seasons are caused by Earth's 23.5-degree axial tilt.",
)
print(result.value) # 0.42 — cosine similarity to reference direction
print(result.normalized) # 0.71 — mapped to [0, 1]
print(result.flagged) # False — above pass threshold (0.30)Domain calibration improves DGI accuracy from AUROC ~0.8 with a basic calibration to 0.90-0.99 with domain-specific calibration:
from groundlens import compute_dgi
result = compute_dgi(
question="What is the statute of limitations for breach of contract in California?",
response="Four years under California Code of Civil Procedure Section 337.",
reference_csv="legal_calibration_pairs.csv",
)LangGraph agents chain multiple LLM calls through tool-use, retrieval, and reasoning nodes. Groundlens intercepts every LLM call, auto-selects SGI or DGI based on available context, and builds a structured trace with per-node triage labels.
from langgraph.graph import StateGraph
from groundlens.integrations.langgraph import GroundlensLangGraphCallback
# Attach the callback to your LangGraph agent
callback = GroundlensLangGraphCallback()
app = graph.compile()
result = app.invoke({"messages": [...]}, config={"callbacks": [callback]})
# Get the structured trace
trace = callback.get_trace()
print(trace.summary())
# "3 steps | 2 trusted, 1 flagged | retriever [SGI=1.30 trusted], ..."
# Triage: which nodes need review?
for step in trace.steps:
if step.triage == "flagged":
print(f" {step.node_name}: {step.method}={step.score.value:.3f}")
# Export an interactive HTML report
trace.to_html(path="triage_report.html")
# Or get structured data for logging
trace.to_json() # JSON string
trace.to_dict() # Python dictThe callback hooks into LangGraph's lifecycle events. When a tool produces output, it becomes the context for the next LLM call (scored with SGI). When no tool output is available, the LLM call is scored with DGI. Each step gets a triage label so the reviewer goes straight to the nodes that matter.
In a multi-step agent, a hallucination in step 2 can cascade through steps 3-5 and produce a confidently wrong final answer. Groundlens gives you per-node visibility so you can catch problems where they originate, not after they compound.
The evaluate() function picks the right method automatically: SGI when context is provided, DGI when it is not.
from groundlens import evaluate
# With context -> SGI
score = evaluate(
question="What is X?",
response="X is Y.",
context="According to the manual, X is Y.",
)
assert score.method == "sgi"
# Without context -> DGI
score = evaluate(
question="What is X?",
response="X is Y.",
)
assert score.method == "dgi"from groundlens import evaluate_batch
items = [
{"question": "Q1?", "response": "A1.", "context": "Source."},
{"question": "Q2?", "response": "A2."},
{"question": "Q3?", "response": "A3.", "context": "Reference."},
]
results = evaluate_batch(items)
# Triage: sort by SGI, review the bottom percentile
sorted_results = sorted(results, key=lambda r: r.score.value)
to_review = sorted_results[:max(1, len(sorted_results) // 20)] # bottom 5%
print(f"{len(to_review)}/{len(results)} flagged for human review")# Check environment health
groundlens doctor
# Single response check
groundlens check \
--question "What is the capital of France?" \
--response "The capital of France is Paris." \
--context "France is in Western Europe. Its capital is Paris."
# Batch CSV evaluation
groundlens evaluate input.csv --output results.csv
# Domain calibration
groundlens calibrate --pairs domain_pairs.csv --output calibration.json
# Run the confabulation benchmark
groundlens benchmarkfrom groundlens.providers.openai import OpenAIProvider
provider = OpenAIProvider(model="gpt-4o")
response = provider.complete(
prompt="Summarize this document.",
context="The document text here...",
)
if response.groundlens_score and response.groundlens_score.flagged:
print("Low-grounding response — surface for human review.")
else:
print(response.text)Not all hallucinations are the same. Groundlens is built on a geometric taxonomy (arXiv:2602.13224) that classifies hallucinations by their geometric signature in embedding space — which determines whether they are detectable and which scoring method applies.
Every text maps to a point on the hypersphere Sd−1. The question q and context c define a geodesic arc. Grounded responses (blue) fall inside the plausibility region 𝒫q. Type I (purple) stays near q — the response ignored the context. Type II (red) deviates far from both q and c — invented content. Type III (pink) lands inside 𝒫q alongside the correct answer — same vocabulary and structure, wrong facts, geometrically indistinguishable.
| Type | What happens | Example | Triage signal |
|---|---|---|---|
| Type I — Unfaithfulness | Response ignores the provided source and defaults to the question | RAG system returns an answer from memory instead of from the retrieved document | SGI (distance ratio) |
| Type II — Confabulation | Response invents content outside the topic's vocabulary | Asked about CRISPR gene editing, the model describes protein-folding correction instead | DGI (displacement direction) |
| Type III — Within-frame error | Response uses the right vocabulary and structure but gets the facts wrong | "The capital of Australia is Canberra" vs. "The capital of Australia is Sydney" — same frame, wrong city | Undetectable by geometry |
Why Type III is undetectable: Sentence embeddings encode distributional similarity (vocabulary, syntax, co-occurrence), not truth value. Two responses that share the same words, entities, and syntactic frame land in the same region of embedding space regardless of which one is correct. This is not a limitation of groundlens — it is a property of the distributional hypothesis (Harris, 1954) that constrains every embedding-based method, including NLI (which inverts to AUROC 0.311 on TruthfulQA, actively favoring false answers over truthful ones).
Implications: Groundlens is triage — it ranks the hallucination types that leave geometric traces (Types I and II), which are the most common and most damaging in production. For Type III errors in high-stakes domains (medical, legal, financial), complement groundlens with claim-level fact-checking tools on the outputs that pass geometric triage. See Complementary Tools for Type III.
Each scoring method targets a specific hallucination type from the taxonomy above.
When context is available, SGI measures whether the response engaged with the source or stayed anchored to the question:
SGI = dist(phi(response), phi(question)) / dist(phi(response), phi(context))
| Score | Interpretation |
|---|---|
| SGI > 1.20 | Strong context engagement (high-trust band) |
| 0.95 < SGI < 1.20 | Partial engagement (review recommended) |
| SGI < 0.95 | Weak engagement (low-trust band — possible Type I) |
Thresholds are interpretation aids. For production triage, sort by raw result.value and surface the bottom percentile of your batch.
When no context is available, DGI checks whether the question-to-response displacement aligns with a learned "grounded direction":
delta = phi(response) - phi(question)
DGI = dot(delta / ||delta||, mu_hat)
| Score | Interpretation |
|---|---|
| DGI > 0.30 |
Aligns with grounded patterns (high-trust band) |
| 0.00 < DGI < 0.30 | Weak alignment (low-trust band — possible Type II) |
| DGI < 0.00 | Opposes grounded direction (highest priority for review) |
| Component | Install extra | Description |
|---|---|---|
| LangGraph | langgraph |
Callback handler for agentic pipelines — auto-scores every LLM call, structured traces, HTML triage reports |
| OpenAI | openai |
Wraps openai SDK with automatic scoring |
| Anthropic | anthropic |
Wraps anthropic SDK with automatic scoring |
google |
Wraps google-generativeai with automatic scoring |
|
| LangChain | langchain |
Evaluator + callback handler |
| CrewAI | crewai |
Tool for agent pipelines |
| Semantic Kernel | semantic-kernel |
Function calling filter |
| AutoGen | autogen |
Agent chat checker |
Generic DGI uses a bundled reference direction that achieves AUROC ~0.8 with a basic calibration. For production use, a domain-specific calibration can be applied (a minimum of 200 queries recommended):
from groundlens import calibrate
result = calibrate(csv_path="my_domain_pairs.csv")
print(f"Concentration: {result.concentration:.2f}")
result.save("calibration.json")Domain-specific calibration typically reaches AUROC 0.90-0.99. The confabulation benchmark (arXiv:2603.13259) reports DGI AUROC 0.958 with domain calibration.
┌─────────────────────────────────────────────┐
│ Public API (evaluate) │
├──────────────────┬──────────────────────────┤
│ SGI (sgi.py) │ DGI (dgi.py) │
├──────────────────┴──────────────────────────┤
│ _internal (geometry, embeddings) │
├─────────────────────────────────────────────┤
│ sentence-transformers (all-MiniLM-L6-v2) │
└─────────────────────────────────────────────┘
▲ ▲
│ │
┌──────┴──────┐ ┌────────┴─────────┐
│ Providers │ │ Integrations │
│ (OpenAI, │ │ (LangGraph, │
│ Anthropic, │ │ LangChain, │
│ Google) │ │ CrewAI, SK, │
│ │ │ AutoGen) │
└─────────────┘ └──────────────────┘
See AGENTS.md for detailed file-by-file documentation. See CLAUDE.md for AI-assisted development guidelines.
groundlens implements the methods described in three research papers:
-
Semantic Grounding Index (SGI) Marin, J. (2025). Semantic Grounding Index for LLM Hallucination Detection. arXiv:2512.13771
-
Hallucinations Taxonomy | Directional Grounding Index (DGI) Marin, J. (2026). A Geometric Taxonomy of Hallucinations in Large Language Models. arXiv:2602.13224
-
Mechanistic Interpretability Marin, J. (2026). Rotational Dynamics of Factual Constraint Processing in Large Language Models. arXiv:2603.13259
