Groundlens

Triage for LLM outputs. Geometric, deterministic, auditable.

Documentation | Research Papers | Examples | Vision | Contributing

Most teams evaluate LLM responses with another LLM. That doesn't survive an audit. Groundlens replaces the judge with a geometric scorer: every response gets a deterministic, sub-second grounding score so humans review the bottom percentile, not everything. No second LLM in the loop. Designed for production deployment in regulated industries.

Groundlens is triage — pay immediate attention to particular priorities. It does not classify responses as right or wrong. It ranks them by grounding signal, exposes the riskiest first, and stays out of the human reviewer's way for the rest.

Why triage instead of an LLM judge?

Problem with LLM-as-judge	The triage approach
Non-deterministic — same response, different verdict next month	Same input → same score, byte-identical, indefinitely. Reproducible across years for audit.
Circular — the judge LLM has the same failure modes as the LLM it judges	Geometric scorer over embedding space. No second LLM in the loop.
Doesn't scale — $0.05–$0.20 per call × 1M responses/month = $50K–$200K/month just to validate	Sub-second per response, marginal cost ~$0 post-deployment. Score every output, not a sample.
Binary verdicts hide nuance, force arbitrary thresholds	Continuous score for ranking. Review the bottom percentile of your batch — the threshold is operational, not metaphysical.
Comparable historicals are lost when you upgrade the judge	Pure embedding geometry. Method does not change with model upgrades. Time-series analysis stays valid.
Agentic pipelines are black boxes	LangGraph callback auto-scores every LLM call with per-node triage and structured traces.

SGI: Semantic Grounding Index (with context) | DGI: Directional Grounding Index (without context)

I want to...

Goal	Start here
Triage my RAG pipeline outputs	SGI quick start · RAG triage guide
Score chat responses without context	DGI quick start · DGI deep dive
Score every LLM call in my LangGraph agent	LangGraph quick start · LangGraph docs
Rank a batch of outputs for review	Batch evaluation · Batch guide
Wrap my LLM provider with auto-scoring	Provider guard · Providers docs
Integrate with LangChain / CrewAI / etc.	Integrations · Integration docs
Improve accuracy for my domain	Domain calibration · Calibration guide
Comply with the EU AI Act	EU AI Act guide
Understand the math	How it works · Research papers
Understand what it can and cannot detect	Hallucination taxonomy
Check my environment is set up correctly	`groundlens doctor`
Contribute	CONTRIBUTING.md · CLAUDE.md · AGENTS.md

Installation

pip install groundlens

With LLM provider support:

pip install "groundlens[openai]"          # OpenAI
pip install "groundlens[anthropic]"       # Anthropic
pip install "groundlens[google]"          # Google Generative AI
pip install "groundlens[providers]"       # All providers

With framework integrations:

pip install "groundlens[langgraph]"       # LangGraph (agentic pipelines)
pip install "groundlens[langchain]"       # LangChain
pip install "groundlens[crewai]"          # CrewAI
pip install "groundlens[semantic-kernel]" # Semantic Kernel
pip install "groundlens[autogen]"         # AutoGen
pip install "groundlens[all]"             # Everything

Requirements: Python 3.10+, numpy, sentence-transformers.

Quick start

SGI -- with context (RAG verification)

SGI (Semantic Grounding Index) measures whether a response engaged with the provided context or stayed anchored to the question. It requires three inputs.

from groundlens import compute_sgi

result = compute_sgi(
    question="What is the capital of France?",
    context="France is in Western Europe. Its capital is Paris.",
    response="The capital of France is Paris.",
)

print(result.value)        # 1.23 — ratio of distances
print(result.normalized)   # 0.61 — mapped to [0, 1]
print(result.flagged)      # False — above review threshold
print(result.explanation)  # "SGI=1.230 — strong context engagement (pass)"

Interpretation: SGI is a continuous score, not a verdict. Higher values mean the response engaged more with the source material than with the question. Use SGI for ranking and triage; default thresholds (flagged, review, trusted) are provided as interpretation aids but can be tuned to your operating point.

DGI -- without context

DGI (Directional Grounding Index) detects hallucinations without requiring source context. It checks whether the question-to-response displacement vector aligns with the characteristic direction of verified grounded responses.

from groundlens import compute_dgi

result = compute_dgi(
    question="What causes seasons on Earth?",
    response="Seasons are caused by Earth's 23.5-degree axial tilt.",
)

print(result.value)        # 0.42 — cosine similarity to reference direction
print(result.normalized)   # 0.71 — mapped to [0, 1]
print(result.flagged)      # False — above pass threshold (0.30)

Domain calibration improves DGI accuracy from AUROC ~0.8 with a basic calibration to 0.90-0.99 with domain-specific calibration:

from groundlens import compute_dgi

result = compute_dgi(
    question="What is the statute of limitations for breach of contract in California?",
    response="Four years under California Code of Civil Procedure Section 337.",
    reference_csv="legal_calibration_pairs.csv",
)

LangGraph -- agentic pipeline scoring

LangGraph agents chain multiple LLM calls through tool-use, retrieval, and reasoning nodes. Groundlens intercepts every LLM call, auto-selects SGI or DGI based on available context, and builds a structured trace with per-node triage labels.

from langgraph.graph import StateGraph
from groundlens.integrations.langgraph import GroundlensLangGraphCallback

# Attach the callback to your LangGraph agent
callback = GroundlensLangGraphCallback()
app = graph.compile()
result = app.invoke({"messages": [...]}, config={"callbacks": [callback]})

# Get the structured trace
trace = callback.get_trace()
print(trace.summary())
# "3 steps | 2 trusted, 1 flagged | retriever [SGI=1.30 trusted], ..."

# Triage: which nodes need review?
for step in trace.steps:
    if step.triage == "flagged":
        print(f"  {step.node_name}: {step.method}={step.score.value:.3f}")

# Export an interactive HTML report
trace.to_html(path="triage_report.html")

# Or get structured data for logging
trace.to_json()  # JSON string
trace.to_dict()  # Python dict

The callback hooks into LangGraph's lifecycle events. When a tool produces output, it becomes the context for the next LLM call (scored with SGI). When no tool output is available, the LLM call is scored with DGI. Each step gets a triage label so the reviewer goes straight to the nodes that matter.

In a multi-step agent, a hallucination in step 2 can cascade through steps 3-5 and produce a confidently wrong final answer. Groundlens gives you per-node visibility so you can catch problems where they originate, not after they compound.

evaluate() -- auto-select

The evaluate() function picks the right method automatically: SGI when context is provided, DGI when it is not.

from groundlens import evaluate

# With context -> SGI
score = evaluate(
    question="What is X?",
    response="X is Y.",
    context="According to the manual, X is Y.",
)
assert score.method == "sgi"

# Without context -> DGI
score = evaluate(
    question="What is X?",
    response="X is Y.",
)
assert score.method == "dgi"

Batch evaluation

from groundlens import evaluate_batch

items = [
    {"question": "Q1?", "response": "A1.", "context": "Source."},
    {"question": "Q2?", "response": "A2."},
    {"question": "Q3?", "response": "A3.", "context": "Reference."},
]

results = evaluate_batch(items)

# Triage: sort by SGI, review the bottom percentile
sorted_results = sorted(results, key=lambda r: r.score.value)
to_review = sorted_results[:max(1, len(sorted_results) // 20)]   # bottom 5%
print(f"{len(to_review)}/{len(results)} flagged for human review")

CLI

# Check environment health
groundlens doctor

# Single response check
groundlens check \
  --question "What is the capital of France?" \
  --response "The capital of France is Paris." \
  --context "France is in Western Europe. Its capital is Paris."

# Batch CSV evaluation
groundlens evaluate input.csv --output results.csv

# Domain calibration
groundlens calibrate --pairs domain_pairs.csv --output calibration.json

# Run the confabulation benchmark
groundlens benchmark

LLM provider guard

from groundlens.providers.openai import OpenAIProvider

provider = OpenAIProvider(model="gpt-4o")
response = provider.complete(
    prompt="Summarize this document.",
    context="The document text here...",
)

if response.groundlens_score and response.groundlens_score.flagged:
    print("Low-grounding response — surface for human review.")
else:
    print(response.text)

Taxonomy of LLM hallucinations

Not all hallucinations are the same. Groundlens is built on a geometric taxonomy (arXiv:2602.13224) that classifies hallucinations by their geometric signature in embedding space — which determines whether they are detectable and which scoring method applies.

Hallucination taxonomy on the unit hypersphere

_{Every text maps to a point on the hypersphere S^d−1. The question q and context c define a geodesic arc. Grounded responses (blue) fall inside the plausibility region 𝒫_q. Type I (purple) stays near q — the response ignored the context. Type II (red) deviates far from both q and c — invented content. Type III (pink) lands inside 𝒫_q alongside the correct answer — same vocabulary and structure, wrong facts, geometrically indistinguishable.}

Type	What happens	Example	Triage signal
Type I — Unfaithfulness	Response ignores the provided source and defaults to the question	RAG system returns an answer from memory instead of from the retrieved document	SGI (distance ratio)
Type II — Confabulation	Response invents content outside the topic's vocabulary	Asked about CRISPR gene editing, the model describes protein-folding correction instead	DGI (displacement direction)
Type III — Within-frame error	Response uses the right vocabulary and structure but gets the facts wrong	"The capital of Australia is Canberra" vs. "The capital of Australia is Sydney" — same frame, wrong city	Undetectable by geometry

Why Type III is undetectable: Sentence embeddings encode distributional similarity (vocabulary, syntax, co-occurrence), not truth value. Two responses that share the same words, entities, and syntactic frame land in the same region of embedding space regardless of which one is correct. This is not a limitation of groundlens — it is a property of the distributional hypothesis (Harris, 1954) that constrains every embedding-based method, including NLI (which inverts to AUROC 0.311 on TruthfulQA, actively favoring false answers over truthful ones).

Implications: Groundlens is triage — it ranks the hallucination types that leave geometric traces (Types I and II), which are the most common and most damaging in production. For Type III errors in high-stakes domains (medical, legal, financial), complement groundlens with claim-level fact-checking tools on the outputs that pass geometric triage. See Complementary Tools for Type III.

Scoring methods

Each scoring method targets a specific hallucination type from the taxonomy above.

SGI (Semantic Grounding Index) — surfaces Type I

When context is available, SGI measures whether the response engaged with the source or stayed anchored to the question:

SGI = dist(phi(response), phi(question)) / dist(phi(response), phi(context))

Score	Interpretation
SGI > 1.20	Strong context engagement (high-trust band)
0.95 < SGI < 1.20	Partial engagement (review recommended)
SGI < 0.95	Weak engagement (low-trust band — possible Type I)

Thresholds are interpretation aids. For production triage, sort by raw result.value and surface the bottom percentile of your batch.

DGI (Directional Grounding Index) — surfaces Type II

When no context is available, DGI checks whether the question-to-response displacement aligns with a learned "grounded direction":

delta = phi(response) - phi(question)
DGI = dot(delta / ||delta||, mu_hat)

Score	Interpretation
DGI > 0.30 $^1$	Aligns with grounded patterns (high-trust band)
0.00 < DGI < 0.30	Weak alignment (low-trust band — possible Type II)
DGI < 0.00	Opposes grounded direction (highest priority for review)

$^1$ This score corresponds to a general calibration. In domain-specific calibrations the score can vary.

Providers and integrations

Component	Install extra	Description
LangGraph	`langgraph`	Callback handler for agentic pipelines — auto-scores every LLM call, structured traces, HTML triage reports
OpenAI	`openai`	Wraps `openai` SDK with automatic scoring
Anthropic	`anthropic`	Wraps `anthropic` SDK with automatic scoring
Google	`google`	Wraps `google-generativeai` with automatic scoring
LangChain	`langchain`	Evaluator + callback handler
CrewAI	`crewai`	Tool for agent pipelines
Semantic Kernel	`semantic-kernel`	Function calling filter
AutoGen	`autogen`	Agent chat checker

Domain calibration

Generic DGI uses a bundled reference direction that achieves AUROC ~0.8 with a basic calibration. For production use, a domain-specific calibration can be applied (a minimum of 200 queries recommended):

from groundlens import calibrate

result = calibrate(csv_path="my_domain_pairs.csv")
print(f"Concentration: {result.concentration:.2f}")
result.save("calibration.json")

Domain-specific calibration typically reaches AUROC 0.90-0.99. The confabulation benchmark (arXiv:2603.13259) reports DGI AUROC 0.958 with domain calibration.

Architecture

┌─────────────────────────────────────────────┐
│             Public API (evaluate)           │
├──────────────────┬──────────────────────────┤
│   SGI (sgi.py)   │       DGI (dgi.py)       │
├──────────────────┴──────────────────────────┤
│       _internal (geometry, embeddings)      │
├─────────────────────────────────────────────┤
│   sentence-transformers (all-MiniLM-L6-v2)  │
└─────────────────────────────────────────────┘
         ▲                       ▲
         │                       │
  ┌──────┴──────┐       ┌────────┴─────────┐
  │  Providers  │       │   Integrations   │
  │  (OpenAI,   │       │   (LangGraph,    │
  │  Anthropic, │       │   LangChain,     │
  │  Google)    │       │   CrewAI, SK,    │
  │             │       │   AutoGen)       │
  └─────────────┘       └──────────────────┘

See AGENTS.md for detailed file-by-file documentation. See CLAUDE.md for AI-assisted development guidelines.

Research

groundlens implements the methods described in three research papers:

Semantic Grounding Index (SGI) Marin, J. (2025). Semantic Grounding Index for LLM Hallucination Detection. arXiv:2512.13771
Hallucinations Taxonomy | Directional Grounding Index (DGI) Marin, J. (2026). A Geometric Taxonomy of Hallucinations in Large Language Models. arXiv:2602.13224
Mechanistic Interpretability Marin, J. (2026). Rotational Dynamics of Factual Constraint Processing in Large Language Models. arXiv:2603.13259

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
.github		.github
benchmarks		benchmarks
deploy/api		deploy/api
docs		docs
examples		examples
src/groundlens		src/groundlens
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
GRAPH.md		GRAPH.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
VISION.md		VISION.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Groundlens

Triage for LLM outputs. Geometric, deterministic, auditable.

Why triage instead of an LLM judge?

I want to...

Installation

Quick start

SGI -- with context (RAG verification)

DGI -- without context

LangGraph -- agentic pipeline scoring

evaluate() -- auto-select

Batch evaluation

CLI

LLM provider guard

Taxonomy of LLM hallucinations

Scoring methods

SGI (Semantic Grounding Index) — surfaces Type I

DGI (Directional Grounding Index) — surfaces Type II

Providers and integrations

Domain calibration

Architecture

Research

About

Uh oh!

Releases 8

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Groundlens

Triage for LLM outputs. Geometric, deterministic, auditable.

Why triage instead of an LLM judge?

I want to...

Installation

Quick start

SGI -- with context (RAG verification)

DGI -- without context

LangGraph -- agentic pipeline scoring

evaluate() -- auto-select

Batch evaluation

CLI

LLM provider guard

Taxonomy of LLM hallucinations

Scoring methods

SGI (Semantic Grounding Index) — surfaces Type I

DGI (Directional Grounding Index) — surfaces Type II

Providers and integrations

Domain calibration

Architecture

Research

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 8

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages