Open benchmark and corpus for evaluating whether AI-generated legal citations remain attributable, verifiable, and reconstructable across jurisdictions.
Dali evaluates whether the evidence behind an AI-generated legal citation can be independently reconstructed, verified, and re-evaluated under a fixed policy version. A citation checker asks whether a citation exists. Dali asks whether the workflow that produced it can be audited and defended.
Every Dali run produces a deterministic, policy-versioned, hash-sealed CitationIntegrityResult artifact. The deterministic Tier 1 evaluator runs offline; CI re-verifies replay equality on every pull request.
The legal industry lacks shared benchmarks, public corpora, or reproducible evidence standards for studying AI-generated citation failures. Court-documented incidents have continued to issue since Mata v. Avianca (2023), including United States v. Cohen and Park v. Kim, which anchor the Tier 1 canonical corpus in data/benchmark/tier1/corpus/citation_failure_cases.json. Dali consolidates that missing public infrastructure into one MIT-licensed, deterministically replayable artifact, with reproducibility defined through cryptographic lineage and the public methodology.
- 524 citations evaluated across 3 OpenAI models and 5 jurisdiction tracks.
- GPT-4.1: 23% of generated citation URLs return HTTP 404; on adversarial citation-trap prompts the model took the bait 76% of the time.
- Portuguese civil-law verified at 3%; UK common-law at 76% — same models, same task, different legal system.
Full per-model leaderboard, jurisdictional breakdown, methodology, and reproducible run instructions: data/results/v0.2/ and LEADERBOARD.md. Narrative writeups of the three Tier 1 cases: CASE-STUDIES.md.
Choose the path that matches your role:
- AI researcher / eval engineer: docs/for-researchers.md
- Legal researcher / practitioner: docs/for-legal-practitioners.md
- Software engineer: docs/for-engineers.md
- Methodology reviewer: docs/reviewer-guide.md
git clone https://github.com/yenk/Dali && cd Dali
pip install -r requirements.txt
python -m tools.cli replayThe Tier 1 evaluator runs entirely offline with no API keys or network access required. Every evaluation verifies replay determinism through Dali's cryptographic lineage chain.
Standalone setup guide: docs/quickstart.md.
Dali exposes the same contributor workflow through both the CLI and MCP:
| Action | Command |
|---|---|
| Validate a corpus record | lint |
| Run the evaluator | score |
| Verify replay determinism | replay |
| Validate a prompt | probe |
| Create a prompt template | draft |
| Bundle prompts | pack |
Use them locally through the CLI:
Or from AI-native editors and assistants through MCP:
Dali is designed so researchers, developers, legal professionals, and AI practitioners can contribute evidence, benchmarks, and evaluation artifacts through a consistent, reproducible workflow.
For contribution rules, taxonomy, labels, and the PR checklist, see CONTRIBUTING.md. For methodology and scoring, see METHODOLOGY.md and docs/policy-versioning.md. For cryptographic lineage, see docs/cryptographic-lineage.md. For a deeper repo tour, see tools/cli/README.md and tools/mcp/README.md.
See CITATION.cff, or:
@software{dali-2026,
title = {Dali: Evidentiary Infrastructure for Legal AI},
author = {Kha, Yen},
year = {2026},
version = {1.0.0},
organization = {GammaLex AI Inc.},
url = {https://github.com/yenk/Dali},
note = {Open benchmark for citation integrity, provenance, and evidence reconstructability in legal AI}
}MIT. See LICENSE.
