Skip to content

yenklabs/Dali

Dali

Open benchmark and corpus for evaluating whether AI-generated legal citations remain attributable, verifiable, and reconstructable across jurisdictions.

CI Replay verification Latest release License: MIT Cite

Dali v0.2 Evidence Reconstructability Benchmark

What is Dali?

Dali evaluates whether the evidence behind an AI-generated legal citation can be independently reconstructed, verified, and re-evaluated under a fixed policy version. A citation checker asks whether a citation exists. Dali asks whether the workflow that produced it can be audited and defended.

Every Dali run produces a deterministic, policy-versioned, hash-sealed CitationIntegrityResult artifact. The deterministic Tier 1 evaluator runs offline; CI re-verifies replay equality on every pull request.

Why does it matter?

The legal industry lacks shared benchmarks, public corpora, or reproducible evidence standards for studying AI-generated citation failures. Court-documented incidents have continued to issue since Mata v. Avianca (2023), including United States v. Cohen and Park v. Kim, which anchor the Tier 1 canonical corpus in data/benchmark/tier1/corpus/citation_failure_cases.json. Dali consolidates that missing public infrastructure into one MIT-licensed, deterministically replayable artifact, with reproducibility defined through cryptographic lineage and the public methodology.

What did we find?

  • 524 citations evaluated across 3 OpenAI models and 5 jurisdiction tracks.
  • GPT-4.1: 23% of generated citation URLs return HTTP 404; on adversarial citation-trap prompts the model took the bait 76% of the time.
  • Portuguese civil-law verified at 3%; UK common-law at 76% — same models, same task, different legal system.

Full per-model leaderboard, jurisdictional breakdown, methodology, and reproducible run instructions: data/results/v0.2/ and LEADERBOARD.md. Narrative writeups of the three Tier 1 cases: CASE-STUDIES.md.

How do I contribute?

Choose the path that matches your role:

Quick start

git clone https://github.com/yenk/Dali && cd Dali
pip install -r requirements.txt
python -m tools.cli replay

The Tier 1 evaluator runs entirely offline with no API keys or network access required. Every evaluation verifies replay determinism through Dali's cryptographic lineage chain.

Standalone setup guide: docs/quickstart.md.

Dali exposes the same contributor workflow through both the CLI and MCP:

Action Command
Validate a corpus record lint
Run the evaluator score
Verify replay determinism replay
Validate a prompt probe
Create a prompt template draft
Bundle prompts pack

Use them locally through the CLI:

Or from AI-native editors and assistants through MCP:

Dali is designed so researchers, developers, legal professionals, and AI practitioners can contribute evidence, benchmarks, and evaluation artifacts through a consistent, reproducible workflow.

For contribution rules, taxonomy, labels, and the PR checklist, see CONTRIBUTING.md. For methodology and scoring, see METHODOLOGY.md and docs/policy-versioning.md. For cryptographic lineage, see docs/cryptographic-lineage.md. For a deeper repo tour, see tools/cli/README.md and tools/mcp/README.md.

How to cite

See CITATION.cff, or:

@software{dali-2026,
  title        = {Dali: Evidentiary Infrastructure for Legal AI},
  author       = {Kha, Yen},
  year         = {2026},
  version      = {1.0.0},
  organization = {GammaLex AI Inc.},
  url          = {https://github.com/yenk/Dali},
  note         = {Open benchmark for citation integrity, provenance, and evidence reconstructability in legal AI}
}

License

MIT. See LICENSE.

Packages

 
 
 

Contributors

Languages