AuditTrace is an independent portfolio prototype for synthetic AI audit reliability. It focuses on versioned question sets, evidence-backed findings, fail-closed validation, audit logs, and eval metrics.
- Versioned question sets, so audit behavior is traceable to a specific criteria release.
- Hybrid deterministic and mock narrative checks, without external model calls.
- Evidence validation, so model-like findings must cite source text.
- Fail-closed downgrades to
insufficient_evidencewhen evidence does not validate. - Eval metrics and audit log timelines, so reliability is observable instead of assumed.
- Open the dashboard.
- Open a seeded synthetic document.
- Run a mock audit with a selected question-set version.
- Inspect evidence-backed findings and the real audit log timeline.
- Run the adversarial demo case to show
insufficient_evidence. - Open Evals and inspect reliability metrics.
Backend pytest: 18 passed
Frontend typecheck: passed
Frontend build: passed
One-command verifier: ./scripts/verify.sh passedAuditTrace uses synthetic data only. It is not medical advice, not a clinical decision system, not affiliated with Brellium, does not use real PHI, and does not claim HIPAA compliance.
The project is intentionally scoped to AI audit reliability mechanics. It is not a healthcare platform, EMR, billing system, patient management system, or production compliance product.
Clinical-documentation audit workflows are not simple summarization tasks. Criteria can be objective, narrative, versioned, and difficult to verify. AI-style outputs are only useful if they are grounded in source evidence and measured over time.
AuditTrace demonstrates that reliability layer:
Synthetic note
-> question-set version
-> deterministic checks
-> mock narrative checks
-> evidence-span validation
-> fail-closed normalized findings
-> persisted audit logs
-> eval metrics
-> minimal review UIThe central rule is: model-like output is proposed, not trusted, until the cited evidence validates against the source note.
AuditTrace does not stop at generating a plausible answer. It persists criteria versions, validates evidence spans, records fail-closed downgrades, writes an audit log timeline, and runs evals against seeded expectations. The mock narrative runner is intentionally boring and deterministic; the interesting part is the reliability harness around it.
| Layer | Technology |
|---|---|
| Frontend | Next.js, TypeScript, plain CSS |
| Backend | FastAPI, Python, Pydantic |
| Database | SQLite local fallback, SQLAlchemy models |
| Tests | pytest, TypeScript typecheck, Next.js build |
| AI path | Mock structured narrative runner only, no external API calls |
- Versioned question sets, including
aba-97155-v1andaba-97155-v2. - Synthetic clinical-style seed notes with no real PHI.
- Hybrid audit runner:
- deterministic checks for objective criteria
- mock structured narrative checks for LLM-compatible criteria
- Evidence-span validator that records source character offsets.
- Fail-closed downgrade to
insufficient_evidencewhen required evidence is missing or invalid. - Persisted audit runs, findings, evidence spans, and audit logs.
- Eval harness that runs seeded cases through the same audit path.
- Minimal UI for dashboard, documents, audit detail, question sets, and evals.
See docs/architecture.md.
Short version:
FastAPI services own the reliability logic.
Next.js renders the review UI from real backend APIs.
SQLite stores synthetic docs, question sets, audits, evidence, evals, and logs.Important backend services:
deterministic_checks.pyllm_auditor.pyevidence_validator.pyaudit_runner.pyeval_runner.py
cd apps/api
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python3 -m audittrace_api.seed
uvicorn audittrace_api.main:app --reloadUseful backend URLs:
http://localhost:8000/health
http://localhost:8000/documents
http://localhost:8000/question-setscd apps/web
cp .env.example .env.local
npm install
npm run devThe frontend reads:
NEXT_PUBLIC_API_BASE_URL=http://localhost:8000Open:
http://localhost:3000Documents and question sets:
GET /documents
GET /documents/{document_id}
GET /question-sets
GET /question-sets/{version_id}Audits:
POST /audits/run
GET /audits/{audit_id}
GET /documents/{document_id}/auditsAudit detail includes real audit logs:
audit_started
deterministic_checks_completed
narrative_checks_completed
evidence_validation_completed
audit_completedEvals:
POST /evals/run
GET /evals/{eval_run_id}
GET /evals/latestLatest local eval output from the mock runner and synthetic seed corpus:
{
"status": "completed",
"metrics": {
"total_cases": 38,
"total_questions_evaluated": 42,
"critical_issue_recall": 1.0,
"false_positive_rate": 0.0,
"unsupported_finding_rate": 0.0,
"evidence_span_match_rate": 0.8211,
"insufficient_evidence_rate": 0.1023,
"regression_failures_by_question_set_version": {},
"average_audit_latency_ms": 4.9474
},
"result_count": 38,
"failures_preview": []
}Definitions:
total_cases: eval cases selected and executed.total_questions_evaluated: expected finding rows compared across selected cases.critical_issue_recall: expected critical failures found divided by expected critical failures.false_positive_rate: unexpected fail findings divided by explicit non-fail expectations.unsupported_finding_rate: required-evidence findings with invalid or missing evidence that were trusted instead of downgraded, divided by findings requiring evidence.evidence_span_match_rate: valid produced evidence spans divided by all produced evidence spans.insufficient_evidence_rate: insufficient-evidence findings divided by all audit findings.regression_failures_by_question_set_version: failed eval comparisons grouped by question-set version ID.average_audit_latency_ms: average audit-run latency across eval cases.
Metrics with a zero denominator return null.
These scores are high because the current runner is deterministic and mock-only. They show that the local reliability loop and seeded expectations are aligned; they do not imply clinical correctness.
More detail: docs/eval_results.md.
Recommended demo cases are listed in docs/demo_cases.md.
- Seed and start the backend.
- Start the frontend.
- Open the dashboard.
- Open Documents.
- Select
ABA 97155 missing rationale synthetic note 001. - Run audit with
aba-97155-v1. - Inspect the audit detail page:
- source note
- real audit log timeline
- evidence-backed finding
- evidence validation status
- Select
Adversarial invalid evidence synthetic note 001. - Run audit with
aba-97155-v1. - Show
Downgraded: insufficient evidence. - Open Evals.
- Run eval in mock mode.
- Inspect unsupported finding rate, evidence span match rate, and failures preview.
Demo scripts:
Run everything:
./scripts/verify.shOr run checks separately:
cd apps/api
python3 -m pytestcd apps/web
npm run typecheck
npm run buildCurrent local result:
Backend pytest: 18 passed
Frontend typecheck: passed
Frontend build: passedReviewer screenshots are committed as PNG files under docs/screenshots and embedded near the top of this README. To refresh them from the local seeded app, follow docs/screenshots/README.md.
See docs/tradeoffs.md.
Short version:
- Synthetic data only.
- Mock narrative runner only.
- No external LLM calls yet.
- No auth or RBAC.
- No EMR integration.
- No clinical correctness claim.
- No HIPAA compliance claim.
- SQLite local fallback rather than production database operations.
- Add optional real structured-output model mode behind the same evidence validator.
- Expand the synthetic labeled eval corpus.
- Add question-set import/export and review workflows.
- Add async audit/eval jobs for longer runs.
- Add observability for model latency, evidence rejection, and regression trends.
- Add production-grade migrations and deployment config.
- Add auth and access control only if moving beyond portfolio/demo scope.






