Executable backend scaffold for the Decision Evidence Maturity Model cross-regime benchmark.
This repository is separate from:
operational-evidence-plane, which is a reference implementation substrate.anchor-level-reconstructability-pilot, which is a frozen anchor-level reproducibility artifact.
The purpose here is narrower and more demanding: generate reproducible benchmark result artifacts for a cross-regime decision-evidence sufficiency study.
The repository currently includes:
- a case manifest schema in Python dataclasses;
- native AER, MAT, IEEC, DCC/HDP, PROV, LLM Audit Trails, AEGIS-NTC, and Dynamic Capabilities fixture adapters for exercising the manifest boundary;
- deterministic implementations for the five non-LLM container-presence baselines;
- optional imported pinned-output support for adding an LLM-judge external/model-run artifact;
- imported baseline output validation for case coverage before metric use;
- Overclaim Rate and Property Sufficiency Accuracy helpers;
- candidate scorer output import and property-level evaluation;
- candidate scorer output validation for case and property coverage;
- summary slices by regime, degradation condition, and question family;
- corpus inventory counts by regime, question family, degradation condition, property category, and strict sufficiency;
- a balanced 64-case draft corpus generator for exercising manuscript-scale mechanics without promoting synthetic outputs as evidence;
- a deterministic construction-derived oracle for manuscript property labels
based on explicit degradation-condition rules stored in
data/oracle/construction_oracle_v1.yaml; - a Cohen kappa helper for mechanical paired-oracle self-consistency diagnostics;
- smoke label calibration for mechanics checks;
- row-level label review exports for adjudication workflow checks;
- fill-in adjudication override templates for disagreement rows;
- adjudicated label promotion into case manifests;
- one-command result package assembly for corpus validation, label calibration, label review, adjudication, scorer evaluation, baselines, run manifest, and readiness report;
- result package artifact checksum validation;
- a result-readiness report that distinguishes mechanics from manuscript-ready evidence;
- an actionable readiness-gap report that maps blockers to artifact areas and next steps;
- manuscript-facing CSV exports for gate status, readiness blockers, and package artifact inventory;
- a smoke corpus manifest and corpus validator;
- a manuscript result runbook and corpus manifest template;
- a label-leakage audit for scorer-facing manuscript artifacts;
- a deterministic redacted-input candidate scorer writer for no-human manuscript package assembly;
- a smoke fixture and tests.
The default manuscript package path is no-human and no-LLM: construction labels
come from the executable oracle, the candidate scorer output is generated from
redacted scorer-facing fields, and the package uses five deterministic
baselines. LLM-judge can still be added as an optional pinned artifact by
setting MANUSCRIPT_BASELINES="trace_present ledger_present schema_present container_checklist source_specific_validator llm_judge" and supplying
data/baselines/llm_judge_outputs.jsonl.
The benchmark requires:
- Eight regime adapters: AER, MAT, IEEC, DCC/HDP, PROV, LLM Audit Trails, AEGIS-NTC, and Dynamic Capabilities.
- Five deterministic default baselines: trace-present, ledger-present, schema-present, container-checklist, and source-specific validator. LLM-judge is an optional pinned external/model-run baseline.
- One property-level candidate scorer interface for Decision Trace Reconstructor outputs.
- Labels over Decision Event Schema properties: actor identity, principal authority, action boundary, policy basis, decision basis, data/resource touch, lifecycle context, and verification strength.
- Metrics: Property Sufficiency Accuracy and Overclaim Rate, followed by the secondary metrics defined by the benchmark protocol.
DEMM-Bench is intentionally lightweight and deterministic (no stochastic component, no LLM call on the critical path, no GPU). The reference 64-case manuscript package runs end-to-end in under 60 seconds per case on a consumer laptop (adapter translation + degradation + scoring + readiness gate). The full 178-test suite completes in approximately seven seconds.
- Python: 3.11, 3.12, 3.13, or 3.14 (verified)
- RAM: under 512 MB working set
- Disk: under 250 MB including dependencies, tracked data, and reference results
- GPU: not required
- Network: not required after the initial
pip install - Determinism: bit-exact reproducible (
mean PSA = 0.5625,kappa = 1.0, baseline overclaim rates0.75 / 0.50 / 0.00) across runs, platforms, and supported Python versions
If you intend to enable the optional LLM-judge baseline (--baseline llm_judge),
expect cost and latency commensurate with the model family chosen; see paper
§4.3 and the LLM-judge workbook export/import flow under scripts/.
python3 -m venv .venv
. .venv/bin/activate
python -m pip install -e ".[dev]"
make verifyRun the smoke benchmark:
decision-evidence-benchmark run \
--cases data/fixtures/smoke_cases.jsonl \
--out data/results/smoke_results.jsonl \
--summary data/results/smoke_summary.json \
--supporting-input data/results/smoke_corpus_validation.json \
--supporting-input data/results/smoke_label_calibration.json \
--supporting-input data/results/smoke_scorer_summary.json \
--run-manifest data/results/smoke_run_manifest.jsonFor manuscript-scale runs, the default strict package uses five deterministic
baselines and treats llm_judge as optional. To include llm_judge, supply a
pinned JSONL artifact:
make export-manuscript-llm-judge-workbook
make import-manuscript-llm-judge-workbookThe import target expects a reviewed
data/results/manuscript_llm_judge_workbook.reviewed.csv and writes
data/baselines/llm_judge_outputs.jsonl only after coverage and metadata
validation pass.
decision-evidence-benchmark validate-baseline-predictions \
--cases data/cases/manuscript_cases.jsonl \
--baseline llm_judge \
--predictions data/baselines/llm_judge_outputs.jsonl \
--out data/results/manuscript_llm_judge_validation.jsondecision-evidence-benchmark run \
--cases data/cases/manuscript_cases.jsonl \
--llm-judge-predictions data/baselines/llm_judge_outputs.jsonl \
--out data/results/manuscript_baseline_results.jsonl \
--summary data/results/manuscript_baseline_summary.jsonValidate the smoke corpus manifest:
decision-evidence-benchmark validate-corpus \
--manifest data/corpus/smoke_corpus.yaml \
--out data/results/smoke_corpus_validation.jsonValidate and evaluate the smoke candidate scorer fixture:
For the default no-human manuscript candidate scorer output, generate redacted scorer inputs and deterministic scorer predictions:
make write-manuscript-deterministic-scorerFor an externally reviewed scorer run instead, use the redacted scorer-input path and guarded workbook import before running validation:
make write-manuscript-scorer-input
make export-manuscript-scorer-workbook
make import-manuscript-scorer-workbookThe redaction target writes
data/cases/manuscript_scorer_input_cases.jsonl,
data/cases/manuscript_scorer_input_case_id_map.jsonl, and
data/results/manuscript_scorer_input_redaction.json. Exported scorer-facing
workbooks use opaque case-000001 style IDs and omit degradation conditions,
labels, and source refs. The private case-id map is used only by the import
targets to map reviewed predictions back to original evaluation case IDs.
Inspect manuscript authoring progress and Section 7 gate blockers:
make manuscript-authoring-statusThis writes data/results/manuscript_authoring_status.json. The report is
diagnostic only; it does not import reviewed workbooks or create result
artifacts.
After construction-oracle annotations, adjudicated cases, and candidate scorer outputs exist, promote the corpus manifest template:
make write-manuscript-corpus-manifestThe target refuses to write data/corpus/manuscript_corpus.yaml until required
gate inputs exist.
decision-evidence-benchmark validate-scorer-predictions \
--cases data/fixtures/smoke_cases.jsonl \
--predictions data/fixtures/smoke_scorer_outputs.jsonl \
--out data/results/smoke_scorer_validation.jsondecision-evidence-benchmark evaluate-scorer \
--cases data/fixtures/smoke_cases.jsonl \
--predictions data/fixtures/smoke_scorer_outputs.jsonl \
--out data/results/smoke_scorer_results.jsonl \
--summary data/results/smoke_scorer_summary.jsonCompute smoke label calibration diagnostics:
decision-evidence-benchmark calibrate-labels \
--cases data/fixtures/smoke_cases.jsonl \
--annotations data/annotations/smoke_annotations.jsonl \
--summary data/results/smoke_label_calibration.jsonExport row-level smoke label review details:
decision-evidence-benchmark review-labels \
--cases data/fixtures/smoke_cases.jsonl \
--annotations data/annotations/smoke_annotations.jsonl \
--out data/results/smoke_label_review.json \
--csv-out data/results/smoke_label_review.csvWrite a fill-in override template when review rows contain disagreements:
decision-evidence-benchmark write-adjudication-overrides-template \
--cases data/fixtures/smoke_cases.jsonl \
--annotations data/annotations/smoke_annotations.jsonl \
--out data/results/smoke_adjudication_overrides.template.jsonl \
--adjudicator-id reviewer_1Promote smoke labels into adjudicated case manifests:
decision-evidence-benchmark adjudicate-labels \
--cases data/fixtures/smoke_cases.jsonl \
--annotations data/annotations/smoke_annotations.jsonl \
--out-cases data/results/smoke_adjudicated_cases.jsonl \
--report data/results/smoke_label_adjudication.jsonGenerate the balanced 64-case draft corpus scaffold:
decision-evidence-benchmark generate-draft-corpusThe generated draft corpus is for pipeline exercise only. Its corpus, annotation, and candidate-scorer statuses are readiness blockers by default.
Exercise the generated 64-case draft package mechanics:
make verify-draft-packageThis writes draft package, validation, readiness-gap, and table-export outputs
under data/results/. These outputs are still blocked from manuscript use by
their fixture statuses.
Use docs/manuscript_artifact_plan.md for the concrete replacement sequence
from the 64-case draft package to manuscript-candidate inputs.
Start case authoring from the 64-cell source template:
make write-manuscript-case-source-templateCreate the dedicated manuscript-corpus evidence source root:
make init-manuscript-evidence-sourceThis writes data/sources/manuscript_corpus/source_manifest.json and
data/sources/manuscript_corpus/case_evidence_sources.jsonl as a 64-row
source-review skeleton. The rows are not reviewed evidence until their source
refs, provenance notes, reviewer metadata, and container flags are filled and
the source manifest is explicitly promoted from template scope.
Export that source root as a spreadsheet-friendly review workbook:
make export-manuscript-source-review-workbookAfter filling and reviewing
data/results/manuscript_source_review_workbook.reviewed.csv, import it back
into the source root:
make import-manuscript-source-review-workbookThe importer updates data/sources/manuscript_corpus only if all 64 rows have
reviewed status, non-placeholder source refs, provenance notes, reviewer
metadata, and concrete container flags.
Audit the local OEP and pilot source roots before marking rows reviewed:
make audit-manuscript-source-rootsExport the audited source-root refs as a review candidate inventory:
make export-manuscript-source-candidatesThis writes data/results/manuscript_source_candidates.csv and .jsonl.
Rows whose source roots remain scoped as demo or anchor artifacts are labelled
as reference context only; they must not be copied into reviewed case rows until
a manuscript-corpus evidence source exists.
Generate one source-review packet per case when filling the workbook:
make export-manuscript-source-review-packetsThis writes data/results/manuscript_source_review_packets/ plus an index CSV
and summary JSON. Packets repeat the case taxonomy, missing source-review
fields, required manuscript source-root refs, and advisory OEP / pilot refs, but
they do not update the source root or promote evidence.
Generate the guarded 8-case review slice before filling the full workbook:
make export-manuscript-source-review-sliceThis writes data/results/manuscript_source_review_slice.csv and .json.
The default slice selects one case per regime and leaves review cells empty on
purpose. It is not a complete 64-row workbook and cannot promote evidence by
itself.
After filling the slice, save it as
data/results/manuscript_source_review_slice.reviewed.csv and validate it:
make validate-manuscript-source-review-sliceTo merge a valid reviewed slice into a full workbook scaffold, run:
make merge-manuscript-source-review-sliceThis writes data/results/manuscript_source_review_workbook.merged_from_slice.csv.
It still does not import data/sources/manuscript_corpus; the full 64-row
workbook must pass the stricter source-root importer before evidence promotion.
Build a row-level evidence intake report from the template and source audit:
make build-manuscript-evidence-intakeExport the intake rows as a CSV/JSONL authoring queue:
make export-manuscript-evidence-workqueueAfter filling and reviewing data/results/manuscript_evidence_workqueue.reviewed.csv,
import it into reviewed source rows:
make import-manuscript-evidence-workqueueAfter reviewed source rows are saved as
data/cases/manuscript_case_sources.reviewed.jsonl, convert them into
unadjudicated case manifests:
make convert-manuscript-case-sourcesWrite deterministic construction-derived labels and adjudicated manuscript cases:
make write-manuscript-construction-oracleThis writes data/annotations/manuscript_annotations.jsonl and
data/cases/manuscript_cases.jsonl from explicit
degradation-condition-to-property rules in
data/oracle/construction_oracle_v1.yaml. The report records the oracle spec
SHA-256 plus input/output artifact hashes. It refuses to overwrite existing
outputs unless run with MANUSCRIPT_CONSTRUCTION_ORACLE_FORCE=1. The output is
ground-truth label construction only; it is not human annotation, not a Gemini
or LLM judgment, and not a Decision Trace Reconstructor run.
Audit scorer-facing artifacts for obvious label leakage:
make audit-manuscript-label-leakageThe audit fails when a scorer or optional LLM-judge artifact exposes
degradation_condition, embedded property labels, or case IDs/source refs that
encode degradation-condition tokens. A passing audit is not proof of semantic
independence, but a failing audit is a concrete reproducibility blocker.
After those deterministic artifacts exist, run the strict manuscript package gate:
make verify-manuscript-packageThe target generates the deterministic scorer output, runs the label-leakage
audit, promotes the corpus manifest, then starts check-manuscript-inputs.
The preflight writes data/results/manuscript_input_preflight.json and fails
before package assembly if required manuscript inputs are missing. LLM-judge
inputs are optional unless llm_judge is added to MANUSCRIPT_BASELINES.
Assemble the smoke result-readiness report:
decision-evidence-benchmark readiness-report \
--corpus-validation data/results/smoke_corpus_validation.json \
--label-calibration data/results/smoke_label_calibration.json \
--label-review data/results/smoke_label_review.json \
--scorer-summary data/results/smoke_scorer_summary.json \
--baseline-summary data/results/smoke_summary.json \
--run-manifest data/results/smoke_run_manifest.json \
--out data/results/smoke_readiness_report.jsonExplain readiness blockers as artifact-level gaps:
decision-evidence-benchmark readiness-gaps \
--readiness-report data/results/smoke_readiness_report.json \
--out data/results/smoke_readiness_gaps.jsonBuild a full smoke result package:
decision-evidence-benchmark build-result-package \
--corpus-manifest data/corpus/smoke_corpus.yaml \
--cases data/fixtures/smoke_cases.jsonl \
--annotations data/annotations/smoke_annotations.jsonl \
--scorer-predictions data/fixtures/smoke_scorer_outputs.jsonl \
--out-dir data/results \
--prefix smoke_packageThe package command writes *_label_adjudication.json,
*_adjudicated_cases.jsonl, *_adjudicated_corpus.yaml, and
*_readiness_gaps.json automatically. Add --adjudication-overrides when
annotation disagreements have explicit adjudication rows. Omit
--llm-judge-predictions only when including a pinned optional LLM-judge
artifact.
Validate the package manifest and referenced artifact checksums:
decision-evidence-benchmark validate-result-package \
--manifest data/results/smoke_package_package_manifest.json \
--out data/results/smoke_package_validation.jsonExport manuscript-facing tables from the package:
decision-evidence-benchmark export-manuscript-tables \
--package-manifest data/results/smoke_package_package_manifest.json \
--out-dir data/results \
--prefix smoke_packageConvert the native DCC/HDP smoke fixture into a case manifest:
decision-evidence-benchmark adapt \
--regime dcc_hdp \
--input data/cases/dcc_hdp/missing_policy_001.native.json \
--out data/results/dcc_hdp_case.jsonlDo not use smoke output as empirical evidence. It exists only to prove that the benchmark mechanics run end to end. Publication-ready results require the full case set, labels, scorer outputs, slice summaries, and aggregation manifest.
Use docs/manuscript_runbook.md for the manuscript-scale execution sequence and
data/corpus/manuscript_corpus.template.yaml as the starting corpus manifest.
If you use DEMM-Bench in your research, please cite both the software artifact and the accompanying paper. The concept DOI 10.5281/zenodo.20408699 resolves to the latest version; cite the specific version DOI for reproducibility.
@software{solozobov_demm_bench_2026,
author = {Solozobov, Oleg},
title = {{DEMM-Bench: A Decision Evidence Maturity Benchmark
for Agent-Runtime Decisions Across Eight Evidence Regimes}},
year = 2026,
publisher = {Zenodo},
version = {v0.1.0},
doi = {10.5281/zenodo.20408700},
url = {https://doi.org/10.5281/zenodo.20408700}
}See CITATION.cff for machine-readable citation metadata.