Intrinsical-AI · Intrinsical-AI · Apr 27, 2026 · Apr 27, 2026 · Apr 27, 2026 · Apr 27, 2026
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # Stateful RAG Platform: A Port & Adapters Modular Approach
 
-[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
+[![Python 3.11-3.12](https://img.shields.io/badge/python-3.11--3.12-blue.svg)](https://www.python.org/downloads/)
 [![FastAPI](https://img.shields.io/badge/FastAPI-0.124+-green.svg)](https://fastapi.tiangolo.com)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 [![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)

diff --git a/docs/ROADMAP.md b/docs/ROADMAP.md
@@ -2,7 +2,7 @@
 
 > Scope: prioritized delivery roadmap derived from the current technical-debt register and the current codebase
 >
-> Status: current as of 2026-04-08
+> Status: current as of 2026-04-26
 
 This roadmap is execution-oriented and intentionally atomic. Each item should be implementable and reviewable on its own, without bundling large architectural rewrites into one change.
 
@@ -11,9 +11,9 @@ This roadmap is execution-oriented and intentionally atomic. Each item should be
 - `transport`: align canonical import semantics between CLI and HTTP
   - unify `replace_scope` defaults
   - cover the behavior explicitly in transport tests
-- `eval`: preserve retriever scores in offline evaluation
+- [x] `eval`: preserve retriever scores in offline evaluation
   - change the eval callback contract from `Sequence[str]` to `(external_id, score)` pairs
-- `eval`: stop silently filtering unknown retrieved IDs
+- [x] `eval`: stop silently filtering unknown retrieved IDs
   - either keep them as non-relevant results or emit explicit evaluator anomalies
 - `ux/dx`: unify CLI payload validation with DTOs / use cases for:
   - `rag-mutate-docs`
@@ -25,9 +25,9 @@ This roadmap is execution-oriented and intentionally atomic. Each item should be
 - `mutation`: extract a typed mutation runtime/config object from `Settings`
 - `mutation`: remove duplicated profile resolution and strategy branching from `MutationCoordinator`
 - `mutation`: split `_mutation_saga_executor.py` into explicit phases without changing behavior
-- `eval`: emit per-query outputs for compare mode
+- [x] `eval`: emit per-query outputs for compare mode
   - this is the prerequisite for stronger statistical comparison later
-- `ux/dx`: reduce `rag-eval-compare` complexity with profiles or spec-file support
+- [x] `ux/dx`: reduce `rag-eval-compare` complexity with spec-file support
 
 ## P2
 
@@ -40,7 +40,7 @@ This roadmap is execution-oriented and intentionally atomic. Each item should be
 
 ## P3
 
-- `eval`: extend dataset schema to support optional graded relevance
+- [x] `eval`: extend dataset schema to support optional graded relevance
 - `eval`: add paired significance testing for compare mode once per-query outputs exist
 - `canonical import`: decide whether scope replacement belongs inside canonical mutation or a dedicated use case
 - `ux/dx`: review evaluation help texts and flag naming for consistency and scanability

diff --git a/docs/ROADMAP_UX_DX.md b/docs/ROADMAP_UX_DX.md
@@ -18,9 +18,9 @@ This roadmap isolates the UX / DX work so it can move independently from mutatio
 
 ## P1
 
-- Reduce `rag-eval-compare` complexity:
-  - support compare profiles or spec files, not only expanded flag matrices
-  - make common compare flows easier than bespoke flag composition
+- [x] Reduce `rag-eval-compare` complexity:
+  - use a canonical `--spec` file with baseline, candidate, and thresholds
+  - remove expanded baseline/candidate flag matrices
 - Homogenize CLI exit codes and success/error output:
   - consistent distinction between contract errors, runtime errors, and failed evaluation gates
   - consistent success summaries across commands

diff --git a/docs/TECH_DEBT.md b/docs/TECH_DEBT.md
@@ -55,7 +55,7 @@ Interpretation:
 | Mutation saga | One module still owns profile validation, journal lifecycle, before-image, SQL apply, vector apply, rollback, recovery, and reconcile | [`src/local_rag_backend/core/use_cases/_mutation_saga_executor.py`](../src/local_rag_backend/core/use_cases/_mutation_saga_executor.py) | High blast radius for write-path changes | High |
 | Mutation coordinator | Strategy selection, profile resolution, batching, and settings-derived behavior remain concentrated in one coordinator | [`src/local_rag_backend/core/use_cases/docs_mutation.py`](../src/local_rag_backend/core/use_cases/docs_mutation.py) | Coordination logic still spreads across layers | High |
 | Canonical import path | Scope-sync still relies on dynamic repo capabilities outside the coordinator; transport validation is now shared, but the write semantics still depend on repo-specific delete hooks | [`src/local_rag_backend/core/use_cases/docs_import_canonical.py`](../src/local_rag_backend/core/use_cases/docs_import_canonical.py), [`src/local_rag_backend/cli_commands/docs/docs_import_canonical.py`](../src/local_rag_backend/cli_commands/docs/docs_import_canonical.py) | Same business action can still depend on repo capabilities outside the core coordinator | High |
-| Evaluation methodology | Unknown retrieved IDs are silently filtered; original retriever scores are discarded; compare gate uses aggregate deltas only | [`src/local_rag_backend/core/services/evaluation.py`](../src/local_rag_backend/core/services/evaluation.py), [`src/local_rag_backend/core/use_cases/evaluation.py`](../src/local_rag_backend/core/use_cases/evaluation.py) | Retrieval defects can be masked; candidate quality can be overstated | High |
+| Evaluation methodology | Score/unknown-ID handling is fixed; compare gate still needs statistical testing beyond aggregate deltas | [`src/local_rag_backend/core/services/evaluation.py`](../src/local_rag_backend/core/services/evaluation.py), [`src/local_rag_backend/core/use_cases/evaluation.py`](../src/local_rag_backend/core/use_cases/evaluation.py) | Candidate quality can still be overstated without paired significance tests | Medium |
 | CLI / DX contract consistency | Evaluation flags are still powerful and cognitively expensive; some command surfaces remain manually shaped rather than spec-driven | [`src/local_rag_backend/cli_commands/docs/docs_mutate.py`](../src/local_rag_backend/cli_commands/docs/docs_mutate.py), [`src/local_rag_backend/cli_commands/eval.py`](../src/local_rag_backend/cli_commands/eval.py) | Users still have to learn a wide CLI surface | Medium |
 | Maintenance | Two multi-store delete flows are still near-mirror implementations | [`src/local_rag_backend/core/services/maintenance.py`](../src/local_rag_backend/core/services/maintenance.py) | Partial fixes and telemetry drift | Medium |
 | Ingestion planner | Planning, stale detection, batching, mutation execution, and terminal output still live in one module; `items` remains `Any` | [`src/local_rag_backend/cli_commands/docs/_ingestion_planner.py`](../src/local_rag_backend/cli_commands/docs/_ingestion_planner.py) | Reuse is limited; contracts remain implicit | Medium |
@@ -107,15 +107,14 @@ Validated points:
 
 - Dataset parsing and aggregate metric calculation are correctly centralized in [`load_eval_dataset`](../src/local_rag_backend/core/services/evaluation.py) and `ir_measures`.
 - The evaluation workspace is isolated through [`prepare_eval_workspace`](../src/local_rag_backend/core/use_cases/evaluation.py), so this is not a simple "production index accidentally reused" story.
-- The core evaluator still filters retrieved IDs not present in the dataset corpus:
-  `normalized_external_id not in known_doc_ids` in [`run_retrieval_eval`](../src/local_rag_backend/core/services/evaluation.py).
-- The evaluator also discards original retrieval scores because the callback contract is `Sequence[str]`, then invents rank-based scores via `k - rank + 1`.
-- [`compare_eval_results`](../src/local_rag_backend/core/services/evaluation.py) still gates on aggregate deltas only, with no per-query distribution or significance testing.
+- The core evaluator now keeps retrieved IDs not present in the dataset corpus as non-relevant results and emits explicit anomalies.
+- The evaluator now accepts ranked `(external_id, score)` style results while keeping backward compatibility for ID-only callbacks.
+- [`compare_eval_results`](../src/local_rag_backend/core/services/evaluation.py) still gates on aggregate deltas; detailed reports now expose per-query deltas, but significance testing remains future work.
 
 Why this matters:
 
-- Filtering unknown IDs can hide retrieval-state defects or corpus leakage instead of surfacing them as degraded precision.
-- Rank-only callback contracts make the evaluator lossy and block better future analysis.
+- Unknown retrieved IDs now degrade metrics and are visible in report/anomaly outputs.
+- Score-preserving callback contracts unblock better future analysis.
 - Aggregate-only gates are acceptable as operational guardrails, but not as evidence of statistical superiority.
 
 Nuance:
@@ -125,19 +124,17 @@ Nuance:
 
 Recommended direction:
 
-- Change the retrieval callback contract to return `(external_id, score)` pairs.
-- Stop silently filtering unknown IDs; either keep them as non-relevant results or surface them as explicit evaluator anomalies.
-- Extend the dataset format to support optional graded relevance.
-- Add per-query outputs and, later, paired significance testing for compare mode.
+- Add paired significance testing for compare mode.
+- Expand external benchmark adapters on top of the internal `EvalDataset`/`EvalRun`/`EvalReport` model.
 
 ### 4. CLI / DX contract consistency is now first-order debt
 
 Validated points:
 
 - CLI mutation and canonical-import commands still parse JSON manually instead of reusing shared typed validation.
 - `replace_scope` semantics are aligned and the typed validation path is now shared across CLI, HTTP, and MCP.
-- `rag-eval`, `rag-eval-batch`, and `rag-eval-compare` expose powerful workflows, but the option surface is large and uneven:
-  flags such as `--candidate-candidate-k`, `--baseline-dual-candidate-k`, and JSON batch specs create a high cognitive load.
+- `rag-eval-compare` now uses a canonical `--spec` file instead of expanded baseline/candidate flag matrices.
+  `rag-eval` and `rag-eval-batch` still need continued payload-validation cleanup.
 - `_ingestion_planner.py` still emits terminal output directly, which keeps planning logic coupled to CLI behavior.
 
 Why this matters:
@@ -150,7 +147,7 @@ Recommended direction:
 
 - Reuse DTOs / use-case input models for CLI payload validation.
 - Align visible defaults across CLI and HTTP.
-- Simplify evaluation entrypoints with profiles or spec-driven compare flows, not only flag expansion.
+- Continue simplifying evaluation entrypoints by reusing the shared eval config validation path.
 - Standardize exit codes and success/error output shape across commands.
 
 ### 5. Maintenance still has cheap-to-fix duplication
@@ -230,8 +227,8 @@ Interpretation:
 - Extract a typed mutation runtime/config object from `Settings`.
 - Split mutation saga internals by phase without changing external behavior.
 - Remove duplicate profile resolution and strategy branching where possible from `MutationCoordinator`.
-- Add per-query outputs to evaluation compare mode.
-- Reduce `eval-compare` flag complexity with profiles or spec-file support.
+- [x] Add per-query outputs to evaluation compare mode.
+- [x] Reduce `eval-compare` flag complexity with spec-file support.
 - Homogenize CLI exit codes and success/error output shape.
 
 ### P2

diff --git a/docs/USAGE.md b/docs/USAGE.md
@@ -396,12 +396,16 @@ rag-eval --retrieval-mode sparse
 rag-eval --retrieval-mode dense --candidate-k 20
 rag-eval --retrieval-mode dual --dual-candidate-k 50
 rag-eval --retrieval-mode hybrid --hybrid-alpha 0.5
-rag-eval-compare --candidate-mode dual --candidate-dual-candidate-k 50
+rag-eval --retrieval-mode sparse --run-out /tmp/run.jsonl --report-out /tmp/report.json
+rag-eval-compare --spec /tmp/rag-eval-compare-spec.json
 rag-eval-batch --specs /tmp/rag-eval-batch-specs.json
 ```
 
 Dataset por defecto: `datasets/rag_eval_v1.jsonl` (o `eval_dataset_path` en `config.yaml`).
 El comando reporta métricas estándar de IR a `@k` (`nDCG`, `MAP`, `MRR`, `P`, `Recall`).
+El evaluador conserva los scores del retriever cuando están disponibles. Los IDs recuperados
+fuera del corpus ya no se filtran silenciosamente: cuentan como no relevantes y aparecen como
+anomalías en `--report-out` / `--anomalies-out`.
 La evaluación usa un runtime local aislado bajo `<data_dir>/_eval_workspaces/`; no reutiliza ni muta el índice principal.
 Ese runtime sí puede reutilizar un índice denso de evaluación ya persistido cuando coinciden:
 
@@ -412,6 +416,12 @@ Ese runtime sí puede reutilizar un índice denso de evaluación ya persistido c
 Si cambias el modelo de embeddings, el backend vectorial o cualquier input del manifest denso, la evaluación invalida ese workspace y reconstruye el índice aislado.
 El rebuild denso se hace en batches acotados para reducir picos de memoria en corpora grandes, pero sigue siendo un rebuild completo del workspace de evaluación cuando hay drift.
 El dataset se valida de forma estricta: IDs duplicados, relevantes vacíos o relevantes fuera del corpus fallan al cargar.
+El schema v1 (`relevant_external_ids`) se mantiene. El schema v2 añade qrels graduados:
+
+```json
+{"type":"query","query":"alpha","qrels":[{"external_id":"doc:1","relevance":3}]}
+```
+
 Los overrides de modo son explícitos:
 - `--candidate-k` sólo para `dense`
 - `--dual-candidate-k` sólo para `dual`
@@ -428,20 +438,30 @@ Ese env var sobrescribe `perf_metrics_out_path` en tiempo de carga de settings s
 Comparación baseline-vs-candidate (“Detector de Placebo RAG”):
 
 ```bash
+cat > /tmp/rag-eval-compare-spec.json <<'JSON'
+{
+  "k": 3,
+  "baseline": {"retrieval_mode": "sparse"},
+  "candidate": {"retrieval_mode": "dual", "dual_candidate_k": 50},
+  "thresholds": {
+    "min_delta_ndcg": 0.02,
+    "min_delta_map": 0.02,
+    "min_delta_mrr": 0.02,
+    "max_regression_precision": 0.01,
+    "max_regression_recall": 0.01
+  }
+}
+JSON
+
 rag-eval-compare \
-  --candidate-mode dual \
-  --candidate-dual-candidate-k 50 \
-  --min-delta-ndcg 0.02 \
-  --min-delta-map 0.02 \
-  --min-delta-mrr 0.02 \
-  --max-regression-precision 0.01 \
-  --max-regression-recall 0.01 \
-  --json-out /tmp/rag-eval-compare.json
+  --spec /tmp/rag-eval-compare-spec.json \
+  --json-out /tmp/rag-eval-compare.json \
+  --report-out /tmp/rag-eval-compare-report.json
 ```
 
 Comportamiento:
-- baseline por defecto: `sparse` sin reranker
-- candidate: la configuración que quieras validar
+- `baseline` y `candidate` se declaran en `--spec`
+- `thresholds` contiene los umbrales del gate; si se omite un valor, su default es `0.0`
 - exit code `0`: pasa el gate
 - exit code `1`: la candidate no mejora lo suficiente o degrada métricas críticas
 - exit code `2`: error de configuración, dependencia o entorno
@@ -451,6 +471,9 @@ El JSON de salida incluye:
 - `candidate`
 - `delta`
 
+El report detallado añade métricas por query y contadores de anomalías, pensado para auditoría,
+pooling y futuros adapters RAGAS/BEIR/MTEB/LLM judge.
+
 Smoke e2e reproducible:
 
 ```bash

diff --git a/pyproject.toml b/pyproject.toml
@@ -3,7 +3,7 @@ name = "rag-prototype"
 version = "1.3.0"
 description = "Experimental-grade RAG prototype with FastAPI, FAISS, and clean hexagonal architecture"
 readme = { file = "README.md", content-type = "text/markdown" }
-requires-python = ">=3.11"
+requires-python = ">=3.11,<3.13"
 license = { text = "MIT" }
 authors = [{ name = "Intrinsical-AI", email = "intrinsicalai@proton.me" }]
 maintainers = [{ name = "Intrinsical-AI", email = "intrinsicalai@proton.me" }]

diff --git a/scripts/test_rag_eval_compare_e2e.sh b/scripts/test_rag_eval_compare_e2e.sh
@@ -8,25 +8,47 @@ trap 'rm -rf "$TMP_DIR"' EXIT
 DATASET_PATH="$TMP_DIR/rag_eval_compare.jsonl"
 PASS_JSON="$TMP_DIR/pass.json"
 FAIL_JSON="$TMP_DIR/fail.json"
+PASS_SPEC="$TMP_DIR/pass-spec.json"
+FAIL_SPEC="$TMP_DIR/fail-spec.json"
 
 cat >"$DATASET_PATH" <<'EOF'
 {"type":"meta","dataset_id":"rag_eval_compare_smoke","schema_version":1}
 {"type":"doc","external_id":"doc:alpha","source_id":"smoke","content":"alpha alpha alpha"}
 {"type":"query","query":"alpha","relevant_external_ids":["doc:alpha"]}
 EOF
 
+cat >"$PASS_SPEC" <<'EOF'
+{
+  "k": 1,
+  "baseline": {"retrieval_mode": "sparse"},
+  "candidate": {"retrieval_mode": "sparse"},
+  "thresholds": {
+    "min_delta_ndcg": 0.0,
+    "min_delta_map": 0.0,
+    "min_delta_mrr": 0.0,
+    "max_regression_precision": 0.0,
+    "max_regression_recall": 0.0
+  }
+}
+EOF
+
+cat >"$FAIL_SPEC" <<'EOF'
+{
+  "k": 1,
+  "baseline": {"retrieval_mode": "sparse"},
+  "candidate": {"retrieval_mode": "sparse"},
+  "thresholds": {
+    "min_delta_ndcg": 0.1
+  }
+}
+EOF
+
 cd "$ROOT_DIR"
 
 echo "[1/2] Expect PASS with identical sparse baseline/candidate and zero delta thresholds"
 env DEBUG=false UV_CACHE_DIR="${UV_CACHE_DIR:-.uv_cache}" uv run rag-eval-compare \
   --dataset "$DATASET_PATH" \
-  --k 1 \
-  --candidate-mode sparse \
-  --min-delta-ndcg 0.0 \
-  --min-delta-map 0.0 \
-  --min-delta-mrr 0.0 \
-  --max-regression-precision 0.0 \
-  --max-regression-recall 0.0 \
+  --spec "$PASS_SPEC" \
   --json-out "$PASS_JSON"
 
 python - <<'PY' "$PASS_JSON"
@@ -44,9 +66,7 @@ echo "[2/2] Expect FAIL when requiring an impossible positive delta from an iden
 set +e
 env DEBUG=false UV_CACHE_DIR="${UV_CACHE_DIR:-.uv_cache}" uv run rag-eval-compare \
   --dataset "$DATASET_PATH" \
-  --k 1 \
-  --candidate-mode sparse \
-  --min-delta-ndcg 0.1 \
+  --spec "$FAIL_SPEC" \
   --json-out "$FAIL_JSON"
 rc=$?
 set -e

diff --git a/src/local_rag_backend/cli_commands/docs/docs_bootstrap.py b/src/local_rag_backend/cli_commands/docs/docs_bootstrap.py
@@ -13,14 +13,14 @@ def bootstrap_cmd() -> None:
         from local_rag_backend.scripts.sample_data_ingestion import run_sample_data_ingestion
 
         click.echo("[INFO] Bootstrapping database with sample data...")
+        bar: ProgressBar[int]
         with click.progressbar(length=1, label="Bootstrapping") as bar:
-            typed_bar: ProgressBar[int] = bar
             run_cli_mutation(
                 run_sample_data_ingestion,
                 use_lock=False,
                 ensure_schema=False,
             )
-            typed_bar.update(1)
+            bar.update(1)
         click.echo("[OK] Bootstrap completed successfully!")
     except Exception as e:
         click.echo(f"[ERROR] Error bootstrapping: {e}", err=True)