cafitac · cafitac · May 1, 2026 · May 1, 2026
diff --git a/.dev/status/current-handoff.md b/.dev/status/current-handoff.md
@@ -1,7 +1,7 @@
 # agent-memory current handoff
 
 Status: AI-authored draft. Not yet human-approved.
-Last updated: 2026-05-01 11:10 KST
+Last updated: 2026-05-01 11:37 KST
 
 ## Trigger for the next session
 
@@ -16,7 +16,7 @@ read this file first. Do not ask the user to restate context. Verify repo state,
 
 ## Ready-to-say answer
 
-agent-memory는 v0.1.39까지 배포/Hermes QA가 완료됐고, 현재는 Priority 5 dogfood/noise monitoring에서 v0.1.39 dogfood 결과를 바탕으로 `observations review-candidates`의 JSON 계약을 더 운영 친화적으로 다듬는 slice를 진행 중이야. 브랜치는 `feat/observation-review-temporal`, worktree는 `/Users/reddit/Project/agent-memory/.worktrees/observation-review-temporal`야. 목표는 review-candidates 결과에 top-level count, per-ref observation window, fact status-history summary를 추가해 historical injections와 현재 lifecycle 상태를 더 쉽게 구분하는 것이다. 자동 cleanup/mutation은 여전히 하지 않는다.
+agent-memory는 v0.1.40까지 배포/Hermes QA가 완료됐고, 현재는 Priority 5 dogfood/noise monitoring에서 empty retrieval/high empty ratio 진단을 강화하는 read-only slice를 진행 중이야. 브랜치는 `feat/empty-retrieval-diagnostics`, worktree는 `/Users/reddit/Project/agent-memory/.worktrees/empty-retrieval-diagnostics`야. 목표는 `observations empty-diagnostics`를 추가해 empty-heavy observation을 surface/scope/status filter별로 묶고, scope mismatch나 승인된 memory coverage 부족을 사람이 안전하게 판단하게 하는 것이다. 자동 cleanup/mutation은 여전히 하지 않는다.
 
 ## Current repo state
 
@@ -32,17 +32,17 @@ Expected GitHub identity:
 
 Verified before this slice:
 
-- latest completed release: `v0.1.39`
-- v0.1.39 added read-only `agent-memory observations review-candidates` and completed published smoke/Hermes runtime QA.
-- local Hermes hook uses `/Users/reddit/.agent-memory/runtime/v0.1.39/.venv/bin/python -m agent_memory.api.cli hermes-pre-llm-hook ...` against `/Users/reddit/.agent-memory/memory.db`.
+- latest completed release: `v0.1.40`
+- v0.1.40 added observation windows/counts/status-history summaries to review-candidates and completed published smoke/Hermes runtime QA.
+- local Hermes hook uses `/Users/reddit/.agent-memory/runtime/v0.1.40/.venv/bin/python -m agent_memory.api.cli hermes-pre-llm-hook ...` against `/Users/reddit/.agent-memory/memory.db`.
 - root checkout was clean on `main...origin/main` except local-only untracked state.
 - open PRs were `[]`.
 
 Active slice/worktree:
 
-- branch: `feat/observation-review-temporal`
-- worktree: `/Users/reddit/Project/agent-memory/.worktrees/observation-review-temporal`
-- intended release after merge: likely `v0.1.40`
+- branch: `feat/empty-retrieval-diagnostics`
+- worktree: `/Users/reddit/Project/agent-memory/.worktrees/empty-retrieval-diagnostics`
+- intended release after merge: likely `v0.1.41`
 
 Expected local untracked artifacts to preserve in the root checkout:
 
@@ -54,63 +54,76 @@ Expected local untracked artifacts to preserve in the root checkout:
 
 Do not delete or commit these unless the user explicitly asks.
 
-## Current slice: observation review temporal summaries
+## Current slice: empty retrieval diagnostics
 
 Goal:
 
 - Keep dogfood/noise monitoring read-only.
-- Make `observations review-candidates` easier to consume from local dogfood output.
-- Add compact count/window/history summaries without exposing raw user queries and without mutating memory.
+- Make high empty retrieval ratio actionable without storing or emitting raw user queries.
+- Diagnose empty-heavy segments by surface, preferred scope, and retrieval status filter before changing rankers or adding graph traversal.
 
 Implemented so far in the active worktree:
 
-- `observations audit` top refs now include `observation_window`:
-  - `first_observation_id`
-  - `first_observed_at`
-  - `latest_observation_id`
-  - `latest_observed_at`
-- `observations review-candidates` now includes top-level:
+- New CLI command:
+  - `agent-memory observations empty-diagnostics <db_path> --limit 200 --top 10 --high-empty-threshold 0.5`
+- Output contract:
+  - `kind: retrieval_empty_diagnostics`
+  - `read_only: true`
   - `observation_count`
-  - `candidate_count`
-- Each review candidate now includes:
-  - the propagated `observation_window`
-  - `status_history_summary.transition_count`
-  - `status_history_summary.latest_transition`
+  - `empty_retrieval_count`
+  - `empty_retrieval_ratio`
+  - `quality_warnings`
+  - top-level `observation_window`
+  - `empty_by_surface[]`
+  - `empty_by_preferred_scope[]`
+  - `empty_by_status_filter[]`
+  - `suggested_next_steps`
+- Segment entries include:
+  - segment key (`surface`, `preferred_scope`, or `statuses`)
+  - `total_count`
+  - `empty_count`
+  - `empty_ratio`
+  - `signals`, currently `high_empty_segment` when above threshold
+  - `sample_observation_ids`
+  - `observation_window`
+- Secret-safety preserved:
+  - no raw query text
+  - no query previews
+  - no prompt content
 - Docs updated:
   - `README.md`
   - `docs/hermes-dogfood.md`
 - Tests updated in `tests/test_cli.py`:
-  - audit regression asserts per-ref observation window.
-  - review-candidates regression asserts top-level counts and status history summary.
+  - new regression asserts empty diagnostics segment grouping, read-only shape, next-step hints, and no secret leakage from raw query strings.
 
 Verification so far:
 
 - RED confirmed:
-  - focused tests failed on missing `observation_window` and top-level `observation_count`.
+  - focused test initially failed because `empty-diagnostics` parser choice was missing.
 - GREEN focused:
-  - `TMPDIR=$PWD/.tmp-test uv run pytest tests/test_cli.py::test_python_module_cli_observations_audit_reports_frequent_and_stale_refs_without_raw_queries tests/test_cli.py::test_python_module_cli_observations_review_candidates_explains_top_refs_without_mutation_or_raw_queries -q`
-  - `2 passed`
+  - `TMPDIR=$PWD/.tmp-test uv run pytest tests/test_cli.py::test_python_module_cli_observations_empty_diagnostics_groups_empty_segments_without_raw_queries -q`
+  - `1 passed`
 
 Remaining before PR:
 
-1. Run broader/full local verification:
-   - focused CLI tests around audit/review-candidates
+1. Run broader focused CLI tests around observations audit/review-candidates/empty-diagnostics.
+2. Run full local verification:
    - `uv run pytest tests/ -q`
    - `uv run python scripts/check_release_metadata.py`
    - `uv run python scripts/smoke_release_readiness.py`
    - `npm pack --dry-run`
    - `git diff --check`
    - `node --check bin/agent-memory.js`
-2. Run real local DB smoke for `observations review-candidates` and verify the new fields exist.
-3. Run static diff secret scan.
-4. Create PR, watch CI, merge, follow release-sync/publish/published smoke/Hermes QA.
-5. After v0.1.40 install, repeat Hermes hook doctor and installed `observations review-candidates` against the existing local DB.
+3. Run real local DB smoke for `observations empty-diagnostics` and verify no raw query fields appear.
+4. Run static diff secret scan.
+5. Create PR, watch CI, merge, follow release-sync/publish/published smoke/Hermes QA.
+6. After v0.1.41 install, repeat Hermes hook doctor and installed `observations empty-diagnostics` against the existing local DB.
 
 ## Next natural slice after this one
 
-After the review-candidates contract is released and dogfooded, continue Priority 5 by either:
+After empty retrieval diagnostics are released and dogfooded, continue Priority 5 by either:
 
-1. improving retrieval diagnostics for empty retrieval/high empty ratio, or
-2. adding an explicit human review cadence/checklist around candidate reports.
+1. adding an explicit human review cadence/checklist around audit/review-candidates/empty-diagnostics, or
+2. improving candidate report UX further by bundling suggested follow-up commands into a richer read-only triage report.
 
-Avoid automatic cleanup/deprecation until the review candidate workflow has been used on real local data for a while.
+Avoid automatic cleanup/deprecation until the review and diagnostics workflow has been used on real local data for a while.
diff --git a/README.md b/README.md
@@ -109,10 +109,11 @@ For local dogfood and noise monitoring, retrievals can leave a secret-safe obser
 agent-memory retrieve "$DB" "How should I install agent-memory?" --preferred-scope user:default --observe cli
 agent-memory observations list "$DB" --limit 20
 agent-memory observations audit "$DB" --limit 200 --top 10 --frequent-threshold 3
+agent-memory observations empty-diagnostics "$DB" --limit 200 --top 10 --high-empty-threshold 0.5
 agent-memory observations review-candidates "$DB" --limit 200 --top 10 --frequent-threshold 3
 ```
 
-Use the observation log and audit report to spot frequently injected or surprising memories before changing retrieval behavior. The audit output is read-only JSON with surface/scope counts, empty-retrieval count and ratio, quality warnings such as `low_observation_count` or `high_empty_retrieval_ratio`, top injected memory refs, current status for known refs, per-ref observation windows, and simple signals such as `frequently_injected` and `current_status_not_approved`. `observations review-candidates` is also read-only; it turns the top audit refs into forensic candidates with top-level `observation_count`/`candidate_count`, fact review explanations, status-history summaries, replacement-chain hints, graph-neighborhood summaries, and copy-paste follow-up commands such as `review explain`, `review replacements`, and `graph inspect`. Treat these reports as local operator telemetry, not a synced analytics feature or an automatic cleanup workflow.
+Use the observation log and audit report to spot frequently injected or surprising memories before changing retrieval behavior. The audit output is read-only JSON with surface/scope counts, empty-retrieval count and ratio, quality warnings such as `low_observation_count` or `high_empty_retrieval_ratio`, top injected memory refs, current status for known refs, per-ref observation windows, and simple signals such as `frequently_injected` and `current_status_not_approved`. `observations empty-diagnostics` is read-only and focuses specifically on empty retrievals: it groups empty-heavy observations by surface, preferred scope, and status filter with segment ratios, sample observation ids, observation windows, and next-step hints for checking scope mismatches or missing approved memory coverage before changing rankers. `observations review-candidates` is also read-only; it turns the top audit refs into forensic candidates with top-level `observation_count`/`candidate_count`, fact review explanations, status-history summaries, replacement-chain hints, graph-neighborhood summaries, and copy-paste follow-up commands such as `review explain`, `review replacements`, and `graph inspect`. Treat these reports as local operator telemetry, not a synced analytics feature or an automatic cleanup workflow.
 
 ## Hermes quickstart
 

diff --git a/docs/hermes-dogfood.md b/docs/hermes-dogfood.md
@@ -48,11 +48,14 @@ Hermes pre-LLM hook retrievals write a secret-safe local observation row to the
 ```bash
 agent-memory observations list ~/.agent-memory/memory.db --limit 20
 agent-memory observations audit ~/.agent-memory/memory.db --limit 200 --top 10 --frequent-threshold 3
+agent-memory observations empty-diagnostics ~/.agent-memory/memory.db --limit 200 --top 10 --high-empty-threshold 0.5
 agent-memory observations review-candidates ~/.agent-memory/memory.db --limit 200 --top 10 --frequent-threshold 3
 ```
 
 Use this before tuning ranking or adding broader graph traversal: first confirm which memories are frequently injected, which scopes are active, whether retrieval is often empty, and whether any frequently injected refs are now deprecated/disputed/missing. The audit command is read-only and summarizes local observation rows without emitting raw query text or query previews. Keep this data local unless you intentionally export it.
 
+When `empty_retrieval_ratio` is high, run `observations empty-diagnostics` before changing rankers. It is a read-only, secret-safe segment report for empty observations. It groups empty-heavy rows by surface, preferred scope, and status filter; includes each segment's total count, empty count, empty ratio, sample observation ids, and observation window; and suggests operator checks such as scope mismatch review or adding/approving durable memories only after confirming the misses are real user needs. It does not emit raw query text, query previews, or prompt content.
+
 `observations review-candidates` is the next read-only step after audit. It keeps the same secret-safe observation summary, then expands each top ref into a forensic candidate:
 
 - fact refs include the same lifecycle explanation as `agent-memory review explain fact ...`.

diff --git a/src/agent_memory/api/cli.py b/src/agent_memory/api/cli.py
@@ -267,6 +267,140 @@ def _audit_retrieval_observations(
     }
 
 
+def _observation_window(observations) -> dict[str, Any] | None:
+    if not observations:
+        return None
+    first = min(observations, key=lambda observation: observation.id)
+    latest = max(observations, key=lambda observation: observation.id)
+    return {
+        "first_observation_id": first.id,
+        "first_observed_at": first.created_at,
+        "latest_observation_id": latest.id,
+        "latest_observed_at": latest.created_at,
+    }
+
+
+def _empty_diagnostic_segment_payload(
+    *,
+    segment_name: str,
+    segment_value: Any,
+    observations,
+    high_empty_threshold: float,
+) -> dict[str, Any]:
+    empty_observations = [observation for observation in observations if not observation.retrieved_memory_refs]
+    total_count = len(observations)
+    empty_count = len(empty_observations)
+    empty_ratio = empty_count / total_count if total_count else 0.0
+    signals = []
+    if empty_ratio >= high_empty_threshold and empty_count > 0:
+        signals.append("high_empty_segment")
+    return {
+        segment_name: segment_value,
+        "total_count": total_count,
+        "empty_count": empty_count,
+        "empty_ratio": round(empty_ratio, 4),
+        "signals": signals,
+        "sample_observation_ids": [observation.id for observation in empty_observations[:5]],
+        "observation_window": _observation_window(observations),
+    }
+
+
+def _empty_retrieval_diagnostics(
+    db_path: Path,
+    *,
+    limit: int,
+    top: int,
+    high_empty_threshold: float,
+) -> dict[str, Any]:
+    if limit < 1:
+        raise ValueError("observations empty-diagnostics limit must be >= 1")
+    if top < 1:
+        raise ValueError("observations empty-diagnostics top must be >= 1")
+    if high_empty_threshold < 0 or high_empty_threshold > 1:
+        raise ValueError("observations empty-diagnostics high empty threshold must be between 0 and 1")
+
+    observations = list_retrieval_observations(db_path, limit=limit)
+    empty_observations = [observation for observation in observations if not observation.retrieved_memory_refs]
+    empty_retrieval_ratio = len(empty_observations) / len(observations) if observations else 0.0
+
+    observations_by_surface: dict[str, list[Any]] = defaultdict(list)
+    observations_by_scope: dict[str | None, list[Any]] = defaultdict(list)
+    observations_by_statuses: dict[tuple[str, ...], list[Any]] = defaultdict(list)
+    for observation in observations:
+        observations_by_surface[observation.surface].append(observation)
+        observations_by_scope[observation.preferred_scope].append(observation)
+        observations_by_statuses[tuple(observation.statuses)].append(observation)
+
+    def sort_segments(items):
+        return sorted(
+            items,
+            key=lambda item: (-item["empty_count"], -item["empty_ratio"], str(next(iter(item.values())))),
+        )[:top]
+
+    empty_by_surface = sort_segments(
+        [
+            _empty_diagnostic_segment_payload(
+                segment_name="surface",
+                segment_value=surface,
+                observations=segment_observations,
+                high_empty_threshold=high_empty_threshold,
+            )
+            for surface, segment_observations in observations_by_surface.items()
+        ]
+    )
+    empty_by_preferred_scope = sort_segments(
+        [
+            _empty_diagnostic_segment_payload(
+                segment_name="preferred_scope",
+                segment_value=preferred_scope,
+                observations=segment_observations,
+                high_empty_threshold=high_empty_threshold,
+            )
+            for preferred_scope, segment_observations in observations_by_scope.items()
+        ]
+    )
+    empty_by_status_filter = sort_segments(
+        [
+            _empty_diagnostic_segment_payload(
+                segment_name="statuses",
+                segment_value=list(statuses),
+                observations=segment_observations,
+                high_empty_threshold=high_empty_threshold,
+            )
+            for statuses, segment_observations in observations_by_statuses.items()
+        ]
+    )
+
+    quality_warnings = []
+    if not observations:
+        quality_warnings.append("no_observations")
+    if 0 < len(observations) < 10:
+        quality_warnings.append("low_observation_count")
+    if empty_retrieval_ratio >= high_empty_threshold and observations:
+        quality_warnings.append("high_empty_retrieval_ratio")
+
+    return {
+        "kind": "retrieval_empty_diagnostics",
+        "read_only": True,
+        "observation_count": len(observations),
+        "limit": limit,
+        "top": top,
+        "high_empty_threshold": high_empty_threshold,
+        "empty_retrieval_count": len(empty_observations),
+        "empty_retrieval_ratio": round(empty_retrieval_ratio, 4),
+        "quality_warnings": quality_warnings,
+        "observation_window": _observation_window(observations),
+        "empty_by_surface": empty_by_surface,
+        "empty_by_preferred_scope": empty_by_preferred_scope,
+        "empty_by_status_filter": empty_by_status_filter,
+        "suggested_next_steps": [
+            "Run observations audit to compare empty vs non-empty retrieval surfaces.",
+            "Check preferred scope values for scope mismatches before changing ranking.",
+            "Add or approve memories only after confirming the missing queries represent durable user needs.",
+        ],
+    }
+
+
 def _review_candidates_from_observations(
     db_path: Path,
     *,
@@ -654,6 +788,14 @@ def _build_parser() -> argparse.ArgumentParser:
     observations_audit_parser.add_argument("--limit", type=int, default=200)
     observations_audit_parser.add_argument("--top", type=int, default=10)
     observations_audit_parser.add_argument("--frequent-threshold", type=int, default=3)
+    observations_empty_diagnostics_parser = observations_subparsers.add_parser(
+        "empty-diagnostics",
+        help="Build a read-only diagnostic report for empty retrieval observations.",
+    )
+    observations_empty_diagnostics_parser.add_argument("db_path", type=Path)
+    observations_empty_diagnostics_parser.add_argument("--limit", type=int, default=200)
+    observations_empty_diagnostics_parser.add_argument("--top", type=int, default=10)
+    observations_empty_diagnostics_parser.add_argument("--high-empty-threshold", type=float, default=0.5)
     observations_review_candidates_parser = observations_subparsers.add_parser(
         "review-candidates",
         help="Build a read-only forensic review report from top retrieval observation refs.",
@@ -1059,6 +1201,19 @@ def main() -> None:
                 )
             )
             return
+        if args.observations_action == "empty-diagnostics":
+            print(
+                json.dumps(
+                    _empty_retrieval_diagnostics(
+                        args.db_path,
+                        limit=args.limit,
+                        top=args.top,
+                        high_empty_threshold=args.high_empty_threshold,
+                    ),
+                    indent=2,
+                )
+            )
+            return
         if args.observations_action == "review-candidates":
             print(
                 json.dumps(