Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 50 additions & 37 deletions .dev/status/current-handoff.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# agent-memory current handoff

Status: AI-authored draft. Not yet human-approved.
Last updated: 2026-05-01 11:10 KST
Last updated: 2026-05-01 11:37 KST

## Trigger for the next session

Expand All @@ -16,7 +16,7 @@ read this file first. Do not ask the user to restate context. Verify repo state,

## Ready-to-say answer

agent-memory는 v0.1.39까지 배포/Hermes QA가 완료됐고, 현재는 Priority 5 dogfood/noise monitoring에서 v0.1.39 dogfood 결과를 바탕으로 `observations review-candidates`의 JSON 계약을 더 운영 친화적으로 다듬는 slice를 진행 중이야. 브랜치는 `feat/observation-review-temporal`, worktree는 `/Users/reddit/Project/agent-memory/.worktrees/observation-review-temporal`야. 목표는 review-candidates 결과에 top-level count, per-ref observation window, fact status-history summary를 추가해 historical injections와 현재 lifecycle 상태를 더 쉽게 구분하는 것이다. 자동 cleanup/mutation은 여전히 하지 않는다.
agent-memory는 v0.1.40까지 배포/Hermes QA가 완료됐고, 현재는 Priority 5 dogfood/noise monitoring에서 empty retrieval/high empty ratio 진단을 강화하는 read-only slice를 진행 중이야. 브랜치는 `feat/empty-retrieval-diagnostics`, worktree는 `/Users/reddit/Project/agent-memory/.worktrees/empty-retrieval-diagnostics`야. 목표는 `observations empty-diagnostics`를 추가해 empty-heavy observation을 surface/scope/status filter별로 묶고, scope mismatch나 승인된 memory coverage 부족을 사람이 안전하게 판단하게 하는 것이다. 자동 cleanup/mutation은 여전히 하지 않는다.

## Current repo state

Expand All @@ -32,17 +32,17 @@ Expected GitHub identity:

Verified before this slice:

- latest completed release: `v0.1.39`
- v0.1.39 added read-only `agent-memory observations review-candidates` and completed published smoke/Hermes runtime QA.
- local Hermes hook uses `/Users/reddit/.agent-memory/runtime/v0.1.39/.venv/bin/python -m agent_memory.api.cli hermes-pre-llm-hook ...` against `/Users/reddit/.agent-memory/memory.db`.
- latest completed release: `v0.1.40`
- v0.1.40 added observation windows/counts/status-history summaries to review-candidates and completed published smoke/Hermes runtime QA.
- local Hermes hook uses `/Users/reddit/.agent-memory/runtime/v0.1.40/.venv/bin/python -m agent_memory.api.cli hermes-pre-llm-hook ...` against `/Users/reddit/.agent-memory/memory.db`.
- root checkout was clean on `main...origin/main` except local-only untracked state.
- open PRs were `[]`.

Active slice/worktree:

- branch: `feat/observation-review-temporal`
- worktree: `/Users/reddit/Project/agent-memory/.worktrees/observation-review-temporal`
- intended release after merge: likely `v0.1.40`
- branch: `feat/empty-retrieval-diagnostics`
- worktree: `/Users/reddit/Project/agent-memory/.worktrees/empty-retrieval-diagnostics`
- intended release after merge: likely `v0.1.41`

Expected local untracked artifacts to preserve in the root checkout:

Expand All @@ -54,63 +54,76 @@ Expected local untracked artifacts to preserve in the root checkout:

Do not delete or commit these unless the user explicitly asks.

## Current slice: observation review temporal summaries
## Current slice: empty retrieval diagnostics

Goal:

- Keep dogfood/noise monitoring read-only.
- Make `observations review-candidates` easier to consume from local dogfood output.
- Add compact count/window/history summaries without exposing raw user queries and without mutating memory.
- Make high empty retrieval ratio actionable without storing or emitting raw user queries.
- Diagnose empty-heavy segments by surface, preferred scope, and retrieval status filter before changing rankers or adding graph traversal.

Implemented so far in the active worktree:

- `observations audit` top refs now include `observation_window`:
- `first_observation_id`
- `first_observed_at`
- `latest_observation_id`
- `latest_observed_at`
- `observations review-candidates` now includes top-level:
- New CLI command:
- `agent-memory observations empty-diagnostics <db_path> --limit 200 --top 10 --high-empty-threshold 0.5`
- Output contract:
- `kind: retrieval_empty_diagnostics`
- `read_only: true`
- `observation_count`
- `candidate_count`
- Each review candidate now includes:
- the propagated `observation_window`
- `status_history_summary.transition_count`
- `status_history_summary.latest_transition`
- `empty_retrieval_count`
- `empty_retrieval_ratio`
- `quality_warnings`
- top-level `observation_window`
- `empty_by_surface[]`
- `empty_by_preferred_scope[]`
- `empty_by_status_filter[]`
- `suggested_next_steps`
- Segment entries include:
- segment key (`surface`, `preferred_scope`, or `statuses`)
- `total_count`
- `empty_count`
- `empty_ratio`
- `signals`, currently `high_empty_segment` when above threshold
- `sample_observation_ids`
- `observation_window`
- Secret-safety preserved:
- no raw query text
- no query previews
- no prompt content
- Docs updated:
- `README.md`
- `docs/hermes-dogfood.md`
- Tests updated in `tests/test_cli.py`:
- audit regression asserts per-ref observation window.
- review-candidates regression asserts top-level counts and status history summary.
- new regression asserts empty diagnostics segment grouping, read-only shape, next-step hints, and no secret leakage from raw query strings.

Verification so far:

- RED confirmed:
- focused tests failed on missing `observation_window` and top-level `observation_count`.
- focused test initially failed because `empty-diagnostics` parser choice was missing.
- GREEN focused:
- `TMPDIR=$PWD/.tmp-test uv run pytest tests/test_cli.py::test_python_module_cli_observations_audit_reports_frequent_and_stale_refs_without_raw_queries tests/test_cli.py::test_python_module_cli_observations_review_candidates_explains_top_refs_without_mutation_or_raw_queries -q`
- `2 passed`
- `TMPDIR=$PWD/.tmp-test uv run pytest tests/test_cli.py::test_python_module_cli_observations_empty_diagnostics_groups_empty_segments_without_raw_queries -q`
- `1 passed`

Remaining before PR:

1. Run broader/full local verification:
- focused CLI tests around audit/review-candidates
1. Run broader focused CLI tests around observations audit/review-candidates/empty-diagnostics.
2. Run full local verification:
- `uv run pytest tests/ -q`
- `uv run python scripts/check_release_metadata.py`
- `uv run python scripts/smoke_release_readiness.py`
- `npm pack --dry-run`
- `git diff --check`
- `node --check bin/agent-memory.js`
2. Run real local DB smoke for `observations review-candidates` and verify the new fields exist.
3. Run static diff secret scan.
4. Create PR, watch CI, merge, follow release-sync/publish/published smoke/Hermes QA.
5. After v0.1.40 install, repeat Hermes hook doctor and installed `observations review-candidates` against the existing local DB.
3. Run real local DB smoke for `observations empty-diagnostics` and verify no raw query fields appear.
4. Run static diff secret scan.
5. Create PR, watch CI, merge, follow release-sync/publish/published smoke/Hermes QA.
6. After v0.1.41 install, repeat Hermes hook doctor and installed `observations empty-diagnostics` against the existing local DB.

## Next natural slice after this one

After the review-candidates contract is released and dogfooded, continue Priority 5 by either:
After empty retrieval diagnostics are released and dogfooded, continue Priority 5 by either:

1. improving retrieval diagnostics for empty retrieval/high empty ratio, or
2. adding an explicit human review cadence/checklist around candidate reports.
1. adding an explicit human review cadence/checklist around audit/review-candidates/empty-diagnostics, or
2. improving candidate report UX further by bundling suggested follow-up commands into a richer read-only triage report.

Avoid automatic cleanup/deprecation until the review candidate workflow has been used on real local data for a while.
Avoid automatic cleanup/deprecation until the review and diagnostics workflow has been used on real local data for a while.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,10 +109,11 @@ For local dogfood and noise monitoring, retrievals can leave a secret-safe obser
agent-memory retrieve "$DB" "How should I install agent-memory?" --preferred-scope user:default --observe cli
agent-memory observations list "$DB" --limit 20
agent-memory observations audit "$DB" --limit 200 --top 10 --frequent-threshold 3
agent-memory observations empty-diagnostics "$DB" --limit 200 --top 10 --high-empty-threshold 0.5
agent-memory observations review-candidates "$DB" --limit 200 --top 10 --frequent-threshold 3
```

Use the observation log and audit report to spot frequently injected or surprising memories before changing retrieval behavior. The audit output is read-only JSON with surface/scope counts, empty-retrieval count and ratio, quality warnings such as `low_observation_count` or `high_empty_retrieval_ratio`, top injected memory refs, current status for known refs, per-ref observation windows, and simple signals such as `frequently_injected` and `current_status_not_approved`. `observations review-candidates` is also read-only; it turns the top audit refs into forensic candidates with top-level `observation_count`/`candidate_count`, fact review explanations, status-history summaries, replacement-chain hints, graph-neighborhood summaries, and copy-paste follow-up commands such as `review explain`, `review replacements`, and `graph inspect`. Treat these reports as local operator telemetry, not a synced analytics feature or an automatic cleanup workflow.
Use the observation log and audit report to spot frequently injected or surprising memories before changing retrieval behavior. The audit output is read-only JSON with surface/scope counts, empty-retrieval count and ratio, quality warnings such as `low_observation_count` or `high_empty_retrieval_ratio`, top injected memory refs, current status for known refs, per-ref observation windows, and simple signals such as `frequently_injected` and `current_status_not_approved`. `observations empty-diagnostics` is read-only and focuses specifically on empty retrievals: it groups empty-heavy observations by surface, preferred scope, and status filter with segment ratios, sample observation ids, observation windows, and next-step hints for checking scope mismatches or missing approved memory coverage before changing rankers. `observations review-candidates` is also read-only; it turns the top audit refs into forensic candidates with top-level `observation_count`/`candidate_count`, fact review explanations, status-history summaries, replacement-chain hints, graph-neighborhood summaries, and copy-paste follow-up commands such as `review explain`, `review replacements`, and `graph inspect`. Treat these reports as local operator telemetry, not a synced analytics feature or an automatic cleanup workflow.

## Hermes quickstart

Expand Down
3 changes: 3 additions & 0 deletions docs/hermes-dogfood.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,11 +48,14 @@ Hermes pre-LLM hook retrievals write a secret-safe local observation row to the
```bash
agent-memory observations list ~/.agent-memory/memory.db --limit 20
agent-memory observations audit ~/.agent-memory/memory.db --limit 200 --top 10 --frequent-threshold 3
agent-memory observations empty-diagnostics ~/.agent-memory/memory.db --limit 200 --top 10 --high-empty-threshold 0.5
agent-memory observations review-candidates ~/.agent-memory/memory.db --limit 200 --top 10 --frequent-threshold 3
```

Use this before tuning ranking or adding broader graph traversal: first confirm which memories are frequently injected, which scopes are active, whether retrieval is often empty, and whether any frequently injected refs are now deprecated/disputed/missing. The audit command is read-only and summarizes local observation rows without emitting raw query text or query previews. Keep this data local unless you intentionally export it.

When `empty_retrieval_ratio` is high, run `observations empty-diagnostics` before changing rankers. It is a read-only, secret-safe segment report for empty observations. It groups empty-heavy rows by surface, preferred scope, and status filter; includes each segment's total count, empty count, empty ratio, sample observation ids, and observation window; and suggests operator checks such as scope mismatch review or adding/approving durable memories only after confirming the misses are real user needs. It does not emit raw query text, query previews, or prompt content.

`observations review-candidates` is the next read-only step after audit. It keeps the same secret-safe observation summary, then expands each top ref into a forensic candidate:

- fact refs include the same lifecycle explanation as `agent-memory review explain fact ...`.
Expand Down
155 changes: 155 additions & 0 deletions src/agent_memory/api/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -267,6 +267,140 @@ def _audit_retrieval_observations(
}


def _observation_window(observations) -> dict[str, Any] | None:
if not observations:
return None
first = min(observations, key=lambda observation: observation.id)
latest = max(observations, key=lambda observation: observation.id)
return {
"first_observation_id": first.id,
"first_observed_at": first.created_at,
"latest_observation_id": latest.id,
"latest_observed_at": latest.created_at,
}


def _empty_diagnostic_segment_payload(
*,
segment_name: str,
segment_value: Any,
observations,
high_empty_threshold: float,
) -> dict[str, Any]:
empty_observations = [observation for observation in observations if not observation.retrieved_memory_refs]
total_count = len(observations)
empty_count = len(empty_observations)
empty_ratio = empty_count / total_count if total_count else 0.0
signals = []
if empty_ratio >= high_empty_threshold and empty_count > 0:
signals.append("high_empty_segment")
return {
segment_name: segment_value,
"total_count": total_count,
"empty_count": empty_count,
"empty_ratio": round(empty_ratio, 4),
"signals": signals,
"sample_observation_ids": [observation.id for observation in empty_observations[:5]],
"observation_window": _observation_window(observations),
}


def _empty_retrieval_diagnostics(
db_path: Path,
*,
limit: int,
top: int,
high_empty_threshold: float,
) -> dict[str, Any]:
if limit < 1:
raise ValueError("observations empty-diagnostics limit must be >= 1")
if top < 1:
raise ValueError("observations empty-diagnostics top must be >= 1")
if high_empty_threshold < 0 or high_empty_threshold > 1:
raise ValueError("observations empty-diagnostics high empty threshold must be between 0 and 1")

observations = list_retrieval_observations(db_path, limit=limit)
empty_observations = [observation for observation in observations if not observation.retrieved_memory_refs]
empty_retrieval_ratio = len(empty_observations) / len(observations) if observations else 0.0

observations_by_surface: dict[str, list[Any]] = defaultdict(list)
observations_by_scope: dict[str | None, list[Any]] = defaultdict(list)
observations_by_statuses: dict[tuple[str, ...], list[Any]] = defaultdict(list)
for observation in observations:
observations_by_surface[observation.surface].append(observation)
observations_by_scope[observation.preferred_scope].append(observation)
observations_by_statuses[tuple(observation.statuses)].append(observation)

def sort_segments(items):
return sorted(
items,
key=lambda item: (-item["empty_count"], -item["empty_ratio"], str(next(iter(item.values())))),
)[:top]

empty_by_surface = sort_segments(
[
_empty_diagnostic_segment_payload(
segment_name="surface",
segment_value=surface,
observations=segment_observations,
high_empty_threshold=high_empty_threshold,
)
for surface, segment_observations in observations_by_surface.items()
]
)
empty_by_preferred_scope = sort_segments(
[
_empty_diagnostic_segment_payload(
segment_name="preferred_scope",
segment_value=preferred_scope,
observations=segment_observations,
high_empty_threshold=high_empty_threshold,
)
for preferred_scope, segment_observations in observations_by_scope.items()
]
)
empty_by_status_filter = sort_segments(
[
_empty_diagnostic_segment_payload(
segment_name="statuses",
segment_value=list(statuses),
observations=segment_observations,
high_empty_threshold=high_empty_threshold,
)
for statuses, segment_observations in observations_by_statuses.items()
]
)

quality_warnings = []
if not observations:
quality_warnings.append("no_observations")
if 0 < len(observations) < 10:
quality_warnings.append("low_observation_count")
if empty_retrieval_ratio >= high_empty_threshold and observations:
quality_warnings.append("high_empty_retrieval_ratio")

return {
"kind": "retrieval_empty_diagnostics",
"read_only": True,
"observation_count": len(observations),
"limit": limit,
"top": top,
"high_empty_threshold": high_empty_threshold,
"empty_retrieval_count": len(empty_observations),
"empty_retrieval_ratio": round(empty_retrieval_ratio, 4),
"quality_warnings": quality_warnings,
"observation_window": _observation_window(observations),
"empty_by_surface": empty_by_surface,
"empty_by_preferred_scope": empty_by_preferred_scope,
"empty_by_status_filter": empty_by_status_filter,
"suggested_next_steps": [
"Run observations audit to compare empty vs non-empty retrieval surfaces.",
"Check preferred scope values for scope mismatches before changing ranking.",
"Add or approve memories only after confirming the missing queries represent durable user needs.",
],
}


def _review_candidates_from_observations(
db_path: Path,
*,
Expand Down Expand Up @@ -654,6 +788,14 @@ def _build_parser() -> argparse.ArgumentParser:
observations_audit_parser.add_argument("--limit", type=int, default=200)
observations_audit_parser.add_argument("--top", type=int, default=10)
observations_audit_parser.add_argument("--frequent-threshold", type=int, default=3)
observations_empty_diagnostics_parser = observations_subparsers.add_parser(
"empty-diagnostics",
help="Build a read-only diagnostic report for empty retrieval observations.",
)
observations_empty_diagnostics_parser.add_argument("db_path", type=Path)
observations_empty_diagnostics_parser.add_argument("--limit", type=int, default=200)
observations_empty_diagnostics_parser.add_argument("--top", type=int, default=10)
observations_empty_diagnostics_parser.add_argument("--high-empty-threshold", type=float, default=0.5)
observations_review_candidates_parser = observations_subparsers.add_parser(
"review-candidates",
help="Build a read-only forensic review report from top retrieval observation refs.",
Expand Down Expand Up @@ -1059,6 +1201,19 @@ def main() -> None:
)
)
return
if args.observations_action == "empty-diagnostics":
print(
json.dumps(
_empty_retrieval_diagnostics(
args.db_path,
limit=args.limit,
top=args.top,
high_empty_threshold=args.high_empty_threshold,
),
indent=2,
)
)
return
if args.observations_action == "review-candidates":
print(
json.dumps(
Expand Down
Loading