diff --git a/.dev/status/current-handoff.md b/.dev/status/current-handoff.md index a5fe1a4..5eaafa6 100644 --- a/.dev/status/current-handoff.md +++ b/.dev/status/current-handoff.md @@ -1,7 +1,7 @@ # agent-memory current handoff Status: AI-authored draft. Not yet human-approved. -Last updated: 2026-05-01 00:20 KST +Last updated: 2026-05-01 01:10 KST ## Trigger for the next session @@ -16,7 +16,7 @@ read this file first. Do not ask the user to restate context. Verify repo state, ## Ready-to-say answer -agent-memory는 v0.1.34까지 배포/Hermes QA가 완료됐고, 지금은 Priority 5 dogfood/noise monitoring 첫 slice인 v0.1.35 후보 작업 중이야. 현재 브랜치는 `feat/retrieval-observation-log`이고, 목표는 Hermes/CLI retrieval이 어떤 memory를 주입했는지 secret-safe local observation log로 남겨 이후 noisy memory audit의 기반을 만드는 거야. +agent-memory는 v0.1.36까지 배포/Hermes QA가 완료됐고, 지금은 Priority 5 dogfood/noise monitoring의 다음 slice인 read-only observation audit 작업 중이야. 현재 브랜치는 `feat/observations-audit`, worktree는 `/Users/reddit/Project/agent-memory/.worktrees/observations-audit`이고, 목표는 기존 retrieval observation log를 바탕으로 자주 주입되는 memory ref, surface/scope 분포, 빈 retrieval, deprecated/disputed/missing ref 신호를 raw query 없이 요약하는 `agent-memory observations audit` CLI를 추가하는 거야. ## Current repo state @@ -32,15 +32,15 @@ Expected GitHub identity: Verified base before this slice: -- latest completed release: `v0.1.34` -- v0.1.34 included published smoke propagation retry/backoff, release-sync PR CI dispatch, and read-only relation graph inspect CLI. -- local Hermes hook uses `/Users/reddit/.agent-memory/runtime/v0.1.34/.venv/bin/agent-memory` against `/Users/reddit/.agent-memory/memory.db`. +- latest completed release: `v0.1.36` +- v0.1.36 included secret-safe local retrieval observation logging and lazy migration for existing DBs without `retrieval_observations`. +- local Hermes hook uses `/Users/reddit/.agent-memory/runtime/v0.1.36/.venv/bin/agent-memory` against `/Users/reddit/.agent-memory/memory.db`. Active slice/worktree: -- branch: `feat/retrieval-observation-log` -- worktree: `/Users/reddit/Project/agent-memory/.worktrees/retrieval-observation-log` -- intended release after merge: likely `v0.1.35` +- branch: `feat/observations-audit` +- worktree: `/Users/reddit/Project/agent-memory/.worktrees/observations-audit` +- intended release after merge: likely `v0.1.37` Expected local untracked artifacts to preserve in the root checkout: @@ -52,107 +52,68 @@ Expected local untracked artifacts to preserve in the root checkout: Do not delete or commit these unless the user explicitly asks. -## What is complete through v0.1.34 +## Current slice: read-only retrieval observation audit -### Distribution and release automation - -- npm package and PyPI package are published from the same versioned source. -- npm-first user install path is documented and verified. -- Publish workflow gates GitHub Release creation on `published-install-smoke` after npm/PyPI publish. -- Published smoke uploads JSON diagnostics artifacts. -- v0.1.34 distinguishes normal retry budget from propagation/transient resolver failure budget and adds registry probe diagnostics. -- Protected `main` fallback is automated and rerun-idempotent. -- release-sync fallback now dispatches `ci.yml` on the bot-created release-sync branch and comments/step-summarizes that handoff. - -### Runtime adapter readiness - -- Hermes bootstrap/doctor/install flow exists and defaults to the conservative preset. -- This local Hermes setup has agent-memory enabled via `/Users/reddit/.agent-memory/runtime/v0.1.34/.venv/bin/agent-memory`. -- Hermes hook fails closed: unavailable DB/schema returns `{}` and exit 0 instead of breaking prompt flow. -- Conservative preset remains default: small prompt budgets, one top memory, no alternative-memory detail, no reason-code noise. -- `--preset balanced` is explicit opt-in for more context/noise. - -### Truth lifecycle, eval, and graph foundation - -- Normal retrieval is approved-only by default. -- Candidate/disputed/deprecated facts remain available only behind explicit forensic/review surfaces. -- `memory_status_transitions` records status changes. -- `review history`, `review supersede`, `review replacements`, and `review explain` exist. -- Retrieval eval calls the real retrieval path but suppresses retrieval bookkeeping writes. -- `agent-memory graph inspect --depth N --limit N` traverses stored `Relation` edges read-only and does not mutate memory state. +Goal: -## Current slice: local retrieval observation log +- Add a local-only, secret-safe, read-only audit report over `retrieval_observations`. +- Summarize dogfood/noise signals before changing ranking, graph traversal, or mutating memory cleanup. -Goal: +Implemented so far in the active worktree: -- Build a local-only, secret-safe observation log that records what retrieval injected during real dogfood use. -- This is the first Priority 5 dogfood/noise monitoring slice and should feed later noisy-memory audit commands. - -Implemented so far: - -- New SQLite table `retrieval_observations`. -- New model `RetrievalObservation`. -- New storage APIs: - - `record_retrieval_observation(...)` - - `list_retrieval_observations(...)` -- `retrieve_memory_packet(...)` accepts: - - `observation_surface` - - `observation_metadata` -- `agent-memory retrieve ... --observe ` records an opt-in observation. -- Hermes pre-LLM hook records an observation automatically with surface `hermes-pre-llm-hook`. - New CLI: - - `agent-memory observations list --limit 50` + - `agent-memory observations audit --limit 200 --top 10 --frequent-threshold 3` +- JSON output includes: + - `kind: retrieval_observation_audit` + - `read_only: true` + - `observation_count` + - `surface_counts` + - `preferred_scope_counts` + - `empty_retrieval_count` + - `top_memory_refs[]` with `memory_ref`, `injection_count`, `current_status`, `signals`, and sample observation ids +- Current signals: + - `frequently_injected` + - `current_status_not_approved` +- Storage helper added: + - `get_memory_status(db_path, memory_type=..., memory_id=...)` +- Docs updated: + - `README.md` + - `docs/hermes-dogfood.md` Secret-safety contract: -- raw query text is not stored. -- stores `query_sha256` and a short redacted preview. -- redacts secret-like assignments such as password/token/api_key/secret/credential/connection_string. -- stores selected memory refs, top memory ref, response mode, statuses, preferred scope, and small metadata. - -Files changed: - -- `src/agent_memory/core/models.py` -- `src/agent_memory/storage/schema.sql` -- `src/agent_memory/storage/sqlite.py` -- `src/agent_memory/core/retrieval.py` -- `src/agent_memory/integrations/hermes_hooks.py` -- `src/agent_memory/api/cli.py` -- `tests/test_cli.py` -- `README.md` -- `docs/hermes-dogfood.md` -- `.dev/status/current-handoff.md` - -Current focused verification already passed: - -```bash -uv run pytest tests/test_cli.py::test_python_module_cli_retrieve_observe_records_secret_safe_local_observation tests/test_cli.py::test_python_module_cli_hermes_pre_llm_hook_outputs_context_for_hermes_shell_hook_payload -q -# 2 passed - -uv run pytest tests/test_cli.py tests/test_retrieval_evaluation.py -q -# 83 passed -``` - -## Remaining work for this slice - -1. Run real smoke for observation CLI and Hermes hook from the worktree. -2. Run full verification: - ```bash - uv run pytest tests/ -q - uv run python scripts/check_release_metadata.py - uv run python scripts/smoke_release_readiness.py - npm pack --dry-run - git diff --check - node --check bin/agent-memory.js - ``` -3. Run static diff secret scan and confirm finding_count 0. -4. Commit branch and open PR. -5. Watch PR CI, merge when green. -6. Verify auto-release/release-sync/publish for likely v0.1.35. -7. Verify GitHub Release/npm/PyPI/published smoke artifact. -8. Install pinned Hermes runtime v0.1.35 and run Hermes QA. -9. Cleanup worktree/branch and update durable memory. - -## Next likely slice after this - -After observation logging is released and dogfooded, build a read-only noisy memory audit command over `retrieval_observations`, for example frequently injected memory refs, surprising scopes, high hidden-alternative counts, and stale/deprecated-nearby risks. +- audit uses existing observation rows and does not read or emit raw query text. +- output contains counts, memory refs, statuses, and observation ids only. +- keep this data local unless intentionally exported. + +Verification so far: + +- RED confirmed before implementation: + - `agent-memory observations audit` failed with argparse invalid choice. +- GREEN focused: + - `uv run pytest tests/test_cli.py::test_python_module_cli_observations_audit_reports_frequent_and_stale_refs_without_raw_queries -q` + - `1 passed` +- Focused regression group: + - `uv run pytest tests/test_cli.py::test_python_module_cli_observations_audit_reports_frequent_and_stale_refs_without_raw_queries tests/test_cli.py::test_python_module_cli_retrieve_observe_records_secret_safe_local_observation tests/test_cli.py::test_python_module_cli_observations_list_migrates_existing_database_without_observation_table -q` + - `3 passed` +- CLI help smoke: + - `uv run python -m agent_memory.api.cli observations audit --help` + - `uv run python -m agent_memory.api.cli observations list --help` + - both exit 0. + +Remaining before PR: + +1. Run full local verification: + - `uv run pytest tests/ -q` + - `uv run python scripts/check_release_metadata.py` + - `uv run python scripts/smoke_release_readiness.py` + - `npm pack --dry-run` + - `git diff --check` + - `node --check bin/agent-memory.js` +2. Run real smoke for `observations audit` on a temp DB and confirm no raw secret-like query text appears. +3. Run static diff secret scan. +4. Create PR, watch CI, merge, follow release-sync/publish/published smoke/Hermes QA. + +## Next natural slice after this one + +After this audit slice is released and Hermes QA passes, the next likely Priority 5 step is dogfood cadence refinement: use the audit report over real Hermes observations to decide whether ranking/scope filters need adjustment. Avoid mutating cleanup or broad graph retrieval until the read-only signals have been observed in real use. diff --git a/README.md b/README.md index fb83e8f..8060f24 100644 --- a/README.md +++ b/README.md @@ -108,9 +108,10 @@ For local dogfood and noise monitoring, retrievals can leave a secret-safe obser ```bash agent-memory retrieve "$DB" "How should I install agent-memory?" --preferred-scope user:default --observe cli agent-memory observations list "$DB" --limit 20 +agent-memory observations audit "$DB" --limit 200 --top 10 --frequent-threshold 3 ``` -Use the observation log to spot frequently injected or surprising memories before changing retrieval behavior. Treat it as local operator telemetry, not a synced analytics stream. +Use the observation log and audit report to spot frequently injected or surprising memories before changing retrieval behavior. The audit output is read-only JSON with surface/scope counts, empty-retrieval count, top injected memory refs, current status for known refs, and simple signals such as `frequently_injected` and `current_status_not_approved`. Treat it as local operator telemetry, not a synced analytics stream. ## Hermes quickstart diff --git a/docs/hermes-dogfood.md b/docs/hermes-dogfood.md index 55f842d..45e7d91 100644 --- a/docs/hermes-dogfood.md +++ b/docs/hermes-dogfood.md @@ -37,6 +37,7 @@ Capture these observations for each dogfood run: - whether unrelated scopes stay out of the prompt - whether failure paths fail closed with no broken prompt text - whether `agent-memory observations list ~/.agent-memory/memory.db --limit 20` shows the expected memory refs without raw query text or secrets +- whether `agent-memory observations audit ~/.agent-memory/memory.db --limit 200 --top 10` highlights frequently injected or no-longer-approved refs before any retrieval tuning A good conservative smoke has low latency, at most one surfaced memory, no noisy reason codes, no workflow-blocking error if the memory DB is missing, and a local observation entry that explains what memory was injected. @@ -46,9 +47,10 @@ Hermes pre-LLM hook retrievals write a secret-safe local observation row to the ```bash agent-memory observations list ~/.agent-memory/memory.db --limit 20 +agent-memory observations audit ~/.agent-memory/memory.db --limit 200 --top 10 --frequent-threshold 3 ``` -Use this before tuning ranking or adding broader graph traversal: first confirm which memories are frequently injected, which scopes are active, and whether the top memory is surprising. Keep this data local unless you intentionally export it. +Use this before tuning ranking or adding broader graph traversal: first confirm which memories are frequently injected, which scopes are active, whether retrieval is often empty, and whether any frequently injected refs are now deprecated/disputed/missing. The audit command is read-only and summarizes local observation rows without emitting raw query text. Keep this data local unless you intentionally export it. ## Fallback and rollback diff --git a/src/agent_memory/api/cli.py b/src/agent_memory/api/cli.py index f8da927..f1a72d1 100644 --- a/src/agent_memory/api/cli.py +++ b/src/agent_memory/api/cli.py @@ -3,6 +3,7 @@ import argparse import json import sys +from collections import Counter, defaultdict from pathlib import Path from typing import Any @@ -42,6 +43,7 @@ ) from agent_memory.storage.sqlite import ( get_fact, + get_memory_status, initialize_database, list_candidate_episodes, list_candidate_facts, @@ -123,6 +125,79 @@ def _status_counts_for_facts(facts) -> dict[str, int]: return counts +def _current_status_for_memory_ref(db_path: Path, memory_ref: str) -> str | None: + memory_type, separator, raw_id = memory_ref.partition(":") + if separator != ":" or not raw_id.isdigit() or memory_type not in {"fact", "procedure", "episode"}: + return None + try: + return get_memory_status(db_path, memory_type=memory_type, memory_id=int(raw_id)) + except ValueError: + return "missing" + + +def _audit_retrieval_observations( + db_path: Path, + *, + limit: int, + top: int, + frequent_threshold: int, +) -> dict[str, Any]: + if limit < 1: + raise ValueError("observations audit limit must be >= 1") + if top < 1: + raise ValueError("observations audit top must be >= 1") + if frequent_threshold < 1: + raise ValueError("observations audit frequent threshold must be >= 1") + + observations = list_retrieval_observations(db_path, limit=limit) + surface_counts = Counter(observation.surface for observation in observations) + preferred_scope_counts = Counter( + observation.preferred_scope for observation in observations if observation.preferred_scope is not None + ) + memory_ref_counts: Counter[str] = Counter() + sample_observation_ids_by_ref: dict[str, list[int]] = defaultdict(list) + empty_retrieval_count = 0 + for observation in observations: + if not observation.retrieved_memory_refs: + empty_retrieval_count += 1 + for memory_ref in observation.retrieved_memory_refs: + memory_ref_counts[memory_ref] += 1 + sample_ids = sample_observation_ids_by_ref[memory_ref] + if len(sample_ids) < 5: + sample_ids.append(observation.id) + + top_memory_refs = [] + for memory_ref, injection_count in sorted(memory_ref_counts.items(), key=lambda item: (-item[1], item[0]))[:top]: + current_status = _current_status_for_memory_ref(db_path, memory_ref) + signals = [] + if injection_count >= frequent_threshold: + signals.append("frequently_injected") + if current_status is not None and current_status != "approved": + signals.append("current_status_not_approved") + top_memory_refs.append( + { + "memory_ref": memory_ref, + "injection_count": injection_count, + "current_status": current_status, + "signals": signals, + "sample_observation_ids": sample_observation_ids_by_ref[memory_ref], + } + ) + + return { + "kind": "retrieval_observation_audit", + "read_only": True, + "observation_count": len(observations), + "limit": limit, + "top": top, + "frequent_threshold": frequent_threshold, + "surface_counts": dict(sorted(surface_counts.items())), + "preferred_scope_counts": dict(sorted(preferred_scope_counts.items())), + "empty_retrieval_count": empty_retrieval_count, + "top_memory_refs": top_memory_refs, + } + + def _inspect_relation_graph(db_path: Path, *, start_ref: str, depth: int, limit: int) -> dict[str, Any]: if depth < 0: raise ValueError("graph inspect depth must be >= 0") @@ -428,6 +503,11 @@ def _build_parser() -> argparse.ArgumentParser: observations_list_parser = observations_subparsers.add_parser("list") observations_list_parser.add_argument("db_path", type=Path) observations_list_parser.add_argument("--limit", type=int, default=50) + observations_audit_parser = observations_subparsers.add_parser("audit") + observations_audit_parser.add_argument("db_path", type=Path) + observations_audit_parser.add_argument("--limit", type=int, default=200) + observations_audit_parser.add_argument("--top", type=int, default=10) + observations_audit_parser.add_argument("--frequent-threshold", type=int, default=3) graph_parser = subparsers.add_parser("graph") graph_subparsers = graph_parser.add_subparsers(dest="graph_action", required=True) @@ -846,6 +926,19 @@ def main() -> None: ) ) return + if args.observations_action == "audit": + print( + json.dumps( + _audit_retrieval_observations( + args.db_path, + limit=args.limit, + top=args.top, + frequent_threshold=args.frequent_threshold, + ), + indent=2, + ) + ) + return raise ValueError(f"Unsupported observations action: {args.observations_action}") if args.command == "graph": diff --git a/src/agent_memory/storage/sqlite.py b/src/agent_memory/storage/sqlite.py index fd6fd69..8a0b739 100644 --- a/src/agent_memory/storage/sqlite.py +++ b/src/agent_memory/storage/sqlite.py @@ -369,6 +369,15 @@ def insert_relation( return relation_from_row(row) +def get_memory_status(db_path: Path | str, *, memory_type: MemoryType, memory_id: int) -> MemoryStatus: + table_name = TABLE_NAME_BY_MEMORY_TYPE[memory_type] + with connect(db_path) as connection: + row = connection.execute(f"SELECT status FROM {table_name} WHERE id = ?", (memory_id,)).fetchone() + if row is None: + raise ValueError(f"No {memory_type} memory found with id {memory_id}") + return row["status"] + + def get_fact(db_path: Path | str, *, fact_id: int) -> Fact: with connect(db_path) as connection: row = connection.execute("SELECT * FROM facts WHERE id = ?", (fact_id,)).fetchone() diff --git a/tests/test_cli.py b/tests/test_cli.py index 87d4577..f18b30e 100644 --- a/tests/test_cli.py +++ b/tests/test_cli.py @@ -8,7 +8,7 @@ from agent_memory.core.curation import approve_fact, create_candidate_fact from agent_memory.core.ingestion import ingest_source_text from agent_memory.integrations.hermes_hooks import scope_from_cwd -from agent_memory.storage.sqlite import initialize_database, insert_relation +from agent_memory.storage.sqlite import initialize_database, insert_relation, update_memory_status def test_python_module_cli_graph_inspect_returns_read_only_relation_neighborhood(tmp_path: Path) -> None: @@ -166,6 +166,99 @@ def test_python_module_cli_retrieve_observe_records_secret_safe_local_observatio assert "abc123" not in list_result.stdout +def test_python_module_cli_observations_audit_reports_frequent_and_stale_refs_without_raw_queries(tmp_path: Path) -> None: + db_path = tmp_path / "observation-audit.db" + initialize_database(db_path) + source = ingest_source_text( + db_path=db_path, + source_type="transcript", + content="Noisy audit target phrase appears in curated memory records.", + metadata={"project": "observation-audit"}, + ) + fact = create_candidate_fact( + db_path=db_path, + subject_ref="Noisy audit", + predicate="target_phrase", + object_ref_or_value="AUDIT_OK", + evidence_ids=[source.id], + scope="project:observation-audit", + confidence=0.95, + ) + approve_fact(db_path=db_path, fact_id=fact.id) + + env = {**os.environ, "PYTHONPATH": "src"} + for secret_query in ( + "What is the noisy audit target phrase? password=SUPERSECRET", + "Repeat the noisy audit target phrase token=abc123", + ): + retrieve_result = subprocess.run( + [ + sys.executable, + "-m", + "agent_memory.api.cli", + "retrieve", + str(db_path), + secret_query, + "--preferred-scope", + "project:observation-audit", + "--observe", + "cli-test", + ], + cwd=Path(__file__).resolve().parents[1], + env=env, + capture_output=True, + text=True, + ) + assert retrieve_result.returncode == 0, retrieve_result.stderr + + update_memory_status( + db_path, + memory_type="fact", + memory_id=fact.id, + status="deprecated", + reason="audit regression smoke", + actor="test", + ) + + audit_result = subprocess.run( + [ + sys.executable, + "-m", + "agent_memory.api.cli", + "observations", + "audit", + str(db_path), + "--limit", + "50", + "--top", + "5", + "--frequent-threshold", + "2", + ], + cwd=Path(__file__).resolve().parents[1], + env=env, + capture_output=True, + text=True, + ) + + assert audit_result.returncode == 0, audit_result.stderr + payload = json.loads(audit_result.stdout) + assert payload["kind"] == "retrieval_observation_audit" + assert payload["read_only"] is True + assert payload["observation_count"] == 2 + assert payload["surface_counts"] == {"cli-test": 2} + assert payload["preferred_scope_counts"] == {"project:observation-audit": 2} + assert payload["empty_retrieval_count"] == 0 + top_ref = payload["top_memory_refs"][0] + assert top_ref["memory_ref"] == f"fact:{fact.id}" + assert top_ref["injection_count"] == 2 + assert top_ref["current_status"] == "deprecated" + assert top_ref["signals"] == ["frequently_injected", "current_status_not_approved"] + assert top_ref["sample_observation_ids"] + assert "SUPERSECRET" not in audit_result.stdout + assert "abc123" not in audit_result.stdout + + def test_python_module_cli_observations_list_migrates_existing_database_without_observation_table(tmp_path: Path) -> None: db_path = tmp_path / "legacy-observation.db" initialize_database(db_path)