Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
169 changes: 65 additions & 104 deletions .dev/status/current-handoff.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# agent-memory current handoff

Status: AI-authored draft. Not yet human-approved.
Last updated: 2026-05-01 00:20 KST
Last updated: 2026-05-01 01:10 KST

## Trigger for the next session

Expand All @@ -16,7 +16,7 @@ read this file first. Do not ask the user to restate context. Verify repo state,

## Ready-to-say answer

agent-memory는 v0.1.34까지 배포/Hermes QA가 완료됐고, 지금은 Priority 5 dogfood/noise monitoring 첫 slice인 v0.1.35 후보 작업 중이야. 현재 브랜치는 `feat/retrieval-observation-log`이고, 목표는 Hermes/CLI retrieval이 어떤 memory를 주입했는지 secret-safe local observation log로 남겨 이후 noisy memory audit의 기반을 만드는 거야.
agent-memory는 v0.1.36까지 배포/Hermes QA가 완료됐고, 지금은 Priority 5 dogfood/noise monitoring의 다음 slice인 read-only observation audit 작업 중이야. 현재 브랜치는 `feat/observations-audit`, worktree는 `/Users/reddit/Project/agent-memory/.worktrees/observations-audit`이고, 목표는 기존 retrieval observation log를 바탕으로 자주 주입되는 memory ref, surface/scope 분포, 빈 retrieval, deprecated/disputed/missing ref 신호를 raw query 없이 요약하는 `agent-memory observations audit` CLI를 추가하는 거야.

## Current repo state

Expand All @@ -32,15 +32,15 @@ Expected GitHub identity:

Verified base before this slice:

- latest completed release: `v0.1.34`
- v0.1.34 included published smoke propagation retry/backoff, release-sync PR CI dispatch, and read-only relation graph inspect CLI.
- local Hermes hook uses `/Users/reddit/.agent-memory/runtime/v0.1.34/.venv/bin/agent-memory` against `/Users/reddit/.agent-memory/memory.db`.
- latest completed release: `v0.1.36`
- v0.1.36 included secret-safe local retrieval observation logging and lazy migration for existing DBs without `retrieval_observations`.
- local Hermes hook uses `/Users/reddit/.agent-memory/runtime/v0.1.36/.venv/bin/agent-memory` against `/Users/reddit/.agent-memory/memory.db`.

Active slice/worktree:

- branch: `feat/retrieval-observation-log`
- worktree: `/Users/reddit/Project/agent-memory/.worktrees/retrieval-observation-log`
- intended release after merge: likely `v0.1.35`
- branch: `feat/observations-audit`
- worktree: `/Users/reddit/Project/agent-memory/.worktrees/observations-audit`
- intended release after merge: likely `v0.1.37`

Expected local untracked artifacts to preserve in the root checkout:

Expand All @@ -52,107 +52,68 @@ Expected local untracked artifacts to preserve in the root checkout:

Do not delete or commit these unless the user explicitly asks.

## What is complete through v0.1.34
## Current slice: read-only retrieval observation audit

### Distribution and release automation

- npm package and PyPI package are published from the same versioned source.
- npm-first user install path is documented and verified.
- Publish workflow gates GitHub Release creation on `published-install-smoke` after npm/PyPI publish.
- Published smoke uploads JSON diagnostics artifacts.
- v0.1.34 distinguishes normal retry budget from propagation/transient resolver failure budget and adds registry probe diagnostics.
- Protected `main` fallback is automated and rerun-idempotent.
- release-sync fallback now dispatches `ci.yml` on the bot-created release-sync branch and comments/step-summarizes that handoff.

### Runtime adapter readiness

- Hermes bootstrap/doctor/install flow exists and defaults to the conservative preset.
- This local Hermes setup has agent-memory enabled via `/Users/reddit/.agent-memory/runtime/v0.1.34/.venv/bin/agent-memory`.
- Hermes hook fails closed: unavailable DB/schema returns `{}` and exit 0 instead of breaking prompt flow.
- Conservative preset remains default: small prompt budgets, one top memory, no alternative-memory detail, no reason-code noise.
- `--preset balanced` is explicit opt-in for more context/noise.

### Truth lifecycle, eval, and graph foundation

- Normal retrieval is approved-only by default.
- Candidate/disputed/deprecated facts remain available only behind explicit forensic/review surfaces.
- `memory_status_transitions` records status changes.
- `review history`, `review supersede`, `review replacements`, and `review explain` exist.
- Retrieval eval calls the real retrieval path but suppresses retrieval bookkeeping writes.
- `agent-memory graph inspect <db_path> <start_ref> --depth N --limit N` traverses stored `Relation` edges read-only and does not mutate memory state.
Goal:

## Current slice: local retrieval observation log
- Add a local-only, secret-safe, read-only audit report over `retrieval_observations`.
- Summarize dogfood/noise signals before changing ranking, graph traversal, or mutating memory cleanup.

Goal:
Implemented so far in the active worktree:

- Build a local-only, secret-safe observation log that records what retrieval injected during real dogfood use.
- This is the first Priority 5 dogfood/noise monitoring slice and should feed later noisy-memory audit commands.

Implemented so far:

- New SQLite table `retrieval_observations`.
- New model `RetrievalObservation`.
- New storage APIs:
- `record_retrieval_observation(...)`
- `list_retrieval_observations(...)`
- `retrieve_memory_packet(...)` accepts:
- `observation_surface`
- `observation_metadata`
- `agent-memory retrieve ... --observe <surface>` records an opt-in observation.
- Hermes pre-LLM hook records an observation automatically with surface `hermes-pre-llm-hook`.
- New CLI:
- `agent-memory observations list <db_path> --limit 50`
- `agent-memory observations audit <db_path> --limit 200 --top 10 --frequent-threshold 3`
- JSON output includes:
- `kind: retrieval_observation_audit`
- `read_only: true`
- `observation_count`
- `surface_counts`
- `preferred_scope_counts`
- `empty_retrieval_count`
- `top_memory_refs[]` with `memory_ref`, `injection_count`, `current_status`, `signals`, and sample observation ids
- Current signals:
- `frequently_injected`
- `current_status_not_approved`
- Storage helper added:
- `get_memory_status(db_path, memory_type=..., memory_id=...)`
- Docs updated:
- `README.md`
- `docs/hermes-dogfood.md`

Secret-safety contract:

- raw query text is not stored.
- stores `query_sha256` and a short redacted preview.
- redacts secret-like assignments such as password/token/api_key/secret/credential/connection_string.
- stores selected memory refs, top memory ref, response mode, statuses, preferred scope, and small metadata.

Files changed:

- `src/agent_memory/core/models.py`
- `src/agent_memory/storage/schema.sql`
- `src/agent_memory/storage/sqlite.py`
- `src/agent_memory/core/retrieval.py`
- `src/agent_memory/integrations/hermes_hooks.py`
- `src/agent_memory/api/cli.py`
- `tests/test_cli.py`
- `README.md`
- `docs/hermes-dogfood.md`
- `.dev/status/current-handoff.md`

Current focused verification already passed:

```bash
uv run pytest tests/test_cli.py::test_python_module_cli_retrieve_observe_records_secret_safe_local_observation tests/test_cli.py::test_python_module_cli_hermes_pre_llm_hook_outputs_context_for_hermes_shell_hook_payload -q
# 2 passed

uv run pytest tests/test_cli.py tests/test_retrieval_evaluation.py -q
# 83 passed
```

## Remaining work for this slice

1. Run real smoke for observation CLI and Hermes hook from the worktree.
2. Run full verification:
```bash
uv run pytest tests/ -q
uv run python scripts/check_release_metadata.py
uv run python scripts/smoke_release_readiness.py
npm pack --dry-run
git diff --check
node --check bin/agent-memory.js
```
3. Run static diff secret scan and confirm finding_count 0.
4. Commit branch and open PR.
5. Watch PR CI, merge when green.
6. Verify auto-release/release-sync/publish for likely v0.1.35.
7. Verify GitHub Release/npm/PyPI/published smoke artifact.
8. Install pinned Hermes runtime v0.1.35 and run Hermes QA.
9. Cleanup worktree/branch and update durable memory.

## Next likely slice after this

After observation logging is released and dogfooded, build a read-only noisy memory audit command over `retrieval_observations`, for example frequently injected memory refs, surprising scopes, high hidden-alternative counts, and stale/deprecated-nearby risks.
- audit uses existing observation rows and does not read or emit raw query text.
- output contains counts, memory refs, statuses, and observation ids only.
- keep this data local unless intentionally exported.

Verification so far:

- RED confirmed before implementation:
- `agent-memory observations audit` failed with argparse invalid choice.
- GREEN focused:
- `uv run pytest tests/test_cli.py::test_python_module_cli_observations_audit_reports_frequent_and_stale_refs_without_raw_queries -q`
- `1 passed`
- Focused regression group:
- `uv run pytest tests/test_cli.py::test_python_module_cli_observations_audit_reports_frequent_and_stale_refs_without_raw_queries tests/test_cli.py::test_python_module_cli_retrieve_observe_records_secret_safe_local_observation tests/test_cli.py::test_python_module_cli_observations_list_migrates_existing_database_without_observation_table -q`
- `3 passed`
- CLI help smoke:
- `uv run python -m agent_memory.api.cli observations audit --help`
- `uv run python -m agent_memory.api.cli observations list --help`
- both exit 0.

Remaining before PR:

1. Run full local verification:
- `uv run pytest tests/ -q`
- `uv run python scripts/check_release_metadata.py`
- `uv run python scripts/smoke_release_readiness.py`
- `npm pack --dry-run`
- `git diff --check`
- `node --check bin/agent-memory.js`
2. Run real smoke for `observations audit` on a temp DB and confirm no raw secret-like query text appears.
3. Run static diff secret scan.
4. Create PR, watch CI, merge, follow release-sync/publish/published smoke/Hermes QA.

## Next natural slice after this one

After this audit slice is released and Hermes QA passes, the next likely Priority 5 step is dogfood cadence refinement: use the audit report over real Hermes observations to decide whether ranking/scope filters need adjustment. Avoid mutating cleanup or broad graph retrieval until the read-only signals have been observed in real use.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,9 +108,10 @@ For local dogfood and noise monitoring, retrievals can leave a secret-safe obser
```bash
agent-memory retrieve "$DB" "How should I install agent-memory?" --preferred-scope user:default --observe cli
agent-memory observations list "$DB" --limit 20
agent-memory observations audit "$DB" --limit 200 --top 10 --frequent-threshold 3
```

Use the observation log to spot frequently injected or surprising memories before changing retrieval behavior. Treat it as local operator telemetry, not a synced analytics stream.
Use the observation log and audit report to spot frequently injected or surprising memories before changing retrieval behavior. The audit output is read-only JSON with surface/scope counts, empty-retrieval count, top injected memory refs, current status for known refs, and simple signals such as `frequently_injected` and `current_status_not_approved`. Treat it as local operator telemetry, not a synced analytics stream.

## Hermes quickstart

Expand Down
4 changes: 3 additions & 1 deletion docs/hermes-dogfood.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ Capture these observations for each dogfood run:
- whether unrelated scopes stay out of the prompt
- whether failure paths fail closed with no broken prompt text
- whether `agent-memory observations list ~/.agent-memory/memory.db --limit 20` shows the expected memory refs without raw query text or secrets
- whether `agent-memory observations audit ~/.agent-memory/memory.db --limit 200 --top 10` highlights frequently injected or no-longer-approved refs before any retrieval tuning

A good conservative smoke has low latency, at most one surfaced memory, no noisy reason codes, no workflow-blocking error if the memory DB is missing, and a local observation entry that explains what memory was injected.

Expand All @@ -46,9 +47,10 @@ Hermes pre-LLM hook retrievals write a secret-safe local observation row to the

```bash
agent-memory observations list ~/.agent-memory/memory.db --limit 20
agent-memory observations audit ~/.agent-memory/memory.db --limit 200 --top 10 --frequent-threshold 3
```

Use this before tuning ranking or adding broader graph traversal: first confirm which memories are frequently injected, which scopes are active, and whether the top memory is surprising. Keep this data local unless you intentionally export it.
Use this before tuning ranking or adding broader graph traversal: first confirm which memories are frequently injected, which scopes are active, whether retrieval is often empty, and whether any frequently injected refs are now deprecated/disputed/missing. The audit command is read-only and summarizes local observation rows without emitting raw query text. Keep this data local unless you intentionally export it.

## Fallback and rollback

Expand Down
93 changes: 93 additions & 0 deletions src/agent_memory/api/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import argparse
import json
import sys
from collections import Counter, defaultdict
from pathlib import Path
from typing import Any

Expand Down Expand Up @@ -42,6 +43,7 @@
)
from agent_memory.storage.sqlite import (
get_fact,
get_memory_status,
initialize_database,
list_candidate_episodes,
list_candidate_facts,
Expand Down Expand Up @@ -123,6 +125,79 @@ def _status_counts_for_facts(facts) -> dict[str, int]:
return counts


def _current_status_for_memory_ref(db_path: Path, memory_ref: str) -> str | None:
memory_type, separator, raw_id = memory_ref.partition(":")
if separator != ":" or not raw_id.isdigit() or memory_type not in {"fact", "procedure", "episode"}:
return None
try:
return get_memory_status(db_path, memory_type=memory_type, memory_id=int(raw_id))
except ValueError:
return "missing"


def _audit_retrieval_observations(
db_path: Path,
*,
limit: int,
top: int,
frequent_threshold: int,
) -> dict[str, Any]:
if limit < 1:
raise ValueError("observations audit limit must be >= 1")
if top < 1:
raise ValueError("observations audit top must be >= 1")
if frequent_threshold < 1:
raise ValueError("observations audit frequent threshold must be >= 1")

observations = list_retrieval_observations(db_path, limit=limit)
surface_counts = Counter(observation.surface for observation in observations)
preferred_scope_counts = Counter(
observation.preferred_scope for observation in observations if observation.preferred_scope is not None
)
memory_ref_counts: Counter[str] = Counter()
sample_observation_ids_by_ref: dict[str, list[int]] = defaultdict(list)
empty_retrieval_count = 0
for observation in observations:
if not observation.retrieved_memory_refs:
empty_retrieval_count += 1
for memory_ref in observation.retrieved_memory_refs:
memory_ref_counts[memory_ref] += 1
sample_ids = sample_observation_ids_by_ref[memory_ref]
if len(sample_ids) < 5:
sample_ids.append(observation.id)

top_memory_refs = []
for memory_ref, injection_count in sorted(memory_ref_counts.items(), key=lambda item: (-item[1], item[0]))[:top]:
current_status = _current_status_for_memory_ref(db_path, memory_ref)
signals = []
if injection_count >= frequent_threshold:
signals.append("frequently_injected")
if current_status is not None and current_status != "approved":
signals.append("current_status_not_approved")
top_memory_refs.append(
{
"memory_ref": memory_ref,
"injection_count": injection_count,
"current_status": current_status,
"signals": signals,
"sample_observation_ids": sample_observation_ids_by_ref[memory_ref],
}
)

return {
"kind": "retrieval_observation_audit",
"read_only": True,
"observation_count": len(observations),
"limit": limit,
"top": top,
"frequent_threshold": frequent_threshold,
"surface_counts": dict(sorted(surface_counts.items())),
"preferred_scope_counts": dict(sorted(preferred_scope_counts.items())),
"empty_retrieval_count": empty_retrieval_count,
"top_memory_refs": top_memory_refs,
}


def _inspect_relation_graph(db_path: Path, *, start_ref: str, depth: int, limit: int) -> dict[str, Any]:
if depth < 0:
raise ValueError("graph inspect depth must be >= 0")
Expand Down Expand Up @@ -428,6 +503,11 @@ def _build_parser() -> argparse.ArgumentParser:
observations_list_parser = observations_subparsers.add_parser("list")
observations_list_parser.add_argument("db_path", type=Path)
observations_list_parser.add_argument("--limit", type=int, default=50)
observations_audit_parser = observations_subparsers.add_parser("audit")
observations_audit_parser.add_argument("db_path", type=Path)
observations_audit_parser.add_argument("--limit", type=int, default=200)
observations_audit_parser.add_argument("--top", type=int, default=10)
observations_audit_parser.add_argument("--frequent-threshold", type=int, default=3)

graph_parser = subparsers.add_parser("graph")
graph_subparsers = graph_parser.add_subparsers(dest="graph_action", required=True)
Expand Down Expand Up @@ -846,6 +926,19 @@ def main() -> None:
)
)
return
if args.observations_action == "audit":
print(
json.dumps(
_audit_retrieval_observations(
args.db_path,
limit=args.limit,
top=args.top,
frequent_threshold=args.frequent_threshold,
),
indent=2,
)
)
return
raise ValueError(f"Unsupported observations action: {args.observations_action}")

if args.command == "graph":
Expand Down
9 changes: 9 additions & 0 deletions src/agent_memory/storage/sqlite.py
Original file line number Diff line number Diff line change
Expand Up @@ -369,6 +369,15 @@ def insert_relation(
return relation_from_row(row)


def get_memory_status(db_path: Path | str, *, memory_type: MemoryType, memory_id: int) -> MemoryStatus:
table_name = TABLE_NAME_BY_MEMORY_TYPE[memory_type]
with connect(db_path) as connection:
row = connection.execute(f"SELECT status FROM {table_name} WHERE id = ?", (memory_id,)).fetchone()
if row is None:
raise ValueError(f"No {memory_type} memory found with id {memory_id}")
return row["status"]


def get_fact(db_path: Path | str, *, fact_id: int) -> Fact:
with connect(db_path) as connection:
row = connection.execute("SELECT * FROM facts WHERE id = ?", (fact_id,)).fetchone()
Expand Down
Loading