Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 64 additions & 46 deletions .dev/status/current-handoff.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# agent-memory current handoff

Status: AI-authored draft. Not yet human-approved.
Last updated: 2026-05-01 01:10 KST
Last updated: 2026-05-01 01:55 KST

## Trigger for the next session

Expand All @@ -16,7 +16,7 @@ read this file first. Do not ask the user to restate context. Verify repo state,

## Ready-to-say answer

agent-memory는 v0.1.36까지 배포/Hermes QA가 완료됐고, 지금은 Priority 5 dogfood/noise monitoring의 다음 slice인 read-only observation audit 작업 중이야. 현재 브랜치는 `feat/observations-audit`, worktree는 `/Users/reddit/Project/agent-memory/.worktrees/observations-audit`이고, 목표는 기존 retrieval observation log를 바탕으로 자주 주입되는 memory ref, surface/scope 분포, 빈 retrieval, deprecated/disputed/missing ref 신호를 raw query 없이 요약하는 `agent-memory observations audit` CLI를 추가하는 거야.
agent-memory는 v0.1.37까지 배포/Hermes QA가 완료됐고, 현재는 실제 dogfood QA에서 발견된 observation 데이터 품질 이슈를 고치는 slice를 진행 중이야. 브랜치는 `fix/observation-dogfood-quality`, worktree는 `/Users/reddit/Project/agent-memory/.worktrees/observation-dogfood-quality`야. 목표는 query preview 제거, `hermes hooks doctor/test` synthetic pre-LLM payload가 dogfood observation을 오염시키지 않게 하기, audit에 데이터 부족/empty retrieval 품질 경고를 추가하기, 그리고 기존 DB에서 `memory_status_transitions` table이 없을 때 approve/review가 lazy migration 되도록 하는 거야. 실제 Hermes가 agent-memory에서 가져온 정보를 답변에 사용하는 E2E도 확인했어.

## Current repo state

Expand All @@ -32,15 +32,15 @@ Expected GitHub identity:

Verified base before this slice:

- latest completed release: `v0.1.36`
- v0.1.36 included secret-safe local retrieval observation logging and lazy migration for existing DBs without `retrieval_observations`.
- local Hermes hook uses `/Users/reddit/.agent-memory/runtime/v0.1.36/.venv/bin/agent-memory` against `/Users/reddit/.agent-memory/memory.db`.
- latest completed release: `v0.1.37`
- v0.1.37 added read-only `agent-memory observations audit` and was published to GitHub/npm/PyPI.
- local Hermes hook uses `/Users/reddit/.agent-memory/runtime/v0.1.37/.venv/bin/agent-memory` against `/Users/reddit/.agent-memory/memory.db`.

Active slice/worktree:

- branch: `feat/observations-audit`
- worktree: `/Users/reddit/Project/agent-memory/.worktrees/observations-audit`
- intended release after merge: likely `v0.1.37`
- branch: `fix/observation-dogfood-quality`
- worktree: `/Users/reddit/Project/agent-memory/.worktrees/observation-dogfood-quality`
- intended release after merge: likely `v0.1.38`

Expected local untracked artifacts to preserve in the root checkout:

Expand All @@ -52,68 +52,86 @@ Expected local untracked artifacts to preserve in the root checkout:

Do not delete or commit these unless the user explicitly asks.

## Current slice: read-only retrieval observation audit
## Current slice: observation dogfood data quality

Goal:

- Add a local-only, secret-safe, read-only audit report over `retrieval_observations`.
- Summarize dogfood/noise signals before changing ranking, graph traversal, or mutating memory cleanup.
- Keep observation telemetry useful for real dogfood QA.
- Avoid storing prompt-like query previews.
- Avoid synthetic hook doctor/test payloads polluting observation audits.
- Make audit explicitly report low-signal data states.
- Ensure existing DBs lazily migrate missing lifecycle tables encountered during real local QA.

Implemented so far in the active worktree:

- New CLI:
- `agent-memory observations audit <db_path> --limit 200 --top 10 --frequent-threshold 3`
- JSON output includes:
- `kind: retrieval_observation_audit`
- `read_only: true`
- `observation_count`
- `surface_counts`
- `preferred_scope_counts`
- `empty_retrieval_count`
- `top_memory_refs[]` with `memory_ref`, `injection_count`, `current_status`, `signals`, and sample observation ids
- Current signals:
- `frequently_injected`
- `current_status_not_approved`
- Storage helper added:
- `get_memory_status(db_path, memory_type=..., memory_id=...)`
- `record_retrieval_observation` now writes `query_preview = None` for new observations.
- Hermes pre-LLM hook detects the deterministic `hermes hooks doctor/test` payload:
- session_id `test-session`
- user_message `What is the weather?`
- empty conversation_history
- is_first_turn true
- model `gpt-4`
- platform `cli`
- Synthetic doctor/test payloads still exercise hook context injection but do not write dogfood observation rows.
- `observations audit` now returns:
- `empty_retrieval_ratio`
- `quality_warnings`
- `no_observations`
- `low_observation_count`
- `high_empty_retrieval_ratio`
- `memory_status_transitions` now has lazy/idempotent schema ensure used by initialize, status update, and status history paths.
- Docs updated:
- `README.md`
- `docs/hermes-dogfood.md`

Secret-safety contract:

- audit uses existing observation rows and does not read or emit raw query text.
- output contains counts, memory refs, statuses, and observation ids only.
- keep this data local unless intentionally exported.
- Tests added/updated in `tests/test_cli.py`:
- query preview is absent from observation list output
- audit reports low-signal empty retrievals
- approve-fact migrates existing DBs missing `memory_status_transitions`
- Hermes hook synthetic doctor payload skips observation write
- Hermes hook context includes retrieved memory content when line budget allows

Verification so far:

- RED confirmed before implementation:
- `agent-memory observations audit` failed with argparse invalid choice.
- RED confirmed:
- query_preview still present
- synthetic doctor payload wrote observation rows
- audit lacked `empty_retrieval_ratio`/`quality_warnings`
- existing DB without `memory_status_transitions` failed approve with sqlite OperationalError
- GREEN focused:
- `uv run pytest tests/test_cli.py::test_python_module_cli_observations_audit_reports_frequent_and_stale_refs_without_raw_queries -q`
- `1 passed`
- Focused regression group:
- `uv run pytest tests/test_cli.py::test_python_module_cli_observations_audit_reports_frequent_and_stale_refs_without_raw_queries tests/test_cli.py::test_python_module_cli_retrieve_observe_records_secret_safe_local_observation tests/test_cli.py::test_python_module_cli_observations_list_migrates_existing_database_without_observation_table -q`
- `3 passed`
- CLI help smoke:
- `uv run python -m agent_memory.api.cli observations audit --help`
- `uv run python -m agent_memory.api.cli observations list --help`
- both exit 0.
- `uv run pytest tests/test_cli.py::test_python_module_cli_approve_fact_migrates_existing_database_without_status_transition_table tests/test_cli.py::test_python_module_cli_retrieve_observe_records_secret_safe_local_observation tests/test_cli.py::test_python_module_cli_observations_audit_reports_low_signal_empty_retrievals tests/test_cli.py::test_python_module_cli_hermes_pre_llm_hook_skips_synthetic_doctor_observation tests/test_cli.py::test_python_module_cli_hermes_pre_llm_hook_injects_retrieved_memory_context -q`
- `5 passed`

Live local Hermes QA already confirmed on v0.1.37 runtime before this patch:

- Created a temporary approved fact in `/Users/reddit/.agent-memory/memory.db` with marker `AM_LIVE_E2E_1777567838` scoped to `/Users/reddit/Project/agent-memory`.
- Direct hook check confirmed:
- `direct_hook_contains_marker=True`
- `direct_hook_contains_agent_memory_context=True`
- `direct_hook_contains_retrieved_fact=True`
- Actual Hermes command confirmed the model used injected memory:
- `hermes --accept-hooks -z "What is the Hermes live E2E QA marker? Return only the marker and nothing else."`
- output contained `AM_LIVE_E2E_1777567838`
- Cleanup done:
- test fact id 2 deprecated with reason `live E2E QA cleanup`
- `review explain` showed `visible_in_default_retrieval: false`
- During live QA, an existing DB migration gap was discovered:
- approve failed until `agent-memory init ~/.agent-memory/memory.db` created `memory_status_transitions`
- this is now covered by the new lazy migration test/fix.

Remaining before PR:

1. Run full local verification:
1. Run broader focused group and full local verification:
- `uv run pytest tests/ -q`
- `uv run python scripts/check_release_metadata.py`
- `uv run python scripts/smoke_release_readiness.py`
- `npm pack --dry-run`
- `git diff --check`
- `node --check bin/agent-memory.js`
2. Run real smoke for `observations audit` on a temp DB and confirm no raw secret-like query text appears.
2. Run real smoke for observation list/audit on a temp DB and confirm query_preview is null and no raw secret-like text appears.
3. Run static diff secret scan.
4. Create PR, watch CI, merge, follow release-sync/publish/published smoke/Hermes QA.
5. After v0.1.38 install, repeat Hermes hook doctor and one real E2E check with the new runtime.

## Next natural slice after this one

After this audit slice is released and Hermes QA passes, the next likely Priority 5 step is dogfood cadence refinement: use the audit report over real Hermes observations to decide whether ranking/scope filters need adjustment. Avoid mutating cleanup or broad graph retrieval until the read-only signals have been observed in real use.
After this data-quality fix is released and Hermes QA passes, continue dogfood/noise monitoring using the cleaner audit data. Avoid mutating cleanup or broader graph retrieval until there are enough real, non-synthetic observations to justify ranking/scope changes.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,15 +103,15 @@ agent-memory graph inspect "$DB" fact:1 --depth 2 --limit 50

The JSON output includes the start ref, visited node refs, relation edges, traversal depth per edge, and a `read_only: true` marker. It is intended as a safe graph-foundation slice before enabling any broader graph traversal in default retrieval.

For local dogfood and noise monitoring, retrievals can leave a secret-safe observation log. Normal `retrieve` only records an observation when explicitly asked; the Hermes pre-LLM hook records one automatically in the local SQLite DB. Observations store a query hash, a redacted short preview, selected memory refs, top memory ref, response mode, scope, and surface. They do not store the raw query text.
For local dogfood and noise monitoring, retrievals can leave a secret-safe observation log. Normal `retrieve` only records an observation when explicitly asked; the Hermes pre-LLM hook records one automatically in the local SQLite DB for real turns. Observations store a query hash, selected memory refs, top memory ref, response mode, scope, and surface. They do not store the raw query text or a query preview. Deterministic `hermes hooks doctor/test` pre-LLM payloads exercise context injection but are skipped as dogfood observations so synthetic weather prompts do not pollute the audit.

```bash
agent-memory retrieve "$DB" "How should I install agent-memory?" --preferred-scope user:default --observe cli
agent-memory observations list "$DB" --limit 20
agent-memory observations audit "$DB" --limit 200 --top 10 --frequent-threshold 3
```

Use the observation log and audit report to spot frequently injected or surprising memories before changing retrieval behavior. The audit output is read-only JSON with surface/scope counts, empty-retrieval count, top injected memory refs, current status for known refs, and simple signals such as `frequently_injected` and `current_status_not_approved`. Treat it as local operator telemetry, not a synced analytics stream.
Use the observation log and audit report to spot frequently injected or surprising memories before changing retrieval behavior. The audit output is read-only JSON with surface/scope counts, empty-retrieval count and ratio, quality warnings such as `low_observation_count` or `high_empty_retrieval_ratio`, top injected memory refs, current status for known refs, and simple signals such as `frequently_injected` and `current_status_not_approved`. Treat it as local operator telemetry, not a synced analytics stream.

## Hermes quickstart

Expand Down
14 changes: 10 additions & 4 deletions docs/hermes-dogfood.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,21 +36,27 @@ Capture these observations for each dogfood run:
- whether returned context includes only approved memory
- whether unrelated scopes stay out of the prompt
- whether failure paths fail closed with no broken prompt text
- whether `agent-memory observations list ~/.agent-memory/memory.db --limit 20` shows the expected memory refs without raw query text or secrets
- whether `agent-memory observations audit ~/.agent-memory/memory.db --limit 200 --top 10` highlights frequently injected or no-longer-approved refs before any retrieval tuning
- whether `agent-memory observations list ~/.agent-memory/memory.db --limit 20` shows the expected memory refs without raw query text, query previews, or secrets
- whether `agent-memory observations audit ~/.agent-memory/memory.db --limit 200 --top 10` highlights frequently injected or no-longer-approved refs, low observation counts, and high empty-retrieval ratios before any retrieval tuning

A good conservative smoke has low latency, at most one surfaced memory, no noisy reason codes, no workflow-blocking error if the memory DB is missing, and a local observation entry that explains what memory was injected.

## Local observation log

Hermes pre-LLM hook retrievals write a secret-safe local observation row to the SQLite DB. The row is intended for dogfood/noise review and stores the surface, query hash, redacted query preview, selected memory refs, top memory ref, response mode, scope, and small metadata. It does not store the raw query text.
Hermes pre-LLM hook retrievals write a secret-safe local observation row to the SQLite DB for real turns. The row is intended for dogfood/noise review and stores the surface, query hash, selected memory refs, top memory ref, response mode, scope, and small metadata. It does not store the raw query text or a query preview. `hermes hooks doctor` / `hermes hooks test pre_llm_call` still exercise hook context injection, but their deterministic synthetic weather payload is skipped as dogfood observation data.

```bash
agent-memory observations list ~/.agent-memory/memory.db --limit 20
agent-memory observations audit ~/.agent-memory/memory.db --limit 200 --top 10 --frequent-threshold 3
```

Use this before tuning ranking or adding broader graph traversal: first confirm which memories are frequently injected, which scopes are active, whether retrieval is often empty, and whether any frequently injected refs are now deprecated/disputed/missing. The audit command is read-only and summarizes local observation rows without emitting raw query text. Keep this data local unless you intentionally export it.
Use this before tuning ranking or adding broader graph traversal: first confirm which memories are frequently injected, which scopes are active, whether retrieval is often empty, and whether any frequently injected refs are now deprecated/disputed/missing. The audit command is read-only and summarizes local observation rows without emitting raw query text or query previews. Keep this data local unless you intentionally export it.

When the audit reports `quality_warnings`, treat them as QA signals rather than cleanup instructions:

- `no_observations`: Hermes has not produced dogfood observation data yet; check hook install/allowlist and run a real prompt.
- `low_observation_count`: keep dogfooding before drawing ranking conclusions.
- `high_empty_retrieval_ratio`: memory retrieval is often returning no approved refs; check scopes, approved memory coverage, and query wording before changing rankers.

## Fallback and rollback

Expand Down
11 changes: 11 additions & 0 deletions src/agent_memory/api/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,15 @@ def _audit_retrieval_observations(
}
)

empty_retrieval_ratio = empty_retrieval_count / len(observations) if observations else 0.0
quality_warnings = []
if not observations:
quality_warnings.append("no_observations")
if 0 < len(observations) < 10:
quality_warnings.append("low_observation_count")
if empty_retrieval_ratio >= 0.5 and observations:
quality_warnings.append("high_empty_retrieval_ratio")

return {
"kind": "retrieval_observation_audit",
"read_only": True,
Expand All @@ -194,6 +203,8 @@ def _audit_retrieval_observations(
"surface_counts": dict(sorted(surface_counts.items())),
"preferred_scope_counts": dict(sorted(preferred_scope_counts.items())),
"empty_retrieval_count": empty_retrieval_count,
"empty_retrieval_ratio": round(empty_retrieval_ratio, 4),
"quality_warnings": quality_warnings,
"top_memory_refs": top_memory_refs,
}

Expand Down
22 changes: 21 additions & 1 deletion src/agent_memory/integrations/hermes_hooks.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,25 @@ def scope_from_cwd(cwd: str | Path | None) -> str | None:
def resolve_effective_preferred_scope(payload: "HermesShellHookPayload", options: "HermesPreLlmHookOptions") -> str | None:
return options.preferred_scope or scope_from_cwd(payload.cwd)


def is_synthetic_hermes_doctor_payload(payload: "HermesShellHookPayload") -> bool:
"""Detect the deterministic pre-LLM payload used by `hermes hooks doctor/test`.

Doctor/test payloads should still exercise hook context injection, but they
should not be counted as dogfood retrieval observations because they look like
real user turns and otherwise pollute noisy-memory audits.
"""
return (
payload.hook_event_name == "pre_llm_call"
and payload.session_id == "test-session"
and payload.extra.get("user_message") == "What is the weather?"
and payload.extra.get("conversation_history") == []
and payload.extra.get("is_first_turn") is True
and payload.extra.get("model") == "gpt-4"
and payload.extra.get("platform") == "cli"
)


class HermesShellHookPayload(BaseModel):
hook_event_name: str
tool_name: str | None = None
Expand Down Expand Up @@ -369,13 +388,14 @@ def build_pre_llm_hook_context(
return {}

effective_preferred_scope = resolve_effective_preferred_scope(payload, options)
observation_surface = None if is_synthetic_hermes_doctor_payload(payload) else "hermes-pre-llm-hook"
try:
packet = retrieve_memory_packet(
db_path=options.db_path,
query=user_message,
limit=options.limit,
preferred_scope=effective_preferred_scope,
observation_surface="hermes-pre-llm-hook",
observation_surface=observation_surface,
observation_metadata={"hook_event_name": payload.hook_event_name},
)
context = prepare_hermes_memory_context(
Expand Down
Loading