Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 27 additions & 23 deletions .dev/status/current-handoff.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# agent-memory current handoff

Status: AI-authored draft. Not yet human-approved.
Last updated: 2026-04-29 01:57 KST
Last updated: 2026-04-30 06:43 KST

## Trigger for the next session

Expand All @@ -17,17 +17,17 @@ then read this file first and answer from the "Ready-to-say answer" section belo

## Ready-to-say answer

지금 바로 해야 할 건 `agent-memory`에서 retrieval evaluation fixture/harness를 시작하는 거야.
지금 바로 해야 할 건 `agent-memory` retrieval evaluation 결과를 사람이 빠르게 읽고 triage할 수 있게 만드는 거야.

KB M1/M1+와 v0.1.8 release/smoke까지 끝났고, 다음 단계는 embeddings/reranking 같은 복잡도를 넣기 전에 retrieval 품질을 측정하는 최소 평가 루프를 만드는 거야.
Runtime adapters, retrieval fixture harness, npm/PyPI distribution, 그리고 main-merge 자동 릴리즈는 검증됐어. 다음 단계는 embeddings/reranking 같은 검색 복잡도를 넣기 전에, fixture 결과를 사람이 한눈에 보고 어떤 memory type / task type이 약한지 판단할 수 있는 report surface를 다듬는 거야.

진행 순서는:
1. `~/Project/agent-memory` 상태 확인
2. `.dev/kb/retrieval-evaluation-v0.md`를 기준으로 `.dev/kb/retrieval-eval-m1-implementation-plan.md` 작성
3. TDD로 `tests/test_retrieval_evaluation.py`부터 추가
4. `agent-memory eval retrieval <db_path> <fixtures_dir>` 또는 최소 core API를 구현
5. fixture 기반으로 expected memory IDs / drift / counts를 검증
6. README는 검증 후 짧게만 업데이트
2. retrieval eval report/CLI 현재 구조 확인
3. TDD로 `--format text` 또는 동등한 human summary surface 테스트 추가
4. JSON contract는 유지하면서 compact text report 구현
5. README와 이 handoff를 검증 결과 기준으로 갱신
6. PR/CI/merge 후 main auto-release가 새 patch를 배포하는지 확인

이 작업부터 진행하면 돼.

Expand All @@ -39,23 +39,25 @@ Canonical repo path:

Current branch/release state at this handoff:

- branch: `main`
- remote: `origin` -> `git@github.com-cafitac:cafitac/agent-memory.git`
- git status at last check: clean, `main...origin/main`
- latest commit: `750ef36 chore: release v0.1.8`
- latest validated release: `v0.1.8`
- npm: `@cafitac/agent-memory@0.1.8`
- PyPI: `cafitac-agent-memory==0.1.8`
- GitHub Release: `https://github.com/cafitac/agent-memory/releases/tag/v0.1.8`
- branch: `main` before the current report-summary branch
- remote: `origin` -> `https://github.com/cafitac/agent-memory.git` in the local checkout after gh HTTPS push repair
- git status at last check: tracked files clean on main; pre-existing untracked local agent/dev state remains intentionally preserved
- latest commit before this branch: `67653e9 chore: release v0.1.11 [skip release]`
- latest validated release: `v0.1.11`
- npm: `@cafitac/agent-memory@0.1.11`
- PyPI: `cafitac-agent-memory==0.1.11`
- GitHub Release: `https://github.com/cafitac/agent-memory/releases/tag/v0.1.11`
- main-merge auto-release is active: `auto-release.yml` bumps patch metadata, commits `[skip release]`, tags, and dispatches `publish.yml` because `GITHUB_TOKEN` tag pushes do not trigger downstream tag workflows reliably

Important run IDs:

- `25065915434` — CI success for `b468166 feat: enrich KB export provenance`
- `25066123570` — main CI success for `750ef36 chore: release v0.1.8`
- `25066195998` — publish workflow success for `v0.1.8`
- `25066196035` — tag CI success for `v0.1.8`
- `25134278544` — first auto-release main run, created `v0.1.10` but exposed the bot-created tag dispatch gap
- `25134636398` — fixed auto-release main run for PR #5, successfully bumped/tagged `v0.1.11` and dispatched publish
- `25134684685` — publish workflow success for `v0.1.11`
- `25134830075` — manual repair publish workflow success for the earlier `v0.1.10` tag
- `25133706803` — publish workflow success for `v0.1.9`

Published install smoke for `v0.1.8` was completed after registry propagation:
Published install smoke for `v0.1.11` was completed after registry propagation:

- npm global install path passed
- npm wrapper `agent-memory kb export --help` passed
Expand All @@ -67,7 +69,7 @@ Published install smoke for `v0.1.8` was completed after registry propagation:
- uv tool `kb export --help`, `bootstrap`, `doctor` passed
- final smoke output: `published install smoke ok`

Note: the first npm smoke attempt for `v0.1.8` failed because the npm launcher correctly pinned `cafitac-agent-memory==0.1.8` before uv/PyPI simple-index resolution had caught up. A retry after propagation succeeded. This is the same known registry-propagation behavior seen in v0.1.7 and is not currently a code blocker.
Note: registry metadata and delegated installer resolvers can briefly disagree after publish. This happened again during `v0.1.10`/`v0.1.11` validation; retrying with propagation time or `uvx --refresh` confirmed the published packages. This is not currently a code blocker.

## What is complete

Expand All @@ -78,7 +80,7 @@ Note: the first npm smoke attempt for `v0.1.8` failed because the npm launcher c
- npm is the shortest onboarding surface; PyPI is the canonical Python runtime.
- npm thin launcher pins the delegated Python package to the npm package version.
- GitHub Actions CI/publish flow is validated.
- Actual published install smoke is validated through `v0.1.8`.
- Actual published install smoke is validated through `v0.1.11`.

### Hermes integration

Expand Down Expand Up @@ -180,6 +182,7 @@ Current verified behavior:
- optional `--fail-on-baseline-regression-memory-type {facts,procedures,episodes}` can scope baseline-relative gating down to selected primary task types instead of failing on every current<baseline task; lexical-global and source-global comparisons still only gate when current is worse, not when the comparator is worse
- optional `--fail-on-regression` exits nonzero when any current task has `pass=false`
- optional `--fail-on-baseline-regression` exits nonzero only when current retrieval is worse than the chosen baseline for at least one task
- optional `--format text` prints a compact terminal summary with current/baseline/delta totals, primary-task-type rollups, failed task IDs, and advisory messages while preserving JSON as the default machine-readable contract
- symbolic fixture selectors now support richer matching such as `searchable_text_contains`, `step_contains`, and `tags_include`
- current fact retrieval now suppresses lower-priority cross-scope fact drift when exact-scope fact matches exist and hides lower-ranked conflicting facts in the same subject/predicate/scope slot from surfaced results
- checked-in fixture families are now directly runnable against a suitably seeded DB and also covered by regression tests in `tests/test_retrieval_evaluation.py`; the seeded family now includes a branch-only adversarial staleness case plus procedure-, episode-, and source-global-oriented stale-fact/stale-source guardrails where current retrieval passes and at least one comparator baseline still fails
Expand All @@ -202,6 +205,7 @@ uv run pytest tests/test_cli.py -q
uv run pytest tests/ -q
uv run agent-memory eval retrieval ~/.agent-memory/memory.db tests/fixtures/retrieval_eval
uv run agent-memory eval retrieval ~/.agent-memory/memory.db tests/fixtures/retrieval_eval --baseline-mode lexical
uv run agent-memory eval retrieval ~/.agent-memory/memory.db tests/fixtures/retrieval_eval --baseline-mode lexical --format text
uv run agent-memory eval retrieval ~/.agent-memory/memory.db tests/fixtures/retrieval_eval --baseline-mode source-lexical
uv run agent-memory eval retrieval ~/.agent-memory/memory.db tests/fixtures/retrieval_eval --baseline-mode source-global
uv run agent-memory eval retrieval ~/.agent-memory/memory.db tests/fixtures/retrieval_eval --baseline-mode lexical --fail-on-baseline-regression
Expand Down
8 changes: 7 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,12 @@ Or include a simple lexical baseline for side-by-side comparison:
uv run agent-memory eval retrieval ~/.agent-memory/memory.db tests/fixtures/retrieval_eval --baseline-mode lexical
```

For a compact terminal-oriented summary instead of the full JSON payload, use text format:

```bash
uv run agent-memory eval retrieval ~/.agent-memory/memory.db tests/fixtures/retrieval_eval --baseline-mode lexical --format text
```

Or fail the command when any current task regresses:

```bash
Expand Down Expand Up @@ -190,7 +196,7 @@ uv run agent-memory eval retrieval ~/.agent-memory/memory.db tests/fixtures/retr
uv run agent-memory eval retrieval ~/.agent-memory/memory.db tests/fixtures/retrieval_eval --baseline-mode lexical --warn-on-baseline-regression-threshold 0
```

The retrieval evaluator accepts either one JSON fixture file or a fixture directory. Directory input is recursive, so fixture families can live under nested folders such as `scope/`, `procedure/`, `drift/`, `staleness/`, and `episode/`. Fixtures may use direct numeric IDs or top-level symbolic `references` that resolve against approved memories in the target database, which makes checked-in fixture families directly runnable from the CLI. Symbolic selectors now also support richer matching such as `searchable_text_contains`, `step_contains`, and `tags_include` when exact field equality is too brittle for checked-in fixtures. Each task may also carry optional human-authored `rationale` text and `notes` arrays; these are preserved verbatim in the JSON report so fixture reviews can explain why a hit matters without introducing LLM judging. The evaluator runs `retrieve_memory_packet` for each task and prints JSON with fixture paths, per-task rationale/notes, retrieved IDs, expected hits, missing expected IDs, avoid/drift hits, a derived per-task `pass` flag, any non-fatal soft-gate `advisories`, and an aggregate summary. Summary objects now also include top-level task counts (`total_tasks`, `passed_tasks`, `failed_tasks`), `by_memory_type` rollups for facts/procedures/episodes, and `by_primary_task_type` rollups keyed by each task's main target surface so regressions can be reviewed both by memory-slice participation and by per-task intent; the per-type summaries expose the same task counts plus hit/miss/avoid totals. With `--baseline-mode lexical`, the same output also includes per-task baseline metrics, per-task delta fields (`expected_hit_delta`, `missing_expected_delta`, `avoid_hit_delta`, `pass_changed`), plus baseline and delta summaries using a simpler lexical-only retrieval path scoped to the same preferred scope; `--baseline-mode source-lexical` keeps that preferred-scope restriction but scores approved memories by lexical overlap in their linked source content instead of normalized memory text; `--baseline-mode source-global` uses the same source-linked lexical scoring while ignoring preferred scope; and `--baseline-mode lexical-global` keeps normalized-text lexical scoring but ignores preferred scope. Soft-gate thresholds never change the per-task `pass` semantics or process exit code on their own; they only populate `advisories` when the observed current or baseline-relative regression count exceeds the requested threshold.
The retrieval evaluator accepts either one JSON fixture file or a fixture directory. Directory input is recursive, so fixture families can live under nested folders such as `scope/`, `procedure/`, `drift/`, `staleness/`, and `episode/`. Fixtures may use direct numeric IDs or top-level symbolic `references` that resolve against approved memories in the target database, which makes checked-in fixture families directly runnable from the CLI. Symbolic selectors now also support richer matching such as `searchable_text_contains`, `step_contains`, and `tags_include` when exact field equality is too brittle for checked-in fixtures. Each task may also carry optional human-authored `rationale` text and `notes` arrays; these are preserved verbatim in the JSON report so fixture reviews can explain why a hit matters without introducing LLM judging. The evaluator runs `retrieve_memory_packet` for each task and prints JSON by default with fixture paths, per-task rationale/notes, retrieved IDs, expected hits, missing expected IDs, avoid/drift hits, a derived per-task `pass` flag, any non-fatal soft-gate `advisories`, and an aggregate summary. Use `--format text` when you want a short human-readable terminal report with pass counts, current/baseline/delta totals, primary-task-type rollups, failed task IDs, and advisory messages. Summary objects now also include top-level task counts (`total_tasks`, `passed_tasks`, `failed_tasks`), `by_memory_type` rollups for facts/procedures/episodes, and `by_primary_task_type` rollups keyed by each task's main target surface so regressions can be reviewed both by memory-slice participation and by per-task intent; the per-type summaries expose the same task counts plus hit/miss/avoid totals. With `--baseline-mode lexical`, the same output also includes per-task baseline metrics, per-task delta fields (`expected_hit_delta`, `missing_expected_delta`, `avoid_hit_delta`, `pass_changed`), plus baseline and delta summaries using a simpler lexical-only retrieval path scoped to the same preferred scope; `--baseline-mode source-lexical` keeps that preferred-scope restriction but scores approved memories by lexical overlap in their linked source content instead of normalized memory text; `--baseline-mode source-global` uses the same source-linked lexical scoring while ignoring preferred scope; and `--baseline-mode lexical-global` keeps normalized-text lexical scoring but ignores preferred scope. Soft-gate thresholds never change the per-task `pass` semantics or process exit code on their own; they only populate `advisories` when the observed current or baseline-relative regression count exceeds the requested threshold.

Export approved memories as a human-readable KB draft:

Expand Down
12 changes: 10 additions & 2 deletions src/agent_memory/api/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,11 @@
from agent_memory.core.ingestion import ingest_source_text
from agent_memory.core.kb_export import export_kb_markdown
from agent_memory.core.retrieval import retrieve_memory_packet
from agent_memory.core.retrieval_eval import RetrievalEvalRegressionError, evaluate_retrieval_fixtures
from agent_memory.core.retrieval_eval import (
RetrievalEvalRegressionError,
evaluate_retrieval_fixtures,
render_retrieval_eval_text_report,
)
from agent_memory.storage.sqlite import (
initialize_database,
list_candidate_episodes,
Expand Down Expand Up @@ -255,6 +259,7 @@ def _build_parser() -> argparse.ArgumentParser:
eval_retrieval_parser.add_argument("db_path", type=Path)
eval_retrieval_parser.add_argument("fixtures_path", type=Path)
eval_retrieval_parser.add_argument("--baseline-mode", choices=["lexical", "lexical-global", "source-lexical", "source-global"])
eval_retrieval_parser.add_argument("--format", choices=["json", "text"], default="json")
eval_retrieval_parser.add_argument("--fail-on-regression", action="store_true")
eval_retrieval_parser.add_argument("--warn-on-regression-threshold", type=int)
eval_retrieval_parser.add_argument("--fail-on-baseline-regression", action="store_true")
Expand Down Expand Up @@ -535,7 +540,10 @@ def main() -> None:
except RetrievalEvalRegressionError as exc:
print(str(exc), file=sys.stderr)
raise SystemExit(1) from exc
print(result.model_dump_json(indent=2, by_alias=True))
if args.format == "text":
print(render_retrieval_eval_text_report(result))
else:
print(result.model_dump_json(indent=2, by_alias=True))
return
raise ValueError(f"Unsupported eval action: {args.eval_action}")

Expand Down
59 changes: 59 additions & 0 deletions src/agent_memory/core/retrieval_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -486,6 +486,65 @@ def _build_summary(task_metrics: list[tuple[str, RetrievalEvalRunMetrics]]) -> R
return summary


def _signed_delta(value: int) -> str:
return f"{value:+d}"


def _format_summary_line(prefix: str, summary: RetrievalEvalSummary) -> str:
return (
f"{prefix}: failures={summary.failed_tasks} "
f"missing={summary.total_missing_expected} "
f"avoid={summary.total_avoid_hits} "
f"expected_hits={summary.total_expected_hits}"
)


def _format_type_summary(memory_type: str, summary: RetrievalEvalMemoryTypeSummary) -> str:
return (
f" {memory_type}: {summary.passed_tasks}/{summary.total_tasks} passed, "
f"missing={summary.total_missing_expected}, avoid={summary.total_avoid_hits}"
)


def render_retrieval_eval_text_report(result_set: RetrievalEvalResultSet) -> str:
summary = result_set.summary
lines = [
f"Retrieval evaluation: {summary.passed_tasks}/{summary.total_tasks} tasks passed",
_format_summary_line("current", summary),
]

if result_set.baseline_summary is not None:
baseline = result_set.baseline_summary
lines.append(f"baseline {baseline.mode}: {baseline.passed_tasks}/{baseline.total_tasks} tasks passed")
if result_set.delta_summary is not None:
delta = result_set.delta_summary
lines.append(
"delta: "
f"pass_count={_signed_delta(delta.total_pass_count_delta)} "
f"expected_hits={_signed_delta(delta.total_expected_hit_delta)} "
f"missing={_signed_delta(delta.total_missing_expected_delta)} "
f"avoid={_signed_delta(delta.total_avoid_hit_delta)}"
)

lines.append("by primary task type:")
for memory_type in _MEMORY_TYPES:
type_summary = summary.by_primary_task_type.get(memory_type, RetrievalEvalMemoryTypeSummary())
lines.append(_format_type_summary(memory_type, type_summary))

failed_task_ids = [task.task_id for task in result_set.results if not task.pass_]
if failed_task_ids:
lines.append("failed tasks:")
lines.extend(f" - {task_id}" for task_id in failed_task_ids)
else:
lines.append("failed tasks: none")

if result_set.advisories:
lines.append("advisories:")
lines.extend(f" - {advisory.code}: {advisory.message}" for advisory in result_set.advisories)

return "\n".join(lines)


def evaluate_retrieval_fixtures(
db_path: Path | str,
fixtures_path: Path | str,
Expand Down
Loading