Skip to content

research(compression): probe-based evaluation of context compaction quality #2164

@bug-ops

Description

@bug-ops

Source

Factory.ai (2025): Evaluating Context Compression for AI Agents
https://factory.ai/news/evaluating-compression

Summary

Replace opaque compression quality metrics (ROUGE/embedding similarity) with functional probes run after each compaction:

  • Recall probes: did specific facts survive?
  • Artifact probes: does the agent know which files/tools it used?
  • Continuation probes: can it pick up mid-task?
  • Decision probes: are past reasoning traces intact?

Agent's ability to correctly answer these probes is the quality signal.

Applicability to Zeph

Relevance: HIGH. Zeph's summarization quality is currently opaque. Zeph already has a compaction probe ([memory.compression.probe]), but the current probe uses generic LLM-generated questions. Structured probe categories (recall/artifact/continuation/decision) would surface silent information loss more reliably.

Implementation sketch

  • Extend CompactionProbe to generate probes per category (currently generates generic questions)
  • After compaction, run each category with different prompt templates
  • Score by category; log per-category breakdown in debug dump
  • Expose per-category scores in TUI metrics panel (issue feat: display filter metrics in TUI dashboard #448)

Complexity: LOW-MEDIUM

Probe prompts are simple; main work is categorizing probe generation and updating the scoring logic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Research — medium-high complexityresearchResearch-driven improvement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions