Skip to content

[Priority 1] Implement Context Evaluator with validation pipeline #59

@marknutter

Description

@marknutter

Problem

We have no systematic quality control for recall results. Hallucinations, incomplete answers, and low-confidence recalls go undetected. Human knowledge (corrections, domain expertise) can only be stored via manual rlm remember calls.

Proposal (from AIGNE paper analysis)

Add Context Evaluator after every recall session that:

  1. Validates output against source context
  2. Computes confidence score based on:
    • Coverage: did we find entries for all query terms?
    • Coherence: do subagent findings agree or contradict?
    • Completeness: any obvious gaps?
  3. Triggers human review when confidence <0.7
  4. Stores human corrections as new memory entries tagged human-verified
  5. Logs validation outcomes to performance.jsonl

Implementation

  1. Create rlm/evaluator.py with validate_recall_results()
  2. Integrate into recall pipeline (after synthesis, before return)
  3. Add user prompt for verification when confidence low
  4. Store human annotations with provenance
  5. Log all validation metrics

Impact

  • Quality control — catch/correct errors before they propagate
  • Trust — users see confidence scores
  • Self-improvement — validation failures inform learned patterns
  • Human knowledge capture — corrections become first-class memories

Effort

2-3 days

Related

  • Context Evaluator from 'Everything is Context' paper (arxiv 2512.05470)
  • Self-improving strategies (learned_patterns.md)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions