Skip to content

Feat: Embedding Benchmark Framework for Medical Quality Assessment #40

@iberi22

Description

@iberi22

Problem

Currently there's no way to measure embedding quality for medical domain retrieval in isar_agent_memory. We need:

  1. Benchmark harness for retrieval quality metrics
  2. Medical-specific test corpora in Spanish
  3. Latency, memory, and accuracy metrics per embedding backend
  4. Comparison against existing RAG competition framework from cerebro-flutter

Proposed Solution

Create lib/benchmark/medical_embeddings_benchmark.dart:

  1. MedicalEmbeddingsTestCorpus:

    • 100 Spanish medical question-answer pairs
    • Categories: symptoms, medications, lab interpretation, appointments
    • Ground truth relevance labels
  2. EmbeddingBenchmarkRunner:

    • Tests all available backends (TFLite, ONNX, Gemini)
    • Measures: recall@k, MRR, latency, memory usage
    • Supports degradation mode when backend unavailable
  3. MedicalRetrievalQualityScorer:

    • Calculates retrieval quality metrics
    • Generates JSON + Markdown reports in .cache/benchmark/

Reference

Already exists in cerebro-flutter: �gent-docs/sessions/rag-audit-session-20260305.md describes cross-repo RAG benchmark framework. We need to port it to isar_agent_memory as a reusable package.

The EmbeddingTelemetryRecorder already exists in lib/embedding_telemetry.dart - need to extend it for benchmark mode.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions