Problem
Currently there's no way to measure embedding quality for medical domain retrieval in isar_agent_memory. We need:
- Benchmark harness for retrieval quality metrics
- Medical-specific test corpora in Spanish
- Latency, memory, and accuracy metrics per embedding backend
- Comparison against existing RAG competition framework from cerebro-flutter
Proposed Solution
Create lib/benchmark/medical_embeddings_benchmark.dart:
-
MedicalEmbeddingsTestCorpus:
- 100 Spanish medical question-answer pairs
- Categories: symptoms, medications, lab interpretation, appointments
- Ground truth relevance labels
-
EmbeddingBenchmarkRunner:
- Tests all available backends (TFLite, ONNX, Gemini)
- Measures: recall@k, MRR, latency, memory usage
- Supports degradation mode when backend unavailable
-
MedicalRetrievalQualityScorer:
- Calculates retrieval quality metrics
- Generates JSON + Markdown reports in .cache/benchmark/
Reference
Already exists in cerebro-flutter: �gent-docs/sessions/rag-audit-session-20260305.md describes cross-repo RAG benchmark framework. We need to port it to isar_agent_memory as a reusable package.
The EmbeddingTelemetryRecorder already exists in lib/embedding_telemetry.dart - need to extend it for benchmark mode.
Problem
Currently there's no way to measure embedding quality for medical domain retrieval in isar_agent_memory. We need:
Proposed Solution
Create lib/benchmark/medical_embeddings_benchmark.dart:
MedicalEmbeddingsTestCorpus:
EmbeddingBenchmarkRunner:
MedicalRetrievalQualityScorer:
Reference
Already exists in cerebro-flutter: �gent-docs/sessions/rag-audit-session-20260305.md describes cross-repo RAG benchmark framework. We need to port it to isar_agent_memory as a reusable package.
The EmbeddingTelemetryRecorder already exists in lib/embedding_telemetry.dart - need to extend it for benchmark mode.