Evaluate AI agents with Unix-style pipeline commands. Schema-driven adapters for any CLI agent, trajectory capture, pass@k metrics, and multi-run comparison.
-
Updated
May 4, 2026 - TypeScript
Evaluate AI agents with Unix-style pipeline commands. Schema-driven adapters for any CLI agent, trajectory capture, pass@k metrics, and multi-run comparison.
LLM-powered clinical extraction + structured evals. Prompt strategies, hallucination detection, and per-field F1 scoring.
A confabulation-resistant synthetic corpus generator for customer intelligence evaluation. Currently in design phase.
Verification-native local coding agent runtime with eval gates, memory, subagents, and model profiles.
Compare OpenClaw setups against the same scenario suite. Run prompts across multiple configurations, capture answers, latency, token usage, tool calls, and file reads, then generate a single comparison report.
Production-style LLM evaluation harness for structured clinical extraction — compares prompt strategies across accuracy, cost, and hallucination.
A lightweight workbench for dataset-driven agent and LLM evaluation.
YAML-driven evaluation harness for WhatsApp RAG bots
Form ADV Part 2A intelligence + peer benchmarking — LangGraph, hybrid retrieval, eval harness
Add a description, image, and links to the eval-harness topic page so that developers can more easily learn about it.
To associate your repository with the eval-harness topic, visit your repo's landing page and select "manage topics."