eval-harness

Here are 9 public repositories matching this topic...

plaited / agent-eval-harness

Evaluate AI agents with Unix-style pipeline commands. Schema-driven adapters for any CLI agent, trajectory capture, pass@k metrics, and multi-run comparison.

cli typescript grader ai-agents bun jsonl llm-evaluation agent-evaluation unix-pipeline agent-comparison trajectory-capture eval-harness pass-at-k headless-adapter

Updated May 4, 2026
TypeScript

adityaanand0001 / healos-ai-agent

Star

LLM-powered clinical extraction + structured evals. Prompt strategies, hallucination detection, and per-field F1 scoring.

typescript nextjs postgresql hono bun llm anthropic drizzle-orm eval-harness

Updated May 1, 2026
TypeScript

ResonantIQ / resonantforge

Star

A confabulation-resistant synthetic corpus generator for customer intelligence evaluation. Currently in design phase.

synthetic-data llm-evaluation customer-intelligence eval-harness corpus-generation

Updated Apr 29, 2026

2830500285 / omni-agent

Star

Verification-native local coding agent runtime with eval gates, memory, subagents, and model profiles.

Updated May 4, 2026
TypeScript

Compare OpenClaw setups against the same scenario suite. Run prompts across multiple configurations, capture answers, latency, token usage, tool calls, and file reads, then generate a single comparison report.

python cli benchmarking regression-testing llm-eval llm-evaluation ai-agent-evaluation agent-testing eval-harness openclaw