The open-source MultiAgentOps evaluation and verification harness for any industry business workflow.
-
Updated
Apr 27, 2026 - Python
The open-source MultiAgentOps evaluation and verification harness for any industry business workflow.
Detecting Relational Boundary Erosion in AI systems. A framework for testing whether models maintain honest, calibrated, and appropriate boundaries.
AI content engine using an anxiety-indexed behavioral science KB, multi-stage LangGraph pipeline, and calibrated LLM-as-judge evaluation harness
A minimal, code-first retrieval observability harness that measures why RAG systems fail to surface relevant evidence, without changing retrieval or generation.
Offline classifier evaluation harness — dataset loader, confusion matrices, LLM-as-judge with cost accounting, regression gates for CI, Phoenix/Langfuse exporters. Built for intent classifiers but works on any classification task.
Runnable benchmark toolkit for monophonic ABC melody generation and editing.
frontier-evals-harness is a lightweight framework for benchmarking frontier language models. It provides deterministic suite versioning, modular adapters, standardized scoring, and paired statistical comparisons with confidence intervals. Built for regression tracking and analysis, it enables reproducible evaluation without infrastructure.
DoE Project
Controlled experiment isolating reranking as a first-class RAG system boundary, measuring how evidence priority—not recall—changes retrieval outcomes.
Production-shaped DV agent evaluation harness with simulator adapter boundary, trajectory scoring, reward decomposition, and JSONL trace persistence.
Add a description, image, and links to the evaluation-harness topic page so that developers can more easily learn about it.
To associate your repository with the evaluation-harness topic, visit your repo's landing page and select "manage topics."