evaluation-harness

Star

Here are 10 public repositories matching this topic...

najeed / ai-agent-eval-harness

Star

The open-source MultiAgentOps evaluation and verification harness for any industry business workflow.

Updated Apr 27, 2026
Python

tpertner / confess

Star

Detecting Relational Boundary Erosion in AI systems. A framework for testing whether models maintain honest, calibrated, and appropriate boundaries.

python yaml calibration alignment metamorphic-testing model-evaluation ai-safety red-teaming prompt-injection hallucination-detection llm-evals evaluation-harness

Updated Feb 22, 2026
Python

tjkuhns / explodable

Star

AI content engine using an anxiety-indexed behavioral science KB, multi-stage LangGraph pipeline, and calibrated LLM-as-judge evaluation harness

python knowledge-graph claude rag streamlit ai-engineering supabase behavioral-science anthropic pgvector langgraph llm-as-judge evaluation-harness buyer-psychology

Updated Apr 19, 2026
Python

Arnav-Ajay / rag-retrieval-eval

Star

A minimal, code-first retrieval observability harness that measures why RAG systems fail to surface relevant evidence, without changing retrieval or generation.

ai-systems failure-analysis rag-evaluation evaluation-harness retrieval-observability

Updated Jan 10, 2026
Python

Offline classifier evaluation harness — dataset loader, confusion matrices, LLM-as-judge with cost accounting, regression gates for CI, Phoenix/Langfuse exporters. Built for intent classifiers but works on any classification task.

classifier typescript ci-cd testing-tools regression-testing confusion-matrix observability intent-classification mlops llm-eval langfuse llm-as-judge arize-phoenix agentic-ai evaluation-harness

Updated Apr 20, 2026
TypeScript

bnovik0v / ABC-GenBench

Star

Runnable benchmark toolkit for monophonic ABC melody generation and editing.

benchmark music-generation abc-notation controllable-generation symbolic-music generative-ai llm-evaluation evaluation-harness

Updated Apr 1, 2026
Python

jsp2195 / frontier-evals-harness

Star

frontier-evals-harness is a lightweight framework for benchmarking frontier language models. It provides deterministic suite versioning, modular adapters, standardized scoring, and paired statistical comparisons with confidence intervals. Built for regression tracking and analysis, it enables reproducible evaluation without infrastructure.

reproducible-research model-evaluation llm-evaluation llm-benchmarking statistical-evaluation evaluation-harness

Updated Feb 19, 2026
Python

DaScient / OMEN

Star

Arnav-Ajay / rag-reranking-playground

Star

Controlled experiment isolating reranking as a first-class RAG system boundary, measuring how evidence priority—not recall—changes retrieval outcomes.

bm25 reranking rag failure-analysis hybrid-retrieval evaluation-harness

Updated Jan 23, 2026
Python

LewallenAE / dv-eval-harness

Star

Production-shaped DV agent evaluation harness with simulator adapter boundary, trajectory scoring, reward decomposition, and JSONL trace persistence.

python eda ai-agents design-verification fastapi rlhf llm-evaluation evaluation-harness

Updated Apr 27, 2026
Python

Improve this page

Add a description, image, and links to the evaluation-harness topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation-harness topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation-harness

Here are 10 public repositories matching this topic...

najeed / ai-agent-eval-harness

tpertner / confess

tjkuhns / explodable

Arnav-Ajay / rag-retrieval-eval

reaatech / classifier-evals

bnovik0v / ABC-GenBench

jsp2195 / frontier-evals-harness

DaScient / OMEN

Arnav-Ajay / rag-reranking-playground

LewallenAE / dv-eval-harness

Improve this page

Add this topic to your repo