Home

Iris — The Agent Eval Standard for MCP

Iris scores every AI agent output against 12 eval rules. Catch PII leaks, hallucinations, injection attacks, and cost overruns before your users do.

Website | Playground | npm

Quick Start

npx @iris-eval/mcp-server@latest

Open http://localhost:6920 to see the eval dashboard.

Wiki Contents

Architecture — How Iris works under the hood
Eval Rules — The 12 built-in evaluation rules explained
Configuration — Customizing rules, thresholds, and storage
MCP Integration — Connecting Iris to any MCP client
Dashboard Guide — Using the web dashboard
Contributing — How to contribute to Iris
FAQ — Frequently asked questions

Why Agent Eval?

Traditional monitoring tells you if your agent ran. Iris tells you how well it performed.

What monitoring catches	What Iris catches
Agent crashed	Agent leaked a credit card number
Latency > 5s	Agent hallucinated instead of answering
Error rate spike	Agent repeated an injection attack in output
Cost per request	Cost 4.7x over budget for a simple query

Architecture Overview

Iris is an MCP server that intercepts agent traces and evaluates them against configurable rules:

MCP Client (Claude, etc.)
    ↓ traces
Iris MCP Server
    ├── 12 Eval Rules (PII, hallucination, injection, cost, ...)
    ├── SQLite Storage (traces, evals, metrics)
    └── Web Dashboard (localhost:6920)

The 12 Eval Rules

Rule	What It Catches
`topic_consistency`	Output drifts from the prompt topic
`expected_coverage`	Key topics missing from response
`response_complete`	Truncated or incomplete answers
`no_hallucination_markers`	"As an AI...", hedging, punt phrases
`no_blocklist_words`	Profanity, competitor names, banned terms
`no_pii`	Credit cards, SSNs, emails, phone numbers
`no_injection_patterns`	Prompt injection repeated in output
`sentiment_appropriate`	Tone mismatch for the context
`language_match`	Response in wrong language
`output_format_valid`	JSON/XML format violations
`cost_under_threshold`	Query cost exceeds budget
`latency_under_threshold`	Response time too slow

Provide feedback

Saved searches

Use saved searches to filter your results more quickly