Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 14 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,34 +33,16 @@ agentevals scores performance and inference quality from OpenTelemetry traces. N

## What is agentevals?

agentevals is a framework-agnostic evaluation solution that scores AI agent behavior directly from [OpenTelemetry](https://opentelemetry.io/) traces. Record your agent's actions once, then evaluate as many times as you want. No re-runs, no guesswork.
agentevals is a framework-agnostic evaluation solution that scores AI agent behavior directly from [OpenTelemetry](https://opentelemetry.io/) traces. Record your agent's actions once, then evaluate as many times as you want without re-executing or burning extra tokens.

It works with any OTel-instrumented framework (LangChain, Strands, Google ADK, OpenAI Agents SDK, and others), supports Jaeger JSON and native OTLP trace formats, and ships with built-in evaluators, custom evaluator support, and LLM-based judges.

- **CLI** for scripting and CI pipelines
- **Web UI** for visual inspection and local developer experience
- **Kubernetes and OTel support** so you can deploy right next to your agents; works natively in your OpenTelemetry pipeline
- **MCP server** so MCP clients can run evaluations from a conversation

## Why agentevals?

Most evaluation tools require you to **re-execute your agent** for every test, burning tokens, time, and money on duplicate LLM calls. agentevals takes a different approach:

- **No re-execution**: score agents from existing traces without replaying expensive LLM calls
- **Framework-agnostic**: works with any agent framework that emits OpenTelemetry spans
- **Golden eval sets**: compare actual behavior against defined expected behaviors for deterministic pass/fail gating
- **Custom evaluators**: write scoring logic in Python, JavaScript, or any language, or offload scoring to OpenAI Eval API
- **CI/CD ready**: gate deployments on quality thresholds directly in your pipeline
- **Local-first**: no cloud dependency required; everything runs on your machine

## How It Works

agentevals follows three simple steps:

1. **Collect traces**: Instrument your agent with OpenTelemetry (or export traces from your tracing backend). Point the OTLP exporter at the agentevals receiver, or load trace files directly.
2. **Define eval sets**: Create golden evaluation sets that describe expected agent behavior: which tools should be called, in what order, and what the output should look like.
3. **Run evaluations**: Use the CLI, Web UI, or MCP server to score traces against your eval sets. Get per-metric scores, pass/fail results, and detailed span-level breakdowns.

- **Multiple interfaces**: CLI for scripting and CI, Web UI for visual inspection, MCP server for conversational evaluation, Helm chart for Kubernetes environments

> [!IMPORTANT]
> This project is under active development. Expect breaking changes.
Expand All @@ -69,7 +51,7 @@ agentevals follows three simple steps:

- [Installation](#installation)
- [Quick Start](#quick-start)
- [Integration](#integration)
- [Use-cases and Integrations](#use-cases-and-integrations)
- [CLI](#cli)
- [Custom Evaluators](#custom-evaluators)
- [Web UI](#web-ui)
Expand Down Expand Up @@ -168,14 +150,14 @@ agentevals serve
# opens http://localhost:8001
```

You can also point any OTel-instrumented agent directly at the built-in receiver (`OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318`). The UI streams tool calls, inputs, and outputs live as your agent runs. For production setups, the same receiver slots into a Kubernetes OTel Collector pipeline as an exporter destination. See [Integration](#integration) and the [Kubernetes example](examples/kubernetes/README.md) for walkthroughs.
You can also point any OTel-instrumented agent directly at the built-in receiver (`OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318`). The UI streams tool calls, inputs, and outputs live as your agent runs. For production setups, the same receiver slots into a Kubernetes OTel Collector pipeline as an exporter destination. See [Use-cases and Integrations](#use-cases-and-integrations) and the [Kubernetes example](examples/kubernetes/README.md) for walkthroughs.

**Next steps:**

- `agentevals evaluator list` to see all built-in and community evaluators
- [Custom Evaluators](#custom-evaluators) to write your own scoring logic

## Use-cases and integrations
## Use-cases and Integrations

### Zero-Code (Recommended)

Expand Down Expand Up @@ -217,7 +199,7 @@ with app.session(eval_set_id="my-eval"):

Requires `pip install "agentevals-cli[streaming]"`. See [examples/sdk_example/](examples/sdk_example/) for framework-specific patterns.

## CLI for local testing, and CI pipelines
## CLI

```bash
# Multiple traces, JSON output
Expand Down Expand Up @@ -280,12 +262,13 @@ A `Dockerfile` is included at the project root. The image bundles the API, web U

```bash
docker build -t agentevals .
docker run -p 8001:8001 -p 4318:4318 agentevals
docker run -p 8001:8001 -p 4317:4317 -p 4318:4318 agentevals
```

| Port | Purpose |
|------|---------|
| 8001 | Web UI and REST API |
| 4317 | OTLP gRPC receiver (traces and logs) |
| 4318 | OTLP HTTP receiver (traces and logs) |
| 8080 | MCP (Streamable HTTP) |

Expand Down Expand Up @@ -363,31 +346,27 @@ See [DEVELOPMENT.md](DEVELOPMENT.md) for build tiers, Makefile targets, and Nix

**Do I need a database or any infrastructure to run agentevals?**

No. agentevals is a single `pip install` with no database, no message queue, and no external services. The CLI evaluates trace files directly from disk. The web UI and live streaming use in-memory session state. You can go from zero to scored traces in under a minute.
No. agentevals is a single `pip install` with no database, no message queue, and no external services. The CLI evaluates trace files directly from disk. The web UI and live streaming use in-memory session state.

**Does the CLI require a running server?**

No. `agentevals run` evaluates trace files entirely offline. The server (`agentevals serve`) is only needed for the web UI, live OTLP streaming, and server-dependent MCP tools like `list_sessions`.

**Can I use agentevals in CI/CD?**

Yes. The CLI is designed for pipeline use: pass trace files and an eval set, set a threshold, and let the exit code gate your deployment. Combine it with `--output json` for machine-readable results. No server process needed.
Yes. Pass trace files and an eval set, set a threshold, and let the exit code gate your deployment. Combine with `--output json` for machine-readable results. No server process needed.

**What if I switch agent frameworks?**

Because agentevals uses OpenTelemetry as its universal interface, switching frameworks (e.g., from LangChain to Strands, or from ADK to OpenAI Agents) does not require changing your evaluation setup. As long as your new framework emits OTel spans, the same eval sets and metrics work as before.
Because agentevals uses OpenTelemetry as its universal interface, switching frameworks does not require changing your evaluation setup. As long as your new framework emits OTel spans, the same eval sets and metrics work as before.

**Can I write evaluators in my own language?**

Yes. A custom evaluator is any program that reads JSON from stdin and writes a score to stdout. Python and JavaScript have first-class scaffolding support (`agentevals evaluator init`), but any language works. If your evaluator has a `requirements.txt`, agentevals manages a cached virtual environment automatically.
Yes. A custom evaluator is any program that reads JSON from stdin and writes a score to stdout. Python and JavaScript have first-class scaffolding support (`agentevals evaluator init`), but any language works.

**Can I plug agentevals into an existing OTel pipeline?**

Yes. The OTLP receiver on port 4318 accepts standard `http/protobuf` and `http/json` trace exports, so it slots into any OpenTelemetry pipeline as just another exporter destination. If your pipeline uses gRPC (port 4317), place an [OTel Collector](https://opentelemetry.io/docs/collector/) in front to bridge gRPC to HTTP. The [Kubernetes example](examples/kubernetes/README.md) shows this exact pattern.

**Can I deploy agentevals on Kubernetes?**

Yes. A Dockerfile and a [Helm chart](charts/agentevals/) are included. A single pod exposes the web UI (8001), OTLP receiver (4318), and MCP server (8080). See the [Kubernetes example](examples/kubernetes/README.md) for a full walkthrough deploying agentevals alongside kagent and an OTel Collector.
Yes. The OTLP receiver on port 4318 accepts standard `http/protobuf` and `http/json` trace exports, so it slots into any OpenTelemetry pipeline as just another exporter destination. If your pipeline uses gRPC (port 4317), place an [OTel Collector](https://opentelemetry.io/docs/collector/) in front to bridge gRPC to HTTP. The [Kubernetes example](examples/kubernetes/README.md) shows this pattern.

**How does this compare to ADK's evaluations?**

Expand All @@ -399,7 +378,7 @@ However, if you're iterating on your agents locally, you can point your agents t

AgentCore's evaluation integration (via `strands-agents-evals`) also couples agent execution with evaluation. It re-invokes the agent for each test case, converts the resulting OTel spans to AWS's ADOT format, and scores them against 4 built-in evaluators (Helpfulness, Accuracy, Harmfulness, Relevance) via a cloud API call. This means you need an AWS account, valid credentials, and network access for every evaluation.

agentevals takes a different approach: it scores pre-recorded traces locally without re-running anything. It works with standard Jaeger JSON and OTLP formats from any framework, supports open-ended metrics (tool trajectory matching, LLM-based judges, custom scorers), and ships with a CLI, web UI, and MCP server. No cloud dependency required, though we do include all ADK's GCP-based evals as of now.
agentevals scores pre-recorded traces locally without re-running anything. It works with standard Jaeger JSON and OTLP formats from any framework, supports open-ended metrics (tool trajectory matching, LLM-based judges, custom scorers), and ships with a CLI, web UI, and MCP server. No cloud dependency required, though we do include all ADK's GCP-based evals as of now.

**How does this compare to LangSmith?**

Expand Down
114 changes: 104 additions & 10 deletions docs/otel-compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,9 @@ agentevals consumes OpenTelemetry traces to evaluate AI agents. This document co

The [GenAI semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/) define standard span attributes for LLM interactions. agentevals auto-detects this format when spans contain `gen_ai.request.model` or `gen_ai.input.messages`.

Supported attributes:
This format works with LangChain, Strands, OpenAI instrumentation, Anthropic instrumentation, and any framework that follows the GenAI semantic conventions.

#### Core attributes

| Attribute | Description |
|-----------|-------------|
Expand All @@ -18,9 +20,51 @@ Supported attributes:
| `gen_ai.response.finish_reasons` | Why the model stopped generating |
| `gen_ai.usage.input_tokens` | Input token count |
| `gen_ai.usage.output_tokens` | Output token count |
| `gen_ai.system` | AI system identifier (e.g. `openai`, `anthropic`) |

This format works with LangChain, Strands, OpenAI instrumentation, Anthropic instrumentation, and any framework that follows the GenAI semantic conventions.
#### Provider and response metadata (v1.37.0+)

| Attribute | Description |
|-----------|-------------|
| `gen_ai.provider.name` | LLM provider (e.g. `openai`, `anthropic`). Replaces the deprecated `gen_ai.system`. |
| `gen_ai.response.model` | Model name returned in the response |
| `gen_ai.response.id` | Unique response identifier |

#### Request parameters (v1.40.0)

| Attribute | Description |
|-----------|-------------|
| `gen_ai.request.temperature` | Temperature sampling parameter |
| `gen_ai.request.max_tokens` | Maximum output tokens limit |
| `gen_ai.request.top_p` | Top-P (nucleus) sampling parameter |
| `gen_ai.request.top_k` | Top-K sampling parameter |

#### Cache token usage

| Attribute | Description |
|-----------|-------------|
| `gen_ai.usage.cache_creation.input_tokens` | Tokens spent creating a prompt cache entry |
| `gen_ai.usage.cache_read.input_tokens` | Tokens served from an existing cache entry |

These are relevant for providers that support prompt caching (Anthropic, OpenAI). agentevals aggregates these across LLM spans and displays them in the performance summary.

#### Agent and tool metadata (v1.31.0+)

| Attribute | Description |
|-----------|-------------|
| `gen_ai.agent.id` | Unique agent identifier |
| `gen_ai.agent.description` | Agent description |
| `gen_ai.tool.description` | Tool description |
| `gen_ai.tool.type` | Tool type classification |

#### Opt-in attributes (v1.37.0+)

These may contain large payloads and are typically gated behind instrumentation flags:

| Attribute | Description |
|-----------|-------------|
| `gen_ai.system_instructions` | System prompt text |
| `gen_ai.tool.definitions` | Tool schema definitions (JSON) |
| `gen_ai.output.type` | Classification of output content |

### Google ADK (framework-native)

Expand All @@ -30,9 +74,33 @@ Google ADK emits spans under the `gcp.vertex.agent` OTel scope with proprietary

Format detection is automatic. When a trace contains both ADK and GenAI attributes, ADK takes priority because it provides richer structured data. The detection logic lives in `src/agentevals/converter.py` (`get_extractor()`).

## Message Formats

GenAI message content (`gen_ai.input.messages`, `gen_ai.output.messages`) can use two JSON schemas. agentevals supports both and normalizes them internally.

### Content-based format

Used by OpenAI and LangChain instrumentors (v2):

```json
{"role": "user", "content": "Hello"}
{"role": "assistant", "content": "...", "tool_calls": [{"type": "function", "function": {"name": "get_weather", "arguments": "{\"city\": \"NYC\"}"}}]}
```

### Parts-based format (v1.36.0+)

Used by newer instrumentors that follow the GenAI semconv parts schema:

```json
{"role": "user", "parts": [{"type": "text", "content": "Hello"}]}
{"role": "assistant", "parts": [{"type": "tool_call", "name": "get_weather", "arguments": {"city": "NYC"}}]}
```

Both formats are auto-detected per message. Tool calls are normalized to `{name, id, arguments}` regardless of source format.

## Message Content Delivery

GenAI message content (`gen_ai.input.messages`, `gen_ai.output.messages`) can arrive through three mechanisms. agentevals supports all of them:
GenAI message content can arrive through three mechanisms. agentevals supports all of them:

### 1. Span attributes (simplest)

Expand Down Expand Up @@ -80,18 +148,44 @@ If you maintain an OTel-instrumented agent framework and want to align with the

## OTLP Receiver

agentevals runs:
agentevals runs two OTLP receivers:

- **gRPC** on port 4317 (standard OTLP gRPC port, configurable via `--otlp-grpc-port`)
- **HTTP** on port 4318 (standard OTLP HTTP port)

- OTLP HTTP receiver on port 4318 (standard OTLP HTTP port)
- OTLP gRPC receiver on port 4317 (standard OTLP gRPC port).
Both accept traces and logs and feed into the same session manager.

OTLP HTTP accepts:
### OTLP HTTP

| Endpoint | Content Types |
|----------|--------------|
| `/v1/traces` | `application/json`, `application/x-protobuf` |
| `/v1/logs` | `application/json`, `application/x-protobuf` |

Point OTLP/HTTP exporters at `http://localhost:4318`.
Point OTLP/gRPC exporters at `localhost:4317` with `OTEL_EXPORTER_OTLP_PROTOCOL=grpc`.
### OTLP gRPC

Implements the standard `TraceService/Export` and `LogsService/Export` RPCs. Configuration:

| Setting | Default |
|---------|---------|
| Max message size | 8 MB |
| Max concurrent RPCs | 32 |
| Compression | gzip |
| TLS | off (insecure) |

### Client configuration

For HTTP exporters:

```bash
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
```

For gRPC exporters:

```bash
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc
```

Traces and logs stream into agentevals automatically. See [examples/README.md](../examples/README.md) for zero-code setup instructions.