diff --git a/layouts/index.html b/layouts/index.html index c2cd807..73a8820 100644 --- a/layouts/index.html +++ b/layouts/index.html @@ -1,224 +1,141 @@ -{{ define "main" }} + + + + + + {{ .Site.Title }} + {{ partial "head.html" . }} + + +
- - - - -
-
- -

Ship Agents Reliably

-

Benchmark your agents before they hit production. AgentEvals scores performance and inference quality from OpenTelemetry traces — no re-runs, no guesswork.

- -
-
- - -
-
-

Why AgentEvals?

-

Evaluate agent behavior from real traces, not synthetic replays.

-
-
-
-
🔍
-

Trace-Based Evaluation

-

Parse OTLP streams and Jaeger JSON traces to evaluate agent behavior directly from production or test telemetry data.

-
-
-
-

No Re-Running Required

-

Score agent behavior from existing traces. No need to replay expensive LLM calls or wait for agent re-execution.

-
-
-
🎯
-

Golden Eval Sets

-

Define expected behaviors as golden eval sets and score traces against them using ADK's evaluation framework.

-
-
-
📊
-

Trajectory Matching

-

Compare agent trajectories with strict, unordered, subset, or superset matching modes for flexible evaluation.

-
-
-
🤖
-

LLM-as-Judge

-

Use LLM-powered evaluation for nuanced scoring of agent behavior without requiring reference trajectories.

-
-
-
🛠
-

CI/CD Integration

-

Run evaluations in your pipeline with the CLI. Gate deployments on agent behavior quality scores.

-
-
-
🧩
-

Custom Evaluators

-

Write custom scoring logic in Python, JavaScript, or any language. Share and discover evaluators through the community registry.

-
-
-
- - -
-
-
-

How It Works

-

Three steps from traces to scores.

-
-
-
-
1
-

Collect Traces

-

Instrument your agent with OpenTelemetry or export Jaeger JSON traces from your observability platform.

+
+
+
+ + Open source • Python SDK • OpenTelemetry native
-
-
2
-

Define Eval Sets

-

Create golden evaluation sets that describe expected agent behaviors, tool calls, and trajectories.

+

Score your AI agent behavior from traces.

+

+ AgentEvals is the open-source Python framework for scoring AI agent performance and behavior + from OpenTelemetry traces. Test prompts, tools, memory, and workflows without re-running your agents. +

+ -
-
3
-

Score & Report

-

Run evaluations via CLI or Web UI. Get detailed scores and pass/fail results.

+
+ CLI + Custom Evaluators + Web UI + CI/CD
-
-
+ - -
-
-
-

Three Ways to Evaluate

-

Choose the interface that fits your workflow.

-
-
-
-
-

CLI

-

Script evaluations and integrate into CI/CD pipelines. Pipe in traces, get scores out. Built for automation.

-
-
-
🖥
-

Web UI

-

Visually inspect traces and interactively evaluate agent behavior. Browse results, compare runs, and drill into details.

+
+
+
+ +

Evaluation that matches how agents actually run.

+

Traditional evals re-run entire workflows. AgentEvals scores the traces you already collect, so you can measure behavior in realistic conditions.

-
-
-
- -
-
-
-
-

Build Your Own Evaluators

-

Write custom scoring logic in Python, JavaScript, or any language. Share it with the community through our evaluator registry.

+
+
+
+

Trace-native evaluation

+

Built on OpenTelemetry traces so you can evaluate real production-like runs without replaying agent execution.

+
+
+
+

Flexible scoring

+

Combine built-in evaluators with custom Python logic to measure correctness, tool usage, memory behavior, and more.

+
+
+
+

Works in your workflow

+

Run locally with the CLI, automate in CI/CD, or explore results visually in the web UI.

+
-
+ +
+
+ +

From traces to scores in three steps.

- - -
- -
-
-
-

Get Started

-

Up and running in seconds.

-
-
-
-
- +
+
+ 01 +

Collect traces

+

Instrument your agent with OpenTelemetry and emit traces for prompts, tool calls, memory operations, and outputs.

+
+
+ 02 +

Define evaluators

+

Choose built-in evaluators or create your own to score the behaviors that matter for your agent.

+
+
+ 03 +

Run evaluations

+

Score trace datasets through the CLI or web UI and compare results across prompts, models, or tool strategies.

- terminal
-
-
# Install from release wheel
-pip install agentevals-<version>-py3-none-any.whl
+    
-# Run an evaluation against a trace -agentevals run samples/helm.json \ - --eval-set samples/eval_set_helm.json \ - -m tool_trajectory_avg_score - -# Start the web UI -agentevals serve +
+
+ +

Start with the path that fits your workflow.

+
- +
+ {{ range where .Site.RegularPages "Section" "docs" }} + +
+

{{ .Title }}

+

{{ .Description }}

+
+ +
+ {{ end }}
- - -
+ - -
-
-

Start Evaluating Your Agents

-

Open source. Trace-driven. No re-runs needed.

- -
-
+
+
+ +

Two ways to evaluate.

+

Use the CLI for fast, scriptable scoring or the Web UI for visual exploration of evaluation results.

+
- - +
+
+

CLI

+

Run evaluations locally or in CI with straightforward commands and structured outputs.

+
agentevals eval run config.yaml
+
+
+

Web UI

+

Inspect trace datasets, compare runs, and review evaluator outputs in a visual interface.

+
agentevals ui
+
+
+
-{{ end }} +
+
+ +

Bring evaluation into your agent development loop.

+

Install AgentEvals, connect your traces, and start measuring how your agent behaves in the real world.

+ +
+
+ + +