Trace-native evaluation
+Built on OpenTelemetry traces so you can evaluate real production-like runs without replaying agent execution.
+diff --git a/layouts/index.html b/layouts/index.html index c2cd807..73a8820 100644 --- a/layouts/index.html +++ b/layouts/index.html @@ -1,224 +1,141 @@ -{{ define "main" }} + + +
+ + +Benchmark your agents before they hit production. AgentEvals scores performance and inference quality from OpenTelemetry traces — no re-runs, no guesswork.
- -Evaluate agent behavior from real traces, not synthetic replays.
-Parse OTLP streams and Jaeger JSON traces to evaluate agent behavior directly from production or test telemetry data.
-Score agent behavior from existing traces. No need to replay expensive LLM calls or wait for agent re-execution.
-Define expected behaviors as golden eval sets and score traces against them using ADK's evaluation framework.
-Compare agent trajectories with strict, unordered, subset, or superset matching modes for flexible evaluation.
-Use LLM-powered evaluation for nuanced scoring of agent behavior without requiring reference trajectories.
-Run evaluations in your pipeline with the CLI. Gate deployments on agent behavior quality scores.
-Write custom scoring logic in Python, JavaScript, or any language. Share and discover evaluators through the community registry.
-Three steps from traces to scores.
-Instrument your agent with OpenTelemetry or export Jaeger JSON traces from your observability platform.
+Create golden evaluation sets that describe expected agent behaviors, tool calls, and trajectories.
++ AgentEvals is the open-source Python framework for scoring AI agent performance and behavior + from OpenTelemetry traces. Test prompts, tools, memory, and workflows without re-running your agents. +
+Run evaluations via CLI or Web UI. Get detailed scores and pass/fail results.
+Choose the interface that fits your workflow.
-Script evaluations and integrate into CI/CD pipelines. Pipe in traces, get scores out. Built for automation.
-Visually inspect traces and interactively evaluate agent behavior. Browse results, compare runs, and drill into details.
+Traditional evals re-run entire workflows. AgentEvals scores the traces you already collect, so you can measure behavior in realistic conditions.
Write custom scoring logic in Python, JavaScript, or any language. Share it with the community through our evaluator registry.
+Built on OpenTelemetry traces so you can evaluate real production-like runs without replaying agent execution.
+Combine built-in evaluators with custom Python logic to measure correctness, tool usage, memory behavior, and more.
+Run locally with the CLI, automate in CI/CD, or explore results visually in the web UI.
+Up and running in seconds.
-Instrument your agent with OpenTelemetry and emit traces for prompts, tool calls, memory operations, and outputs.
+Choose built-in evaluators or create your own to score the behaviors that matter for your agent.
+Score trace datasets through the CLI or web UI and compare results across prompts, models, or tool strategies.
# Install from release wheel -pip install agentevals-<version>-py3-none-any.whl + -# Run an evaluation against a trace -agentevals run samples/helm.json \ - --eval-set samples/eval_set_helm.json \ - -m tool_trajectory_avg_score - -# Start the web UI -agentevals serve +++ + Docs +-Start with the path that fits your workflow.
+
{{ .Description }}
+Open source. Trace-driven. No re-runs needed.
- -Use the CLI for fast, scriptable scoring or the Web UI for visual exploration of evaluation results.
+Run evaluations locally or in CI with straightforward commands and structured outputs.
+agentevals eval run config.yaml
+ Inspect trace datasets, compare runs, and review evaluator outputs in a visual interface.
+agentevals ui
+ Install AgentEvals, connect your traces, and start measuring how your agent behaves in the real world.
+