Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,7 @@ The report grows a `Comparison vs Baseline` section with per-metric deltas.

- [Quickstart tutorial](https://github.com/Azure/agentops/blob/main/docs/tutorial-quickstart.md) — bootstrap a workspace and run one evaluation.
- [End-to-end tutorial](https://github.com/Azure/agentops/blob/main/docs/tutorial-end-to-end.md) — full do-it-yourself tour: Foundry hosted agent, baseline comparison, GitFlow CI/CD, watchdog.
- [Copilot skills tutorial](https://github.com/Azure/agentops/blob/main/docs/tutorial-copilot-skills.md) — use AgentOps skills to have Copilot configure, run, explain, and wire evals into CI.
- Per-scenario tutorials:
- [Foundry hosted agent](https://github.com/Azure/agentops/blob/main/docs/tutorial-basic-foundry-agent.md)
- [Model-direct](https://github.com/Azure/agentops/blob/main/docs/tutorial-model-direct.md)
Expand Down
12 changes: 6 additions & 6 deletions docs/ci-github-actions.md
Original file line number Diff line number Diff line change
Expand Up @@ -220,18 +220,18 @@ agentops workflow generate --dir <path> # different repo root

## Customisation tips

- **Tighten thresholds for QA / PROD** copy `.agentops/run.yaml` to
`.agentops/run-qa.yaml` / `.agentops/run-prod.yaml` and tighten
thresholds in the bundle. Update the `inputs.config` default in the
- **Tighten thresholds for QA / PROD** - copy `agentops.yaml` to
`agentops-qa.yaml` / `agentops-prod.yaml` and tighten the
`thresholds:` block. Update the `inputs.config` default in the
matching workflow file.
- **Scheduled runs** — add a `schedule:` entry in `agentops-pr.yml` (or
a new file) to evaluate against `main` nightly.
- **Matrix per scenario** if you have multiple `runs/*.yaml`, extend
- **Matrix per scenario** - if you have multiple AgentOps config files, extend
the eval job with `strategy.matrix.config:` and reference
`${{ matrix.config }}` in the eval step.
- **Regression baseline** wire deploy templates to download the
- **Regression baseline** - wire deploy templates to download the
previous run's `results.json` artifact and call
`agentops eval compare` between the two.
`agentops eval run --baseline <results.json>`.

## Migration from the older 3-template layout

Expand Down
179 changes: 76 additions & 103 deletions docs/concepts.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,31 @@
# Concepts

This page explains the core building blocks of AgentOps and how they fit together. For the full schema reference and architecture details, see [how-it-works.md](how-it-works.md).
This page explains the core AgentOps building blocks. For the full schema
reference and architecture details, see [how-it-works.md](how-it-works.md).

## How an Evaluation Works

```mermaid
flowchart TD
run["run.yaml<br/><i>what, where, how to eval</i>"]
bundle["Bundle<br/><i>evaluators + thresholds</i>"]
config["agentops.yaml<br/><i>target, dataset, thresholds</i>"]
dataset["Dataset<br/><i>JSONL rows: input, expected</i>"]
runner(["Runner<br/><i>resolves backend</i>"])
runner(["Runner<br/><i>resolves target kind</i>"])
foundry["Foundry<br/>Backend"]
http["HTTP<br/>Backend"]
local["Local<br/>Adapter"]
model["Model-direct<br/>Backend"]
evals(["Evaluators<br/><i>score each response</i>"])
results[/"results.json<br/>(machine)"/]
report[/"report.md<br/>(human)"/]

run --> bundle
run --> dataset
bundle --> runner
config --> dataset
config --> runner
dataset --> runner
runner --> foundry
runner --> http
runner --> local
runner --> model
foundry --> evals
http --> evals
local --> evals
model --> evals
evals --> results
evals --> report
```
Expand All @@ -37,122 +36,91 @@ flowchart TD

### Workspace

The `.agentops/` directory inside your project root. Created by `agentops init`, it holds all evaluation configuration: run configs, bundles, datasets, data files, and results.
Created by `agentops init`. The evaluation config lives in the flat
`agentops.yaml` file at the project root; `.agentops/` stores seed data,
run history, and optional supporting files.

```
```text
agentops.yaml # flat config: agent, dataset, thresholds
.agentops/
├── config.yaml # workspace defaults
├── run.yaml # default run config
├── bundles/ # evaluation policies
├── datasets/ # dataset definitions (YAML)
├── data/ # dataset rows (JSONL)
└── results/ # run outputs + latest/ pointer
├── data/ # dataset rows (JSONL)
└── results/ # run outputs + latest/ pointer
```

### Run Config

A YAML file (typically `run.yaml`) that connects **what** to evaluate, **how** to reach it, and **which evaluators** to apply. It references one bundle and one dataset.

A run config has three key dimensions:
### AgentOps Config

| Dimension | Values | Purpose |
|---|---|---|
| `target.type` | `agent`, `model` | What is being evaluated |
| `target.execution_mode` | `local`, `remote` | How AgentOps reaches the target |
| `target.endpoint.kind` | `foundry_agent`, `http` | Remote endpoint type (when remote) |
A YAML file named `agentops.yaml` that connects **what** to evaluate,
**which dataset** to use, and **which thresholds** gate the run.

Minimal example:
The minimum is:

```yaml
version: 1
target:
type: agent
hosting: foundry
execution_mode: remote
endpoint:
kind: foundry_agent
agent_id: my-agent:1
model: gpt-4o
project_endpoint_env: AZURE_AI_FOUNDRY_PROJECT_ENDPOINT
bundle:
name: rag_quality_baseline
dataset:
name: smoke-rag
agent: "my-agent:1"
dataset: .agentops/data/smoke.jsonl
```

See [how-it-works.md](how-it-works.md) for the full schema, all fields, and validation rules.

### Bundle
Common `agent:` values:

A YAML file that defines **which evaluators** to run and **what thresholds** to enforce. Bundles are reusable — the same bundle can evaluate different targets across environments.
| Agent value | Target kind |
|---|---|
| `"support-bot:1"` | Foundry prompt agent (`name:version`) |
| `"https://api.example.com/chat"` | HTTP/JSON agent |
| `"model:gpt-4o-mini"` | Direct model deployment |

Each bundle contains:
- A list of evaluators (AI-assisted or local metrics)
- Threshold rules that determine pass/fail

```yaml
# .agentops/bundles/model_quality_baseline.yaml
evaluators:
- name: SimilarityEvaluator
source: foundry
enabled: true
thresholds:
- metric: SimilarityEvaluator
operator: ">="
value: 3.0
```

See [bundles.md](bundles.md) for the full bundle authoring guide.
HTTP targets can add top-level mapping fields such as `request_field`,
`response_field`, `tool_calls_field`, `auth_header_env`, and
`extra_fields`.

### Dataset

A YAML config that points to a JSONL file containing evaluation rows. Each row has an `input` (the prompt) and an `expected` (the reference answer). Some scenarios add extra fields like `context` (RAG) or `tool_calls` (agent workflows).

```yaml
# .agentops/datasets/smoke-model-direct.yaml
source:
type: file
path: ../data/smoke-model-direct.jsonl
format:
type: jsonl
input_field: input
expected_field: expected
```
A JSONL file containing evaluation rows. Each row has an `input` prompt
and usually an `expected` reference answer. Some scenarios add extra
fields like `context` (RAG), `tool_definitions`, or `tool_calls` (agent
workflows).

```json
{"id": "1", "input": "What is Python?", "expected": "Python is a programming language."}
```

### Evaluator

A scoring function that measures one aspect of the target's response. Evaluators can be:
A scoring function that measures one aspect of the target response.
Evaluators can be:

- **AI-assisted** (Foundry) — use a judge model to score responses on criteria like coherence, fluency, or groundedness (1-5 scale)
- **Local metrics** — computed without a model, such as `F1ScoreEvaluator` or `avg_latency_seconds`
- **AI-assisted** (Foundry) — use a judge model to score responses on
criteria like coherence, fluency, similarity, or groundedness.
- **Local metrics** — computed without a judge model, such as
`F1ScoreEvaluator` or `avg_latency_seconds`.

Evaluators are configured inside bundles. See [foundry-evaluation-sdk-built-in-evaluators.md](foundry-evaluation-sdk-built-in-evaluators.md) for the complete evaluator reference.
AgentOps auto-selects evaluators from the target kind and dataset shape.
Use `evaluators:` in `agentops.yaml` only when you need to override that
selection. See
[foundry-evaluation-sdk-built-in-evaluators.md](foundry-evaluation-sdk-built-in-evaluators.md)
for the complete evaluator reference.

### Backend
### Target resolver

The execution engine that sends dataset rows to the target and collects responses. The runner automatically selects the backend based on the run config:
The execution engine sends dataset rows to the target and collects
responses. AgentOps automatically selects the target kind from `agent:`.

| Execution Mode | Endpoint Kind | Backend | Use case |
|---|---|---|---|
| `remote` | `foundry_agent` | Foundry Backend | Foundry agents and models |
| `remote` | `http` | HTTP Backend | LangGraph, LangChain, ACA, custom REST |
| `local` | | Local Adapter | In-process Python functions or subprocess |
| `agent:` shape | Target kind | Use case |
|---|---|---|
| `name:version` | Foundry prompt agent | Foundry Agent Service agents |
| `https://...` | HTTP/JSON endpoint | LangGraph, Agent Framework, ACA, AKS, custom REST |
| `model:<deployment>` | Model-direct | Raw model deployment checks |

## Evaluation Scenarios

AgentOps ships starter bundles for common evaluation patterns. Each bundle pairs specific evaluators with default thresholds:
AgentOps auto-selects common evaluation patterns from the dataset:

| Scenario | Bundle | Key Evaluators | When to use |
| Scenario | Dataset signal | Key evaluators | When to use |
|---|---|---|---|
| **Model Quality** | `model_quality_baseline` | Similarity, Coherence, Fluency, F1Score | Direct model deployment checks |
| **RAG** | `rag_quality_baseline` | Groundedness, Relevance, Retrieval, ResponseCompleteness | RAG pipelines with context retrieval |
| **Conversational** | `conversational_agent_baseline` | Coherence, Fluency, Relevance, Similarity | Chatbots, Q&A assistants |
| **Agent Workflow** | `agent_workflow_baseline` | TaskCompletion, ToolCallAccuracy, IntentResolution, ToolSelection | Agents with tool calling |
| **Content Safety** | `safe_agent_baseline` | Violence, Sexual, SelfHarm, HateUnfairness, ProtectedMaterial | Responsible AI checks |
| **Model Quality** | `input`, `expected` on `model:<deployment>` | Similarity, Coherence, Fluency, F1Score | Direct model deployment checks |
| **RAG** | `context` | Groundedness, Relevance, Retrieval, ResponseCompleteness | RAG pipelines with context retrieval |
| **Conversational** | `input`, `expected` on an agent | Coherence, Fluency, Similarity/F1 where applicable | Chatbots, Q&A assistants |
| **Agent Workflow** | `tool_calls`, `tool_definitions` | ToolCallAccuracy, IntentResolution, TaskAdherence | Agents with tool calling |
| **Content Safety** | Explicit safety evaluators | Violence, Sexual, SelfHarm, HateUnfairness, ProtectedMaterial | Responsible AI checks |

Each scenario has a dedicated tutorial:

Expand All @@ -165,16 +133,21 @@ Each scenario has a dedicated tutorial:

## Configuration Model

Run configs use an orthogonal target model. The three key dimensions — `type`, `execution_mode`, and `endpoint.kind` — are independent. Additional optional fields:
`agentops.yaml` is the single source of truth. Keep it small and add only
the fields your target needs:

| Field | Values | When to use |
|---|---|---|
| `target.hosting` | `local`, `foundry`, `aks`, `containerapps` | Metadata: where the target runs |
| `target.framework` | `agent_framework`, `langgraph`, `custom` | Agent targets only |
| `target.agent_mode` | `prompt`, `hosted` | Foundry agents only |
```yaml
version: 1
agent: "https://api.example.com/chat"
dataset: .agentops/data/support.jsonl

request_field: message
response_field: text

**Bundle and dataset references** support two resolution modes:
- `name` — convention-based: resolves to `.agentops/bundles/<name>.yaml` or `.agentops/datasets/<name>.yaml`
- `path` — explicit relative path to the YAML file
thresholds:
coherence: ">=3"
avg_latency_seconds: "<=2"
```

See [how-it-works.md](how-it-works.md) for the full schema, all endpoint fields, validation rules, and more configuration examples.
See [how-it-works.md](how-it-works.md) for the full schema, endpoint
fields, validation rules, and more examples.
4 changes: 2 additions & 2 deletions docs/how-it-works.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ src/
│ ├── invocations.py # Per-row agent / model invocation strategies
│ ├── thresholds.py # Threshold pass/fail evaluation
│ ├── reporter.py # Markdown report generation
│ ├── comparison.py # `eval compare` two runs
│ ├── comparison.py # Baseline delta rendering for `eval run --baseline`
│ ├── publisher.py # Classic Foundry publish (OneDP upload of metrics)
│ └── cloud_publisher.py # New Foundry publish (server-side via OpenAI Evals API)
Expand Down Expand Up @@ -108,7 +108,7 @@ When you run `agentops eval run`, the following happens step by step:
|---|---|---|
| `agentops init [--path DIR]` | Scaffold `.agentops/` workspace with starter config, bundles, datasets, and data. Also installs coding agent skills. | Available |
| `agentops eval run` | Execute an evaluation (main command) | Available |
| `agentops eval compare --runs ID1,ID2` | Compare two past evaluation runs | Available |
| `agentops eval run --baseline <results.json>` | Run an eval and add a comparison against a previous result | Available |
| `agentops skills install` | Install AgentOps coding agent skills (Copilot, Claude) into the target project | Available |
| `agentops run list\|show` | List or inspect past runs | Planned (stub) |
| `agentops run view <id> [--entry N]` | Deep-inspect a run | Planned (stub) |
Expand Down
18 changes: 9 additions & 9 deletions docs/tutorial-agent-workflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ both of these row fields:
When AgentOps sees `tool_calls` (or `tool_definitions`) in the
dataset rows, it auto-selects the **agent workflow** evaluators:
TaskCompletion, ToolCallAccuracy, IntentResolution, TaskAdherence,
plus the conversational baseline (Coherence, Fluency, Similarity,
F1Score, latency).
plus the conversational baseline metrics that apply to the target
(Coherence, Fluency, latency, and any explicitly configured text metric).

## 1. Bootstrap

Expand All @@ -44,11 +44,11 @@ body:
```yaml
version: 1
agent: "https://aca-weather-bot.example.com/"
http:
request_field: message
response_field: text
tool_calls_field: tool_calls
dataset: .agentops/data/tools.jsonl

request_field: message
response_field: text
tool_calls_field: tool_calls
```

`tool_calls_field` tells AgentOps where in the response JSON to find
Expand All @@ -61,9 +61,9 @@ the structured tool calls (dot-path notation supported).
{"id":"2","input":"How is the weather in Tokyo, Japan?","expected":"Calls get_weather with location='Tokyo, Japan'.","tool_calls":[{"type":"function_call","name":"get_weather","arguments":{"location":"Tokyo, Japan"}}]}
```

You can additionally include `tool_definitions` to give the evaluator
the schema of every tool the agent should know about. This sharpens
the **ToolSelectionEvaluator** judgement.
Include `tool_definitions` when you evaluate tool-call accuracy. The
evaluator needs the schema of every tool the agent should know about;
repeat the catalogue on each JSONL row so every row is self-contained.

## 4. Run

Expand Down
2 changes: 1 addition & 1 deletion docs/tutorial-basic-foundry-agent.md
Original file line number Diff line number Diff line change
Expand Up @@ -219,4 +219,4 @@ The RAG scenario uses GroundednessEvaluator instead of SimilarityEvaluator becau
- [Model-Direct Tutorial](tutorial-model-direct.md) — evaluate a model without agents
- [RAG Tutorial](tutorial-rag.md) — evaluate retrieval-augmented responses
- [Baseline Comparison Tutorial](tutorial-baseline-comparison.md) — compare runs and detect regressions
- [Copilot Skills Tutorial](tutorial-copilot-skills.md) — install skills for AI-assisted guidance
- [Copilot Skills Tutorial](tutorial-copilot-skills.md) — use the installed AgentOps skills to build an eval workflow with Copilot
10 changes: 5 additions & 5 deletions docs/tutorial-conversational-agent.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,10 +48,10 @@ different field names, override them:
```yaml
version: 1
agent: "https://api.example.com/chat"
http:
request_field: prompt
response_field: choices.0.message.content
dataset: .agentops/data/chat.jsonl

request_field: prompt
response_field: choices.0.message.content
```

## 3. Dataset shape (`chat.jsonl`)
Expand All @@ -67,8 +67,8 @@ auto-selects the **conversational baseline** evaluators: Coherence,
Fluency, Similarity, F1Score, average latency.

> Want to test multi-turn behaviour explicitly? Have your service
> accept a `history` field, then add `extra_fields: [history]` under
> `http:` and include a `history` array in each JSONL row.
> accept a `history` field, then add `extra_fields: [history]` to
> `agentops.yaml` and include a `history` array in each JSONL row.

## 4. Run

Expand Down
Loading
Loading