Azure · placerda · May 8, 2026 · May 8, 2026 · May 8, 2026 · May 8, 2026
diff --git a/README.md b/README.md
@@ -117,6 +117,7 @@ The report grows a `Comparison vs Baseline` section with per-metric deltas.
 
 - [Quickstart tutorial](https://github.com/Azure/agentops/blob/main/docs/tutorial-quickstart.md) — bootstrap a workspace and run one evaluation.
 - [End-to-end tutorial](https://github.com/Azure/agentops/blob/main/docs/tutorial-end-to-end.md) — full do-it-yourself tour: Foundry hosted agent, baseline comparison, GitFlow CI/CD, watchdog.
+- [Copilot skills tutorial](https://github.com/Azure/agentops/blob/main/docs/tutorial-copilot-skills.md) — use AgentOps skills to have Copilot configure, run, explain, and wire evals into CI.
 - Per-scenario tutorials:
   - [Foundry hosted agent](https://github.com/Azure/agentops/blob/main/docs/tutorial-basic-foundry-agent.md)
   - [Model-direct](https://github.com/Azure/agentops/blob/main/docs/tutorial-model-direct.md)

diff --git a/docs/ci-github-actions.md b/docs/ci-github-actions.md
@@ -220,18 +220,18 @@ agentops workflow generate --dir <path>        # different repo root
 
 ## Customisation tips
 
-- **Tighten thresholds for QA / PROD** — copy `.agentops/run.yaml` to
-  `.agentops/run-qa.yaml` / `.agentops/run-prod.yaml` and tighten
-  thresholds in the bundle. Update the `inputs.config` default in the
+- **Tighten thresholds for QA / PROD** - copy `agentops.yaml` to
+  `agentops-qa.yaml` / `agentops-prod.yaml` and tighten the
+  `thresholds:` block. Update the `inputs.config` default in the
   matching workflow file.
 - **Scheduled runs** — add a `schedule:` entry in `agentops-pr.yml` (or
   a new file) to evaluate against `main` nightly.
-- **Matrix per scenario** — if you have multiple `runs/*.yaml`, extend
+- **Matrix per scenario** - if you have multiple AgentOps config files, extend
   the eval job with `strategy.matrix.config:` and reference
   `${{ matrix.config }}` in the eval step.
-- **Regression baseline** — wire deploy templates to download the
+- **Regression baseline** - wire deploy templates to download the
   previous run's `results.json` artifact and call
-  `agentops eval compare` between the two.
+  `agentops eval run --baseline <results.json>`.
 
 ## Migration from the older 3-template layout
 

diff --git a/docs/concepts.md b/docs/concepts.md
@@ -1,32 +1,31 @@
 # Concepts
 
-This page explains the core building blocks of AgentOps and how they fit together. For the full schema reference and architecture details, see [how-it-works.md](how-it-works.md).
+This page explains the core AgentOps building blocks. For the full schema
+reference and architecture details, see [how-it-works.md](how-it-works.md).
 
 ## How an Evaluation Works
 
 ```mermaid
 flowchart TD
-    run["run.yaml<br/><i>what, where, how to eval</i>"]
-    bundle["Bundle<br/><i>evaluators + thresholds</i>"]
+    config["agentops.yaml<br/><i>target, dataset, thresholds</i>"]
     dataset["Dataset<br/><i>JSONL rows: input, expected</i>"]
-    runner(["Runner<br/><i>resolves backend</i>"])
+    runner(["Runner<br/><i>resolves target kind</i>"])
     foundry["Foundry<br/>Backend"]
     http["HTTP<br/>Backend"]
-    local["Local<br/>Adapter"]
+    model["Model-direct<br/>Backend"]
     evals(["Evaluators<br/><i>score each response</i>"])
     results[/"results.json<br/>(machine)"/]
     report[/"report.md<br/>(human)"/]
 
-    run --> bundle
-    run --> dataset
-    bundle --> runner
+    config --> dataset
+    config --> runner
     dataset --> runner
     runner --> foundry
     runner --> http
-    runner --> local
+    runner --> model
     foundry --> evals
     http --> evals
-    local --> evals
+    model --> evals
     evals --> results
     evals --> report
 ```
@@ -37,122 +36,91 @@ flowchart TD
 
 ### Workspace
 
-The `.agentops/` directory inside your project root. Created by `agentops init`, it holds all evaluation configuration: run configs, bundles, datasets, data files, and results.
+Created by `agentops init`. The evaluation config lives in the flat
+`agentops.yaml` file at the project root; `.agentops/` stores seed data,
+run history, and optional supporting files.
 
-```
+```text
+agentops.yaml          # flat config: agent, dataset, thresholds
 .agentops/
-├── config.yaml          # workspace defaults
-├── run.yaml             # default run config
-├── bundles/             # evaluation policies
-├── datasets/            # dataset definitions (YAML)
-├── data/                # dataset rows (JSONL)
-└── results/             # run outputs + latest/ pointer
+├── data/              # dataset rows (JSONL)
+└── results/           # run outputs + latest/ pointer
 ```
 
-### Run Config
-
-A YAML file (typically `run.yaml`) that connects **what** to evaluate, **how** to reach it, and **which evaluators** to apply. It references one bundle and one dataset.
-
-A run config has three key dimensions:
+### AgentOps Config
 
-| Dimension | Values | Purpose |
-|---|---|---|
-| `target.type` | `agent`, `model` | What is being evaluated |
-| `target.execution_mode` | `local`, `remote` | How AgentOps reaches the target |
-| `target.endpoint.kind` | `foundry_agent`, `http` | Remote endpoint type (when remote) |
+A YAML file named `agentops.yaml` that connects **what** to evaluate,
+**which dataset** to use, and **which thresholds** gate the run.
 
-Minimal example:
+The minimum is:
 
 ```yaml
 version: 1
-target:
-  type: agent
-  hosting: foundry
-  execution_mode: remote
-  endpoint:
-    kind: foundry_agent
-    agent_id: my-agent:1
-    model: gpt-4o
-    project_endpoint_env: AZURE_AI_FOUNDRY_PROJECT_ENDPOINT
-bundle:
-  name: rag_quality_baseline
-dataset:
-  name: smoke-rag
+agent: "my-agent:1"
+dataset: .agentops/data/smoke.jsonl
 ```
 
-See [how-it-works.md](how-it-works.md) for the full schema, all fields, and validation rules.
-
-### Bundle
+Common `agent:` values:
 
-A YAML file that defines **which evaluators** to run and **what thresholds** to enforce. Bundles are reusable — the same bundle can evaluate different targets across environments.
+| Agent value | Target kind |
+|---|---|
+| `"support-bot:1"` | Foundry prompt agent (`name:version`) |
+| `"https://api.example.com/chat"` | HTTP/JSON agent |
+| `"model:gpt-4o-mini"` | Direct model deployment |
 
-Each bundle contains:
-- A list of evaluators (AI-assisted or local metrics)
-- Threshold rules that determine pass/fail
-
-```yaml
-# .agentops/bundles/model_quality_baseline.yaml
-evaluators:
-  - name: SimilarityEvaluator
-    source: foundry
-    enabled: true
-thresholds:
-  - metric: SimilarityEvaluator
-    operator: ">="
-    value: 3.0
-```
-
-See [bundles.md](bundles.md) for the full bundle authoring guide.
+HTTP targets can add top-level mapping fields such as `request_field`,
+`response_field`, `tool_calls_field`, `auth_header_env`, and
+`extra_fields`.
 
 ### Dataset
 
-A YAML config that points to a JSONL file containing evaluation rows. Each row has an `input` (the prompt) and an `expected` (the reference answer). Some scenarios add extra fields like `context` (RAG) or `tool_calls` (agent workflows).
-
-```yaml
-# .agentops/datasets/smoke-model-direct.yaml
-source:
-  type: file
-  path: ../data/smoke-model-direct.jsonl
-format:
-  type: jsonl
-  input_field: input
-  expected_field: expected
-```
+A JSONL file containing evaluation rows. Each row has an `input` prompt
+and usually an `expected` reference answer. Some scenarios add extra
+fields like `context` (RAG), `tool_definitions`, or `tool_calls` (agent
+workflows).
 
 ```json
 {"id": "1", "input": "What is Python?", "expected": "Python is a programming language."}
 ```
 
 ### Evaluator
 
-A scoring function that measures one aspect of the target's response. Evaluators can be:
+A scoring function that measures one aspect of the target response.
+Evaluators can be:
 
-- **AI-assisted** (Foundry) — use a judge model to score responses on criteria like coherence, fluency, or groundedness (1-5 scale)
-- **Local metrics** — computed without a model, such as `F1ScoreEvaluator` or `avg_latency_seconds`
+- **AI-assisted** (Foundry) — use a judge model to score responses on
+  criteria like coherence, fluency, similarity, or groundedness.
+- **Local metrics** — computed without a judge model, such as
+  `F1ScoreEvaluator` or `avg_latency_seconds`.
 
-Evaluators are configured inside bundles. See [foundry-evaluation-sdk-built-in-evaluators.md](foundry-evaluation-sdk-built-in-evaluators.md) for the complete evaluator reference.
+AgentOps auto-selects evaluators from the target kind and dataset shape.
+Use `evaluators:` in `agentops.yaml` only when you need to override that
+selection. See
+[foundry-evaluation-sdk-built-in-evaluators.md](foundry-evaluation-sdk-built-in-evaluators.md)
+for the complete evaluator reference.
 
-### Backend
+### Target resolver
 
-The execution engine that sends dataset rows to the target and collects responses. The runner automatically selects the backend based on the run config:
+The execution engine sends dataset rows to the target and collects
+responses. AgentOps automatically selects the target kind from `agent:`.
 
-| Execution Mode | Endpoint Kind | Backend | Use case |
-|---|---|---|---|
-| `remote` | `foundry_agent` | Foundry Backend | Foundry agents and models |
-| `remote` | `http` | HTTP Backend | LangGraph, LangChain, ACA, custom REST |
-| `local` | — | Local Adapter | In-process Python functions or subprocess |
+| `agent:` shape | Target kind | Use case |
+|---|---|---|
+| `name:version` | Foundry prompt agent | Foundry Agent Service agents |
+| `https://...` | HTTP/JSON endpoint | LangGraph, Agent Framework, ACA, AKS, custom REST |
+| `model:<deployment>` | Model-direct | Raw model deployment checks |
 
 ## Evaluation Scenarios
 
-AgentOps ships starter bundles for common evaluation patterns. Each bundle pairs specific evaluators with default thresholds:
+AgentOps auto-selects common evaluation patterns from the dataset:
 
-| Scenario | Bundle | Key Evaluators | When to use |
+| Scenario | Dataset signal | Key evaluators | When to use |
 |---|---|---|---|
-| **Model Quality** | `model_quality_baseline` | Similarity, Coherence, Fluency, F1Score | Direct model deployment checks |
-| **RAG** | `rag_quality_baseline` | Groundedness, Relevance, Retrieval, ResponseCompleteness | RAG pipelines with context retrieval |
-| **Conversational** | `conversational_agent_baseline` | Coherence, Fluency, Relevance, Similarity | Chatbots, Q&A assistants |
-| **Agent Workflow** | `agent_workflow_baseline` | TaskCompletion, ToolCallAccuracy, IntentResolution, ToolSelection | Agents with tool calling |
-| **Content Safety** | `safe_agent_baseline` | Violence, Sexual, SelfHarm, HateUnfairness, ProtectedMaterial | Responsible AI checks |
+| **Model Quality** | `input`, `expected` on `model:<deployment>` | Similarity, Coherence, Fluency, F1Score | Direct model deployment checks |
+| **RAG** | `context` | Groundedness, Relevance, Retrieval, ResponseCompleteness | RAG pipelines with context retrieval |
+| **Conversational** | `input`, `expected` on an agent | Coherence, Fluency, Similarity/F1 where applicable | Chatbots, Q&A assistants |
+| **Agent Workflow** | `tool_calls`, `tool_definitions` | ToolCallAccuracy, IntentResolution, TaskAdherence | Agents with tool calling |
+| **Content Safety** | Explicit safety evaluators | Violence, Sexual, SelfHarm, HateUnfairness, ProtectedMaterial | Responsible AI checks |
 
 Each scenario has a dedicated tutorial:
 
@@ -165,16 +133,21 @@ Each scenario has a dedicated tutorial:
 
 ## Configuration Model
 
-Run configs use an orthogonal target model. The three key dimensions — `type`, `execution_mode`, and `endpoint.kind` — are independent. Additional optional fields:
+`agentops.yaml` is the single source of truth. Keep it small and add only
+the fields your target needs:
 
-| Field | Values | When to use |
-|---|---|---|
-| `target.hosting` | `local`, `foundry`, `aks`, `containerapps` | Metadata: where the target runs |
-| `target.framework` | `agent_framework`, `langgraph`, `custom` | Agent targets only |
-| `target.agent_mode` | `prompt`, `hosted` | Foundry agents only |
+```yaml
+version: 1
+agent: "https://api.example.com/chat"
+dataset: .agentops/data/support.jsonl
+
+request_field: message
+response_field: text
 
-**Bundle and dataset references** support two resolution modes:
-- `name` — convention-based: resolves to `.agentops/bundles/<name>.yaml` or `.agentops/datasets/<name>.yaml`
-- `path` — explicit relative path to the YAML file
+thresholds:
+  coherence: ">=3"
+  avg_latency_seconds: "<=2"
+```
 
-See [how-it-works.md](how-it-works.md) for the full schema, all endpoint fields, validation rules, and more configuration examples.
+See [how-it-works.md](how-it-works.md) for the full schema, endpoint
+fields, validation rules, and more examples.
diff --git a/docs/how-it-works.md b/docs/how-it-works.md
@@ -46,7 +46,7 @@ src/
     │   ├── invocations.py     # Per-row agent / model invocation strategies
     │   ├── thresholds.py      # Threshold pass/fail evaluation
     │   ├── reporter.py        # Markdown report generation
-    │   ├── comparison.py      # `eval compare` two runs
+    │   ├── comparison.py      # Baseline delta rendering for `eval run --baseline`
     │   ├── publisher.py       # Classic Foundry publish (OneDP upload of metrics)
     │   └── cloud_publisher.py # New Foundry publish (server-side via OpenAI Evals API)
     │
@@ -108,7 +108,7 @@ When you run `agentops eval run`, the following happens step by step:
 |---|---|---|
 | `agentops init [--path DIR]` | Scaffold `.agentops/` workspace with starter config, bundles, datasets, and data. Also installs coding agent skills. | Available |
 | `agentops eval run` | Execute an evaluation (main command) | Available |
-| `agentops eval compare --runs ID1,ID2` | Compare two past evaluation runs | Available |
+| `agentops eval run --baseline <results.json>` | Run an eval and add a comparison against a previous result | Available |
 | `agentops skills install` | Install AgentOps coding agent skills (Copilot, Claude) into the target project | Available |
 | `agentops run list\|show` | List or inspect past runs | Planned (stub) |
 | `agentops run view <id> [--entry N]` | Deep-inspect a run | Planned (stub) |

diff --git a/docs/tutorial-agent-workflow.md b/docs/tutorial-agent-workflow.md
@@ -17,8 +17,8 @@ both of these row fields:
 When AgentOps sees `tool_calls` (or `tool_definitions`) in the
 dataset rows, it auto-selects the **agent workflow** evaluators:
 TaskCompletion, ToolCallAccuracy, IntentResolution, TaskAdherence,
-plus the conversational baseline (Coherence, Fluency, Similarity,
-F1Score, latency).
+plus the conversational baseline metrics that apply to the target
+(Coherence, Fluency, latency, and any explicitly configured text metric).
 
 ## 1. Bootstrap
 
@@ -44,11 +44,11 @@ body:
 ```yaml
 version: 1
 agent: "https://aca-weather-bot.example.com/"
-http:
-  request_field: message
-  response_field: text
-  tool_calls_field: tool_calls
 dataset: .agentops/data/tools.jsonl
+
+request_field: message
+response_field: text
+tool_calls_field: tool_calls
 ```
 
 `tool_calls_field` tells AgentOps where in the response JSON to find
@@ -61,9 +61,9 @@ the structured tool calls (dot-path notation supported).
 {"id":"2","input":"How is the weather in Tokyo, Japan?","expected":"Calls get_weather with location='Tokyo, Japan'.","tool_calls":[{"type":"function_call","name":"get_weather","arguments":{"location":"Tokyo, Japan"}}]}
 ```
 
-You can additionally include `tool_definitions` to give the evaluator
-the schema of every tool the agent should know about. This sharpens
-the **ToolSelectionEvaluator** judgement.
+Include `tool_definitions` when you evaluate tool-call accuracy. The
+evaluator needs the schema of every tool the agent should know about;
+repeat the catalogue on each JSONL row so every row is self-contained.
 
 ## 4. Run
 

diff --git a/docs/tutorial-basic-foundry-agent.md b/docs/tutorial-basic-foundry-agent.md
@@ -219,4 +219,4 @@ The RAG scenario uses GroundednessEvaluator instead of SimilarityEvaluator becau
 - [Model-Direct Tutorial](tutorial-model-direct.md) — evaluate a model without agents
 - [RAG Tutorial](tutorial-rag.md) — evaluate retrieval-augmented responses
 - [Baseline Comparison Tutorial](tutorial-baseline-comparison.md) — compare runs and detect regressions
-- [Copilot Skills Tutorial](tutorial-copilot-skills.md) — install skills for AI-assisted guidance
+- [Copilot Skills Tutorial](tutorial-copilot-skills.md) — use the installed AgentOps skills to build an eval workflow with Copilot
diff --git a/docs/tutorial-conversational-agent.md b/docs/tutorial-conversational-agent.md
@@ -48,10 +48,10 @@ different field names, override them:
 ```yaml
 version: 1
 agent: "https://api.example.com/chat"
-http:
-  request_field: prompt
-  response_field: choices.0.message.content
 dataset: .agentops/data/chat.jsonl
+
+request_field: prompt
+response_field: choices.0.message.content
 ```
 
 ## 3. Dataset shape (`chat.jsonl`)
@@ -67,8 +67,8 @@ auto-selects the **conversational baseline** evaluators: Coherence,
 Fluency, Similarity, F1Score, average latency.
 
 > Want to test multi-turn behaviour explicitly? Have your service
-> accept a `history` field, then add `extra_fields: [history]` under
-> `http:` and include a `history` array in each JSONL row.
+> accept a `history` field, then add `extra_fields: [history]` to
+> `agentops.yaml` and include a `history` array in each JSONL row.
 
 ## 4. Run