diff --git a/README.md b/README.md
index 11459a57..821b0478 100644
--- a/README.md
+++ b/README.md
@@ -117,6 +117,7 @@ The report grows a `Comparison vs Baseline` section with per-metric deltas.
 
 - [Quickstart tutorial](https://github.com/Azure/agentops/blob/main/docs/tutorial-quickstart.md) — bootstrap a workspace and run one evaluation.
 - [End-to-end tutorial](https://github.com/Azure/agentops/blob/main/docs/tutorial-end-to-end.md) — full do-it-yourself tour: Foundry hosted agent, baseline comparison, GitFlow CI/CD, watchdog.
+- [Copilot skills tutorial](https://github.com/Azure/agentops/blob/main/docs/tutorial-copilot-skills.md) — use AgentOps skills to have Copilot configure, run, explain, and wire evals into CI.
 - Per-scenario tutorials:
   - [Foundry hosted agent](https://github.com/Azure/agentops/blob/main/docs/tutorial-basic-foundry-agent.md)
   - [Model-direct](https://github.com/Azure/agentops/blob/main/docs/tutorial-model-direct.md)
diff --git a/docs/ci-github-actions.md b/docs/ci-github-actions.md
index f454ab0b..88dc91a9 100644
--- a/docs/ci-github-actions.md
+++ b/docs/ci-github-actions.md
@@ -220,18 +220,18 @@ agentops workflow generate --dir <path>        # different repo root
 
 ## Customisation tips
 
-- **Tighten thresholds for QA / PROD** — copy `.agentops/run.yaml` to
-  `.agentops/run-qa.yaml` / `.agentops/run-prod.yaml` and tighten
-  thresholds in the bundle. Update the `inputs.config` default in the
+- **Tighten thresholds for QA / PROD** - copy `agentops.yaml` to
+  `agentops-qa.yaml` / `agentops-prod.yaml` and tighten the
+  `thresholds:` block. Update the `inputs.config` default in the
   matching workflow file.
 - **Scheduled runs** — add a `schedule:` entry in `agentops-pr.yml` (or
   a new file) to evaluate against `main` nightly.
-- **Matrix per scenario** — if you have multiple `runs/*.yaml`, extend
+- **Matrix per scenario** - if you have multiple AgentOps config files, extend
   the eval job with `strategy.matrix.config:` and reference
   `${{ matrix.config }}` in the eval step.
-- **Regression baseline** — wire deploy templates to download the
+- **Regression baseline** - wire deploy templates to download the
   previous run's `results.json` artifact and call
-  `agentops eval compare` between the two.
+  `agentops eval run --baseline <results.json>`.
 
 ## Migration from the older 3-template layout
 
diff --git a/docs/concepts.md b/docs/concepts.md
index ffe692f4..8598fefa 100644
--- a/docs/concepts.md
+++ b/docs/concepts.md
@@ -1,32 +1,31 @@
 # Concepts
 
-This page explains the core building blocks of AgentOps and how they fit together. For the full schema reference and architecture details, see [how-it-works.md](how-it-works.md).
+This page explains the core AgentOps building blocks. For the full schema
+reference and architecture details, see [how-it-works.md](how-it-works.md).
 
 ## How an Evaluation Works
 
 ```mermaid
 flowchart TD
-    run["run.yaml<br/><i>what, where, how to eval</i>"]
-    bundle["Bundle<br/><i>evaluators + thresholds</i>"]
+    config["agentops.yaml<br/><i>target, dataset, thresholds</i>"]
     dataset["Dataset<br/><i>JSONL rows: input, expected</i>"]
-    runner(["Runner<br/><i>resolves backend</i>"])
+    runner(["Runner<br/><i>resolves target kind</i>"])
     foundry["Foundry<br/>Backend"]
     http["HTTP<br/>Backend"]
-    local["Local<br/>Adapter"]
+    model["Model-direct<br/>Backend"]
     evals(["Evaluators<br/><i>score each response</i>"])
     results[/"results.json<br/>(machine)"/]
     report[/"report.md<br/>(human)"/]
 
-    run --> bundle
-    run --> dataset
-    bundle --> runner
+    config --> dataset
+    config --> runner
     dataset --> runner
     runner --> foundry
     runner --> http
-    runner --> local
+    runner --> model
     foundry --> evals
     http --> evals
-    local --> evals
+    model --> evals
     evals --> results
     evals --> report
 ```
@@ -37,87 +36,48 @@ flowchart TD
 
 ### Workspace
 
-The `.agentops/` directory inside your project root. Created by `agentops init`, it holds all evaluation configuration: run configs, bundles, datasets, data files, and results.
+Created by `agentops init`. The evaluation config lives in the flat
+`agentops.yaml` file at the project root; `.agentops/` stores seed data,
+run history, and optional supporting files.
 
-```
+```text
+agentops.yaml          # flat config: agent, dataset, thresholds
 .agentops/
-├── config.yaml          # workspace defaults
-├── run.yaml             # default run config
-├── bundles/             # evaluation policies
-├── datasets/            # dataset definitions (YAML)
-├── data/                # dataset rows (JSONL)
-└── results/             # run outputs + latest/ pointer
+├── data/              # dataset rows (JSONL)
+└── results/           # run outputs + latest/ pointer
 ```
 
-### Run Config
-
-A YAML file (typically `run.yaml`) that connects **what** to evaluate, **how** to reach it, and **which evaluators** to apply. It references one bundle and one dataset.
-
-A run config has three key dimensions:
+### AgentOps Config
 
-| Dimension | Values | Purpose |
-|---|---|---|
-| `target.type` | `agent`, `model` | What is being evaluated |
-| `target.execution_mode` | `local`, `remote` | How AgentOps reaches the target |
-| `target.endpoint.kind` | `foundry_agent`, `http` | Remote endpoint type (when remote) |
+A YAML file named `agentops.yaml` that connects **what** to evaluate,
+**which dataset** to use, and **which thresholds** gate the run.
 
-Minimal example:
+The minimum is:
 
 ```yaml
 version: 1
-target:
-  type: agent
-  hosting: foundry
-  execution_mode: remote
-  endpoint:
-    kind: foundry_agent
-    agent_id: my-agent:1
-    model: gpt-4o
-    project_endpoint_env: AZURE_AI_FOUNDRY_PROJECT_ENDPOINT
-bundle:
-  name: rag_quality_baseline
-dataset:
-  name: smoke-rag
+agent: "my-agent:1"
+dataset: .agentops/data/smoke.jsonl
 ```
 
-See [how-it-works.md](how-it-works.md) for the full schema, all fields, and validation rules.
-
-### Bundle
+Common `agent:` values:
 
-A YAML file that defines **which evaluators** to run and **what thresholds** to enforce. Bundles are reusable — the same bundle can evaluate different targets across environments.
+| Agent value | Target kind |
+|---|---|
+| `"support-bot:1"` | Foundry prompt agent (`name:version`) |
+| `"https://api.example.com/chat"` | HTTP/JSON agent |
+| `"model:gpt-4o-mini"` | Direct model deployment |
 
-Each bundle contains:
-- A list of evaluators (AI-assisted or local metrics)
-- Threshold rules that determine pass/fail
-
-```yaml
-# .agentops/bundles/model_quality_baseline.yaml
-evaluators:
-  - name: SimilarityEvaluator
-    source: foundry
-    enabled: true
-thresholds:
-  - metric: SimilarityEvaluator
-    operator: ">="
-    value: 3.0
-```
-
-See [bundles.md](bundles.md) for the full bundle authoring guide.
+HTTP targets can add top-level mapping fields such as `request_field`,
+`response_field`, `tool_calls_field`, `auth_header_env`, and
+`extra_fields`.
 
 ### Dataset
 
-A YAML config that points to a JSONL file containing evaluation rows. Each row has an `input` (the prompt) and an `expected` (the reference answer). Some scenarios add extra fields like `context` (RAG) or `tool_calls` (agent workflows).
-
-```yaml
-# .agentops/datasets/smoke-model-direct.yaml
-source:
-  type: file
-  path: ../data/smoke-model-direct.jsonl
-format:
-  type: jsonl
-  input_field: input
-  expected_field: expected
-```
+A JSONL file containing evaluation rows. Each row has an `input` prompt
+and usually an `expected` reference answer. Some scenarios add extra
+fields like `context` (RAG), `tool_definitions`, or `tool_calls` (agent
+workflows).
 
 ```json
 {"id": "1", "input": "What is Python?", "expected": "Python is a programming language."}
@@ -125,34 +85,42 @@ format:
 
 ### Evaluator
 
-A scoring function that measures one aspect of the target's response. Evaluators can be:
+A scoring function that measures one aspect of the target response.
+Evaluators can be:
 
-- **AI-assisted** (Foundry) — use a judge model to score responses on criteria like coherence, fluency, or groundedness (1-5 scale)
-- **Local metrics** — computed without a model, such as `F1ScoreEvaluator` or `avg_latency_seconds`
+- **AI-assisted** (Foundry) — use a judge model to score responses on
+  criteria like coherence, fluency, similarity, or groundedness.
+- **Local metrics** — computed without a judge model, such as
+  `F1ScoreEvaluator` or `avg_latency_seconds`.
 
-Evaluators are configured inside bundles. See [foundry-evaluation-sdk-built-in-evaluators.md](foundry-evaluation-sdk-built-in-evaluators.md) for the complete evaluator reference.
+AgentOps auto-selects evaluators from the target kind and dataset shape.
+Use `evaluators:` in `agentops.yaml` only when you need to override that
+selection. See
+[foundry-evaluation-sdk-built-in-evaluators.md](foundry-evaluation-sdk-built-in-evaluators.md)
+for the complete evaluator reference.
 
-### Backend
+### Target resolver
 
-The execution engine that sends dataset rows to the target and collects responses. The runner automatically selects the backend based on the run config:
+The execution engine sends dataset rows to the target and collects
+responses. AgentOps automatically selects the target kind from `agent:`.
 
-| Execution Mode | Endpoint Kind | Backend | Use case |
-|---|---|---|---|
-| `remote` | `foundry_agent` | Foundry Backend | Foundry agents and models |
-| `remote` | `http` | HTTP Backend | LangGraph, LangChain, ACA, custom REST |
-| `local` | — | Local Adapter | In-process Python functions or subprocess |
+| `agent:` shape | Target kind | Use case |
+|---|---|---|
+| `name:version` | Foundry prompt agent | Foundry Agent Service agents |
+| `https://...` | HTTP/JSON endpoint | LangGraph, Agent Framework, ACA, AKS, custom REST |
+| `model:<deployment>` | Model-direct | Raw model deployment checks |
 
 ## Evaluation Scenarios
 
-AgentOps ships starter bundles for common evaluation patterns. Each bundle pairs specific evaluators with default thresholds:
+AgentOps auto-selects common evaluation patterns from the dataset:
 
-| Scenario | Bundle | Key Evaluators | When to use |
+| Scenario | Dataset signal | Key evaluators | When to use |
 |---|---|---|---|
-| **Model Quality** | `model_quality_baseline` | Similarity, Coherence, Fluency, F1Score | Direct model deployment checks |
-| **RAG** | `rag_quality_baseline` | Groundedness, Relevance, Retrieval, ResponseCompleteness | RAG pipelines with context retrieval |
-| **Conversational** | `conversational_agent_baseline` | Coherence, Fluency, Relevance, Similarity | Chatbots, Q&A assistants |
-| **Agent Workflow** | `agent_workflow_baseline` | TaskCompletion, ToolCallAccuracy, IntentResolution, ToolSelection | Agents with tool calling |
-| **Content Safety** | `safe_agent_baseline` | Violence, Sexual, SelfHarm, HateUnfairness, ProtectedMaterial | Responsible AI checks |
+| **Model Quality** | `input`, `expected` on `model:<deployment>` | Similarity, Coherence, Fluency, F1Score | Direct model deployment checks |
+| **RAG** | `context` | Groundedness, Relevance, Retrieval, ResponseCompleteness | RAG pipelines with context retrieval |
+| **Conversational** | `input`, `expected` on an agent | Coherence, Fluency, Similarity/F1 where applicable | Chatbots, Q&A assistants |
+| **Agent Workflow** | `tool_calls`, `tool_definitions` | ToolCallAccuracy, IntentResolution, TaskAdherence | Agents with tool calling |
+| **Content Safety** | Explicit safety evaluators | Violence, Sexual, SelfHarm, HateUnfairness, ProtectedMaterial | Responsible AI checks |
 
 Each scenario has a dedicated tutorial:
 
@@ -165,16 +133,21 @@ Each scenario has a dedicated tutorial:
 
 ## Configuration Model
 
-Run configs use an orthogonal target model. The three key dimensions — `type`, `execution_mode`, and `endpoint.kind` — are independent. Additional optional fields:
+`agentops.yaml` is the single source of truth. Keep it small and add only
+the fields your target needs:
 
-| Field | Values | When to use |
-|---|---|---|
-| `target.hosting` | `local`, `foundry`, `aks`, `containerapps` | Metadata: where the target runs |
-| `target.framework` | `agent_framework`, `langgraph`, `custom` | Agent targets only |
-| `target.agent_mode` | `prompt`, `hosted` | Foundry agents only |
+```yaml
+version: 1
+agent: "https://api.example.com/chat"
+dataset: .agentops/data/support.jsonl
+
+request_field: message
+response_field: text
 
-**Bundle and dataset references** support two resolution modes:
-- `name` — convention-based: resolves to `.agentops/bundles/<name>.yaml` or `.agentops/datasets/<name>.yaml`
-- `path` — explicit relative path to the YAML file
+thresholds:
+  coherence: ">=3"
+  avg_latency_seconds: "<=2"
+```
 
-See [how-it-works.md](how-it-works.md) for the full schema, all endpoint fields, validation rules, and more configuration examples.
+See [how-it-works.md](how-it-works.md) for the full schema, endpoint
+fields, validation rules, and more examples.
diff --git a/docs/how-it-works.md b/docs/how-it-works.md
index 4f2a5624..d0684473 100644
--- a/docs/how-it-works.md
+++ b/docs/how-it-works.md
@@ -46,7 +46,7 @@ src/
     │   ├── invocations.py     # Per-row agent / model invocation strategies
     │   ├── thresholds.py      # Threshold pass/fail evaluation
     │   ├── reporter.py        # Markdown report generation
-    │   ├── comparison.py      # `eval compare` two runs
+    │   ├── comparison.py      # Baseline delta rendering for `eval run --baseline`
     │   ├── publisher.py       # Classic Foundry publish (OneDP upload of metrics)
     │   └── cloud_publisher.py # New Foundry publish (server-side via OpenAI Evals API)
     │
@@ -108,7 +108,7 @@ When you run `agentops eval run`, the following happens step by step:
 |---|---|---|
 | `agentops init [--path DIR]` | Scaffold `.agentops/` workspace with starter config, bundles, datasets, and data. Also installs coding agent skills. | Available |
 | `agentops eval run` | Execute an evaluation (main command) | Available |
-| `agentops eval compare --runs ID1,ID2` | Compare two past evaluation runs | Available |
+| `agentops eval run --baseline <results.json>` | Run an eval and add a comparison against a previous result | Available |
 | `agentops skills install` | Install AgentOps coding agent skills (Copilot, Claude) into the target project | Available |
 | `agentops run list\|show` | List or inspect past runs | Planned (stub) |
 | `agentops run view <id> [--entry N]` | Deep-inspect a run | Planned (stub) |
diff --git a/docs/tutorial-agent-workflow.md b/docs/tutorial-agent-workflow.md
index bb1b0d96..3cf86d6d 100644
--- a/docs/tutorial-agent-workflow.md
+++ b/docs/tutorial-agent-workflow.md
@@ -17,8 +17,8 @@ both of these row fields:
 When AgentOps sees `tool_calls` (or `tool_definitions`) in the
 dataset rows, it auto-selects the **agent workflow** evaluators:
 TaskCompletion, ToolCallAccuracy, IntentResolution, TaskAdherence,
-plus the conversational baseline (Coherence, Fluency, Similarity,
-F1Score, latency).
+plus the conversational baseline metrics that apply to the target
+(Coherence, Fluency, latency, and any explicitly configured text metric).
 
 ## 1. Bootstrap
 
@@ -44,11 +44,11 @@ body:
 ```yaml
 version: 1
 agent: "https://aca-weather-bot.example.com/"
-http:
-  request_field: message
-  response_field: text
-  tool_calls_field: tool_calls
 dataset: .agentops/data/tools.jsonl
+
+request_field: message
+response_field: text
+tool_calls_field: tool_calls
 ```
 
 `tool_calls_field` tells AgentOps where in the response JSON to find
@@ -61,9 +61,9 @@ the structured tool calls (dot-path notation supported).
 {"id":"2","input":"How is the weather in Tokyo, Japan?","expected":"Calls get_weather with location='Tokyo, Japan'.","tool_calls":[{"type":"function_call","name":"get_weather","arguments":{"location":"Tokyo, Japan"}}]}
 ```
 
-You can additionally include `tool_definitions` to give the evaluator
-the schema of every tool the agent should know about. This sharpens
-the **ToolSelectionEvaluator** judgement.
+Include `tool_definitions` when you evaluate tool-call accuracy. The
+evaluator needs the schema of every tool the agent should know about;
+repeat the catalogue on each JSONL row so every row is self-contained.
 
 ## 4. Run
 
diff --git a/docs/tutorial-basic-foundry-agent.md b/docs/tutorial-basic-foundry-agent.md
index 93b6ec26..0aab0543 100644
--- a/docs/tutorial-basic-foundry-agent.md
+++ b/docs/tutorial-basic-foundry-agent.md
@@ -219,4 +219,4 @@ The RAG scenario uses GroundednessEvaluator instead of SimilarityEvaluator becau
 - [Model-Direct Tutorial](tutorial-model-direct.md) — evaluate a model without agents
 - [RAG Tutorial](tutorial-rag.md) — evaluate retrieval-augmented responses
 - [Baseline Comparison Tutorial](tutorial-baseline-comparison.md) — compare runs and detect regressions
-- [Copilot Skills Tutorial](tutorial-copilot-skills.md) — install skills for AI-assisted guidance
+- [Copilot Skills Tutorial](tutorial-copilot-skills.md) — use the installed AgentOps skills to build an eval workflow with Copilot
diff --git a/docs/tutorial-conversational-agent.md b/docs/tutorial-conversational-agent.md
index b2080f3e..3f79048e 100644
--- a/docs/tutorial-conversational-agent.md
+++ b/docs/tutorial-conversational-agent.md
@@ -48,10 +48,10 @@ different field names, override them:
 ```yaml
 version: 1
 agent: "https://api.example.com/chat"
-http:
-  request_field: prompt
-  response_field: choices.0.message.content
 dataset: .agentops/data/chat.jsonl
+
+request_field: prompt
+response_field: choices.0.message.content
 ```
 
 ## 3. Dataset shape (`chat.jsonl`)
@@ -67,8 +67,8 @@ auto-selects the **conversational baseline** evaluators: Coherence,
 Fluency, Similarity, F1Score, average latency.
 
 > Want to test multi-turn behaviour explicitly? Have your service
-> accept a `history` field, then add `extra_fields: [history]` under
-> `http:` and include a `history` array in each JSONL row.
+> accept a `history` field, then add `extra_fields: [history]` to
+> `agentops.yaml` and include a `history` array in each JSONL row.
 
 ## 4. Run
 
diff --git a/docs/tutorial-copilot-skills.md b/docs/tutorial-copilot-skills.md
new file mode 100644
index 00000000..7816c44c
--- /dev/null
+++ b/docs/tutorial-copilot-skills.md
@@ -0,0 +1,292 @@
+# Tutorial — Copilot-assisted AgentOps workflow
+
+This tutorial shows how to use the AgentOps coding-agent skills as a
+guided development workflow. Instead of memorizing the AgentOps schema,
+you let Copilot inspect the project, generate the config and dataset, run
+the eval, explain the report, and create the CI/CD workflow.
+
+The tutorial is still fully executable without guessing: each Copilot
+prompt is followed by the concrete file or command you should expect.
+
+## What you will build
+
+- A small HTTP support agent that answers three customer-service
+  questions.
+- Installed AgentOps skills under `.github/skills/`.
+- A flat `agentops.yaml` generated from project context.
+- A JSONL dataset generated for the agent's behavior.
+- One passing local evaluation and a readable `report.md`.
+- GitHub Actions workflow files generated from the skill-guided flow.
+
+## Prerequisites
+
+- Python 3.11 or later.
+- GitHub Copilot Chat or Copilot CLI with repository context.
+- Azure CLI login and a judge-model deployment for AI-assisted evaluators.
+
+```powershell
+python -m venv .venv
+.\.venv\Scripts\Activate.ps1
+python -m pip install -U pip
+python -m pip install "agentops-toolkit @ git+https://github.com/Azure/agentops.git@develop"
+
+az login
+$env:AZURE_AI_FOUNDRY_PROJECT_ENDPOINT = "https://<resource>.services.ai.azure.com/api/projects/<project>"
+$env:AZURE_OPENAI_ENDPOINT             = "https://<resource>.openai.azure.com"
+$env:AZURE_OPENAI_DEPLOYMENT           = "gpt-4o-mini"
+```
+
+> If you are testing unreleased AgentOps changes locally, install from
+> your checkout instead:
+>
+> ```powershell
+> python -m pip install -e "C:\path\to\agentops[foundry,agent]"
+> ```
+
+## 1. Create the sample agent
+
+Create `support_agent.py`:
+
+```python
+from http.server import BaseHTTPRequestHandler, HTTPServer
+import json
+
+
+RESPONSES = {
+    "Where is my order ORD-12345?": "Order ORD-12345 is in transit and expected to arrive tomorrow.",
+    "Can I return a damaged headset from ORD-77821?": "Yes. Start a return for ORD-77821 and choose damaged item as the reason.",
+    "How do I contact a human support agent?": "I can connect you to a human support agent for account or order issues.",
+}
+
+
+class Handler(BaseHTTPRequestHandler):
+    def do_POST(self):
+        length = int(self.headers.get("content-length", "0"))
+        body = json.loads(self.rfile.read(length))
+        message = body.get("message", "")
+        text = RESPONSES.get(message, "I can help with order status, returns, and support escalation.")
+
+        payload = json.dumps({"text": text}).encode("utf-8")
+        self.send_response(200)
+        self.send_header("content-type", "application/json")
+        self.send_header("content-length", str(len(payload)))
+        self.end_headers()
+        self.wfile.write(payload)
+
+
+HTTPServer(("127.0.0.1", 8790), Handler).serve_forever()
+```
+
+Start it in a second terminal:
+
+```powershell
+.\.venv\Scripts\Activate.ps1
+python support_agent.py
+```
+
+## 2. Initialize AgentOps and install skills
+
+```powershell
+agentops init
+agentops skills install --platform copilot --force
+```
+
+You should now have:
+
+```text
+agentops.yaml
+.agentops/data/smoke.jsonl
+.github/skills/
+  agentops-config/SKILL.md
+  agentops-dataset/SKILL.md
+  agentops-eval/SKILL.md
+  agentops-report/SKILL.md
+  agentops-workflow/SKILL.md
+```
+
+The skills are workflow instructions for Copilot. They tell Copilot how
+to inspect the workspace, which AgentOps files to create, which commands
+are valid, and when to ask for missing values instead of inventing them.
+
+## 3. Ask Copilot to configure AgentOps
+
+In Copilot Chat, ask:
+
+```text
+Use the agentops-config skill. Inspect this project and create an
+AgentOps config for the local HTTP support agent on port 8790.
+```
+
+Expected `agentops.yaml`:
+
+```yaml
+version: 1
+agent: "http://127.0.0.1:8790/"
+dataset: .agentops/data/support-agent.jsonl
+
+request_field: message
+response_field: text
+
+thresholds:
+  coherence: ">=3"
+  fluency: ">=3"
+  similarity: ">=3"
+  avg_latency_seconds: "<=2"
+```
+
+Why this is the right config:
+
+- `agent` is the local HTTP endpoint.
+- `request_field` matches `body.get("message")` in `support_agent.py`.
+- `response_field` matches the returned JSON key `{ "text": ... }`.
+- The thresholds are intentionally simple for the first smoke gate.
+
+## 4. Ask Copilot to generate the dataset
+
+In Copilot Chat, ask:
+
+```text
+Use the agentops-dataset skill. Generate a small deterministic JSONL
+dataset for the support agent behavior in support_agent.py.
+```
+
+Expected `.agentops/data/support-agent.jsonl`:
+
+```jsonl
+{"input":"Where is my order ORD-12345?","expected":"Order ORD-12345 is in transit and expected to arrive tomorrow."}
+{"input":"Can I return a damaged headset from ORD-77821?","expected":"The customer can start a return for ORD-77821 and choose damaged item as the reason."}
+{"input":"How do I contact a human support agent?","expected":"The assistant can connect the customer to a human support agent for account or order issues."}
+```
+
+The dataset uses exact intents that the sample app implements. That makes
+the first run a configuration smoke test: if it fails, you likely have a
+field mapping, endpoint, auth, or environment problem rather than a
+prompt-quality problem.
+
+## 5. Ask Copilot to run the eval
+
+In Copilot Chat, ask:
+
+```text
+Use the agentops-eval skill. Run the evaluation and explain any failure.
+```
+
+Expected command:
+
+```powershell
+agentops eval run
+```
+
+Expected outputs:
+
+```text
+.agentops/results/<timestamp>/results.json
+.agentops/results/<timestamp>/report.md
+.agentops/results/latest/results.json
+.agentops/results/latest/report.md
+```
+
+Exit code `0` means the config, dataset, HTTP agent, and thresholds all
+worked. Exit code `2` means the run completed but one or more thresholds
+failed. Exit code `1` means a runtime/configuration error.
+
+## 6. Ask Copilot to interpret the report
+
+In Copilot Chat, ask:
+
+```text
+Use the agentops-report skill. Read the latest report and summarize the
+strongest rows, weakest rows, and next improvement.
+```
+
+A useful answer should not just say "pass" or "fail". It should point to:
+
+- the threshold table in `.agentops/results/latest/report.md`;
+- the lowest-scoring row or metric;
+- whether latency is agent runtime or evaluator overhead;
+- a concrete next change, such as improving an answer or tightening a
+  threshold after repeated passing runs.
+
+## 7. Ask Copilot to add the PR gate
+
+In Copilot Chat, ask:
+
+```text
+Use the agentops-workflow skill. Generate the GitHub Actions workflow
+files and tell me which GitHub environment variables are required.
+```
+
+Expected command:
+
+```powershell
+agentops workflow generate
+```
+
+Expected workflow files:
+
+```text
+.github/workflows/agentops-pr.yml
+.github/workflows/agentops-deploy-dev.yml
+.github/workflows/agentops-deploy-qa.yml
+.github/workflows/agentops-deploy-prod.yml
+```
+
+For this HTTP tutorial, the PR gate needs the same evaluator-model values
+you used locally:
+
+| GitHub variable | Purpose |
+|---|---|
+| `AZURE_CLIENT_ID` | OIDC identity used by `azure/login`. |
+| `AZURE_TENANT_ID` | Tenant for the OIDC login. |
+| `AZURE_SUBSCRIPTION_ID` | Azure subscription for the login. |
+| `AZURE_AI_FOUNDRY_PROJECT_ENDPOINT` | Foundry project used by AI-assisted evaluators. |
+| `AZURE_OPENAI_ENDPOINT` | Azure OpenAI endpoint for the judge model. |
+| `AZURE_OPENAI_DEPLOYMENT` | Judge model deployment, for example `gpt-4o-mini`. |
+
+If your HTTP agent is remote and protected, also add the token variable
+referenced by `auth_header_env`.
+
+Because this tutorial starts the sample agent on `127.0.0.1`, GitHub
+Actions must start that process before `agentops eval run`. For this
+sample repo, add this step between **Install AgentOps Toolkit** and
+**Run AgentOps eval** in `agentops-pr.yml`:
+
+```yaml
+      - name: Start local tutorial agent
+        run: |
+          python support_agent.py &
+          sleep 2
+```
+
+For a deployed ACA/AKS/App Service endpoint, skip that step and point
+`agent:` at the public or private URL your runner can reach.
+
+## 8. Push the tutorial repo
+
+```powershell
+git init -b main
+git add .
+git commit -m "feat: add Copilot-assisted AgentOps eval"
+gh repo create "agentops-copilot-skills-<suffix>" --public --source=. --push
+```
+
+The first PR against `main` or `develop` will run `agentops-pr.yml`.
+When it finishes, open the workflow artifact or PR comment to view the
+same `report.md` you inspected locally.
+
+## What Copilot should have learned
+
+The skills keep Copilot inside the AgentOps contract:
+
+- `agentops-config` creates a flat `agentops.yaml`, not legacy
+  `run.yaml` / bundle / dataset config files.
+- `agentops-dataset` creates rows tailored to the app instead of generic
+  trivia.
+- `agentops-eval` runs `agentops eval run` and respects exit codes.
+- `agentops-report` turns metrics into actionable insights.
+- `agentops-workflow` generates the standard GitFlow workflow scaffold
+  without inventing unsupported flags or commands.
+
+That is the intended AgentOps development loop: Copilot accelerates the
+file creation and interpretation, while AgentOps supplies the repeatable
+evaluation contract.
diff --git a/docs/tutorial-end-to-end.md b/docs/tutorial-end-to-end.md
index ecdb4b9c..22d01ee8 100644
--- a/docs/tutorial-end-to-end.md
+++ b/docs/tutorial-end-to-end.md
@@ -84,7 +84,7 @@ Set the project endpoint up front so every command picks it up.
 
 ```powershell
 $env:AZURE_AI_FOUNDRY_PROJECT_ENDPOINT = "https://<your-project>.services.ai.azure.com/api/projects/<project-name>"
-$env:AZURE_OPENAI_ENDPOINT             = "https://<your-project>.services.ai.azure.com"
+$env:AZURE_OPENAI_ENDPOINT             = "https://<your-project>.openai.azure.com"
 $env:AZURE_OPENAI_DEPLOYMENT           = "gpt-4o-mini"
 ```
 
@@ -92,20 +92,19 @@ $env:AZURE_OPENAI_DEPLOYMENT           = "gpt-4o-mini"
 
 ```bash
 export AZURE_AI_FOUNDRY_PROJECT_ENDPOINT="https://<your-project>.services.ai.azure.com/api/projects/<project-name>"
-export AZURE_OPENAI_ENDPOINT="https://<your-project>.services.ai.azure.com"
+export AZURE_OPENAI_ENDPOINT="https://<your-project>.openai.azure.com"
 export AZURE_OPENAI_DEPLOYMENT="gpt-4o-mini"
 ```
 
-> **Watch out for two endpoint shapes.** On a Foundry "AI Services"
-> account, both env vars start with the same hostname but the
-> project endpoint includes `/api/projects/<project-name>` while
-> `AZURE_OPENAI_ENDPOINT` is **only** the hostname (no path). If you
-> paste the project URL into `AZURE_OPENAI_ENDPOINT` the evaluators
-> fail with `BadRequest: API version not supported`. AgentOps
-> defaults the API version to a release that works against both
-> New Foundry and classic Azure OpenAI; override with
-> `AZURE_OPENAI_API_VERSION` only if your resource needs a specific
-> version.
+> **Watch out for two endpoint shapes.** The Foundry project endpoint
+> uses the `*.services.ai.azure.com/api/projects/<project-name>` shape.
+> The evaluator model endpoint is the Azure OpenAI data-plane host,
+> usually `*.openai.azure.com`, with **no path**. If you paste the
+> project URL into `AZURE_OPENAI_ENDPOINT`, evaluators can fail with
+> `BadRequest: API version not supported`. AgentOps defaults the API
+> version to a release that works against both New Foundry and classic
+> Azure OpenAI; override with `AZURE_OPENAI_API_VERSION` only if your
+> resource needs a specific version.
 
 > The remaining shell snippets in this tutorial are written for
 > **PowerShell** (the default on Windows). bash / zsh users can
@@ -204,8 +203,8 @@ You get:
     └── agentops-*/SKILL.md
 ```
 
-Open `.agentops/agentops.yaml` and configure it for the support
-agent:
+Open `agentops.yaml` at the project root and configure it for the
+support agent:
 
 ```yaml
 version: 1
@@ -223,8 +222,9 @@ thresholds:
   coherence: ">=3"
   fluency: ">=3"
   similarity: ">=3"
-  # Latency budget.
-  avg_latency_seconds: "<=10"
+  # Lab-safe latency budget. Tool-calling Foundry agents can have
+  # occasional cold-start / orchestration spikes during a tutorial run.
+  avg_latency_seconds: "<=90"
 ```
 
 The `agent: "name:version"` shape is recognised as a **Foundry hosted
@@ -275,6 +275,13 @@ quality stack. When AgentOps loads the dataset it picks:
 | `CoherenceEvaluator` / `FluencyEvaluator` / `SimilarityEvaluator` / `F1ScoreEvaluator` | Standard text quality. |
 | `avg_latency_seconds` | End-to-end latency budget. |
 
+> **Why is the latency budget 90 seconds?** The point of this first gate
+> is to prove tool behavior, not to fail a learner because one Foundry
+> row hit a transient cold-start or service-queue spike. Keep this
+> tutorial gate broad, then tighten latency for your own production
+> agent after you have baseline data. Step 9 shows how to use
+> Application Insights and Watchdog for stricter p95 latency monitoring.
+
 ## 5. Run your first evaluation
 
 ```powershell
@@ -308,8 +315,10 @@ The report has four sections you will revisit often:
   debugging false-positive tool calls.
 - **Aggregate metrics** — averages across rows.
 - **Thresholds** — every rule from `agentops.yaml` with measured
-  value. With v1 you should see all the tool-calling thresholds in
-  the green.
+  value. With v1 you should see the tool-calling and text-quality
+  thresholds in the green. If latency is high but below the lab-safe
+  budget, keep going; you will inspect production-style p95 latency
+  with Watchdog later.
 
 The exit code is `0` (all thresholds passed) or `2` (one or more
 failed). `1` means a runtime error.
@@ -441,9 +450,28 @@ git push -u origin develop
 
 ### Wire the GitHub Environments
 
+At this point the eval works on your machine because your local Azure
+login has access to Foundry and to the evaluator model. GitHub Actions is
+a different machine, so you must give the workflow its own identity and
+permissions.
+
 The three workflows (`pr`, `deploy-dev`, `deploy-qa`, `deploy-prod`)
-expect a GitHub **environment** per stage, each populated with the same
-six variables and a federated credential so Azure trusts GitHub OIDC.
+expect one GitHub **environment** per stage. Each environment stores the
+variables the workflow needs and maps to one trusted Azure identity.
+
+| Piece | Why you need it |
+|---|---|
+| App registration + service principal | The Azure identity that GitHub Actions will impersonate. |
+| GitHub environment variables | Non-secret configuration such as tenant, subscription, Foundry endpoint, and evaluator model endpoint. |
+| Federated credential | The trust rule that allows GitHub OIDC tokens from this repo/environment to become Azure tokens. |
+| Azure role assignments | The actual permissions to read the Foundry agent and call the Azure OpenAI judge model. |
+
+Think of the setup in two layers:
+
+1. **Authentication:** GitHub proves "this workflow is running from your
+   `support-bot-*` repo in the `dev`, `qa`, or `prod` environment".
+2. **Authorization:** Azure checks whether that identity has roles on the
+   Foundry and Azure OpenAI resources.
 
 The next four snippets create everything end-to-end. Run them in order
 from the same PowerShell session you used above (so `$suffix` is still
@@ -451,6 +479,17 @@ in scope).
 
 #### 1. Create the app registration GitHub will impersonate
 
+This creates the Azure identity used by the workflows. There is no client
+secret in this tutorial: GitHub will authenticate with OIDC instead of a
+stored password.
+
+The command prints three values you will store as GitHub environment
+variables:
+
+- `AZURE_CLIENT_ID` — which app registration GitHub should impersonate.
+- `AZURE_TENANT_ID` — which Microsoft Entra tenant owns the app.
+- `AZURE_SUBSCRIPTION_ID` — which Azure subscription the workflow should use.
+
 ```powershell
 $app    = az ad app create --display-name "support-bot-ci-$suffix" | ConvertFrom-Json
 az ad sp create --id $app.appId | Out-Null
@@ -475,6 +514,23 @@ Write-Host "AZURE_SUBSCRIPTION_ID = $sub"
 
 #### 2. Create the three environments and push the variables
 
+GitHub environments give each stage its own variable scope and its own
+OIDC subject (`environment:dev`, `environment:qa`, `environment:prod`).
+The PR gate intentionally runs in `dev`, so it reuses the same variables
+and identity as the first deployment stage.
+
+This snippet creates the environments and stores the values the generated
+workflows read through `vars.*`:
+
+| Variable | Where it comes from | Used for |
+|---|---|---|
+| `AZURE_TENANT_ID` | `az account show` | Tells `azure/login` which Entra tenant to authenticate against. |
+| `AZURE_SUBSCRIPTION_ID` | `az account show` | Selects the Azure subscription for the workflow. |
+| `AZURE_CLIENT_ID` | The app registration from step 1 | Tells `azure/login` which identity GitHub should impersonate. |
+| `AZURE_AI_FOUNDRY_PROJECT_ENDPOINT` | Your local env var | Tells AgentOps where the hosted support agent lives. |
+| `AZURE_OPENAI_ENDPOINT` | Your local env var | Tells evaluators where the judge model endpoint is. |
+| `AZURE_OPENAI_DEPLOYMENT` | The deployment name, e.g. `gpt-4o-mini` | Tells evaluators which judge model deployment to call. |
+
 ```powershell
 $foundry = $env:AZURE_AI_FOUNDRY_PROJECT_ENDPOINT
 $aoai    = $env:AZURE_OPENAI_ENDPOINT
@@ -495,16 +551,25 @@ foreach ($envName in @("dev","qa","prod")) {
 
 > **Prefer the portal?** Open your repo on github.com → **Settings →
 > Environments → New environment** and create `dev`, `qa`, and `prod`.
-> For each one, click **Add variable** and add the six rows from the
-> table at the top of this section.
+> For each one, click **Add variable** and add the six variables listed
+> above.
 
 #### 3. Add federated credentials so Azure trusts GitHub OIDC
 
-One credential per environment. The PR gate workflow runs **inside the
-`dev` environment** (so it inherits the same `dev` variables and OIDC
-subject) — no separate `pull_request` credential is needed. The JSON is
-written to a temp file because `az` does not parse inline JSON reliably
-under PowerShell:
+The variables above tell GitHub which Azure identity to use, but Azure
+still needs to trust this repository. A federated credential is that trust
+rule.
+
+Each credential says: "Accept tokens issued by GitHub for this exact repo
+and this exact environment." That is why the `subject` values include
+`environment:dev`, `environment:qa`, and `environment:prod`.
+
+The PR gate workflow runs **inside the `dev` environment**, so it inherits
+the same `dev` variables and OIDC subject — no separate `pull_request`
+credential is needed.
+
+The JSON is written to a temp file because `az` does not parse inline JSON
+reliably under PowerShell:
 
 ```powershell
 $subjects = @{
@@ -538,6 +603,19 @@ foreach ($name in $subjects.Keys) {
 
 #### 4. Grant the app the roles it needs
 
+OIDC only proves the workflow's identity; it does not grant access by
+itself. This step assigns least-privilege Azure roles to the service
+principal:
+
+| Scope | Role | Why |
+|---|---|---|
+| Foundry account/project resource | `Azure AI User` | Lets AgentOps read and invoke the hosted support agent. |
+| Azure OpenAI account | `Cognitive Services OpenAI User` | Lets the evaluators call the judge model deployment. |
+
+The endpoint URLs contain the Azure resource names, but role assignments
+need full Azure resource IDs. The first half of the script extracts those
+names and resolves them to IDs; the second half assigns the roles.
+
 ```powershell
 $spId = az ad sp show --id $client --query id -o tsv
 
@@ -586,51 +664,158 @@ The `agentops-pr.yml` workflow runs. When it finishes you will see:
 - A green or red check on the PR.
 - A bot comment with the verdict, threshold table (including the
   tool-call metrics), and a link to the full `report.md` artifact.
+  The tutorial's latency threshold is intentionally broad; after a few
+  real runs, tighten it in `agentops.yaml` or enforce p95 latency with
+  Watchdog in step 9.
 
 Merge the PR. `agentops-deploy-dev.yml` triggers, runs an eval against
 the dev environment, and deploys if it passes.
 
 ## 9. Run the Watchdog
 
-The watchdog reads your accumulated run history and (optionally)
-queries Application Insights and the Foundry control plane to flag
-drifts that a single eval cannot see — repeated regressions, latency
-trends, error spikes, safety findings.
+The watchdog is only useful if it has real signals to inspect. In this
+tutorial those signals are:
+
+1. `.agentops/results/*/results.json` from the evals you already ran.
+2. Application Insights telemetry emitted by a new eval run.
+3. Foundry control-plane metadata for the hosted support agent.
+
+If you run `agentops agent analyze` without Application Insights
+configured, the report can only say `azure_monitor: skipped`. That is not
+an observability tutorial. The next commands create Application Insights,
+send telemetry into it, and then run the watchdog against the live data.
+
+### 9.1 Create Application Insights for the tutorial
 
 ```powershell
-pip install "agentops-toolkit[agent] @ git+https://github.com/Azure/agentops.git@develop"
-agentops agent analyze
+# Reuse the same resource group/location as the Foundry account.
+$foundryName = (($env:AZURE_AI_FOUNDRY_PROJECT_ENDPOINT -split "//")[1] -split "\.")[0]
+$foundry = az resource list `
+  --name $foundryName `
+  --resource-type "Microsoft.CognitiveServices/accounts" `
+  --query "[0]" | ConvertFrom-Json
+
+if (-not $foundry) { throw "Could not resolve Foundry resource '$foundryName'" }
+
+$resourceGroup = ($foundry.id -split "/resourceGroups/")[1].Split("/")[0]
+$location      = $foundry.location
+$workspaceName = "law-support-bot-$suffix"
+$appiName      = "appi-support-bot-$suffix"
+
+az extension add -n application-insights --upgrade | Out-Null
+az monitor log-analytics workspace create `
+  --resource-group $resourceGroup `
+  --workspace-name $workspaceName `
+  --location $location | Out-Null
+
+$workspaceId = az monitor log-analytics workspace show `
+  --resource-group $resourceGroup `
+  --workspace-name $workspaceName `
+  --query id -o tsv
+
+az monitor app-insights component create `
+  --app $appiName `
+  --location $location `
+  --resource-group $resourceGroup `
+  --workspace $workspaceId `
+  --application-type web | Out-Null
+
+$appInsightsId = az monitor app-insights component show `
+  --app $appiName `
+  --resource-group $resourceGroup `
+  --query id -o tsv
+
+$appInsightsConnectionString = az monitor app-insights component show `
+  --app $appiName `
+  --resource-group $resourceGroup `
+  --query connectionString -o tsv
 ```
 
-This produces `.agentops/agent/report.md`. With no `agent.yaml`
-present, only the local results-history source is active and Azure
-Monitor / Foundry control plane appear as `skipped` in the
-diagnostics block. That is enough for the basic regression and
-latency checks across all your previous runs.
+What this creates:
+
+- A **Log Analytics workspace** that stores the telemetry tables.
+- A workspace-based **Application Insights component** that receives
+  AgentOps spans and exposes them to Azure Monitor queries.
+- Two local variables:
+  - `$appInsightsId` — used by the watchdog to query telemetry.
+  - `$appInsightsConnectionString` — used by `agentops eval run` to emit
+    telemetry.
+
+### 9.2 Let the CI identity read telemetry
 
-To pull production telemetry, drop a starter `agent.yaml` into the
-workspace and edit it:
+Locally, your signed-in Azure user can usually query the resource because
+you created it. For GitHub Actions, grant the same OIDC app a read role
+so scheduled watchdog runs can query Application Insights too:
 
 ```powershell
-$tpl = python -c "import agentops, pathlib; print(pathlib.Path(agentops.__file__).parent / 'templates' / 'agent.yaml')"
-Copy-Item $tpl .agentops/agent.yaml
+$repo = gh repo view --json nameWithOwner -q .nameWithOwner
+$client = gh variable get AZURE_CLIENT_ID --env dev --repo $repo
+$spId = az ad sp show --id $client --query id -o tsv
+
+az role assignment create `
+  --assignee-object-id $spId `
+  --assignee-principal-type ServicePrincipal `
+  --role "Monitoring Reader" `
+  --scope $appInsightsId | Out-Null
 ```
 
-```yaml
+### 9.3 Configure the watchdog
+
+Now write `.agentops/agent.yaml`. This is the file that tells the
+watchdog which signal sources to use:
+
+```powershell
+@"
+version: 1
+lookback_days: 7
+
 sources:
   results_history:
     enabled: true
+    path: .agentops/results
+    lookback_runs: 10
   azure_monitor:
     enabled: true
-    app_insights_resource_id: /subscriptions/<sub>/resourceGroups/<rg>/providers/microsoft.insights/components/<ai>
+    app_insights_resource_id: $appInsightsId
   foundry_control:
     enabled: true
     project_endpoint_env: AZURE_AI_FOUNDRY_PROJECT_ENDPOINT
+checks:
+  latency:
+    p95_threshold_seconds: 5.0
+  errors:
+    rate_threshold: 0.05
+"@ | Set-Content .agentops/agent.yaml -Encoding utf8
 ```
 
-Re-run `agentops agent analyze`. The findings table now mixes signals
-from your eval history (including the v1 → v2 tool-call regression)
-with live telemetry from the deployed agent.
+### 9.4 Generate telemetry, then analyze it
+
+Install both the Foundry runtime and the watchdog extras, set the
+Application Insights connection string, and run one more eval. AgentOps
+will emit OpenTelemetry spans for each dataset row and agent invocation.
+
+```powershell
+python -m pip install "agentops-toolkit[foundry,agent] @ git+https://github.com/Azure/agentops.git@develop"
+
+$env:APPLICATIONINSIGHTS_CONNECTION_STRING = $appInsightsConnectionString
+agentops eval run
+
+# Azure Monitor ingestion is asynchronous. Give it a short moment to index.
+Start-Sleep -Seconds 90
+
+agentops agent analyze
+code .agentops/agent/report.md
+```
+
+The report should now show `azure_monitor` as `ok`, not `skipped`. The
+watchdog can combine:
+
+- eval-history regressions from `.agentops/results`;
+- live p95 latency and error-rate signals from Application Insights;
+- Foundry control-plane metadata from `AZURE_AI_FOUNDRY_PROJECT_ENDPOINT`.
+
+If the findings table is empty, that means the configured checks passed;
+the **Sources** table still proves which signal sources were queried.
 
 > **Optional — WAF-AI security audit.** The watchdog can also run a
 > read-only audit of your Foundry resource group against the
diff --git a/docs/tutorial-http-agent.md b/docs/tutorial-http-agent.md
index aa8fb625..e9fa7f14 100644
--- a/docs/tutorial-http-agent.md
+++ b/docs/tutorial-http-agent.md
@@ -1,209 +1,252 @@
-# Tutorial: HTTP Agent Evaluation (Agent Framework / ACA)
+# Tutorial: HTTP Agent Evaluation
 
-This tutorial shows how to evaluate an AI agent deployed as an HTTP endpoint — for example, a [Microsoft Agent Framework](https://learn.microsoft.com/azure/ai-agent-service/) application running on Azure Container Apps (ACA). No Foundry Agent Service is required.
+This tutorial shows how to evaluate an agent that is exposed as an
+HTTP/JSON endpoint. That endpoint can be a local development server,
+Azure Container Apps, AKS, App Service, FastAPI, Express, Microsoft Agent
+Framework, LangGraph, or any service that accepts a prompt and returns a
+text response.
 
-The HTTP backend sends each dataset row as a JSON POST request to your agent endpoint, extracts the response, runs local and AI-assisted evaluators, and produces the standard `results.json` and `report.md` outputs.
+AgentOps treats HTTP agents the same way it treats Foundry agents after
+the call succeeds: it loads JSONL rows, POSTs one row at a time, extracts
+the answer, runs evaluators, and writes `results.json` plus `report.md`.
 
-## When HTTP backend makes sense
+## What you will build
 
-Use `type: http` when:
+- A tiny local HTTP agent so you can run the tutorial without deploying
+  anything.
+- A flat `agentops.yaml` that points to the HTTP URL.
+- A JSONL dataset with deterministic support-style questions.
+- One `agentops eval run` producing a passing report.
 
-- Your agent is **deployed outside Foundry Agent Service** — for example, a multi-agent orchestrator on ACA or a custom FastAPI service.
-- You use **Microsoft Agent Framework** (or any other framework) and expose an HTTP chat endpoint.
-- You want **CI/CD gating** for any HTTP-accessible agent without Foundry dependency.
-- You need to evaluate a **local development server** before deploying.
-
-The HTTP backend works for multi-agent scenarios transparently — evaluation always hits the orchestrator endpoint; internal agent routing and tool calls are invisible to AgentOps at this level.
+Use the same pattern later by changing only the `agent:` URL and field
+mapping for your real deployed agent.
 
 ## Prerequisites
 
-- Python 3.11+
-- An agent running and accessible via HTTP (local or remote).
-- *(Optional)* Azure CLI for AI-assisted evaluators (`az login`).
-- `pip install "agentops-toolkit @ git+https://github.com/Azure/agentops.git@develop"`
-
-## Part 1: Set up
+```powershell
+python -m venv .venv
+.\.venv\Scripts\Activate.ps1
+python -m pip install -U pip
+python -m pip install "agentops-toolkit @ git+https://github.com/Azure/agentops.git@develop"
+```
 
-### 1) Initialize the workspace
+If you use AI-assisted evaluators such as Similarity or Fluency, also set
+the judge model and sign in to Azure:
 
-```bash
-agentops init
+```powershell
+az login
+$env:AZURE_AI_FOUNDRY_PROJECT_ENDPOINT = "https://<resource>.services.ai.azure.com/api/projects/<project>"
+$env:AZURE_OPENAI_ENDPOINT             = "https://<resource>.openai.azure.com"
+$env:AZURE_OPENAI_DEPLOYMENT           = "gpt-4o-mini"
 ```
 
-This creates `.agentops/` with all starter files, including the HTTP scenario templates:
+## 1. Create a local HTTP agent
 
-```
-.agentops/
-├── run-http-model.yaml                  ← HTTP run config
-├── bundles/model_quality_baseline.yaml  ← baseline evaluators
-├── datasets/smoke-model-direct.yaml     ← smoke dataset config
-└── data/smoke-model-direct.jsonl        ← 5 generic Q&A rows
-```
+Create `http_agent.py`:
 
-### 2) Set the agent URL
+```python
+from http.server import BaseHTTPRequestHandler, HTTPServer
+import json
 
-The recommended approach is to set an environment variable so the URL stays out of your run config:
 
-PowerShell:
-```powershell
-$env:AGENT_HTTP_URL = "https://your-agent.region.azurecontainerapps.io/chat"
-```
+ANSWERS = {
+    "Where is my order ORD-12345?": {
+        "text": "Order ORD-12345 is in transit and expected to arrive tomorrow.",
+        "tool_calls": [{"type": "tool_call", "tool_call_id": "c1", "name": "lookup_order", "arguments": {"order_id": "ORD-12345"}}],
+    },
+    "I want a refund for ORD-77821, it arrived broken.": {
+        "text": "I started a refund for ORD-77821 because it arrived broken.",
+        "tool_calls": [{"type": "tool_call", "tool_call_id": "c2", "name": "refund_order", "arguments": {"order_id": "ORD-77821", "reason": "arrived broken"}}],
+    },
+    "Hi there!": {
+        "text": "Hello! I can help with order status, refunds, or connecting you to a human support agent.",
+        "tool_calls": [],
+    },
+}
+
 
-Bash/zsh:
-```bash
-export AGENT_HTTP_URL="https://your-agent.region.azurecontainerapps.io/chat"
+class Handler(BaseHTTPRequestHandler):
+    def do_POST(self):
+        length = int(self.headers.get("content-length", "0"))
+        body = json.loads(self.rfile.read(length))
+        message = body.get("message", "")
+        response = ANSWERS.get(message, {"text": "I do not know yet.", "tool_calls": []})
+
+        payload = json.dumps(response).encode("utf-8")
+        self.send_response(200)
+        self.send_header("content-type", "application/json")
+        self.send_header("content-length", str(len(payload)))
+        self.end_headers()
+        self.wfile.write(payload)
+
+
+HTTPServer(("127.0.0.1", 8787), Handler).serve_forever()
 ```
 
-For a local agent running during development:
-```bash
-export AGENT_HTTP_URL="http://localhost:8080/chat"
+Start it in a second terminal:
+
+```powershell
+.\.venv\Scripts\Activate.ps1
+python http_agent.py
 ```
 
-### 3) *(Optional)* Configure AI-assisted evaluators
+Why this local server? It lets you prove the AgentOps HTTP contract before
+you involve Container Apps, auth, networking, or deployment variables.
+When this passes locally, a remote HTTP target is just a URL swap.
+
+## 2. Initialize AgentOps
 
-If your bundle includes `SimilarityEvaluator` or other AI-assisted evaluators, set the judge model:
+Back in your first terminal:
 
-```bash
-export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
-export AZURE_AI_MODEL_DEPLOYMENT_NAME="gpt-4o"
+```powershell
+agentops init
+```
+
+This creates:
+
+```text
+agentops.yaml
+.agentops/
+  data/smoke.jsonl
+  results/
+.github/skills/
 ```
 
-Run `az login` if you are using `DefaultAzureCredential` locally.
+AgentOps 1.0 uses one flat config file at the project root. You do not
+need legacy `run-http.yaml`, bundle YAML, or dataset YAML files.
 
-## Part 2: Customize the run config
+## 3. Configure the HTTP endpoint
 
-Open `.agentops/run-http-model.yaml`. The starter config already points at the baseline bundle and smoke dataset:
+Replace `agentops.yaml` with:
 
 ```yaml
 version: 1
-target:
-  type: model
-  hosting: aks
-  execution_mode: remote
-  endpoint:
-    kind: http
-    url_env: AGENT_HTTP_URL      # reads the URL from your environment
-    request_field: message        # JSON field to send the prompt in
-    response_field: text          # JSON field to extract the response from
-bundle:
-  name: model_quality_baseline
-dataset:
-  name: smoke-model-direct
-execution:
-  timeout_seconds: 60
-output:
-  write_report: true
+agent: "http://127.0.0.1:8787/"
+dataset: .agentops/data/http-support.jsonl
+
+request_field: message
+response_field: text
+tool_calls_field: tool_calls
+
+thresholds:
+  coherence: ">=3"
+  fluency: ">=3"
+  tool_call_accuracy: ">=0.8"
+  intent_resolution: ">=3"
+  task_adherence: ">=0.8"
+  avg_latency_seconds: "<=2"
 ```
 
-### Adapting to your agent's API
+The HTTP field mapping controls the JSON protocol:
 
-Every agent has its own request/response format. Adjust these fields:
-
-| Field | Default | Description |
-|---|---|---|
-| `request_field` | `message` | JSON key for the prompt text |
-| `response_field` | `text` | JSON key for the response (supports dot-path) |
-| `auth_header_env` | — | Env var containing a Bearer token |
-| `headers` | `{}` | Static extra headers |
+| Config field | Meaning |
+|---|---|
+| `request_field: message` | AgentOps sends `{"message": "<row input>"}`. |
+| `response_field: text` | AgentOps reads the final answer from `response.text`. Dot paths such as `output.text` are supported. |
+| `tool_calls_field: tool_calls` | AgentOps reads structured tool calls from `response.tool_calls` so tool metrics can run. |
 
-**Examples:**
+For a deployed endpoint that requires a Bearer token, add:
 
-Agent that expects `{"query": "..."}` and returns `{"answer": "..."}`: 
 ```yaml
-target:
-  endpoint:
-    kind: http
-    url_env: AGENT_HTTP_URL
-    request_field: query
-    response_field: answer
+auth_header_env: AGENT_TOKEN
 ```
 
-Agent that returns `{"output": {"text": "..."}}` (nested):
-```yaml
-target:
-  endpoint:
-    kind: http
-    url_env: AGENT_HTTP_URL
-    response_field: output.text   # dot-path into nested object
-```
+Then set `$env:AGENT_TOKEN` before running the eval.
 
-Agent requiring Bearer token authentication:
-```yaml
-target:
-  endpoint:
-    kind: http
-    url_env: AGENT_HTTP_URL
-    auth_header_env: AGENT_TOKEN    # reads Bearer token from env
+## 4. Create the dataset
+
+Create `.agentops/data/http-support.jsonl`:
+
+```jsonl
+{"input":"Where is my order ORD-12345?","expected":"Order ORD-12345 is in transit and expected to arrive tomorrow.","tool_definitions":[{"type":"function","name":"lookup_order","description":"Look up an order.","parameters":{"type":"object","properties":{"order_id":{"type":"string"}},"required":["order_id"]}},{"type":"function","name":"refund_order","description":"Refund an order.","parameters":{"type":"object","properties":{"order_id":{"type":"string"},"reason":{"type":"string"}},"required":["order_id","reason"]}}],"tool_calls":[{"type":"tool_call","tool_call_id":"c1","name":"lookup_order","arguments":{"order_id":"ORD-12345"}}]}
+{"input":"I want a refund for ORD-77821, it arrived broken.","expected":"A refund is started for ORD-77821 because it arrived broken.","tool_definitions":[{"type":"function","name":"lookup_order","description":"Look up an order.","parameters":{"type":"object","properties":{"order_id":{"type":"string"}},"required":["order_id"]}},{"type":"function","name":"refund_order","description":"Refund an order.","parameters":{"type":"object","properties":{"order_id":{"type":"string"},"reason":{"type":"string"}},"required":["order_id","reason"]}}],"tool_calls":[{"type":"tool_call","tool_call_id":"c2","name":"refund_order","arguments":{"order_id":"ORD-77821","reason":"arrived broken"}}]}
+{"input":"Hi there!","expected":"The assistant replies with a clear greeting and offers support options without calling a tool.","tool_definitions":[{"type":"function","name":"lookup_order","description":"Look up an order.","parameters":{"type":"object","properties":{"order_id":{"type":"string"}},"required":["order_id"]}},{"type":"function","name":"refund_order","description":"Refund an order.","parameters":{"type":"object","properties":{"order_id":{"type":"string"},"reason":{"type":"string"}},"required":["order_id","reason"]}}],"tool_calls":[]}
 ```
 
-Banking assistant (Agent Framework default):
-```yaml
-target:
-  endpoint:
-    kind: http
-    url_env: AGENT_HTTP_URL
-    request_field: message
-    response_field: text
-    auth_header_env: AGENT_TOKEN
+Each row has:
 
-## Part 3: Prepare the dataset
+- `input` — what AgentOps sends to the HTTP service.
+- `expected` — the reference answer for text-quality metrics.
+- `tool_calls` — the expected structured tool behavior. Omit this field
+  if your HTTP endpoint does not expose tool calls.
+- `tool_definitions` — the function-tool schema available to the agent.
+  Tool-call accuracy evaluators need this catalogue on each row.
 
-The smoke dataset has 5 generic Q&A rows. For real evaluations, replace `data/smoke-http.jsonl` with domain-specific queries:
+## 5. Run the evaluation
 
-```json
-{"id":"1","input":"What is the balance on account 12345?","expected":"The balance on account 12345 is $1,234.56."}
-{"id":"2","input":"What are the last 3 transactions on my savings account?","expected":"The last 3 transactions are: ..."}
+```powershell
+agentops eval run
 ```
 
-Update `datasets/smoke-http.yaml` to point at your file:
+The CLI should print a passing threshold summary and write:
 
-```yaml
-source:
-  type: file
-  path: ../data/your-dataset.jsonl
+```text
+.agentops/results/<timestamp>/results.json
+.agentops/results/<timestamp>/report.md
+.agentops/results/latest/
 ```
 
-## Part 4: Run the evaluation
+Open the Markdown report:
 
-```bash
-agentops eval run --config .agentops/run-http.yaml
+```powershell
+code .agentops/results/latest/report.md
 ```
 
-The backend:
-1. Loads the dataset rows from the JSONL file.
-2. POSTs each row to your agent via HTTP.
-3. Extracts the response text.
-4. Runs evaluators (`SimilarityEvaluator`, `avg_latency_seconds`).
-5. Writes `backend_metrics.json`, then `results.json` and `report.md`.
-
-Output lands in `.agentops/results/<timestamp>/` and is mirrored to `.agentops/results/latest/`. Pass `--output <dir>` to write the run only to that path instead.
+The report shows the aggregate metrics, threshold table, and per-row
+details. For the first two rows, the per-row section should include the
+tool calls returned by the HTTP server.
 
-## Part 5: Review results
+## 6. Point it at a real service
 
-**Console:** AgentOps prints a summary with pass/fail per threshold.
+When you deploy the agent, keep the dataset and thresholds but change the
+URL and field mapping:
 
-**Report:** Open the report in VS Code with `code .agentops/results/latest/report.md` and press `Ctrl+Shift+V` to render the Markdown.
+```yaml
+version: 1
+agent: "https://your-agent.region.azurecontainerapps.io/chat"
+dataset: .agentops/data/http-support.jsonl
 
-**JSON:** Parse `.agentops/results/latest/results.json` for machine-readable scores.
+request_field: message
+response_field: output.text
+tool_calls_field: output.tool_calls
+auth_header_env: AGENT_TOKEN
+```
 
-## Troubleshooting
+Run the same command:
 
-**`connection refused` / `URL error`** — Your agent is not reachable. Check that `AGENT_HTTP_URL` is correct and the server is running.
+```powershell
+agentops eval run
+```
 
-**`Response field 'text' not found`** — Your agent returns a different key. Inspect the raw response and update `response_field` in your run config.
+If the local server passed but the remote service fails, the issue is
+usually deployment reachability, auth, or a response-field mismatch rather
+than evaluator logic.
 
-**`SimilarityEvaluator` fails** — Set `AZURE_OPENAI_ENDPOINT` and `AZURE_AI_MODEL_DEPLOYMENT_NAME`, then run `az login`.
+## Troubleshooting
 
-**All rows error, exit code 1** — Check `.agentops/results/latest/backend.stderr.log` for per-row error details.
+| Symptom | What to check |
+|---|---|
+| `connection refused` | The server is not running or the URL/port is wrong. |
+| `Response field 'text' not found` | Update `response_field` to match your JSON response shape. |
+| `tool_call_accuracy` is missing | Add `tool_calls_field` and make sure the response includes structured tool calls. |
+| AI evaluator auth error | Run `az login` and set the Azure OpenAI / Foundry environment variables. |
 
 ## Exit codes
 
 | Code | Meaning |
 |---|---|
-| `0` | All rows succeeded and all thresholds passed |
-| `2` | Evaluation succeeded but one or more thresholds failed |
-| `1` | Runtime error (HTTP failure, config error) |
+| `0` | Evaluation succeeded and all thresholds passed. |
+| `2` | Evaluation succeeded but at least one threshold failed. |
+| `1` | Runtime or configuration error. |
 
 ## CI/CD integration
 
-See [docs/ci-github-actions.md](ci-github-actions.md) for how to gate on the exit code in a GitHub Actions workflow. The HTTP backend works identically to other backends from a CI perspective.
+After the local run passes, generate workflow files with:
+
+```powershell
+agentops workflow generate
+```
+
+The generated PR workflow uses the same `agentops eval run` exit codes to
+gate pull requests. See [ci-github-actions.md](ci-github-actions.md) for
+the GitHub environment and OIDC setup.
diff --git a/docs/tutorial-model-direct.md b/docs/tutorial-model-direct.md
index 1db40627..fe652c1b 100644
--- a/docs/tutorial-model-direct.md
+++ b/docs/tutorial-model-direct.md
@@ -34,10 +34,18 @@ the deployment, and skips agent infrastructure entirely.
 `.agentops/data/smoke.jsonl` (one JSON object per line):
 
 ```jsonl
-{"id":"1","input":"What is the capital of France?","expected":"Paris is the capital of France."}
-{"id":"2","input":"Which planet is known as the Red Planet?","expected":"Mars is the Red Planet."}
+{"id":"1","input":"Answer with exactly this sentence: Paris is the capital of France and one of Europe's major cultural centers.","expected":"Paris is the capital of France and one of Europe's major cultural centers."}
+{"id":"2","input":"Answer with exactly this sentence: Mars is known as the Red Planet because iron-rich dust gives its surface a reddish color.","expected":"Mars is known as the Red Planet because iron-rich dust gives its surface a reddish color."}
+{"id":"3","input":"Answer with exactly this sentence: Water has the chemical formula H2O because each molecule contains two hydrogen atoms and one oxygen atom.","expected":"Water has the chemical formula H2O because each molecule contains two hydrogen atoms and one oxygen atom."}
 ```
 
+The first model-direct smoke test intentionally uses short factual
+sentences with exact-answer instructions. That makes the default
+Similarity, F1, and Fluency thresholds meaningful: if this fails, you
+likely have a configuration/auth problem rather than a subjective-answer
+mismatch. Once the loop is working, replace these rows with realistic
+prompts for your application.
+
 The dataset has only `input` and `expected`, so AgentOps auto-selects
 the **model quality** evaluators: Coherence, Fluency, Similarity,
 F1Score, plus average latency.
diff --git a/docs/tutorial-quickstart.md b/docs/tutorial-quickstart.md
index 2169401c..53313b33 100644
--- a/docs/tutorial-quickstart.md
+++ b/docs/tutorial-quickstart.md
@@ -43,7 +43,8 @@ agentops init
 This creates two files:
 
 - `agentops.yaml` — your evaluation config (3 lines + comments).
-- `.agentops/data/smoke.jsonl` — a 3-row seed dataset.
+- `.agentops/data/smoke.jsonl` — a 3-row seed dataset with short,
+  deterministic factual answers.
 
 ## 3. Configure your agent
 
@@ -91,6 +92,12 @@ To view the report rendered (tables, ✅/❌), open it in VS Code and press `Ctr
 code .agentops/results/latest/report.md
 ```
 
+The seed dataset asks the target to answer with exact short factual
+sentences. That keeps the first run focused on proving the AgentOps loop
+works instead of debugging subjective wording differences. After the
+smoke test passes, replace the rows with domain-specific examples for
+your agent.
+
 The CLI prints `Threshold status: PASSED` (exit code `0`) or `FAILED` (exit code `2`) so you can wire it into CI directly.
 
 ## 5. Compare against a baseline
diff --git a/plugins/agentops/skills/agentops-eval/SKILL.md b/plugins/agentops/skills/agentops-eval/SKILL.md
index b6042523..5ba5eae4 100644
--- a/plugins/agentops/skills/agentops-eval/SKILL.md
+++ b/plugins/agentops/skills/agentops-eval/SKILL.md
@@ -60,8 +60,9 @@ agentops report generate --in <results.json>
 ```
 
 Open `.agentops/results/latest/report.md`. To compare two runs, hand both
-`results.json` files to the user and walk them through metric deltas;
-AgentOps does not ship a separate `eval compare` command.
+`results.json` files to the user or run the next eval with
+`--baseline <previous-results.json>` so AgentOps adds a **Comparison vs
+Baseline** section to the report.
 
 ## Step 5 — (Optional) Publish to Foundry Evaluations
 
diff --git a/plugins/agentops/skills/agentops-report/SKILL.md b/plugins/agentops/skills/agentops-report/SKILL.md
index 72ed2bd4..a9593b0c 100644
--- a/plugins/agentops/skills/agentops-report/SKILL.md
+++ b/plugins/agentops/skills/agentops-report/SKILL.md
@@ -59,9 +59,9 @@ exit code of the original run reflects the gate:
   suggest concrete prompt or retrieval changes.
 - For latency regressions, look at `run_metrics.avg_latency_seconds` and
   per-row latency.
-- To compare two runs, diff the two `results.json` files at the metric
-  level and surface the deltas; AgentOps does not ship a separate
-  comparison CLI.
+- To compare a new run against a previous one, re-run with
+  `agentops eval run --baseline <previous-results.json>` and explain the
+  generated **Comparison vs Baseline** section.
 
 ## Guardrails
 
diff --git a/plugins/agentops/skills/agentops-workflow/SKILL.md b/plugins/agentops/skills/agentops-workflow/SKILL.md
index 8f8f77b2..d8e569f7 100644
--- a/plugins/agentops/skills/agentops-workflow/SKILL.md
+++ b/plugins/agentops/skills/agentops-workflow/SKILL.md
@@ -31,7 +31,8 @@ and have them generate `--kinds pr,dev,prod`.
 ## Step 0 — Prerequisites
 
 1. `pip install "agentops-toolkit @ git+https://github.com/Azure/agentops.git@develop"` if `agentops` is missing.
-2. `.agentops/run.yaml` exists and `agentops eval run` works locally.
+2. `agentops.yaml` exists at the project root and `agentops eval run`
+   works locally.
 3. The user's repo follows GitFlow (or is willing to). If not, ask which
    branches map to dev/qa/prod and adjust the `on:` triggers after
    generation.
@@ -126,18 +127,18 @@ This makes the eval gate a hard merge requirement.
 
 Common follow-ups:
 
-- **Tighten thresholds for QA/PROD** — copy `.agentops/run.yaml` to
-  `.agentops/run-qa.yaml` / `.agentops/run-prod.yaml` and tighten the
-  bundle thresholds. Point each workflow at its own config via the
+- **Tighten thresholds for QA/PROD** — copy `agentops.yaml` to
+  `agentops-qa.yaml` / `agentops-prod.yaml` and tighten the
+  `thresholds:` block. Point each workflow at its own config via the
   `inputs.config` default.
 - **Scheduled runs** — add a `schedule:` entry in `agentops-pr.yml` (or a
   new `agentops-nightly.yml`) to evaluate against `main` nightly.
-- **Matrix per scenario** — if the user has multiple `runs/*.yaml` files,
+- **Matrix per scenario** — if the user has multiple AgentOps config files,
   extend the eval job with `strategy.matrix.config:` and reference
   `${{ matrix.config }}`.
 - **Regression baseline** — wire the deploy templates to download the
   previous run's `results.json` artifact and call
-  `agentops eval compare`.
+  `agentops eval run --baseline <results.json>`.
 
 ## Guardrails
 
diff --git a/pyproject.toml b/pyproject.toml
index 77b42ecf..bf4f32b7 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -29,6 +29,7 @@ agent = [
   "httpx>=0.27",
   "cryptography>=42",
   "azure-monitor-query>=1.3",
+  "azure-monitor-opentelemetry>=1.6",
   "azure-identity>=1.17",
   "azure-mgmt-cognitiveservices>=13.5",
   "azure-mgmt-monitor>=6.0",
diff --git a/src/agentops/agent/sources/azure_monitor.py b/src/agentops/agent/sources/azure_monitor.py
index 0a5e757f..9089ed6c 100644
--- a/src/agentops/agent/sources/azure_monitor.py
+++ b/src/agentops/agent/sources/azure_monitor.py
@@ -28,7 +28,7 @@ class AzureMonitorPayload:
 
 
 _REQUESTS_KQL = """
-requests
+union isfuzzy=true requests, dependencies
 | where timestamp > ago({lookback_days}d)
 | summarize
     request_count = count(),
diff --git a/src/agentops/pipeline/orchestrator.py b/src/agentops/pipeline/orchestrator.py
index cbb2ddab..eb264bad 100644
--- a/src/agentops/pipeline/orchestrator.py
+++ b/src/agentops/pipeline/orchestrator.py
@@ -34,6 +34,7 @@
 )
 from agentops.pipeline import comparison as comparison_module
 from agentops.pipeline import invocations, publisher, reporter, runtime, thresholds
+from agentops.utils import telemetry
 from agentops.utils.colors import style
 
 logger = logging.getLogger("agentops.pipeline")
@@ -65,6 +66,19 @@ def run_evaluation(
     options: RunOptions,
 ) -> RunResult:
     """Run a full evaluation and persist artifacts. Returns the RunResult."""
+    telemetry.init_tracing()
+    try:
+        return _run_evaluation(config, options=options)
+    finally:
+        telemetry.shutdown()
+
+
+def _run_evaluation(
+    config: AgentOpsConfig,
+    *,
+    options: RunOptions,
+) -> RunResult:
+    """Run a full evaluation after optional telemetry has been initialized."""
     started_at = datetime.now(timezone.utc)
     started_perf = time.perf_counter()
 
@@ -106,26 +120,40 @@ def run_evaluation(
         f"{_friendly_target_kind(target.kind)}: {style(target.raw, 'bold')}."
     )
 
-    rows: List[RowResult] = []
-    rules_by_metric = {rule.metric: rule for rule in threshold_rules}
-    for index, row in enumerate(dataset_rows):
-        rows.append(
-            _evaluate_row(
-                row=row,
-                index=index,
-                total=total,
-                target=target,
-                config=config,
-                evaluators=evaluator_runtimes,
-                timeout=options.timeout_seconds,
-                progress=progress,
-                rules_by_metric=rules_by_metric,
+    with telemetry.eval_run_span(
+        bundle_name=options.config_path.stem,
+        dataset_name=dataset_path.name,
+        backend_type=target.kind,
+        target=target.raw,
+        model=target.deployment,
+        agent_id=target.raw if target.kind.startswith("foundry") else None,
+    ) as run_span:
+        rows: List[RowResult] = []
+        rules_by_metric = {rule.metric: rule for rule in threshold_rules}
+        for index, row in enumerate(dataset_rows):
+            rows.append(
+                _evaluate_row(
+                    row=row,
+                    index=index,
+                    total=total,
+                    target=target,
+                    config=config,
+                    evaluators=evaluator_runtimes,
+                    timeout=options.timeout_seconds,
+                    progress=progress,
+                    rules_by_metric=rules_by_metric,
+                )
             )
-        )
 
-    aggregate = _aggregate_metrics(rows)
-    threshold_results = thresholds.evaluate(threshold_rules, aggregate)
-    summary = _summarize(rows, threshold_results)
+        aggregate = _aggregate_metrics(rows)
+        threshold_results = thresholds.evaluate(threshold_rules, aggregate)
+        summary = _summarize(rows, threshold_results)
+        telemetry.set_eval_run_result(
+            run_span,
+            passed=summary.overall_passed,
+            items_total=summary.items_total,
+            items_passed=summary.items_passed_all,
+        )
 
     finished_at = datetime.now(timezone.utc)
     duration = time.perf_counter() - started_perf
@@ -357,6 +385,24 @@ def _iter_dataset(path: Path) -> Iterable[Dict[str, Any]]:
 # ---------------------------------------------------------------------------
 
 
+def _metric_passes(rule: Threshold, value: float) -> bool:
+    if rule.value is None or rule.criteria in {"true", "false"}:
+        return True
+    target_v = float(rule.value)
+    c = rule.criteria
+    if c == ">=":
+        return value >= target_v
+    if c == ">":
+        return value > target_v
+    if c == "<=":
+        return value <= target_v
+    if c == "<":
+        return value < target_v
+    if c == "==":
+        return value == target_v
+    return True
+
+
 def _evaluate_row(
     *,
     row: Dict[str, Any],
@@ -374,57 +420,84 @@ def _evaluate_row(
     if len(preview) > 80:
         preview = preview[:77] + "..."
     progress(f"{label} invoking target: {preview!r}")
+    expected = row.get("expected")
+    expected_text = str(expected) if expected is not None else None
 
-    try:
-        invocation = invocations.invoke(target, config, row, timeout=timeout)
-    except Exception as exc:  # noqa: BLE001
-        logger.warning("row %d invocation failed: %s", index, exc)
-        progress(f"{label} {style('invocation FAILED', 'bold', 'red')}: {exc}")
-        return RowResult(
-            row_index=index,
-            input=str(row.get("input", "")),
-            expected=row.get("expected"),
-            response="",
-            context=row.get("context"),
-            error=str(exc),
+    with telemetry.eval_item_span(
+        row_index=index,
+        input_text=str(row.get("input", "")),
+        expected_text=expected_text,
+    ) as item_span:
+        try:
+            with telemetry.agent_invoke_span(
+                target="agent" if target.kind.startswith("foundry") else "model",
+                model=target.deployment,
+                agent_id=target.raw if target.kind.startswith("foundry") else None,
+                agent_name=target.name,
+                agent_version=target.version,
+            ) as invoke_span:
+                invocation = invocations.invoke(target, config, row, timeout=timeout)
+                telemetry.set_agent_invoke_result(
+                    invoke_span,
+                    response_model=target.deployment,
+                )
+        except Exception as exc:  # noqa: BLE001
+            telemetry.set_eval_item_result(item_span, passed=False)
+            logger.warning("row %d invocation failed: %s", index, exc)
+            progress(f"{label} {style('invocation FAILED', 'bold', 'red')}: {exc}")
+            return RowResult(
+                row_index=index,
+                input=str(row.get("input", "")),
+                expected=row.get("expected"),
+                response="",
+                context=row.get("context"),
+                error=str(exc),
+            )
+
+        tool_count = len(invocation.tool_calls) if invocation.tool_calls else 0
+        progress(
+            f"{label} replied in {style(f'{invocation.latency_seconds:.2f}s', 'cyan')} "
+            f"({tool_count} tool call(s)); scoring..."
         )
 
-    tool_count = len(invocation.tool_calls) if invocation.tool_calls else 0
-    progress(
-        f"{label} replied in {style(f'{invocation.latency_seconds:.2f}s', 'cyan')} "
-        f"({tool_count} tool call(s)); scoring..."
-    )
+        metrics: List[RowMetric] = []
+        for evaluator in evaluators:
+            metric = runtime.run_evaluator(
+                evaluator,
+                row=row,
+                response=invocation.response,
+                latency_seconds=invocation.latency_seconds,
+                actual_tool_calls=invocation.tool_calls,
+            )
+            metrics.append(metric)
 
-    metrics: List[RowMetric] = []
-    for evaluator in evaluators:
-        metric = runtime.run_evaluator(
-            evaluator,
-            row=row,
-            response=invocation.response,
-            latency_seconds=invocation.latency_seconds,
-            actual_tool_calls=invocation.tool_calls,
+            rule = (rules_by_metric or {}).get(metric.name)
+            metric_passed = (
+                None
+                if metric.value is None or rule is None
+                else _metric_passes(rule, float(metric.value))
+            )
+            telemetry.record_evaluator_span(
+                evaluator_name=evaluator.preset.name,
+                builtin_name=metric.name,
+                source=(
+                    "local"
+                    if evaluator.preset.class_name == "_latency"
+                    else "azure-ai-evaluation"
+                ),
+                score=float(metric.value) if metric.value is not None else 0.0,
+                threshold=rule.value if rule is not None else None,
+                criteria=rule.criteria if rule is not None else None,
+                passed=metric_passed,
+            )
+
+        telemetry.set_eval_item_result(
+            item_span,
+            passed=all(metric.error is None for metric in metrics),
         )
-        metrics.append(metric)
 
     rules = rules_by_metric or {}
 
-    def _passes(rule: Threshold, value: float) -> bool:
-        if rule.value is None or rule.criteria in {"true", "false"}:
-            return True
-        target_v = float(rule.value)
-        c = rule.criteria
-        if c == ">=":
-            return value >= target_v
-        if c == ">":
-            return value > target_v
-        if c == "<=":
-            return value <= target_v
-        if c == "<":
-            return value < target_v
-        if c == "==":
-            return value == target_v
-        return True
-
     def _format_metric(m: RowMetric) -> str:
         if isinstance(m.value, (int, float)):
             rule = rules.get(m.name)
@@ -433,7 +506,7 @@ def _format_metric(m: RowMetric) -> str:
                 # No user threshold for this metric: keep value neutral
                 # so the line stays readable.
                 return f"{m.name}={text}"
-            color = "green" if _passes(rule, float(m.value)) else "red"
+            color = "green" if _metric_passes(rule, float(m.value)) else "red"
             return f"{m.name}={style(text, color)}"
         if m.error:
             return f"{m.name}={style('ERR', 'red')}"
diff --git a/src/agentops/templates/skills/agentops-eval/SKILL.md b/src/agentops/templates/skills/agentops-eval/SKILL.md
index b6042523..5ba5eae4 100644
--- a/src/agentops/templates/skills/agentops-eval/SKILL.md
+++ b/src/agentops/templates/skills/agentops-eval/SKILL.md
@@ -60,8 +60,9 @@ agentops report generate --in <results.json>
 ```
 
 Open `.agentops/results/latest/report.md`. To compare two runs, hand both
-`results.json` files to the user and walk them through metric deltas;
-AgentOps does not ship a separate `eval compare` command.
+`results.json` files to the user or run the next eval with
+`--baseline <previous-results.json>` so AgentOps adds a **Comparison vs
+Baseline** section to the report.
 
 ## Step 5 — (Optional) Publish to Foundry Evaluations
 
diff --git a/src/agentops/templates/skills/agentops-report/SKILL.md b/src/agentops/templates/skills/agentops-report/SKILL.md
index 72ed2bd4..a9593b0c 100644
--- a/src/agentops/templates/skills/agentops-report/SKILL.md
+++ b/src/agentops/templates/skills/agentops-report/SKILL.md
@@ -59,9 +59,9 @@ exit code of the original run reflects the gate:
   suggest concrete prompt or retrieval changes.
 - For latency regressions, look at `run_metrics.avg_latency_seconds` and
   per-row latency.
-- To compare two runs, diff the two `results.json` files at the metric
-  level and surface the deltas; AgentOps does not ship a separate
-  comparison CLI.
+- To compare a new run against a previous one, re-run with
+  `agentops eval run --baseline <previous-results.json>` and explain the
+  generated **Comparison vs Baseline** section.
 
 ## Guardrails
 
diff --git a/src/agentops/templates/skills/agentops-workflow/SKILL.md b/src/agentops/templates/skills/agentops-workflow/SKILL.md
index 8f8f77b2..d8e569f7 100644
--- a/src/agentops/templates/skills/agentops-workflow/SKILL.md
+++ b/src/agentops/templates/skills/agentops-workflow/SKILL.md
@@ -31,7 +31,8 @@ and have them generate `--kinds pr,dev,prod`.
 ## Step 0 — Prerequisites
 
 1. `pip install "agentops-toolkit @ git+https://github.com/Azure/agentops.git@develop"` if `agentops` is missing.
-2. `.agentops/run.yaml` exists and `agentops eval run` works locally.
+2. `agentops.yaml` exists at the project root and `agentops eval run`
+   works locally.
 3. The user's repo follows GitFlow (or is willing to). If not, ask which
    branches map to dev/qa/prod and adjust the `on:` triggers after
    generation.
@@ -126,18 +127,18 @@ This makes the eval gate a hard merge requirement.
 
 Common follow-ups:
 
-- **Tighten thresholds for QA/PROD** — copy `.agentops/run.yaml` to
-  `.agentops/run-qa.yaml` / `.agentops/run-prod.yaml` and tighten the
-  bundle thresholds. Point each workflow at its own config via the
+- **Tighten thresholds for QA/PROD** — copy `agentops.yaml` to
+  `agentops-qa.yaml` / `agentops-prod.yaml` and tighten the
+  `thresholds:` block. Point each workflow at its own config via the
   `inputs.config` default.
 - **Scheduled runs** — add a `schedule:` entry in `agentops-pr.yml` (or a
   new `agentops-nightly.yml`) to evaluate against `main` nightly.
-- **Matrix per scenario** — if the user has multiple `runs/*.yaml` files,
+- **Matrix per scenario** — if the user has multiple AgentOps config files,
   extend the eval job with `strategy.matrix.config:` and reference
   `${{ matrix.config }}`.
 - **Regression baseline** — wire the deploy templates to download the
   previous run's `results.json` artifact and call
-  `agentops eval compare`.
+  `agentops eval run --baseline <results.json>`.
 
 ## Guardrails
 
diff --git a/src/agentops/templates/smoke.jsonl b/src/agentops/templates/smoke.jsonl
index b2246374..c28695b8 100644
--- a/src/agentops/templates/smoke.jsonl
+++ b/src/agentops/templates/smoke.jsonl
@@ -1,3 +1,3 @@
-{"input": "What is AgentOps?", "expected": "AgentOps is a CLI for evaluating Foundry agents."}
-{"input": "Which formats does it produce?", "expected": "It writes results.json and report.md."}
-{"input": "How do I configure thresholds?", "expected": "Use the 'thresholds' map in agentops.yaml."}
+{"input": "Answer with exactly this sentence: Paris is the capital of France and one of Europe's major cultural centers.", "expected": "Paris is the capital of France and one of Europe's major cultural centers."}
+{"input": "Answer with exactly this sentence: Mars is known as the Red Planet because iron-rich dust gives its surface a reddish color.", "expected": "Mars is known as the Red Planet because iron-rich dust gives its surface a reddish color."}
+{"input": "Answer with exactly this sentence: Water has the chemical formula H2O because each molecule contains two hydrogen atoms and one oxygen atom.", "expected": "Water has the chemical formula H2O because each molecule contains two hydrogen atoms and one oxygen atom."}
diff --git a/src/agentops/utils/telemetry.py b/src/agentops/utils/telemetry.py
index c9769c5f..09f20d3b 100644
--- a/src/agentops/utils/telemetry.py
+++ b/src/agentops/utils/telemetry.py
@@ -1,8 +1,9 @@
 """Optional OpenTelemetry instrumentation for AgentOps evaluation runs.
 
 All OpenTelemetry imports are **lazy** — they only happen when tracing is
-enabled via the ``AGENTOPS_OTLP_ENDPOINT`` environment variable.  When the
-variable is unset, every public function in this module is a no-op.
+enabled via ``APPLICATIONINSIGHTS_CONNECTION_STRING`` (Azure Monitor) or
+the ``AGENTOPS_OTLP_ENDPOINT`` environment variable. When neither variable
+is set, every public function in this module is a no-op.
 
 Schema design follows three OTel semantic convention layers:
 https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/
@@ -26,12 +27,12 @@
 
 
 def is_enabled() -> bool:
-    """Return True when OTLP tracing has been initialised."""
+    """Return True when tracing has been initialised."""
     return _tracing_enabled
 
 
 def init_tracing() -> None:
-    """Initialise the OTLP exporter if ``AGENTOPS_OTLP_ENDPOINT`` is set.
+    """Initialise tracing when Azure Monitor or OTLP export is configured.
 
     Safe to call multiple times; only the first call has an effect.
     """
@@ -40,12 +41,37 @@ def init_tracing() -> None:
     if _tracing_enabled:
         return
 
-    endpoint = os.getenv("AGENTOPS_OTLP_ENDPOINT")
-    if not endpoint:
+    appinsights_connection_string = os.getenv(
+        "APPLICATIONINSIGHTS_CONNECTION_STRING"
+    ) or os.getenv("AGENTOPS_APPLICATIONINSIGHTS_CONNECTION_STRING")
+    otlp_endpoint = os.getenv("AGENTOPS_OTLP_ENDPOINT")
+    if not appinsights_connection_string and not otlp_endpoint:
         return
 
     try:
         from opentelemetry import trace
+    except ImportError:
+        # opentelemetry not installed — tracing stays disabled
+        return
+
+    if appinsights_connection_string:
+        try:
+            from azure.monitor.opentelemetry import configure_azure_monitor
+
+            configure_azure_monitor(
+                connection_string=appinsights_connection_string,
+            )
+            _tracer = trace.get_tracer("agentops")
+            _tracing_enabled = True
+            return
+        except ImportError:
+            # Azure Monitor exporter not installed — try OTLP below if configured.
+            pass
+
+    if not otlp_endpoint:
+        return
+
+    try:
         from opentelemetry.exporter.otlp.proto.http.trace_exporter import (
             OTLPSpanExporter,
         )
@@ -63,14 +89,14 @@ def init_tracing() -> None:
         )
 
         provider = TracerProvider(resource=resource)
-        exporter = OTLPSpanExporter(endpoint=endpoint + "/v1/traces")
+        exporter = OTLPSpanExporter(endpoint=otlp_endpoint + "/v1/traces")
         provider.add_span_processor(BatchSpanProcessor(exporter))
         trace.set_tracer_provider(provider)
 
         _tracer = trace.get_tracer("agentops")
         _tracing_enabled = True
     except ImportError:
-        # opentelemetry not installed — tracing stays disabled
+        # OTLP exporter not installed — tracing stays disabled
         pass
 
 
@@ -172,7 +198,7 @@ def eval_item_span(
         yield None
         return
 
-    from opentelemetry.trace import SpanKind
+    from opentelemetry.trace import SpanKind, StatusCode
 
     _label = f"eval_item {row_index}"
     if input_text:
@@ -183,7 +209,7 @@ def eval_item_span(
 
     with _tracer.start_as_current_span(
         _label,
-        kind=SpanKind.INTERNAL,
+        kind=SpanKind.SERVER,
     ) as span:
         # CICD task attributes
         span.set_attribute("cicd.pipeline.task.name", "eval_item")
@@ -196,17 +222,27 @@ def eval_item_span(
         if expected_text:
             span.set_attribute("agentops.eval.item.expected", expected_text)
 
-        yield span
+        try:
+            yield span
+        except Exception as exc:
+            span.set_attribute("cicd.pipeline.task.run.result", "failure")
+            span.set_attribute("agentops.eval.item.passed", False)
+            span.set_status(StatusCode.ERROR, str(exc))
+            span.record_exception(exc)
+            raise
 
 
 def set_eval_item_result(span: Any, *, passed: bool) -> None:
     """Set final result on an eval item span."""
     if span is None:
         return
+    from opentelemetry.trace import StatusCode
+
     span.set_attribute(
         "cicd.pipeline.task.run.result", "success" if passed else "failure"
     )
     span.set_attribute("agentops.eval.item.passed", passed)
+    span.set_status(StatusCode.OK if passed else StatusCode.ERROR)
 
 
 @contextmanager
diff --git a/tests/unit/test_telemetry.py b/tests/unit/test_telemetry.py
index cec0bd22..24fec39f 100644
--- a/tests/unit/test_telemetry.py
+++ b/tests/unit/test_telemetry.py
@@ -3,10 +3,18 @@
 from __future__ import annotations
 
 import os
+import sys
+import types
+from pathlib import Path
 from unittest.mock import MagicMock, patch
 
 import pytest
 
+from agentops.agent.config import AzureMonitorSourceConfig
+from agentops.agent.sources import azure_monitor
+from agentops.core.agentops_config import AgentOpsConfig
+from agentops.pipeline.orchestrator import RunOptions, run_evaluation
+from agentops.utils import telemetry
 from agentops.utils.telemetry import (
     eval_item_span,
     eval_run_span,
@@ -265,3 +273,153 @@ def test_eval_run_span_name(self) -> None:
         self.mock_tracer.start_as_current_span.assert_called_once()
         span_name = self.mock_tracer.start_as_current_span.call_args.args[0]
         assert span_name == "RUN my_bundle"
+
+
+def test_application_insights_connection_string_initializes_azure_monitor(
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    calls: dict[str, str] = {}
+
+    trace_module = types.ModuleType("opentelemetry.trace")
+    trace_module.get_tracer = lambda name: ("tracer", name)  # type: ignore[attr-defined]
+
+    opentelemetry_module = types.ModuleType("opentelemetry")
+    opentelemetry_module.trace = trace_module  # type: ignore[attr-defined]
+
+    azure_module = types.ModuleType("azure")
+    azure_monitor_module = types.ModuleType("azure.monitor")
+    azure_monitor_otel_module = types.ModuleType("azure.monitor.opentelemetry")
+
+    def configure_azure_monitor(*, connection_string: str) -> None:
+        calls["connection_string"] = connection_string
+
+    setattr(
+        azure_monitor_otel_module,
+        "configure_azure_monitor",
+        configure_azure_monitor,
+    )
+
+    monkeypatch.setitem(sys.modules, "opentelemetry", opentelemetry_module)
+    monkeypatch.setitem(sys.modules, "opentelemetry.trace", trace_module)
+    monkeypatch.setitem(sys.modules, "azure", azure_module)
+    monkeypatch.setitem(sys.modules, "azure.monitor", azure_monitor_module)
+    monkeypatch.setitem(
+        sys.modules, "azure.monitor.opentelemetry", azure_monitor_otel_module
+    )
+    monkeypatch.setattr(telemetry, "_tracer", None)
+    monkeypatch.setattr(telemetry, "_tracing_enabled", False)
+    monkeypatch.setenv(
+        "APPLICATIONINSIGHTS_CONNECTION_STRING",
+        "InstrumentationKey=00000000-0000-0000-0000-000000000000",
+    )
+    monkeypatch.delenv("AGENTOPS_OTLP_ENDPOINT", raising=False)
+
+    init_tracing()
+
+    assert calls == {
+        "connection_string": "InstrumentationKey=00000000-0000-0000-0000-000000000000"
+    }
+    assert is_enabled() is True
+
+
+def test_azure_monitor_queries_requests_and_dependencies(
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    captured: dict[str, str | None] = {}
+
+    azure_module = types.ModuleType("azure")
+    identity_module = types.ModuleType("azure.identity")
+    monitor_module = types.ModuleType("azure.monitor")
+    query_module = types.ModuleType("azure.monitor.query")
+
+    class DefaultAzureCredential:
+        def __init__(self, **_kwargs: object) -> None:
+            pass
+
+    class LogsQueryStatus:
+        FAILURE = "Failure"
+
+    class Column:
+        def __init__(self, name: str) -> None:
+            self.name = name
+
+    class Table:
+        columns = [
+            Column("request_count"),
+            Column("error_count"),
+            Column("avg_duration_ms"),
+            Column("p95_duration_ms"),
+        ]
+        rows = [[2, 1, 1000.0, 2500.0]]
+
+    class Response:
+        status = "Success"
+        tables = [Table()]
+
+    class LogsQueryClient:
+        def __init__(self, _credential: object) -> None:
+            pass
+
+        def query_resource(
+            self,
+            *,
+            resource_id: str,
+            query: str,
+            timespan: object,
+        ) -> Response:
+            captured["resource_id"] = resource_id
+            captured["query"] = query
+            captured["timespan"] = str(timespan)
+            return Response()
+
+    identity_module.DefaultAzureCredential = DefaultAzureCredential  # type: ignore[attr-defined]
+    query_module.LogsQueryClient = LogsQueryClient  # type: ignore[attr-defined]
+    query_module.LogsQueryStatus = LogsQueryStatus  # type: ignore[attr-defined]
+
+    monkeypatch.setitem(sys.modules, "azure", azure_module)
+    monkeypatch.setitem(sys.modules, "azure.identity", identity_module)
+    monkeypatch.setitem(sys.modules, "azure.monitor", monitor_module)
+    monkeypatch.setitem(sys.modules, "azure.monitor.query", query_module)
+
+    payload = azure_monitor.collect_azure_monitor(
+        AzureMonitorSourceConfig(
+            enabled=True,
+            app_insights_resource_id=(
+                "/subscriptions/000/resourceGroups/rg/providers/"
+                "Microsoft.Insights/components/appi"
+            ),
+        ),
+        lookback_days=7,
+    )
+
+    assert "union isfuzzy=true requests, dependencies" in str(captured["query"])
+    assert payload.diagnostics["status"] == "ok"
+    assert payload.request_count == 2
+    assert payload.error_count == 1
+    assert payload.error_rate == 0.5
+    assert payload.avg_duration_seconds == 1.0
+    assert payload.p95_duration_seconds == 2.5
+
+
+def test_run_evaluation_flushes_telemetry_on_error(
+    monkeypatch: pytest.MonkeyPatch,
+    tmp_path: Path,
+) -> None:
+    events: list[str] = []
+    monkeypatch.setattr(telemetry, "init_tracing", lambda: events.append("init"))
+    monkeypatch.setattr(telemetry, "shutdown", lambda: events.append("shutdown"))
+
+    config = AgentOpsConfig(
+        version=1,
+        agent="model:gpt-4o-mini",
+        dataset=tmp_path / "missing.jsonl",
+    )
+    options = RunOptions(
+        config_path=tmp_path / "agentops.yaml",
+        output_dir=tmp_path / "out",
+    )
+
+    with pytest.raises(FileNotFoundError):
+        run_evaluation(config, options=options)
+
+    assert events == ["init", "shutdown"]