diff --git a/README.md b/README.md index 11459a57..821b0478 100644 --- a/README.md +++ b/README.md @@ -117,6 +117,7 @@ The report grows a `Comparison vs Baseline` section with per-metric deltas. - [Quickstart tutorial](https://github.com/Azure/agentops/blob/main/docs/tutorial-quickstart.md) — bootstrap a workspace and run one evaluation. - [End-to-end tutorial](https://github.com/Azure/agentops/blob/main/docs/tutorial-end-to-end.md) — full do-it-yourself tour: Foundry hosted agent, baseline comparison, GitFlow CI/CD, watchdog. +- [Copilot skills tutorial](https://github.com/Azure/agentops/blob/main/docs/tutorial-copilot-skills.md) — use AgentOps skills to have Copilot configure, run, explain, and wire evals into CI. - Per-scenario tutorials: - [Foundry hosted agent](https://github.com/Azure/agentops/blob/main/docs/tutorial-basic-foundry-agent.md) - [Model-direct](https://github.com/Azure/agentops/blob/main/docs/tutorial-model-direct.md) diff --git a/docs/ci-github-actions.md b/docs/ci-github-actions.md index f454ab0b..88dc91a9 100644 --- a/docs/ci-github-actions.md +++ b/docs/ci-github-actions.md @@ -220,18 +220,18 @@ agentops workflow generate --dir # different repo root ## Customisation tips -- **Tighten thresholds for QA / PROD** — copy `.agentops/run.yaml` to - `.agentops/run-qa.yaml` / `.agentops/run-prod.yaml` and tighten - thresholds in the bundle. Update the `inputs.config` default in the +- **Tighten thresholds for QA / PROD** - copy `agentops.yaml` to + `agentops-qa.yaml` / `agentops-prod.yaml` and tighten the + `thresholds:` block. Update the `inputs.config` default in the matching workflow file. - **Scheduled runs** — add a `schedule:` entry in `agentops-pr.yml` (or a new file) to evaluate against `main` nightly. -- **Matrix per scenario** — if you have multiple `runs/*.yaml`, extend +- **Matrix per scenario** - if you have multiple AgentOps config files, extend the eval job with `strategy.matrix.config:` and reference `${{ matrix.config }}` in the eval step. -- **Regression baseline** — wire deploy templates to download the +- **Regression baseline** - wire deploy templates to download the previous run's `results.json` artifact and call - `agentops eval compare` between the two. + `agentops eval run --baseline `. ## Migration from the older 3-template layout diff --git a/docs/concepts.md b/docs/concepts.md index ffe692f4..8598fefa 100644 --- a/docs/concepts.md +++ b/docs/concepts.md @@ -1,32 +1,31 @@ # Concepts -This page explains the core building blocks of AgentOps and how they fit together. For the full schema reference and architecture details, see [how-it-works.md](how-it-works.md). +This page explains the core AgentOps building blocks. For the full schema +reference and architecture details, see [how-it-works.md](how-it-works.md). ## How an Evaluation Works ```mermaid flowchart TD - run["run.yaml
what, where, how to eval"] - bundle["Bundle
evaluators + thresholds"] + config["agentops.yaml
target, dataset, thresholds"] dataset["Dataset
JSONL rows: input, expected"] - runner(["Runner
resolves backend"]) + runner(["Runner
resolves target kind"]) foundry["Foundry
Backend"] http["HTTP
Backend"] - local["Local
Adapter"] + model["Model-direct
Backend"] evals(["Evaluators
score each response"]) results[/"results.json
(machine)"/] report[/"report.md
(human)"/] - run --> bundle - run --> dataset - bundle --> runner + config --> dataset + config --> runner dataset --> runner runner --> foundry runner --> http - runner --> local + runner --> model foundry --> evals http --> evals - local --> evals + model --> evals evals --> results evals --> report ``` @@ -37,87 +36,48 @@ flowchart TD ### Workspace -The `.agentops/` directory inside your project root. Created by `agentops init`, it holds all evaluation configuration: run configs, bundles, datasets, data files, and results. +Created by `agentops init`. The evaluation config lives in the flat +`agentops.yaml` file at the project root; `.agentops/` stores seed data, +run history, and optional supporting files. -``` +```text +agentops.yaml # flat config: agent, dataset, thresholds .agentops/ -├── config.yaml # workspace defaults -├── run.yaml # default run config -├── bundles/ # evaluation policies -├── datasets/ # dataset definitions (YAML) -├── data/ # dataset rows (JSONL) -└── results/ # run outputs + latest/ pointer +├── data/ # dataset rows (JSONL) +└── results/ # run outputs + latest/ pointer ``` -### Run Config - -A YAML file (typically `run.yaml`) that connects **what** to evaluate, **how** to reach it, and **which evaluators** to apply. It references one bundle and one dataset. - -A run config has three key dimensions: +### AgentOps Config -| Dimension | Values | Purpose | -|---|---|---| -| `target.type` | `agent`, `model` | What is being evaluated | -| `target.execution_mode` | `local`, `remote` | How AgentOps reaches the target | -| `target.endpoint.kind` | `foundry_agent`, `http` | Remote endpoint type (when remote) | +A YAML file named `agentops.yaml` that connects **what** to evaluate, +**which dataset** to use, and **which thresholds** gate the run. -Minimal example: +The minimum is: ```yaml version: 1 -target: - type: agent - hosting: foundry - execution_mode: remote - endpoint: - kind: foundry_agent - agent_id: my-agent:1 - model: gpt-4o - project_endpoint_env: AZURE_AI_FOUNDRY_PROJECT_ENDPOINT -bundle: - name: rag_quality_baseline -dataset: - name: smoke-rag +agent: "my-agent:1" +dataset: .agentops/data/smoke.jsonl ``` -See [how-it-works.md](how-it-works.md) for the full schema, all fields, and validation rules. - -### Bundle +Common `agent:` values: -A YAML file that defines **which evaluators** to run and **what thresholds** to enforce. Bundles are reusable — the same bundle can evaluate different targets across environments. +| Agent value | Target kind | +|---|---| +| `"support-bot:1"` | Foundry prompt agent (`name:version`) | +| `"https://api.example.com/chat"` | HTTP/JSON agent | +| `"model:gpt-4o-mini"` | Direct model deployment | -Each bundle contains: -- A list of evaluators (AI-assisted or local metrics) -- Threshold rules that determine pass/fail - -```yaml -# .agentops/bundles/model_quality_baseline.yaml -evaluators: - - name: SimilarityEvaluator - source: foundry - enabled: true -thresholds: - - metric: SimilarityEvaluator - operator: ">=" - value: 3.0 -``` - -See [bundles.md](bundles.md) for the full bundle authoring guide. +HTTP targets can add top-level mapping fields such as `request_field`, +`response_field`, `tool_calls_field`, `auth_header_env`, and +`extra_fields`. ### Dataset -A YAML config that points to a JSONL file containing evaluation rows. Each row has an `input` (the prompt) and an `expected` (the reference answer). Some scenarios add extra fields like `context` (RAG) or `tool_calls` (agent workflows). - -```yaml -# .agentops/datasets/smoke-model-direct.yaml -source: - type: file - path: ../data/smoke-model-direct.jsonl -format: - type: jsonl - input_field: input - expected_field: expected -``` +A JSONL file containing evaluation rows. Each row has an `input` prompt +and usually an `expected` reference answer. Some scenarios add extra +fields like `context` (RAG), `tool_definitions`, or `tool_calls` (agent +workflows). ```json {"id": "1", "input": "What is Python?", "expected": "Python is a programming language."} @@ -125,34 +85,42 @@ format: ### Evaluator -A scoring function that measures one aspect of the target's response. Evaluators can be: +A scoring function that measures one aspect of the target response. +Evaluators can be: -- **AI-assisted** (Foundry) — use a judge model to score responses on criteria like coherence, fluency, or groundedness (1-5 scale) -- **Local metrics** — computed without a model, such as `F1ScoreEvaluator` or `avg_latency_seconds` +- **AI-assisted** (Foundry) — use a judge model to score responses on + criteria like coherence, fluency, similarity, or groundedness. +- **Local metrics** — computed without a judge model, such as + `F1ScoreEvaluator` or `avg_latency_seconds`. -Evaluators are configured inside bundles. See [foundry-evaluation-sdk-built-in-evaluators.md](foundry-evaluation-sdk-built-in-evaluators.md) for the complete evaluator reference. +AgentOps auto-selects evaluators from the target kind and dataset shape. +Use `evaluators:` in `agentops.yaml` only when you need to override that +selection. See +[foundry-evaluation-sdk-built-in-evaluators.md](foundry-evaluation-sdk-built-in-evaluators.md) +for the complete evaluator reference. -### Backend +### Target resolver -The execution engine that sends dataset rows to the target and collects responses. The runner automatically selects the backend based on the run config: +The execution engine sends dataset rows to the target and collects +responses. AgentOps automatically selects the target kind from `agent:`. -| Execution Mode | Endpoint Kind | Backend | Use case | -|---|---|---|---| -| `remote` | `foundry_agent` | Foundry Backend | Foundry agents and models | -| `remote` | `http` | HTTP Backend | LangGraph, LangChain, ACA, custom REST | -| `local` | — | Local Adapter | In-process Python functions or subprocess | +| `agent:` shape | Target kind | Use case | +|---|---|---| +| `name:version` | Foundry prompt agent | Foundry Agent Service agents | +| `https://...` | HTTP/JSON endpoint | LangGraph, Agent Framework, ACA, AKS, custom REST | +| `model:` | Model-direct | Raw model deployment checks | ## Evaluation Scenarios -AgentOps ships starter bundles for common evaluation patterns. Each bundle pairs specific evaluators with default thresholds: +AgentOps auto-selects common evaluation patterns from the dataset: -| Scenario | Bundle | Key Evaluators | When to use | +| Scenario | Dataset signal | Key evaluators | When to use | |---|---|---|---| -| **Model Quality** | `model_quality_baseline` | Similarity, Coherence, Fluency, F1Score | Direct model deployment checks | -| **RAG** | `rag_quality_baseline` | Groundedness, Relevance, Retrieval, ResponseCompleteness | RAG pipelines with context retrieval | -| **Conversational** | `conversational_agent_baseline` | Coherence, Fluency, Relevance, Similarity | Chatbots, Q&A assistants | -| **Agent Workflow** | `agent_workflow_baseline` | TaskCompletion, ToolCallAccuracy, IntentResolution, ToolSelection | Agents with tool calling | -| **Content Safety** | `safe_agent_baseline` | Violence, Sexual, SelfHarm, HateUnfairness, ProtectedMaterial | Responsible AI checks | +| **Model Quality** | `input`, `expected` on `model:` | Similarity, Coherence, Fluency, F1Score | Direct model deployment checks | +| **RAG** | `context` | Groundedness, Relevance, Retrieval, ResponseCompleteness | RAG pipelines with context retrieval | +| **Conversational** | `input`, `expected` on an agent | Coherence, Fluency, Similarity/F1 where applicable | Chatbots, Q&A assistants | +| **Agent Workflow** | `tool_calls`, `tool_definitions` | ToolCallAccuracy, IntentResolution, TaskAdherence | Agents with tool calling | +| **Content Safety** | Explicit safety evaluators | Violence, Sexual, SelfHarm, HateUnfairness, ProtectedMaterial | Responsible AI checks | Each scenario has a dedicated tutorial: @@ -165,16 +133,21 @@ Each scenario has a dedicated tutorial: ## Configuration Model -Run configs use an orthogonal target model. The three key dimensions — `type`, `execution_mode`, and `endpoint.kind` — are independent. Additional optional fields: +`agentops.yaml` is the single source of truth. Keep it small and add only +the fields your target needs: -| Field | Values | When to use | -|---|---|---| -| `target.hosting` | `local`, `foundry`, `aks`, `containerapps` | Metadata: where the target runs | -| `target.framework` | `agent_framework`, `langgraph`, `custom` | Agent targets only | -| `target.agent_mode` | `prompt`, `hosted` | Foundry agents only | +```yaml +version: 1 +agent: "https://api.example.com/chat" +dataset: .agentops/data/support.jsonl + +request_field: message +response_field: text -**Bundle and dataset references** support two resolution modes: -- `name` — convention-based: resolves to `.agentops/bundles/.yaml` or `.agentops/datasets/.yaml` -- `path` — explicit relative path to the YAML file +thresholds: + coherence: ">=3" + avg_latency_seconds: "<=2" +``` -See [how-it-works.md](how-it-works.md) for the full schema, all endpoint fields, validation rules, and more configuration examples. +See [how-it-works.md](how-it-works.md) for the full schema, endpoint +fields, validation rules, and more examples. diff --git a/docs/how-it-works.md b/docs/how-it-works.md index 4f2a5624..d0684473 100644 --- a/docs/how-it-works.md +++ b/docs/how-it-works.md @@ -46,7 +46,7 @@ src/ │ ├── invocations.py # Per-row agent / model invocation strategies │ ├── thresholds.py # Threshold pass/fail evaluation │ ├── reporter.py # Markdown report generation - │ ├── comparison.py # `eval compare` two runs + │ ├── comparison.py # Baseline delta rendering for `eval run --baseline` │ ├── publisher.py # Classic Foundry publish (OneDP upload of metrics) │ └── cloud_publisher.py # New Foundry publish (server-side via OpenAI Evals API) │ @@ -108,7 +108,7 @@ When you run `agentops eval run`, the following happens step by step: |---|---|---| | `agentops init [--path DIR]` | Scaffold `.agentops/` workspace with starter config, bundles, datasets, and data. Also installs coding agent skills. | Available | | `agentops eval run` | Execute an evaluation (main command) | Available | -| `agentops eval compare --runs ID1,ID2` | Compare two past evaluation runs | Available | +| `agentops eval run --baseline ` | Run an eval and add a comparison against a previous result | Available | | `agentops skills install` | Install AgentOps coding agent skills (Copilot, Claude) into the target project | Available | | `agentops run list\|show` | List or inspect past runs | Planned (stub) | | `agentops run view [--entry N]` | Deep-inspect a run | Planned (stub) | diff --git a/docs/tutorial-agent-workflow.md b/docs/tutorial-agent-workflow.md index bb1b0d96..3cf86d6d 100644 --- a/docs/tutorial-agent-workflow.md +++ b/docs/tutorial-agent-workflow.md @@ -17,8 +17,8 @@ both of these row fields: When AgentOps sees `tool_calls` (or `tool_definitions`) in the dataset rows, it auto-selects the **agent workflow** evaluators: TaskCompletion, ToolCallAccuracy, IntentResolution, TaskAdherence, -plus the conversational baseline (Coherence, Fluency, Similarity, -F1Score, latency). +plus the conversational baseline metrics that apply to the target +(Coherence, Fluency, latency, and any explicitly configured text metric). ## 1. Bootstrap @@ -44,11 +44,11 @@ body: ```yaml version: 1 agent: "https://aca-weather-bot.example.com/" -http: - request_field: message - response_field: text - tool_calls_field: tool_calls dataset: .agentops/data/tools.jsonl + +request_field: message +response_field: text +tool_calls_field: tool_calls ``` `tool_calls_field` tells AgentOps where in the response JSON to find @@ -61,9 +61,9 @@ the structured tool calls (dot-path notation supported). {"id":"2","input":"How is the weather in Tokyo, Japan?","expected":"Calls get_weather with location='Tokyo, Japan'.","tool_calls":[{"type":"function_call","name":"get_weather","arguments":{"location":"Tokyo, Japan"}}]} ``` -You can additionally include `tool_definitions` to give the evaluator -the schema of every tool the agent should know about. This sharpens -the **ToolSelectionEvaluator** judgement. +Include `tool_definitions` when you evaluate tool-call accuracy. The +evaluator needs the schema of every tool the agent should know about; +repeat the catalogue on each JSONL row so every row is self-contained. ## 4. Run diff --git a/docs/tutorial-basic-foundry-agent.md b/docs/tutorial-basic-foundry-agent.md index 93b6ec26..0aab0543 100644 --- a/docs/tutorial-basic-foundry-agent.md +++ b/docs/tutorial-basic-foundry-agent.md @@ -219,4 +219,4 @@ The RAG scenario uses GroundednessEvaluator instead of SimilarityEvaluator becau - [Model-Direct Tutorial](tutorial-model-direct.md) — evaluate a model without agents - [RAG Tutorial](tutorial-rag.md) — evaluate retrieval-augmented responses - [Baseline Comparison Tutorial](tutorial-baseline-comparison.md) — compare runs and detect regressions -- [Copilot Skills Tutorial](tutorial-copilot-skills.md) — install skills for AI-assisted guidance +- [Copilot Skills Tutorial](tutorial-copilot-skills.md) — use the installed AgentOps skills to build an eval workflow with Copilot diff --git a/docs/tutorial-conversational-agent.md b/docs/tutorial-conversational-agent.md index b2080f3e..3f79048e 100644 --- a/docs/tutorial-conversational-agent.md +++ b/docs/tutorial-conversational-agent.md @@ -48,10 +48,10 @@ different field names, override them: ```yaml version: 1 agent: "https://api.example.com/chat" -http: - request_field: prompt - response_field: choices.0.message.content dataset: .agentops/data/chat.jsonl + +request_field: prompt +response_field: choices.0.message.content ``` ## 3. Dataset shape (`chat.jsonl`) @@ -67,8 +67,8 @@ auto-selects the **conversational baseline** evaluators: Coherence, Fluency, Similarity, F1Score, average latency. > Want to test multi-turn behaviour explicitly? Have your service -> accept a `history` field, then add `extra_fields: [history]` under -> `http:` and include a `history` array in each JSONL row. +> accept a `history` field, then add `extra_fields: [history]` to +> `agentops.yaml` and include a `history` array in each JSONL row. ## 4. Run diff --git a/docs/tutorial-copilot-skills.md b/docs/tutorial-copilot-skills.md new file mode 100644 index 00000000..7816c44c --- /dev/null +++ b/docs/tutorial-copilot-skills.md @@ -0,0 +1,292 @@ +# Tutorial — Copilot-assisted AgentOps workflow + +This tutorial shows how to use the AgentOps coding-agent skills as a +guided development workflow. Instead of memorizing the AgentOps schema, +you let Copilot inspect the project, generate the config and dataset, run +the eval, explain the report, and create the CI/CD workflow. + +The tutorial is still fully executable without guessing: each Copilot +prompt is followed by the concrete file or command you should expect. + +## What you will build + +- A small HTTP support agent that answers three customer-service + questions. +- Installed AgentOps skills under `.github/skills/`. +- A flat `agentops.yaml` generated from project context. +- A JSONL dataset generated for the agent's behavior. +- One passing local evaluation and a readable `report.md`. +- GitHub Actions workflow files generated from the skill-guided flow. + +## Prerequisites + +- Python 3.11 or later. +- GitHub Copilot Chat or Copilot CLI with repository context. +- Azure CLI login and a judge-model deployment for AI-assisted evaluators. + +```powershell +python -m venv .venv +.\.venv\Scripts\Activate.ps1 +python -m pip install -U pip +python -m pip install "agentops-toolkit @ git+https://github.com/Azure/agentops.git@develop" + +az login +$env:AZURE_AI_FOUNDRY_PROJECT_ENDPOINT = "https://.services.ai.azure.com/api/projects/" +$env:AZURE_OPENAI_ENDPOINT = "https://.openai.azure.com" +$env:AZURE_OPENAI_DEPLOYMENT = "gpt-4o-mini" +``` + +> If you are testing unreleased AgentOps changes locally, install from +> your checkout instead: +> +> ```powershell +> python -m pip install -e "C:\path\to\agentops[foundry,agent]" +> ``` + +## 1. Create the sample agent + +Create `support_agent.py`: + +```python +from http.server import BaseHTTPRequestHandler, HTTPServer +import json + + +RESPONSES = { + "Where is my order ORD-12345?": "Order ORD-12345 is in transit and expected to arrive tomorrow.", + "Can I return a damaged headset from ORD-77821?": "Yes. Start a return for ORD-77821 and choose damaged item as the reason.", + "How do I contact a human support agent?": "I can connect you to a human support agent for account or order issues.", +} + + +class Handler(BaseHTTPRequestHandler): + def do_POST(self): + length = int(self.headers.get("content-length", "0")) + body = json.loads(self.rfile.read(length)) + message = body.get("message", "") + text = RESPONSES.get(message, "I can help with order status, returns, and support escalation.") + + payload = json.dumps({"text": text}).encode("utf-8") + self.send_response(200) + self.send_header("content-type", "application/json") + self.send_header("content-length", str(len(payload))) + self.end_headers() + self.wfile.write(payload) + + +HTTPServer(("127.0.0.1", 8790), Handler).serve_forever() +``` + +Start it in a second terminal: + +```powershell +.\.venv\Scripts\Activate.ps1 +python support_agent.py +``` + +## 2. Initialize AgentOps and install skills + +```powershell +agentops init +agentops skills install --platform copilot --force +``` + +You should now have: + +```text +agentops.yaml +.agentops/data/smoke.jsonl +.github/skills/ + agentops-config/SKILL.md + agentops-dataset/SKILL.md + agentops-eval/SKILL.md + agentops-report/SKILL.md + agentops-workflow/SKILL.md +``` + +The skills are workflow instructions for Copilot. They tell Copilot how +to inspect the workspace, which AgentOps files to create, which commands +are valid, and when to ask for missing values instead of inventing them. + +## 3. Ask Copilot to configure AgentOps + +In Copilot Chat, ask: + +```text +Use the agentops-config skill. Inspect this project and create an +AgentOps config for the local HTTP support agent on port 8790. +``` + +Expected `agentops.yaml`: + +```yaml +version: 1 +agent: "http://127.0.0.1:8790/" +dataset: .agentops/data/support-agent.jsonl + +request_field: message +response_field: text + +thresholds: + coherence: ">=3" + fluency: ">=3" + similarity: ">=3" + avg_latency_seconds: "<=2" +``` + +Why this is the right config: + +- `agent` is the local HTTP endpoint. +- `request_field` matches `body.get("message")` in `support_agent.py`. +- `response_field` matches the returned JSON key `{ "text": ... }`. +- The thresholds are intentionally simple for the first smoke gate. + +## 4. Ask Copilot to generate the dataset + +In Copilot Chat, ask: + +```text +Use the agentops-dataset skill. Generate a small deterministic JSONL +dataset for the support agent behavior in support_agent.py. +``` + +Expected `.agentops/data/support-agent.jsonl`: + +```jsonl +{"input":"Where is my order ORD-12345?","expected":"Order ORD-12345 is in transit and expected to arrive tomorrow."} +{"input":"Can I return a damaged headset from ORD-77821?","expected":"The customer can start a return for ORD-77821 and choose damaged item as the reason."} +{"input":"How do I contact a human support agent?","expected":"The assistant can connect the customer to a human support agent for account or order issues."} +``` + +The dataset uses exact intents that the sample app implements. That makes +the first run a configuration smoke test: if it fails, you likely have a +field mapping, endpoint, auth, or environment problem rather than a +prompt-quality problem. + +## 5. Ask Copilot to run the eval + +In Copilot Chat, ask: + +```text +Use the agentops-eval skill. Run the evaluation and explain any failure. +``` + +Expected command: + +```powershell +agentops eval run +``` + +Expected outputs: + +```text +.agentops/results//results.json +.agentops/results//report.md +.agentops/results/latest/results.json +.agentops/results/latest/report.md +``` + +Exit code `0` means the config, dataset, HTTP agent, and thresholds all +worked. Exit code `2` means the run completed but one or more thresholds +failed. Exit code `1` means a runtime/configuration error. + +## 6. Ask Copilot to interpret the report + +In Copilot Chat, ask: + +```text +Use the agentops-report skill. Read the latest report and summarize the +strongest rows, weakest rows, and next improvement. +``` + +A useful answer should not just say "pass" or "fail". It should point to: + +- the threshold table in `.agentops/results/latest/report.md`; +- the lowest-scoring row or metric; +- whether latency is agent runtime or evaluator overhead; +- a concrete next change, such as improving an answer or tightening a + threshold after repeated passing runs. + +## 7. Ask Copilot to add the PR gate + +In Copilot Chat, ask: + +```text +Use the agentops-workflow skill. Generate the GitHub Actions workflow +files and tell me which GitHub environment variables are required. +``` + +Expected command: + +```powershell +agentops workflow generate +``` + +Expected workflow files: + +```text +.github/workflows/agentops-pr.yml +.github/workflows/agentops-deploy-dev.yml +.github/workflows/agentops-deploy-qa.yml +.github/workflows/agentops-deploy-prod.yml +``` + +For this HTTP tutorial, the PR gate needs the same evaluator-model values +you used locally: + +| GitHub variable | Purpose | +|---|---| +| `AZURE_CLIENT_ID` | OIDC identity used by `azure/login`. | +| `AZURE_TENANT_ID` | Tenant for the OIDC login. | +| `AZURE_SUBSCRIPTION_ID` | Azure subscription for the login. | +| `AZURE_AI_FOUNDRY_PROJECT_ENDPOINT` | Foundry project used by AI-assisted evaluators. | +| `AZURE_OPENAI_ENDPOINT` | Azure OpenAI endpoint for the judge model. | +| `AZURE_OPENAI_DEPLOYMENT` | Judge model deployment, for example `gpt-4o-mini`. | + +If your HTTP agent is remote and protected, also add the token variable +referenced by `auth_header_env`. + +Because this tutorial starts the sample agent on `127.0.0.1`, GitHub +Actions must start that process before `agentops eval run`. For this +sample repo, add this step between **Install AgentOps Toolkit** and +**Run AgentOps eval** in `agentops-pr.yml`: + +```yaml + - name: Start local tutorial agent + run: | + python support_agent.py & + sleep 2 +``` + +For a deployed ACA/AKS/App Service endpoint, skip that step and point +`agent:` at the public or private URL your runner can reach. + +## 8. Push the tutorial repo + +```powershell +git init -b main +git add . +git commit -m "feat: add Copilot-assisted AgentOps eval" +gh repo create "agentops-copilot-skills-" --public --source=. --push +``` + +The first PR against `main` or `develop` will run `agentops-pr.yml`. +When it finishes, open the workflow artifact or PR comment to view the +same `report.md` you inspected locally. + +## What Copilot should have learned + +The skills keep Copilot inside the AgentOps contract: + +- `agentops-config` creates a flat `agentops.yaml`, not legacy + `run.yaml` / bundle / dataset config files. +- `agentops-dataset` creates rows tailored to the app instead of generic + trivia. +- `agentops-eval` runs `agentops eval run` and respects exit codes. +- `agentops-report` turns metrics into actionable insights. +- `agentops-workflow` generates the standard GitFlow workflow scaffold + without inventing unsupported flags or commands. + +That is the intended AgentOps development loop: Copilot accelerates the +file creation and interpretation, while AgentOps supplies the repeatable +evaluation contract. diff --git a/docs/tutorial-end-to-end.md b/docs/tutorial-end-to-end.md index ecdb4b9c..22d01ee8 100644 --- a/docs/tutorial-end-to-end.md +++ b/docs/tutorial-end-to-end.md @@ -84,7 +84,7 @@ Set the project endpoint up front so every command picks it up. ```powershell $env:AZURE_AI_FOUNDRY_PROJECT_ENDPOINT = "https://.services.ai.azure.com/api/projects/" -$env:AZURE_OPENAI_ENDPOINT = "https://.services.ai.azure.com" +$env:AZURE_OPENAI_ENDPOINT = "https://.openai.azure.com" $env:AZURE_OPENAI_DEPLOYMENT = "gpt-4o-mini" ``` @@ -92,20 +92,19 @@ $env:AZURE_OPENAI_DEPLOYMENT = "gpt-4o-mini" ```bash export AZURE_AI_FOUNDRY_PROJECT_ENDPOINT="https://.services.ai.azure.com/api/projects/" -export AZURE_OPENAI_ENDPOINT="https://.services.ai.azure.com" +export AZURE_OPENAI_ENDPOINT="https://.openai.azure.com" export AZURE_OPENAI_DEPLOYMENT="gpt-4o-mini" ``` -> **Watch out for two endpoint shapes.** On a Foundry "AI Services" -> account, both env vars start with the same hostname but the -> project endpoint includes `/api/projects/` while -> `AZURE_OPENAI_ENDPOINT` is **only** the hostname (no path). If you -> paste the project URL into `AZURE_OPENAI_ENDPOINT` the evaluators -> fail with `BadRequest: API version not supported`. AgentOps -> defaults the API version to a release that works against both -> New Foundry and classic Azure OpenAI; override with -> `AZURE_OPENAI_API_VERSION` only if your resource needs a specific -> version. +> **Watch out for two endpoint shapes.** The Foundry project endpoint +> uses the `*.services.ai.azure.com/api/projects/` shape. +> The evaluator model endpoint is the Azure OpenAI data-plane host, +> usually `*.openai.azure.com`, with **no path**. If you paste the +> project URL into `AZURE_OPENAI_ENDPOINT`, evaluators can fail with +> `BadRequest: API version not supported`. AgentOps defaults the API +> version to a release that works against both New Foundry and classic +> Azure OpenAI; override with `AZURE_OPENAI_API_VERSION` only if your +> resource needs a specific version. > The remaining shell snippets in this tutorial are written for > **PowerShell** (the default on Windows). bash / zsh users can @@ -204,8 +203,8 @@ You get: └── agentops-*/SKILL.md ``` -Open `.agentops/agentops.yaml` and configure it for the support -agent: +Open `agentops.yaml` at the project root and configure it for the +support agent: ```yaml version: 1 @@ -223,8 +222,9 @@ thresholds: coherence: ">=3" fluency: ">=3" similarity: ">=3" - # Latency budget. - avg_latency_seconds: "<=10" + # Lab-safe latency budget. Tool-calling Foundry agents can have + # occasional cold-start / orchestration spikes during a tutorial run. + avg_latency_seconds: "<=90" ``` The `agent: "name:version"` shape is recognised as a **Foundry hosted @@ -275,6 +275,13 @@ quality stack. When AgentOps loads the dataset it picks: | `CoherenceEvaluator` / `FluencyEvaluator` / `SimilarityEvaluator` / `F1ScoreEvaluator` | Standard text quality. | | `avg_latency_seconds` | End-to-end latency budget. | +> **Why is the latency budget 90 seconds?** The point of this first gate +> is to prove tool behavior, not to fail a learner because one Foundry +> row hit a transient cold-start or service-queue spike. Keep this +> tutorial gate broad, then tighten latency for your own production +> agent after you have baseline data. Step 9 shows how to use +> Application Insights and Watchdog for stricter p95 latency monitoring. + ## 5. Run your first evaluation ```powershell @@ -308,8 +315,10 @@ The report has four sections you will revisit often: debugging false-positive tool calls. - **Aggregate metrics** — averages across rows. - **Thresholds** — every rule from `agentops.yaml` with measured - value. With v1 you should see all the tool-calling thresholds in - the green. + value. With v1 you should see the tool-calling and text-quality + thresholds in the green. If latency is high but below the lab-safe + budget, keep going; you will inspect production-style p95 latency + with Watchdog later. The exit code is `0` (all thresholds passed) or `2` (one or more failed). `1` means a runtime error. @@ -441,9 +450,28 @@ git push -u origin develop ### Wire the GitHub Environments +At this point the eval works on your machine because your local Azure +login has access to Foundry and to the evaluator model. GitHub Actions is +a different machine, so you must give the workflow its own identity and +permissions. + The three workflows (`pr`, `deploy-dev`, `deploy-qa`, `deploy-prod`) -expect a GitHub **environment** per stage, each populated with the same -six variables and a federated credential so Azure trusts GitHub OIDC. +expect one GitHub **environment** per stage. Each environment stores the +variables the workflow needs and maps to one trusted Azure identity. + +| Piece | Why you need it | +|---|---| +| App registration + service principal | The Azure identity that GitHub Actions will impersonate. | +| GitHub environment variables | Non-secret configuration such as tenant, subscription, Foundry endpoint, and evaluator model endpoint. | +| Federated credential | The trust rule that allows GitHub OIDC tokens from this repo/environment to become Azure tokens. | +| Azure role assignments | The actual permissions to read the Foundry agent and call the Azure OpenAI judge model. | + +Think of the setup in two layers: + +1. **Authentication:** GitHub proves "this workflow is running from your + `support-bot-*` repo in the `dev`, `qa`, or `prod` environment". +2. **Authorization:** Azure checks whether that identity has roles on the + Foundry and Azure OpenAI resources. The next four snippets create everything end-to-end. Run them in order from the same PowerShell session you used above (so `$suffix` is still @@ -451,6 +479,17 @@ in scope). #### 1. Create the app registration GitHub will impersonate +This creates the Azure identity used by the workflows. There is no client +secret in this tutorial: GitHub will authenticate with OIDC instead of a +stored password. + +The command prints three values you will store as GitHub environment +variables: + +- `AZURE_CLIENT_ID` — which app registration GitHub should impersonate. +- `AZURE_TENANT_ID` — which Microsoft Entra tenant owns the app. +- `AZURE_SUBSCRIPTION_ID` — which Azure subscription the workflow should use. + ```powershell $app = az ad app create --display-name "support-bot-ci-$suffix" | ConvertFrom-Json az ad sp create --id $app.appId | Out-Null @@ -475,6 +514,23 @@ Write-Host "AZURE_SUBSCRIPTION_ID = $sub" #### 2. Create the three environments and push the variables +GitHub environments give each stage its own variable scope and its own +OIDC subject (`environment:dev`, `environment:qa`, `environment:prod`). +The PR gate intentionally runs in `dev`, so it reuses the same variables +and identity as the first deployment stage. + +This snippet creates the environments and stores the values the generated +workflows read through `vars.*`: + +| Variable | Where it comes from | Used for | +|---|---|---| +| `AZURE_TENANT_ID` | `az account show` | Tells `azure/login` which Entra tenant to authenticate against. | +| `AZURE_SUBSCRIPTION_ID` | `az account show` | Selects the Azure subscription for the workflow. | +| `AZURE_CLIENT_ID` | The app registration from step 1 | Tells `azure/login` which identity GitHub should impersonate. | +| `AZURE_AI_FOUNDRY_PROJECT_ENDPOINT` | Your local env var | Tells AgentOps where the hosted support agent lives. | +| `AZURE_OPENAI_ENDPOINT` | Your local env var | Tells evaluators where the judge model endpoint is. | +| `AZURE_OPENAI_DEPLOYMENT` | The deployment name, e.g. `gpt-4o-mini` | Tells evaluators which judge model deployment to call. | + ```powershell $foundry = $env:AZURE_AI_FOUNDRY_PROJECT_ENDPOINT $aoai = $env:AZURE_OPENAI_ENDPOINT @@ -495,16 +551,25 @@ foreach ($envName in @("dev","qa","prod")) { > **Prefer the portal?** Open your repo on github.com → **Settings → > Environments → New environment** and create `dev`, `qa`, and `prod`. -> For each one, click **Add variable** and add the six rows from the -> table at the top of this section. +> For each one, click **Add variable** and add the six variables listed +> above. #### 3. Add federated credentials so Azure trusts GitHub OIDC -One credential per environment. The PR gate workflow runs **inside the -`dev` environment** (so it inherits the same `dev` variables and OIDC -subject) — no separate `pull_request` credential is needed. The JSON is -written to a temp file because `az` does not parse inline JSON reliably -under PowerShell: +The variables above tell GitHub which Azure identity to use, but Azure +still needs to trust this repository. A federated credential is that trust +rule. + +Each credential says: "Accept tokens issued by GitHub for this exact repo +and this exact environment." That is why the `subject` values include +`environment:dev`, `environment:qa`, and `environment:prod`. + +The PR gate workflow runs **inside the `dev` environment**, so it inherits +the same `dev` variables and OIDC subject — no separate `pull_request` +credential is needed. + +The JSON is written to a temp file because `az` does not parse inline JSON +reliably under PowerShell: ```powershell $subjects = @{ @@ -538,6 +603,19 @@ foreach ($name in $subjects.Keys) { #### 4. Grant the app the roles it needs +OIDC only proves the workflow's identity; it does not grant access by +itself. This step assigns least-privilege Azure roles to the service +principal: + +| Scope | Role | Why | +|---|---|---| +| Foundry account/project resource | `Azure AI User` | Lets AgentOps read and invoke the hosted support agent. | +| Azure OpenAI account | `Cognitive Services OpenAI User` | Lets the evaluators call the judge model deployment. | + +The endpoint URLs contain the Azure resource names, but role assignments +need full Azure resource IDs. The first half of the script extracts those +names and resolves them to IDs; the second half assigns the roles. + ```powershell $spId = az ad sp show --id $client --query id -o tsv @@ -586,51 +664,158 @@ The `agentops-pr.yml` workflow runs. When it finishes you will see: - A green or red check on the PR. - A bot comment with the verdict, threshold table (including the tool-call metrics), and a link to the full `report.md` artifact. + The tutorial's latency threshold is intentionally broad; after a few + real runs, tighten it in `agentops.yaml` or enforce p95 latency with + Watchdog in step 9. Merge the PR. `agentops-deploy-dev.yml` triggers, runs an eval against the dev environment, and deploys if it passes. ## 9. Run the Watchdog -The watchdog reads your accumulated run history and (optionally) -queries Application Insights and the Foundry control plane to flag -drifts that a single eval cannot see — repeated regressions, latency -trends, error spikes, safety findings. +The watchdog is only useful if it has real signals to inspect. In this +tutorial those signals are: + +1. `.agentops/results/*/results.json` from the evals you already ran. +2. Application Insights telemetry emitted by a new eval run. +3. Foundry control-plane metadata for the hosted support agent. + +If you run `agentops agent analyze` without Application Insights +configured, the report can only say `azure_monitor: skipped`. That is not +an observability tutorial. The next commands create Application Insights, +send telemetry into it, and then run the watchdog against the live data. + +### 9.1 Create Application Insights for the tutorial ```powershell -pip install "agentops-toolkit[agent] @ git+https://github.com/Azure/agentops.git@develop" -agentops agent analyze +# Reuse the same resource group/location as the Foundry account. +$foundryName = (($env:AZURE_AI_FOUNDRY_PROJECT_ENDPOINT -split "//")[1] -split "\.")[0] +$foundry = az resource list ` + --name $foundryName ` + --resource-type "Microsoft.CognitiveServices/accounts" ` + --query "[0]" | ConvertFrom-Json + +if (-not $foundry) { throw "Could not resolve Foundry resource '$foundryName'" } + +$resourceGroup = ($foundry.id -split "/resourceGroups/")[1].Split("/")[0] +$location = $foundry.location +$workspaceName = "law-support-bot-$suffix" +$appiName = "appi-support-bot-$suffix" + +az extension add -n application-insights --upgrade | Out-Null +az monitor log-analytics workspace create ` + --resource-group $resourceGroup ` + --workspace-name $workspaceName ` + --location $location | Out-Null + +$workspaceId = az monitor log-analytics workspace show ` + --resource-group $resourceGroup ` + --workspace-name $workspaceName ` + --query id -o tsv + +az monitor app-insights component create ` + --app $appiName ` + --location $location ` + --resource-group $resourceGroup ` + --workspace $workspaceId ` + --application-type web | Out-Null + +$appInsightsId = az monitor app-insights component show ` + --app $appiName ` + --resource-group $resourceGroup ` + --query id -o tsv + +$appInsightsConnectionString = az monitor app-insights component show ` + --app $appiName ` + --resource-group $resourceGroup ` + --query connectionString -o tsv ``` -This produces `.agentops/agent/report.md`. With no `agent.yaml` -present, only the local results-history source is active and Azure -Monitor / Foundry control plane appear as `skipped` in the -diagnostics block. That is enough for the basic regression and -latency checks across all your previous runs. +What this creates: + +- A **Log Analytics workspace** that stores the telemetry tables. +- A workspace-based **Application Insights component** that receives + AgentOps spans and exposes them to Azure Monitor queries. +- Two local variables: + - `$appInsightsId` — used by the watchdog to query telemetry. + - `$appInsightsConnectionString` — used by `agentops eval run` to emit + telemetry. + +### 9.2 Let the CI identity read telemetry -To pull production telemetry, drop a starter `agent.yaml` into the -workspace and edit it: +Locally, your signed-in Azure user can usually query the resource because +you created it. For GitHub Actions, grant the same OIDC app a read role +so scheduled watchdog runs can query Application Insights too: ```powershell -$tpl = python -c "import agentops, pathlib; print(pathlib.Path(agentops.__file__).parent / 'templates' / 'agent.yaml')" -Copy-Item $tpl .agentops/agent.yaml +$repo = gh repo view --json nameWithOwner -q .nameWithOwner +$client = gh variable get AZURE_CLIENT_ID --env dev --repo $repo +$spId = az ad sp show --id $client --query id -o tsv + +az role assignment create ` + --assignee-object-id $spId ` + --assignee-principal-type ServicePrincipal ` + --role "Monitoring Reader" ` + --scope $appInsightsId | Out-Null ``` -```yaml +### 9.3 Configure the watchdog + +Now write `.agentops/agent.yaml`. This is the file that tells the +watchdog which signal sources to use: + +```powershell +@" +version: 1 +lookback_days: 7 + sources: results_history: enabled: true + path: .agentops/results + lookback_runs: 10 azure_monitor: enabled: true - app_insights_resource_id: /subscriptions//resourceGroups//providers/microsoft.insights/components/ + app_insights_resource_id: $appInsightsId foundry_control: enabled: true project_endpoint_env: AZURE_AI_FOUNDRY_PROJECT_ENDPOINT +checks: + latency: + p95_threshold_seconds: 5.0 + errors: + rate_threshold: 0.05 +"@ | Set-Content .agentops/agent.yaml -Encoding utf8 ``` -Re-run `agentops agent analyze`. The findings table now mixes signals -from your eval history (including the v1 → v2 tool-call regression) -with live telemetry from the deployed agent. +### 9.4 Generate telemetry, then analyze it + +Install both the Foundry runtime and the watchdog extras, set the +Application Insights connection string, and run one more eval. AgentOps +will emit OpenTelemetry spans for each dataset row and agent invocation. + +```powershell +python -m pip install "agentops-toolkit[foundry,agent] @ git+https://github.com/Azure/agentops.git@develop" + +$env:APPLICATIONINSIGHTS_CONNECTION_STRING = $appInsightsConnectionString +agentops eval run + +# Azure Monitor ingestion is asynchronous. Give it a short moment to index. +Start-Sleep -Seconds 90 + +agentops agent analyze +code .agentops/agent/report.md +``` + +The report should now show `azure_monitor` as `ok`, not `skipped`. The +watchdog can combine: + +- eval-history regressions from `.agentops/results`; +- live p95 latency and error-rate signals from Application Insights; +- Foundry control-plane metadata from `AZURE_AI_FOUNDRY_PROJECT_ENDPOINT`. + +If the findings table is empty, that means the configured checks passed; +the **Sources** table still proves which signal sources were queried. > **Optional — WAF-AI security audit.** The watchdog can also run a > read-only audit of your Foundry resource group against the diff --git a/docs/tutorial-http-agent.md b/docs/tutorial-http-agent.md index aa8fb625..e9fa7f14 100644 --- a/docs/tutorial-http-agent.md +++ b/docs/tutorial-http-agent.md @@ -1,209 +1,252 @@ -# Tutorial: HTTP Agent Evaluation (Agent Framework / ACA) +# Tutorial: HTTP Agent Evaluation -This tutorial shows how to evaluate an AI agent deployed as an HTTP endpoint — for example, a [Microsoft Agent Framework](https://learn.microsoft.com/azure/ai-agent-service/) application running on Azure Container Apps (ACA). No Foundry Agent Service is required. +This tutorial shows how to evaluate an agent that is exposed as an +HTTP/JSON endpoint. That endpoint can be a local development server, +Azure Container Apps, AKS, App Service, FastAPI, Express, Microsoft Agent +Framework, LangGraph, or any service that accepts a prompt and returns a +text response. -The HTTP backend sends each dataset row as a JSON POST request to your agent endpoint, extracts the response, runs local and AI-assisted evaluators, and produces the standard `results.json` and `report.md` outputs. +AgentOps treats HTTP agents the same way it treats Foundry agents after +the call succeeds: it loads JSONL rows, POSTs one row at a time, extracts +the answer, runs evaluators, and writes `results.json` plus `report.md`. -## When HTTP backend makes sense +## What you will build -Use `type: http` when: +- A tiny local HTTP agent so you can run the tutorial without deploying + anything. +- A flat `agentops.yaml` that points to the HTTP URL. +- A JSONL dataset with deterministic support-style questions. +- One `agentops eval run` producing a passing report. -- Your agent is **deployed outside Foundry Agent Service** — for example, a multi-agent orchestrator on ACA or a custom FastAPI service. -- You use **Microsoft Agent Framework** (or any other framework) and expose an HTTP chat endpoint. -- You want **CI/CD gating** for any HTTP-accessible agent without Foundry dependency. -- You need to evaluate a **local development server** before deploying. - -The HTTP backend works for multi-agent scenarios transparently — evaluation always hits the orchestrator endpoint; internal agent routing and tool calls are invisible to AgentOps at this level. +Use the same pattern later by changing only the `agent:` URL and field +mapping for your real deployed agent. ## Prerequisites -- Python 3.11+ -- An agent running and accessible via HTTP (local or remote). -- *(Optional)* Azure CLI for AI-assisted evaluators (`az login`). -- `pip install "agentops-toolkit @ git+https://github.com/Azure/agentops.git@develop"` - -## Part 1: Set up +```powershell +python -m venv .venv +.\.venv\Scripts\Activate.ps1 +python -m pip install -U pip +python -m pip install "agentops-toolkit @ git+https://github.com/Azure/agentops.git@develop" +``` -### 1) Initialize the workspace +If you use AI-assisted evaluators such as Similarity or Fluency, also set +the judge model and sign in to Azure: -```bash -agentops init +```powershell +az login +$env:AZURE_AI_FOUNDRY_PROJECT_ENDPOINT = "https://.services.ai.azure.com/api/projects/" +$env:AZURE_OPENAI_ENDPOINT = "https://.openai.azure.com" +$env:AZURE_OPENAI_DEPLOYMENT = "gpt-4o-mini" ``` -This creates `.agentops/` with all starter files, including the HTTP scenario templates: +## 1. Create a local HTTP agent -``` -.agentops/ -├── run-http-model.yaml ← HTTP run config -├── bundles/model_quality_baseline.yaml ← baseline evaluators -├── datasets/smoke-model-direct.yaml ← smoke dataset config -└── data/smoke-model-direct.jsonl ← 5 generic Q&A rows -``` +Create `http_agent.py`: -### 2) Set the agent URL +```python +from http.server import BaseHTTPRequestHandler, HTTPServer +import json -The recommended approach is to set an environment variable so the URL stays out of your run config: -PowerShell: -```powershell -$env:AGENT_HTTP_URL = "https://your-agent.region.azurecontainerapps.io/chat" -``` +ANSWERS = { + "Where is my order ORD-12345?": { + "text": "Order ORD-12345 is in transit and expected to arrive tomorrow.", + "tool_calls": [{"type": "tool_call", "tool_call_id": "c1", "name": "lookup_order", "arguments": {"order_id": "ORD-12345"}}], + }, + "I want a refund for ORD-77821, it arrived broken.": { + "text": "I started a refund for ORD-77821 because it arrived broken.", + "tool_calls": [{"type": "tool_call", "tool_call_id": "c2", "name": "refund_order", "arguments": {"order_id": "ORD-77821", "reason": "arrived broken"}}], + }, + "Hi there!": { + "text": "Hello! I can help with order status, refunds, or connecting you to a human support agent.", + "tool_calls": [], + }, +} + -Bash/zsh: -```bash -export AGENT_HTTP_URL="https://your-agent.region.azurecontainerapps.io/chat" +class Handler(BaseHTTPRequestHandler): + def do_POST(self): + length = int(self.headers.get("content-length", "0")) + body = json.loads(self.rfile.read(length)) + message = body.get("message", "") + response = ANSWERS.get(message, {"text": "I do not know yet.", "tool_calls": []}) + + payload = json.dumps(response).encode("utf-8") + self.send_response(200) + self.send_header("content-type", "application/json") + self.send_header("content-length", str(len(payload))) + self.end_headers() + self.wfile.write(payload) + + +HTTPServer(("127.0.0.1", 8787), Handler).serve_forever() ``` -For a local agent running during development: -```bash -export AGENT_HTTP_URL="http://localhost:8080/chat" +Start it in a second terminal: + +```powershell +.\.venv\Scripts\Activate.ps1 +python http_agent.py ``` -### 3) *(Optional)* Configure AI-assisted evaluators +Why this local server? It lets you prove the AgentOps HTTP contract before +you involve Container Apps, auth, networking, or deployment variables. +When this passes locally, a remote HTTP target is just a URL swap. + +## 2. Initialize AgentOps -If your bundle includes `SimilarityEvaluator` or other AI-assisted evaluators, set the judge model: +Back in your first terminal: -```bash -export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/" -export AZURE_AI_MODEL_DEPLOYMENT_NAME="gpt-4o" +```powershell +agentops init +``` + +This creates: + +```text +agentops.yaml +.agentops/ + data/smoke.jsonl + results/ +.github/skills/ ``` -Run `az login` if you are using `DefaultAzureCredential` locally. +AgentOps 1.0 uses one flat config file at the project root. You do not +need legacy `run-http.yaml`, bundle YAML, or dataset YAML files. -## Part 2: Customize the run config +## 3. Configure the HTTP endpoint -Open `.agentops/run-http-model.yaml`. The starter config already points at the baseline bundle and smoke dataset: +Replace `agentops.yaml` with: ```yaml version: 1 -target: - type: model - hosting: aks - execution_mode: remote - endpoint: - kind: http - url_env: AGENT_HTTP_URL # reads the URL from your environment - request_field: message # JSON field to send the prompt in - response_field: text # JSON field to extract the response from -bundle: - name: model_quality_baseline -dataset: - name: smoke-model-direct -execution: - timeout_seconds: 60 -output: - write_report: true +agent: "http://127.0.0.1:8787/" +dataset: .agentops/data/http-support.jsonl + +request_field: message +response_field: text +tool_calls_field: tool_calls + +thresholds: + coherence: ">=3" + fluency: ">=3" + tool_call_accuracy: ">=0.8" + intent_resolution: ">=3" + task_adherence: ">=0.8" + avg_latency_seconds: "<=2" ``` -### Adapting to your agent's API +The HTTP field mapping controls the JSON protocol: -Every agent has its own request/response format. Adjust these fields: - -| Field | Default | Description | -|---|---|---| -| `request_field` | `message` | JSON key for the prompt text | -| `response_field` | `text` | JSON key for the response (supports dot-path) | -| `auth_header_env` | — | Env var containing a Bearer token | -| `headers` | `{}` | Static extra headers | +| Config field | Meaning | +|---|---| +| `request_field: message` | AgentOps sends `{"message": ""}`. | +| `response_field: text` | AgentOps reads the final answer from `response.text`. Dot paths such as `output.text` are supported. | +| `tool_calls_field: tool_calls` | AgentOps reads structured tool calls from `response.tool_calls` so tool metrics can run. | -**Examples:** +For a deployed endpoint that requires a Bearer token, add: -Agent that expects `{"query": "..."}` and returns `{"answer": "..."}`: ```yaml -target: - endpoint: - kind: http - url_env: AGENT_HTTP_URL - request_field: query - response_field: answer +auth_header_env: AGENT_TOKEN ``` -Agent that returns `{"output": {"text": "..."}}` (nested): -```yaml -target: - endpoint: - kind: http - url_env: AGENT_HTTP_URL - response_field: output.text # dot-path into nested object -``` +Then set `$env:AGENT_TOKEN` before running the eval. -Agent requiring Bearer token authentication: -```yaml -target: - endpoint: - kind: http - url_env: AGENT_HTTP_URL - auth_header_env: AGENT_TOKEN # reads Bearer token from env +## 4. Create the dataset + +Create `.agentops/data/http-support.jsonl`: + +```jsonl +{"input":"Where is my order ORD-12345?","expected":"Order ORD-12345 is in transit and expected to arrive tomorrow.","tool_definitions":[{"type":"function","name":"lookup_order","description":"Look up an order.","parameters":{"type":"object","properties":{"order_id":{"type":"string"}},"required":["order_id"]}},{"type":"function","name":"refund_order","description":"Refund an order.","parameters":{"type":"object","properties":{"order_id":{"type":"string"},"reason":{"type":"string"}},"required":["order_id","reason"]}}],"tool_calls":[{"type":"tool_call","tool_call_id":"c1","name":"lookup_order","arguments":{"order_id":"ORD-12345"}}]} +{"input":"I want a refund for ORD-77821, it arrived broken.","expected":"A refund is started for ORD-77821 because it arrived broken.","tool_definitions":[{"type":"function","name":"lookup_order","description":"Look up an order.","parameters":{"type":"object","properties":{"order_id":{"type":"string"}},"required":["order_id"]}},{"type":"function","name":"refund_order","description":"Refund an order.","parameters":{"type":"object","properties":{"order_id":{"type":"string"},"reason":{"type":"string"}},"required":["order_id","reason"]}}],"tool_calls":[{"type":"tool_call","tool_call_id":"c2","name":"refund_order","arguments":{"order_id":"ORD-77821","reason":"arrived broken"}}]} +{"input":"Hi there!","expected":"The assistant replies with a clear greeting and offers support options without calling a tool.","tool_definitions":[{"type":"function","name":"lookup_order","description":"Look up an order.","parameters":{"type":"object","properties":{"order_id":{"type":"string"}},"required":["order_id"]}},{"type":"function","name":"refund_order","description":"Refund an order.","parameters":{"type":"object","properties":{"order_id":{"type":"string"},"reason":{"type":"string"}},"required":["order_id","reason"]}}],"tool_calls":[]} ``` -Banking assistant (Agent Framework default): -```yaml -target: - endpoint: - kind: http - url_env: AGENT_HTTP_URL - request_field: message - response_field: text - auth_header_env: AGENT_TOKEN +Each row has: -## Part 3: Prepare the dataset +- `input` — what AgentOps sends to the HTTP service. +- `expected` — the reference answer for text-quality metrics. +- `tool_calls` — the expected structured tool behavior. Omit this field + if your HTTP endpoint does not expose tool calls. +- `tool_definitions` — the function-tool schema available to the agent. + Tool-call accuracy evaluators need this catalogue on each row. -The smoke dataset has 5 generic Q&A rows. For real evaluations, replace `data/smoke-http.jsonl` with domain-specific queries: +## 5. Run the evaluation -```json -{"id":"1","input":"What is the balance on account 12345?","expected":"The balance on account 12345 is $1,234.56."} -{"id":"2","input":"What are the last 3 transactions on my savings account?","expected":"The last 3 transactions are: ..."} +```powershell +agentops eval run ``` -Update `datasets/smoke-http.yaml` to point at your file: +The CLI should print a passing threshold summary and write: -```yaml -source: - type: file - path: ../data/your-dataset.jsonl +```text +.agentops/results//results.json +.agentops/results//report.md +.agentops/results/latest/ ``` -## Part 4: Run the evaluation +Open the Markdown report: -```bash -agentops eval run --config .agentops/run-http.yaml +```powershell +code .agentops/results/latest/report.md ``` -The backend: -1. Loads the dataset rows from the JSONL file. -2. POSTs each row to your agent via HTTP. -3. Extracts the response text. -4. Runs evaluators (`SimilarityEvaluator`, `avg_latency_seconds`). -5. Writes `backend_metrics.json`, then `results.json` and `report.md`. - -Output lands in `.agentops/results//` and is mirrored to `.agentops/results/latest/`. Pass `--output ` to write the run only to that path instead. +The report shows the aggregate metrics, threshold table, and per-row +details. For the first two rows, the per-row section should include the +tool calls returned by the HTTP server. -## Part 5: Review results +## 6. Point it at a real service -**Console:** AgentOps prints a summary with pass/fail per threshold. +When you deploy the agent, keep the dataset and thresholds but change the +URL and field mapping: -**Report:** Open the report in VS Code with `code .agentops/results/latest/report.md` and press `Ctrl+Shift+V` to render the Markdown. +```yaml +version: 1 +agent: "https://your-agent.region.azurecontainerapps.io/chat" +dataset: .agentops/data/http-support.jsonl -**JSON:** Parse `.agentops/results/latest/results.json` for machine-readable scores. +request_field: message +response_field: output.text +tool_calls_field: output.tool_calls +auth_header_env: AGENT_TOKEN +``` -## Troubleshooting +Run the same command: -**`connection refused` / `URL error`** — Your agent is not reachable. Check that `AGENT_HTTP_URL` is correct and the server is running. +```powershell +agentops eval run +``` -**`Response field 'text' not found`** — Your agent returns a different key. Inspect the raw response and update `response_field` in your run config. +If the local server passed but the remote service fails, the issue is +usually deployment reachability, auth, or a response-field mismatch rather +than evaluator logic. -**`SimilarityEvaluator` fails** — Set `AZURE_OPENAI_ENDPOINT` and `AZURE_AI_MODEL_DEPLOYMENT_NAME`, then run `az login`. +## Troubleshooting -**All rows error, exit code 1** — Check `.agentops/results/latest/backend.stderr.log` for per-row error details. +| Symptom | What to check | +|---|---| +| `connection refused` | The server is not running or the URL/port is wrong. | +| `Response field 'text' not found` | Update `response_field` to match your JSON response shape. | +| `tool_call_accuracy` is missing | Add `tool_calls_field` and make sure the response includes structured tool calls. | +| AI evaluator auth error | Run `az login` and set the Azure OpenAI / Foundry environment variables. | ## Exit codes | Code | Meaning | |---|---| -| `0` | All rows succeeded and all thresholds passed | -| `2` | Evaluation succeeded but one or more thresholds failed | -| `1` | Runtime error (HTTP failure, config error) | +| `0` | Evaluation succeeded and all thresholds passed. | +| `2` | Evaluation succeeded but at least one threshold failed. | +| `1` | Runtime or configuration error. | ## CI/CD integration -See [docs/ci-github-actions.md](ci-github-actions.md) for how to gate on the exit code in a GitHub Actions workflow. The HTTP backend works identically to other backends from a CI perspective. +After the local run passes, generate workflow files with: + +```powershell +agentops workflow generate +``` + +The generated PR workflow uses the same `agentops eval run` exit codes to +gate pull requests. See [ci-github-actions.md](ci-github-actions.md) for +the GitHub environment and OIDC setup. diff --git a/docs/tutorial-model-direct.md b/docs/tutorial-model-direct.md index 1db40627..fe652c1b 100644 --- a/docs/tutorial-model-direct.md +++ b/docs/tutorial-model-direct.md @@ -34,10 +34,18 @@ the deployment, and skips agent infrastructure entirely. `.agentops/data/smoke.jsonl` (one JSON object per line): ```jsonl -{"id":"1","input":"What is the capital of France?","expected":"Paris is the capital of France."} -{"id":"2","input":"Which planet is known as the Red Planet?","expected":"Mars is the Red Planet."} +{"id":"1","input":"Answer with exactly this sentence: Paris is the capital of France and one of Europe's major cultural centers.","expected":"Paris is the capital of France and one of Europe's major cultural centers."} +{"id":"2","input":"Answer with exactly this sentence: Mars is known as the Red Planet because iron-rich dust gives its surface a reddish color.","expected":"Mars is known as the Red Planet because iron-rich dust gives its surface a reddish color."} +{"id":"3","input":"Answer with exactly this sentence: Water has the chemical formula H2O because each molecule contains two hydrogen atoms and one oxygen atom.","expected":"Water has the chemical formula H2O because each molecule contains two hydrogen atoms and one oxygen atom."} ``` +The first model-direct smoke test intentionally uses short factual +sentences with exact-answer instructions. That makes the default +Similarity, F1, and Fluency thresholds meaningful: if this fails, you +likely have a configuration/auth problem rather than a subjective-answer +mismatch. Once the loop is working, replace these rows with realistic +prompts for your application. + The dataset has only `input` and `expected`, so AgentOps auto-selects the **model quality** evaluators: Coherence, Fluency, Similarity, F1Score, plus average latency. diff --git a/docs/tutorial-quickstart.md b/docs/tutorial-quickstart.md index 2169401c..53313b33 100644 --- a/docs/tutorial-quickstart.md +++ b/docs/tutorial-quickstart.md @@ -43,7 +43,8 @@ agentops init This creates two files: - `agentops.yaml` — your evaluation config (3 lines + comments). -- `.agentops/data/smoke.jsonl` — a 3-row seed dataset. +- `.agentops/data/smoke.jsonl` — a 3-row seed dataset with short, + deterministic factual answers. ## 3. Configure your agent @@ -91,6 +92,12 @@ To view the report rendered (tables, ✅/❌), open it in VS Code and press `Ctr code .agentops/results/latest/report.md ``` +The seed dataset asks the target to answer with exact short factual +sentences. That keeps the first run focused on proving the AgentOps loop +works instead of debugging subjective wording differences. After the +smoke test passes, replace the rows with domain-specific examples for +your agent. + The CLI prints `Threshold status: PASSED` (exit code `0`) or `FAILED` (exit code `2`) so you can wire it into CI directly. ## 5. Compare against a baseline diff --git a/plugins/agentops/skills/agentops-eval/SKILL.md b/plugins/agentops/skills/agentops-eval/SKILL.md index b6042523..5ba5eae4 100644 --- a/plugins/agentops/skills/agentops-eval/SKILL.md +++ b/plugins/agentops/skills/agentops-eval/SKILL.md @@ -60,8 +60,9 @@ agentops report generate --in ``` Open `.agentops/results/latest/report.md`. To compare two runs, hand both -`results.json` files to the user and walk them through metric deltas; -AgentOps does not ship a separate `eval compare` command. +`results.json` files to the user or run the next eval with +`--baseline ` so AgentOps adds a **Comparison vs +Baseline** section to the report. ## Step 5 — (Optional) Publish to Foundry Evaluations diff --git a/plugins/agentops/skills/agentops-report/SKILL.md b/plugins/agentops/skills/agentops-report/SKILL.md index 72ed2bd4..a9593b0c 100644 --- a/plugins/agentops/skills/agentops-report/SKILL.md +++ b/plugins/agentops/skills/agentops-report/SKILL.md @@ -59,9 +59,9 @@ exit code of the original run reflects the gate: suggest concrete prompt or retrieval changes. - For latency regressions, look at `run_metrics.avg_latency_seconds` and per-row latency. -- To compare two runs, diff the two `results.json` files at the metric - level and surface the deltas; AgentOps does not ship a separate - comparison CLI. +- To compare a new run against a previous one, re-run with + `agentops eval run --baseline ` and explain the + generated **Comparison vs Baseline** section. ## Guardrails diff --git a/plugins/agentops/skills/agentops-workflow/SKILL.md b/plugins/agentops/skills/agentops-workflow/SKILL.md index 8f8f77b2..d8e569f7 100644 --- a/plugins/agentops/skills/agentops-workflow/SKILL.md +++ b/plugins/agentops/skills/agentops-workflow/SKILL.md @@ -31,7 +31,8 @@ and have them generate `--kinds pr,dev,prod`. ## Step 0 — Prerequisites 1. `pip install "agentops-toolkit @ git+https://github.com/Azure/agentops.git@develop"` if `agentops` is missing. -2. `.agentops/run.yaml` exists and `agentops eval run` works locally. +2. `agentops.yaml` exists at the project root and `agentops eval run` + works locally. 3. The user's repo follows GitFlow (or is willing to). If not, ask which branches map to dev/qa/prod and adjust the `on:` triggers after generation. @@ -126,18 +127,18 @@ This makes the eval gate a hard merge requirement. Common follow-ups: -- **Tighten thresholds for QA/PROD** — copy `.agentops/run.yaml` to - `.agentops/run-qa.yaml` / `.agentops/run-prod.yaml` and tighten the - bundle thresholds. Point each workflow at its own config via the +- **Tighten thresholds for QA/PROD** — copy `agentops.yaml` to + `agentops-qa.yaml` / `agentops-prod.yaml` and tighten the + `thresholds:` block. Point each workflow at its own config via the `inputs.config` default. - **Scheduled runs** — add a `schedule:` entry in `agentops-pr.yml` (or a new `agentops-nightly.yml`) to evaluate against `main` nightly. -- **Matrix per scenario** — if the user has multiple `runs/*.yaml` files, +- **Matrix per scenario** — if the user has multiple AgentOps config files, extend the eval job with `strategy.matrix.config:` and reference `${{ matrix.config }}`. - **Regression baseline** — wire the deploy templates to download the previous run's `results.json` artifact and call - `agentops eval compare`. + `agentops eval run --baseline `. ## Guardrails diff --git a/pyproject.toml b/pyproject.toml index 77b42ecf..bf4f32b7 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -29,6 +29,7 @@ agent = [ "httpx>=0.27", "cryptography>=42", "azure-monitor-query>=1.3", + "azure-monitor-opentelemetry>=1.6", "azure-identity>=1.17", "azure-mgmt-cognitiveservices>=13.5", "azure-mgmt-monitor>=6.0", diff --git a/src/agentops/agent/sources/azure_monitor.py b/src/agentops/agent/sources/azure_monitor.py index 0a5e757f..9089ed6c 100644 --- a/src/agentops/agent/sources/azure_monitor.py +++ b/src/agentops/agent/sources/azure_monitor.py @@ -28,7 +28,7 @@ class AzureMonitorPayload: _REQUESTS_KQL = """ -requests +union isfuzzy=true requests, dependencies | where timestamp > ago({lookback_days}d) | summarize request_count = count(), diff --git a/src/agentops/pipeline/orchestrator.py b/src/agentops/pipeline/orchestrator.py index cbb2ddab..eb264bad 100644 --- a/src/agentops/pipeline/orchestrator.py +++ b/src/agentops/pipeline/orchestrator.py @@ -34,6 +34,7 @@ ) from agentops.pipeline import comparison as comparison_module from agentops.pipeline import invocations, publisher, reporter, runtime, thresholds +from agentops.utils import telemetry from agentops.utils.colors import style logger = logging.getLogger("agentops.pipeline") @@ -65,6 +66,19 @@ def run_evaluation( options: RunOptions, ) -> RunResult: """Run a full evaluation and persist artifacts. Returns the RunResult.""" + telemetry.init_tracing() + try: + return _run_evaluation(config, options=options) + finally: + telemetry.shutdown() + + +def _run_evaluation( + config: AgentOpsConfig, + *, + options: RunOptions, +) -> RunResult: + """Run a full evaluation after optional telemetry has been initialized.""" started_at = datetime.now(timezone.utc) started_perf = time.perf_counter() @@ -106,26 +120,40 @@ def run_evaluation( f"{_friendly_target_kind(target.kind)}: {style(target.raw, 'bold')}." ) - rows: List[RowResult] = [] - rules_by_metric = {rule.metric: rule for rule in threshold_rules} - for index, row in enumerate(dataset_rows): - rows.append( - _evaluate_row( - row=row, - index=index, - total=total, - target=target, - config=config, - evaluators=evaluator_runtimes, - timeout=options.timeout_seconds, - progress=progress, - rules_by_metric=rules_by_metric, + with telemetry.eval_run_span( + bundle_name=options.config_path.stem, + dataset_name=dataset_path.name, + backend_type=target.kind, + target=target.raw, + model=target.deployment, + agent_id=target.raw if target.kind.startswith("foundry") else None, + ) as run_span: + rows: List[RowResult] = [] + rules_by_metric = {rule.metric: rule for rule in threshold_rules} + for index, row in enumerate(dataset_rows): + rows.append( + _evaluate_row( + row=row, + index=index, + total=total, + target=target, + config=config, + evaluators=evaluator_runtimes, + timeout=options.timeout_seconds, + progress=progress, + rules_by_metric=rules_by_metric, + ) ) - ) - aggregate = _aggregate_metrics(rows) - threshold_results = thresholds.evaluate(threshold_rules, aggregate) - summary = _summarize(rows, threshold_results) + aggregate = _aggregate_metrics(rows) + threshold_results = thresholds.evaluate(threshold_rules, aggregate) + summary = _summarize(rows, threshold_results) + telemetry.set_eval_run_result( + run_span, + passed=summary.overall_passed, + items_total=summary.items_total, + items_passed=summary.items_passed_all, + ) finished_at = datetime.now(timezone.utc) duration = time.perf_counter() - started_perf @@ -357,6 +385,24 @@ def _iter_dataset(path: Path) -> Iterable[Dict[str, Any]]: # --------------------------------------------------------------------------- +def _metric_passes(rule: Threshold, value: float) -> bool: + if rule.value is None or rule.criteria in {"true", "false"}: + return True + target_v = float(rule.value) + c = rule.criteria + if c == ">=": + return value >= target_v + if c == ">": + return value > target_v + if c == "<=": + return value <= target_v + if c == "<": + return value < target_v + if c == "==": + return value == target_v + return True + + def _evaluate_row( *, row: Dict[str, Any], @@ -374,57 +420,84 @@ def _evaluate_row( if len(preview) > 80: preview = preview[:77] + "..." progress(f"{label} invoking target: {preview!r}") + expected = row.get("expected") + expected_text = str(expected) if expected is not None else None - try: - invocation = invocations.invoke(target, config, row, timeout=timeout) - except Exception as exc: # noqa: BLE001 - logger.warning("row %d invocation failed: %s", index, exc) - progress(f"{label} {style('invocation FAILED', 'bold', 'red')}: {exc}") - return RowResult( - row_index=index, - input=str(row.get("input", "")), - expected=row.get("expected"), - response="", - context=row.get("context"), - error=str(exc), + with telemetry.eval_item_span( + row_index=index, + input_text=str(row.get("input", "")), + expected_text=expected_text, + ) as item_span: + try: + with telemetry.agent_invoke_span( + target="agent" if target.kind.startswith("foundry") else "model", + model=target.deployment, + agent_id=target.raw if target.kind.startswith("foundry") else None, + agent_name=target.name, + agent_version=target.version, + ) as invoke_span: + invocation = invocations.invoke(target, config, row, timeout=timeout) + telemetry.set_agent_invoke_result( + invoke_span, + response_model=target.deployment, + ) + except Exception as exc: # noqa: BLE001 + telemetry.set_eval_item_result(item_span, passed=False) + logger.warning("row %d invocation failed: %s", index, exc) + progress(f"{label} {style('invocation FAILED', 'bold', 'red')}: {exc}") + return RowResult( + row_index=index, + input=str(row.get("input", "")), + expected=row.get("expected"), + response="", + context=row.get("context"), + error=str(exc), + ) + + tool_count = len(invocation.tool_calls) if invocation.tool_calls else 0 + progress( + f"{label} replied in {style(f'{invocation.latency_seconds:.2f}s', 'cyan')} " + f"({tool_count} tool call(s)); scoring..." ) - tool_count = len(invocation.tool_calls) if invocation.tool_calls else 0 - progress( - f"{label} replied in {style(f'{invocation.latency_seconds:.2f}s', 'cyan')} " - f"({tool_count} tool call(s)); scoring..." - ) + metrics: List[RowMetric] = [] + for evaluator in evaluators: + metric = runtime.run_evaluator( + evaluator, + row=row, + response=invocation.response, + latency_seconds=invocation.latency_seconds, + actual_tool_calls=invocation.tool_calls, + ) + metrics.append(metric) - metrics: List[RowMetric] = [] - for evaluator in evaluators: - metric = runtime.run_evaluator( - evaluator, - row=row, - response=invocation.response, - latency_seconds=invocation.latency_seconds, - actual_tool_calls=invocation.tool_calls, + rule = (rules_by_metric or {}).get(metric.name) + metric_passed = ( + None + if metric.value is None or rule is None + else _metric_passes(rule, float(metric.value)) + ) + telemetry.record_evaluator_span( + evaluator_name=evaluator.preset.name, + builtin_name=metric.name, + source=( + "local" + if evaluator.preset.class_name == "_latency" + else "azure-ai-evaluation" + ), + score=float(metric.value) if metric.value is not None else 0.0, + threshold=rule.value if rule is not None else None, + criteria=rule.criteria if rule is not None else None, + passed=metric_passed, + ) + + telemetry.set_eval_item_result( + item_span, + passed=all(metric.error is None for metric in metrics), ) - metrics.append(metric) rules = rules_by_metric or {} - def _passes(rule: Threshold, value: float) -> bool: - if rule.value is None or rule.criteria in {"true", "false"}: - return True - target_v = float(rule.value) - c = rule.criteria - if c == ">=": - return value >= target_v - if c == ">": - return value > target_v - if c == "<=": - return value <= target_v - if c == "<": - return value < target_v - if c == "==": - return value == target_v - return True - def _format_metric(m: RowMetric) -> str: if isinstance(m.value, (int, float)): rule = rules.get(m.name) @@ -433,7 +506,7 @@ def _format_metric(m: RowMetric) -> str: # No user threshold for this metric: keep value neutral # so the line stays readable. return f"{m.name}={text}" - color = "green" if _passes(rule, float(m.value)) else "red" + color = "green" if _metric_passes(rule, float(m.value)) else "red" return f"{m.name}={style(text, color)}" if m.error: return f"{m.name}={style('ERR', 'red')}" diff --git a/src/agentops/templates/skills/agentops-eval/SKILL.md b/src/agentops/templates/skills/agentops-eval/SKILL.md index b6042523..5ba5eae4 100644 --- a/src/agentops/templates/skills/agentops-eval/SKILL.md +++ b/src/agentops/templates/skills/agentops-eval/SKILL.md @@ -60,8 +60,9 @@ agentops report generate --in ``` Open `.agentops/results/latest/report.md`. To compare two runs, hand both -`results.json` files to the user and walk them through metric deltas; -AgentOps does not ship a separate `eval compare` command. +`results.json` files to the user or run the next eval with +`--baseline ` so AgentOps adds a **Comparison vs +Baseline** section to the report. ## Step 5 — (Optional) Publish to Foundry Evaluations diff --git a/src/agentops/templates/skills/agentops-report/SKILL.md b/src/agentops/templates/skills/agentops-report/SKILL.md index 72ed2bd4..a9593b0c 100644 --- a/src/agentops/templates/skills/agentops-report/SKILL.md +++ b/src/agentops/templates/skills/agentops-report/SKILL.md @@ -59,9 +59,9 @@ exit code of the original run reflects the gate: suggest concrete prompt or retrieval changes. - For latency regressions, look at `run_metrics.avg_latency_seconds` and per-row latency. -- To compare two runs, diff the two `results.json` files at the metric - level and surface the deltas; AgentOps does not ship a separate - comparison CLI. +- To compare a new run against a previous one, re-run with + `agentops eval run --baseline ` and explain the + generated **Comparison vs Baseline** section. ## Guardrails diff --git a/src/agentops/templates/skills/agentops-workflow/SKILL.md b/src/agentops/templates/skills/agentops-workflow/SKILL.md index 8f8f77b2..d8e569f7 100644 --- a/src/agentops/templates/skills/agentops-workflow/SKILL.md +++ b/src/agentops/templates/skills/agentops-workflow/SKILL.md @@ -31,7 +31,8 @@ and have them generate `--kinds pr,dev,prod`. ## Step 0 — Prerequisites 1. `pip install "agentops-toolkit @ git+https://github.com/Azure/agentops.git@develop"` if `agentops` is missing. -2. `.agentops/run.yaml` exists and `agentops eval run` works locally. +2. `agentops.yaml` exists at the project root and `agentops eval run` + works locally. 3. The user's repo follows GitFlow (or is willing to). If not, ask which branches map to dev/qa/prod and adjust the `on:` triggers after generation. @@ -126,18 +127,18 @@ This makes the eval gate a hard merge requirement. Common follow-ups: -- **Tighten thresholds for QA/PROD** — copy `.agentops/run.yaml` to - `.agentops/run-qa.yaml` / `.agentops/run-prod.yaml` and tighten the - bundle thresholds. Point each workflow at its own config via the +- **Tighten thresholds for QA/PROD** — copy `agentops.yaml` to + `agentops-qa.yaml` / `agentops-prod.yaml` and tighten the + `thresholds:` block. Point each workflow at its own config via the `inputs.config` default. - **Scheduled runs** — add a `schedule:` entry in `agentops-pr.yml` (or a new `agentops-nightly.yml`) to evaluate against `main` nightly. -- **Matrix per scenario** — if the user has multiple `runs/*.yaml` files, +- **Matrix per scenario** — if the user has multiple AgentOps config files, extend the eval job with `strategy.matrix.config:` and reference `${{ matrix.config }}`. - **Regression baseline** — wire the deploy templates to download the previous run's `results.json` artifact and call - `agentops eval compare`. + `agentops eval run --baseline `. ## Guardrails diff --git a/src/agentops/templates/smoke.jsonl b/src/agentops/templates/smoke.jsonl index b2246374..c28695b8 100644 --- a/src/agentops/templates/smoke.jsonl +++ b/src/agentops/templates/smoke.jsonl @@ -1,3 +1,3 @@ -{"input": "What is AgentOps?", "expected": "AgentOps is a CLI for evaluating Foundry agents."} -{"input": "Which formats does it produce?", "expected": "It writes results.json and report.md."} -{"input": "How do I configure thresholds?", "expected": "Use the 'thresholds' map in agentops.yaml."} +{"input": "Answer with exactly this sentence: Paris is the capital of France and one of Europe's major cultural centers.", "expected": "Paris is the capital of France and one of Europe's major cultural centers."} +{"input": "Answer with exactly this sentence: Mars is known as the Red Planet because iron-rich dust gives its surface a reddish color.", "expected": "Mars is known as the Red Planet because iron-rich dust gives its surface a reddish color."} +{"input": "Answer with exactly this sentence: Water has the chemical formula H2O because each molecule contains two hydrogen atoms and one oxygen atom.", "expected": "Water has the chemical formula H2O because each molecule contains two hydrogen atoms and one oxygen atom."} diff --git a/src/agentops/utils/telemetry.py b/src/agentops/utils/telemetry.py index c9769c5f..09f20d3b 100644 --- a/src/agentops/utils/telemetry.py +++ b/src/agentops/utils/telemetry.py @@ -1,8 +1,9 @@ """Optional OpenTelemetry instrumentation for AgentOps evaluation runs. All OpenTelemetry imports are **lazy** — they only happen when tracing is -enabled via the ``AGENTOPS_OTLP_ENDPOINT`` environment variable. When the -variable is unset, every public function in this module is a no-op. +enabled via ``APPLICATIONINSIGHTS_CONNECTION_STRING`` (Azure Monitor) or +the ``AGENTOPS_OTLP_ENDPOINT`` environment variable. When neither variable +is set, every public function in this module is a no-op. Schema design follows three OTel semantic convention layers: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/ @@ -26,12 +27,12 @@ def is_enabled() -> bool: - """Return True when OTLP tracing has been initialised.""" + """Return True when tracing has been initialised.""" return _tracing_enabled def init_tracing() -> None: - """Initialise the OTLP exporter if ``AGENTOPS_OTLP_ENDPOINT`` is set. + """Initialise tracing when Azure Monitor or OTLP export is configured. Safe to call multiple times; only the first call has an effect. """ @@ -40,12 +41,37 @@ def init_tracing() -> None: if _tracing_enabled: return - endpoint = os.getenv("AGENTOPS_OTLP_ENDPOINT") - if not endpoint: + appinsights_connection_string = os.getenv( + "APPLICATIONINSIGHTS_CONNECTION_STRING" + ) or os.getenv("AGENTOPS_APPLICATIONINSIGHTS_CONNECTION_STRING") + otlp_endpoint = os.getenv("AGENTOPS_OTLP_ENDPOINT") + if not appinsights_connection_string and not otlp_endpoint: return try: from opentelemetry import trace + except ImportError: + # opentelemetry not installed — tracing stays disabled + return + + if appinsights_connection_string: + try: + from azure.monitor.opentelemetry import configure_azure_monitor + + configure_azure_monitor( + connection_string=appinsights_connection_string, + ) + _tracer = trace.get_tracer("agentops") + _tracing_enabled = True + return + except ImportError: + # Azure Monitor exporter not installed — try OTLP below if configured. + pass + + if not otlp_endpoint: + return + + try: from opentelemetry.exporter.otlp.proto.http.trace_exporter import ( OTLPSpanExporter, ) @@ -63,14 +89,14 @@ def init_tracing() -> None: ) provider = TracerProvider(resource=resource) - exporter = OTLPSpanExporter(endpoint=endpoint + "/v1/traces") + exporter = OTLPSpanExporter(endpoint=otlp_endpoint + "/v1/traces") provider.add_span_processor(BatchSpanProcessor(exporter)) trace.set_tracer_provider(provider) _tracer = trace.get_tracer("agentops") _tracing_enabled = True except ImportError: - # opentelemetry not installed — tracing stays disabled + # OTLP exporter not installed — tracing stays disabled pass @@ -172,7 +198,7 @@ def eval_item_span( yield None return - from opentelemetry.trace import SpanKind + from opentelemetry.trace import SpanKind, StatusCode _label = f"eval_item {row_index}" if input_text: @@ -183,7 +209,7 @@ def eval_item_span( with _tracer.start_as_current_span( _label, - kind=SpanKind.INTERNAL, + kind=SpanKind.SERVER, ) as span: # CICD task attributes span.set_attribute("cicd.pipeline.task.name", "eval_item") @@ -196,17 +222,27 @@ def eval_item_span( if expected_text: span.set_attribute("agentops.eval.item.expected", expected_text) - yield span + try: + yield span + except Exception as exc: + span.set_attribute("cicd.pipeline.task.run.result", "failure") + span.set_attribute("agentops.eval.item.passed", False) + span.set_status(StatusCode.ERROR, str(exc)) + span.record_exception(exc) + raise def set_eval_item_result(span: Any, *, passed: bool) -> None: """Set final result on an eval item span.""" if span is None: return + from opentelemetry.trace import StatusCode + span.set_attribute( "cicd.pipeline.task.run.result", "success" if passed else "failure" ) span.set_attribute("agentops.eval.item.passed", passed) + span.set_status(StatusCode.OK if passed else StatusCode.ERROR) @contextmanager diff --git a/tests/unit/test_telemetry.py b/tests/unit/test_telemetry.py index cec0bd22..24fec39f 100644 --- a/tests/unit/test_telemetry.py +++ b/tests/unit/test_telemetry.py @@ -3,10 +3,18 @@ from __future__ import annotations import os +import sys +import types +from pathlib import Path from unittest.mock import MagicMock, patch import pytest +from agentops.agent.config import AzureMonitorSourceConfig +from agentops.agent.sources import azure_monitor +from agentops.core.agentops_config import AgentOpsConfig +from agentops.pipeline.orchestrator import RunOptions, run_evaluation +from agentops.utils import telemetry from agentops.utils.telemetry import ( eval_item_span, eval_run_span, @@ -265,3 +273,153 @@ def test_eval_run_span_name(self) -> None: self.mock_tracer.start_as_current_span.assert_called_once() span_name = self.mock_tracer.start_as_current_span.call_args.args[0] assert span_name == "RUN my_bundle" + + +def test_application_insights_connection_string_initializes_azure_monitor( + monkeypatch: pytest.MonkeyPatch, +) -> None: + calls: dict[str, str] = {} + + trace_module = types.ModuleType("opentelemetry.trace") + trace_module.get_tracer = lambda name: ("tracer", name) # type: ignore[attr-defined] + + opentelemetry_module = types.ModuleType("opentelemetry") + opentelemetry_module.trace = trace_module # type: ignore[attr-defined] + + azure_module = types.ModuleType("azure") + azure_monitor_module = types.ModuleType("azure.monitor") + azure_monitor_otel_module = types.ModuleType("azure.monitor.opentelemetry") + + def configure_azure_monitor(*, connection_string: str) -> None: + calls["connection_string"] = connection_string + + setattr( + azure_monitor_otel_module, + "configure_azure_monitor", + configure_azure_monitor, + ) + + monkeypatch.setitem(sys.modules, "opentelemetry", opentelemetry_module) + monkeypatch.setitem(sys.modules, "opentelemetry.trace", trace_module) + monkeypatch.setitem(sys.modules, "azure", azure_module) + monkeypatch.setitem(sys.modules, "azure.monitor", azure_monitor_module) + monkeypatch.setitem( + sys.modules, "azure.monitor.opentelemetry", azure_monitor_otel_module + ) + monkeypatch.setattr(telemetry, "_tracer", None) + monkeypatch.setattr(telemetry, "_tracing_enabled", False) + monkeypatch.setenv( + "APPLICATIONINSIGHTS_CONNECTION_STRING", + "InstrumentationKey=00000000-0000-0000-0000-000000000000", + ) + monkeypatch.delenv("AGENTOPS_OTLP_ENDPOINT", raising=False) + + init_tracing() + + assert calls == { + "connection_string": "InstrumentationKey=00000000-0000-0000-0000-000000000000" + } + assert is_enabled() is True + + +def test_azure_monitor_queries_requests_and_dependencies( + monkeypatch: pytest.MonkeyPatch, +) -> None: + captured: dict[str, str | None] = {} + + azure_module = types.ModuleType("azure") + identity_module = types.ModuleType("azure.identity") + monitor_module = types.ModuleType("azure.monitor") + query_module = types.ModuleType("azure.monitor.query") + + class DefaultAzureCredential: + def __init__(self, **_kwargs: object) -> None: + pass + + class LogsQueryStatus: + FAILURE = "Failure" + + class Column: + def __init__(self, name: str) -> None: + self.name = name + + class Table: + columns = [ + Column("request_count"), + Column("error_count"), + Column("avg_duration_ms"), + Column("p95_duration_ms"), + ] + rows = [[2, 1, 1000.0, 2500.0]] + + class Response: + status = "Success" + tables = [Table()] + + class LogsQueryClient: + def __init__(self, _credential: object) -> None: + pass + + def query_resource( + self, + *, + resource_id: str, + query: str, + timespan: object, + ) -> Response: + captured["resource_id"] = resource_id + captured["query"] = query + captured["timespan"] = str(timespan) + return Response() + + identity_module.DefaultAzureCredential = DefaultAzureCredential # type: ignore[attr-defined] + query_module.LogsQueryClient = LogsQueryClient # type: ignore[attr-defined] + query_module.LogsQueryStatus = LogsQueryStatus # type: ignore[attr-defined] + + monkeypatch.setitem(sys.modules, "azure", azure_module) + monkeypatch.setitem(sys.modules, "azure.identity", identity_module) + monkeypatch.setitem(sys.modules, "azure.monitor", monitor_module) + monkeypatch.setitem(sys.modules, "azure.monitor.query", query_module) + + payload = azure_monitor.collect_azure_monitor( + AzureMonitorSourceConfig( + enabled=True, + app_insights_resource_id=( + "/subscriptions/000/resourceGroups/rg/providers/" + "Microsoft.Insights/components/appi" + ), + ), + lookback_days=7, + ) + + assert "union isfuzzy=true requests, dependencies" in str(captured["query"]) + assert payload.diagnostics["status"] == "ok" + assert payload.request_count == 2 + assert payload.error_count == 1 + assert payload.error_rate == 0.5 + assert payload.avg_duration_seconds == 1.0 + assert payload.p95_duration_seconds == 2.5 + + +def test_run_evaluation_flushes_telemetry_on_error( + monkeypatch: pytest.MonkeyPatch, + tmp_path: Path, +) -> None: + events: list[str] = [] + monkeypatch.setattr(telemetry, "init_tracing", lambda: events.append("init")) + monkeypatch.setattr(telemetry, "shutdown", lambda: events.append("shutdown")) + + config = AgentOpsConfig( + version=1, + agent="model:gpt-4o-mini", + dataset=tmp_path / "missing.jsonl", + ) + options = RunOptions( + config_path=tmp_path / "agentops.yaml", + output_dir=tmp_path / "out", + ) + + with pytest.raises(FileNotFoundError): + run_evaluation(config, options=options) + + assert events == ["init", "shutdown"]