Azure · placerda · May 8, 2026 · May 8, 2026
diff --git a/docs/ci-github-actions.md b/docs/ci-github-actions.md
diff --git a/docs/tutorial-agent-workflow.md b/docs/tutorial-agent-workflow.md
@@ -1,110 +1,291 @@
-# Tutorial — agent workflow with tool calling
+# Tutorial: Build and evaluate a real tool-calling agent
 
-Evaluate an agent that calls **tools** (function calls / actions).
-AgentOps grades both the **final natural-language answer** *and* the
-**tool selection / arguments** the agent chose along the way.
+This tutorial is the tool-calling companion to the HTTP tutorial. You
+will build an agent that chooses between support tools, deploy it to
+Azure Container Apps, evaluate both the final answer and the tool trace,
+and add a CI gate.
 
-## Required dataset shape
+Use this tutorial when you care about questions such as:
 
-What turns a regular dataset into a tool-calling dataset is one or
-both of these row fields:
+- Did the agent call the right tool?
+- Did it pass the right arguments?
+- Did it avoid tools when the user only said hello?
+- Did tool quality regress in a pull request?
 
-| Field | What it is |
+## How AgentOps grades tool workflows
+
+AgentOps uses normal answer-quality metrics plus tool-specific metrics
+when the dataset includes `tool_calls` or `tool_definitions`.
+
+| Dataset field | Purpose |
 |---|---|
-| `tool_definitions` | The tools the agent has access to (OpenAI tool-call schema). |
-| `tool_calls` | The expected tool calls (name + arguments). |
+| `tool_definitions` | Tool catalogue available to the agent. Include it on every JSONL row so each row is self-contained. |
+| `tool_calls` | Expected tool trace: tool name, call id, and arguments. |
+| `input` | User message sent to the agent. |
+| `expected` | Reference final answer. |
 
-When AgentOps sees `tool_calls` (or `tool_definitions`) in the
-dataset rows, it auto-selects the **agent workflow** evaluators:
-TaskCompletion, ToolCallAccuracy, IntentResolution, TaskAdherence,
-plus the conversational baseline metrics that apply to the target
-(Coherence, Fluency, latency, and any explicitly configured text metric).
+For HTTP agents, the response also needs a field that contains the
+actual tool trace. In this tutorial that field is `tool_calls`.
 
-## 1. Bootstrap
+## 1. Create the support-agent project
 
-```bash
-pip install "agentops-toolkit @ git+https://github.com/Azure/agentops.git@develop"
-agentops init
-export AZURE_AI_FOUNDRY_PROJECT_ENDPOINT="https://<resource>.services.ai.azure.com/api/projects/<project>"
+```powershell
+mkdir support-tools-agent
+Set-Location support-tools-agent
+
+python -m venv .venv
+.\.venv\Scripts\Activate.ps1
+python -m pip install -U pip
+python -m pip install "agentops-toolkit[foundry,agent] @ git+https://github.com/Azure/agentops.git@develop"
 ```
 
-## 2. Edit `agentops.yaml`
+Create the same FastAPI tool-calling agent used by the HTTP tutorial:
 
-For a Foundry prompt agent that already has tools registered:
+```powershell
+@'
+from __future__ import annotations
 
-```yaml
-version: 1
-agent: "weather-bot:2"
-dataset: .agentops/data/tools.jsonl
+from fastapi import FastAPI
+from pydantic import BaseModel
+
+
+app = FastAPI(title="AgentOps Support Tools Agent")
+
+
+class ChatRequest(BaseModel):
+    message: str
+
+
+def lookup_order(order_id: str) -> dict[str, str]:
+    status = {
+        "ORD-12345": "in transit and expected to arrive tomorrow",
+        "ORD-99001": "shipped yesterday and is waiting for carrier pickup",
+    }.get(order_id, "not found")
+    return {"order_id": order_id, "status": status}
+
+
+def refund_order(order_id: str, reason: str) -> dict[str, str]:
+    return {"order_id": order_id, "status": "refund_started", "reason": reason}
+
+
+@app.get("/health")
+def health() -> dict[str, str]:
+    return {"status": "ok"}
+
+
+@app.post("/chat")
+def chat(request: ChatRequest) -> dict[str, object]:
+    message = request.message
+
+    if "ORD-12345" in message or "ORD-99001" in message:
+        order_id = "ORD-12345" if "ORD-12345" in message else "ORD-99001"
+        result = lookup_order(order_id)
+        return {
+            "text": f"Order {order_id} is {result['status']}.",
+            "tool_calls": [
+                {
+                    "type": "tool_call",
+                    "tool_call_id": "lookup_1",
+                    "name": "lookup_order",
+                    "arguments": {"order_id": order_id},
+                }
+            ],
+        }
+
+    if "refund" in message.lower() and "ORD-77821" in message:
+        result = refund_order("ORD-77821", "arrived broken")
+        return {
+            "text": "I started a refund for ORD-77821 because it arrived broken.",
+            "tool_calls": [
+                {
+                    "type": "tool_call",
+                    "tool_call_id": "refund_1",
+                    "name": "refund_order",
+                    "arguments": {
+                        "order_id": result["order_id"],
+                        "reason": result["reason"],
+                    },
+                }
+            ],
+        }
+
+    return {
+        "text": "Hello! I can help with order status, refunds, or connecting you to support.",
+        "tool_calls": [],
+    }
+'@ | Set-Content app.py -Encoding utf8
+
+@'
+fastapi==0.115.14
+uvicorn[standard]==0.35.0
+pydantic==2.11.9
+'@ | Set-Content requirements.txt -Encoding utf8
+
+@'
+FROM python:3.11-slim
+
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY app.py .
+
+EXPOSE 8000
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
+'@ | Set-Content Dockerfile -Encoding utf8
 ```
 
-For an HTTP-deployed agent that returns tool calls in its response
-body:
+The implementation is simple enough to inspect but still has the core
+production contract: request JSON, business tools, final answer, and
+structured tool trace.
 
-```yaml
+## 2. Deploy the agent to Azure
+
+```powershell
+az login
+
+$env:AZURE_LOCATION = "eastus2"
+$env:AZURE_RESOURCE_GROUP = "rg-agentops-tools-tutorial"
+$env:ACA_NAME = "agentops-tools-$((Get-Date).ToString('MMddHHmm'))"
+
+az group create `
+  --name $env:AZURE_RESOURCE_GROUP `
+  --location $env:AZURE_LOCATION
+
+az containerapp up `
+  --name $env:ACA_NAME `
+  --resource-group $env:AZURE_RESOURCE_GROUP `
+  --location $env:AZURE_LOCATION `
+  --source . `
+  --target-port 8000 `
+  --ingress external
+
+$fqdn = az containerapp show `
+  --name $env:ACA_NAME `
+  --resource-group $env:AZURE_RESOURCE_GROUP `
+  --query properties.configuration.ingress.fqdn `
+  -o tsv
+
+$agentUrl = "https://$fqdn/chat"
+Invoke-RestMethod -Uri "https://$fqdn/health"
+```
+
+This step matters because a tool workflow eval should exercise the same
+HTTP boundary your production clients use, not a local-only shortcut.
+
+## 3. Initialize AgentOps
+
+```powershell
+agentops init
+
+$env:AZURE_AI_FOUNDRY_PROJECT_ENDPOINT = "https://<resource>.services.ai.azure.com/api/projects/<project>"
+$env:AZURE_OPENAI_ENDPOINT             = "https://<resource>.openai.azure.com"
+$env:AZURE_OPENAI_DEPLOYMENT           = "gpt-4o-mini"
+$env:AZURE_AI_MODEL_DEPLOYMENT_NAME    = "gpt-4o-mini"
+```
+
+## 4. Write `agentops.yaml`
+
+```powershell
+@"
 version: 1
-agent: "https://aca-weather-bot.example.com/"
-dataset: .agentops/data/tools.jsonl
+agent: "$agentUrl"
+dataset: .agentops/data/support-tools.jsonl
 
 request_field: message
 response_field: text
 tool_calls_field: tool_calls
+
+thresholds:
+  coherence: ">=3"
+  fluency: ">=3"
+  tool_call_accuracy: ">=0.8"
+  intent_resolution: ">=3"
+  task_adherence: ">=0.6"
+  avg_latency_seconds: "<=30"
+"@ | Set-Content agentops.yaml -Encoding utf8
 ```
 
-`tool_calls_field` tells AgentOps where in the response JSON to find
-the structured tool calls (dot-path notation supported).
+Why each threshold exists:
+
+| Threshold | What it protects |
+|---|---|
+| `coherence`, `fluency` | The final answer remains readable. |
+| `tool_call_accuracy` | The tool name and arguments match the expected trace. |
+| `intent_resolution` | The agent understood the user's task. |
+| `task_adherence` | The agent did not drift away from the requested action. |
+| `avg_latency_seconds` | The deployed endpoint stays responsive. |
 
-## 3. Dataset shape (`tools.jsonl`)
+## 5. Create the dataset
 
-```jsonl
-{"id":"1","input":"What's the weather in Paris, France?","expected":"Calls get_weather with location='Paris, France'.","tool_calls":[{"type":"function_call","name":"get_weather","arguments":{"location":"Paris, France"}}]}
-{"id":"2","input":"How is the weather in Tokyo, Japan?","expected":"Calls get_weather with location='Tokyo, Japan'.","tool_calls":[{"type":"function_call","name":"get_weather","arguments":{"location":"Tokyo, Japan"}}]}
+```powershell
+New-Item -ItemType Directory -Force .agentops/data | Out-Null
+@'
+{"input":"Where is my order ORD-12345?","expected":"Order ORD-12345 is in transit and expected to arrive tomorrow.","tool_definitions":[{"type":"function","name":"lookup_order","description":"Look up an order.","parameters":{"type":"object","properties":{"order_id":{"type":"string"}},"required":["order_id"]}},{"type":"function","name":"refund_order","description":"Refund an order.","parameters":{"type":"object","properties":{"order_id":{"type":"string"},"reason":{"type":"string"}},"required":["order_id","reason"]}}],"tool_calls":[{"type":"tool_call","tool_call_id":"lookup_1","name":"lookup_order","arguments":{"order_id":"ORD-12345"}}]}
+{"input":"I want a refund for ORD-77821, it arrived broken.","expected":"A refund is started for ORD-77821 because it arrived broken.","tool_definitions":[{"type":"function","name":"lookup_order","description":"Look up an order.","parameters":{"type":"object","properties":{"order_id":{"type":"string"}},"required":["order_id"]}},{"type":"function","name":"refund_order","description":"Refund an order.","parameters":{"type":"object","properties":{"order_id":{"type":"string"},"reason":{"type":"string"}},"required":["order_id","reason"]}}],"tool_calls":[{"type":"tool_call","tool_call_id":"refund_1","name":"refund_order","arguments":{"order_id":"ORD-77821","reason":"arrived broken"}}]}
+{"input":"Hi there!","expected":"The assistant replies with a clear greeting and offers support options without calling a tool.","tool_definitions":[{"type":"function","name":"lookup_order","description":"Look up an order.","parameters":{"type":"object","properties":{"order_id":{"type":"string"}},"required":["order_id"]}},{"type":"function","name":"refund_order","description":"Refund an order.","parameters":{"type":"object","properties":{"order_id":{"type":"string"},"reason":{"type":"string"}},"required":["order_id","reason"]}}],"tool_calls":[]}
+'@ | Set-Content .agentops/data/support-tools.jsonl -Encoding utf8
 ```
 
-Include `tool_definitions` when you evaluate tool-call accuracy. The
-evaluator needs the schema of every tool the agent should know about;
-repeat the catalogue on each JSONL row so every row is self-contained.
+The third row is as important as the first two. It asserts that greeting
+messages should not call a business tool.
 
-## 4. Run
+## 6. Run and inspect the eval
 
-```bash
+```powershell
 agentops eval run
+code .agentops/results/latest/report.md
 ```
 
-The report's per-row block shows:
+The report should include:
 
-- The agent's final text response
-- The structured tool calls the agent emitted
-- ToolCallAccuracy / IntentResolution / TaskAdherence scores
+- Aggregate metric values.
+- Threshold pass/fail status.
+- Per-row tool traces.
+- The latency of calls to the deployed Container App.
 
-## 5. CI gate
+If the tool-call metrics fail, inspect the row in the report before
+changing thresholds. Usually the bug is an incorrect tool name, missing
+argument, or response mapping mismatch.
 
-In a PR check, fail when tool quality regresses. After your first
-run, diff every subsequent run against it:
+## 7. Add a PR gate
 
-```bash
-agentops eval run --baseline .agentops/results/latest/results.json
+```powershell
+agentops workflow generate --kinds pr --force
 ```
 
-AgentOps loads the baseline into memory before refreshing `latest/`,
-so `latest/results.json` is shorthand for "the run before this one".
-For CI, commit a stable baseline file (see
-[tutorial-baseline-comparison.md](tutorial-baseline-comparison.md)).
+Use PR-only first. Generate DEV/QA/PROD deploy workflows only after you
+have configured GitHub Environments, OIDC federated credentials, and real
+build/deploy commands. Otherwise a push to `main` will create a red
+workflow that proves nothing about agent quality.
 
-## Build a real tool-calling agent
+Configure the `dev` environment variables and OIDC credential as shown in
+[tutorial-http-agent.md](tutorial-http-agent.md#8-add-a-pr-evaluation-gate).
 
-The repo's E2E test deploys a real Microsoft Agent Framework agent
-(FastAPI on Container Apps) with a `get_weather` tool. See:
+## 8. Run Watchdog
 
-- `infra/e2e/agent-app/app.py` — minimal Agent Framework + FastAPI app
-- `infra/e2e/perrun.bicep` — per-run ACA deployment
-- `scripts/e2e_data/tools.jsonl` — the dataset used to grade it
+```powershell
+agentops agent analyze --severity-fail critical
+code .agentops/agent/report.md
+```
+
+Watchdog reads `.agentops/results/*/results.json` and looks for quality,
+latency, error, and safety findings. If you configure
+`.agentops/agent.yaml` with an Application Insights resource id, it also
+queries Azure Monitor. The coding-agent skill `agentops-agent` is just a
+guided way to run these commands; it is not the runtime analyzer itself.
+
+## 9. Expand the scenario
 
-That same setup is what `tutorial-http-agent.md` walks through.
+After this tutorial passes, make the dataset closer to production:
 
-## See also
+- Add a row for an unknown order and expect a safe escalation.
+- Add a refund row without an order id and expect no `refund_order` call.
+- Add negative rows where the user asks for unrelated help.
+- Save one passing `results.json` as a baseline and compare future runs
+  with `agentops eval run --baseline <path>`.
 
-- [tutorial-conversational-agent.md](tutorial-conversational-agent.md) — same shape, no tools
-- [tutorial-http-agent.md](tutorial-http-agent.md) — deploying an HTTP agent
-- [tutorial-rag.md](tutorial-rag.md) — RAG instead of tools
-- [foundry-evaluation-sdk-built-in-evaluators.md](foundry-evaluation-sdk-built-in-evaluators.md) — full evaluator reference
+## Cleanup
+
+```powershell
+az group delete --name $env:AZURE_RESOURCE_GROUP --yes --no-wait
+```