Azure · Dongbumlee · Apr 29, 2026 · Apr 29, 2026 · Apr 29, 2026
diff --git a/docs/telemetry.md b/docs/telemetry.md
@@ -49,18 +49,20 @@ A **trace** represents a single end-to-end operation. In AgentOps, one evaluatio
 A **span** is a unit of work with a start time, end time, a name, and key-value attributes. Spans nest inside each other to form a tree. Example:
 
 ```
-RUN conversational_agent_baseline          ← root span (the whole run)
-├── eval_item 0                            ← child span (one dataset row)
-│   ├── invoke_agent my-agent              ← grandchild (the agent call)
-│   ├── evaluator builtin.similarity       ← grandchild (scoring)
-│   └── evaluator builtin.coherence        ← grandchild (scoring)
-├── eval_item 1
+RUN conversational_agent_baseline                    ← root span (the whole run)
+├── eval_item 1 - 'What is 2+2?'                     ← child span (one dataset row)
+│   ├── invoke_agent my-agent                        ← grandchild (the agent call)
+│   ├── evaluator builtin.similarity                 ← grandchild (scoring)
+│   └── evaluator builtin.coherence                  ← grandchild (scoring)
+├── eval_item 2 - 'Capital of France?'
 │   ├── invoke_agent my-agent
 │   ├── evaluator builtin.similarity
 │   └── evaluator builtin.coherence
 └── ...
 ```
 
+Item indices are **1-based**, and each `eval_item` span name includes a short snippet of the row input for easy scanning in trace UIs.
+
 Each span records **attributes** — structured key-value pairs like `agentops.eval.evaluator.score = 0.87`.
 
 ### What Is OTLP?
@@ -196,11 +198,12 @@ RUN <bundle_name>                             kind=SERVER
 │     agentops.eval.model = <deployment>          (if applicable)
 │     agentops.eval.agent_id = <agent_id>         (if applicable)
 │
-├── eval_item 0                                kind=INTERNAL
+├── eval_item 1 - 'What is 2+2?'              kind=INTERNAL
 │   │   cicd.pipeline.task.name = "eval_item"
-│   │   agentops.eval.item.index = 0
-│   │   agentops.eval.item.input = "..."
-│   │   agentops.eval.item.expected = "..."
+│   │   cicd.pipeline.task.run.id = "1"
+│   │   cicd.pipeline.task.run.result = "success"
+│   │   agentops.eval.item.index = 1
+│   │   agentops.eval.item.input = "What is 2+2?"
 │   │   agentops.eval.item.passed = true
 │   │
 │   ├── invoke_agent my-agent                  kind=CLIENT
@@ -222,7 +225,7 @@ RUN <bundle_name>                             kind=SERVER
 │         agentops.eval.evaluator.score = 0.85
 │         ...
 │
-├── eval_item 1
+├── eval_item 2 - 'Capital of France?'
 │   └── ...
 │
 └── (final attributes on root span)
@@ -265,8 +268,8 @@ Follows the [OTel GenAI semantic conventions](https://opentelemetry.io/docs/spec
 
 | Attribute | Example | Description |
 |---|---|---|
-| `gen_ai.operation.name` | `invoke_agent` / `chat` | Operation type |
-| `gen_ai.provider.name` | `azure.ai.inference` | Provider |
+| `gen_ai.operation.name` | `invoke_agent` / `chat` | Operation type — `invoke_agent` for agent targets, `chat` for model targets |
+| `gen_ai.provider.name` | `azure.ai.inference` / `local.callable` | Provider — varies by backend (e.g. `azure.ai.inference` for Foundry, `local.callable` for the local adapter backend) |
 | `gen_ai.request.model` | `gpt-4o` | Requested model deployment |
 | `gen_ai.response.model` | `gpt-4o-2024-08-06` | Actual model version |
 | `gen_ai.agent.id` | `my-agent:3` | Foundry agent identifier |
@@ -289,9 +292,8 @@ Custom attributes for evaluation-specific data that has no standard equivalent.
 | `agentops.eval.items_total` | `10` | Total rows evaluated |
 | `agentops.eval.items_passed` | `9` | Rows passing thresholds |
 | `agentops.eval.pass_rate` | `0.9` | Pass rate |
-| `agentops.eval.item.index` | `0` | Row index |
+| `agentops.eval.item.index` | `1` | Row index (1-based) |
 | `agentops.eval.item.input` | `"What is 2+2?"` | Input text |
-| `agentops.eval.item.expected` | `"4"` | Expected answer |
 | `agentops.eval.item.passed` | `true` | Row pass/fail |
 | `agentops.eval.evaluator.name` | `SimilarityEvaluator` | Class name |
 | `agentops.eval.evaluator.builtin` | `builtin.similarity` | Builtin name |
@@ -345,9 +347,9 @@ Jaeger shows spans as horizontal bars on a timeline:
 
 ## Sending Traces to Azure Monitor
 
-For production, you may want traces in Azure Monitor / Application Insights instead of local Jaeger.
+For production, you may want traces in Azure Monitor / Application Insights instead of local Jaeger. The recommended path is the **OpenTelemetry Collector** running locally (or as a sidecar) with the Azure Monitor exporter.
 
-### Option A: Use the OTel Collector as a Proxy
+### Use the OTel Collector as a Proxy
 
 Run the [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) with an Azure Monitor exporter:
 
@@ -372,15 +374,121 @@ service:
 
 Then set `AGENTOPS_OTLP_ENDPOINT=http://localhost:4318`.
 
-### Option B: Use Azure Monitor's OTLP Endpoint Directly
+### Why not export from AgentOps directly?
 
-Azure Monitor now supports OTLP ingestion natively. Set the endpoint to your Application Insights OTLP ingestion URL:
+AgentOps ships a vanilla `OTLPSpanExporter` that POSTs `application/x-protobuf` to `<endpoint>/v1/traces` with no Authorization header. This is fine for any plain OTLP/HTTP backend (Jaeger, Tempo, the Collector, etc.), but it is **not** sufficient for Azure Monitor:
 
-```bash
-export AGENTOPS_OTLP_ENDPOINT=https://<region>.applicationinsights.azure.com
+- The official Azure Monitor OpenTelemetry distro for Python (see [Microsoft Learn — OpenTelemetry configuration](https://learn.microsoft.com/azure/azure-monitor/app/opentelemetry-configuration?tabs=python)) requires a **connection string** and is invoked via `configure_azure_monitor()`, not a raw OTLP endpoint.
+- Application Insights also has a preview feature (`Microsoft.Insights/OtlpApplicationInsights`) that exposes per-resource OTLP ingestion URLs, but it requires **Entra ID Bearer-token authentication** (scope `https://monitor.azure.com/.default`), which AgentOps's exporter does not currently inject.
+
+The Collector proxy avoids both issues: AgentOps speaks plain OTLP/HTTP to the Collector, and the Collector handles authentication to Azure Monitor.
+
+---
+
+## Querying Traces in Azure Monitor (KQL)
+
+Once eval traces land in Application Insights via the Collector, you can query them directly in **Application Insights > Logs** using KQL. All span attributes are stored as JSON keys in the `customDimensions` column.
+
+### Table Mapping
+
+AgentOps spans map to App Insights tables based on their OpenTelemetry span kind:
+
+| Span | App Insights Table | Span Kind |
+|---|---|---|
+| `RUN <bundle>` (root eval run) | `requests` | `SERVER` |
+| `eval_item N` (per-row evaluation) | `dependencies` | `INTERNAL` |
+| `invoke_agent` / `chat` (agent/model call) | `dependencies` | `CLIENT` |
+| `evaluator <name>` (individual evaluator) | `dependencies` | `INTERNAL` |
+
+### Query 1: Slowest Evaluation Rows
+
+Find the top 10 slowest evaluation rows to identify performance bottlenecks.
+
+```kql
+dependencies
+| where customDimensions["cicd.pipeline.task.name"] == "eval_item"
+| extend
+    rowIndex = toint(customDimensions["agentops.eval.item.index"]),
+    input = tostring(customDimensions["agentops.eval.item.input"]),
+    passed = tostring(customDimensions["agentops.eval.item.passed"])
+| project timestamp, rowIndex, input, passed, duration, operation_Id
+| top 10 by duration desc
+```
+
+### Query 2: Failed Evaluators
+
+List all evaluator executions that failed their threshold, with scores and thresholds.
+
+```kql
+dependencies
+| where customDimensions["agentops.eval.evaluator.passed"] == "false"
+| extend
+    evaluator = tostring(customDimensions["agentops.eval.evaluator.builtin"]),
+    score = toreal(customDimensions["agentops.eval.evaluator.score"]),
+    threshold = toreal(customDimensions["agentops.eval.evaluator.threshold"]),
+    criteria = tostring(customDimensions["agentops.eval.evaluator.criteria"])
+| project timestamp, evaluator, score, threshold, criteria, operation_Id
+| order by timestamp desc
 ```
 
-Refer to the [Azure Monitor OTLP documentation](https://learn.microsoft.com/azure/azure-monitor/app/opentelemetry-configuration) for details.
+### Query 3: Pass Rate Over Time
+
+Track overall evaluation pass rate trends from root spans.
+
+```kql
+requests
+| where name startswith "RUN "
+| extend
+    passRate = toreal(customDimensions["agentops.eval.pass_rate"]),
+    bundle = tostring(customDimensions["cicd.pipeline.name"]),
+    dataset = tostring(customDimensions["agentops.eval.dataset"]),
+    itemsTotal = toint(customDimensions["agentops.eval.items_total"]),
+    itemsPassed = toint(customDimensions["agentops.eval.items_passed"])
+| project timestamp, bundle, dataset, passRate, itemsPassed, itemsTotal
+| order by timestamp asc
+| render timechart with (ycolumns=passRate, title="Evaluation Pass Rate Over Time")
+```
+
+### Query 4: Token Usage Per Run
+
+Sum input and output tokens across all agent/model invocations within each eval run.
+
+```kql
+dependencies
+| where customDimensions["gen_ai.operation.name"] in ("invoke_agent", "chat")
+| extend
+    inputTokens = toint(customDimensions["gen_ai.usage.input_tokens"]),
+    outputTokens = toint(customDimensions["gen_ai.usage.output_tokens"]),
+    model = tostring(customDimensions["gen_ai.request.model"])
+| summarize
+    totalInputTokens = sum(inputTokens),
+    totalOutputTokens = sum(outputTokens),
+    totalTokens = sum(inputTokens) + sum(outputTokens),
+    invocations = count()
+    by operation_Id, model
+| order by totalTokens desc
+```
+
+### Query 5: Evaluator Score Distribution
+
+View the distribution of scores grouped by evaluator name to identify consistently low-performing evaluators.
+
+```kql
+dependencies
+| where isnotempty(customDimensions["agentops.eval.evaluator.score"])
+| extend
+    evaluator = tostring(customDimensions["agentops.eval.evaluator.builtin"]),
+    score = toreal(customDimensions["agentops.eval.evaluator.score"])
+| summarize
+    avgScore = avg(score),
+    minScore = min(score),
+    maxScore = max(score),
+    p50 = percentile(score, 50),
+    p90 = percentile(score, 90),
+    count = count()
+    by evaluator
+| order by avgScore asc
+```
 
 ---