From 2fafd681c3d99623b3122aaff1098632a8346988 Mon Sep 17 00:00:00 2001
From: DB Lee <donlee@microsoft.com>
Date: Wed, 29 Apr 2026 11:56:09 -0700
Subject: [PATCH 1/3] docs: add KQL query library section to telemetry.md

Adds a new 'Querying Traces in Azure Monitor (KQL)' section with 5 ready-to-use KQL queries for users who send eval traces to Azure Monitor / Application Insights:

1. Slowest evaluation rows (top N eval_item spans by duration)
2. Failed evaluators (filter by passed == false with scores and thresholds)
3. Pass rate over time (trend from root spans with timechart render)
4. Token usage per run (sum input + output tokens by operation_Id)
5. Evaluator score distribution (stats by evaluator name)

Includes a table mapping explaining which AgentOps spans land in which App Insights tables (requests vs dependencies). All attribute names verified against telemetry.py source code.

Closes #89

Generated with [Amplifier](https://github.com/microsoft/amplifier)

Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
---
 docs/telemetry.md | 107 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 107 insertions(+)
diff --git a/docs/telemetry.md b/docs/telemetry.md
index 04802ee9..341ba887 100644
--- a/docs/telemetry.md
+++ b/docs/telemetry.md
@@ -384,6 +384,113 @@ Refer to the [Azure Monitor OTLP documentation](https://learn.microsoft.com/azur
 
 ---
 
+## Querying Traces in Azure Monitor (KQL)
+
+Once eval traces land in Application Insights (via either option above), you can query them directly in **Application Insights > Logs** using KQL. All span attributes are stored as JSON keys in the `customDimensions` column.
+
+### Table Mapping
+
+AgentOps spans map to App Insights tables based on their OpenTelemetry span kind:
+
+| Span | App Insights Table | Span Kind |
+|---|---|---|
+| `RUN <bundle>` (root eval run) | `requests` | `SERVER` |
+| `eval_item N` (per-row evaluation) | `dependencies` | `INTERNAL` |
+| `invoke_agent` / `chat` (agent/model call) | `dependencies` | `CLIENT` |
+| `evaluator <name>` (individual evaluator) | `dependencies` | `INTERNAL` |
+
+### Query 1: Slowest Evaluation Rows
+
+Find the top 10 slowest evaluation rows to identify performance bottlenecks.
+
+```kql
+dependencies
+| where customDimensions["cicd.pipeline.task.name"] == "eval_item"
+| extend
+    rowIndex = toint(customDimensions["agentops.eval.item.index"]),
+    input = tostring(customDimensions["agentops.eval.item.input"]),
+    passed = tostring(customDimensions["agentops.eval.item.passed"])
+| project timestamp, rowIndex, input, passed, duration, operation_Id
+| top 10 by duration desc
+```
+
+### Query 2: Failed Evaluators
+
+List all evaluator executions that failed their threshold, with scores and thresholds.
+
+```kql
+dependencies
+| where customDimensions["agentops.eval.evaluator.passed"] == "false"
+| extend
+    evaluator = tostring(customDimensions["agentops.eval.evaluator.builtin"]),
+    score = toreal(customDimensions["agentops.eval.evaluator.score"]),
+    threshold = toreal(customDimensions["agentops.eval.evaluator.threshold"]),
+    criteria = tostring(customDimensions["agentops.eval.evaluator.criteria"])
+| project timestamp, evaluator, score, threshold, criteria, operation_Id
+| order by timestamp desc
+```
+
+### Query 3: Pass Rate Over Time
+
+Track overall evaluation pass rate trends from root spans.
+
+```kql
+requests
+| where name startswith "RUN "
+| extend
+    passRate = toreal(customDimensions["agentops.eval.pass_rate"]),
+    bundle = tostring(customDimensions["cicd.pipeline.name"]),
+    dataset = tostring(customDimensions["agentops.eval.dataset"]),
+    itemsTotal = toint(customDimensions["agentops.eval.items_total"]),
+    itemsPassed = toint(customDimensions["agentops.eval.items_passed"])
+| project timestamp, bundle, dataset, passRate, itemsPassed, itemsTotal
+| order by timestamp asc
+| render timechart with (ycolumns=passRate, title="Evaluation Pass Rate Over Time")
+```
+
+### Query 4: Token Usage Per Run
+
+Sum input and output tokens across all agent/model invocations within each eval run.
+
+```kql
+dependencies
+| where customDimensions["gen_ai.operation.name"] in ("invoke_agent", "chat")
+| extend
+    inputTokens = toint(customDimensions["gen_ai.usage.input_tokens"]),
+    outputTokens = toint(customDimensions["gen_ai.usage.output_tokens"]),
+    model = tostring(customDimensions["gen_ai.request.model"])
+| summarize
+    totalInputTokens = sum(inputTokens),
+    totalOutputTokens = sum(outputTokens),
+    totalTokens = sum(inputTokens) + sum(outputTokens),
+    invocations = count()
+    by operation_Id, model
+| order by totalTokens desc
+```
+
+### Query 5: Evaluator Score Distribution
+
+View the distribution of scores grouped by evaluator name to identify consistently low-performing evaluators.
+
+```kql
+dependencies
+| where isnotempty(customDimensions["agentops.eval.evaluator.score"])
+| extend
+    evaluator = tostring(customDimensions["agentops.eval.evaluator.builtin"]),
+    score = toreal(customDimensions["agentops.eval.evaluator.score"])
+| summarize
+    avgScore = avg(score),
+    minScore = min(score),
+    maxScore = max(score),
+    p50 = percentile(score, 50),
+    p90 = percentile(score, 90),
+    count = count()
+    by evaluator
+| order by avgScore asc
+```
+
+---
+
 ## Evaluation Tracing vs. Agent Execution Tracing
 
 It is important to understand that AgentOps telemetry covers **evaluation observability** — not agent execution tracing. These are two different things:

From fae2d41463bebf5603585de9706ef0ee41105130 Mon Sep 17 00:00:00 2001
From: DB Lee <donlee@microsoft.com>
Date: Wed, 29 Apr 2026 13:44:22 -0700
Subject: [PATCH 2/3] docs(telemetry): correct trace tree and attribute table

Validated end-to-end against Jaeger and Azure Monitor (OTel Collector
proxy + App Insights KQL queries 1-5). Adjustments:

- Use 1-based eval_item indices and include the input snippet that the
  runner actually puts in the span name (eval_item N - '<input>').
- Add cicd.pipeline.task.run.id / .run.result to the eval_item example;
  these are emitted by telemetry.py but were missing from the doc.
- Remove agentops.eval.item.expected from the trace tree and the
  attribute table; the attribute is never populated because the runner
  does not pass expected_text into eval_item_span.
- Clarify that gen_ai.provider.name varies by backend
  (azure.ai.inference for Foundry, local.callable for local adapter).
- Note that item.index is 1-based.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 docs/telemetry.md | 32 +++++++++++++++++---------------
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/docs/telemetry.md b/docs/telemetry.md
index 341ba887..75048bc3 100644
--- a/docs/telemetry.md
+++ b/docs/telemetry.md
@@ -49,18 +49,20 @@ A **trace** represents a single end-to-end operation. In AgentOps, one evaluatio
 A **span** is a unit of work with a start time, end time, a name, and key-value attributes. Spans nest inside each other to form a tree. Example:
 
 ```
-RUN conversational_agent_baseline          ← root span (the whole run)
-├── eval_item 0                            ← child span (one dataset row)
-│   ├── invoke_agent my-agent              ← grandchild (the agent call)
-│   ├── evaluator builtin.similarity       ← grandchild (scoring)
-│   └── evaluator builtin.coherence        ← grandchild (scoring)
-├── eval_item 1
+RUN conversational_agent_baseline                    ← root span (the whole run)
+├── eval_item 1 - 'What is 2+2?'                     ← child span (one dataset row)
+│   ├── invoke_agent my-agent                        ← grandchild (the agent call)
+│   ├── evaluator builtin.similarity                 ← grandchild (scoring)
+│   └── evaluator builtin.coherence                  ← grandchild (scoring)
+├── eval_item 2 - 'Capital of France?'
 │   ├── invoke_agent my-agent
 │   ├── evaluator builtin.similarity
 │   └── evaluator builtin.coherence
 └── ...
 ```
 
+Item indices are **1-based**, and each `eval_item` span name includes a short snippet of the row input for easy scanning in trace UIs.
+
 Each span records **attributes** — structured key-value pairs like `agentops.eval.evaluator.score = 0.87`.
 
 ### What Is OTLP?
@@ -196,11 +198,12 @@ RUN <bundle_name>                             kind=SERVER
 │     agentops.eval.model = <deployment>          (if applicable)
 │     agentops.eval.agent_id = <agent_id>         (if applicable)
 │
-├── eval_item 0                                kind=INTERNAL
+├── eval_item 1 - 'What is 2+2?'              kind=INTERNAL
 │   │   cicd.pipeline.task.name = "eval_item"
-│   │   agentops.eval.item.index = 0
-│   │   agentops.eval.item.input = "..."
-│   │   agentops.eval.item.expected = "..."
+│   │   cicd.pipeline.task.run.id = "1"
+│   │   cicd.pipeline.task.run.result = "success"
+│   │   agentops.eval.item.index = 1
+│   │   agentops.eval.item.input = "What is 2+2?"
 │   │   agentops.eval.item.passed = true
 │   │
 │   ├── invoke_agent my-agent                  kind=CLIENT
@@ -222,7 +225,7 @@ RUN <bundle_name>                             kind=SERVER
 │         agentops.eval.evaluator.score = 0.85
 │         ...
 │
-├── eval_item 1
+├── eval_item 2 - 'Capital of France?'
 │   └── ...
 │
 └── (final attributes on root span)
@@ -265,8 +268,8 @@ Follows the [OTel GenAI semantic conventions](https://opentelemetry.io/docs/spec
 
 | Attribute | Example | Description |
 |---|---|---|
-| `gen_ai.operation.name` | `invoke_agent` / `chat` | Operation type |
-| `gen_ai.provider.name` | `azure.ai.inference` | Provider |
+| `gen_ai.operation.name` | `invoke_agent` / `chat` | Operation type — `invoke_agent` for agent targets, `chat` for model targets |
+| `gen_ai.provider.name` | `azure.ai.inference` / `local.callable` | Provider — varies by backend (e.g. `azure.ai.inference` for Foundry, `local.callable` for the local adapter backend) |
 | `gen_ai.request.model` | `gpt-4o` | Requested model deployment |
 | `gen_ai.response.model` | `gpt-4o-2024-08-06` | Actual model version |
 | `gen_ai.agent.id` | `my-agent:3` | Foundry agent identifier |
@@ -289,9 +292,8 @@ Custom attributes for evaluation-specific data that has no standard equivalent.
 | `agentops.eval.items_total` | `10` | Total rows evaluated |
 | `agentops.eval.items_passed` | `9` | Rows passing thresholds |
 | `agentops.eval.pass_rate` | `0.9` | Pass rate |
-| `agentops.eval.item.index` | `0` | Row index |
+| `agentops.eval.item.index` | `1` | Row index (1-based) |
 | `agentops.eval.item.input` | `"What is 2+2?"` | Input text |
-| `agentops.eval.item.expected` | `"4"` | Expected answer |
 | `agentops.eval.item.passed` | `true` | Row pass/fail |
 | `agentops.eval.evaluator.name` | `SimilarityEvaluator` | Class name |
 | `agentops.eval.evaluator.builtin` | `builtin.similarity` | Builtin name |

From a0d6098e198b5170ed515ff737f3441fd0dfab32 Mon Sep 17 00:00:00 2001
From: DB Lee <donlee@microsoft.com>
Date: Wed, 29 Apr 2026 13:52:10 -0700
Subject: [PATCH 3/3] docs(telemetry): drop misleading 'Option B' Azure Monitor
 path

Option B told users to set AGENTOPS_OTLP_ENDPOINT directly to a
'https://<region>.applicationinsights.azure.com' URL, but our exporter
sends plain OTLP/HTTP with no Authorization header. App Insights does
not accept that:

- The Azure Monitor OpenTelemetry distro
  (https://learn.microsoft.com/azure/azure-monitor/app/opentelemetry-configuration?tabs=python)
  requires a connection string and configure_azure_monitor(), not a
  raw OTLP endpoint.
- The preview 'Microsoft.Insights/OtlpApplicationInsights' direct OTLP
  ingestion requires Entra ID Bearer-token auth (scope
  https://monitor.azure.com/.default), which telemetry.py does not
  inject today.

Replace the two-option layout with a single recommended path (the
Collector proxy, validated end-to-end against App Insights) and an
explanatory subsection covering why direct export from AgentOps is
not supported.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 docs/telemetry.md | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/docs/telemetry.md b/docs/telemetry.md
index 75048bc3..146c3a5c 100644
--- a/docs/telemetry.md
+++ b/docs/telemetry.md
@@ -347,9 +347,9 @@ Jaeger shows spans as horizontal bars on a timeline:
 
 ## Sending Traces to Azure Monitor
 
-For production, you may want traces in Azure Monitor / Application Insights instead of local Jaeger.
+For production, you may want traces in Azure Monitor / Application Insights instead of local Jaeger. The recommended path is the **OpenTelemetry Collector** running locally (or as a sidecar) with the Azure Monitor exporter.
 
-### Option A: Use the OTel Collector as a Proxy
+### Use the OTel Collector as a Proxy
 
 Run the [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) with an Azure Monitor exporter:
 
@@ -374,21 +374,20 @@ service:
 
 Then set `AGENTOPS_OTLP_ENDPOINT=http://localhost:4318`.
 
-### Option B: Use Azure Monitor's OTLP Endpoint Directly
+### Why not export from AgentOps directly?
 
-Azure Monitor now supports OTLP ingestion natively. Set the endpoint to your Application Insights OTLP ingestion URL:
+AgentOps ships a vanilla `OTLPSpanExporter` that POSTs `application/x-protobuf` to `<endpoint>/v1/traces` with no Authorization header. This is fine for any plain OTLP/HTTP backend (Jaeger, Tempo, the Collector, etc.), but it is **not** sufficient for Azure Monitor:
 
-```bash
-export AGENTOPS_OTLP_ENDPOINT=https://<region>.applicationinsights.azure.com
-```
+- The official Azure Monitor OpenTelemetry distro for Python (see [Microsoft Learn — OpenTelemetry configuration](https://learn.microsoft.com/azure/azure-monitor/app/opentelemetry-configuration?tabs=python)) requires a **connection string** and is invoked via `configure_azure_monitor()`, not a raw OTLP endpoint.
+- Application Insights also has a preview feature (`Microsoft.Insights/OtlpApplicationInsights`) that exposes per-resource OTLP ingestion URLs, but it requires **Entra ID Bearer-token authentication** (scope `https://monitor.azure.com/.default`), which AgentOps's exporter does not currently inject.
 
-Refer to the [Azure Monitor OTLP documentation](https://learn.microsoft.com/azure/azure-monitor/app/opentelemetry-configuration) for details.
+The Collector proxy avoids both issues: AgentOps speaks plain OTLP/HTTP to the Collector, and the Collector handles authentication to Azure Monitor.
 
 ---
 
 ## Querying Traces in Azure Monitor (KQL)
 
-Once eval traces land in Application Insights (via either option above), you can query them directly in **Application Insights > Logs** using KQL. All span attributes are stored as JSON keys in the `customDimensions` column.
+Once eval traces land in Application Insights via the Collector, you can query them directly in **Application Insights > Logs** using KQL. All span attributes are stored as JSON keys in the `customDimensions` column.
 
 ### Table Mapping