From 2fafd681c3d99623b3122aaff1098632a8346988 Mon Sep 17 00:00:00 2001 From: DB Lee Date: Wed, 29 Apr 2026 11:56:09 -0700 Subject: [PATCH 1/3] docs: add KQL query library section to telemetry.md Adds a new 'Querying Traces in Azure Monitor (KQL)' section with 5 ready-to-use KQL queries for users who send eval traces to Azure Monitor / Application Insights: 1. Slowest evaluation rows (top N eval_item spans by duration) 2. Failed evaluators (filter by passed == false with scores and thresholds) 3. Pass rate over time (trend from root spans with timechart render) 4. Token usage per run (sum input + output tokens by operation_Id) 5. Evaluator score distribution (stats by evaluator name) Includes a table mapping explaining which AgentOps spans land in which App Insights tables (requests vs dependencies). All attribute names verified against telemetry.py source code. Closes #89 Generated with [Amplifier](https://github.com/microsoft/amplifier) Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com> --- docs/telemetry.md | 107 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 107 insertions(+) diff --git a/docs/telemetry.md b/docs/telemetry.md index 04802ee9..341ba887 100644 --- a/docs/telemetry.md +++ b/docs/telemetry.md @@ -384,6 +384,113 @@ Refer to the [Azure Monitor OTLP documentation](https://learn.microsoft.com/azur --- +## Querying Traces in Azure Monitor (KQL) + +Once eval traces land in Application Insights (via either option above), you can query them directly in **Application Insights > Logs** using KQL. All span attributes are stored as JSON keys in the `customDimensions` column. + +### Table Mapping + +AgentOps spans map to App Insights tables based on their OpenTelemetry span kind: + +| Span | App Insights Table | Span Kind | +|---|---|---| +| `RUN ` (root eval run) | `requests` | `SERVER` | +| `eval_item N` (per-row evaluation) | `dependencies` | `INTERNAL` | +| `invoke_agent` / `chat` (agent/model call) | `dependencies` | `CLIENT` | +| `evaluator ` (individual evaluator) | `dependencies` | `INTERNAL` | + +### Query 1: Slowest Evaluation Rows + +Find the top 10 slowest evaluation rows to identify performance bottlenecks. + +```kql +dependencies +| where customDimensions["cicd.pipeline.task.name"] == "eval_item" +| extend + rowIndex = toint(customDimensions["agentops.eval.item.index"]), + input = tostring(customDimensions["agentops.eval.item.input"]), + passed = tostring(customDimensions["agentops.eval.item.passed"]) +| project timestamp, rowIndex, input, passed, duration, operation_Id +| top 10 by duration desc +``` + +### Query 2: Failed Evaluators + +List all evaluator executions that failed their threshold, with scores and thresholds. + +```kql +dependencies +| where customDimensions["agentops.eval.evaluator.passed"] == "false" +| extend + evaluator = tostring(customDimensions["agentops.eval.evaluator.builtin"]), + score = toreal(customDimensions["agentops.eval.evaluator.score"]), + threshold = toreal(customDimensions["agentops.eval.evaluator.threshold"]), + criteria = tostring(customDimensions["agentops.eval.evaluator.criteria"]) +| project timestamp, evaluator, score, threshold, criteria, operation_Id +| order by timestamp desc +``` + +### Query 3: Pass Rate Over Time + +Track overall evaluation pass rate trends from root spans. + +```kql +requests +| where name startswith "RUN " +| extend + passRate = toreal(customDimensions["agentops.eval.pass_rate"]), + bundle = tostring(customDimensions["cicd.pipeline.name"]), + dataset = tostring(customDimensions["agentops.eval.dataset"]), + itemsTotal = toint(customDimensions["agentops.eval.items_total"]), + itemsPassed = toint(customDimensions["agentops.eval.items_passed"]) +| project timestamp, bundle, dataset, passRate, itemsPassed, itemsTotal +| order by timestamp asc +| render timechart with (ycolumns=passRate, title="Evaluation Pass Rate Over Time") +``` + +### Query 4: Token Usage Per Run + +Sum input and output tokens across all agent/model invocations within each eval run. + +```kql +dependencies +| where customDimensions["gen_ai.operation.name"] in ("invoke_agent", "chat") +| extend + inputTokens = toint(customDimensions["gen_ai.usage.input_tokens"]), + outputTokens = toint(customDimensions["gen_ai.usage.output_tokens"]), + model = tostring(customDimensions["gen_ai.request.model"]) +| summarize + totalInputTokens = sum(inputTokens), + totalOutputTokens = sum(outputTokens), + totalTokens = sum(inputTokens) + sum(outputTokens), + invocations = count() + by operation_Id, model +| order by totalTokens desc +``` + +### Query 5: Evaluator Score Distribution + +View the distribution of scores grouped by evaluator name to identify consistently low-performing evaluators. + +```kql +dependencies +| where isnotempty(customDimensions["agentops.eval.evaluator.score"]) +| extend + evaluator = tostring(customDimensions["agentops.eval.evaluator.builtin"]), + score = toreal(customDimensions["agentops.eval.evaluator.score"]) +| summarize + avgScore = avg(score), + minScore = min(score), + maxScore = max(score), + p50 = percentile(score, 50), + p90 = percentile(score, 90), + count = count() + by evaluator +| order by avgScore asc +``` + +--- + ## Evaluation Tracing vs. Agent Execution Tracing It is important to understand that AgentOps telemetry covers **evaluation observability** — not agent execution tracing. These are two different things: From fae2d41463bebf5603585de9706ef0ee41105130 Mon Sep 17 00:00:00 2001 From: DB Lee Date: Wed, 29 Apr 2026 13:44:22 -0700 Subject: [PATCH 2/3] docs(telemetry): correct trace tree and attribute table Validated end-to-end against Jaeger and Azure Monitor (OTel Collector proxy + App Insights KQL queries 1-5). Adjustments: - Use 1-based eval_item indices and include the input snippet that the runner actually puts in the span name (eval_item N - ''). - Add cicd.pipeline.task.run.id / .run.result to the eval_item example; these are emitted by telemetry.py but were missing from the doc. - Remove agentops.eval.item.expected from the trace tree and the attribute table; the attribute is never populated because the runner does not pass expected_text into eval_item_span. - Clarify that gen_ai.provider.name varies by backend (azure.ai.inference for Foundry, local.callable for local adapter). - Note that item.index is 1-based. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- docs/telemetry.md | 32 +++++++++++++++++--------------- 1 file changed, 17 insertions(+), 15 deletions(-) diff --git a/docs/telemetry.md b/docs/telemetry.md index 341ba887..75048bc3 100644 --- a/docs/telemetry.md +++ b/docs/telemetry.md @@ -49,18 +49,20 @@ A **trace** represents a single end-to-end operation. In AgentOps, one evaluatio A **span** is a unit of work with a start time, end time, a name, and key-value attributes. Spans nest inside each other to form a tree. Example: ``` -RUN conversational_agent_baseline ← root span (the whole run) -├── eval_item 0 ← child span (one dataset row) -│ ├── invoke_agent my-agent ← grandchild (the agent call) -│ ├── evaluator builtin.similarity ← grandchild (scoring) -│ └── evaluator builtin.coherence ← grandchild (scoring) -├── eval_item 1 +RUN conversational_agent_baseline ← root span (the whole run) +├── eval_item 1 - 'What is 2+2?' ← child span (one dataset row) +│ ├── invoke_agent my-agent ← grandchild (the agent call) +│ ├── evaluator builtin.similarity ← grandchild (scoring) +│ └── evaluator builtin.coherence ← grandchild (scoring) +├── eval_item 2 - 'Capital of France?' │ ├── invoke_agent my-agent │ ├── evaluator builtin.similarity │ └── evaluator builtin.coherence └── ... ``` +Item indices are **1-based**, and each `eval_item` span name includes a short snippet of the row input for easy scanning in trace UIs. + Each span records **attributes** — structured key-value pairs like `agentops.eval.evaluator.score = 0.87`. ### What Is OTLP? @@ -196,11 +198,12 @@ RUN kind=SERVER │ agentops.eval.model = (if applicable) │ agentops.eval.agent_id = (if applicable) │ -├── eval_item 0 kind=INTERNAL +├── eval_item 1 - 'What is 2+2?' kind=INTERNAL │ │ cicd.pipeline.task.name = "eval_item" -│ │ agentops.eval.item.index = 0 -│ │ agentops.eval.item.input = "..." -│ │ agentops.eval.item.expected = "..." +│ │ cicd.pipeline.task.run.id = "1" +│ │ cicd.pipeline.task.run.result = "success" +│ │ agentops.eval.item.index = 1 +│ │ agentops.eval.item.input = "What is 2+2?" │ │ agentops.eval.item.passed = true │ │ │ ├── invoke_agent my-agent kind=CLIENT @@ -222,7 +225,7 @@ RUN kind=SERVER │ agentops.eval.evaluator.score = 0.85 │ ... │ -├── eval_item 1 +├── eval_item 2 - 'Capital of France?' │ └── ... │ └── (final attributes on root span) @@ -265,8 +268,8 @@ Follows the [OTel GenAI semantic conventions](https://opentelemetry.io/docs/spec | Attribute | Example | Description | |---|---|---| -| `gen_ai.operation.name` | `invoke_agent` / `chat` | Operation type | -| `gen_ai.provider.name` | `azure.ai.inference` | Provider | +| `gen_ai.operation.name` | `invoke_agent` / `chat` | Operation type — `invoke_agent` for agent targets, `chat` for model targets | +| `gen_ai.provider.name` | `azure.ai.inference` / `local.callable` | Provider — varies by backend (e.g. `azure.ai.inference` for Foundry, `local.callable` for the local adapter backend) | | `gen_ai.request.model` | `gpt-4o` | Requested model deployment | | `gen_ai.response.model` | `gpt-4o-2024-08-06` | Actual model version | | `gen_ai.agent.id` | `my-agent:3` | Foundry agent identifier | @@ -289,9 +292,8 @@ Custom attributes for evaluation-specific data that has no standard equivalent. | `agentops.eval.items_total` | `10` | Total rows evaluated | | `agentops.eval.items_passed` | `9` | Rows passing thresholds | | `agentops.eval.pass_rate` | `0.9` | Pass rate | -| `agentops.eval.item.index` | `0` | Row index | +| `agentops.eval.item.index` | `1` | Row index (1-based) | | `agentops.eval.item.input` | `"What is 2+2?"` | Input text | -| `agentops.eval.item.expected` | `"4"` | Expected answer | | `agentops.eval.item.passed` | `true` | Row pass/fail | | `agentops.eval.evaluator.name` | `SimilarityEvaluator` | Class name | | `agentops.eval.evaluator.builtin` | `builtin.similarity` | Builtin name | From a0d6098e198b5170ed515ff737f3441fd0dfab32 Mon Sep 17 00:00:00 2001 From: DB Lee Date: Wed, 29 Apr 2026 13:52:10 -0700 Subject: [PATCH 3/3] docs(telemetry): drop misleading 'Option B' Azure Monitor path Option B told users to set AGENTOPS_OTLP_ENDPOINT directly to a 'https://.applicationinsights.azure.com' URL, but our exporter sends plain OTLP/HTTP with no Authorization header. App Insights does not accept that: - The Azure Monitor OpenTelemetry distro (https://learn.microsoft.com/azure/azure-monitor/app/opentelemetry-configuration?tabs=python) requires a connection string and configure_azure_monitor(), not a raw OTLP endpoint. - The preview 'Microsoft.Insights/OtlpApplicationInsights' direct OTLP ingestion requires Entra ID Bearer-token auth (scope https://monitor.azure.com/.default), which telemetry.py does not inject today. Replace the two-option layout with a single recommended path (the Collector proxy, validated end-to-end against App Insights) and an explanatory subsection covering why direct export from AgentOps is not supported. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- docs/telemetry.md | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/docs/telemetry.md b/docs/telemetry.md index 75048bc3..146c3a5c 100644 --- a/docs/telemetry.md +++ b/docs/telemetry.md @@ -347,9 +347,9 @@ Jaeger shows spans as horizontal bars on a timeline: ## Sending Traces to Azure Monitor -For production, you may want traces in Azure Monitor / Application Insights instead of local Jaeger. +For production, you may want traces in Azure Monitor / Application Insights instead of local Jaeger. The recommended path is the **OpenTelemetry Collector** running locally (or as a sidecar) with the Azure Monitor exporter. -### Option A: Use the OTel Collector as a Proxy +### Use the OTel Collector as a Proxy Run the [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) with an Azure Monitor exporter: @@ -374,21 +374,20 @@ service: Then set `AGENTOPS_OTLP_ENDPOINT=http://localhost:4318`. -### Option B: Use Azure Monitor's OTLP Endpoint Directly +### Why not export from AgentOps directly? -Azure Monitor now supports OTLP ingestion natively. Set the endpoint to your Application Insights OTLP ingestion URL: +AgentOps ships a vanilla `OTLPSpanExporter` that POSTs `application/x-protobuf` to `/v1/traces` with no Authorization header. This is fine for any plain OTLP/HTTP backend (Jaeger, Tempo, the Collector, etc.), but it is **not** sufficient for Azure Monitor: -```bash -export AGENTOPS_OTLP_ENDPOINT=https://.applicationinsights.azure.com -``` +- The official Azure Monitor OpenTelemetry distro for Python (see [Microsoft Learn — OpenTelemetry configuration](https://learn.microsoft.com/azure/azure-monitor/app/opentelemetry-configuration?tabs=python)) requires a **connection string** and is invoked via `configure_azure_monitor()`, not a raw OTLP endpoint. +- Application Insights also has a preview feature (`Microsoft.Insights/OtlpApplicationInsights`) that exposes per-resource OTLP ingestion URLs, but it requires **Entra ID Bearer-token authentication** (scope `https://monitor.azure.com/.default`), which AgentOps's exporter does not currently inject. -Refer to the [Azure Monitor OTLP documentation](https://learn.microsoft.com/azure/azure-monitor/app/opentelemetry-configuration) for details. +The Collector proxy avoids both issues: AgentOps speaks plain OTLP/HTTP to the Collector, and the Collector handles authentication to Azure Monitor. --- ## Querying Traces in Azure Monitor (KQL) -Once eval traces land in Application Insights (via either option above), you can query them directly in **Application Insights > Logs** using KQL. All span attributes are stored as JSON keys in the `customDimensions` column. +Once eval traces land in Application Insights via the Collector, you can query them directly in **Application Insights > Logs** using KQL. All span attributes are stored as JSON keys in the `customDimensions` column. ### Table Mapping