diff --git a/docs/telemetry.md b/docs/telemetry.md index 04802ee9..146c3a5c 100644 --- a/docs/telemetry.md +++ b/docs/telemetry.md @@ -49,18 +49,20 @@ A **trace** represents a single end-to-end operation. In AgentOps, one evaluatio A **span** is a unit of work with a start time, end time, a name, and key-value attributes. Spans nest inside each other to form a tree. Example: ``` -RUN conversational_agent_baseline ← root span (the whole run) -├── eval_item 0 ← child span (one dataset row) -│ ├── invoke_agent my-agent ← grandchild (the agent call) -│ ├── evaluator builtin.similarity ← grandchild (scoring) -│ └── evaluator builtin.coherence ← grandchild (scoring) -├── eval_item 1 +RUN conversational_agent_baseline ← root span (the whole run) +├── eval_item 1 - 'What is 2+2?' ← child span (one dataset row) +│ ├── invoke_agent my-agent ← grandchild (the agent call) +│ ├── evaluator builtin.similarity ← grandchild (scoring) +│ └── evaluator builtin.coherence ← grandchild (scoring) +├── eval_item 2 - 'Capital of France?' │ ├── invoke_agent my-agent │ ├── evaluator builtin.similarity │ └── evaluator builtin.coherence └── ... ``` +Item indices are **1-based**, and each `eval_item` span name includes a short snippet of the row input for easy scanning in trace UIs. + Each span records **attributes** — structured key-value pairs like `agentops.eval.evaluator.score = 0.87`. ### What Is OTLP? @@ -196,11 +198,12 @@ RUN kind=SERVER │ agentops.eval.model = (if applicable) │ agentops.eval.agent_id = (if applicable) │ -├── eval_item 0 kind=INTERNAL +├── eval_item 1 - 'What is 2+2?' kind=INTERNAL │ │ cicd.pipeline.task.name = "eval_item" -│ │ agentops.eval.item.index = 0 -│ │ agentops.eval.item.input = "..." -│ │ agentops.eval.item.expected = "..." +│ │ cicd.pipeline.task.run.id = "1" +│ │ cicd.pipeline.task.run.result = "success" +│ │ agentops.eval.item.index = 1 +│ │ agentops.eval.item.input = "What is 2+2?" │ │ agentops.eval.item.passed = true │ │ │ ├── invoke_agent my-agent kind=CLIENT @@ -222,7 +225,7 @@ RUN kind=SERVER │ agentops.eval.evaluator.score = 0.85 │ ... │ -├── eval_item 1 +├── eval_item 2 - 'Capital of France?' │ └── ... │ └── (final attributes on root span) @@ -265,8 +268,8 @@ Follows the [OTel GenAI semantic conventions](https://opentelemetry.io/docs/spec | Attribute | Example | Description | |---|---|---| -| `gen_ai.operation.name` | `invoke_agent` / `chat` | Operation type | -| `gen_ai.provider.name` | `azure.ai.inference` | Provider | +| `gen_ai.operation.name` | `invoke_agent` / `chat` | Operation type — `invoke_agent` for agent targets, `chat` for model targets | +| `gen_ai.provider.name` | `azure.ai.inference` / `local.callable` | Provider — varies by backend (e.g. `azure.ai.inference` for Foundry, `local.callable` for the local adapter backend) | | `gen_ai.request.model` | `gpt-4o` | Requested model deployment | | `gen_ai.response.model` | `gpt-4o-2024-08-06` | Actual model version | | `gen_ai.agent.id` | `my-agent:3` | Foundry agent identifier | @@ -289,9 +292,8 @@ Custom attributes for evaluation-specific data that has no standard equivalent. | `agentops.eval.items_total` | `10` | Total rows evaluated | | `agentops.eval.items_passed` | `9` | Rows passing thresholds | | `agentops.eval.pass_rate` | `0.9` | Pass rate | -| `agentops.eval.item.index` | `0` | Row index | +| `agentops.eval.item.index` | `1` | Row index (1-based) | | `agentops.eval.item.input` | `"What is 2+2?"` | Input text | -| `agentops.eval.item.expected` | `"4"` | Expected answer | | `agentops.eval.item.passed` | `true` | Row pass/fail | | `agentops.eval.evaluator.name` | `SimilarityEvaluator` | Class name | | `agentops.eval.evaluator.builtin` | `builtin.similarity` | Builtin name | @@ -345,9 +347,9 @@ Jaeger shows spans as horizontal bars on a timeline: ## Sending Traces to Azure Monitor -For production, you may want traces in Azure Monitor / Application Insights instead of local Jaeger. +For production, you may want traces in Azure Monitor / Application Insights instead of local Jaeger. The recommended path is the **OpenTelemetry Collector** running locally (or as a sidecar) with the Azure Monitor exporter. -### Option A: Use the OTel Collector as a Proxy +### Use the OTel Collector as a Proxy Run the [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) with an Azure Monitor exporter: @@ -372,15 +374,121 @@ service: Then set `AGENTOPS_OTLP_ENDPOINT=http://localhost:4318`. -### Option B: Use Azure Monitor's OTLP Endpoint Directly +### Why not export from AgentOps directly? -Azure Monitor now supports OTLP ingestion natively. Set the endpoint to your Application Insights OTLP ingestion URL: +AgentOps ships a vanilla `OTLPSpanExporter` that POSTs `application/x-protobuf` to `/v1/traces` with no Authorization header. This is fine for any plain OTLP/HTTP backend (Jaeger, Tempo, the Collector, etc.), but it is **not** sufficient for Azure Monitor: -```bash -export AGENTOPS_OTLP_ENDPOINT=https://.applicationinsights.azure.com +- The official Azure Monitor OpenTelemetry distro for Python (see [Microsoft Learn — OpenTelemetry configuration](https://learn.microsoft.com/azure/azure-monitor/app/opentelemetry-configuration?tabs=python)) requires a **connection string** and is invoked via `configure_azure_monitor()`, not a raw OTLP endpoint. +- Application Insights also has a preview feature (`Microsoft.Insights/OtlpApplicationInsights`) that exposes per-resource OTLP ingestion URLs, but it requires **Entra ID Bearer-token authentication** (scope `https://monitor.azure.com/.default`), which AgentOps's exporter does not currently inject. + +The Collector proxy avoids both issues: AgentOps speaks plain OTLP/HTTP to the Collector, and the Collector handles authentication to Azure Monitor. + +--- + +## Querying Traces in Azure Monitor (KQL) + +Once eval traces land in Application Insights via the Collector, you can query them directly in **Application Insights > Logs** using KQL. All span attributes are stored as JSON keys in the `customDimensions` column. + +### Table Mapping + +AgentOps spans map to App Insights tables based on their OpenTelemetry span kind: + +| Span | App Insights Table | Span Kind | +|---|---|---| +| `RUN ` (root eval run) | `requests` | `SERVER` | +| `eval_item N` (per-row evaluation) | `dependencies` | `INTERNAL` | +| `invoke_agent` / `chat` (agent/model call) | `dependencies` | `CLIENT` | +| `evaluator ` (individual evaluator) | `dependencies` | `INTERNAL` | + +### Query 1: Slowest Evaluation Rows + +Find the top 10 slowest evaluation rows to identify performance bottlenecks. + +```kql +dependencies +| where customDimensions["cicd.pipeline.task.name"] == "eval_item" +| extend + rowIndex = toint(customDimensions["agentops.eval.item.index"]), + input = tostring(customDimensions["agentops.eval.item.input"]), + passed = tostring(customDimensions["agentops.eval.item.passed"]) +| project timestamp, rowIndex, input, passed, duration, operation_Id +| top 10 by duration desc +``` + +### Query 2: Failed Evaluators + +List all evaluator executions that failed their threshold, with scores and thresholds. + +```kql +dependencies +| where customDimensions["agentops.eval.evaluator.passed"] == "false" +| extend + evaluator = tostring(customDimensions["agentops.eval.evaluator.builtin"]), + score = toreal(customDimensions["agentops.eval.evaluator.score"]), + threshold = toreal(customDimensions["agentops.eval.evaluator.threshold"]), + criteria = tostring(customDimensions["agentops.eval.evaluator.criteria"]) +| project timestamp, evaluator, score, threshold, criteria, operation_Id +| order by timestamp desc ``` -Refer to the [Azure Monitor OTLP documentation](https://learn.microsoft.com/azure/azure-monitor/app/opentelemetry-configuration) for details. +### Query 3: Pass Rate Over Time + +Track overall evaluation pass rate trends from root spans. + +```kql +requests +| where name startswith "RUN " +| extend + passRate = toreal(customDimensions["agentops.eval.pass_rate"]), + bundle = tostring(customDimensions["cicd.pipeline.name"]), + dataset = tostring(customDimensions["agentops.eval.dataset"]), + itemsTotal = toint(customDimensions["agentops.eval.items_total"]), + itemsPassed = toint(customDimensions["agentops.eval.items_passed"]) +| project timestamp, bundle, dataset, passRate, itemsPassed, itemsTotal +| order by timestamp asc +| render timechart with (ycolumns=passRate, title="Evaluation Pass Rate Over Time") +``` + +### Query 4: Token Usage Per Run + +Sum input and output tokens across all agent/model invocations within each eval run. + +```kql +dependencies +| where customDimensions["gen_ai.operation.name"] in ("invoke_agent", "chat") +| extend + inputTokens = toint(customDimensions["gen_ai.usage.input_tokens"]), + outputTokens = toint(customDimensions["gen_ai.usage.output_tokens"]), + model = tostring(customDimensions["gen_ai.request.model"]) +| summarize + totalInputTokens = sum(inputTokens), + totalOutputTokens = sum(outputTokens), + totalTokens = sum(inputTokens) + sum(outputTokens), + invocations = count() + by operation_Id, model +| order by totalTokens desc +``` + +### Query 5: Evaluator Score Distribution + +View the distribution of scores grouped by evaluator name to identify consistently low-performing evaluators. + +```kql +dependencies +| where isnotempty(customDimensions["agentops.eval.evaluator.score"]) +| extend + evaluator = tostring(customDimensions["agentops.eval.evaluator.builtin"]), + score = toreal(customDimensions["agentops.eval.evaluator.score"]) +| summarize + avgScore = avg(score), + minScore = min(score), + maxScore = max(score), + p50 = percentile(score, 50), + p90 = percentile(score, 90), + count = count() + by evaluator +| order by avgScore asc +``` ---