Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
152 changes: 130 additions & 22 deletions docs/telemetry.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,18 +49,20 @@ A **trace** represents a single end-to-end operation. In AgentOps, one evaluatio
A **span** is a unit of work with a start time, end time, a name, and key-value attributes. Spans nest inside each other to form a tree. Example:

```
RUN conversational_agent_baseline ← root span (the whole run)
├── eval_item 0 ← child span (one dataset row)
│ ├── invoke_agent my-agent ← grandchild (the agent call)
│ ├── evaluator builtin.similarity ← grandchild (scoring)
│ └── evaluator builtin.coherence ← grandchild (scoring)
├── eval_item 1
RUN conversational_agent_baseline ← root span (the whole run)
├── eval_item 1 - 'What is 2+2?' ← child span (one dataset row)
│ ├── invoke_agent my-agent ← grandchild (the agent call)
│ ├── evaluator builtin.similarity ← grandchild (scoring)
│ └── evaluator builtin.coherence ← grandchild (scoring)
├── eval_item 2 - 'Capital of France?'
│ ├── invoke_agent my-agent
│ ├── evaluator builtin.similarity
│ └── evaluator builtin.coherence
└── ...
```

Item indices are **1-based**, and each `eval_item` span name includes a short snippet of the row input for easy scanning in trace UIs.

Each span records **attributes** — structured key-value pairs like `agentops.eval.evaluator.score = 0.87`.

### What Is OTLP?
Expand Down Expand Up @@ -196,11 +198,12 @@ RUN <bundle_name> kind=SERVER
│ agentops.eval.model = <deployment> (if applicable)
│ agentops.eval.agent_id = <agent_id> (if applicable)
├── eval_item 0 kind=INTERNAL
├── eval_item 1 - 'What is 2+2?' kind=INTERNAL
│ │ cicd.pipeline.task.name = "eval_item"
│ │ agentops.eval.item.index = 0
│ │ agentops.eval.item.input = "..."
│ │ agentops.eval.item.expected = "..."
│ │ cicd.pipeline.task.run.id = "1"
│ │ cicd.pipeline.task.run.result = "success"
│ │ agentops.eval.item.index = 1
│ │ agentops.eval.item.input = "What is 2+2?"
│ │ agentops.eval.item.passed = true
│ │
│ ├── invoke_agent my-agent kind=CLIENT
Expand All @@ -222,7 +225,7 @@ RUN <bundle_name> kind=SERVER
│ agentops.eval.evaluator.score = 0.85
│ ...
├── eval_item 1
├── eval_item 2 - 'Capital of France?'
│ └── ...
└── (final attributes on root span)
Expand Down Expand Up @@ -265,8 +268,8 @@ Follows the [OTel GenAI semantic conventions](https://opentelemetry.io/docs/spec

| Attribute | Example | Description |
|---|---|---|
| `gen_ai.operation.name` | `invoke_agent` / `chat` | Operation type |
| `gen_ai.provider.name` | `azure.ai.inference` | Provider |
| `gen_ai.operation.name` | `invoke_agent` / `chat` | Operation type — `invoke_agent` for agent targets, `chat` for model targets |
| `gen_ai.provider.name` | `azure.ai.inference` / `local.callable` | Provider — varies by backend (e.g. `azure.ai.inference` for Foundry, `local.callable` for the local adapter backend) |
| `gen_ai.request.model` | `gpt-4o` | Requested model deployment |
| `gen_ai.response.model` | `gpt-4o-2024-08-06` | Actual model version |
| `gen_ai.agent.id` | `my-agent:3` | Foundry agent identifier |
Expand All @@ -289,9 +292,8 @@ Custom attributes for evaluation-specific data that has no standard equivalent.
| `agentops.eval.items_total` | `10` | Total rows evaluated |
| `agentops.eval.items_passed` | `9` | Rows passing thresholds |
| `agentops.eval.pass_rate` | `0.9` | Pass rate |
| `agentops.eval.item.index` | `0` | Row index |
| `agentops.eval.item.index` | `1` | Row index (1-based) |
| `agentops.eval.item.input` | `"What is 2+2?"` | Input text |
| `agentops.eval.item.expected` | `"4"` | Expected answer |
| `agentops.eval.item.passed` | `true` | Row pass/fail |
| `agentops.eval.evaluator.name` | `SimilarityEvaluator` | Class name |
| `agentops.eval.evaluator.builtin` | `builtin.similarity` | Builtin name |
Expand Down Expand Up @@ -345,9 +347,9 @@ Jaeger shows spans as horizontal bars on a timeline:

## Sending Traces to Azure Monitor

For production, you may want traces in Azure Monitor / Application Insights instead of local Jaeger.
For production, you may want traces in Azure Monitor / Application Insights instead of local Jaeger. The recommended path is the **OpenTelemetry Collector** running locally (or as a sidecar) with the Azure Monitor exporter.

### Option A: Use the OTel Collector as a Proxy
### Use the OTel Collector as a Proxy

Run the [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) with an Azure Monitor exporter:

Expand All @@ -372,15 +374,121 @@ service:

Then set `AGENTOPS_OTLP_ENDPOINT=http://localhost:4318`.

### Option B: Use Azure Monitor's OTLP Endpoint Directly
### Why not export from AgentOps directly?

Azure Monitor now supports OTLP ingestion natively. Set the endpoint to your Application Insights OTLP ingestion URL:
AgentOps ships a vanilla `OTLPSpanExporter` that POSTs `application/x-protobuf` to `<endpoint>/v1/traces` with no Authorization header. This is fine for any plain OTLP/HTTP backend (Jaeger, Tempo, the Collector, etc.), but it is **not** sufficient for Azure Monitor:

```bash
export AGENTOPS_OTLP_ENDPOINT=https://<region>.applicationinsights.azure.com
- The official Azure Monitor OpenTelemetry distro for Python (see [Microsoft Learn — OpenTelemetry configuration](https://learn.microsoft.com/azure/azure-monitor/app/opentelemetry-configuration?tabs=python)) requires a **connection string** and is invoked via `configure_azure_monitor()`, not a raw OTLP endpoint.
- Application Insights also has a preview feature (`Microsoft.Insights/OtlpApplicationInsights`) that exposes per-resource OTLP ingestion URLs, but it requires **Entra ID Bearer-token authentication** (scope `https://monitor.azure.com/.default`), which AgentOps's exporter does not currently inject.

The Collector proxy avoids both issues: AgentOps speaks plain OTLP/HTTP to the Collector, and the Collector handles authentication to Azure Monitor.

---

## Querying Traces in Azure Monitor (KQL)

Once eval traces land in Application Insights via the Collector, you can query them directly in **Application Insights > Logs** using KQL. All span attributes are stored as JSON keys in the `customDimensions` column.

### Table Mapping

AgentOps spans map to App Insights tables based on their OpenTelemetry span kind:

| Span | App Insights Table | Span Kind |
|---|---|---|
| `RUN <bundle>` (root eval run) | `requests` | `SERVER` |
| `eval_item N` (per-row evaluation) | `dependencies` | `INTERNAL` |
| `invoke_agent` / `chat` (agent/model call) | `dependencies` | `CLIENT` |
| `evaluator <name>` (individual evaluator) | `dependencies` | `INTERNAL` |

### Query 1: Slowest Evaluation Rows

Find the top 10 slowest evaluation rows to identify performance bottlenecks.

```kql
dependencies
| where customDimensions["cicd.pipeline.task.name"] == "eval_item"
| extend
rowIndex = toint(customDimensions["agentops.eval.item.index"]),
input = tostring(customDimensions["agentops.eval.item.input"]),
passed = tostring(customDimensions["agentops.eval.item.passed"])
| project timestamp, rowIndex, input, passed, duration, operation_Id
| top 10 by duration desc
```

### Query 2: Failed Evaluators

List all evaluator executions that failed their threshold, with scores and thresholds.

```kql
dependencies
| where customDimensions["agentops.eval.evaluator.passed"] == "false"
| extend
evaluator = tostring(customDimensions["agentops.eval.evaluator.builtin"]),
score = toreal(customDimensions["agentops.eval.evaluator.score"]),
threshold = toreal(customDimensions["agentops.eval.evaluator.threshold"]),
criteria = tostring(customDimensions["agentops.eval.evaluator.criteria"])
| project timestamp, evaluator, score, threshold, criteria, operation_Id
| order by timestamp desc
```

Refer to the [Azure Monitor OTLP documentation](https://learn.microsoft.com/azure/azure-monitor/app/opentelemetry-configuration) for details.
### Query 3: Pass Rate Over Time

Track overall evaluation pass rate trends from root spans.

```kql
requests
| where name startswith "RUN "
| extend
passRate = toreal(customDimensions["agentops.eval.pass_rate"]),
bundle = tostring(customDimensions["cicd.pipeline.name"]),
dataset = tostring(customDimensions["agentops.eval.dataset"]),
itemsTotal = toint(customDimensions["agentops.eval.items_total"]),
itemsPassed = toint(customDimensions["agentops.eval.items_passed"])
| project timestamp, bundle, dataset, passRate, itemsPassed, itemsTotal
| order by timestamp asc
| render timechart with (ycolumns=passRate, title="Evaluation Pass Rate Over Time")
```

### Query 4: Token Usage Per Run

Sum input and output tokens across all agent/model invocations within each eval run.

```kql
dependencies
| where customDimensions["gen_ai.operation.name"] in ("invoke_agent", "chat")
| extend
inputTokens = toint(customDimensions["gen_ai.usage.input_tokens"]),
outputTokens = toint(customDimensions["gen_ai.usage.output_tokens"]),
model = tostring(customDimensions["gen_ai.request.model"])
| summarize
totalInputTokens = sum(inputTokens),
totalOutputTokens = sum(outputTokens),
totalTokens = sum(inputTokens) + sum(outputTokens),
invocations = count()
by operation_Id, model
| order by totalTokens desc
```

### Query 5: Evaluator Score Distribution

View the distribution of scores grouped by evaluator name to identify consistently low-performing evaluators.

```kql
dependencies
| where isnotempty(customDimensions["agentops.eval.evaluator.score"])
| extend
evaluator = tostring(customDimensions["agentops.eval.evaluator.builtin"]),
score = toreal(customDimensions["agentops.eval.evaluator.score"])
| summarize
avgScore = avg(score),
minScore = min(score),
maxScore = max(score),
p50 = percentile(score, 50),
p90 = percentile(score, 90),
count = count()
by evaluator
| order by avgScore asc
```

---

Expand Down