Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 108 additions & 16 deletions docs/tutorial-agent-watchdog.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,9 @@ sources:
queried via KQL.
3. **Foundry control plane** — agent metadata and recent runs read
through `azure-ai-projects`.
4. **Azure resource posture** — a read-only WAF-AI Security pillar audit
of the Cognitive Services / Azure OpenAI account that hosts the agent
and judge model.

The agent runs the same checks (regression, latency, errors, safety)
in three form factors:
Expand Down Expand Up @@ -67,14 +70,20 @@ Exit codes are CI-friendly:
- `2` — a finding meets the configured `--severity-fail` floor
- `1` — runtime / configuration error

## 2b. Security posture audit (WAF-AI)
## 3. Security posture audit (WAF-AI)

The watchdog can also run a **read-only audit of the Azure footprint**
hosting your agent against the [Microsoft Well-Architected Framework
for AI workloads — Security pillar][waf-ai]. This is opt-in: the
findings live in their own `security` category and are skipped unless
both the `azure_resources` source and the `posture` check are enabled.

Why is this opt-in? The telemetry checks use App Insights and Foundry
metadata that you already configured in the previous step. Security
posture requires management-plane reads against the Azure resource group,
so the tutorial asks for the subscription, resource group, and Cognitive
Services account explicitly instead of guessing them.

The audit runs five high-impact rules against the Cognitive Services /
Azure OpenAI account:

Expand All @@ -86,30 +95,113 @@ Azure OpenAI account:
| `waf.security.diagnostic_settings` | warning | Diagnostic logs flowing to Log Analytics / storage / event hub |
| `waf.security.content_filter` | critical | Every model deployment has a RAI policy applied |

Required RBAC: **Reader** on the resource group (or on each
individual resource), granted to whoever runs `agentops agent analyze`
(your local identity locally, or the OIDC-federated identity in CI).
Required RBAC: **Reader** on the resource group (or on each individual
resource), granted to whoever runs `agentops agent analyze` (your local
identity locally, or the OIDC-federated identity in CI).

Find the account to audit:

```powershell
$env:AZURE_SUBSCRIPTION_ID = az account show --query id -o tsv
$resourceGroup = "<your-agent-resource-group>"

az cognitiveservices account list `
--resource-group $resourceGroup `
--query "[].{name:name,kind:kind,location:location,disableLocalAuth:properties.disableLocalAuth,publicNetworkAccess:properties.publicNetworkAccess}" `
-o table
```

Pick the account that hosts your Azure OpenAI / AI Services deployment:

```powershell
$cognitiveAccount = "<ai-services-or-azure-openai-account-name>"
```

Enable in `.agentops/agent.yaml`:

```yaml
```powershell
@"
version: 1
lookback_days: 7

sources:
results_history:
enabled: true
path: .agentops/results
lookback_runs: 10
azure_monitor:
enabled: true
app_insights_resource_id: $appInsightsId
foundry_control:
enabled: true
project_endpoint_env: AZURE_AI_FOUNDRY_PROJECT_ENDPOINT
azure_resources:
enabled: true
subscription_id_env: AZURE_SUBSCRIPTION_ID # or set subscription_id directly
resource_group: rg-myproject
cognitive_services_account: ai-services-myproject
subscription_id_env: AZURE_SUBSCRIPTION_ID
resource_group: $resourceGroup
cognitive_services_account: $cognitiveAccount

checks:
latency:
p95_threshold_seconds: 10.0
errors:
rate_threshold: 0.05
posture:
enabled: true
pillar: security
# Skip individual rules without disabling the whole check, e.g.
# exclude_rules:
# - waf.security.diagnostic_settings
exclude_rules: []
"@ | Set-Content .agentops/agent.yaml -Encoding utf8
```

Run only the security category first:

```powershell
agentops agent analyze --categories security --severity-fail critical
code .agentops/agent/report.md
```

In the test run for this tutorial, `azure_resources` changed from
`disabled` to `ok` and the report produced two WAF-AI findings:

```text
## Verdict: ⚠️ Warnings found

| Category | Count |
|---|---|
| Security posture (WAF-AI — Security pillar) | 2 |

| Source | Status | Detail |
|---|---|---|
| azure_resources | ok |

| Severity | ID | Title | Source |
|---|---|---|---|
| warning | waf.security.diagnostic_settings | Diagnostic settings are missing or incomplete | azure_resources |
| warning | waf.security.public_network_access | Public network access is open and unrestricted | azure_resources |
```

The evidence blocks in that run showed:

```json
{
"account": "aif-agentops-exp",
"diagnostic_settings": []
}
```

```json
{
"account": "aif-agentops-exp",
"public_network_access": "Enabled",
"private_endpoint_count": 0,
"network_acls_default_action": "Allow"
}
```

Those are real management-plane findings: the account had Entra-only
authentication enabled, but it still needed diagnostic settings and a
network restriction plan.

Run only the security category, or skip a specific rule from the CLI:

```bash
Expand All @@ -127,12 +219,12 @@ agentops agent analyze --exclude-rules waf.security.diagnostic_settings,waf.secu
```

The Markdown report groups findings by category, so security findings
appear under their own `### 🔐 Security` heading with a footer link
back to the WAF-AI guidance.
appear under their own `### Security posture (WAF-AI — Security pillar)`
heading with a footer link back to the WAF-AI guidance.

[waf-ai]: https://learn.microsoft.com/azure/well-architected/ai/security

## 3. CI scheduled run
## 4. CI scheduled run

Pair the analyzer with a GitHub Actions schedule:

Expand Down Expand Up @@ -161,7 +253,7 @@ jobs:
path: .agentops/agent/report.md
```

## 4. Copilot Chat extension (local)
## 5. Copilot Chat extension (local)

```bash
pip install "agentops-toolkit[agent] @ git+https://github.com/Azure/agentops.git@develop"
Expand All @@ -173,7 +265,7 @@ Then point a GitHub App's Copilot Extension webhook at
local-only** — never expose that endpoint publicly without signature
validation.

## 5. Hosted Copilot Extension on Azure Container Apps
## 6. Hosted Copilot Extension on Azure Container Apps

The repo ships a minimal scaffold:

Expand Down
55 changes: 43 additions & 12 deletions docs/tutorial-end-to-end.md
Original file line number Diff line number Diff line change
Expand Up @@ -762,9 +762,21 @@ az role assignment create `
### 9.3 Configure the watchdog

Now write `.agentops/agent.yaml`. This is the file that tells the
watchdog which signal sources to use:
watchdog which signal sources to use. In addition to eval history,
Application Insights, and Foundry metadata, this tutorial enables the
read-only WAF-AI security posture audit for the Azure AI account:

```powershell
$env:AZURE_SUBSCRIPTION_ID = az account show --query id -o tsv
$cognitiveAccount = az cognitiveservices account list `
--resource-group $resourceGroup `
--query "[?kind=='AIServices' || kind=='OpenAI'].name | [0]" `
-o tsv

if (-not $cognitiveAccount) {
throw "No AIServices/OpenAI account found in resource group $resourceGroup"
}

@"
version: 1
lookback_days: 7
Expand All @@ -780,14 +792,32 @@ sources:
foundry_control:
enabled: true
project_endpoint_env: AZURE_AI_FOUNDRY_PROJECT_ENDPOINT
azure_resources:
enabled: true
subscription_id_env: AZURE_SUBSCRIPTION_ID
resource_group: $resourceGroup
cognitive_services_account: $cognitiveAccount
checks:
latency:
p95_threshold_seconds: 5.0
errors:
rate_threshold: 0.05
posture:
enabled: true
pillar: security
exclude_rules: []
"@ | Set-Content .agentops/agent.yaml -Encoding utf8
```

If your resource group or account name is different, list candidates with:

```powershell
az cognitiveservices account list `
--resource-group $resourceGroup `
--query "[].{name:name,kind:kind,location:location,disableLocalAuth:properties.disableLocalAuth,publicNetworkAccess:properties.publicNetworkAccess}" `
-o table
```

### 9.4 Generate telemetry, then analyze it

Install both the Foundry runtime and the watchdog extras, set the
Expand All @@ -805,26 +835,27 @@ Start-Sleep -Seconds 90

agentops agent analyze
code .agentops/agent/report.md

# Optional: focus only on WAF-AI security posture.
agentops agent analyze --categories security --severity-fail critical
```

The report should now show `azure_monitor` as `ok`, not `skipped`. The
watchdog can combine:
The report should now show `azure_monitor` and `azure_resources` as `ok`,
not `skipped`. The watchdog can combine:

- eval-history regressions from `.agentops/results`;
- live p95 latency and error-rate signals from Application Insights;
- Foundry control-plane metadata from `AZURE_AI_FOUNDRY_PROJECT_ENDPOINT`.
- Foundry control-plane metadata from `AZURE_AI_FOUNDRY_PROJECT_ENDPOINT`;
- WAF-AI security posture findings from the Cognitive Services / Azure
OpenAI account.

If the findings table is empty, that means the configured checks passed;
the **Sources** table still proves which signal sources were queried.

> **Optional — WAF-AI security audit.** The watchdog can also run a
> read-only audit of your Foundry resource group against the
> [Well-Architected Framework for AI workloads — Security pillar][waf-ai].
> Enable the `azure_resources` source and the `posture` check in
> `agent.yaml` (commented stanzas are included), grant your identity
> `Reader` on the resource group, and re-run with
> `agentops agent analyze --categories security`. Full walkthrough:
> [`tutorial-agent-watchdog.md`](tutorial-agent-watchdog.md#2b-security-posture-audit-waf-ai).
In the tutorial test environment, the posture-only run produced two
warnings: missing diagnostic settings and unrestricted public network
access on the AI Services account. Full walkthrough:
[`tutorial-agent-watchdog.md`](tutorial-agent-watchdog.md#3-security-posture-audit-waf-ai).

For deeper integration (Copilot Chat extension, ACA deploy), see
[`tutorial-agent-watchdog.md`](tutorial-agent-watchdog.md).
Expand Down
Loading