Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,19 @@
All notable changes to this project will be documented in this file.
This format follows [Keep a Changelog](https://keepachangelog.com/) and adheres to [Semantic Versioning](https://semver.org/).

## [0.1.7] - 2026-04-21

### Added
- **Single source of truth for skills (closes #87)** — `src/agentops/templates/skills/` is now the canonical location. Added `scripts/sync-skills.sh` and `scripts/sync-skills.ps1` to propagate changes to `plugins/agentops/skills/`. CI test `test_skills_sync.py` fails if the two directories diverge.
- **Optional unit test generation** — `agentops-eval` skill (Step 1) now offers to generate unit tests for agent code when no existing tests are detected. Generates `pytest` + `unittest.mock` tests covering endpoint handlers, response parsing, and error handling. Opt-in only — skips silently if tests already exist or user declines.

### Changed
- **Cross-platform subprocess handling in generated scripts** — `agentops-eval` and `agentops-dataset` skills now instruct generated `rag_context.py` scripts to use `shutil.which()` + `shell=(sys.platform == "win32")` when calling external CLIs, preventing `FileNotFoundError` on Windows.
- **Auth detection carrythrough to callable adapter** — `agentops-eval` skill Step 5.5 now explicitly wires the auth pattern detected in Step 2 into the adapter using generic `AGENT_AUTH_HEADER` and `AGENT_AUTH_TOKEN` env vars. Updated `callable_adapter.py` template to use the same generic auth mechanism. Prevents 401 errors on first smoke test.
- **azd environment validation** — `agentops-eval` (Step 4) and `agentops-config` (Step 3) skills now validate azd environments before trusting `.azure/<env>/.env` values: checks `azd env list`, verifies resource group exists via `az group exists`, and warns on stale environments.
- **Enhanced smoke test diagnostics** — `agentops-eval` skill Step 6 smoke test now checks for empty responses, response length, response format mismatches (JSON vs SSE), unexpected prefixes (UUIDs), and HTML error pages. Expanded troubleshooting table with specific remediation steps.
- **Updated CONTRIBUTING.md** — added single-source-of-truth rule for skills and sync script instructions.

## [0.1.6] - 2026-04-15

### Changed
Expand Down
15 changes: 15 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,21 @@ These rules are critical to maintaining the project's architecture. PRs that vio
| Add a new CLI command | `cli/app.py` (keep it thin — delegate to `services/`) |
| Add a new workflow/service | `services/` (new file) |
| Add starter templates | `templates/` + update `pyproject.toml` package-data |
| Edit a coding agent skill | `src/agentops/templates/skills/<name>/SKILL.md` (single source of truth) → run `scripts/sync-skills.sh` or `.ps1` to propagate to `plugins/agentops/skills/` |

### Skills: Single Source of Truth

`src/agentops/templates/skills/` is the **canonical location** for all SKILL.md files. The VS Code extension directory `plugins/agentops/skills/` is a copy. After editing any skill file, run the sync script:

```bash
# Linux / macOS
bash scripts/sync-skills.sh

# Windows
.\scripts\sync-skills.ps1
```

CI will fail if the two directories diverge (`test_skills_sync.py`).

### Testing Expectations

Expand Down
12 changes: 12 additions & 0 deletions plugins/agentops/skills/agentops-config/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,18 @@ Search these locations **in order** — stop as soon as each value is found:
3. `.azure/<env>/.env` files (azd environments) — also read `AZURE_RESOURCE_GROUP`, `AZURE_SUBSCRIPTION_ID`
4. `.azure/config.json` for `defaultEnvironment` to pick the right env folder

### Validate azd environment (if using `.azure/<env>/.env`)

Before trusting values from `.azure/<env>/.env`, verify the environment is still valid:

1. **Check the environment is current** — run `azd env list` and confirm the selected environment appears. If multiple environments exist, list them and ask the user which to use.
2. **Verify the resource group exists**:
```bash
az group exists --name $RG --subscription $SUB
```
If this returns `false`, warn: *"Resource group '$RG' no longer exists. Your azd environment may be outdated."*
3. **If validation fails**, ask the user for correct values or to select a different environment.

If values are **not found** in any file, run Azure CLI discovery:
```bash
# 1. Confirm auth and get subscription
Expand Down
11 changes: 10 additions & 1 deletion plugins/agentops/skills/agentops-dataset/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,8 +90,17 @@ If the scenario is **RAG** and the generated JSONL has no `context` field:
2. **Build a retrieval script** at `.agentops/rag_context.py` (**never** in `src/`) that:
- Reads the project's own retrieval config (env vars, endpoint, index name) from whatever the project uses
- For each row in the JSONL, queries the retrieval backend with `row["input"]` and writes the result into `row["context"]`
- Uses only stdlib (`urllib.request`, `json`, `os`) — no third-party dependencies
- Uses only stdlib (`urllib.request`, `json`, `os`, `subprocess`, `sys`, `shutil`) — no third-party dependencies
- Accepts the JSONL file path as a CLI argument: `python .agentops/rag_context.py .agentops/data/data.jsonl`
- **Must be cross-platform** (Windows + Linux/macOS) — when calling external CLIs (e.g. `az`), use:
```python
import shutil, subprocess, sys
def _run_cli(args: list[str], **kwargs) -> subprocess.CompletedProcess:
exe = shutil.which(args[0])
if exe is None:
raise FileNotFoundError(f"'{args[0]}' not found in PATH.")
return subprocess.run([exe] + args[1:], **kwargs, shell=(sys.platform == "win32"))
```

3. Verify: each JSONL row now has a `context` field.
4. Update dataset YAML to include `context_field: context` under `format:`.
Expand Down
150 changes: 134 additions & 16 deletions plugins/agentops/skills/agentops-eval/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ Analyze the codebase holistically to understand the agent's **primary purpose**:

5. **Responsible AI (optional)**: Ask *"Do you also want to include safety evaluators (violence, hate/unfairness, self-harm, protected material)? These can be added alongside your main bundle."* If yes, add the safety evaluators from `safe_agent_baseline` to the selected bundle.

6. **Unit tests (optional)**: Only ask this if **all** of the following are true: (a) the codebase has testable agent code in Python, JavaScript, or TypeScript (endpoint handlers, tool definitions, orchestration logic), (b) no existing test directory or test files are detected (e.g., `tests/`, `test_*.py`, `*_test.py`, `*.test.ts`, `*.test.js`, `__tests__/`). If both conditions are met, ask: *"Would you also like me to generate unit tests for your agent code? (e.g., mocked HTTP calls, response parsing, error handling)"*. If the user declines or if conditions are not met, skip silently. See the **Unit Test Generation** section at the end of this skill for details.

## Step 2 — Detect endpoint type

| Search for | `endpoint.kind` | `hosting` | `execution_mode` |
Expand Down Expand Up @@ -117,8 +119,29 @@ If the scenario is **RAG** and the dataset has no `context` field:
2. **Build a retrieval script** at `.agentops/rag_context.py` (**never** in `src/`) that:
- Reads the project's own retrieval config (env vars, endpoint, index name) from whatever the project uses
- For each row in the JSONL, queries the retrieval backend with `row["input"]` and writes the result into `row["context"]`
- Uses only stdlib (`urllib.request`, `json`, `os`) — no third-party dependencies
- Uses only stdlib (`urllib.request`, `json`, `os`, `subprocess`, `sys`, `shutil`) — no third-party dependencies
- Accepts the JSONL file path as a CLI argument: `python .agentops/rag_context.py .agentops/data/data.jsonl`
- **Must be cross-platform** (Windows + Linux/macOS) — when calling external CLIs (e.g. `az`), use the following pattern:
```python
import shutil
import subprocess
import sys

def _run_cli(args: list[str], **kwargs) -> subprocess.CompletedProcess:
"""Run an external CLI command, cross-platform."""
exe = shutil.which(args[0])
if exe is None:
raise FileNotFoundError(
f"'{args[0]}' not found in PATH. "
"Make sure it is installed and available."
)
return subprocess.run(
[exe] + args[1:],
**kwargs,
shell=(sys.platform == "win32"),
)
```
- This avoids `FileNotFoundError` on Windows where `subprocess.run(["az", ...])` fails without `shell=True`

3. Update dataset YAML to include `context_field: context` under `format:`.
4. Now `rag_quality_baseline` with GroundednessEvaluator and RetrievalEvaluator can be used.
Expand All @@ -134,6 +157,18 @@ Search these locations in order — stop as soon as each value is found:
3. `.azure/<env>/.env` (azd environments) — also read `AZURE_RESOURCE_GROUP`, `AZURE_SUBSCRIPTION_ID`
4. `.azure/config.json` for `defaultEnvironment` to pick the right env folder

### Validate azd environment (if using `.azure/<env>/.env`)

Before trusting values from `.azure/<env>/.env`, verify the environment is still valid:

1. **Check if the environment is current** — run `azd env list` and confirm the selected environment appears in the output. If multiple environments exist, list them and ask the user which one to use.
2. **Verify the resource group exists** — after reading `AZURE_RESOURCE_GROUP` and `AZURE_SUBSCRIPTION_ID` from the env file, run:
```bash
az group exists --name $RG --subscription $SUB
```
If this returns `false`, the environment is stale (resources were deleted). Warn the user: *"The resource group '$RG' no longer exists. Your azd environment may be outdated. Please re-run `azd up` or provide current Azure values."*
3. **If validation fails**, do not silently proceed with stale values — ask the user for correct values or to select a different environment.

If values are **not found** in files, use Azure CLI to discover them:

```bash
Expand Down Expand Up @@ -282,13 +317,17 @@ import os
import urllib.request

ENDPOINT = os.environ["AGENT_HTTP_URL"]
AUTH_TOKEN = os.environ.get("APP_API_TOKEN", "")
# Auth: set AGENT_AUTH_HEADER and AGENT_AUTH_TOKEN env vars if your endpoint requires auth.
# Example: AGENT_AUTH_HEADER=dapr-api-token AGENT_AUTH_TOKEN=dev-token
# AGENT_AUTH_HEADER=X-API-KEY AGENT_AUTH_TOKEN=my-key
AUTH_HEADER = os.environ.get("AGENT_AUTH_HEADER", "")
AUTH_TOKEN = os.environ.get("AGENT_AUTH_TOKEN", "")

def run_evaluation(input_text: str, context: dict) -> dict:
body = json.dumps({"message": input_text}).encode()
headers = {"Content-Type": "application/json"}
if AUTH_TOKEN:
headers["dapr-api-token"] = AUTH_TOKEN
if AUTH_HEADER and AUTH_TOKEN:
headers[AUTH_HEADER] = AUTH_TOKEN
req = urllib.request.Request(ENDPOINT, data=body, headers=headers, method="POST")
with urllib.request.urlopen(req, timeout=120) as resp:
data = json.loads(resp.read())
Expand All @@ -302,13 +341,15 @@ import os
import urllib.request

ENDPOINT = os.environ["AGENT_HTTP_URL"]
AUTH_TOKEN = os.environ.get("APP_API_TOKEN", "")
# Auth: set AGENT_AUTH_HEADER and AGENT_AUTH_TOKEN env vars if your endpoint requires auth.
AUTH_HEADER = os.environ.get("AGENT_AUTH_HEADER", "")
AUTH_TOKEN = os.environ.get("AGENT_AUTH_TOKEN", "")

def run_evaluation(input_text: str, context: dict) -> dict:
body = json.dumps({"message": input_text}).encode()
headers = {"Content-Type": "application/json"}
if AUTH_TOKEN:
headers["dapr-api-token"] = AUTH_TOKEN
if AUTH_HEADER and AUTH_TOKEN:
headers[AUTH_HEADER] = AUTH_TOKEN
req = urllib.request.Request(ENDPOINT, data=body, headers=headers, method="POST")
chunks = []
try:
Expand Down Expand Up @@ -340,11 +381,20 @@ def run_evaluation(input_text: str, context: dict) -> dict:
```

Customize the adapter:
- **Dapr auth** (`dapr-api-token` / `APP_API_TOKEN` found in code or `.env`) → keep the auth lines above.
- **API key** (`X-API-KEY` / `api_key` / `API_KEY` found in code or `.env`) → change header to `headers["X-API-KEY"] = AUTH_TOKEN` and env var to `API_KEY`.
- **Bearer token** (`Authorization: Bearer` found in code) → recommend using `http` backend with `auth_header_env` instead of callable.
- **No auth found** → remove the `AUTH_TOKEN` lines entirely.
- **Apply the auth pattern detected in Step 2.** Use the table below to wire the correct header and env var into the adapter:

| Auth detected in Step 2 | Adapter env var | Header line in adapter |
|---|---|---|
| `dapr-api-token` / `APP_API_TOKEN` | `AGENT_AUTH_TOKEN` (tell user to set it to their Dapr token) | `headers["dapr-api-token"] = AUTH_TOKEN` |
| `X-API-KEY` / `api_key` / `API_KEY` | `AGENT_AUTH_TOKEN` (tell user to set it to their API key) | `headers["X-API-KEY"] = AUTH_TOKEN` |
| `Authorization: Bearer` | Recommend HTTP backend with `auth_header_env` instead of callable adapter | N/A |
| No auth detected | Remove `AUTH_TOKEN` and auth header lines entirely | N/A |

**Important**: Do NOT generate the adapter with auth lines commented out or using hardcoded tokens. If auth was detected, the adapter must include the correct header from the start — otherwise the smoke test will fail with 401.

- **Choose the right template:** If the agent code uses `yield`, `StreamingResponse`, `EventSourceResponse`, or `text/event-stream` content type, use the **SSE/streaming adapter** template. Otherwise use the **standard JSON adapter**.
- **Customize the request field:** If the agent expects a different key than `"message"` (e.g. `"ask"`, `"question"`, `"input"`), change the `json.dumps({"message": ...})` line to match.
- **Customize the response extraction:** If the agent returns a different key than `"text"` or `"response"`, update the `.get()` call accordingly.

### Context sanitization (RAG scenarios)

Expand Down Expand Up @@ -424,17 +474,29 @@ import sys; sys.path.insert(0, '.agentops')
from callable_adapter import run_evaluation
result = run_evaluation('hello', {})
assert 'response' in result, f'Missing response key: {result}'
assert not result['response'].startswith('ERROR:'), f'Adapter error: {result[\"response\"]}'
resp = result['response']
assert not resp.startswith('ERROR:'), f'Adapter error: {resp}'
assert len(resp.strip()) > 0, 'Empty response — check endpoint and request format'
print('Smoke test PASSED')
print('Response preview:', result['response'][:120])
print(f'Response length: {len(resp)} chars')
print('Response preview:', resp[:200])
"
```

If the smoke test fails:
- **Connection refused** → the agent endpoint is not running. Start it first.
- **401 Unauthorized** → auth token is missing or wrong. Check the env var.
- **400/422** → the request body format doesn't match the endpoint. Check `request_field`.
- **401 Unauthorized** → auth token is missing or wrong. Check `AGENT_AUTH_HEADER` and `AGENT_AUTH_TOKEN` env vars.
- **400/422** → the request body format doesn't match the endpoint. Check the `json.dumps({"message": ...})` field name in the adapter — the endpoint may expect a different key (e.g. `"ask"`, `"question"`, `"input"`).
- **Response starts with `ERROR:`** → the adapter caught an exception. Read the error message.
- **Empty response** → the endpoint returned successfully but the adapter extracted no text. Check `response_field` / `.get()` key in the adapter.
- **Response contains unexpected prefix** (UUID, metadata, HTML) → add a post-processing step to the adapter to strip it. Common pattern: `re.sub(r'^[0-9a-f-]{36}\s*', '', response_text)` for UUID prefixes.

### Smoke test response format verification

After the basic smoke test passes, verify the response format matches expectations:
1. If the response contains HTML tags (`<html>`, `<div>`, etc.) but the adapter expects plain text → the endpoint may be returning an error page, not agent output.
2. If the response is very short (< 10 chars) for a conversational prompt like "hello" → warn the user: *"Response seems unusually short. Verify the endpoint is returning the full agent response."*
3. If the response starts with `data:` or contains SSE markers but the adapter uses the standard JSON template → switch to the SSE/streaming adapter template.

Do NOT proceed to Step 7 until the smoke test passes.

Expand Down Expand Up @@ -482,11 +544,67 @@ agentops report generate [--in results.json] # Regenerate report
- **NEVER** fabricate `agent_id`, model names, or endpoint URLs.
- **NEVER** edit `.agentops/` template files (`run-callable.yaml`, `run-http-rag.yaml`, etc.) — always update `.agentops/run.yaml`.
- **NEVER** use dotted import paths like `.agentops.callable_adapter` — they fail.
- **NEVER** create files outside `.agentops/` — all generated artifacts (adapters, datasets, configs, scripts) belong in `.agentops/`.
- **NEVER** create files outside `.agentops/` — all generated artifacts (adapters, datasets, configs, scripts) belong in `.agentops/`. Exception: unit tests go in the project's existing test directory.
- **NEVER** try `az login` automatically — ask the user to authenticate.
- **NEVER** use `requests` or `httpx` in callable adapters — use only stdlib (`urllib.request`, `json`, `os`).
- If a bundle uses SDK-version-dependent evaluators, verify availability before running (Step 4.5). Don't block on this — if verification is hard, proceed and fix on failure.
- Always update `.agentops/run.yaml` — do not create custom-named files except for multi-model benchmarks.
- Use generic file names: `dataset.yaml`, `data.jsonl` — not project-specific prefixes.
- Use plain language in questions — not technical jargon ("callable adapter", "SSE", "POST").
- Always run pre-flight (Step 6) before executing. Fix all issues first.

## Unit Test Generation (Optional)

This section is only executed if the user accepted the unit test offer in Step 1.

### When to generate

- The codebase has Python, JavaScript, or TypeScript agent code with testable logic (endpoint handlers, tool definitions, response parsing, orchestration).
- No existing test files or test directories were detected.

### What to generate

Create tests in the project's conventional test directory (e.g. `tests/test_agent.py` for Python, `__tests__/agent.test.ts` for TypeScript). Use only standard testing libraries — no extra dependencies.

**For Python agents**, generate `pytest` tests using `unittest.mock`:

1. **Endpoint handler test** — mock the HTTP framework (FastAPI `TestClient`, Flask `test_client`) and verify the handler returns expected response format.
2. **Response parsing test** — if the agent has response parsing logic (JSON extraction, SSE chunk assembly, UUID stripping), test it with known inputs/outputs.
3. **Error handling test** — verify the agent handles timeouts, 4xx/5xx from downstream services, and malformed inputs gracefully.
4. **Tool schema test** (if applicable) — if the agent defines tools with schemas, validate the schema structure is correct (required fields, types).

**Template pattern** (adapt to the detected code):
```python
"""Unit tests for agent endpoint — generated by AgentOps."""
import json
from unittest.mock import MagicMock, patch

import pytest


class TestAgentEndpoint:
"""Tests for the agent's HTTP endpoint handler."""

def test_returns_valid_response_format(self):
# Mock the downstream model/service call
# Call the endpoint handler directly
# Assert response has expected keys and types
...

def test_handles_empty_input(self):
# Verify the agent handles empty or whitespace-only input
...

def test_handles_downstream_timeout(self):
# Mock the downstream call to raise a timeout
# Assert the agent returns an error response (not a crash)
...
```

### Rules for generated tests

- Tests must run **without** Azure credentials or live services — all external calls must be mocked.
- Do not generate tests that duplicate what AgentOps evaluations already cover (response quality, groundedness, coherence).
- Focus on **functional correctness**: does the code do what it's supposed to do?
- Place tests in the project's existing test directory structure, not in `.agentops/`.
- If the project uses a specific test runner or framework (detected via `pyproject.toml`, `package.json`, `conftest.py`), follow its conventions.
Loading