Azure · Dongbumlee · Apr 22, 2026 · Apr 21, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,19 @@
 All notable changes to this project will be documented in this file.
 This format follows [Keep a Changelog](https://keepachangelog.com/) and adheres to [Semantic Versioning](https://semver.org/).
 
+## [0.1.7] - 2026-04-21
+
+### Added
+- **Single source of truth for skills (closes #87)** — `src/agentops/templates/skills/` is now the canonical location. Added `scripts/sync-skills.sh` and `scripts/sync-skills.ps1` to propagate changes to `plugins/agentops/skills/`. CI test `test_skills_sync.py` fails if the two directories diverge.
+- **Optional unit test generation** — `agentops-eval` skill (Step 1) now offers to generate unit tests for agent code when no existing tests are detected. Generates `pytest` + `unittest.mock` tests covering endpoint handlers, response parsing, and error handling. Opt-in only — skips silently if tests already exist or user declines.
+
+### Changed
+- **Cross-platform subprocess handling in generated scripts** — `agentops-eval` and `agentops-dataset` skills now instruct generated `rag_context.py` scripts to use `shutil.which()` + `shell=(sys.platform == "win32")` when calling external CLIs, preventing `FileNotFoundError` on Windows.
+- **Auth detection carrythrough to callable adapter** — `agentops-eval` skill Step 5.5 now explicitly wires the auth pattern detected in Step 2 into the adapter using generic `AGENT_AUTH_HEADER` and `AGENT_AUTH_TOKEN` env vars. Updated `callable_adapter.py` template to use the same generic auth mechanism. Prevents 401 errors on first smoke test.
+- **azd environment validation** — `agentops-eval` (Step 4) and `agentops-config` (Step 3) skills now validate azd environments before trusting `.azure/<env>/.env` values: checks `azd env list`, verifies resource group exists via `az group exists`, and warns on stale environments.
+- **Enhanced smoke test diagnostics** — `agentops-eval` skill Step 6 smoke test now checks for empty responses, response length, response format mismatches (JSON vs SSE), unexpected prefixes (UUIDs), and HTML error pages. Expanded troubleshooting table with specific remediation steps.
+- **Updated CONTRIBUTING.md** — added single-source-of-truth rule for skills and sync script instructions.
+
 ## [0.1.6] - 2026-04-15
 
 ### Changed

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -81,6 +81,21 @@ These rules are critical to maintaining the project's architecture. PRs that vio
 | Add a new CLI command | `cli/app.py` (keep it thin — delegate to `services/`) |
 | Add a new workflow/service | `services/` (new file) |
 | Add starter templates | `templates/` + update `pyproject.toml` package-data |
+| Edit a coding agent skill | `src/agentops/templates/skills/<name>/SKILL.md` (single source of truth) → run `scripts/sync-skills.sh` or `.ps1` to propagate to `plugins/agentops/skills/` |
+
+### Skills: Single Source of Truth
+
+`src/agentops/templates/skills/` is the **canonical location** for all SKILL.md files. The VS Code extension directory `plugins/agentops/skills/` is a copy. After editing any skill file, run the sync script:
+
+```bash
+# Linux / macOS
+bash scripts/sync-skills.sh
+
+# Windows
+.\scripts\sync-skills.ps1
+```
+
+CI will fail if the two directories diverge (`test_skills_sync.py`).
 
 ### Testing Expectations
 

diff --git a/plugins/agentops/skills/agentops-config/SKILL.md b/plugins/agentops/skills/agentops-config/SKILL.md
@@ -76,6 +76,18 @@ Search these locations **in order** — stop as soon as each value is found:
 3. `.azure/<env>/.env` files (azd environments) — also read `AZURE_RESOURCE_GROUP`, `AZURE_SUBSCRIPTION_ID`
 4. `.azure/config.json` for `defaultEnvironment` to pick the right env folder
 
+### Validate azd environment (if using `.azure/<env>/.env`)
+
+Before trusting values from `.azure/<env>/.env`, verify the environment is still valid:
+
+1. **Check the environment is current** — run `azd env list` and confirm the selected environment appears. If multiple environments exist, list them and ask the user which to use.
+2. **Verify the resource group exists**:
+   ```bash
+   az group exists --name $RG --subscription $SUB
+   ```
+   If this returns `false`, warn: *"Resource group '$RG' no longer exists. Your azd environment may be outdated."*
+3. **If validation fails**, ask the user for correct values or to select a different environment.
+
 If values are **not found** in any file, run Azure CLI discovery:
 ```bash
 # 1. Confirm auth and get subscription

diff --git a/plugins/agentops/skills/agentops-dataset/SKILL.md b/plugins/agentops/skills/agentops-dataset/SKILL.md
@@ -90,8 +90,17 @@ If the scenario is **RAG** and the generated JSONL has no `context` field:
 2. **Build a retrieval script** at `.agentops/rag_context.py` (**never** in `src/`) that:
    - Reads the project's own retrieval config (env vars, endpoint, index name) from whatever the project uses
    - For each row in the JSONL, queries the retrieval backend with `row["input"]` and writes the result into `row["context"]`
-   - Uses only stdlib (`urllib.request`, `json`, `os`) — no third-party dependencies
+   - Uses only stdlib (`urllib.request`, `json`, `os`, `subprocess`, `sys`, `shutil`) — no third-party dependencies
    - Accepts the JSONL file path as a CLI argument: `python .agentops/rag_context.py .agentops/data/data.jsonl`
+   - **Must be cross-platform** (Windows + Linux/macOS) — when calling external CLIs (e.g. `az`), use:
+     ```python
+     import shutil, subprocess, sys
+     def _run_cli(args: list[str], **kwargs) -> subprocess.CompletedProcess:
+         exe = shutil.which(args[0])
+         if exe is None:
+             raise FileNotFoundError(f"'{args[0]}' not found in PATH.")
+         return subprocess.run([exe] + args[1:], **kwargs, shell=(sys.platform == "win32"))
+     ```
 
 3. Verify: each JSONL row now has a `context` field.
 4. Update dataset YAML to include `context_field: context` under `format:`.

diff --git a/plugins/agentops/skills/agentops-eval/SKILL.md b/plugins/agentops/skills/agentops-eval/SKILL.md
@@ -40,6 +40,8 @@ Analyze the codebase holistically to understand the agent's **primary purpose**:
 
 5. **Responsible AI (optional)**: Ask *"Do you also want to include safety evaluators (violence, hate/unfairness, self-harm, protected material)? These can be added alongside your main bundle."* If yes, add the safety evaluators from `safe_agent_baseline` to the selected bundle.
 
+6. **Unit tests (optional)**: Only ask this if **all** of the following are true: (a) the codebase has testable agent code in Python, JavaScript, or TypeScript (endpoint handlers, tool definitions, orchestration logic), (b) no existing test directory or test files are detected (e.g., `tests/`, `test_*.py`, `*_test.py`, `*.test.ts`, `*.test.js`, `__tests__/`). If both conditions are met, ask: *"Would you also like me to generate unit tests for your agent code? (e.g., mocked HTTP calls, response parsing, error handling)"*. If the user declines or if conditions are not met, skip silently. See the **Unit Test Generation** section at the end of this skill for details.
+
 ## Step 2 — Detect endpoint type
 
 | Search for | `endpoint.kind` | `hosting` | `execution_mode` |
@@ -117,8 +119,29 @@ If the scenario is **RAG** and the dataset has no `context` field:
 2. **Build a retrieval script** at `.agentops/rag_context.py` (**never** in `src/`) that:
    - Reads the project's own retrieval config (env vars, endpoint, index name) from whatever the project uses
    - For each row in the JSONL, queries the retrieval backend with `row["input"]` and writes the result into `row["context"]`
-   - Uses only stdlib (`urllib.request`, `json`, `os`) — no third-party dependencies
+   - Uses only stdlib (`urllib.request`, `json`, `os`, `subprocess`, `sys`, `shutil`) — no third-party dependencies
    - Accepts the JSONL file path as a CLI argument: `python .agentops/rag_context.py .agentops/data/data.jsonl`
+   - **Must be cross-platform** (Windows + Linux/macOS) — when calling external CLIs (e.g. `az`), use the following pattern:
+     ```python
+     import shutil
+     import subprocess
+     import sys
+
+     def _run_cli(args: list[str], **kwargs) -> subprocess.CompletedProcess:
+         """Run an external CLI command, cross-platform."""
+         exe = shutil.which(args[0])
+         if exe is None:
+             raise FileNotFoundError(
+                 f"'{args[0]}' not found in PATH. "
+                 "Make sure it is installed and available."
+             )
+         return subprocess.run(
+             [exe] + args[1:],
+             **kwargs,
+             shell=(sys.platform == "win32"),
+         )
+     ```
+   - This avoids `FileNotFoundError` on Windows where `subprocess.run(["az", ...])` fails without `shell=True`
 
 3. Update dataset YAML to include `context_field: context` under `format:`.
 4. Now `rag_quality_baseline` with GroundednessEvaluator and RetrievalEvaluator can be used.
@@ -134,6 +157,18 @@ Search these locations in order — stop as soon as each value is found:
 3. `.azure/<env>/.env` (azd environments) — also read `AZURE_RESOURCE_GROUP`, `AZURE_SUBSCRIPTION_ID`
 4. `.azure/config.json` for `defaultEnvironment` to pick the right env folder
 
+### Validate azd environment (if using `.azure/<env>/.env`)
+
+Before trusting values from `.azure/<env>/.env`, verify the environment is still valid:
+
+1. **Check if the environment is current** — run `azd env list` and confirm the selected environment appears in the output. If multiple environments exist, list them and ask the user which one to use.
+2. **Verify the resource group exists** — after reading `AZURE_RESOURCE_GROUP` and `AZURE_SUBSCRIPTION_ID` from the env file, run:
+   ```bash
+   az group exists --name $RG --subscription $SUB
+   ```
+   If this returns `false`, the environment is stale (resources were deleted). Warn the user: *"The resource group '$RG' no longer exists. Your azd environment may be outdated. Please re-run `azd up` or provide current Azure values."*
+3. **If validation fails**, do not silently proceed with stale values — ask the user for correct values or to select a different environment.
+
 If values are **not found** in files, use Azure CLI to discover them:
 
 ```bash
@@ -282,13 +317,17 @@ import os
 import urllib.request
 
 ENDPOINT = os.environ["AGENT_HTTP_URL"]
-AUTH_TOKEN = os.environ.get("APP_API_TOKEN", "")
+# Auth: set AGENT_AUTH_HEADER and AGENT_AUTH_TOKEN env vars if your endpoint requires auth.
+# Example: AGENT_AUTH_HEADER=dapr-api-token  AGENT_AUTH_TOKEN=dev-token
+#          AGENT_AUTH_HEADER=X-API-KEY        AGENT_AUTH_TOKEN=my-key
+AUTH_HEADER = os.environ.get("AGENT_AUTH_HEADER", "")
+AUTH_TOKEN = os.environ.get("AGENT_AUTH_TOKEN", "")
 
 def run_evaluation(input_text: str, context: dict) -> dict:
     body = json.dumps({"message": input_text}).encode()
     headers = {"Content-Type": "application/json"}
-    if AUTH_TOKEN:
-        headers["dapr-api-token"] = AUTH_TOKEN
+    if AUTH_HEADER and AUTH_TOKEN:
+        headers[AUTH_HEADER] = AUTH_TOKEN
     req = urllib.request.Request(ENDPOINT, data=body, headers=headers, method="POST")
     with urllib.request.urlopen(req, timeout=120) as resp:
         data = json.loads(resp.read())
@@ -302,13 +341,15 @@ import os
 import urllib.request
 
 ENDPOINT = os.environ["AGENT_HTTP_URL"]
-AUTH_TOKEN = os.environ.get("APP_API_TOKEN", "")
+# Auth: set AGENT_AUTH_HEADER and AGENT_AUTH_TOKEN env vars if your endpoint requires auth.
+AUTH_HEADER = os.environ.get("AGENT_AUTH_HEADER", "")
+AUTH_TOKEN = os.environ.get("AGENT_AUTH_TOKEN", "")
 
 def run_evaluation(input_text: str, context: dict) -> dict:
     body = json.dumps({"message": input_text}).encode()
     headers = {"Content-Type": "application/json"}
-    if AUTH_TOKEN:
-        headers["dapr-api-token"] = AUTH_TOKEN
+    if AUTH_HEADER and AUTH_TOKEN:
+        headers[AUTH_HEADER] = AUTH_TOKEN
     req = urllib.request.Request(ENDPOINT, data=body, headers=headers, method="POST")
     chunks = []
     try:
@@ -340,11 +381,20 @@ def run_evaluation(input_text: str, context: dict) -> dict:
 ```
 
 Customize the adapter:
-- **Dapr auth** (`dapr-api-token` / `APP_API_TOKEN` found in code or `.env`) → keep the auth lines above.
-- **API key** (`X-API-KEY` / `api_key` / `API_KEY` found in code or `.env`) → change header to `headers["X-API-KEY"] = AUTH_TOKEN` and env var to `API_KEY`.
-- **Bearer token** (`Authorization: Bearer` found in code) → recommend using `http` backend with `auth_header_env` instead of callable.
-- **No auth found** → remove the `AUTH_TOKEN` lines entirely.
+- **Apply the auth pattern detected in Step 2.** Use the table below to wire the correct header and env var into the adapter:
+
+| Auth detected in Step 2 | Adapter env var | Header line in adapter |
+|---|---|---|
+| `dapr-api-token` / `APP_API_TOKEN` | `AGENT_AUTH_TOKEN` (tell user to set it to their Dapr token) | `headers["dapr-api-token"] = AUTH_TOKEN` |
+| `X-API-KEY` / `api_key` / `API_KEY` | `AGENT_AUTH_TOKEN` (tell user to set it to their API key) | `headers["X-API-KEY"] = AUTH_TOKEN` |
+| `Authorization: Bearer` | Recommend HTTP backend with `auth_header_env` instead of callable adapter | N/A |
+| No auth detected | Remove `AUTH_TOKEN` and auth header lines entirely | N/A |
+
+  **Important**: Do NOT generate the adapter with auth lines commented out or using hardcoded tokens. If auth was detected, the adapter must include the correct header from the start — otherwise the smoke test will fail with 401.
+
 - **Choose the right template:** If the agent code uses `yield`, `StreamingResponse`, `EventSourceResponse`, or `text/event-stream` content type, use the **SSE/streaming adapter** template. Otherwise use the **standard JSON adapter**.
+- **Customize the request field:** If the agent expects a different key than `"message"` (e.g. `"ask"`, `"question"`, `"input"`), change the `json.dumps({"message": ...})` line to match.
+- **Customize the response extraction:** If the agent returns a different key than `"text"` or `"response"`, update the `.get()` call accordingly.
 
 ### Context sanitization (RAG scenarios)
 
@@ -424,17 +474,29 @@ import sys; sys.path.insert(0, '.agentops')
 from callable_adapter import run_evaluation
 result = run_evaluation('hello', {})
 assert 'response' in result, f'Missing response key: {result}'
-assert not result['response'].startswith('ERROR:'), f'Adapter error: {result[\"response\"]}'
+resp = result['response']
+assert not resp.startswith('ERROR:'), f'Adapter error: {resp}'
+assert len(resp.strip()) > 0, 'Empty response — check endpoint and request format'
 print('Smoke test PASSED')
-print('Response preview:', result['response'][:120])
+print(f'Response length: {len(resp)} chars')
+print('Response preview:', resp[:200])
 "
 ```
 
 If the smoke test fails:
 - **Connection refused** → the agent endpoint is not running. Start it first.
-- **401 Unauthorized** → auth token is missing or wrong. Check the env var.
-- **400/422** → the request body format doesn't match the endpoint. Check `request_field`.
+- **401 Unauthorized** → auth token is missing or wrong. Check `AGENT_AUTH_HEADER` and `AGENT_AUTH_TOKEN` env vars.
+- **400/422** → the request body format doesn't match the endpoint. Check the `json.dumps({"message": ...})` field name in the adapter — the endpoint may expect a different key (e.g. `"ask"`, `"question"`, `"input"`).
 - **Response starts with `ERROR:`** → the adapter caught an exception. Read the error message.
+- **Empty response** → the endpoint returned successfully but the adapter extracted no text. Check `response_field` / `.get()` key in the adapter.
+- **Response contains unexpected prefix** (UUID, metadata, HTML) → add a post-processing step to the adapter to strip it. Common pattern: `re.sub(r'^[0-9a-f-]{36}\s*', '', response_text)` for UUID prefixes.
+
+### Smoke test response format verification
+
+After the basic smoke test passes, verify the response format matches expectations:
+1. If the response contains HTML tags (`<html>`, `<div>`, etc.) but the adapter expects plain text → the endpoint may be returning an error page, not agent output.
+2. If the response is very short (< 10 chars) for a conversational prompt like "hello" → warn the user: *"Response seems unusually short. Verify the endpoint is returning the full agent response."*
+3. If the response starts with `data:` or contains SSE markers but the adapter uses the standard JSON template → switch to the SSE/streaming adapter template.
 
 Do NOT proceed to Step 7 until the smoke test passes.
 
@@ -482,11 +544,67 @@ agentops report generate [--in results.json]            # Regenerate report
 - **NEVER** fabricate `agent_id`, model names, or endpoint URLs.
 - **NEVER** edit `.agentops/` template files (`run-callable.yaml`, `run-http-rag.yaml`, etc.) — always update `.agentops/run.yaml`.
 - **NEVER** use dotted import paths like `.agentops.callable_adapter` — they fail.
-- **NEVER** create files outside `.agentops/` — all generated artifacts (adapters, datasets, configs, scripts) belong in `.agentops/`.
+- **NEVER** create files outside `.agentops/` — all generated artifacts (adapters, datasets, configs, scripts) belong in `.agentops/`. Exception: unit tests go in the project's existing test directory.
 - **NEVER** try `az login` automatically — ask the user to authenticate.
 - **NEVER** use `requests` or `httpx` in callable adapters — use only stdlib (`urllib.request`, `json`, `os`).
 - If a bundle uses SDK-version-dependent evaluators, verify availability before running (Step 4.5). Don't block on this — if verification is hard, proceed and fix on failure.
 - Always update `.agentops/run.yaml` — do not create custom-named files except for multi-model benchmarks.
 - Use generic file names: `dataset.yaml`, `data.jsonl` — not project-specific prefixes.
 - Use plain language in questions — not technical jargon ("callable adapter", "SSE", "POST").
 - Always run pre-flight (Step 6) before executing. Fix all issues first.
+
+## Unit Test Generation (Optional)
+
+This section is only executed if the user accepted the unit test offer in Step 1.
+
+### When to generate
+
+- The codebase has Python, JavaScript, or TypeScript agent code with testable logic (endpoint handlers, tool definitions, response parsing, orchestration).
+- No existing test files or test directories were detected.
+
+### What to generate
+
+Create tests in the project's conventional test directory (e.g. `tests/test_agent.py` for Python, `__tests__/agent.test.ts` for TypeScript). Use only standard testing libraries — no extra dependencies.
+
+**For Python agents**, generate `pytest` tests using `unittest.mock`:
+
+1. **Endpoint handler test** — mock the HTTP framework (FastAPI `TestClient`, Flask `test_client`) and verify the handler returns expected response format.
+2. **Response parsing test** — if the agent has response parsing logic (JSON extraction, SSE chunk assembly, UUID stripping), test it with known inputs/outputs.
+3. **Error handling test** — verify the agent handles timeouts, 4xx/5xx from downstream services, and malformed inputs gracefully.
+4. **Tool schema test** (if applicable) — if the agent defines tools with schemas, validate the schema structure is correct (required fields, types).
+
+**Template pattern** (adapt to the detected code):
+```python
+"""Unit tests for agent endpoint — generated by AgentOps."""
+import json
+from unittest.mock import MagicMock, patch
+
+import pytest
+
+
+class TestAgentEndpoint:
+    """Tests for the agent's HTTP endpoint handler."""
+
+    def test_returns_valid_response_format(self):
+        # Mock the downstream model/service call
+        # Call the endpoint handler directly
+        # Assert response has expected keys and types
+        ...
+
+    def test_handles_empty_input(self):
+        # Verify the agent handles empty or whitespace-only input
+        ...
+
+    def test_handles_downstream_timeout(self):
+        # Mock the downstream call to raise a timeout
+        # Assert the agent returns an error response (not a crash)
+        ...
+```
+
+### Rules for generated tests
+
+- Tests must run **without** Azure credentials or live services — all external calls must be mocked.
+- Do not generate tests that duplicate what AgentOps evaluations already cover (response quality, groundedness, coherence).
+- Focus on **functional correctness**: does the code do what it's supposed to do?
+- Place tests in the project's existing test directory structure, not in `.agentops/`.
+- If the project uses a specific test runner or framework (detected via `pyproject.toml`, `package.json`, `conftest.py`), follow its conventions.