docs(skill): require Cognitive Services OpenAI User as prereq RBAC role (#203) (#204)

placerda · Copilot · web-flow · commit 690db389f514 · 2026-05-29T05:21:41.000-03:00
Foundry `azure_ai_evaluator` graders impersonate the OIDC principal
to call OpenAI; without `Cognitive Services OpenAI User` on the
underlying AI Services account the graders fail with a 401
PermissionDenied and every cloud eval metric returns null. Verified
end-to-end on placerda/agentops-prompt-quickstart: after granting the
role, the first PR run goes green from scratch.

- agentops-workflow SKILL.md: pre-dispatch checks now list both Foundry
  User (Foundry project) AND Cognitive Services OpenAI User (AI
  Services account), with role ids and az role assignment create
  commands for each.
- tutorial-prompt-agent-quickstart.md: step 12's Copilot prompt and the
  workflow-skill walkthrough list both roles.
- tutorial-end-to-end.md: both workflow-skill prompts list both roles.
- docs/ci-github-actions.md: prerequisite section lists both roles with
  the OpenAI graders' failure mode spelled out.
- plugins/agentops/skills/agentops-workflow/SKILL.md: synced from src/.

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,20 @@ This format follows [Keep a Changelog](https://keepachangelog.com/) and adheres
 
 ## [Unreleased]
 
+### Changed
+- **Skill + tutorial guidance now require `Cognitive Services OpenAI User` as a prerequisite RBAC role.**
+  The `agentops-workflow` skill, `tutorial-prompt-agent-quickstart.md`,
+  `tutorial-end-to-end.md`, and `docs/ci-github-actions.md` now instruct users
+  to grant the OIDC/CI service principal **both** Foundry User on the Foundry
+  project **and** Cognitive Services OpenAI User on the underlying Azure AI
+  Services account that hosts the evaluator model deployment. Foundry
+  `azure_ai_evaluator` graders impersonate the OIDC principal to call OpenAI;
+  without the OpenAI User role they fail with a 401 `PermissionDenied` and
+  every cloud eval metric returns `null`, blocking the first PR run. The skill
+  now emits the matching `az role assignment create` commands for both roles
+  (role ids `53ca6127-db72-4b80-b1b0-d745d6d5456d` and
+  `5e0bd9bd-7b93-4f28-af87-19fc36ad61bd`) before dispatching the workflow.
+
 ### Fixed
 - **Cloud eval surfaces grader execution errors instead of silent nulls.**
   When a Foundry `azure_ai_evaluator` grader fails to execute (most
diff --git a/docs/ci-github-actions.md b/docs/ci-github-actions.md
@@ -125,9 +125,23 @@ from GitHub Actions runs. See
 [Microsoft's WIF docs](https://learn.microsoft.com/azure/active-directory/workload-identities/workload-identity-federation-create-trust?pivots=identity-wif-apps-methods-azp).
 
 For Foundry prompt-agent gates, the same app registration / service principal
-also needs **Foundry User** on the Foundry project or Foundry resource. Azure
-`Reader` is not enough because the eval step calls Foundry data-plane APIs such
-as `Microsoft.CognitiveServices/accounts/AIServices/agents/read`.
+needs **two** Azure RBAC roles before the first workflow run. Both are required
+and the eval step fails silently (every metric returns `null`) if only one is
+in place:
+
+- **Foundry User** on the Foundry project or Foundry resource. Azure `Reader`
+  is not enough because the eval step calls Foundry data-plane APIs such as
+  `Microsoft.CognitiveServices/accounts/AIServices/agents/read`.
+- **Cognitive Services OpenAI User** on the underlying Azure AI Services
+  account that hosts the evaluator model deployment. Foundry `azure_ai_evaluator`
+  graders impersonate the OIDC principal to call OpenAI; without this role
+  they fail with a 401 `PermissionDenied` on
+  `Microsoft.CognitiveServices/accounts/OpenAI/deployments/chat/completions/action`
+  and every metric returns `null` in the cloud eval report. AgentOps lifts that
+  error into `results.json` and the orchestrator's "0 usable metric scores"
+  warning so you can see the cause in CI logs, but the workflow still fails the
+  gate. The role ids are `53ca6127-db72-4b80-b1b0-d745d6d5456d` (Foundry User)
+  and `5e0bd9bd-7b93-4f28-af87-19fc36ad61bd` (Cognitive Services OpenAI User).
 
 The generated eval and doctor workflows install AgentOps telemetry support.
 When `AZURE_AI_FOUNDRY_PROJECT_ENDPOINT` is set, AgentOps first tries to
diff --git a/docs/tutorial-end-to-end.md b/docs/tutorial-end-to-end.md
@@ -418,8 +418,11 @@ this Foundry prompt-agent repo.
 Create or connect the GitHub repo if needed, create the `dev` environment, wire
 Azure OIDC, set AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini as a GitHub `dev`
 environment variable or equivalent Azure DevOps pipeline variable, verify the
-OIDC principal has Foundry User access, and show me the plan before changing
-GitHub or Azure.
+OIDC principal has **both** Foundry User access on the dev Foundry project
+**and** Cognitive Services OpenAI User access on the underlying Azure AI
+Services account that hosts the evaluator model (both are required — without
+the OpenAI User role, every cloud eval metric returns null), and show me the
+plan before changing GitHub or Azure.
 ```
 
 That value is not an `agentops init` answer. It tells the Foundry cloud eval
@@ -568,10 +571,13 @@ workflows running for this Foundry agent repo.
 
 Extend the PR/dev setup if it already exists, wire Azure OIDC for the `qa` and
 `production` environments, confirm required Actions variables such as
-AZURE_OPENAI_DEPLOYMENT, verify the OIDC principals have Foundry User access,
-and keep deploy placeholders unless this repo already has an azd deployment
-path. Show me the plan before changing GitHub or Azure, and call out anything
-that needs owner/admin permission.
+AZURE_OPENAI_DEPLOYMENT, verify the OIDC principals have **both** Foundry User
+access on each Foundry project **and** Cognitive Services OpenAI User on the
+underlying AI Services account hosting the evaluator model (both are required
+— without the OpenAI User role, every cloud eval metric returns null), and
+keep deploy placeholders unless this repo already has an azd deployment path.
+Show me the plan before changing GitHub or Azure, and call out anything that
+needs owner/admin permission.
 ```
 
 Use this moment in the video to connect the four repos: Foundry Toolkit creates
diff --git a/docs/tutorial-prompt-agent-quickstart.md b/docs/tutorial-prompt-agent-quickstart.md
@@ -569,9 +569,12 @@ This may be a brand-new folder with no Git repo or GitHub remote yet.
 Keep the scope to the PR gate and dev deploy only: create or connect the
 GitHub repo if needed, wire Azure OIDC and required Actions
 variables/secrets, create only the `dev` environment, verify the OIDC
-principal has Foundry User access on the **dev** Foundry project, and
-do not set up `qa`, `production`, scheduled Doctor, or hosted
-deployment workflows yet.
+principal has **both** Foundry User access on the **dev** Foundry project
+**and** Cognitive Services OpenAI User on the underlying Azure AI Services
+account that hosts the evaluator model (both roles are required — without
+the OpenAI User role, the Foundry cloud graders fail with a 401 and every
+metric comes back null), and do not set up `qa`, `production`, scheduled
+Doctor, or hosted deployment workflows yet.
 
 The dev Foundry project endpoint is in `.azure/dev/.env`; the sandbox
 endpoint is local-only and must not be added to CI.
@@ -589,9 +592,19 @@ it skips:
 - Set Actions variables `AZURE_TENANT_ID`, `AZURE_SUBSCRIPTION_ID`,
   `AZURE_CLIENT_ID`, `AZURE_AI_FOUNDRY_PROJECT_ENDPOINT` (the dev
   endpoint), and `APPLICATIONINSIGHTS_CONNECTION_STRING` if available.
-- Verify the OIDC principal has **Foundry User** access on the dev
-  Foundry project. Reader alone is not enough for the data-plane calls
-  the prompt-agent staging and eval steps make.
+- Verify the OIDC principal has **two** Azure RBAC roles before the first
+  run. Both are required and the eval step fails silently (every metric
+  returns `null`) if only one is in place:
+  - **Foundry User** on the dev Foundry project — Reader alone is not
+    enough for the data-plane calls the prompt-agent staging and eval steps
+    make.
+  - **Cognitive Services OpenAI User** on the underlying Azure AI Services
+    account that hosts the evaluator model deployment. Foundry
+    `azure_ai_evaluator` graders impersonate the OIDC principal to call
+    OpenAI; without this role they fail with a 401 `PermissionDenied`. The
+    AgentOps cloud-results parser lifts that error into `results.json` so
+    you can see the cause in the artifact, but the workflow still fails
+    the gate.
 
 ## 13. First green PR → merge → dev deploy
 
diff --git a/plugins/agentops/skills/agentops-workflow/SKILL.md b/plugins/agentops/skills/agentops-workflow/SKILL.md
@@ -100,22 +100,40 @@ by discovering the whole Azure subscription.
    `repo:<owner>/<repo>:environment:dev`. Do not assume branch or
    `pull_request` subjects without reading the workflow.
 9. Before triggering a Foundry prompt-agent workflow, make sure the OIDC app /
-   service principal has Foundry data-plane access. It needs **Foundry User**
-   (role id `53ca6127-db72-4b80-b1b0-d745d6d5456d`, formerly Azure AI User) at
-   the Foundry project scope, or at the Foundry resource scope if that is the
-   team's standard. Azure **Reader** is not enough; without this role the eval
-   step fails on
-   `Microsoft.CognitiveServices/accounts/AIServices/agents/read`.
-10. If the Foundry RBAC assignment is missing, do not run the workflow yet.
-   Show the exact GitHub OIDC client ID / service principal, desired role, and
-   target Foundry scope, then ask the user to approve the role assignment or
+   service principal has **two** RBAC assignments. Both are required; the eval
+   step fails silently (every metric returns `null`) if only one is in place.
+   1. **Foundry User** on the Foundry project (or the Foundry resource scope
+      if that is the team's standard). Role id
+      `53ca6127-db72-4b80-b1b0-d745d6d5456d` (formerly Azure AI User). Without
+      this the candidate-staging step fails on
+      `Microsoft.CognitiveServices/accounts/AIServices/agents/read`.
+   2. **Cognitive Services OpenAI User** on the underlying Azure AI Services
+      account that hosts the evaluator model deployment
+      (typically the parent account of the Foundry project). Role id
+      `5e0bd9bd-7b93-4f28-af87-19fc36ad61bd`. Without this the Foundry
+      `azure_ai_evaluator` graders fail with a 401 `PermissionDenied` on
+      `Microsoft.CognitiveServices/accounts/OpenAI/deployments/chat/completions/action`
+      and every metric comes back `null` in the cloud eval report. AgentOps now
+      lifts that error into `results.json` and the orchestrator's "0 usable
+      metric scores" warning so the cause is visible in CI logs, but the
+      workflow still fails the gate. Grant this role **before** the first run.
+   Azure **Reader** is not enough for either step.
+10. If either RBAC assignment is missing, do not run the workflow yet.
+   Show the exact GitHub OIDC client ID / service principal, desired role,
+   target scope (project for Foundry User, AI Services account for Cognitive
+   Services OpenAI User), then ask the user to approve the role assignment or
    get an Azure/Foundry admin to grant it. After assignment, read it back or ask
    the user to confirm before dispatching the workflow.
-   When the user approves and you know the Foundry scope, use the role id to
-   avoid rename drift:
+   When the user approves and you know the scopes, use the role ids to avoid
+   rename drift:
    - `az ad sp show --id <AZURE_CLIENT_ID> --query id -o tsv`
    - `az role assignment list --assignee <sp-object-id> --scope <foundry-scope> --include-inherited`
    - `az role assignment create --assignee-object-id <sp-object-id> --assignee-principal-type ServicePrincipal --role 53ca6127-db72-4b80-b1b0-d745d6d5456d --scope <foundry-scope>`
+   - `az role assignment create --assignee-object-id <sp-object-id> --assignee-principal-type ServicePrincipal --role 5e0bd9bd-7b93-4f28-af87-19fc36ad61bd --scope <ai-services-account-scope>`
+   The AI Services account scope looks like
+   `/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.CognitiveServices/accounts/<ai-account-name>`
+   and can be derived from
+   `az cognitiveservices account list --resource-group <foundry-project-rg> --query "[?kind=='AIServices'].id" -o tsv`.
 11. Ask before creating or updating GitHub repos, GitHub environments,
    variables/secrets, Entra app registrations/service principals, federated
    credentials, managed identities, or Azure RBAC assignments.
@@ -304,11 +322,21 @@ Then configure Workload Identity Federation on the Azure side
 environment** the workflows will run from. See
 `docs/ci-github-actions.md` for the exact `az` commands.
 
-Also grant the same app registration / service principal **Foundry User** on the
-Foundry project or Foundry resource before the first workflow run. The PR gate
-uses Foundry data-plane APIs to read prompt agents; Azure `Reader` only proves
-ARM access and will still fail the eval step with
-`Microsoft.CognitiveServices/accounts/AIServices/agents/read`.
+Also grant the same app registration / service principal **two** Azure
+RBAC roles before the first workflow run; both are required and the eval
+step fails silently (every metric returns `null`) if only one is in place:
+
+1. **Foundry User** on the Foundry project or Foundry resource. The PR gate
+   uses Foundry data-plane APIs to read prompt agents; Azure `Reader` only
+   proves ARM access and will still fail the eval step with
+   `Microsoft.CognitiveServices/accounts/AIServices/agents/read`.
+2. **Cognitive Services OpenAI User** on the underlying Azure AI Services
+   account that hosts the evaluator model deployment. Without this, Foundry
+   `azure_ai_evaluator` graders fail with a 401 `PermissionDenied` on the
+   OpenAI `chat/completions/action` data action and every metric returns
+   `null` in the cloud eval report. AgentOps surfaces that error in
+   `results.json` and the orchestrator's "0 usable metric scores" warning,
+   but the workflow still fails the gate — fix the role before the run.
 
 Tell the user that CI evals emit `agentops.eval.*` telemetry and scheduled
 Doctor runs emit `agentops.agent.finding.*` telemetry when App Insights is
@@ -319,7 +347,11 @@ Monitor deep links.
 
 Already done in Step 2 - the `agentops-azure` service connection
 handles auth. Make sure the underlying service principal or managed
-identity has the **Foundry User** role on the Foundry project or resource.
+identity has **both** the **Foundry User** role on the Foundry project (or
+Foundry resource) **and** the **Cognitive Services OpenAI User** role on the
+underlying Azure AI Services account that hosts the evaluator model. Both
+are required; without the OpenAI User role the Foundry graders fail with a
+401 `PermissionDenied` and every cloud eval metric returns `null`.
 
 ## Step 4 - Use azd for deployment
 
diff --git a/src/agentops/templates/skills/agentops-workflow/SKILL.md b/src/agentops/templates/skills/agentops-workflow/SKILL.md
@@ -100,22 +100,40 @@ by discovering the whole Azure subscription.
    `repo:<owner>/<repo>:environment:dev`. Do not assume branch or
    `pull_request` subjects without reading the workflow.
 9. Before triggering a Foundry prompt-agent workflow, make sure the OIDC app /
-   service principal has Foundry data-plane access. It needs **Foundry User**
-   (role id `53ca6127-db72-4b80-b1b0-d745d6d5456d`, formerly Azure AI User) at
-   the Foundry project scope, or at the Foundry resource scope if that is the
-   team's standard. Azure **Reader** is not enough; without this role the eval
-   step fails on
-   `Microsoft.CognitiveServices/accounts/AIServices/agents/read`.
-10. If the Foundry RBAC assignment is missing, do not run the workflow yet.
-   Show the exact GitHub OIDC client ID / service principal, desired role, and
-   target Foundry scope, then ask the user to approve the role assignment or
+   service principal has **two** RBAC assignments. Both are required; the eval
+   step fails silently (every metric returns `null`) if only one is in place.
+   1. **Foundry User** on the Foundry project (or the Foundry resource scope
+      if that is the team's standard). Role id
+      `53ca6127-db72-4b80-b1b0-d745d6d5456d` (formerly Azure AI User). Without
+      this the candidate-staging step fails on
+      `Microsoft.CognitiveServices/accounts/AIServices/agents/read`.
+   2. **Cognitive Services OpenAI User** on the underlying Azure AI Services
+      account that hosts the evaluator model deployment
+      (typically the parent account of the Foundry project). Role id
+      `5e0bd9bd-7b93-4f28-af87-19fc36ad61bd`. Without this the Foundry
+      `azure_ai_evaluator` graders fail with a 401 `PermissionDenied` on
+      `Microsoft.CognitiveServices/accounts/OpenAI/deployments/chat/completions/action`
+      and every metric comes back `null` in the cloud eval report. AgentOps now
+      lifts that error into `results.json` and the orchestrator's "0 usable
+      metric scores" warning so the cause is visible in CI logs, but the
+      workflow still fails the gate. Grant this role **before** the first run.
+   Azure **Reader** is not enough for either step.
+10. If either RBAC assignment is missing, do not run the workflow yet.
+   Show the exact GitHub OIDC client ID / service principal, desired role,
+   target scope (project for Foundry User, AI Services account for Cognitive
+   Services OpenAI User), then ask the user to approve the role assignment or
    get an Azure/Foundry admin to grant it. After assignment, read it back or ask
    the user to confirm before dispatching the workflow.
-   When the user approves and you know the Foundry scope, use the role id to
-   avoid rename drift:
+   When the user approves and you know the scopes, use the role ids to avoid
+   rename drift:
    - `az ad sp show --id <AZURE_CLIENT_ID> --query id -o tsv`
    - `az role assignment list --assignee <sp-object-id> --scope <foundry-scope> --include-inherited`
    - `az role assignment create --assignee-object-id <sp-object-id> --assignee-principal-type ServicePrincipal --role 53ca6127-db72-4b80-b1b0-d745d6d5456d --scope <foundry-scope>`
+   - `az role assignment create --assignee-object-id <sp-object-id> --assignee-principal-type ServicePrincipal --role 5e0bd9bd-7b93-4f28-af87-19fc36ad61bd --scope <ai-services-account-scope>`
+   The AI Services account scope looks like
+   `/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.CognitiveServices/accounts/<ai-account-name>`
+   and can be derived from
+   `az cognitiveservices account list --resource-group <foundry-project-rg> --query "[?kind=='AIServices'].id" -o tsv`.
 11. Ask before creating or updating GitHub repos, GitHub environments,
    variables/secrets, Entra app registrations/service principals, federated
    credentials, managed identities, or Azure RBAC assignments.
@@ -304,11 +322,21 @@ Then configure Workload Identity Federation on the Azure side
 environment** the workflows will run from. See
 `docs/ci-github-actions.md` for the exact `az` commands.
 
-Also grant the same app registration / service principal **Foundry User** on the
-Foundry project or Foundry resource before the first workflow run. The PR gate
-uses Foundry data-plane APIs to read prompt agents; Azure `Reader` only proves
-ARM access and will still fail the eval step with
-`Microsoft.CognitiveServices/accounts/AIServices/agents/read`.
+Also grant the same app registration / service principal **two** Azure
+RBAC roles before the first workflow run; both are required and the eval
+step fails silently (every metric returns `null`) if only one is in place:
+
+1. **Foundry User** on the Foundry project or Foundry resource. The PR gate
+   uses Foundry data-plane APIs to read prompt agents; Azure `Reader` only
+   proves ARM access and will still fail the eval step with
+   `Microsoft.CognitiveServices/accounts/AIServices/agents/read`.
+2. **Cognitive Services OpenAI User** on the underlying Azure AI Services
+   account that hosts the evaluator model deployment. Without this, Foundry
+   `azure_ai_evaluator` graders fail with a 401 `PermissionDenied` on the
+   OpenAI `chat/completions/action` data action and every metric returns
+   `null` in the cloud eval report. AgentOps surfaces that error in
+   `results.json` and the orchestrator's "0 usable metric scores" warning,
+   but the workflow still fails the gate — fix the role before the run.
 
 Tell the user that CI evals emit `agentops.eval.*` telemetry and scheduled
 Doctor runs emit `agentops.agent.finding.*` telemetry when App Insights is
@@ -319,7 +347,11 @@ Monitor deep links.
 
 Already done in Step 2 - the `agentops-azure` service connection
 handles auth. Make sure the underlying service principal or managed
-identity has the **Foundry User** role on the Foundry project or resource.
+identity has **both** the **Foundry User** role on the Foundry project (or
+Foundry resource) **and** the **Cognitive Services OpenAI User** role on the
+underlying Azure AI Services account that hosts the evaluator model. Both
+are required; without the OpenAI User role the Foundry graders fail with a
+401 `PermissionDenied` and every cloud eval metric returns `null`.
 
 ## Step 4 - Use azd for deployment