Skip to content

ARM API Reviewer agent: mine human reviewer comments for skill gaps (iteration 1, recurring) #43471

@ravimeda

Description

@ravimeda

Background

The ARM API Reviewer agent's rule set is curated by hand from the instruction files in .github/instructions/ and the references in .github/skills/azure-api-review/. Human ARM reviewers (razvanbadea-msft, markcowl, sandipsh, mentat9, nachoalonsoportillo, and others) continue to catch real, recurring classes of issues on the same PRs the agent reviews. Each such gap is a concrete signal that the agent's instructions are missing a rule, missing a checklist item, or missing a detection heuristic.

This issue should recur every 4 to 6 weeks. The proposal is to make "mine reviewer comments for agent skill gaps" a standing periodic task. This issue is the first occurrence of that recurrence. When the suggestions below are addressed (or triaged), the closer should open the next iteration of this issue with a fresh corpus from the intervening weeks.

Method (reproducible)

For the next iteration, run the same process:

  1. Enumerate PRs where the agent has commented in the analysis window by searching for the substring posted-by: arm-api-reviewer-agent in PR review comments. This substring matches both the structured marker format introduced in PR #42824 (<!-- posted-by: arm-api-reviewer-agent | rule: ... | severity: ... | classification: ... -->) and the legacy simple marker (<!-- posted-by: arm-api-reviewer-agent -->) that predates it. The search must capture both formats so older PRs are not missed.
  2. For each such PR, fetch all pulls/{n}/comments and filter to comments authored by known human ARM reviewers (current top reviewers: razvanbadea-msft, sandipsh, markcowl, mentat9, psah434, nachoalonsoportillo, mohitsinha-ms, pmoana-ms, tadelesh).
  3. Exclude comments that re-post the agent's own findings (same **[NEW] Blocking [RULE-ID]** template format) by reading the comment body, not just the author.
  4. Categorize the remaining substantive reviewer comments into themes. Any theme that recurs across more than one PR or more than one reviewer is a candidate skill gap.
  5. For each gap, propose a concrete instruction-file change, rule ID, severity policy, and (where possible) eval fixture.

Current corpus

  • Window: 2026-02-23 through 2026-05-23 (past 3 months).
  • PRs analyzed: 30 PRs where the agent had at least one comment.
  • Reviewer comments scanned: 282 from 9 known human ARM reviewers.
  • Themes that recurred: 10 (listed below).

Findings - skill gaps to close

Each finding lists representative reviewer comments and a concrete proposal.

1. EX-EXAMPLE-URL-HOST - non-prod hosts in example URLs

The agent does not flag dogfood, integration, or legacy management endpoints in example payloads.

  • psah434 on #42226: "dogfood URLs in LRO headers. Replace api-dogfood.resources.windows-int.net with management.azure.com."
  • razvanbadea-msft on #42226: "Multiple new examples use the legacy https://management.windowsazure.com/... endpoint in 202 response LRO headers. This MUST be https://management.azure.com/...."

Proposal: add an EX-EXAMPLE-URL-HOST rule to openapi-review.instructions.md Section 22 with an explicit host denylist: api-dogfood.*, *.windows-int.net, *.windows-ppe.net, management.windowsazure.com. Severity: blocking for new examples.

2. EX-RESOURCE-ID-CASING - ARM resource-ID segment casing in examples

  • psah434 on #42226: "resourcegroups should be resourceGroups in the ARM resource ID."

Proposal: add a rule under EX-* that validates the canonical casing of ARM ID segments (subscriptions, resourceGroups, providers, Microsoft.X) in example request URLs and response bodies.

3. TYPESPEC-DESIGN-SMELLS - prefer idiomatic TypeSpec over hand-rolled patterns

Multiple markcowl comments push design refactors the agent never raises:

  • #42717: "Use ProxyResource<T> here, which removes the need for the private decorator."
  • #41694: "Using openapi decoration for LROs ensures client SDKs will not understand them."
  • #41694: "This is more complicated than using a template for this operation."
  • #41694: "Make these parameters that appear a lot into an alias and reuse the alias multiple times."

Proposal: add a "TypeSpec design smells" section to typespec-review.instructions.md that flags:

  • Hand-rolled proxy/tracked resource definitions when ProxyResource<T> / TrackedResource<T> would suffice.
  • @useFinalStateVia and other OpenAPI decorators on LROs where an Azure.Core LRO template would be simpler and SDK-friendly.
  • Repeated parameter blocks (three or more occurrences of the same ...ApiVersionParameter; ...SubscriptionIdParameter; ...ResourceGroupParameter pattern) that should be extracted to an alias.

These are warnings, not blocking - the agent should suggest the refactor.

4. SCHEMA-EXTENSIBILITY - forward-compatible types

  • razvanbadea-msft on #42266: "why not replace it with an enum property with values like ApplyNow, Reschedule. This will help adding a new value later in a non breaking way."
  • mentat9 on #42226: "Don't use unrestricted string for discriminator type. This should be changed to a union and the suppression removed."

Proposal: add two rules to openapi-review.instructions.md Section 5:

  • SCHEMA-BOOLEAN-FOR-ENUM: warning when a new boolean property's name suggests a state machine (enable*, is*Mode, action, mode, type).
  • SCHEMA-DISCRIMINATOR-MUST-BE-UNION: blocking when a discriminator value type is an unrestricted string instead of a closed union/enum.

5. TERMINOLOGY-REFRESH-ON-NEW-VERSION - Azure AD -> Microsoft Entra ID

  • razvanbadea-msft on #42280: "All client-facing descriptions in this file still use 'Azure Active Directory' instead of 'Microsoft Entra ID'. Since this is a new preview version, this is the right time to update the terminology - carrying forward [legacy term] is a missed opportunity."

Proposal: add a TERMINOLOGY-REFRESH-V1 rule that fires only on new API version folders (not on patches to existing versions) when descriptions still use:

  • "Azure Active Directory" / "AAD" -> "Microsoft Entra ID"
  • "AAD application" -> "Microsoft Entra application"

Severity: warning. Out of scope: rewording inside existing versions (breaking change risk).

6. RPC-Get-V1-01 tightening - GET returning 202

Already in the rulebook, but missed by the agent on multiple operations in #41486.

  • sandipsh on #41486: "WorkspaceDiscoveries_GetLatest is currently modeled as GET returning 202 with body/headers. That conflicts with ARM GET contract expectations (RPC-Get-V1-01: GET should be a normal read response)."
  • Similar follow-up on #41486 for WorkspaceEvaluations_GetLatest.

Proposal: strengthen the per-operation checklist in armapi-review.instructions.md to explicitly enumerate response-code requirements for GET (200, default only; no 202; no LRO headers) and add an eval fixture seeded with a GET-returns-202 operation. Verify the agent flags it.

7. LRO-LEAST-SPECIFICATION - over-specified final-state-via

  • sandipsh on #42602: "Please remove final-state-via: location for this standard ARM DELETE LRO unless there is a confirmed non-standard polling contract that requires it. Keeping it without need increases SDK behavior divergence risk."

Proposal: add a rule that flags x-ms-long-running-operation-options.final-state-via on standard DELETE / PUT LROs unless the description justifies a non-default polling contract. Severity: warning.

8. CI-FAILURE-CORRELATION - lead with concrete CI blockers

  • sandipsh and razvanbadea-msft on #42331: "Blocking CI: Swagger Avocado - Analyze Code is failing with MULTIPLE_API_VERSION on tag: package-2025-07-01. Please update the tag configuration so the default package does not mix API versions."

Proposal: add a pre-review step (new Step 2.5 in arm-api-reviewer.agent.md) that calls gh api /repos/{owner}/{repo}/commits/{sha}/check-runs, filters to failing checks (Swagger Avocado, LintDiff, BreakingChange), and includes the failure summary in the agent's report so the human reviewer leads with concrete blockers instead of duplicating CI noise.

9. SUPPRESSION-JUSTIFICATION-DENYLIST - reject boilerplate justifications

  • mentat9 on #42226: "please always replace the boilerplate comment with a real, legitimate justification."
  • sandipsh on #41475: "any remaining arm-resource-provisioning-state suppressions with generic FIXME justification..."

Proposal: tighten the existing "new suppression must have a clear reason" check by adding a denylist of low-information justifications: FIXME, TODO, legacy, existing pattern, will fix later, bare quote of the rule name, single-word reasons. Severity: blocking for security-related rules, warning otherwise.

10. DUPLICATE-COLLECTION-MODEL - inline child collection + child-resource endpoint

  • sandipsh on #42331: "InterconnectGroupPropertiesFormat.subgroups remains as Subgroup[]. Can you clarify why inline child resources are required here despite separate /subgroups GET/List operations?"

Proposal: add a cross-check that flags an inline collection property (Foo[] typed as a child-resource model) when a sibling child-resource endpoint (/foos, /foos/{name}) also exists. Severity: warning. This is a modeling-clarity flag, not a correctness flag.

References

Requested work

  1. Triage each of the 10 findings above. For each, decide: (a) add a new rule, (b) tighten an existing rule, (c) defer with rationale, or (d) close as out-of-scope.
  2. For findings that proceed, file a focused follow-up issue per rule (matching the structure of #43349, #43384) so each rule change ships with its own eval fixtures and PR.
  3. Update the instruction and skill files per the proposals above.
  4. Add eval fixtures under .github/skills/evals/arm-api-reviewer/fixtures/ for each new rule, with both a positive control (should not flag) and a negative control (should flag), and [NEW] / [EXISTING] variants where the severity policy depends on classification.
  5. Run the full eval suite locally and confirm new stimuli pass.
  6. Schedule the next recurrence. When this issue is closed, open the next iteration with title "ARM API Reviewer agent: mine reviewer comments for skill gaps (iteration N+1)", set the analysis window to the period since this issue's created_at, and follow the same method.

Acceptance criteria

  • Each of the 10 findings is either addressed (rule change shipped, eval fixture added) or has a recorded triage decision (deferred / out-of-scope / superseded by an existing issue).
  • A follow-up "iteration 2" issue is filed before this issue is closed, with a fresh analysis window covering the time since this issue was created.
  • The recurrence cadence (every 4 to 6 weeks) is documented either in arm-api-reviewer.agent.md (a new "Maintenance" section) or in the README of .github/skills/evals/arm-api-reviewer/, so anyone can pick up the next iteration without context from this thread.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No fields configured for Task.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions