test(evals): Migrate evals to vitest harnesses by dcramer · Pull Request #1049 · getsentry/sentry-mcp

dcramer · 2026-06-05T04:13:09Z

Move the eval suite onto the current vitest-evals harness API and remove the legacy scorer/task-runner utilities. The eval specs now use shared helpers for prediction, full MCP tool-call runs, and embedded search-agent runs, which keeps traces, usage data, and report metadata centralized.

GitHub Eval Reporting

The eval workflow now writes JSON/JUnit output and publishes the Evaluation Results check with getsentry/vitest-evals@v0. Root shortcuts also serve the last JSON result with the local report UI.

Harness Documentation

Refresh the eval README, testing guide, adding-tools guide, and PR guidance so new evals stay harness-first and avoid reintroducing legacy boilerplate.

Upgrade vitest-evals and replace legacy scorer utilities with shared harnesses for tool prediction, MCP tool-call, and embedded-agent suites. Keep usage data and traces on the harness path while reducing per-spec boilerplate. Wire CI to publish the GitHub eval check and add local report UI shortcuts. Co-Authored-By: GPT-5 Codex <noreply@openai.com>

Use an explicit JSON-valued schema for predicted tool arguments so OpenAI structured output receives typed additionalProperties. Normalize embedded-agent runs through fallback sessions when AI SDK steps do not expose model metadata, preserving captured tool calls and usage without crashing the eval harness. Co-Authored-By: GPT-5 Codex <noreply@openai.com>

Default search-agent evals to the same 0.6 judge threshold used by the other migrated eval helpers. Remove the now-redundant threshold override from the issue-events agent suite while keeping its timeout override. Co-Authored-By: GPT-5 Codex <noreply@openai.com>

Build the tool-prediction prompt from generated stable tool definitions instead of the experimental mock stdio surface. Update the stale tag eval to target the current get_issue_tag_values tool and cover stable prompt names in unit tests. Co-Authored-By: GPT-5 Codex <noreply@openai.com>

Preserve legacy prediction-suite calibration by including expected tool calls in the prediction prompt while still normalizing the harness output. Normalize full MCP tool-call runs through fallback sessions when AI SDK steps do not expose model metadata, and relax embedded-agent checks around incidental tool argument shapes and valid output variants. Co-Authored-By: GPT-5 Codex <noreply@openai.com>

Keep fuzzy argument comparison enabled for search-agent eval tool calls so expected dataset inputs still catch regressions. Remove empty argument expectations from no-input helper tools, which keeps those cases focused on whether the resolver tool was called. Co-Authored-By: GPT-5 Codex <noreply@openai.com>

Preserve the legacy prediction suite contract by using the model-provided score while recording deterministic tool-call comparison details in metadata. Harden full MCP eval prompts around Sentry organization phrasing, allow missing assistant text in tool-call runs, and relax brittle search-agent expectations observed in CI. Co-Authored-By: GPT-5 Codex <noreply@openai.com>

sentry-warden

Eval workflow never triggers on changes inside packages/mcp-core/src/tools/ subdirectories

The path filter packages/mcp-core/src/tools* uses a bare * which GitHub Actions does not allow to match /, so edits to any file under packages/mcp-core/src/tools/ (e.g. tools/support/search-events/agent.ts) will silently skip the eval run; change the filter to packages/mcp-core/src/tools/**.

Evidence

GitHub Actions path filters use minimatch semantics: * matches any char except /, so packages/mcp-core/src/tools* matches only paths whose name begins with tools directly in src/ (e.g. a hypothetical src/toolsHelpers.ts), not src/tools/....
packages/mcp-core/src/tools/ contains support/, search-events/, search-issues/, etc.; all agent files live one or more levels below tools/.
This PR itself changes packages/mcp-core/src/tools/support/search-events/agent.ts, which would not match the filter and would not trigger the eval workflow.
No other workflow in .github/workflows/ has a path filter covering packages/mcp-core/src/tools/**, so these changes go completely unchecked by evals on push/PR.

_{Identified by Warden code-review}

Record step-level tool calls before falling back to top-level calls so full MCP evals judge the complete agent trace. Fix eval workflow path filters for tool subdirectories, preserve deterministic matches when legacy prediction models underrate expected calls, and relax brittle search-agent output variants seen in CI. Co-Authored-By: GPT-5 Codex <noreply@openai.com>

Route full MCP evals through an inner AI SDK harness with the MCP tools installed so tool calls are intercepted at execution time. Drop raw SDK steps from that inner result so the normalized session is built from runtime-captured calls instead of the CI-visible last-step shape. Co-Authored-By: GPT-5 Codex <noreply@openai.com>

Wrap MCP dynamic tools directly so migrated harness runs preserve the full tool sequence and usage counts in normalized sessions. Keep the slow search-events agent eval timeout scoped to that suite. Co-Authored-By: GPT-5 Codex <noreply@openai.com>

Make the full MCP eval harness choose search_tools before execute_tool so catalog-discovery suites exercise the intended contract reliably. Co-Authored-By: GPT-5 Codex <noreply@openai.com>

Make ToolPredictionJudge use the ToolCallJudge match score for pass/fail so inflated model self-scores cannot mask wrong tool predictions. Co-Authored-By: GPT-5 Codex <noreply@openai.com>

Add a test:ci script for the eval package so root CI executes the migrated harness unit tests and publishes JUnit output. Co-Authored-By: GPT-5 Codex <noreply@openai.com>

Relax brittle eval expectations for equivalent Sentry query syntax and discovery searches while preserving required filters and tool execution checks. Co-Authored-By: GPT-5 Codex <noreply@openai.com>

Accept valid direct Sentry shorthand, nullable issue sort, and timestamp sort variants observed in migrated search-agent eval outputs. Co-Authored-By: GPT-5 Codex <noreply@openai.com>

Document breadcrumbs-by-resource-id usage for get_sentry_resource so prediction evals and assistants see the supported call shape. Keep the generated tool and skill definitions in sync with the core tool description. Co-Authored-By: GPT-5 Codex <noreply@openai.com>

Teach the search events agent that gen_ai.request.temperature is a numeric span field and add a concrete high-temperature LLM call example. This keeps the eval strict while nudging the embedded agent toward Sentry numeric comparison syntax instead of wildcard approximations. Co-Authored-By: GPT-5 Codex <noreply@openai.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 417ecf4. Configure here.}

Remove expectedTools from the tool prediction model prompt so the suite predicts from the user task and catalog alone. Keep expectedTools only in deterministic judge metadata, and add a regression test for the prompt contract. Co-Authored-By: GPT-5 Codex <noreply@openai.com>

sentry-warden · 2026-06-05T13:54:04Z

@@ -129,7 +129,7 @@
      },
      {
        "name": "get_sentry_resource",


get_sentry_resource breadcrumbs description claims event ID support the implementation does not provide

The tool surface advertises - breadcrumbs: issue shortId or event ID (and the resourceId param description repeats "issue shortId or event ID for breadcrumbs"), but the breadcrumbs path only accepts an issue identifier. resolveResourceParams sets issueId: resourceId.toUpperCase() and the handler calls fetchAndFormatBreadcrumbs(...) -> getLatestEventForIssue({ issueId }), which hits /organizations/<org>/issues/<issueId>/events/latest/. An event ID placed in the <issueId> path position will not resolve, and even when a specific event URL is supplied the handler always fetches the issue's latest event. The description therefore overstates supported inputs, which can mislead the model into passing event IDs that fail. This is a tool-definition/metadata accuracy issue, not a functional security defect.

Evidence

resolveResourceParams breadcrumbs case (get-sentry-resource.ts:164-169) sets issueId: resourceId.toUpperCase(); there is no event-ID branch or issue lookup from an event ID.

Handler routes to fetchAndFormatBreadcrumbs(apiService, organizationSlug, resolved.issueId!) (get-sentry-resource.ts ~692).

fetchAndFormatBreadcrumbs calls getLatestEventForIssue({ organizationSlug, issueId }) (breadcrumbs.ts:11) which delegates to getEventForIssue with eventId: "latest", path /organizations/${org}/issues/${issueId}/events/latest/ (client.ts:2124,2205-2210) — issueId must be an issue shortId/numeric ID.

Description text lives at get-sentry-resource.ts:540 and :587 and is mirrored into the generated wire surface (toolDefinitions.json:931,957; skillDefinitions.json) which all repeat "issue shortId or event ID" for breadcrumbs.

_{Identified by Warden mcp-audit · 5H6-MUT}

dcramer had a problem deploying to Actions June 5, 2026 04:13 — with GitHub Actions Failure

dcramer had a problem deploying to Actions June 5, 2026 04:26 — with GitHub Actions Failure

cursor Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread packages/mcp-server-evals/src/evals/utils/describe.ts Outdated

dcramer had a problem deploying to Actions June 5, 2026 04:31 — with GitHub Actions Failure

cursor Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread packages/mcp-server-evals/src/evals/utils/mcpClient.ts Outdated

dcramer had a problem deploying to Actions June 5, 2026 06:16 — with GitHub Actions Failure

dcramer had a problem deploying to Actions June 5, 2026 06:41 — with GitHub Actions Failure

cursor Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread packages/mcp-server-evals/src/evals/utils/describe.ts

dcramer had a problem deploying to Actions June 5, 2026 06:46 — with GitHub Actions Failure

dcramer had a problem deploying to Actions June 5, 2026 09:09 — with GitHub Actions Failure

sentry-warden Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread packages/mcp-server-evals/src/evals/utils/mcpToolCallHarness.ts Outdated

dcramer had a problem deploying to Actions June 5, 2026 09:31 — with GitHub Actions Failure

dcramer had a problem deploying to Actions June 5, 2026 09:56 — with GitHub Actions Failure

dcramer had a problem deploying to Actions June 5, 2026 10:30 — with GitHub Actions Failure

cursor Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread packages/mcp-server-evals/src/evals/utils/toolPredictionHarness.ts Outdated

fix(evals): Force MCP discovery step

3971db3

Make the full MCP eval harness choose search_tools before execute_tool so catalog-discovery suites exercise the intended contract reliably. Co-Authored-By: GPT-5 Codex <noreply@openai.com>

dcramer had a problem deploying to Actions June 5, 2026 10:47 — with GitHub Actions Failure

fix(evals): Trust deterministic prediction scoring

c929258

Make ToolPredictionJudge use the ToolCallJudge match score for pass/fail so inflated model self-scores cannot mask wrong tool predictions. Co-Authored-By: GPT-5 Codex <noreply@openai.com>

dcramer had a problem deploying to Actions June 5, 2026 10:48 — with GitHub Actions Failure

cursor Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread packages/mcp-server-evals/package.json

test(evals): Run harness unit tests in CI

16ea37c

Add a test:ci script for the eval package so root CI executes the migrated harness unit tests and publishes JUnit output. Co-Authored-By: GPT-5 Codex <noreply@openai.com>

dcramer had a problem deploying to Actions June 5, 2026 10:54 — with GitHub Actions Failure

fix(evals): Accept valid search query variants

b96e9ee

Relax brittle eval expectations for equivalent Sentry query syntax and discovery searches while preserving required filters and tool execution checks. Co-Authored-By: GPT-5 Codex <noreply@openai.com>

dcramer had a problem deploying to Actions June 5, 2026 11:10 — with GitHub Actions Failure

fix(evals): Stabilize search agent alternates

491f2c5

Accept valid direct Sentry shorthand, nullable issue sort, and timestamp sort variants observed in migrated search-agent eval outputs. Co-Authored-By: GPT-5 Codex <noreply@openai.com>

dcramer had a problem deploying to Actions June 5, 2026 11:23 — with GitHub Actions Failure

dcramer had a problem deploying to Actions June 5, 2026 11:39 — with GitHub Actions Failure

dcramer had a problem deploying to Actions June 5, 2026 12:53 — with GitHub Actions Failure

cursor Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread packages/mcp-server-evals/src/evals/utils/toolPredictionHarness.ts Outdated

dcramer had a problem deploying to Actions June 5, 2026 13:40 — with GitHub Actions Failure

sentry-warden Bot reviewed Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(evals): Migrate evals to vitest harnesses#1049

test(evals): Migrate evals to vitest harnesses#1049
dcramer wants to merge 18 commits into
mainfrom
codex/vitest-evals-harness-migration

dcramer commented Jun 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sentry-warden Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

sentry-warden Bot Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dcramer commented Jun 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sentry-warden Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sentry-warden Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant