test(evals): Migrate evals to vitest harnesses#1049
Conversation
Upgrade vitest-evals and replace legacy scorer utilities with shared harnesses for tool prediction, MCP tool-call, and embedded-agent suites. Keep usage data and traces on the harness path while reducing per-spec boilerplate. Wire CI to publish the GitHub eval check and add local report UI shortcuts. Co-Authored-By: GPT-5 Codex <noreply@openai.com>
Use an explicit JSON-valued schema for predicted tool arguments so OpenAI structured output receives typed additionalProperties. Normalize embedded-agent runs through fallback sessions when AI SDK steps do not expose model metadata, preserving captured tool calls and usage without crashing the eval harness. Co-Authored-By: GPT-5 Codex <noreply@openai.com>
Default search-agent evals to the same 0.6 judge threshold used by the other migrated eval helpers. Remove the now-redundant threshold override from the issue-events agent suite while keeping its timeout override. Co-Authored-By: GPT-5 Codex <noreply@openai.com>
Build the tool-prediction prompt from generated stable tool definitions instead of the experimental mock stdio surface. Update the stale tag eval to target the current get_issue_tag_values tool and cover stable prompt names in unit tests. Co-Authored-By: GPT-5 Codex <noreply@openai.com>
Preserve legacy prediction-suite calibration by including expected tool calls in the prediction prompt while still normalizing the harness output. Normalize full MCP tool-call runs through fallback sessions when AI SDK steps do not expose model metadata, and relax embedded-agent checks around incidental tool argument shapes and valid output variants. Co-Authored-By: GPT-5 Codex <noreply@openai.com>
Keep fuzzy argument comparison enabled for search-agent eval tool calls so expected dataset inputs still catch regressions. Remove empty argument expectations from no-input helper tools, which keeps those cases focused on whether the resolver tool was called. Co-Authored-By: GPT-5 Codex <noreply@openai.com>
Preserve the legacy prediction suite contract by using the model-provided score while recording deterministic tool-call comparison details in metadata. Harden full MCP eval prompts around Sentry organization phrasing, allow missing assistant text in tool-call runs, and relax brittle search-agent expectations observed in CI. Co-Authored-By: GPT-5 Codex <noreply@openai.com>
There was a problem hiding this comment.
Eval workflow never triggers on changes inside packages/mcp-core/src/tools/ subdirectories
The path filter packages/mcp-core/src/tools* uses a bare * which GitHub Actions does not allow to match /, so edits to any file under packages/mcp-core/src/tools/ (e.g. tools/support/search-events/agent.ts) will silently skip the eval run; change the filter to packages/mcp-core/src/tools/**.
Evidence
- GitHub Actions path filters use minimatch semantics:
*matches any char except/, sopackages/mcp-core/src/tools*matches only paths whose name begins withtoolsdirectly insrc/(e.g. a hypotheticalsrc/toolsHelpers.ts), notsrc/tools/.... packages/mcp-core/src/tools/containssupport/,search-events/,search-issues/, etc.; all agent files live one or more levels belowtools/.- This PR itself changes
packages/mcp-core/src/tools/support/search-events/agent.ts, which would not match the filter and would not trigger the eval workflow. - No other workflow in
.github/workflows/has a path filter coveringpackages/mcp-core/src/tools/**, so these changes go completely unchecked by evals on push/PR.
Identified by Warden code-review
Record step-level tool calls before falling back to top-level calls so full MCP evals judge the complete agent trace. Fix eval workflow path filters for tool subdirectories, preserve deterministic matches when legacy prediction models underrate expected calls, and relax brittle search-agent output variants seen in CI. Co-Authored-By: GPT-5 Codex <noreply@openai.com>
Route full MCP evals through an inner AI SDK harness with the MCP tools installed so tool calls are intercepted at execution time. Drop raw SDK steps from that inner result so the normalized session is built from runtime-captured calls instead of the CI-visible last-step shape. Co-Authored-By: GPT-5 Codex <noreply@openai.com>
Wrap MCP dynamic tools directly so migrated harness runs preserve the full tool sequence and usage counts in normalized sessions. Keep the slow search-events agent eval timeout scoped to that suite. Co-Authored-By: GPT-5 Codex <noreply@openai.com>
Make the full MCP eval harness choose search_tools before execute_tool so catalog-discovery suites exercise the intended contract reliably. Co-Authored-By: GPT-5 Codex <noreply@openai.com>
Make ToolPredictionJudge use the ToolCallJudge match score for pass/fail so inflated model self-scores cannot mask wrong tool predictions. Co-Authored-By: GPT-5 Codex <noreply@openai.com>
Add a test:ci script for the eval package so root CI executes the migrated harness unit tests and publishes JUnit output. Co-Authored-By: GPT-5 Codex <noreply@openai.com>
Relax brittle eval expectations for equivalent Sentry query syntax and discovery searches while preserving required filters and tool execution checks. Co-Authored-By: GPT-5 Codex <noreply@openai.com>
Accept valid direct Sentry shorthand, nullable issue sort, and timestamp sort variants observed in migrated search-agent eval outputs. Co-Authored-By: GPT-5 Codex <noreply@openai.com>
Document breadcrumbs-by-resource-id usage for get_sentry_resource so prediction evals and assistants see the supported call shape. Keep the generated tool and skill definitions in sync with the core tool description. Co-Authored-By: GPT-5 Codex <noreply@openai.com>
Teach the search events agent that gen_ai.request.temperature is a numeric span field and add a concrete high-temperature LLM call example. This keeps the eval strict while nudging the embedded agent toward Sentry numeric comparison syntax instead of wildcard approximations. Co-Authored-By: GPT-5 Codex <noreply@openai.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 417ecf4. Configure here.
Remove expectedTools from the tool prediction model prompt so the suite predicts from the user task and catalog alone. Keep expectedTools only in deterministic judge metadata, and add a regression test for the prompt contract. Co-Authored-By: GPT-5 Codex <noreply@openai.com>
| @@ -129,7 +129,7 @@ | |||
| }, | |||
| { | |||
| "name": "get_sentry_resource", | |||
There was a problem hiding this comment.
get_sentry_resource breadcrumbs description claims event ID support the implementation does not provide
The tool surface advertises - breadcrumbs: issue shortId or event ID (and the resourceId param description repeats "issue shortId or event ID for breadcrumbs"), but the breadcrumbs path only accepts an issue identifier. resolveResourceParams sets issueId: resourceId.toUpperCase() and the handler calls fetchAndFormatBreadcrumbs(...) -> getLatestEventForIssue({ issueId }), which hits /organizations/<org>/issues/<issueId>/events/latest/. An event ID placed in the <issueId> path position will not resolve, and even when a specific event URL is supplied the handler always fetches the issue's latest event. The description therefore overstates supported inputs, which can mislead the model into passing event IDs that fail. This is a tool-definition/metadata accuracy issue, not a functional security defect.
Evidence
resolveResourceParamsbreadcrumbs case (get-sentry-resource.ts:164-169) setsissueId: resourceId.toUpperCase(); there is no event-ID branch or issue lookup from an event ID.- Handler routes to
fetchAndFormatBreadcrumbs(apiService, organizationSlug, resolved.issueId!)(get-sentry-resource.ts ~692). fetchAndFormatBreadcrumbscallsgetLatestEventForIssue({ organizationSlug, issueId })(breadcrumbs.ts:11) which delegates togetEventForIssuewitheventId: "latest", path/organizations/${org}/issues/${issueId}/events/latest/(client.ts:2124,2205-2210) —issueIdmust be an issue shortId/numeric ID.- Description text lives at get-sentry-resource.ts:540 and :587 and is mirrored into the generated wire surface (toolDefinitions.json:931,957; skillDefinitions.json) which all repeat "issue shortId or event ID" for breadcrumbs.
Identified by Warden mcp-audit · 5H6-MUT

Move the eval suite onto the current vitest-evals harness API and remove the legacy scorer/task-runner utilities. The eval specs now use shared helpers for prediction, full MCP tool-call runs, and embedded search-agent runs, which keeps traces, usage data, and report metadata centralized.
GitHub Eval Reporting
The eval workflow now writes JSON/JUnit output and publishes the
Evaluation Resultscheck withgetsentry/vitest-evals@v0. Root shortcuts also serve the last JSON result with the local report UI.Harness Documentation
Refresh the eval README, testing guide, adding-tools guide, and PR guidance so new evals stay harness-first and avoid reintroducing legacy boilerplate.