feat: agent-powered E2E test subset selection for PRs#1267
Conversation
Add an LLM-based CI workflow that analyzes PR file changes and selects the most relevant mock-LLM E2E test specs to run, then triggers the E2E workflow with only that subset. Changes: - .github/workflows/mock-llm-e2e.yml: Add workflow_dispatch inputs (test_specs, test_grep, pr_number) so the workflow can be triggered with a filtered subset. Fully backward-compatible — pull_request events still run the full suite. Use EFFECTIVE_PR_NUMBER for comment/artifact steps to support both trigger types. - scripts/select-e2e-tests.py: Python script using OpenHands SDK LLM to intelligently map changed files to relevant specs, with a deterministic heuristic fallback when LLM is unavailable. - .github/workflows/agent-e2e-selector.yml: New workflow triggered by the 'smart-e2e' label or manual dispatch. Analyzes PR files, runs the selector, and dispatches mock-llm-e2e with the chosen subset. Posts a summary comment to the PR. Co-authored-by: openhands <openhands@all-hands.dev>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
PR Artifacts Notice This PR contains a
|
✅ Mock-LLM E2E Tests43/43 passed Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
…/gpt-5.1 - mock-llm-e2e.yml: Add workflow_call trigger with matching inputs so the workflow can be called as a reusable workflow (no PAT needed). - agent-e2e-selector.yml: Replace gh workflow run dispatch (which requires a PAT with actions:write scope) with a workflow_call job. The select job outputs feed directly into the run-e2e job. - select-e2e-tests.py: Change default model from litellm_proxy/openai/ gpt-4.1-mini to openhands/gpt-5.1. Co-authored-by: openhands <openhands@all-hands.dev>
🛑 Mock-LLM Docker E2E Test Results12/12 passed · Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
🤖 Agent E2E Test Selector
Selected specs: Running via: Mock-LLM E2E Tests |
|
| Status | Test | Duration |
|---|
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses)
… point - select-e2e-tests.py: Remove heuristic_select() and all static path-prefix mapping. Always use the LLM — if LLM_API_KEY is missing the script fails loudly instead of silently degrading. - mock-llm-e2e.yml: Remove pull_request trigger. E2E tests no longer run the full suite on every PR commit. They are now invoked only via workflow_call (from agent-e2e-selector) or workflow_dispatch. - agent-e2e-selector.yml: Trigger on pull_request [opened, synchronize, reopened] — this is now the primary entry point for PR-driven E2E. The LLM picks the relevant subset and the E2E workflow runs only those specs. Co-authored-by: openhands <openhands@all-hands.dev>
🛑 Mock-LLM Docker E2E Test Results3/3 passed · Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
🛑 Mock-LLM E2E Tests18/18 passed · Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |
- select-e2e-tests.py: Rewritten to use Agent + Conversation instead of raw LLM.completion(). Adds a CIVisualizer (streams every event to stderr for CI log visibility) and a capture_event callback (collects agent messages for parsing). The agent outputs a structured <TEST_SELECTION> block that gets parsed for the spec list. Empty specs now means 'skip E2E' not 'run full suite'. - agent-e2e-selector.yml: run-e2e job now has an if-guard that skips mock-llm-e2e entirely when the agent returns no specs. Co-authored-by: openhands <openhands@all-hands.dev>
|
| Status | Test | Duration |
|---|
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses)
The capture_event callback collects model_dump_json() from every event. The agent's response text lives at event.llm_message.content[].text, not at event.message or event.text. Added extract_text_from_dumps() to properly walk the JSON structure and pull out all text fragments before searching for the <TEST_SELECTION> tag. Also improved the ValueError message to include the extracted text for easier debugging if parsing still fails. Co-authored-by: openhands <openhands@all-hands.dev>
|
| Status | Test | Duration |
|---|
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses)
- select-e2e-tests.py: Agent now has terminal + file_editor tools and runs against the repo checkout (WORKSPACE env var, defaults to cwd). It can read source files and test specs to make informed decisions. Fixed two parsing bugs: (1) only collect agent-source events (skip user prompt to avoid matching the template), (2) use re.findall and take the last match as an extra safety net. Increased visualizer dump to 800 chars for better CI log visibility. - agent-e2e-selector.yml: Install openhands-tools alongside openhands-sdk (needed for TerminalTool/FileEditorTool). Suppress SDK banner. Bumped timeout to 10 min for agent tool-call iterations. Co-authored-by: openhands <openhands@all-hands.dev>
|
| Status | Test | Duration |
|---|
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses)
Co-authored-by: openhands <openhands@all-hands.dev>
|
| Status | Test | Duration |
|---|
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses)
The TEST_SELECTION block lands in the FinishObservation event (source='environment'), not in agent-source events. Removed the source filter from the callback and replaced the flat key-lookup extractor with a recursive _collect_text() that walks the entire JSON tree collecting every 'text' and 'message' string value. parse_selection() already uses re.findall + last-match to avoid matching the prompt template. Co-authored-by: openhands <openhands@all-hands.dev>
|
| Status | Test | Duration |
|---|
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses)
🤖 Agent E2E Test Selector
Selected specs:
Running via: Mock-LLM E2E Tests |
Instead of wrangling agent event dumps / regex / recursive JSON walking, the agent now writes its result to a temp JSON file. After the conversation finishes, we just read that file. Removed: OUTPUT_TAG, capture_event callback, _collect_text, extract_all_text, parse_selection, re import. The CIVisualizer still streams all events to stderr for CI log visibility. Co-authored-by: openhands <openhands@all-hands.dev>
|
| Status | Test | Duration |
|---|
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses)
🤖 Agent E2E Test Selector
Selected specs:
Running via: Mock-LLM E2E Tests |
Replaced the 14-entry SPEC_CATALOG dict with discover_specs() which globs tests/e2e/mock-llm/*.spec.ts at runtime. The agent gets the list of available spec filenames and is told to read their source to understand what each one tests. New specs are picked up automatically without any script changes. Co-authored-by: openhands <openhands@all-hands.dev>
|
| Status | Test | Duration |
|---|
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses)
🤖 Agent E2E Test Selector
Selected specs:
Running via: Mock-LLM E2E Tests |
📸 Snapshot Test ReportWarning Snapshot comparison step crashed (timeout, OOM, or runner error) — diff results below may be incomplete or absent. ❌ 1 snapshot differ from the main branch baseline. Add the
🔴 Changed snapshots (1)
|
| Expected (main) | Actual (PR) | Diff |
|---|---|---|
![]() |
![]() |
![]() |
✅ Unchanged snapshots (73)
archived-conversation
- conversation-panel-with-archived-badges
- conversation-view-archived
- conversation-view-sandbox-error
automations
- automations-delete-modal
- automations-list-active-inactive
- automations-no-automations
- automations-search-no-results
backends-extended
- backend-add-blank-disabled
- backend-add-cloud-advanced-open
- backend-add-cloud-no-key-disabled
- backend-add-cloud-with-key-enabled
- backend-add-form-partially-filled
- backend-add-invalid-url-disabled
- backend-add-local-ready
- backend-add-name-only-disabled
- backend-add-two-column-layout
- backend-add-whitespace-host-disabled
- backend-after-switch
- backend-cancel-nothing-saved
- backend-edit-prefilled
- backend-manage-after-removal
- backend-manage-two-listed
- backend-remove-cancelled
- backend-remove-confirmation
- backend-switch-overlay
backends
- backend-add-modal
- backend-manage-modal
- backend-selector-open
changes-tab
- changes-deleted-file
- changes-diff-viewer
- changes-empty
collapsible-thinking
- reasoning-content-collapsed
- reasoning-content-expanded
- think-action-collapsed
- think-action-expanded
mcp-page
- mcp-custom-server-1-editor-open
- mcp-custom-server-2-url-filled
- mcp-custom-server-3-all-filled
- mcp-custom-server-4-installed
- mcp-custom-server-editor
- mcp-empty-installed
- mcp-search-filtered
- mcp-slack-install-1-marketplace
- mcp-slack-install-2-modal
- mcp-slack-install-3-filled
- mcp-slack-install-4-installed
onboarding
- onboarding-step-0-check-backend
- onboarding-step-1-choose-agent
- onboarding-step-2-setup-llm
- onboarding-step-3-say-hello
projects-workspace-browser
- projects-workspace-browser
settings-page
- add-backend-modal
- analytics-consent-modal
- home-screen
- settings-app-page
- settings-page
settings-secrets
- secrets-add-form-filled
- secrets-add-form
- secrets-after-save
- secrets-delete-confirm
- secrets-list
settings-verification
- condenser-settings
- verification-settings-critic-enabled
- verification-settings-off
- verification-settings-on
sidebar
- sidebar-collapsed
- sidebar-conversation-panel
- sidebar-filter-menu
skills-page
- skills-empty
- skills-loaded
- skills-no-match
- skills-search-filtered
- skills-type-filter
Generated by the Snapshot Tests workflow. This comment was created by an AI agent (OpenHands) on behalf of the repo maintainers.
🔶 Mock-LLM Docker E2E Test Results38/43 passed · 5 skipped Commit:
Posted by the Mock-LLM E2E workflow · results are deterministic (scripted LLM responses) |



Why
Running the full mock-LLM E2E suite (~14 specs) on every PR push takes significant CI time. Many PRs only touch a narrow area of the codebase and don't need the full suite. There was no way to parameterize the E2E workflow or intelligently select a subset of tests.
Summary
mock-llm-e2e.yml: Addedworkflow_dispatchinputs (test_specs,test_grep,pr_number) so the workflow can be triggered with a filtered subset. Fully backward-compatible —pull_requestevents still run the full suite. All PR-number references now useEFFECTIVE_PR_NUMBERto support both trigger types.scripts/select-e2e-tests.py: Python script that uses the OpenHands SDKLLMclass to intelligently map changed files to relevant test specs. Includes a deterministic heuristic fallback when the LLM is unavailable. Outputs JSON with the selected specs, reason, and mode (llm/heuristic/full).agent-e2e-selector.yml: Workflow triggered by thesmart-e2elabel or manual dispatch. Analyzes PR changed files, runs the selector script, dispatchesMock-LLM E2E Testswith the chosen subset, and posts a summary comment to the PR.How to Test
Manual dispatch (direct E2E subset):
Agent selector (manual dispatch):
Agent selector (label-triggered):
Add the
smart-e2elabel to any same-repo PR.Test the Python script locally (heuristic mode, no LLM needed):
Type
Notes
secrets.LLM_API_KEY(already available in this repo for live E2E) with the LLM proxy athttps://llm-proxy.app.all-hands.devandgpt-4.1-minifor cost-efficient selection.pull_requesttrigger onmock-llm-e2e.ymlis unchanged — the full suite still runs on every PR push as a merge gate. The agent selector is an opt-in faster-feedback path.This PR was created by an AI agent (OpenHands) on behalf of the user.
@malhotra5 can click here to continue refining the PR
🐳 Docker images for this PR
• GHCR package: https://github.com/OpenHands/agent-canvas/pkgs/container/agent-canvas
ghcr.io/openhands/agent-canvasghcr.io/openhands/agent-server:1.26.0-pythonopenhands-automation==1.0.0a68d1976bd6bee4810375f52737fc310951b9b03bfPull (multi-arch manifest)
# Multi-arch manifest — Docker automatically pulls the correct architecture docker pull ghcr.io/openhands/agent-canvas:sha-8d1976bRun
All tags pushed for this build
About Multi-Architecture Support
sha-8d1976b) is a multi-arch manifest supporting both amd64 and arm64sha-8d1976b-amd64) are also available if needed