Summary
Define the experimentation strategy and phased rollout plan for the Helix Reporter Job (HRJ). The goal is to validate the design in a controlled environment, measure impact on pipeline reliability and cost, and progressively roll out to all .NET pipelines that use Helix.
Motivation
This change fundamentally alters how test results flow through the CI/CD system for 100+ .NET repositories. A careful, phased rollout is essential to avoid disrupting the ecosystem. We need to:
- Validate correctness (all test results still appear, pass/fail semantics are preserved).
- Measure cost savings (agent time freed up).
- Identify edge cases (retry behavior, timing issues, pool capacity).
- Build confidence before broad adoption.
Rollout Phases
Phase 1: Internal Prototype
Scope: A single low-risk internal pipeline (e.g., Arcade's own CI or a small dotnet repo).
Steps:
- Deploy the Helix endpoint (Issue 1) to a staging/test environment.
- Add the HRJ to one or more stages in the target pipeline with
HelixReporterJobEnabled: true.
- Run both old (wait-for-results) and new (HRJ) paths in parallel:
- Keep the existing wait-for-results behavior active.
- Add the HRJ as an additional job in each stage that also uploads results.
- Compare results from both paths — they should match exactly.
- Validate retry scenarios manually.
- Verify that multiple HRJ instances (one per stage) can run concurrently without interference.
Phase 2: Dual-Mode on Key Pipelines
Scope: 2–3 high-volume pipelines (e.g., dotnet/runtime, dotnet/sdk) in dual-mode.
Steps:
- Enable dual-mode: existing agent-based upload + HRJ upload running side by side across all stages with Helix work.
- Automated comparison job that flags any discrepancies.
- Monitor for 2–4 weeks.
- Test retry scenarios (manual and automated).
Phase 3: Switch Over (Agent Upload Disabled)
Scope: Same 2–3 pipelines from Phase 2, now running HRJ-only.
Steps:
- Disable agent-based test result upload (but keep the code — just skip the upload step).
- HRJ is now the sole source of test results in each stage.
- Monitor for 2–4 weeks.
Phase 4: Broad Rollout
Scope: All .NET pipelines using Helix + Arcade SDK.
Steps:
- Enable
HelixReporterJobEnabled as the default in the Arcade SDK templates (opt-out available).
- Communicate the change via the usual channels (dotnet/arcade announcements, engineering updates).
- Provide documentation on:
- How the HRJ works (one per stage that submits Helix work).
- How retries work (must re-run the HRJ or the stage).
- How to opt out if issues arise.
- Monitor for issues across the ecosystem for 4+ weeks.
Phase 5: Cleanup
Transition to Issue 5 — remove the agent-based test result upload code path entirely.
Rollback Plan
At any phase, rollback is straightforward:
- Set
HelixReporterJobEnabled: false (or remove the variable) to revert to the old wait-for-results behavior.
- The Helix SDK's default behavior (wait + upload from agents) remains unchanged until explicitly switched off.
- The HRJ YAML template can be removed from pipeline stages without affecting other jobs.
Summary
Define the experimentation strategy and phased rollout plan for the Helix Reporter Job (HRJ). The goal is to validate the design in a controlled environment, measure impact on pipeline reliability and cost, and progressively roll out to all .NET pipelines that use Helix.
Motivation
This change fundamentally alters how test results flow through the CI/CD system for 100+ .NET repositories. A careful, phased rollout is essential to avoid disrupting the ecosystem. We need to:
Rollout Phases
Phase 1: Internal Prototype
Scope: A single low-risk internal pipeline (e.g., Arcade's own CI or a small dotnet repo).
Steps:
HelixReporterJobEnabled: true.Phase 2: Dual-Mode on Key Pipelines
Scope: 2–3 high-volume pipelines (e.g., dotnet/runtime, dotnet/sdk) in dual-mode.
Steps:
Phase 3: Switch Over (Agent Upload Disabled)
Scope: Same 2–3 pipelines from Phase 2, now running HRJ-only.
Steps:
Phase 4: Broad Rollout
Scope: All .NET pipelines using Helix + Arcade SDK.
Steps:
HelixReporterJobEnabledas the default in the Arcade SDK templates (opt-out available).Phase 5: Cleanup
Transition to Issue 5 — remove the agent-based test result upload code path entirely.
Rollback Plan
At any phase, rollback is straightforward:
HelixReporterJobEnabled: false(or remove the variable) to revert to the old wait-for-results behavior.