Skip to content

Experimentation & Rollout #16731

@premun

Description

@premun

Summary

Define the experimentation strategy and phased rollout plan for the Helix Reporter Job (HRJ). The goal is to validate the design in a controlled environment, measure impact on pipeline reliability and cost, and progressively roll out to all .NET pipelines that use Helix.

Motivation

This change fundamentally alters how test results flow through the CI/CD system for 100+ .NET repositories. A careful, phased rollout is essential to avoid disrupting the ecosystem. We need to:

  • Validate correctness (all test results still appear, pass/fail semantics are preserved).
  • Measure cost savings (agent time freed up).
  • Identify edge cases (retry behavior, timing issues, pool capacity).
  • Build confidence before broad adoption.

Rollout Phases

Phase 1: Internal Prototype

Scope: A single low-risk internal pipeline (e.g., Arcade's own CI or a small dotnet repo).

Steps:

  1. Deploy the Helix endpoint (Issue 1) to a staging/test environment.
  2. Add the HRJ to one or more stages in the target pipeline with HelixReporterJobEnabled: true.
  3. Run both old (wait-for-results) and new (HRJ) paths in parallel:
    • Keep the existing wait-for-results behavior active.
    • Add the HRJ as an additional job in each stage that also uploads results.
    • Compare results from both paths — they should match exactly.
  4. Validate retry scenarios manually.
  5. Verify that multiple HRJ instances (one per stage) can run concurrently without interference.

Phase 2: Dual-Mode on Key Pipelines

Scope: 2–3 high-volume pipelines (e.g., dotnet/runtime, dotnet/sdk) in dual-mode.

Steps:

  1. Enable dual-mode: existing agent-based upload + HRJ upload running side by side across all stages with Helix work.
  2. Automated comparison job that flags any discrepancies.
  3. Monitor for 2–4 weeks.
  4. Test retry scenarios (manual and automated).

Phase 3: Switch Over (Agent Upload Disabled)

Scope: Same 2–3 pipelines from Phase 2, now running HRJ-only.

Steps:

  1. Disable agent-based test result upload (but keep the code — just skip the upload step).
  2. HRJ is now the sole source of test results in each stage.
  3. Monitor for 2–4 weeks.

Phase 4: Broad Rollout

Scope: All .NET pipelines using Helix + Arcade SDK.

Steps:

  1. Enable HelixReporterJobEnabled as the default in the Arcade SDK templates (opt-out available).
  2. Communicate the change via the usual channels (dotnet/arcade announcements, engineering updates).
  3. Provide documentation on:
    • How the HRJ works (one per stage that submits Helix work).
    • How retries work (must re-run the HRJ or the stage).
    • How to opt out if issues arise.
  4. Monitor for issues across the ecosystem for 4+ weeks.

Phase 5: Cleanup

Transition to Issue 5 — remove the agent-based test result upload code path entirely.

Rollback Plan

At any phase, rollback is straightforward:

  • Set HelixReporterJobEnabled: false (or remove the variable) to revert to the old wait-for-results behavior.
  • The Helix SDK's default behavior (wait + upload from agents) remains unchanged until explicitly switched off.
  • The HRJ YAML template can be removed from pipeline stages without affecting other jobs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions