Reduce machine time spent waiting on Helix job completion

## Context

Today, Azure DevOps build agents remain allocated (and incurring cost) while waiting for Helix work items to complete and test results to be uploaded. The goal is to **decouple build jobs from Helix test execution** so that build agents are released as soon as they submit Helix work items, without losing test result visibility, retry capabilities, or pipeline pass/fail semantics.

### Current Flow

1. A build job compiles code, then submits Helix work items via the Helix SDK.
2. The build job **waits** for all submitted work items to finish.
3. Helix agents execute work items. On completion, each Helix agent parses test result XMLs and uploads them to Azure DevOps using the **system access token** passed in the Helix payload.
4. The build job reports success/failure based on Helix results.

### Key Constraint

The **system access token** is scoped to the originating build job and expires shortly after the job ends (roughly job timeout + 5 minutes). If build jobs finish before Helix work items complete, the token cannot be reliably used for uploading test results.

---

## Approaches Considered

### Managed Identity (MI) on Helix Agents

Replace the system access token with a Managed Identity for uploading test results directly from Helix agents.

**Pros:**
- Minimal architectural change — Helix agents continue uploading results as today.

**Cons:**
- Azure DevOps throttles API calls **per identity**. Today's system access tokens are not throttled because they lack a persistent identity. Switching to MIs would likely trigger throttling.
- Multiple MIs exist across different infrastructure (Arc machines, Helix queues, certificate-based physical devices) — each would need rate-limit exemptions from the Azure DevOps Ops team.
- Every new MI would require a manual request to increase limits — an ongoing maintenance burden.

**Decision:** Not pursued as primary approach due to throttling risk and operational overhead.

### Single Agentless Job (Callback-Based)

A single Azure DevOps agentless job starts with the pipeline and uses a callback from Helix to determine when all work items complete.

**Pros:**
- No machine cost — truly agentless.
- No additional runtime added to the pipeline.

**Cons:**
- Agentless jobs **cannot poll**; they support only a single REST API call with a ~15-second timeout, then require a callback to complete.
- Helix would need to know about Azure DevOps pipeline state to decide when to fire the callback (e.g., what if no Helix jobs are ever submitted due to a compilation error?). This couples Helix to Azure DevOps in undesirable ways.
- **No output** — agentless jobs produce no console logs. Inline progress (links to failed work items, console logs) would be impossible.
- The Azure DevOps Ops team expressed concerns about agentless jobs being used for long-running orchestration scenarios.

**Decision:** Not pursued due to output limitations, callback complexity, and Azure DevOps Ops team concerns.

### A machine-backed Helix reporter — Lightweight Long-Running Machine Job ✅ (Preferred)

A dedicated build job runs on a minimal machine (tiny Linux container or a dedicated small pool) within each stage that submits Helix work items. It polls Helix for the status of all work items submitted by that stage, downloads and processes test result files, and uploads them to Azure DevOps using its own system access token.

**Decision:** Preferred approach. Described in detail below.

---

## Proposed Design: The machine-backed Helix reporter job

A long-running build job — referred to as the **Helix Reporter** — is added as a job within each pipeline stage that monitors (and submits failed) Helix work items. Each stage gets its own Helix Reporter instance that monitors only the Helix jobs submitted by other jobs in that stage. It runs on a lightweight dedicated pool (e.g., a small Linux container) and is responsible for:

1. **Monitoring stage state** via the Azure DevOps REST API.
2. **Polling Helix** for the status of all work items submitted by jobs in its stage.
3. **Downloading test result files** from Helix storage as work items complete.
4. **Parsing and uploading test results** to Azure DevOps using its own system access token.
5. **Reporting final pass/fail status** for the stage's Helix work.

### Helix Reporter Lifecycle

#### Phase 1: Wait for Stage Completion (with Incremental Processing)

The Helix Reporter starts at the beginning of its stage (in parallel with the other jobs in the stage). It enters a polling loop:

1. **Query Azure DevOps REST API**: Are all non-Helix Reporter jobs in the stage finished (completed, failed, or cancelled)?
2. **Query Helix API**: Are there any completed Helix jobs associated with this build/stage that have not yet been processed?
   - If yes → download test result XMLs, parse, and upload to Azure DevOps.
   - Log progress (jobs completed, jobs remaining, links to failed work items).
3. If non-Helix Reporter jobs in the stage are still running → sleep and retry from step 1.
4. If all non-Helix Reporter jobs in the stage are finished → proceed to Phase 2.

#### Phase 2: Wait for All Helix Work Items

Once all other jobs in the stage have completed, Helix jobs may still be running:

1. **Query Helix API**: Get all Helix jobs associated with this build/stage (identified by build/pipeline/stage metadata submitted by the Helix SDK).
2. For each completed Helix job not yet processed:
   - Download test result XML files from Helix storage (publicly/anonymously accessible for public builds).
   - Parse test results (existing Python logic currently in the Helix Machines repo).
   - Upload parsed results to Azure DevOps via the test results REST API using the Helix Reporter's own system access token.
   - Record the job as processed (in memory).
3. If any Helix jobs are still running → sleep and retry.
4. Once all Helix jobs are complete and results uploaded → proceed to Phase 3.

#### Phase 3: Report Status and Exit

- If all Helix work items passed → exit **green**.
- If any work item failed → exit **red**.
- Output a summary of processed jobs, pass/fail counts, and links to failed work items in the job console log.

### Statelessness

A critical design goal is that the Helix Reporter should be **stateless** — it must be able to derive its current state entirely from:

- The **Azure DevOps REST API** (pipeline/build state, job statuses, attempts).
- The **Helix API** (job statuses, work item results, associated build metadata).

This means:
- No dependency on pipeline artifacts for state tracking.
- Safe to re-run at any time — it will re-derive what has been processed and what remains.
- On retry (new attempt), it can compare previously uploaded results against the current set of Helix jobs and only process the delta.

### Retry Handling

Retries are a key complexity area. The design handles them as follows:

| Scenario | Behavior |
|----------|----------|
| **A Helix work item fails** | Helix Reporter goes red. User re-runs the Helix Reporter (or the entire stage). It queries Helix for all jobs from the stage, identifies unprocessed ones (including resubmitted work items), processes them, and reports the new result. |
| **A build job fails (compilation error)** | No Helix jobs submitted from that job. Helix Reporter detects the build job failure, reflects it in its final status. User can re-run the failed build job or the entire stage. Once the build job succeeds, Helix Reporter (re-run) picks up the newly submitted Helix jobs. |
| **Helix Reporter itself crashes** | User re-runs the Helix Reporter. Because it is stateless, it re-derives current state from APIs and picks up where it left off. |
| **User re-runs a build job but not the Helix Reporter** | The stage is still red (because Helix Reporter is red). New Helix jobs get submitted. User then re-runs the Helix Reporter, which discovers the new unprocessed jobs and uploads their results. |

#### Work Item Resubmission

When the Helix Reporter is re-run after a failure:
1. It queries Helix for all jobs associated with the build/stage (across all attempts).
2. It identifies which work items have already had results uploaded (by checking Azure DevOps test run data or by convention).
3. It processes only the new/unprocessed work items.

**Note:** If a user re-runs a build job but forgets to re-run the Helix Reporter, the stage will remain red (since the Helix Reporter is the only red job). This is by design — it forces the user to also re-run the Helix Reporter to get updated test results. As a safeguard, the Helix SDK could detect whether the Helix Reporter is running and warn/fail if it is not. Additionally, build jobs may programmatically trigger the Helix Reporter via the Azure DevOps REST API (Build Analysis already uses this API for retries).

### Incremental Test Result Reporting

Unlike the agentless job approach, the Helix Reporter runs on a real machine and can produce console output. This enables:

- **Progress reporting**: As Helix jobs complete, the Helix Reporter logs which jobs finished, how many remain, and provides links to console logs for failed work items.
- **Incremental uploads**: Test results appear in Azure DevOps as each Helix job completes, rather than only after everything finishes.

### Test Result Processing

The existing test result processing logic (currently in the Helix Machines repo) handles:
- Multiple test result formats (xUnit, NUnit, etc.).
- Locating test result XMLs by convention (e.g., `testResults.xml` in a known directory).
- Parsing and converting to Azure DevOps test result format.
- Uploading via the Azure DevOps test results REST API.

This logic would be reused within the Helix Reporter. The same Python code that runs on Helix agents today would run inside the Helix Reporter, processing results centrally per stage rather than distributed across individual Helix agents.

### Authentication

The Helix Reporter uses its **own system access token** to upload test results to Azure DevOps. This token is:
- Valid for the lifetime of the Helix Reporter (which runs for the duration of its stage).
- Not subject to identity-based throttling (same behavior as today's per-job tokens).
- Scoped to the pipeline and requires no MI exemptions.

No system access tokens need to be passed to Helix agents for test result uploads.

### Infrastructure Requirements

| Requirement | Details |
|-------------|---------|
| **Dedicated pool** | A small pool of lightweight machines (e.g., tiny Linux containers) dedicated to running Helix Reporters. Must be highly available — if no machine is available, the stage has no test result reporter. |
| **Pool sizing** | Must accommodate concurrent pipelines and stages. A backed-up pool delays test result reporting but does not block builds from completing. |
| **Helix API** | Must support querying jobs by build/pipeline/stage metadata. The Helix SDK already submits build metadata; stage metadata must also be included. |
| **Helix storage** | Test result XMLs must be accessible from the Helix Reporter. For public builds, these are anonymously accessible. For internal builds, authentication may be needed. |


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce machine time spent waiting on Helix job completion #16727

Context

Current Flow

Key Constraint

Approaches Considered

Managed Identity (MI) on Helix Agents

Single Agentless Job (Callback-Based)

A machine-backed Helix reporter — Lightweight Long-Running Machine Job ✅ (Preferred)

Proposed Design: The machine-backed Helix reporter job

Helix Reporter Lifecycle

Phase 1: Wait for Stage Completion (with Incremental Processing)

Phase 2: Wait for All Helix Work Items

Phase 3: Report Status and Exit

Statelessness

Retry Handling

Work Item Resubmission

Incremental Test Result Reporting

Test Result Processing

Authentication

Infrastructure Requirements

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Scenario	Behavior
A Helix work item fails	Helix Reporter goes red. User re-runs the Helix Reporter (or the entire stage). It queries Helix for all jobs from the stage, identifies unprocessed ones (including resubmitted work items), processes them, and reports the new result.
A build job fails (compilation error)	No Helix jobs submitted from that job. Helix Reporter detects the build job failure, reflects it in its final status. User can re-run the failed build job or the entire stage. Once the build job succeeds, Helix Reporter (re-run) picks up the newly submitted Helix jobs.
Helix Reporter itself crashes	User re-runs the Helix Reporter. Because it is stateless, it re-derives current state from APIs and picks up where it left off.
User re-runs a build job but not the Helix Reporter	The stage is still red (because Helix Reporter is red). New Helix jobs get submitted. User then re-runs the Helix Reporter, which discovers the new unprocessed jobs and uploads their results.

Requirement	Details
Dedicated pool	A small pool of lightweight machines (e.g., tiny Linux containers) dedicated to running Helix Reporters. Must be highly available — if no machine is available, the stage has no test result reporter.
Pool sizing	Must accommodate concurrent pipelines and stages. A backed-up pool delays test result reporting but does not block builds from completing.
Helix API	Must support querying jobs by build/pipeline/stage metadata. The Helix SDK already submits build metadata; stage metadata must also be included.
Helix storage	Test result XMLs must be accessible from the Helix Reporter. For public builds, these are anonymously accessible. For internal builds, authentication may be needed.

Reduce machine time spent waiting on Helix job completion #16727

Description

Context

Current Flow

Key Constraint

Approaches Considered

Managed Identity (MI) on Helix Agents

Single Agentless Job (Callback-Based)

A machine-backed Helix reporter — Lightweight Long-Running Machine Job ✅ (Preferred)

Proposed Design: The machine-backed Helix reporter job

Helix Reporter Lifecycle

Phase 1: Wait for Stage Completion (with Incremental Processing)

Phase 2: Wait for All Helix Work Items

Phase 3: Report Status and Exit

Statelessness

Retry Handling

Work Item Resubmission

Incremental Test Result Reporting

Test Result Processing

Authentication

Infrastructure Requirements

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions