Skip to content

Fix existing PET telemetry fields before adding new diagnostics #478

Description

@eleanorjboyd

Problem

Recent rollout diagnostics show several existing PET/setup telemetry fields are missing or misclassified, which blocks investigation of Windows rpc_refresh_timeout restarts and setup hangs.

Existing telemetry fixes needed

  1. Move numeric fields marked as measurements out of properties and into measures:

    • pet.refresh: envCount, unresolvedCount, workspaceDirCount, searchPathCount, attempt
    • pet.configure: workspaceDirCount, envDirCount, retryCount
    • pet.process_restart: attempt
  2. Fix setup.hang_detected so stageDuration arrives alongside duration.

  3. Stop classifying PET RPC timeouts as spawn_timeout; use rpc_timeout or method-specific values like rpc_refresh_timeout.

  4. Ensure pet.refresh timeout/error events include all known existing context fields: counts, attempt, workspace/search-path counts when available.

  5. Improve PET version/build stamping reliability so a single early info timeout does not leave slow/timeout paths as unknown.

Why this matters

Windows PET restarts are mostly rpc_refresh_timeout. Affected sessions are strongly correlated with setup hangs and very slow manager registration, but missing/misclassified fields make the root cause hard to prove.

Acceptance criteria

  • setup.hang_detected contains both duration and stageDuration.
  • PET refresh RPC timeouts are distinguishable from process spawn timeouts.
  • Timeout/error refresh events preserve available context needed for correlation.
  • PET version/build/commit is not unnecessarily unknown after recoverable info timeouts.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions