Skip to content

Fix Live Activity fanout stalls from APNs timeouts and ExpiredToken cleanup #41

Description

@mithileshchellappan

Summary

During the Canada qualifying live activity window, Pushboy accepted admin-api job requests quickly, but live activity fanout fell 20-40 minutes behind. The main observed contributors were APNs live activity send timeouts/retries and stale APNs live activity tokens that were not being invalidated on ExpiredToken.

Production evidence

Window inspected: 2026-05-23 19:45:00 UTC to 2026-05-23 21:45:00 UTC.

Live activity job:

  • activity_id: 5_montreal_qualifying_2026
  • topic_id: broadcast
  • Pushboy HTTP acceptance for /v1/live-activity/jobs: 154 successful 202 responses, p50 24ms, p95 260ms
  • This means admin-api submission was not the slow part.

Dispatch latency:

  • Start dispatch: 30,156 tokens, completed in 1,270s (~21 minutes)
  • Update dispatches: 153
  • Update avg fanout size: 12,139 tokens
  • Update latency: p50 1,338s, p95 2,315s, max 2,342s (~39 minutes)

Provider/error mix from Pushboy logs:

  • 29,278 APNs live activity failures
  • 22,912 APNs live activity failures classified as timeout/client-header waits
  • 4,268 APNs ExpiredToken
  • FCM live activity failures were not material in this window (4)

Suspected causes

  1. ExpiredToken is not treated as an invalid live activity token.

    Current invalid-token handling covers BadDeviceToken, Unregistered, and FCM registration-token-not-registered, but not APNs ExpiredToken. Those stale LA tokens keep participating in fanout until passive expiry or another invalidation path removes them.

  2. Timeout retries can monopolize sender workers.

    APNs sends use a 10s HTTP client timeout and retry retryable timeout-like errors up to 3 times. A single bad/stalled APNs send can therefore occupy a sender for roughly 40s. With ~7k-13k APNs/LA update tokens per update and updates arriving around every 30s, dispatches can queue faster than workers complete them.

  3. Observability makes this hard to detect early.

    /v1/live-activity/jobs returning 202 quickly can look healthy while dispatch completion is badly delayed. We need dispatch/fanout metrics to separate acceptance latency, enqueue latency, and provider-send completion latency.

Proposed fixes

  • Treat APNs ExpiredToken as invalid for live activity tokens and mark those tokens invalid in ApplyLAOutcomeBatch.
  • Revisit APNs LA timeout/retry behavior so timeout storms cannot hold the whole sender pool for tens of minutes.
  • Add/log metrics for:
    • LA dispatch created to completed duration
    • total/success/failure/unresolved counts by dispatch
    • provider error counts by reason
    • current in-flight/backlog count for LA dispatches
  • Consider provider-specific concurrency/backpressure so APNs timeout storms do not starve other live activity work.

Acceptance criteria

  • ExpiredToken LA failures invalidate the corresponding live activity token.
  • A timeout-heavy APNs window no longer causes 20-40 minute LA update completion delays.
  • Operators can see whether slowness is in job acceptance, token fanout/enqueue, or provider sends.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions