Summary
During the Canada qualifying live activity window, Pushboy accepted admin-api job requests quickly, but live activity fanout fell 20-40 minutes behind. The main observed contributors were APNs live activity send timeouts/retries and stale APNs live activity tokens that were not being invalidated on ExpiredToken.
Production evidence
Window inspected: 2026-05-23 19:45:00 UTC to 2026-05-23 21:45:00 UTC.
Live activity job:
activity_id: 5_montreal_qualifying_2026
topic_id: broadcast
- Pushboy HTTP acceptance for
/v1/live-activity/jobs: 154 successful 202 responses, p50 24ms, p95 260ms
- This means admin-api submission was not the slow part.
Dispatch latency:
- Start dispatch:
30,156 tokens, completed in 1,270s (~21 minutes)
- Update dispatches:
153
- Update avg fanout size:
12,139 tokens
- Update latency: p50
1,338s, p95 2,315s, max 2,342s (~39 minutes)
Provider/error mix from Pushboy logs:
29,278 APNs live activity failures
22,912 APNs live activity failures classified as timeout/client-header waits
4,268 APNs ExpiredToken
- FCM live activity failures were not material in this window (
4)
Suspected causes
-
ExpiredToken is not treated as an invalid live activity token.
Current invalid-token handling covers BadDeviceToken, Unregistered, and FCM registration-token-not-registered, but not APNs ExpiredToken. Those stale LA tokens keep participating in fanout until passive expiry or another invalidation path removes them.
-
Timeout retries can monopolize sender workers.
APNs sends use a 10s HTTP client timeout and retry retryable timeout-like errors up to 3 times. A single bad/stalled APNs send can therefore occupy a sender for roughly 40s. With ~7k-13k APNs/LA update tokens per update and updates arriving around every 30s, dispatches can queue faster than workers complete them.
-
Observability makes this hard to detect early.
/v1/live-activity/jobs returning 202 quickly can look healthy while dispatch completion is badly delayed. We need dispatch/fanout metrics to separate acceptance latency, enqueue latency, and provider-send completion latency.
Proposed fixes
- Treat APNs
ExpiredToken as invalid for live activity tokens and mark those tokens invalid in ApplyLAOutcomeBatch.
- Revisit APNs LA timeout/retry behavior so timeout storms cannot hold the whole sender pool for tens of minutes.
- Add/log metrics for:
- LA dispatch created to completed duration
- total/success/failure/unresolved counts by dispatch
- provider error counts by reason
- current in-flight/backlog count for LA dispatches
- Consider provider-specific concurrency/backpressure so APNs timeout storms do not starve other live activity work.
Acceptance criteria
ExpiredToken LA failures invalidate the corresponding live activity token.
- A timeout-heavy APNs window no longer causes 20-40 minute LA update completion delays.
- Operators can see whether slowness is in job acceptance, token fanout/enqueue, or provider sends.
Summary
During the Canada qualifying live activity window, Pushboy accepted admin-api job requests quickly, but live activity fanout fell 20-40 minutes behind. The main observed contributors were APNs live activity send timeouts/retries and stale APNs live activity tokens that were not being invalidated on
ExpiredToken.Production evidence
Window inspected:
2026-05-23 19:45:00 UTCto2026-05-23 21:45:00 UTC.Live activity job:
activity_id:5_montreal_qualifying_2026topic_id:broadcast/v1/live-activity/jobs:154successful202responses, p5024ms, p95260msDispatch latency:
30,156tokens, completed in1,270s(~21 minutes)15312,139tokens1,338s, p952,315s, max2,342s(~39 minutes)Provider/error mix from Pushboy logs:
29,278APNs live activity failures22,912APNs live activity failures classified as timeout/client-header waits4,268APNsExpiredToken4)Suspected causes
ExpiredTokenis not treated as an invalid live activity token.Current invalid-token handling covers
BadDeviceToken,Unregistered, and FCMregistration-token-not-registered, but not APNsExpiredToken. Those stale LA tokens keep participating in fanout until passive expiry or another invalidation path removes them.Timeout retries can monopolize sender workers.
APNs sends use a
10sHTTP client timeout and retry retryable timeout-like errors up to 3 times. A single bad/stalled APNs send can therefore occupy a sender for roughly 40s. With ~7k-13k APNs/LA update tokens per update and updates arriving around every 30s, dispatches can queue faster than workers complete them.Observability makes this hard to detect early.
/v1/live-activity/jobsreturning202quickly can look healthy while dispatch completion is badly delayed. We need dispatch/fanout metrics to separate acceptance latency, enqueue latency, and provider-send completion latency.Proposed fixes
ExpiredTokenas invalid for live activity tokens and mark those tokens invalid inApplyLAOutcomeBatch.Acceptance criteria
ExpiredTokenLA failures invalidate the corresponding live activity token.