Skip to content

Stuck Cloudflare build prevented new site updates #36

@rishikeshsreehari

Description

@rishikeshsreehari

Cloudflare Pages build for npsnav.in got stuck in the “Building…” state for several hours.
During this time, all subsequent deployments were queued and never executed.

GitHub Actions successfully updated data.json in the repository, but the site did not update because Cloudflare never completed the deployment.
The issue was resolved only after manually deleting the stuck deployment from the Cloudflare Pages dashboard.

This incident also coincided with a Cloudflare-wide outage, which may have contributed to the build becoming stuck.


Impact

  • Website served stale NAV data even though the repo contained the latest updates
  • Automated daily updates silently failed
  • All new deployments were blocked behind the stuck one
  • Required manual intervention
  • High chance of recurrence without a safeguard

Proposed Fix

Introduce a “watchdog” mechanism to detect and recover from stuck Cloudflare Pages builds.
Instead of running on a schedule, the watchdog can also be triggered only after a new deployment should have started (i.e., after our GitHub Action pushes changes).

Option A — Trigger Watchdog After Each Push / Deployment Attempt

After the NAV update workflow completes:

  • Query Cloudflare Pages API for the latest deployment
  • If the deployment is queued or building for too long (e.g., >10 minutes) → automatically cancel
  • Trigger a new deployment
  • Optional: send alerts (Slack/Discord/etc.)

This approach runs only when needed and avoids unnecessary scheduled jobs.


Option B — Health Check After Deployment

After GitHub Actions finishes pushing updates:

  • Fetch https://npsnav.in/data.json
  • Validate the timestamp
  • If the deployed data is stale → retry or alert

This ensures correctness even if Cloudflare marks a build as “successful” but the live site didn’t update.


Option C — Optional Pipeline Change

Switch to building the site inside GitHub Actions and let Cloudflare Pages only handle hosting (Direct Upload).
This avoids Cloudflare’s build system entirely, but requires modifying the current deployment setup.


Next Steps

  • Decide which recovery approach to implement
  • If Option A is chosen: implement watchdog logic triggered after the NAV update workflow
  • If Option B is chosen: add health-check logic
  • Optional: add fallback scheduled watchdog for resilience during Cloudflare outages

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggood first issueGood for newcomershelp wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions