Skip to content

fix: self-healing downloads + per-user cancel ref-counting (#31, #32)#69

Merged
windoze95 merged 2 commits into
mainfrom
fix/backend-download-lifecycle
Jun 27, 2026
Merged

fix: self-healing downloads + per-user cancel ref-counting (#31, #32)#69
windoze95 merged 2 commits into
mainfrom
fix/backend-download-lifecycle

Conversation

@windoze95

Copy link
Copy Markdown
Owner

Two backend download-reliability fixes (roadmap NEXT tier).

Fixes #31
Fixes #32

#31 — Self-healing downloads

Root cause: the stdout read loop only called process.wait(timeout) after it exited, so a silent hang (process alive, socket dead, no output, no EOF) blocked the worker forever and the per-line cancel check couldn't fire. A crashed worker also stranded the row in DOWNLOADING indefinitely.

  • Watchdog thread (download_manager.py) runs beside the reader and kills the process group on: no output for NO_OUTPUT_TIMEOUT_SECONDS (300s), runtime past OVERALL_DEADLINE_SECONDS (4h), or cancel_check() → True (so cancel now works mid-hang). Killing closes the pipe, unblocking the reader, which raises RuntimeError (stall/deadline) or DownloadCancelled. A finally always reaps the child.
  • Celery soft_time_limit/time_limit sit just above the watchdog as a coarse backstop.
  • Heartbeat + reaper: new videos.download_heartbeat_at column (stamped atomically with the DOWNLOADING transition, refreshed ~every 30s from yt-dlp output); reap_stuck_downloads resets rows whose heartbeat is stale > 30min (DOWNLOADING→PENDING + re-enqueue, CANCELLING→CATALOGED). Runs on worker_ready and every 5min via beat, so a crashed worker self-heals.
  • Migration 007_download_heartbeat.

#32 — Per-user cancel/re-trigger ref-counting

Cancel previously mutated the shared Video.status with no ownership check, so one user's Cancel killed a download others still wanted. Built on the existing UserVideoRef (it is the reference count — no new table):

  • cancel_download: drops only the caller's active ref; if any active refs remain, returns {stopped: false} without touching Video.status (covers the scheduler, which downloads on subscribers' behalf); only when none remain does it truly cancel (PENDING→CATALOGED, DOWNLOADING→CANCELLING). Keeps the CANCELLING escape-hatch and orphan cleanup.
  • trigger_download: registers the caller's ref up front (idempotent upsert), symmetric with cancel.

Did not modify community PRs #56/#59/#66 (they add no ref-counting).

Verification (local)

  • pytest -q86 passed (+6 new in test_download_lifecycle.py: watchdog kills a silent hang; cancel honored mid-hang; terminal stall → FAILED; reaper resets stranded rows; two-subscriber cancel keeps B's download alive then truly cancels on the last ref drop; trigger registers a ref).
  • alembic upgrade head / downgrade -1 round-trip on a throwaway DB.
  • ruff check / ruff format --check clean.

🤖 Generated with Claude Code

https://claude.ai/code/session_01RXMKM1rDWn8wNh93MMUtxY

…l ref-counting (#32)

Task 1 - self-healing downloads (#31):
- download_manager: a watchdog thread now runs alongside the stdout reader and
  kills yt-dlp on a silent hang (no output for 300s), past a 4h overall
  deadline, or on cancellation. Previously process.wait(timeout) only ran AFTER
  the read loop, so a silent hang (stdout open, no output) blocked the worker
  forever and the per-line cancel check could never fire. Cancellation now works
  mid-hang. download_video also exposes a heartbeat_callback and always reaps the
  child process in a finally block.
- download_video_task: Celery soft/hard time limits as a coarse backstop above
  the watchdog deadline; stamps videos.download_heartbeat_at on the DOWNLOADING
  transition and refreshes it (throttled, own session) as output flows.
- download_reaper + reap_stuck_downloads_task: reset rows stranded by a crashed
  worker - DOWNLOADING -> PENDING (and re-enqueue), CANCELLING -> CATALOGED -
  detected via a stale heartbeat. Runs on worker startup (worker_ready) and every
  5 min via Celery beat. Reset-to-PENDING clears the start guard so retries run.
- migration 007_download_heartbeat: add nullable videos.download_heartbeat_at
  (down_revision 006_hotpath_indexes).

Task 2 - per-user cancel/re-trigger ref-counting (#32):
- cancel_download drops ONLY the caller's active UserVideoRef and tears down the
  shared download solely when no active ref remains, so one user's cancel can no
  longer kill a download another subscriber (or the scheduler, which downloads on
  subscribers' behalf) still wants. Preserves the CANCELLING escape hatch and
  cleans up orphaned files via check_and_delete_orphan.
- trigger_download registers/reactivates the caller's ref up front so download
  intent is ref-counted symmetrically with cancel.

Tests (tests/test_download_lifecycle.py): watchdog kills a silently hung process
and raises; cancel honored during a silent hang; task marks FAILED on a terminal
stall; reaper resets stranded DOWNLOADING/CANCELLING rows while leaving a live
download untouched; two-subscriber cancel keeps the other subscriber's download
running, then truly cancels when the last ref drops.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RXMKM1rDWn8wNh93MMUtxY
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

The watchdog's `self._stop = threading.Event()` shadowed `threading.Thread._stop`,
an internal method that `Thread.join()` -> `_wait_for_tstate_lock()` calls on some
CPython versions. On Python 3.12 (CI) this raised "'Event' object is not callable"
during the `finally: watchdog.join()` in download_video; 3.13 (local) doesn't hit
that path, which is why it passed locally but failed in CI. Renaming the attribute
restores the inherited method.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RXMKM1rDWn8wNh93MMUtxY
@windoze95 windoze95 merged commit cc43a58 into main Jun 27, 2026
5 checks passed
@windoze95 windoze95 deleted the fix/backend-download-lifecycle branch June 27, 2026 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Canceling a Download Affects All Users Globally Bug: Subprocess Deadlock in yt-dlp Download Wrapper

1 participant