fix: self-healing downloads + per-user cancel ref-counting (#31, #32)#69
Merged
Conversation
…l ref-counting (#32) Task 1 - self-healing downloads (#31): - download_manager: a watchdog thread now runs alongside the stdout reader and kills yt-dlp on a silent hang (no output for 300s), past a 4h overall deadline, or on cancellation. Previously process.wait(timeout) only ran AFTER the read loop, so a silent hang (stdout open, no output) blocked the worker forever and the per-line cancel check could never fire. Cancellation now works mid-hang. download_video also exposes a heartbeat_callback and always reaps the child process in a finally block. - download_video_task: Celery soft/hard time limits as a coarse backstop above the watchdog deadline; stamps videos.download_heartbeat_at on the DOWNLOADING transition and refreshes it (throttled, own session) as output flows. - download_reaper + reap_stuck_downloads_task: reset rows stranded by a crashed worker - DOWNLOADING -> PENDING (and re-enqueue), CANCELLING -> CATALOGED - detected via a stale heartbeat. Runs on worker startup (worker_ready) and every 5 min via Celery beat. Reset-to-PENDING clears the start guard so retries run. - migration 007_download_heartbeat: add nullable videos.download_heartbeat_at (down_revision 006_hotpath_indexes). Task 2 - per-user cancel/re-trigger ref-counting (#32): - cancel_download drops ONLY the caller's active UserVideoRef and tears down the shared download solely when no active ref remains, so one user's cancel can no longer kill a download another subscriber (or the scheduler, which downloads on subscribers' behalf) still wants. Preserves the CANCELLING escape hatch and cleans up orphaned files via check_and_delete_orphan. - trigger_download registers/reactivates the caller's ref up front so download intent is ref-counted symmetrically with cancel. Tests (tests/test_download_lifecycle.py): watchdog kills a silently hung process and raises; cancel honored during a silent hang; task marks FAILED on a terminal stall; reaper resets stranded DOWNLOADING/CANCELLING rows while leaving a live download untouched; two-subscriber cancel keeps the other subscriber's download running, then truly cancels when the last ref drops. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RXMKM1rDWn8wNh93MMUtxY
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
The watchdog's `self._stop = threading.Event()` shadowed `threading.Thread._stop`, an internal method that `Thread.join()` -> `_wait_for_tstate_lock()` calls on some CPython versions. On Python 3.12 (CI) this raised "'Event' object is not callable" during the `finally: watchdog.join()` in download_video; 3.13 (local) doesn't hit that path, which is why it passed locally but failed in CI. Renaming the attribute restores the inherited method. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RXMKM1rDWn8wNh93MMUtxY
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two backend download-reliability fixes (roadmap NEXT tier).
Fixes #31
Fixes #32
#31 — Self-healing downloads
Root cause: the stdout read loop only called
process.wait(timeout)after it exited, so a silent hang (process alive, socket dead, no output, no EOF) blocked the worker forever and the per-line cancel check couldn't fire. A crashed worker also stranded the row in DOWNLOADING indefinitely.download_manager.py) runs beside the reader and kills the process group on: no output forNO_OUTPUT_TIMEOUT_SECONDS(300s), runtime pastOVERALL_DEADLINE_SECONDS(4h), orcancel_check()→ True (so cancel now works mid-hang). Killing closes the pipe, unblocking the reader, which raisesRuntimeError(stall/deadline) orDownloadCancelled. Afinallyalways reaps the child.soft_time_limit/time_limitsit just above the watchdog as a coarse backstop.videos.download_heartbeat_atcolumn (stamped atomically with the DOWNLOADING transition, refreshed ~every 30s from yt-dlp output);reap_stuck_downloadsresets rows whose heartbeat is stale > 30min (DOWNLOADING→PENDING + re-enqueue, CANCELLING→CATALOGED). Runs onworker_readyand every 5min via beat, so a crashed worker self-heals.007_download_heartbeat.#32 — Per-user cancel/re-trigger ref-counting
Cancel previously mutated the shared
Video.statuswith no ownership check, so one user's Cancel killed a download others still wanted. Built on the existingUserVideoRef(it is the reference count — no new table):{stopped: false}without touchingVideo.status(covers the scheduler, which downloads on subscribers' behalf); only when none remain does it truly cancel (PENDING→CATALOGED, DOWNLOADING→CANCELLING). Keeps the CANCELLING escape-hatch and orphan cleanup.Did not modify community PRs #56/#59/#66 (they add no ref-counting).
Verification (local)
pytest -q→ 86 passed (+6 new intest_download_lifecycle.py: watchdog kills a silent hang; cancel honored mid-hang; terminal stall → FAILED; reaper resets stranded rows; two-subscriber cancel keeps B's download alive then truly cancels on the last ref drop; trigger registers a ref).alembic upgrade head/downgrade -1round-trip on a throwaway DB.ruff check/ruff format --checkclean.🤖 Generated with Claude Code
https://claude.ai/code/session_01RXMKM1rDWn8wNh93MMUtxY