fix: retry only transient Celery errors with exponential backoff (#664) by AmSach · Pull Request #670 · param20h/PDF-Assistant-RAG

AmSach · 2026-06-23T02:01:35Z

Fixes #664.

Problem

The process_document Celery task was configured with autoretry_for=(Exception,) and default_retry_delay=30. This retried every error (including ValidationException, ValueError, KeyError, etc.) on a flat 30s schedule with no backoff and no jitter.

In practice this meant a bad PDF, a missing field, or a programming bug would:

get retried 3 more times, each one spending 30s and another round-trip to an external API,
still fail at the end (because the same bad input was re-submitted unchanged),
and only on the last attempt did the document row get marked as failed, so the UI showed a misleading processing status throughout the wasted retries.

What this PR changes

`backend/app/tasks.py`

Define an explicit TRANSIENT_ERRORS tuple containing only error types worth retrying:
- ExternalServiceException, RateLimitException (custom app errors)
- ConnectionError, TimeoutError, OSError (stdlib)
- httpx.ConnectError, ReadTimeout, WriteTimeout, PoolTimeout, ConnectTimeout, RemoteProtocolError, NetworkError (the HTTP client the rest of the app already uses)
ValidationException, NotFoundException, UnauthorizedException, ForbiddenException, ConflictException, UnsafePromptException, ValueError, KeyError, TypeError and other programming bugs are intentionally excluded - re-running them just wastes worker time and external API quota.
Switch the retry schedule from a flat default_retry_delay=30 to:
- retry_backoff=True
- retry_backoff_max=600
- retry_jitter=True
So retries use exponential backoff with jitter, capped at 10 minutes.
Fix the mark doc as failed bookkeeping so it matches the new retry semantics:
- Non-transient errors (e.g. ValidationException) are marked failed on the first attempt - no point waiting for retries that will never help.
- Transient errors (e.g. ExternalServiceException) only get marked failed once retries are exhausted, so the UI doesn't show a misleading failed status before the retry path has had a chance.

`backend/tests/test_celery_ingestion.py`

Add three regression tests:

test_non_transient_error_marks_failed_immediately - ValidationException → failed on first attempt, no autoretry, error traceback recorded.
test_transient_error_does_not_mark_failed_before_retries_exhausted - ExternalServiceException → task ends in FAILURE for the current attempt, but the document row is not marked failed while retries remain.
test_task_uses_exponential_backoff_with_jitter - locks in the new decorator settings (retry_backoff=True, retry_backoff_max=600, retry_jitter=True, Exception ∉ autoretry_for, includes ExternalServiceException and RateLimitException).

How I tested

python3 -c "import ast; ast.parse(...)" syntax check on both edited files.
Static check of the decorator kwargs and TRANSIENT_ERRORS contents against the issue's spec.
End-to-end behavioral verification with Celery 5.6 in EAGER mode, in-memory broker, and stubbed app modules:
- Decorator config: max_retries=3, autoretry_for=TRANSIENT_ERRORS (no bare Exception), retry_backoff=True, retry_backoff_max=600, retry_jitter=True.
- Classification: ValidationException is non-transient; ExternalServiceException, RateLimitException, ConnectionError, TimeoutError are transient.
- Behavior: ValidationException from _ingest_document → task FAILURE, document marked failed, last_error_traceback populated.
The existing test_process_document_runs_real_ingestion_pipeline and test_process_document_marks_failed_when_no_text_extracted tests are unaffected (they don't raise exceptions through the retry path).

Notes

requests is not in requirements.txt, so the HTTP-client retries use httpx (which the app already depends on) instead of requests.exceptions.
Targeted at dev per the repo's contributing guidelines.

Issue param20h#664: autoretry_for=(Exception,) retried every error (including validation errors) with a fixed 30s delay and no backoff. Replace it with an explicit TRANSIENT_ERRORS tuple containing only upstream/IO classes that are actually worth retrying: - ExternalServiceException, RateLimitException (custom app errors) - ConnectionError, TimeoutError, OSError (stdlib) - httpx.ConnectError / ReadTimeout / WriteTimeout / PoolTimeout / ConnectTimeout / RemoteProtocolError / NetworkError ValidationException, NotFoundException, ValueError, KeyError, TypeError and other programming bugs are intentionally excluded: re-running them just wastes worker time and hits external APIs. Switch the retry schedule from default_retry_delay=30 to: retry_backoff=True retry_backoff_max=600 retry_jitter=True So retries use exponential backoff with jitter, capped at 10 minutes. Also fix the 'mark doc as failed' bookkeeping: non-transient errors are now marked failed on the very first attempt (no point waiting for retries that will never help), while transient errors only get marked failed once retries are exhausted. Previously, transient errors also flipped the doc to 'failed' early, which confused users who saw a 'failed' status before the retry path had even had a chance. Add regression tests covering: non-transient error -> failed on first attempt; transient error -> not marked failed while retries remain; task decorator applies the new retry policy.

AmSach requested a review from param20h as a code owner June 23, 2026 02:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: retry only transient Celery errors with exponential backoff (#664)#670

fix: retry only transient Celery errors with exponential backoff (#664)#670
AmSach wants to merge 1 commit into
param20h:devfrom
AmSach:fix/celery-retry-transient-only

AmSach commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AmSach commented Jun 23, 2026

Problem

What this PR changes

backend/app/tasks.py

backend/tests/test_celery_ingestion.py

How I tested

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`backend/app/tasks.py`

`backend/tests/test_celery_ingestion.py`