Skip to content

fix: retry only transient Celery errors with exponential backoff (#664)#670

Open
AmSach wants to merge 1 commit into
param20h:devfrom
AmSach:fix/celery-retry-transient-only
Open

fix: retry only transient Celery errors with exponential backoff (#664)#670
AmSach wants to merge 1 commit into
param20h:devfrom
AmSach:fix/celery-retry-transient-only

Conversation

@AmSach

@AmSach AmSach commented Jun 23, 2026

Copy link
Copy Markdown

Fixes #664.

Problem

The process_document Celery task was configured with autoretry_for=(Exception,) and default_retry_delay=30. This retried every error (including ValidationException, ValueError, KeyError, etc.) on a flat 30s schedule with no backoff and no jitter.

In practice this meant a bad PDF, a missing field, or a programming bug would:

  • get retried 3 more times, each one spending 30s and another round-trip to an external API,
  • still fail at the end (because the same bad input was re-submitted unchanged),
  • and only on the last attempt did the document row get marked as failed, so the UI showed a misleading processing status throughout the wasted retries.

What this PR changes

backend/app/tasks.py

  1. Define an explicit TRANSIENT_ERRORS tuple containing only error types worth retrying:

    • ExternalServiceException, RateLimitException (custom app errors)
    • ConnectionError, TimeoutError, OSError (stdlib)
    • httpx.ConnectError, ReadTimeout, WriteTimeout, PoolTimeout, ConnectTimeout, RemoteProtocolError, NetworkError (the HTTP client the rest of the app already uses)

    ValidationException, NotFoundException, UnauthorizedException, ForbiddenException, ConflictException, UnsafePromptException, ValueError, KeyError, TypeError and other programming bugs are intentionally excluded - re-running them just wastes worker time and external API quota.

  2. Switch the retry schedule from a flat default_retry_delay=30 to:

    • retry_backoff=True
    • retry_backoff_max=600
    • retry_jitter=True

    So retries use exponential backoff with jitter, capped at 10 minutes.

  3. Fix the mark doc as failed bookkeeping so it matches the new retry semantics:

    • Non-transient errors (e.g. ValidationException) are marked failed on the first attempt - no point waiting for retries that will never help.
    • Transient errors (e.g. ExternalServiceException) only get marked failed once retries are exhausted, so the UI doesn't show a misleading failed status before the retry path has had a chance.

backend/tests/test_celery_ingestion.py

Add three regression tests:

  • test_non_transient_error_marks_failed_immediately - ValidationExceptionfailed on first attempt, no autoretry, error traceback recorded.
  • test_transient_error_does_not_mark_failed_before_retries_exhausted - ExternalServiceException → task ends in FAILURE for the current attempt, but the document row is not marked failed while retries remain.
  • test_task_uses_exponential_backoff_with_jitter - locks in the new decorator settings (retry_backoff=True, retry_backoff_max=600, retry_jitter=True, Exception ∉ autoretry_for, includes ExternalServiceException and RateLimitException).

How I tested

  1. python3 -c "import ast; ast.parse(...)" syntax check on both edited files.
  2. Static check of the decorator kwargs and TRANSIENT_ERRORS contents against the issue's spec.
  3. End-to-end behavioral verification with Celery 5.6 in EAGER mode, in-memory broker, and stubbed app modules:
    • Decorator config: max_retries=3, autoretry_for=TRANSIENT_ERRORS (no bare Exception), retry_backoff=True, retry_backoff_max=600, retry_jitter=True.
    • Classification: ValidationException is non-transient; ExternalServiceException, RateLimitException, ConnectionError, TimeoutError are transient.
    • Behavior: ValidationException from _ingest_document → task FAILURE, document marked failed, last_error_traceback populated.
  4. The existing test_process_document_runs_real_ingestion_pipeline and test_process_document_marks_failed_when_no_text_extracted tests are unaffected (they don't raise exceptions through the retry path).

Notes

  • requests is not in requirements.txt, so the HTTP-client retries use httpx (which the app already depends on) instead of requests.exceptions.
  • Targeted at dev per the repo's contributing guidelines.

Issue param20h#664: autoretry_for=(Exception,) retried every error (including
validation errors) with a fixed 30s delay and no backoff. Replace it with
an explicit TRANSIENT_ERRORS tuple containing only upstream/IO classes
that are actually worth retrying:

- ExternalServiceException, RateLimitException (custom app errors)
- ConnectionError, TimeoutError, OSError (stdlib)
- httpx.ConnectError / ReadTimeout / WriteTimeout / PoolTimeout /
  ConnectTimeout / RemoteProtocolError / NetworkError

ValidationException, NotFoundException, ValueError, KeyError, TypeError
and other programming bugs are intentionally excluded: re-running them
just wastes worker time and hits external APIs.

Switch the retry schedule from default_retry_delay=30 to:
  retry_backoff=True
  retry_backoff_max=600
  retry_jitter=True

So retries use exponential backoff with jitter, capped at 10 minutes.

Also fix the 'mark doc as failed' bookkeeping: non-transient errors are
now marked failed on the very first attempt (no point waiting for
retries that will never help), while transient errors only get marked
failed once retries are exhausted. Previously, transient errors also
flipped the doc to 'failed' early, which confused users who saw a
'failed' status before the retry path had even had a chance.

Add regression tests covering: non-transient error -> failed on first
attempt; transient error -> not marked failed while retries remain;
task decorator applies the new retry policy.
@AmSach AmSach requested a review from param20h as a code owner June 23, 2026 02:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]

1 participant