fix: retry only transient Celery errors with exponential backoff (#664)#670
Open
AmSach wants to merge 1 commit into
Open
fix: retry only transient Celery errors with exponential backoff (#664)#670AmSach wants to merge 1 commit into
AmSach wants to merge 1 commit into
Conversation
Issue param20h#664: autoretry_for=(Exception,) retried every error (including validation errors) with a fixed 30s delay and no backoff. Replace it with an explicit TRANSIENT_ERRORS tuple containing only upstream/IO classes that are actually worth retrying: - ExternalServiceException, RateLimitException (custom app errors) - ConnectionError, TimeoutError, OSError (stdlib) - httpx.ConnectError / ReadTimeout / WriteTimeout / PoolTimeout / ConnectTimeout / RemoteProtocolError / NetworkError ValidationException, NotFoundException, ValueError, KeyError, TypeError and other programming bugs are intentionally excluded: re-running them just wastes worker time and hits external APIs. Switch the retry schedule from default_retry_delay=30 to: retry_backoff=True retry_backoff_max=600 retry_jitter=True So retries use exponential backoff with jitter, capped at 10 minutes. Also fix the 'mark doc as failed' bookkeeping: non-transient errors are now marked failed on the very first attempt (no point waiting for retries that will never help), while transient errors only get marked failed once retries are exhausted. Previously, transient errors also flipped the doc to 'failed' early, which confused users who saw a 'failed' status before the retry path had even had a chance. Add regression tests covering: non-transient error -> failed on first attempt; transient error -> not marked failed while retries remain; task decorator applies the new retry policy.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #664.
Problem
The
process_documentCelery task was configured withautoretry_for=(Exception,)anddefault_retry_delay=30. This retried every error (includingValidationException,ValueError,KeyError, etc.) on a flat 30s schedule with no backoff and no jitter.In practice this meant a bad PDF, a missing field, or a programming bug would:
failed, so the UI showed a misleadingprocessingstatus throughout the wasted retries.What this PR changes
backend/app/tasks.pyDefine an explicit
TRANSIENT_ERRORStuple containing only error types worth retrying:ExternalServiceException,RateLimitException(custom app errors)ConnectionError,TimeoutError,OSError(stdlib)httpx.ConnectError,ReadTimeout,WriteTimeout,PoolTimeout,ConnectTimeout,RemoteProtocolError,NetworkError(the HTTP client the rest of the app already uses)ValidationException, NotFoundException, UnauthorizedException, ForbiddenException, ConflictException, UnsafePromptException, ValueError, KeyError, TypeError and other programming bugs are intentionally excluded - re-running them just wastes worker time and external API quota.
Switch the retry schedule from a flat
default_retry_delay=30to:retry_backoff=Trueretry_backoff_max=600retry_jitter=TrueSo retries use exponential backoff with jitter, capped at 10 minutes.
Fix the
mark doc as failedbookkeeping so it matches the new retry semantics:ValidationException) are markedfailedon the first attempt - no point waiting for retries that will never help.ExternalServiceException) only get markedfailedonce retries are exhausted, so the UI doesn't show a misleadingfailedstatus before the retry path has had a chance.backend/tests/test_celery_ingestion.pyAdd three regression tests:
test_non_transient_error_marks_failed_immediately-ValidationException→failedon first attempt, no autoretry, error traceback recorded.test_transient_error_does_not_mark_failed_before_retries_exhausted-ExternalServiceException→ task ends in FAILURE for the current attempt, but the document row is not markedfailedwhile retries remain.test_task_uses_exponential_backoff_with_jitter- locks in the new decorator settings (retry_backoff=True,retry_backoff_max=600,retry_jitter=True,Exception ∉ autoretry_for, includesExternalServiceExceptionandRateLimitException).How I tested
python3 -c "import ast; ast.parse(...)"syntax check on both edited files.TRANSIENT_ERRORScontents against the issue's spec.max_retries=3,autoretry_for=TRANSIENT_ERRORS(no bareException),retry_backoff=True,retry_backoff_max=600,retry_jitter=True.ValidationExceptionis non-transient;ExternalServiceException,RateLimitException,ConnectionError,TimeoutErrorare transient.ValidationExceptionfrom_ingest_document→ task FAILURE, document markedfailed,last_error_tracebackpopulated.test_process_document_runs_real_ingestion_pipelineandtest_process_document_marks_failed_when_no_text_extractedtests are unaffected (they don't raise exceptions through the retry path).Notes
requestsis not inrequirements.txt, so the HTTP-client retries usehttpx(which the app already depends on) instead ofrequests.exceptions.devper the repo's contributing guidelines.