Description of the Bug
File: backend/app/services/cleanup.py (cleanup_stale_documents, cleanup_old_deleted_documents, cleanup_inactive_active_documents)
All three scheduled cleanup functions paginate with the same pattern: loop, fetch .filter(...).limit(_CLEANUP_BATCH_SIZE).offset(offset).all(), mutate every row in the batch (either flip status to "failed" or db.delete(doc)), then increment offset += _CLEANUP_BATCH_SIZE for the next iteration — without ever committing per batch (the surrounding get_db_session() context manager, per backend/app/database.py, only commits once when the entire with block exits).
Because the SQLAlchemy session has default autoflush behavior, the pending UPDATE/DELETE statements from each batch are flushed to the database before the next query executes within the same transaction — which means each row that was just mutated/deleted drops out of the filter criteria for the next iteration's query (e.g. a doc just marked status = "failed" no longer matches Document.status == "processing"; a deleted row no longer matches at all). The matching result set shrinks by one full batch every iteration, but offset still advances by a full batch, so each iteration after the first skips an entire batch's worth of still-eligible rows. None of the three queries have an explicit ORDER BY either, so result ordering across calls within the mutating transaction isn't guaranteed stable.
Steps to Reproduce
Proposed fix approach: don't advance offset across iterations at all — since each batch's mutation removes those rows from the matching set, re-querying with offset=0 each time (or, better, ordering by primary key and committing per batch so the next SELECT ... LIMIT simply picks up the next naturally-remaining rows) converges correctly without needing to track a manually advancing offset.
Expected Behavior
Net effect: with N eligible rows where N > _CLEANUP_BATCH_SIZE, a single scheduled run processes roughly half of them and silently leaves the rest unprocessed — stale "processing" documents that should be marked failed remain stuck, and old soft-deleted/inactive documents that should be purged (per DOC_CLEANUP_MAX_AGE_DAYS / DOC_CLEANUP_INACTIVE_DAYS) are retained indefinitely beyond their configured retention window, contradicting the docstring's stated retention guarantee and causing unbounded storage growth.
Screenshots / Logs
No response
Environment
Windows
GSSoC '26
Description of the Bug
File: backend/app/services/cleanup.py (
cleanup_stale_documents,cleanup_old_deleted_documents,cleanup_inactive_active_documents)All three scheduled cleanup functions paginate with the same pattern: loop, fetch
.filter(...).limit(_CLEANUP_BATCH_SIZE).offset(offset).all(), mutate every row in the batch (either flipstatusto "failed" ordb.delete(doc)), then incrementoffset += _CLEANUP_BATCH_SIZEfor the next iteration — without ever committing per batch (the surroundingget_db_session()context manager, per backend/app/database.py, only commits once when the entirewithblock exits).Because the SQLAlchemy session has default autoflush behavior, the pending UPDATE/DELETE statements from each batch are flushed to the database before the next query executes within the same transaction — which means each row that was just mutated/deleted drops out of the filter criteria for the next iteration's query (e.g. a doc just marked
status = "failed"no longer matchesDocument.status == "processing"; a deleted row no longer matches at all). The matching result set shrinks by one full batch every iteration, butoffsetstill advances by a full batch, so each iteration after the first skips an entire batch's worth of still-eligible rows. None of the three queries have an explicitORDER BYeither, so result ordering across calls within the mutating transaction isn't guaranteed stable.Steps to Reproduce
Proposed fix approach: don't advance
offsetacross iterations at all — since each batch's mutation removes those rows from the matching set, re-querying withoffset=0each time (or, better, ordering by primary key and committing per batch so the nextSELECT ... LIMITsimply picks up the next naturally-remaining rows) converges correctly without needing to track a manually advancing offset.Expected Behavior
Net effect: with N eligible rows where N > _CLEANUP_BATCH_SIZE, a single scheduled run processes roughly half of them and silently leaves the rest unprocessed — stale "processing" documents that should be marked failed remain stuck, and old soft-deleted/inactive documents that should be purged (per
DOC_CLEANUP_MAX_AGE_DAYS/DOC_CLEANUP_INACTIVE_DAYS) are retained indefinitely beyond their configured retention window, contradicting the docstring's stated retention guarantee and causing unbounded storage growth.Screenshots / Logs
No response
Environment
Windows
GSSoC '26