Skip to content

[BUG] : Cleanup jobs silently skip roughly half of eligible documents due to OFFSET pagination over a self-mutating query #638

Description

@hariom888

Description of the Bug

File: backend/app/services/cleanup.py (cleanup_stale_documents, cleanup_old_deleted_documents, cleanup_inactive_active_documents)

All three scheduled cleanup functions paginate with the same pattern: loop, fetch .filter(...).limit(_CLEANUP_BATCH_SIZE).offset(offset).all(), mutate every row in the batch (either flip status to "failed" or db.delete(doc)), then increment offset += _CLEANUP_BATCH_SIZE for the next iteration — without ever committing per batch (the surrounding get_db_session() context manager, per backend/app/database.py, only commits once when the entire with block exits).

Because the SQLAlchemy session has default autoflush behavior, the pending UPDATE/DELETE statements from each batch are flushed to the database before the next query executes within the same transaction — which means each row that was just mutated/deleted drops out of the filter criteria for the next iteration's query (e.g. a doc just marked status = "failed" no longer matches Document.status == "processing"; a deleted row no longer matches at all). The matching result set shrinks by one full batch every iteration, but offset still advances by a full batch, so each iteration after the first skips an entire batch's worth of still-eligible rows. None of the three queries have an explicit ORDER BY either, so result ordering across calls within the mutating transaction isn't guaranteed stable.

Steps to Reproduce

Proposed fix approach: don't advance offset across iterations at all — since each batch's mutation removes those rows from the matching set, re-querying with offset=0 each time (or, better, ordering by primary key and committing per batch so the next SELECT ... LIMIT simply picks up the next naturally-remaining rows) converges correctly without needing to track a manually advancing offset.

Expected Behavior

Net effect: with N eligible rows where N > _CLEANUP_BATCH_SIZE, a single scheduled run processes roughly half of them and silently leaves the rest unprocessed — stale "processing" documents that should be marked failed remain stuck, and old soft-deleted/inactive documents that should be purged (per DOC_CLEANUP_MAX_AGE_DAYS / DOC_CLEANUP_INACTIVE_DAYS) are retained indefinitely beyond their configured retention window, contradicting the docstring's stated retention guarantee and causing unbounded storage growth.

Screenshots / Logs

No response

Environment

Windows

GSSoC '26

  • Yes, I am participating in GirlScript Summer of Code and would like to fix this.

Metadata

Metadata

Labels

bugSomething isn't workinggssocGirlScript Summer of Code 2026 issue/PR

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions