Skip to content

fix(cleanup): commit each batch independently to prevent rollback of already-processed rows on partial job failure#650

Open
hariom888 wants to merge 1 commit into
param20h:devfrom
hariom888:fix/cleanup-batch-commit-pagination
Open

fix(cleanup): commit each batch independently to prevent rollback of already-processed rows on partial job failure#650
hariom888 wants to merge 1 commit into
param20h:devfrom
hariom888:fix/cleanup-batch-commit-pagination

Conversation

@hariom888

@hariom888 hariom888 commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes a data-integrity bug in backend/app/services/cleanup.py related to #638. All three scheduled cleanup functions (cleanup_stale_documents, cleanup_old_deleted_documents, cleanup_inactive_active_documents) previously ran their entire batch loop inside one with get_db_session() block, committing only once at the very end, while tracking a manually advancing offset across iterations.

What I found

I started from the offset-pagination-skips-rows mechanism described in #638 (autoflush propagating each batch's mutation into the same transaction, shrinking the result set faster than offset advances). I could not reproduce that specific mechanism in this codebase: SessionLocal in backend/app/database.py is explicitly configured with autoflush=False, so a batch's UPDATE/DELETE is never flushed to the DB before the next iteration's SELECT runs within the same transaction. I confirmed this with a standalone repro seeding 250 rows against the actual session config -- the original code processed all 250 correctly.

While tracing through the same code path, I found a more severe bug with the same root pattern (one long-lived transaction across all batches):

If any unhandled exception occurs partway through a large run, the entire job's transaction rolls back -- discarding every batch already processed in that call, not just the batch that failed. For cleanup_old_deleted_documents and cleanup_inactive_active_documents, this is a correctness/data-integrity problem, not just a lost-progress problem: _hard_delete_document performs irreversible side effects (vector-store deletion, file removal) before db.delete(doc). If an error elsewhere in the loop causes a rollback after several batches have already had their files/vectors permanently deleted, those documents' Document rows reappear in the database -- now orphaned, pointing at vector embeddings and files that no longer exist.

I verified this with fault injection: seeded 250 eligible rows (2.5x the batch size of 100), forced an unhandled error on the 220th document processed (well into the third batch). Before this fix, all 250 rows reappeared in the table after rollback, despite ~200 already having their underlying data permanently deleted. After this fix, the 200 rows from the two already-committed batches stay correctly purged; only the uncommitted remainder survives.

Fix

  • Removed the manually advancing offset from all three functions. Each iteration re-runs the same filtered query from the top with only LIMIT applied (no OFFSET) -- once a batch is committed, those rows no longer match the filter, so they naturally drop out of the next query's results without needing to track how many rows to skip.
  • Added .order_by(Document.id) to all three queries for deterministic, repeatable pagination across iterations (none of the queries had an explicit order before).
  • Moved with get_db_session() inside the while True: loop so each batch commits independently, instead of one commit at the very end of the whole job.
  • Each function now returns the total number of rows processed (was implicitly returning None), which is useful for logging/monitoring and is exercised directly in the new tests.

Tests

Added backend/tests/test_cleanup.py:

  • Seeds 2.5x the batch size (250 rows with _CLEANUP_BATCH_SIZE = 100) for each of the three cleanup functions and asserts a single call processes all of them.
  • A dedicated test (test_cleanup_old_deleted_documents_preserves_earlier_committed_batches_on_failure) reproduces the partial-failure corruption scenario above via fault injection on _hard_delete_document, asserting that rows from already-committed batches survive a later batch's unhandled failure.
  • A test for the DOC_CLEANUP_ENABLED=False early-return path.

I ran these against the real cleanup.py / database.py / models.py code paths using a minimal standalone harness, but was not able to run them through this repo's full pytest/conftest.py stack in my environment (missing several heavy dependencies unrelated to this change, e.g. prometheus_client via app.main). Please run pytest tests/test_cleanup.py -v locally before merging -- happy to fix up any fixture/mocking mismatches with the actual conftest if something doesn't line up.

Why not just re-add commits inside the old offset loop?

Per-batch commits alone don't fix anything if offset is still tracked manually, since a committed batch still needs to stop matching the filter for the next page to be correct without an offset. Re-querying from the top each time, with no offset, is the simpler and more robust fix recommended in the original issue thread, and it composes correctly with per-batch commits.

Closes #638
GSSoC'26

… transaction

Removes the manually advancing offset from all three scheduled cleanup functions (cleanup_stale_documents, cleanup_old_deleted_documents, cleanup_inactive_active_documents) and re-queries from the top of the filter on every iteration instead, with an explicit ORDER BY Document.id for deterministic pagination.

Each batch is now committed independently rather than accumulated inside one long-lived transaction for the whole job. This fixes a data-integrity bug: if any unhandled exception occurred partway through a large cleanup run, the entire job's transaction rolled back, discarding every batch already processed in that call -- including batches where _hard_delete_document had already performed irreversible side effects (vector-store deletes, file removal). The DB rows for those documents would reappear after rollback, now orphaned and pointing at data that no longer existed.

Verified via fault injection: seeding 250 eligible rows (2.5x the 100-row batch size) and forcing an unhandled error on row param20h#220 -- before this fix, all 250 rows reappeared after rollback despite ~200 already having had their vectors/files deleted; after this fix, the 200 rows from the first two committed batches correctly stay purged.

Note: investigated the originally suspected mechanism (SQLAlchemy autoflush propagating in-batch mutations into the next iteration's query within one shared transaction) and could not reproduce it -- this codebase's SessionLocal is configured with autoflush=False, so that specific failure mode does not occur here. The offset-removal and per-batch-commit change is still correct and necessary for the partial-failure durability issue described above, and is also more robust against any future change to session config.
@hariom888 hariom888 requested a review from param20h as a code owner June 20, 2026 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] : Cleanup jobs silently skip roughly half of eligible documents due to OFFSET pagination over a self-mutating query

2 participants