fix(cleanup): commit each batch independently to prevent rollback of already-processed rows on partial job failure#650
Open
hariom888 wants to merge 1 commit into
Conversation
… transaction Removes the manually advancing offset from all three scheduled cleanup functions (cleanup_stale_documents, cleanup_old_deleted_documents, cleanup_inactive_active_documents) and re-queries from the top of the filter on every iteration instead, with an explicit ORDER BY Document.id for deterministic pagination. Each batch is now committed independently rather than accumulated inside one long-lived transaction for the whole job. This fixes a data-integrity bug: if any unhandled exception occurred partway through a large cleanup run, the entire job's transaction rolled back, discarding every batch already processed in that call -- including batches where _hard_delete_document had already performed irreversible side effects (vector-store deletes, file removal). The DB rows for those documents would reappear after rollback, now orphaned and pointing at data that no longer existed. Verified via fault injection: seeding 250 eligible rows (2.5x the 100-row batch size) and forcing an unhandled error on row param20h#220 -- before this fix, all 250 rows reappeared after rollback despite ~200 already having had their vectors/files deleted; after this fix, the 200 rows from the first two committed batches correctly stay purged. Note: investigated the originally suspected mechanism (SQLAlchemy autoflush propagating in-batch mutations into the next iteration's query within one shared transaction) and could not reproduce it -- this codebase's SessionLocal is configured with autoflush=False, so that specific failure mode does not occur here. The offset-removal and per-batch-commit change is still correct and necessary for the partial-failure durability issue described above, and is also more robust against any future change to session config.
param20h
approved these changes
Jun 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a data-integrity bug in
backend/app/services/cleanup.pyrelated to #638. All three scheduled cleanup functions (cleanup_stale_documents,cleanup_old_deleted_documents,cleanup_inactive_active_documents) previously ran their entire batch loop inside onewith get_db_session()block, committing only once at the very end, while tracking a manually advancingoffsetacross iterations.What I found
I started from the offset-pagination-skips-rows mechanism described in #638 (autoflush propagating each batch's mutation into the same transaction, shrinking the result set faster than offset advances). I could not reproduce that specific mechanism in this codebase:
SessionLocalinbackend/app/database.pyis explicitly configured withautoflush=False, so a batch'sUPDATE/DELETEis never flushed to the DB before the next iteration'sSELECTruns within the same transaction. I confirmed this with a standalone repro seeding 250 rows against the actual session config -- the original code processed all 250 correctly.While tracing through the same code path, I found a more severe bug with the same root pattern (one long-lived transaction across all batches):
If any unhandled exception occurs partway through a large run, the entire job's transaction rolls back -- discarding every batch already processed in that call, not just the batch that failed. For
cleanup_old_deleted_documentsandcleanup_inactive_active_documents, this is a correctness/data-integrity problem, not just a lost-progress problem:_hard_delete_documentperforms irreversible side effects (vector-store deletion, file removal) beforedb.delete(doc). If an error elsewhere in the loop causes a rollback after several batches have already had their files/vectors permanently deleted, those documents'Documentrows reappear in the database -- now orphaned, pointing at vector embeddings and files that no longer exist.I verified this with fault injection: seeded 250 eligible rows (2.5x the batch size of 100), forced an unhandled error on the 220th document processed (well into the third batch). Before this fix, all 250 rows reappeared in the table after rollback, despite ~200 already having their underlying data permanently deleted. After this fix, the 200 rows from the two already-committed batches stay correctly purged; only the uncommitted remainder survives.
Fix
offsetfrom all three functions. Each iteration re-runs the same filtered query from the top with onlyLIMITapplied (noOFFSET) -- once a batch is committed, those rows no longer match the filter, so they naturally drop out of the next query's results without needing to track how many rows to skip..order_by(Document.id)to all three queries for deterministic, repeatable pagination across iterations (none of the queries had an explicit order before).with get_db_session()inside thewhile True:loop so each batch commits independently, instead of one commit at the very end of the whole job.None), which is useful for logging/monitoring and is exercised directly in the new tests.Tests
Added
backend/tests/test_cleanup.py:2.5xthe batch size (250 rows with_CLEANUP_BATCH_SIZE = 100) for each of the three cleanup functions and asserts a single call processes all of them.test_cleanup_old_deleted_documents_preserves_earlier_committed_batches_on_failure) reproduces the partial-failure corruption scenario above via fault injection on_hard_delete_document, asserting that rows from already-committed batches survive a later batch's unhandled failure.DOC_CLEANUP_ENABLED=Falseearly-return path.I ran these against the real
cleanup.py/database.py/models.pycode paths using a minimal standalone harness, but was not able to run them through this repo's fullpytest/conftest.pystack in my environment (missing several heavy dependencies unrelated to this change, e.g.prometheus_clientviaapp.main). Please runpytest tests/test_cleanup.py -vlocally before merging -- happy to fix up any fixture/mocking mismatches with the actual conftest if something doesn't line up.Why not just re-add commits inside the old offset loop?
Per-batch commits alone don't fix anything if
offsetis still tracked manually, since a committed batch still needs to stop matching the filter for the next page to be correct without an offset. Re-querying from the top each time, with no offset, is the simpler and more robust fix recommended in the original issue thread, and it composes correctly with per-batch commits.Closes #638
GSSoC'26