fix(cleanup): commit each batch independently to prevent rollback of already-processed rows on partial job failure by hariom888 · Pull Request #650 · param20h/PDF-Assistant-RAG

hariom888 · 2026-06-20T08:36:28Z

Summary

Fixes a data-integrity bug in backend/app/services/cleanup.py related to #638. All three scheduled cleanup functions (cleanup_stale_documents, cleanup_old_deleted_documents, cleanup_inactive_active_documents) previously ran their entire batch loop inside one with get_db_session() block, committing only once at the very end, while tracking a manually advancing offset across iterations.

What I found

I started from the offset-pagination-skips-rows mechanism described in #638 (autoflush propagating each batch's mutation into the same transaction, shrinking the result set faster than offset advances). I could not reproduce that specific mechanism in this codebase: SessionLocal in backend/app/database.py is explicitly configured with autoflush=False, so a batch's UPDATE/DELETE is never flushed to the DB before the next iteration's SELECT runs within the same transaction. I confirmed this with a standalone repro seeding 250 rows against the actual session config -- the original code processed all 250 correctly.

While tracing through the same code path, I found a more severe bug with the same root pattern (one long-lived transaction across all batches):

If any unhandled exception occurs partway through a large run, the entire job's transaction rolls back -- discarding every batch already processed in that call, not just the batch that failed. For cleanup_old_deleted_documents and cleanup_inactive_active_documents, this is a correctness/data-integrity problem, not just a lost-progress problem: _hard_delete_document performs irreversible side effects (vector-store deletion, file removal) before db.delete(doc). If an error elsewhere in the loop causes a rollback after several batches have already had their files/vectors permanently deleted, those documents' Document rows reappear in the database -- now orphaned, pointing at vector embeddings and files that no longer exist.

I verified this with fault injection: seeded 250 eligible rows (2.5x the batch size of 100), forced an unhandled error on the 220th document processed (well into the third batch). Before this fix, all 250 rows reappeared in the table after rollback, despite ~200 already having their underlying data permanently deleted. After this fix, the 200 rows from the two already-committed batches stay correctly purged; only the uncommitted remainder survives.

Fix

Removed the manually advancing offset from all three functions. Each iteration re-runs the same filtered query from the top with only LIMIT applied (no OFFSET) -- once a batch is committed, those rows no longer match the filter, so they naturally drop out of the next query's results without needing to track how many rows to skip.
Added .order_by(Document.id) to all three queries for deterministic, repeatable pagination across iterations (none of the queries had an explicit order before).
Moved with get_db_session() inside the while True: loop so each batch commits independently, instead of one commit at the very end of the whole job.
Each function now returns the total number of rows processed (was implicitly returning None), which is useful for logging/monitoring and is exercised directly in the new tests.

Tests

Added backend/tests/test_cleanup.py:

Seeds 2.5x the batch size (250 rows with _CLEANUP_BATCH_SIZE = 100) for each of the three cleanup functions and asserts a single call processes all of them.
A dedicated test (test_cleanup_old_deleted_documents_preserves_earlier_committed_batches_on_failure) reproduces the partial-failure corruption scenario above via fault injection on _hard_delete_document, asserting that rows from already-committed batches survive a later batch's unhandled failure.
A test for the DOC_CLEANUP_ENABLED=False early-return path.

I ran these against the real cleanup.py / database.py / models.py code paths using a minimal standalone harness, but was not able to run them through this repo's full pytest/conftest.py stack in my environment (missing several heavy dependencies unrelated to this change, e.g. prometheus_client via app.main). Please run pytest tests/test_cleanup.py -v locally before merging -- happy to fix up any fixture/mocking mismatches with the actual conftest if something doesn't line up.

Why not just re-add commits inside the old offset loop?

Per-batch commits alone don't fix anything if offset is still tracked manually, since a committed batch still needs to stop matching the filter for the next page to be correct without an offset. Re-querying from the top each time, with no offset, is the simpler and more robust fix recommended in the original issue thread, and it composes correctly with per-batch commits.

Closes #638
GSSoC'26

… transaction Removes the manually advancing offset from all three scheduled cleanup functions (cleanup_stale_documents, cleanup_old_deleted_documents, cleanup_inactive_active_documents) and re-queries from the top of the filter on every iteration instead, with an explicit ORDER BY Document.id for deterministic pagination. Each batch is now committed independently rather than accumulated inside one long-lived transaction for the whole job. This fixes a data-integrity bug: if any unhandled exception occurred partway through a large cleanup run, the entire job's transaction rolled back, discarding every batch already processed in that call -- including batches where _hard_delete_document had already performed irreversible side effects (vector-store deletes, file removal). The DB rows for those documents would reappear after rollback, now orphaned and pointing at data that no longer existed. Verified via fault injection: seeding 250 eligible rows (2.5x the 100-row batch size) and forcing an unhandled error on row param20h#220 -- before this fix, all 250 rows reappeared after rollback despite ~200 already having had their vectors/files deleted; after this fix, the 200 rows from the first two committed batches correctly stay purged. Note: investigated the originally suspected mechanism (SQLAlchemy autoflush propagating in-batch mutations into the next iteration's query within one shared transaction) and could not reproduce it -- this codebase's SessionLocal is configured with autoflush=False, so that specific failure mode does not occur here. The offset-removal and per-batch-commit change is still correct and necessary for the partial-failure durability issue described above, and is also more robust against any future change to session config.

hariom888 requested a review from param20h as a code owner June 20, 2026 08:36

param20h approved these changes Jun 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cleanup): commit each batch independently to prevent rollback of already-processed rows on partial job failure#650

fix(cleanup): commit each batch independently to prevent rollback of already-processed rows on partial job failure#650
hariom888 wants to merge 1 commit into
param20h:devfrom
hariom888:fix/cleanup-batch-commit-pagination

hariom888 commented Jun 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hariom888 commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What I found

Fix

Tests

Why not just re-add commits inside the old offset loop?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hariom888 commented Jun 20, 2026 •

edited

Loading