Skip to content

Add incremental preprocessing checkpoints with resume support#63

Open
AnshuPriya-1 wants to merge 7 commits intoAOSSIE-Org:mainfrom
AnshuPriya-1:feature/checkpoint-preprocessing
Open

Add incremental preprocessing checkpoints with resume support#63
AnshuPriya-1 wants to merge 7 commits intoAOSSIE-Org:mainfrom
AnshuPriya-1:feature/checkpoint-preprocessing

Conversation

@AnshuPriya-1
Copy link

@AnshuPriya-1 AnshuPriya-1 commented Mar 10, 2026

Addressed Issues:

Fixes #54

Screenshots/Recordings:

Additional Notes:

Adds checkpoint support to the preprocessing pipeline so large Wikipedia dump processing can resume from the last processed page instead of restarting from the beginning.

A checkpoint is saved every 1000 pages and removed after successful completion.

Checklist

  • My code follows the project's code style and conventions
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

  • New Features

    • Resume capability for interrupted text extraction so resumed runs append instead of restarting.
    • Manifest generation moved to run after preprocessing finishes to ensure final outputs are consistent.
  • Improvements

    • Automatic checkpoints saved regularly (every 1,000 pages) with safer swap/update semantics.
    • Robust handling of corrupted/missing checkpoint data and cleanup of temporary checkpoint files.
    • Skips already-processed pages and writes per-page output during runs.
  • Bug Fixes

    • Known issue: unresolved merge conflict remnants may appear in extraction flow and need resolution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

Walkthrough

Adds incremental checkpointing and resume support to extract_text_from_xml, writes checkpoints periodically and removes them on success, defers manifest generation until after preprocessing, and introduces unresolved merge conflict markers in the extract_text_from_xml and main sections.

Changes

Cohort / File(s) Summary
Checkpointing & resume
openverifiablellm/utils.py
Adds loading of data/checkpoint.json (handles JSONDecodeError/OSError), sets start_page from checkpoint, skips already-processed pages, chooses output mode "w" or "a" based on resume state, writes atomic .tmp checkpoints every 1000 pages, deletes checkpoint on successful completion, and moves generate_manifest(input_path, output_path) to run after preprocessing.
Merge conflict remnants
openverifiablellm/utils.py, .../__main__
Unresolved merge conflict markers were introduced in extract_text_from_xml and the __main__ section; these need resolution before merge.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/Process
    participant Func as extract_text_from_xml()
    participant Check as data/checkpoint.json
    participant Proc as PageProcessor
    participant Manifest as generate_manifest()

    User->>Func: start(input_path, output_path)
    Func->>Check: read checkpoint (if exists)
    alt checkpoint present
        Check-->>Func: last_processed_page
        Func->>Proc: set start_page = last_processed_page
    else no checkpoint
        Func->>Proc: set start_page = 0
    end

    Func->>Proc: begin processing pages
    loop for each page
        Proc->>Proc: page_index++
        alt page_index <= start_page
            Proc-->>Proc: skip page
        else
            Proc->>Proc: process and write output (append or new)
        end
        alt every 1000 pages
            Proc->>Check: write checkpoint (atomic .tmp then replace)
        end
    end

    Proc-->>Func: processing complete
    Func->>Check: delete checkpoint file
    Func->>Manifest: generate_manifest(input_path, output_path)
    Manifest-->>Func: manifest ready
    Func-->>User: done
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Python Lang

Poem

🐰 I hop through dumps and leave a trace,

A tiny file to mark the place.
Each thousand hops I blink and store,
So next time round I skip before,
Resume intact — I bound once more!

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR implements core features from issue #54 (checkpoints, resume capability), but does not include checkpoint integrity verification (hashes), time remaining estimates, or graceful failure handling. Add checkpoint integrity verification using hashes, implement time remaining estimates, and ensure graceful failure handling as specified in issue #54 requirements.
Out of Scope Changes check ⚠️ Warning The PR introduces unresolved merge conflict markers in extract_text_from_xml and main sections, which are out-of-scope anomalies. These conflict remnants should not be present in a production-ready changeset. Resolve all merge conflict markers in the code before merging. Review conflicting sections and complete the merge resolution process.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding checkpoint support with resume capability for preprocessing. It is concise, clear, and directly reflects the primary objective of the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added size/S and removed size/S labels Mar 10, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
openverifiablellm/utils.py (1)

158-246: 🧹 Nitpick | 🔵 Trivial

Missing features from Issue #54 requirements.

The implementation partially addresses the checkpoint requirements but is missing:

  1. Checkpoint integrity verification - No hash stored to verify output file state matches checkpoint
  2. Time remaining estimates - No progress tracking or ETA calculation

These are listed as objectives in Issue #54.

Would you like me to help implement these missing features or open separate issues to track them?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 158 - 246, The extract_text_from_xml
function currently writes a checkpoint with only last_processed_page; update the
checkpoint logic to store an integrity hash of the output file (e.g., SHA256 of
output_path) and state needed for ETA (total_pages if known or processed_pages
and elapsed_seconds) so you can verify the output file matches the checkpoint on
resume and compute progress/time remaining; specifically, when writing
checkpoint_file in extract_text_from_xml include fields {"last_processed_page":
page_index, "output_hash": <sha256_hex_of_output_path>, "processed_bytes":
<output_path.stat().st_size>, "timestamp": <now_epoch>} and on load verify
output_path exists and its hash and size match the checkpoint before resuming
(if mismatch, either abort or rebuild from scratch), and while iterating update
elapsed time and estimate ETA using page_index-start_page over elapsed seconds
and (total_pages or an estimated remaining count) and surface this via
logger.info; ensure generate_manifest is called only after successful hash
verification and final checkpoint removal.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 230-237: The checkpoint write currently uses a non-atomic
json.dump to checkpoint_file when page_index % 1000 == 0; change this to perform
an atomic write by serializing the checkpoint dict (e.g.,
{"last_processed_page": page_index}) to a temporary file (e.g., checkpoint_file
+ ".tmp"), fsync the file, close it, then atomically replace the original using
os.replace (or pathlib.Path.replace) in the same directory; additionally compute
a checksum (SHA256) of the serialized checkpoint and store it together in the
checkpoint JSON (e.g., {"last_processed_page": ..., "checksum": "<hex>"}), and
update the checkpoint loading logic to verify the stored checksum matches the
content before trusting last_processed_page, raising/handling a verification
error if mismatched.
- Around line 207-220: The output is opened with "w" which truncates existing
output on resume; change the file open mode to "a" (append) when start_page > 0
to avoid losing previously written pages and keep "w" when starting fresh
(start_page == 0); locate the block using variables/output_path, start_page,
page_index and the loop over context (for _, elem in context:) and open the file
conditionally (mode = "a" if start_page > 0 else "w") preserving encoding and
newline handling so resumed runs append new pages instead of truncating the
file.
- Around line 199-202: Wrap the json.load call that reads checkpoint_file in a
try/except that catches json.JSONDecodeError (and fallback exceptions) so a
malformed checkpoint does not crash; on exception set start_page = 0 and emit a
warning that includes the checkpoint_file name and the exception, then continue.
Also remove the redundant "r" mode from open(), keep the existing with
open(checkpoint_file) context, and ensure you reference checkpoint_file,
json.load, and start_page in the fix so the handler replaces the current direct
json.load usage.

---

Outside diff comments:
In `@openverifiablellm/utils.py`:
- Around line 158-246: The extract_text_from_xml function currently writes a
checkpoint with only last_processed_page; update the checkpoint logic to store
an integrity hash of the output file (e.g., SHA256 of output_path) and state
needed for ETA (total_pages if known or processed_pages and elapsed_seconds) so
you can verify the output file matches the checkpoint on resume and compute
progress/time remaining; specifically, when writing checkpoint_file in
extract_text_from_xml include fields {"last_processed_page": page_index,
"output_hash": <sha256_hex_of_output_path>, "processed_bytes":
<output_path.stat().st_size>, "timestamp": <now_epoch>} and on load verify
output_path exists and its hash and size match the checkpoint before resuming
(if mismatch, either abort or rebuild from scratch), and while iterating update
elapsed time and estimate ETA using page_index-start_page over elapsed seconds
and (total_pages or an estimated remaining count) and surface this via
logger.info; ensure generate_manifest is called only after successful hash
verification and final checkpoint removal.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e3e31401-8ba9-4461-aebe-65146df2a7e5

📥 Commits

Reviewing files that changed from the base of the PR and between a2dfd7a and d80a0a7.

📒 Files selected for processing (1)
  • openverifiablellm/utils.py

Comment on lines +199 to +202
if checkpoint_file.exists():
with open(checkpoint_file, "r") as cp:
data = json.load(cp)
start_page = data.get("last_processed_page", 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add error handling for corrupted checkpoint files.

If the checkpoint file is malformed or corrupted (e.g., partial write from a previous crash), json.load() will raise a JSONDecodeError and abort. Per the PR objectives, failures should be handled gracefully. Consider catching exceptions and falling back to start_page = 0 with a warning.

Also, the "r" mode argument is unnecessary (it's the default).

🛡️ Proposed fix
     # load checkpoint if exists
     if checkpoint_file.exists():
-        with open(checkpoint_file, "r") as cp:
-            data = json.load(cp)
-            start_page = data.get("last_processed_page", 0)
+        try:
+            with open(checkpoint_file) as cp:
+                data = json.load(cp)
+                start_page = data.get("last_processed_page", 0)
+                logger.info("Resuming from checkpoint at page %d", start_page)
+        except (json.JSONDecodeError, KeyError) as e:
+            logger.warning("Corrupted checkpoint file, starting from beginning: %s", e)
+            start_page = 0
🧰 Tools
🪛 Ruff (0.15.5)

[warning] 200-200: Unnecessary mode argument

Remove mode argument

(UP015)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 199 - 202, Wrap the json.load call
that reads checkpoint_file in a try/except that catches json.JSONDecodeError
(and fallback exceptions) so a malformed checkpoint does not crash; on exception
set start_page = 0 and emit a warning that includes the checkpoint_file name and
the exception, then continue. Also remove the redundant "r" mode from open(),
keep the existing with open(checkpoint_file) context, and ensure you reference
checkpoint_file, json.load, and start_page in the fix so the handler replaces
the current direct json.load usage.

Comment on lines +230 to +237
# save checkpoint every 1000 pages
if page_index % 1000 == 0:
with open(checkpoint_file, "w") as cp:
json.dump(
{"last_processed_page": page_index},
cp
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Checkpoint write is not atomic and lacks integrity verification.

If the process is terminated while writing the checkpoint, the file could be left in a corrupted state (partial JSON). Consider using atomic write pattern (write to temp file, then rename).

Additionally, the PR objectives mention verifying checkpoint integrity with hashes, but no hash is stored to validate the output file state matches the checkpoint.

🛡️ Proposed fix for atomic writes
                     # save checkpoint every 1000 pages
                     if page_index % 1000 == 0:
-                        with open(checkpoint_file, "w") as cp:
-                            json.dump(
-                                {"last_processed_page": page_index},
-                                cp
-                            )
+                        # Atomic write: write to temp file then rename
+                        temp_checkpoint = checkpoint_file.with_suffix(".tmp")
+                        with open(temp_checkpoint, "w") as cp:
+                            json.dump(
+                                {"last_processed_page": page_index},
+                                cp
+                            )
+                        temp_checkpoint.rename(checkpoint_file)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 230 - 237, The checkpoint write
currently uses a non-atomic json.dump to checkpoint_file when page_index % 1000
== 0; change this to perform an atomic write by serializing the checkpoint dict
(e.g., {"last_processed_page": page_index}) to a temporary file (e.g.,
checkpoint_file + ".tmp"), fsync the file, close it, then atomically replace the
original using os.replace (or pathlib.Path.replace) in the same directory;
additionally compute a checksum (SHA256) of the serialized checkpoint and store
it together in the checkpoint JSON (e.g., {"last_processed_page": ...,
"checksum": "<hex>"}), and update the checkpoint loading logic to verify the
stored checksum matches the content before trusting last_processed_page,
raising/handling a verification error if mismatched.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
openverifiablellm/utils.py (1)

199-206: ⚠️ Potential issue | 🟡 Minor

Fix inconsistent indentation in checkpoint loading block.

The indentation in this block is inconsistent (extra leading spaces before try, mixed indentation inside the block). Also, the "r" mode argument on Line 201 is unnecessary as it's the default.

🧹 Proposed fix
     # load checkpoint if exists
     if checkpoint_file.exists():
-            try:
-              with open(checkpoint_file, "r") as cp:
-                 data = json.load(cp)
-                 start_page = data.get("last_processed_page", 0)
-            except (json.JSONDecodeError, OSError):
-              logger.warning("Checkpoint file corrupted. Restarting from beginning.")
-              start_page = 0
+        try:
+            with open(checkpoint_file) as cp:
+                data = json.load(cp)
+                start_page = data.get("last_processed_page", 0)
+        except (json.JSONDecodeError, OSError):
+            logger.warning("Checkpoint file corrupted. Restarting from beginning.")
+            start_page = 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 199 - 206, Normalize the indentation
of the checkpoint loading block so all lines under the if
checkpoint_file.exists(): use the same indentation level (remove the extra
leading spaces before try and ensure with/open/json.load/except align
consistently), and remove the unnecessary explicit "r" mode argument in
open(checkpoint_file, "r") (use open(checkpoint_file) instead); keep the same
exception handling (json.JSONDecodeError, OSError) and behavior of setting
start_page from data.get("last_processed_page", 0) or falling back to start_page
= 0 and logging via logger.warning.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 236-242: The checkpoint write is inverted: the code currently
opens and writes to checkpoint_file and then calls
tmp_checkpoint.replace(checkpoint_file), which fails or corrupts state; change
it to write to tmp_checkpoint first (open tmp_checkpoint and json.dump
{"last_processed_page": page_index} into it, flush/close), then call
tmp_checkpoint.replace(checkpoint_file) so the atomic rename replaces the target
with the newly written temp file; update references to tmp_checkpoint,
checkpoint_file, and page_index accordingly.
- Around line 210-211: The code computes mode = "a" when resuming but then
always opens the file with "w", truncating previous output; update the open call
that uses output_path to use the computed mode (i.e., open(output_path, mode,
encoding="utf-8")) so resumed runs append instead of overwrite, and keep the
existing encoding argument; ensure the variable name mode is the one referenced
in the open call (function/method using output_path and mode).

---

Duplicate comments:
In `@openverifiablellm/utils.py`:
- Around line 199-206: Normalize the indentation of the checkpoint loading block
so all lines under the if checkpoint_file.exists(): use the same indentation
level (remove the extra leading spaces before try and ensure
with/open/json.load/except align consistently), and remove the unnecessary
explicit "r" mode argument in open(checkpoint_file, "r") (use
open(checkpoint_file) instead); keep the same exception handling
(json.JSONDecodeError, OSError) and behavior of setting start_page from
data.get("last_processed_page", 0) or falling back to start_page = 0 and logging
via logger.warning.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 450ebfb8-8984-446e-b758-b2a387f90299

📥 Commits

Reviewing files that changed from the base of the PR and between d80a0a7 and efc879a.

📒 Files selected for processing (1)
  • openverifiablellm/utils.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 219-224: The checkpoint logic can cause duplicated output because
start_page is used to skip pages while the output file is opened in append mode;
to fix, when saving a checkpoint also flush the output file and record its
current byte offset (use file.tell()) into the checkpoint metadata, and on
resume (when load_checkpoint provides start_page) open the output file and
truncate it to that recorded offset before continuing; update the functions that
save/load checkpoints (where start_page is stored) to include an output_offset
field and ensure you call file.flush() and os.ftruncate(file.fileno(), offset)
before appending to prevent duplication (references: start_page, page_index,
elem.clear).
- Around line 247-251: Add follow-up tracking for the two unimplemented features
instead of changing behavior: create two issues (or PR tasks) titled "Verify
checkpoint integrity with hashes" and "Provide time-remaining estimates during
processing", and add concise TODO comments in the code near the checkpoint
cleanup and preprocessing logic referencing checkpoint_file (the
exists()/unlink() block) and generate_manifest so future work can find where to
add verification and ETA instrumentation; also link the new issue IDs in a brief
comment above the checkpoint cleanup to make the follow-up explicit.
- Around line 199-206: Fix the inconsistent indentation in the
checkpoint-reading block around checkpoint_file so the try/except aligns with
the if body and avoids IndentationError; open the file without explicitly
passing mode "r" (default) when calling open(checkpoint_file) before json.load,
set start_page = data.get("last_processed_page", 0) in the try, and in the
success path call logger.info to state the resumed start_page (e.g.,
logger.info(..., start_page)) while keeping the except handling that warns and
sets start_page = 0; update references to checkpoint_file, start_page, logger,
and json.load accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7a7b8a18-0bd7-4301-9dc5-ff6f7f154bef

📥 Commits

Reviewing files that changed from the base of the PR and between efc879a and b90a674.

📒 Files selected for processing (1)
  • openverifiablellm/utils.py

Comment on lines +199 to +206
if checkpoint_file.exists():
try:
with open(checkpoint_file, "r") as cp:
data = json.load(cp)
start_page = data.get("last_processed_page", 0)
except (json.JSONDecodeError, OSError):
logger.warning("Checkpoint file corrupted. Restarting from beginning.")
start_page = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Inconsistent indentation will cause IndentationError at runtime.

The try/except block has inconsistent indentation that will prevent the code from running. Python requires consistent whitespace within blocks. Additionally, consider adding an info log when successfully resuming (as suggested in past review).

🐛 Proposed fix for indentation and logging
     # load checkpoint if exists
     if checkpoint_file.exists():
-            try:
-              with open(checkpoint_file, "r") as cp:
-                 data = json.load(cp)
-                 start_page = data.get("last_processed_page", 0)
-            except (json.JSONDecodeError, OSError):
-              logger.warning("Checkpoint file corrupted. Restarting from beginning.")
-              start_page = 0
+        try:
+            with open(checkpoint_file) as cp:
+                data = json.load(cp)
+                start_page = data.get("last_processed_page", 0)
+                logger.info("Resuming from checkpoint at page %d", start_page)
+        except (json.JSONDecodeError, OSError):
+            logger.warning("Checkpoint file corrupted. Restarting from beginning.")
+            start_page = 0

Note: The "r" mode argument is also unnecessary (it's the default) per static analysis hint UP015.

🧰 Tools
🪛 Ruff (0.15.5)

[warning] 201-201: Unnecessary mode argument

Remove mode argument

(UP015)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 199 - 206, Fix the inconsistent
indentation in the checkpoint-reading block around checkpoint_file so the
try/except aligns with the if body and avoids IndentationError; open the file
without explicitly passing mode "r" (default) when calling open(checkpoint_file)
before json.load, set start_page = data.get("last_processed_page", 0) in the
try, and in the success path call logger.info to state the resumed start_page
(e.g., logger.info(..., start_page)) while keeping the except handling that
warns and sets start_page = 0; update references to checkpoint_file, start_page,
logger, and json.load accordingly.

Comment on lines +219 to +224
page_index += 1

# skip already processed pages
if page_index <= start_page:
elem.clear()
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Data duplication risk between checkpoint intervals.

If processing crashes between checkpoint saves (e.g., at page 1500 after checkpoint at page 1000), pages 1001–1500 are already written to the output file. On resume, start_page=1000 causes these pages to be skipped, but since the file opens in append mode, pages 1001–1500 from the new run will be appended again, resulting in duplicate content.

Consider one of:

  1. Flush and record output file position in checkpoint, truncate to that position on resume
  2. Save checkpoint more frequently (e.g., every 100 pages)
  3. Document this limitation and recommend users delete the output file if crash occurred mid-interval

Also applies to: 234-241

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 219 - 224, The checkpoint logic can
cause duplicated output because start_page is used to skip pages while the
output file is opened in append mode; to fix, when saving a checkpoint also
flush the output file and record its current byte offset (use file.tell()) into
the checkpoint metadata, and on resume (when load_checkpoint provides
start_page) open the output file and truncate it to that recorded offset before
continuing; update the functions that save/load checkpoints (where start_page is
stored) to include an output_offset field and ensure you call file.flush() and
os.ftruncate(file.fileno(), offset) before appending to prevent duplication
(references: start_page, page_index, elem.clear).

Comment on lines +247 to +251
# remove checkpoint after successful completion
if checkpoint_file.exists():
checkpoint_file.unlink()

generate_manifest(input_path, output_path)
Copy link
Contributor

@coderabbitai coderabbitai bot Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Checkpoint cleanup and manifest generation look good.

The checkpoint is properly removed after successful completion, and manifest generation is correctly deferred until preprocessing finishes.

Note: The PR objectives (Issue #54) mention two features not yet implemented:

  1. Verify checkpoint integrity with hashes
  2. Provide estimates of time remaining during processing

Consider tracking these as follow-up tasks if not planned for this PR.

Would you like me to open issues to track the missing checkpoint integrity verification and time-remaining estimates?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 247 - 251, Add follow-up tracking
for the two unimplemented features instead of changing behavior: create two
issues (or PR tasks) titled "Verify checkpoint integrity with hashes" and
"Provide time-remaining estimates during processing", and add concise TODO
comments in the code near the checkpoint cleanup and preprocessing logic
referencing checkpoint_file (the exists()/unlink() block) and generate_manifest
so future work can find where to add verification and ETA instrumentation; also
link the new issue IDs in a brief comment above the checkpoint cleanup to make
the follow-up explicit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All requested issues have been addressed.

  • Added safe checkpoint loading
  • Fixed append mode resume logic
  • Implemented atomic checkpoint writes
  • Resolved merge conflicts with upstream main

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AnshuPriya-1, thanks for the updates! The code improvements you listed (safe checkpoint loading, append mode resume, atomic writes) look good. However, there are two things still outstanding based on the current file content:


1. Unresolved merge conflict marker still present

The __main__ block at the bottom of openverifiablellm/utils.py still contains a raw merge conflict:

if __name__ == "__main__":
    ...
    extract_text_from_xml(
        sys.argv[1],
        write_manifest="--no-manifest" not in sys.argv[2:],
    )
<<<<<<< HEAD
    extract_text_from_xml(sys.argv[1], write_manifest=True)
=======
>>>>>>> upstream/main

The duplicate extract_text_from_xml(...) call inside the conflict block needs to be removed, leaving only the resolved version above it. This will cause a SyntaxError at runtime if not cleaned up.


2. Two Issue #54 features still unimplemented

My original note flagged that Issue #54 calls for:

  • Checkpoint integrity verification (e.g., hashing the checkpoint file contents)
  • Time-remaining estimates during processing

These aren't addressed by the checkpoint loading/atomic write changes. Would you like me to open two GitHub issues to track these as follow-up work so they aren't forgotten?

@github-actions github-actions bot added size/M and removed size/S labels Mar 10, 2026
@github-actions github-actions bot added size/M and removed size/M labels Mar 10, 2026
@github-actions github-actions bot added size/M and removed size/M labels Mar 10, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
openverifiablellm/utils.py (1)

199-207: 🧹 Nitpick | 🔵 Trivial

Remove unnecessary "r" mode and add resume logging.

The "r" mode argument is unnecessary (it's the default for open()). Additionally, when successfully resuming from a checkpoint, it would be helpful to log the starting page for debugging and monitoring.

♻️ Proposed fix
     # load checkpoint if exists
     if checkpoint_file.exists():
         try:
-            with open(checkpoint_file, "r") as cp:
+            with open(checkpoint_file) as cp:
                 data = json.load(cp)
                 start_page = data.get("last_processed_page", 0)
+                logger.info("Resuming from checkpoint at page %d", start_page)
         except (json.JSONDecodeError, OSError):
             logger.warning("Checkpoint file corrupted. Restarting from beginning.")
             start_page = 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 199 - 207, The checkpoint loading
block unnecessarily passes the "r" mode to open() and lacks a resume log; change
open(checkpoint_file, "r") to open(checkpoint_file) and after setting start_page
= data.get("last_processed_page", 0) add a logger.info (or debug) call that logs
the resumed start_page (e.g., "Resuming from checkpoint, starting at page
{start_page}") so developers can see when and where processing resumes; keep the
existing exception handling for json.JSONDecodeError and OSError unchanged and
reference checkpoint_file, start_page, and logger when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 433-436: Remove the unresolved Git conflict markers and ensure
extract_text_from_xml is called only once: delete the lines with <<<<<<<,
=======, and >>>>>>> and remove the duplicated call to
extract_text_from_xml(sys.argv[1], write_manifest=True) so that only the
intended invocation (the one already present in the module flow) remains; verify
the function extract_text_from_xml and any surrounding logic are left intact and
the file parses without conflict markers.

---

Duplicate comments:
In `@openverifiablellm/utils.py`:
- Around line 199-207: The checkpoint loading block unnecessarily passes the "r"
mode to open() and lacks a resume log; change open(checkpoint_file, "r") to
open(checkpoint_file) and after setting start_page =
data.get("last_processed_page", 0) add a logger.info (or debug) call that logs
the resumed start_page (e.g., "Resuming from checkpoint, starting at page
{start_page}") so developers can see when and where processing resumes; keep the
existing exception handling for json.JSONDecodeError and OSError unchanged and
reference checkpoint_file, start_page, and logger when making the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8bfa7b9a-971d-41b6-85ea-0183643add18

📥 Commits

Reviewing files that changed from the base of the PR and between b90a674 and 7033f67.

📒 Files selected for processing (1)
  • openverifiablellm/utils.py

Comment on lines +433 to +436
<<<<<<< HEAD
extract_text_from_xml(sys.argv[1], write_manifest=True)
=======
>>>>>>> upstream/main
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Unresolved merge conflict markers will cause SyntaxError.

The file contains unresolved Git merge conflict markers. This will cause a syntax error when the module is executed, completely breaking the preprocessing functionality.

Additionally, if the HEAD version (line 434) were kept, it would call extract_text_from_xml twice—once at lines 429-432 and again at line 434—causing double processing.

🐛 Proposed fix: Remove merge conflict markers
     extract_text_from_xml(
         sys.argv[1],
         write_manifest="--no-manifest" not in sys.argv[2:],
     )
-<<<<<<< HEAD
-    extract_text_from_xml(sys.argv[1], write_manifest=True)
-=======
->>>>>>> upstream/main
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 433 - 436, Remove the unresolved Git
conflict markers and ensure extract_text_from_xml is called only once: delete
the lines with <<<<<<<, =======, and >>>>>>> and remove the duplicated call to
extract_text_from_xml(sys.argv[1], write_manifest=True) so that only the
intended invocation (the one already present in the module flow) remains; verify
the function extract_text_from_xml and any surrounding logic are left intact and
the file parses without conflict markers.

@github-actions github-actions bot added size/S and removed size/M labels Mar 11, 2026
@AnshuPriya-1
Copy link
Author

The leftover merge conflict markers in the main block have been removed and the fix has been pushed in commit 612c1e1. Please let me know if any further changes are needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Incremental Preprocessing with Checkpoints to avoid halfway fail-restart

1 participant