Add incremental preprocessing checkpoints with resume support#63
Add incremental preprocessing checkpoints with resume support#63AnshuPriya-1 wants to merge 7 commits intoAOSSIE-Org:mainfrom
Conversation
WalkthroughAdds incremental checkpointing and resume support to extract_text_from_xml, writes checkpoints periodically and removes them on success, defers manifest generation until after preprocessing, and introduces unresolved merge conflict markers in the extract_text_from_xml and main sections. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User/Process
participant Func as extract_text_from_xml()
participant Check as data/checkpoint.json
participant Proc as PageProcessor
participant Manifest as generate_manifest()
User->>Func: start(input_path, output_path)
Func->>Check: read checkpoint (if exists)
alt checkpoint present
Check-->>Func: last_processed_page
Func->>Proc: set start_page = last_processed_page
else no checkpoint
Func->>Proc: set start_page = 0
end
Func->>Proc: begin processing pages
loop for each page
Proc->>Proc: page_index++
alt page_index <= start_page
Proc-->>Proc: skip page
else
Proc->>Proc: process and write output (append or new)
end
alt every 1000 pages
Proc->>Check: write checkpoint (atomic .tmp then replace)
end
end
Proc-->>Func: processing complete
Func->>Check: delete checkpoint file
Func->>Manifest: generate_manifest(input_path, output_path)
Manifest-->>Func: manifest ready
Func-->>User: done
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
openverifiablellm/utils.py (1)
158-246: 🧹 Nitpick | 🔵 TrivialMissing features from Issue
#54requirements.The implementation partially addresses the checkpoint requirements but is missing:
- Checkpoint integrity verification - No hash stored to verify output file state matches checkpoint
- Time remaining estimates - No progress tracking or ETA calculation
These are listed as objectives in Issue
#54.Would you like me to help implement these missing features or open separate issues to track them?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 158 - 246, The extract_text_from_xml function currently writes a checkpoint with only last_processed_page; update the checkpoint logic to store an integrity hash of the output file (e.g., SHA256 of output_path) and state needed for ETA (total_pages if known or processed_pages and elapsed_seconds) so you can verify the output file matches the checkpoint on resume and compute progress/time remaining; specifically, when writing checkpoint_file in extract_text_from_xml include fields {"last_processed_page": page_index, "output_hash": <sha256_hex_of_output_path>, "processed_bytes": <output_path.stat().st_size>, "timestamp": <now_epoch>} and on load verify output_path exists and its hash and size match the checkpoint before resuming (if mismatch, either abort or rebuild from scratch), and while iterating update elapsed time and estimate ETA using page_index-start_page over elapsed seconds and (total_pages or an estimated remaining count) and surface this via logger.info; ensure generate_manifest is called only after successful hash verification and final checkpoint removal.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 230-237: The checkpoint write currently uses a non-atomic
json.dump to checkpoint_file when page_index % 1000 == 0; change this to perform
an atomic write by serializing the checkpoint dict (e.g.,
{"last_processed_page": page_index}) to a temporary file (e.g., checkpoint_file
+ ".tmp"), fsync the file, close it, then atomically replace the original using
os.replace (or pathlib.Path.replace) in the same directory; additionally compute
a checksum (SHA256) of the serialized checkpoint and store it together in the
checkpoint JSON (e.g., {"last_processed_page": ..., "checksum": "<hex>"}), and
update the checkpoint loading logic to verify the stored checksum matches the
content before trusting last_processed_page, raising/handling a verification
error if mismatched.
- Around line 207-220: The output is opened with "w" which truncates existing
output on resume; change the file open mode to "a" (append) when start_page > 0
to avoid losing previously written pages and keep "w" when starting fresh
(start_page == 0); locate the block using variables/output_path, start_page,
page_index and the loop over context (for _, elem in context:) and open the file
conditionally (mode = "a" if start_page > 0 else "w") preserving encoding and
newline handling so resumed runs append new pages instead of truncating the
file.
- Around line 199-202: Wrap the json.load call that reads checkpoint_file in a
try/except that catches json.JSONDecodeError (and fallback exceptions) so a
malformed checkpoint does not crash; on exception set start_page = 0 and emit a
warning that includes the checkpoint_file name and the exception, then continue.
Also remove the redundant "r" mode from open(), keep the existing with
open(checkpoint_file) context, and ensure you reference checkpoint_file,
json.load, and start_page in the fix so the handler replaces the current direct
json.load usage.
---
Outside diff comments:
In `@openverifiablellm/utils.py`:
- Around line 158-246: The extract_text_from_xml function currently writes a
checkpoint with only last_processed_page; update the checkpoint logic to store
an integrity hash of the output file (e.g., SHA256 of output_path) and state
needed for ETA (total_pages if known or processed_pages and elapsed_seconds) so
you can verify the output file matches the checkpoint on resume and compute
progress/time remaining; specifically, when writing checkpoint_file in
extract_text_from_xml include fields {"last_processed_page": page_index,
"output_hash": <sha256_hex_of_output_path>, "processed_bytes":
<output_path.stat().st_size>, "timestamp": <now_epoch>} and on load verify
output_path exists and its hash and size match the checkpoint before resuming
(if mismatch, either abort or rebuild from scratch), and while iterating update
elapsed time and estimate ETA using page_index-start_page over elapsed seconds
and (total_pages or an estimated remaining count) and surface this via
logger.info; ensure generate_manifest is called only after successful hash
verification and final checkpoint removal.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: e3e31401-8ba9-4461-aebe-65146df2a7e5
📒 Files selected for processing (1)
openverifiablellm/utils.py
openverifiablellm/utils.py
Outdated
| if checkpoint_file.exists(): | ||
| with open(checkpoint_file, "r") as cp: | ||
| data = json.load(cp) | ||
| start_page = data.get("last_processed_page", 0) |
There was a problem hiding this comment.
Add error handling for corrupted checkpoint files.
If the checkpoint file is malformed or corrupted (e.g., partial write from a previous crash), json.load() will raise a JSONDecodeError and abort. Per the PR objectives, failures should be handled gracefully. Consider catching exceptions and falling back to start_page = 0 with a warning.
Also, the "r" mode argument is unnecessary (it's the default).
🛡️ Proposed fix
# load checkpoint if exists
if checkpoint_file.exists():
- with open(checkpoint_file, "r") as cp:
- data = json.load(cp)
- start_page = data.get("last_processed_page", 0)
+ try:
+ with open(checkpoint_file) as cp:
+ data = json.load(cp)
+ start_page = data.get("last_processed_page", 0)
+ logger.info("Resuming from checkpoint at page %d", start_page)
+ except (json.JSONDecodeError, KeyError) as e:
+ logger.warning("Corrupted checkpoint file, starting from beginning: %s", e)
+ start_page = 0🧰 Tools
🪛 Ruff (0.15.5)
[warning] 200-200: Unnecessary mode argument
Remove mode argument
(UP015)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/utils.py` around lines 199 - 202, Wrap the json.load call
that reads checkpoint_file in a try/except that catches json.JSONDecodeError
(and fallback exceptions) so a malformed checkpoint does not crash; on exception
set start_page = 0 and emit a warning that includes the checkpoint_file name and
the exception, then continue. Also remove the redundant "r" mode from open(),
keep the existing with open(checkpoint_file) context, and ensure you reference
checkpoint_file, json.load, and start_page in the fix so the handler replaces
the current direct json.load usage.
| # save checkpoint every 1000 pages | ||
| if page_index % 1000 == 0: | ||
| with open(checkpoint_file, "w") as cp: | ||
| json.dump( | ||
| {"last_processed_page": page_index}, | ||
| cp | ||
| ) | ||
|
|
There was a problem hiding this comment.
Checkpoint write is not atomic and lacks integrity verification.
If the process is terminated while writing the checkpoint, the file could be left in a corrupted state (partial JSON). Consider using atomic write pattern (write to temp file, then rename).
Additionally, the PR objectives mention verifying checkpoint integrity with hashes, but no hash is stored to validate the output file state matches the checkpoint.
🛡️ Proposed fix for atomic writes
# save checkpoint every 1000 pages
if page_index % 1000 == 0:
- with open(checkpoint_file, "w") as cp:
- json.dump(
- {"last_processed_page": page_index},
- cp
- )
+ # Atomic write: write to temp file then rename
+ temp_checkpoint = checkpoint_file.with_suffix(".tmp")
+ with open(temp_checkpoint, "w") as cp:
+ json.dump(
+ {"last_processed_page": page_index},
+ cp
+ )
+ temp_checkpoint.rename(checkpoint_file)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/utils.py` around lines 230 - 237, The checkpoint write
currently uses a non-atomic json.dump to checkpoint_file when page_index % 1000
== 0; change this to perform an atomic write by serializing the checkpoint dict
(e.g., {"last_processed_page": page_index}) to a temporary file (e.g.,
checkpoint_file + ".tmp"), fsync the file, close it, then atomically replace the
original using os.replace (or pathlib.Path.replace) in the same directory;
additionally compute a checksum (SHA256) of the serialized checkpoint and store
it together in the checkpoint JSON (e.g., {"last_processed_page": ...,
"checksum": "<hex>"}), and update the checkpoint loading logic to verify the
stored checksum matches the content before trusting last_processed_page,
raising/handling a verification error if mismatched.
…ut, atomic checkpoint writes
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
openverifiablellm/utils.py (1)
199-206:⚠️ Potential issue | 🟡 MinorFix inconsistent indentation in checkpoint loading block.
The indentation in this block is inconsistent (extra leading spaces before
try, mixed indentation inside the block). Also, the"r"mode argument on Line 201 is unnecessary as it's the default.🧹 Proposed fix
# load checkpoint if exists if checkpoint_file.exists(): - try: - with open(checkpoint_file, "r") as cp: - data = json.load(cp) - start_page = data.get("last_processed_page", 0) - except (json.JSONDecodeError, OSError): - logger.warning("Checkpoint file corrupted. Restarting from beginning.") - start_page = 0 + try: + with open(checkpoint_file) as cp: + data = json.load(cp) + start_page = data.get("last_processed_page", 0) + except (json.JSONDecodeError, OSError): + logger.warning("Checkpoint file corrupted. Restarting from beginning.") + start_page = 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 199 - 206, Normalize the indentation of the checkpoint loading block so all lines under the if checkpoint_file.exists(): use the same indentation level (remove the extra leading spaces before try and ensure with/open/json.load/except align consistently), and remove the unnecessary explicit "r" mode argument in open(checkpoint_file, "r") (use open(checkpoint_file) instead); keep the same exception handling (json.JSONDecodeError, OSError) and behavior of setting start_page from data.get("last_processed_page", 0) or falling back to start_page = 0 and logging via logger.warning.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 236-242: The checkpoint write is inverted: the code currently
opens and writes to checkpoint_file and then calls
tmp_checkpoint.replace(checkpoint_file), which fails or corrupts state; change
it to write to tmp_checkpoint first (open tmp_checkpoint and json.dump
{"last_processed_page": page_index} into it, flush/close), then call
tmp_checkpoint.replace(checkpoint_file) so the atomic rename replaces the target
with the newly written temp file; update references to tmp_checkpoint,
checkpoint_file, and page_index accordingly.
- Around line 210-211: The code computes mode = "a" when resuming but then
always opens the file with "w", truncating previous output; update the open call
that uses output_path to use the computed mode (i.e., open(output_path, mode,
encoding="utf-8")) so resumed runs append instead of overwrite, and keep the
existing encoding argument; ensure the variable name mode is the one referenced
in the open call (function/method using output_path and mode).
---
Duplicate comments:
In `@openverifiablellm/utils.py`:
- Around line 199-206: Normalize the indentation of the checkpoint loading block
so all lines under the if checkpoint_file.exists(): use the same indentation
level (remove the extra leading spaces before try and ensure
with/open/json.load/except align consistently), and remove the unnecessary
explicit "r" mode argument in open(checkpoint_file, "r") (use
open(checkpoint_file) instead); keep the same exception handling
(json.JSONDecodeError, OSError) and behavior of setting start_page from
data.get("last_processed_page", 0) or falling back to start_page = 0 and logging
via logger.warning.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 450ebfb8-8984-446e-b758-b2a387f90299
📒 Files selected for processing (1)
openverifiablellm/utils.py
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 219-224: The checkpoint logic can cause duplicated output because
start_page is used to skip pages while the output file is opened in append mode;
to fix, when saving a checkpoint also flush the output file and record its
current byte offset (use file.tell()) into the checkpoint metadata, and on
resume (when load_checkpoint provides start_page) open the output file and
truncate it to that recorded offset before continuing; update the functions that
save/load checkpoints (where start_page is stored) to include an output_offset
field and ensure you call file.flush() and os.ftruncate(file.fileno(), offset)
before appending to prevent duplication (references: start_page, page_index,
elem.clear).
- Around line 247-251: Add follow-up tracking for the two unimplemented features
instead of changing behavior: create two issues (or PR tasks) titled "Verify
checkpoint integrity with hashes" and "Provide time-remaining estimates during
processing", and add concise TODO comments in the code near the checkpoint
cleanup and preprocessing logic referencing checkpoint_file (the
exists()/unlink() block) and generate_manifest so future work can find where to
add verification and ETA instrumentation; also link the new issue IDs in a brief
comment above the checkpoint cleanup to make the follow-up explicit.
- Around line 199-206: Fix the inconsistent indentation in the
checkpoint-reading block around checkpoint_file so the try/except aligns with
the if body and avoids IndentationError; open the file without explicitly
passing mode "r" (default) when calling open(checkpoint_file) before json.load,
set start_page = data.get("last_processed_page", 0) in the try, and in the
success path call logger.info to state the resumed start_page (e.g.,
logger.info(..., start_page)) while keeping the except handling that warns and
sets start_page = 0; update references to checkpoint_file, start_page, logger,
and json.load accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7a7b8a18-0bd7-4301-9dc5-ff6f7f154bef
📒 Files selected for processing (1)
openverifiablellm/utils.py
openverifiablellm/utils.py
Outdated
| if checkpoint_file.exists(): | ||
| try: | ||
| with open(checkpoint_file, "r") as cp: | ||
| data = json.load(cp) | ||
| start_page = data.get("last_processed_page", 0) | ||
| except (json.JSONDecodeError, OSError): | ||
| logger.warning("Checkpoint file corrupted. Restarting from beginning.") | ||
| start_page = 0 |
There was a problem hiding this comment.
Critical: Inconsistent indentation will cause IndentationError at runtime.
The try/except block has inconsistent indentation that will prevent the code from running. Python requires consistent whitespace within blocks. Additionally, consider adding an info log when successfully resuming (as suggested in past review).
🐛 Proposed fix for indentation and logging
# load checkpoint if exists
if checkpoint_file.exists():
- try:
- with open(checkpoint_file, "r") as cp:
- data = json.load(cp)
- start_page = data.get("last_processed_page", 0)
- except (json.JSONDecodeError, OSError):
- logger.warning("Checkpoint file corrupted. Restarting from beginning.")
- start_page = 0
+ try:
+ with open(checkpoint_file) as cp:
+ data = json.load(cp)
+ start_page = data.get("last_processed_page", 0)
+ logger.info("Resuming from checkpoint at page %d", start_page)
+ except (json.JSONDecodeError, OSError):
+ logger.warning("Checkpoint file corrupted. Restarting from beginning.")
+ start_page = 0Note: The "r" mode argument is also unnecessary (it's the default) per static analysis hint UP015.
🧰 Tools
🪛 Ruff (0.15.5)
[warning] 201-201: Unnecessary mode argument
Remove mode argument
(UP015)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/utils.py` around lines 199 - 206, Fix the inconsistent
indentation in the checkpoint-reading block around checkpoint_file so the
try/except aligns with the if body and avoids IndentationError; open the file
without explicitly passing mode "r" (default) when calling open(checkpoint_file)
before json.load, set start_page = data.get("last_processed_page", 0) in the
try, and in the success path call logger.info to state the resumed start_page
(e.g., logger.info(..., start_page)) while keeping the except handling that
warns and sets start_page = 0; update references to checkpoint_file, start_page,
logger, and json.load accordingly.
| page_index += 1 | ||
|
|
||
| # skip already processed pages | ||
| if page_index <= start_page: | ||
| elem.clear() | ||
| continue |
There was a problem hiding this comment.
Data duplication risk between checkpoint intervals.
If processing crashes between checkpoint saves (e.g., at page 1500 after checkpoint at page 1000), pages 1001–1500 are already written to the output file. On resume, start_page=1000 causes these pages to be skipped, but since the file opens in append mode, pages 1001–1500 from the new run will be appended again, resulting in duplicate content.
Consider one of:
- Flush and record output file position in checkpoint, truncate to that position on resume
- Save checkpoint more frequently (e.g., every 100 pages)
- Document this limitation and recommend users delete the output file if crash occurred mid-interval
Also applies to: 234-241
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/utils.py` around lines 219 - 224, The checkpoint logic can
cause duplicated output because start_page is used to skip pages while the
output file is opened in append mode; to fix, when saving a checkpoint also
flush the output file and record its current byte offset (use file.tell()) into
the checkpoint metadata, and on resume (when load_checkpoint provides
start_page) open the output file and truncate it to that recorded offset before
continuing; update the functions that save/load checkpoints (where start_page is
stored) to include an output_offset field and ensure you call file.flush() and
os.ftruncate(file.fileno(), offset) before appending to prevent duplication
(references: start_page, page_index, elem.clear).
openverifiablellm/utils.py
Outdated
| # remove checkpoint after successful completion | ||
| if checkpoint_file.exists(): | ||
| checkpoint_file.unlink() | ||
|
|
||
| generate_manifest(input_path, output_path) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Checkpoint cleanup and manifest generation look good.
The checkpoint is properly removed after successful completion, and manifest generation is correctly deferred until preprocessing finishes.
Note: The PR objectives (Issue #54) mention two features not yet implemented:
- Verify checkpoint integrity with hashes
- Provide estimates of time remaining during processing
Consider tracking these as follow-up tasks if not planned for this PR.
Would you like me to open issues to track the missing checkpoint integrity verification and time-remaining estimates?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/utils.py` around lines 247 - 251, Add follow-up tracking
for the two unimplemented features instead of changing behavior: create two
issues (or PR tasks) titled "Verify checkpoint integrity with hashes" and
"Provide time-remaining estimates during processing", and add concise TODO
comments in the code near the checkpoint cleanup and preprocessing logic
referencing checkpoint_file (the exists()/unlink() block) and generate_manifest
so future work can find where to add verification and ETA instrumentation; also
link the new issue IDs in a brief comment above the checkpoint cleanup to make
the follow-up explicit.
There was a problem hiding this comment.
All requested issues have been addressed.
- Added safe checkpoint loading
- Fixed append mode resume logic
- Implemented atomic checkpoint writes
- Resolved merge conflicts with upstream main
There was a problem hiding this comment.
@AnshuPriya-1, thanks for the updates! The code improvements you listed (safe checkpoint loading, append mode resume, atomic writes) look good. However, there are two things still outstanding based on the current file content:
1. Unresolved merge conflict marker still present
The __main__ block at the bottom of openverifiablellm/utils.py still contains a raw merge conflict:
if __name__ == "__main__":
...
extract_text_from_xml(
sys.argv[1],
write_manifest="--no-manifest" not in sys.argv[2:],
)
<<<<<<< HEAD
extract_text_from_xml(sys.argv[1], write_manifest=True)
=======
>>>>>>> upstream/mainThe duplicate extract_text_from_xml(...) call inside the conflict block needs to be removed, leaving only the resolved version above it. This will cause a SyntaxError at runtime if not cleaned up.
2. Two Issue #54 features still unimplemented
My original note flagged that Issue #54 calls for:
- Checkpoint integrity verification (e.g., hashing the checkpoint file contents)
- Time-remaining estimates during processing
These aren't addressed by the checkpoint loading/atomic write changes. Would you like me to open two GitHub issues to track these as follow-up work so they aren't forgotten?
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
openverifiablellm/utils.py (1)
199-207: 🧹 Nitpick | 🔵 TrivialRemove unnecessary
"r"mode and add resume logging.The
"r"mode argument is unnecessary (it's the default foropen()). Additionally, when successfully resuming from a checkpoint, it would be helpful to log the starting page for debugging and monitoring.♻️ Proposed fix
# load checkpoint if exists if checkpoint_file.exists(): try: - with open(checkpoint_file, "r") as cp: + with open(checkpoint_file) as cp: data = json.load(cp) start_page = data.get("last_processed_page", 0) + logger.info("Resuming from checkpoint at page %d", start_page) except (json.JSONDecodeError, OSError): logger.warning("Checkpoint file corrupted. Restarting from beginning.") start_page = 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 199 - 207, The checkpoint loading block unnecessarily passes the "r" mode to open() and lacks a resume log; change open(checkpoint_file, "r") to open(checkpoint_file) and after setting start_page = data.get("last_processed_page", 0) add a logger.info (or debug) call that logs the resumed start_page (e.g., "Resuming from checkpoint, starting at page {start_page}") so developers can see when and where processing resumes; keep the existing exception handling for json.JSONDecodeError and OSError unchanged and reference checkpoint_file, start_page, and logger when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 433-436: Remove the unresolved Git conflict markers and ensure
extract_text_from_xml is called only once: delete the lines with <<<<<<<,
=======, and >>>>>>> and remove the duplicated call to
extract_text_from_xml(sys.argv[1], write_manifest=True) so that only the
intended invocation (the one already present in the module flow) remains; verify
the function extract_text_from_xml and any surrounding logic are left intact and
the file parses without conflict markers.
---
Duplicate comments:
In `@openverifiablellm/utils.py`:
- Around line 199-207: The checkpoint loading block unnecessarily passes the "r"
mode to open() and lacks a resume log; change open(checkpoint_file, "r") to
open(checkpoint_file) and after setting start_page =
data.get("last_processed_page", 0) add a logger.info (or debug) call that logs
the resumed start_page (e.g., "Resuming from checkpoint, starting at page
{start_page}") so developers can see when and where processing resumes; keep the
existing exception handling for json.JSONDecodeError and OSError unchanged and
reference checkpoint_file, start_page, and logger when making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8bfa7b9a-971d-41b6-85ea-0183643add18
📒 Files selected for processing (1)
openverifiablellm/utils.py
openverifiablellm/utils.py
Outdated
| <<<<<<< HEAD | ||
| extract_text_from_xml(sys.argv[1], write_manifest=True) | ||
| ======= | ||
| >>>>>>> upstream/main |
There was a problem hiding this comment.
Critical: Unresolved merge conflict markers will cause SyntaxError.
The file contains unresolved Git merge conflict markers. This will cause a syntax error when the module is executed, completely breaking the preprocessing functionality.
Additionally, if the HEAD version (line 434) were kept, it would call extract_text_from_xml twice—once at lines 429-432 and again at line 434—causing double processing.
🐛 Proposed fix: Remove merge conflict markers
extract_text_from_xml(
sys.argv[1],
write_manifest="--no-manifest" not in sys.argv[2:],
)
-<<<<<<< HEAD
- extract_text_from_xml(sys.argv[1], write_manifest=True)
-=======
->>>>>>> upstream/main🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/utils.py` around lines 433 - 436, Remove the unresolved Git
conflict markers and ensure extract_text_from_xml is called only once: delete
the lines with <<<<<<<, =======, and >>>>>>> and remove the duplicated call to
extract_text_from_xml(sys.argv[1], write_manifest=True) so that only the
intended invocation (the one already present in the module flow) remains; verify
the function extract_text_from_xml and any surrounding logic are left intact and
the file parses without conflict markers.
|
The leftover merge conflict markers in the main block have been removed and the fix has been pushed in commit 612c1e1. Please let me know if any further changes are needed. |
Addressed Issues:
Fixes #54
Screenshots/Recordings:
Additional Notes:
Adds checkpoint support to the preprocessing pipeline so large Wikipedia dump processing can resume from the last processed page instead of restarting from the beginning.
A checkpoint is saved every 1000 pages and removed after successful completion.
Checklist
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
Summary by CodeRabbit
New Features
Improvements
Bug Fixes