Handle malformed XML in extract_text_from_xml and add edge case test by tharunkumar4562 · Pull Request #67 · AOSSIE-Org/OpenVerifiableLLM

tharunkumar4562 · 2026-03-11T10:02:33Z

Addressed Issues:

Fixes #48

Screenshots/Recordings:

Before fix: malformed XML raised a low-level ParseError without context.

After fix: the error now includes a clear message indicating which
XML dump failed to parse.

Additional Notes:

While running the test suite locally, I noticed that extract_text_from_xml does not provide useful context when encountering malformed XML.

Changes in this PR:

Added error handling around ET.iterparse to catch xml.etree.ElementTree.ParseError.
Log a descriptive error message including the input file path.
Re-raise the error with additional context to improve debugging.
Added a new unit test test_extract_text_from_xml_malformed to verify behaviour when malformed XML is encountered.

This improves robustness of the XML extraction step and prevents unclear crashes when corrupted Wikipedia dumps are processed.

All tests pass locally.

Checklist

My code follows the project's code style and conventions
I have made corresponding changes to the documentation
My changes generate no new warnings or errors
I have joined the Discord server and I will share a link to this PR with the project maintainers there
I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

Bug Fixes
- Safer XML extraction with atomic temporary-file replacement, improved memory handling, and clearer contextual parse errors that avoid leaving partial outputs.
- Verification now records manifest chunk-size as a passing report field unconditionally (avoids prior mismatch check).
Public API
- Hashing call now requires keyword-only input for data or file (behavior preserved).
Tests
- Added tests for malformed XML, midstream parse failures, and temporary-file cleanup.

coderabbitai · 2026-03-11T10:02:52Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

extract_text_from_xml now writes to a temporary file and parses atomically with try/except that cleans temp artifacts and re-raises contextual ET.ParseError; compute_sha256 signature requires keyword-only data or file_path; verify_preprocessing records manifest_chunk_size_bytes as a PASS without comparing to reproduced manifest; tests added for malformed and midstream XML failures.

Changes

Cohort / File(s)	Summary
Utilities `openverifiablellm/utils.py`	`extract_text_from_xml`: write to `.tmp` then atomically replace on success; catch `ET.ParseError`, log context, remove temp file on failure, re-raise preserving parse info; clear elements during parsing. `compute_sha256` signature changed to keyword-only: `def compute_sha256(*, data: Optional[Union[bytes, bytearray]] = None, file_path: Optional[Union[str, Path]] = None) -> str`.
Verification `openverifiablellm/verify.py`	In `verify_preprocessing`, remove equality check against reproduced manifest for `manifest_chunk_size_bytes`; instead always record a PASS `CheckResult` containing the manifest value.
Tests `tests/test_util.py`	Add tests for malformed XML and midstream failure (`test_extract_text_from_xml_malformed`, `test_extract_text_from_xml_midstream_failure`), assert no leftover temp/processed files on failure; add/adjust success-path test to verify `.txt.tmp` removal; import `defusedxml.ElementTree as ET`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Fix NameError in verify_preprocessing when validating chunk_size_bytes #43: Similar changes around manifest_chunk_size_bytes handling in verify_preprocessing — related validation logic adjustments.
Improve XML extraction logic and update tests #29: Changes to extract_text_from_xml (temp-file handling / ParseError behavior) — closely related to utils updates.
Deterministic Preprocessing Verification Mode that re-runs dataset preprocessing and validates that all generated artifacts #25: Touches compute_sha256 usage and verification flow — relevant to the public API/signature change.

Suggested labels

Python Lang

Suggested reviewers

Archit381

Poem

🐇 I hopped through files and left a temp trail,
Cleared dusty tags so parsers won't fail,
Wrapped parsing tight and moved the good home,
No stray crumbs left where broken bits roam,
A tidy hop — the pipeline can sail.

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	⚠️ Warning	The PR includes an out-of-scope change: modifying verify_preprocessing to remove manifest validation logic and record manifest_chunk_size_bytes as unconditional PASS, which is unrelated to the issue `#48` malformed XML handling.	Remove the manifest validation changes from openverifiablellm/verify.py or address them in a separate PR that corresponds to a different issue.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main changes: handling malformed XML in extract_text_from_xml and adding corresponding edge case tests.
Linked Issues check	✅ Passed	The PR successfully implements all coding requirements from issue `#48`: adds ParseError handling with logging, wraps errors with context, creates test_extract_text_from_xml_malformed to verify ParseError propagation, prevents silent failures, and prevents unintended behavior changes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 192-212: The code currently opens output_path before the XML parse
completes, leaving a truncated wiki_clean.txt if parsing fails; modify the logic
in the block using open_func, ET.iterparse, and clean_wikitext to write to a
temporary file (e.g., via tempfile.NamedTemporaryFile or a temp path) and only
atomically replace output_path with os.replace(temp_path, output_path) after the
parse loop finishes successfully, ensuring any ET.ParseError logs the input_path
(using logger.error) and the temp file is removed on exceptions; keep
ET.ParseError handling but do not write to the final output until the atomic
rename succeeds.

In `@tests/test_util.py`:
- Around line 156-174: The test should assert the specific exception type thrown
by utils.extract_text_from_xml instead of any Exception: change the
pytest.raises to expect xml.etree.ElementTree.ParseError (ET.ParseError) and
verify the exception message includes the input file path (input_file) or the
"Failed to parse XML" prefix to ensure the implementation's contextual error is
preserved; target the test function test_extract_text_from_xml_malformed and the
utils.extract_text_from_xml call when updating the assertion.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5e43f95c-39b7-404f-b9cb-6c09abc7b12f

📥 Commits

Reviewing files that changed from the base of the PR and between a2dfd7a and 013ee70.

📒 Files selected for processing (2)

openverifiablellm/utils.py
tests/test_util.py

openverifiablellm/utils.py

tests/test_util.py

… output on XML parse failure

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 217-222: The cleanup block currently swallows all exceptions from
temp_output_path.unlink(); change the broad except to only catch OSError (e.g.
except OSError as e:) and emit a warning or logger.warn including the
temp_output_path and exception details so the leftover .tmp is diagnosable,
while still not suppressing the original parse error (do not swallow other
exception types).
- Around line 223-224: The current re-raise creates a fresh ET.ParseError losing
parser metadata; instead, preserve the original exception's structured fields by
creating a new ET.ParseError with your contextual message and then copy
e.position and any other metadata (e.g., e.code) from the caught exception
before raising it (use the caught variable name e and ET.ParseError and include
msg as the new message). Ensure you copy attributes defensively (use getattr(e,
"position", None) and getattr(e, "code", None)) so downstream handlers still see
the original position/code on the raised exception.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: cc2aa65f-4f7f-4d36-8880-1cbd10195694

📥 Commits

Reviewing files that changed from the base of the PR and between 013ee70 and f979df7.

📒 Files selected for processing (1)

openverifiablellm/utils.py

openverifiablellm/utils.py

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 212-215: The except block catching ET.ParseError should log the
full traceback so failures are diagnosable: replace the current
logger.error(msg) call in the ET.ParseError handler (the variables input_path
and msg are built there) with logger.exception(msg) or logger.error(msg,
exc_info=True) so the original stack/exception context is preserved when parsing
fails in that code path.
- Around line 235-236: The current code logs "Preprocessing complete" before
calling generate_manifest, which can produce misleading success logs if manifest
generation fails; move the logger.info call so it occurs after
generate_manifest(input_path, output_path) completes (or split into two logs:
one for extraction completion and another for successful manifest creation).
Specifically, update the sequence around generate_manifest, logger.info, and the
variables input_path/output_path so that logger.info("Preprocessing complete.
Output saved to %s", output_path) executes only after
generate_manifest(input_path, output_path) returns successfully.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9202e810-90b2-4631-948e-248db2253280

📥 Commits

Reviewing files that changed from the base of the PR and between f979df7 and 6a79ac4.

📒 Files selected for processing (1)

openverifiablellm/utils.py

openverifiablellm/utils.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_util.py`:
- Around line 206-222: The test's failing_iterparse currently raises before a
page is completed, so change it to raise only after the first page has been
fully processed by detecting the page end event: iterate over original_iter
inside failing_iterparse, yield events normally, track when you see an event
where elem.tag.endswith("page") (or the corresponding end event) and only then
raise a _ET.ParseError to simulate a midstream failure after one page has been
written; this ensures utils.extract_text_from_xml actually writes partial data
to tmp_file (and then the test can assert cleanup/partial-write behavior).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: df6352f4-b8ea-4736-aaa9-cf315616ce0d

📥 Commits

Reviewing files that changed from the base of the PR and between 6a79ac4 and f2f54df.

📒 Files selected for processing (2)

openverifiablellm/utils.py
tests/test_util.py

tests/test_util.py

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/utils.py`:
- Around line 236-238: The success path in extract_text_from_xml references an
undefined variable write_manifest causing a NameError; fix by removing the
conditional guard and always calling generate_manifest(input_path, output_path)
at the end of extract_text_from_xml (or alternatively add a parameter
write_manifest with a default and use it), but prefer the safer change:
unguarded call to generate_manifest so extract_text_from_xml calls
generate_manifest(input_path, output_path) unconditionally.

In `@openverifiablellm/verify.py`:
- Around line 288-296: The current code unconditionally adds a PASS CheckResult
for "manifest_chunk_size_bytes" using report.add(CheckResult(...
name="manifest_chunk_size_bytes" ...)), which falsely signals success before any
comparison; remove or change this synthetic PASS and instead defer checking to
the reproduced-manifest comparison code: do not call report.add(...) here, and
in the reproduced-manifest block use the existing helper
_check_field(original_manifest.get("chunk_size_bytes"),
regenerated_manifest.get("chunk_size_bytes"), name="manifest_chunk_size_bytes")
(or call _check_field with the correct symbol names) so the stored
chunk_size_bytes is compared to the regenerated value and only marked PASS/FAIL
based on that comparison. Ensure the check uses the same field name
"manifest_chunk_size_bytes" so the report entry is produced by _check_field
rather than pre-marked.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0ece45bb-2460-4911-aa81-f1be8ee5f47a

📥 Commits

Reviewing files that changed from the base of the PR and between f2f54df and b1a6fbb.

📒 Files selected for processing (2)

openverifiablellm/utils.py
openverifiablellm/verify.py

coderabbitai · 2026-03-11T18:58:05Z

openverifiablellm/utils.py

+    # manifest generation must succeed before we declare the pipeline complete
+    if write_manifest:
+        generate_manifest(input_path, output_path)


⚠️ Potential issue | 🔴 Critical

write_manifest is undefined on the success path.

Line 237 references write_manifest, but extract_text_from_xml does not define or accept it. Any successful parse now ends in a NameError before completion; tests/test_util.py:121 still calls this function with only input_path.

Suggested fix

-def extract_text_from_xml(input_path): +def extract_text_from_xml(input_path, *, write_manifest: bool = True):

If manifest writing is not meant to be optional, the safer fix is to remove the guard and always call generate_manifest(...).

🧰 Tools

🪛 Ruff (0.15.5)

[error] 237-237: Undefined name write_manifest

(F821)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 236 - 238, The success path in extract_text_from_xml references an undefined variable write_manifest causing a NameError; fix by removing the conditional guard and always calling generate_manifest(input_path, output_path) at the end of extract_text_from_xml (or alternatively add a parameter write_manifest with a default and use it), but prefer the safer change: unguarded call to generate_manifest so extract_text_from_xml calls generate_manifest(input_path, output_path) unconditionally.

openverifiablellm/verify.py

…rison check

Handle malformed XML in extract_text_from_xml and add edge case test

013ee70

github-actions bot added backend python size/M first-time-contributor pending-coderabbit-review and removed size/M labels Mar 11, 2026

coderabbitai bot requested changes Mar 11, 2026

View reviewed changes

openverifiablellm/utils.py Outdated Show resolved Hide resolved

tests/test_util.py Outdated Show resolved Hide resolved

github-actions bot added size/M and removed size/M labels Mar 11, 2026

Write parsed output to temp file before finalizing to avoid truncated…

f979df7

… output on XML parse failure

github-actions bot added size/M and removed size/M labels Mar 11, 2026

coderabbitai bot requested changes Mar 11, 2026

View reviewed changes

openverifiablellm/utils.py Outdated Show resolved Hide resolved

openverifiablellm/utils.py Outdated Show resolved Hide resolved

Preserve ParseError metadata when wrapping XML parse exceptions

6a79ac4

github-actions bot added size/M and removed size/M labels Mar 11, 2026

coderabbitai bot requested changes Mar 11, 2026

View reviewed changes

openverifiablellm/utils.py Outdated Show resolved Hide resolved

openverifiablellm/utils.py Outdated Show resolved Hide resolved

Use ET.ParseError in malformed XML test instead of generic Exception

edbcd03

github-actions bot removed the size/M label Mar 11, 2026

github-actions bot added the size/M label Mar 11, 2026

Improve error logging and temp file cleanup in XML preprocessing

f2f54df

github-actions bot added size/M and removed size/M labels Mar 11, 2026

coderabbitai bot requested changes Mar 11, 2026

View reviewed changes

tests/test_util.py Outdated Show resolved Hide resolved

Fix merge conflicts; improve XML preprocessing logging/cleanup

b1a6fbb

github-actions bot added size/M and removed size/M labels Mar 11, 2026

coderabbitai bot requested changes Mar 11, 2026

View reviewed changes

Resolve merge conflicts; improve XML parsing logging/cleanup

6c886aa

github-actions bot added size/M and removed size/M labels Mar 11, 2026

Remove synthetic PASS for manifest_chunk_size_bytes and rely on compa…

dc01a67

…rison check

github-actions bot added size/M and removed size/M labels Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle malformed XML in extract_text_from_xml and add edge case test#67

Handle malformed XML in extract_text_from_xml and add edge case test#67
tharunkumar4562 wants to merge 8 commits intoAOSSIE-Org:mainfrom
tharunkumar4562:fix-malformed-xml

tharunkumar4562 commented Mar 11, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 11, 2026 •

edited

Loading

Reviews paused

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tharunkumar4562 commented Mar 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addressed Issues:

Screenshots/Recordings:

Additional Notes:

Checklist

⚠️ AI Notice - Important!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tharunkumar4562 commented Mar 11, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 11, 2026 •

edited

Loading