Skip to content

Add malformed XML test and fix verify.py regression#49

Open
Shubhamx404 wants to merge 4 commits intoAOSSIE-Org:mainfrom
Shubhamx404:fix-malformed-xml-test
Open

Add malformed XML test and fix verify.py regression#49
Shubhamx404 wants to merge 4 commits intoAOSSIE-Org:mainfrom
Shubhamx404:fix-malformed-xml-test

Conversation

@Shubhamx404
Copy link
Contributor

@Shubhamx404 Shubhamx404 commented Mar 8, 2026

Addressed Issues:

Fixes #48

Summary

Added a missing edge case test for the preprocessing pipeline to ensure that
extract_text_from_xml fails fast when it encounters malformed XML. In the process, a regression in the verification test suite (
openverifiablellm/verify.py) was also fixed to allow all pipeline checks to pass successfully.

Problem

Currently, there are no tests enforcing the failure state for corrupted XML inputs. If a anyone accidentally suppresses the ET.ParseError during a refactor, it will silently pass also , feeding corrupted data into the verify.py pipeline causes a native Python crash, rather than gracefully reporting a pipeline failure.
he pipeline would silently ignore corrupted files and produce incomplete outputs

Problem in verify.py

when verify.py was used against manifest or when the pipeline hit corrupted data, the script crashed with UnboundLocalError instead of produce Verification report.

Solution

  • Created test_extract_text_from_xml_malformed_xmlthis genrate a broken XML file and feeds it to the pipeline while strictly asserting pytest.raises(ET.ParseError).

  • Fixed verification crash for file verify.py

Screenshots/Recordings:

  • here is demonstration of implemention
image
  • here pipeline first ran on the valid Wikipedia dataset, it created dataset_manifest.json in the data/ folder. The Expected values are read from this manifest, while the Actual values are computed in real time from the file currently passed to the script (e.g., invalid_wikipedia_dump.xml)

Checklist

  • [ * ] My code follows the project's code style and conventions
  • [ * ] I have made corresponding changes to the documentation
  • [ * ] My changes generate no new warnings or errors
  • [ * ] I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • [ * ] I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

  • Bug Fixes

    • Verification now skips the chunk-size check for older manifests that lack that field, avoiding false failures and preserving existing pass/skip semantics.
  • Tests

    • Added a unit test to ensure XML extraction raises a parse error on malformed XML.
    • Updated verification tests to expect the legacy chunk-size check to be reported as skipped with an explanatory note.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 8, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7b813047-a631-411b-85c7-8a204df7e89f

📥 Commits

Reviewing files that changed from the base of the PR and between 0d0626a and 573ad01.

📒 Files selected for processing (2)
  • openverifiablellm/verify.py
  • tests/test_verify.py

Walkthrough

Make manifest_chunk_size_bytes comparison conditional in verification (skip with note when absent in legacy manifests) and add a unit test asserting utils.extract_text_from_xml raises ET.ParseError for malformed XML; update verification test expectations for legacy manifests.

Changes

Cohort / File(s) Summary
Manifest chunk-size check
openverifiablellm/verify.py
Change unconditional manifest_chunk_size_bytes comparison to run only when chunk_size_bytes exists in the manifest; add SKIP result with explanatory detail when the field is absent.
XML parse error test
tests/test_util.py
Add unit test that writes malformed XML and asserts defusedxml.ElementTree.ParseError is raised by utils.extract_text_from_xml.
Verification tests adjusted
tests/test_verify.py
Update legacy-manifest test to expect a SKIP entry for manifest_chunk_size_bytes with detail noting the field is absent in older manifests.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

Python Lang

Suggested reviewers

  • Archit381

Poem

"🐰 I hopped through manifests, soft and spry,
I check chunk sizes only when they lie,
When XML breaks and parsing cries,
My test will catch it with bright little eyes,
A nibble, a hop — then off to the sky! 🥕"

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: adding a malformed XML test and fixing a verify.py regression, both of which are reflected in the changeset.
Linked Issues check ✅ Passed The PR meets the objectives from issue #48: it adds a unit test (test_extract_text_from_xml_malformed_xml) that verifies extract_text_from_xml raises ET.ParseError for malformed XML input.
Out of Scope Changes check ✅ Passed All changes are within scope. The malformed XML test directly addresses issue #48, and the verify.py fix ensures the verification pipeline can complete successfully without UnboundLocalError when encountering older manifests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added size/S and removed size/S labels Mar 8, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/verify.py`:
- Around line 413-419: The manifest check for "chunk_size_bytes" should mirror
the behavior used for other optional fields: if "chunk_size_bytes" is present
call _check_field as currently done; otherwise add a CheckStatus.SKIP entry to
the report (use the same report key "manifest_chunk_size_bytes" and a brief
detail like "Merkle chunk size not present in manifest") so the absence is
explicitly recorded; update the code around the _check_field call that
references manifest["chunk_size_bytes"],
reproduced_manifest.get("chunk_size_bytes"), and the report to include this SKIP
branch.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 291635d0-780c-4aa3-a6ed-a177b1b9e64b

📥 Commits

Reviewing files that changed from the base of the PR and between c258e31 and d741214.

📒 Files selected for processing (2)
  • openverifiablellm/verify.py
  • tests/test_util.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@github-actions github-actions bot added size/S and removed size/S labels Mar 8, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/verify.py`:
- Around line 420-425: The new SKIP branch for manifest_chunk_size_bytes is
untested; update the existing test that checks
raw_merkle_root/processed_merkle_root to also assert that report contains a
CheckResult with name "manifest_chunk_size_bytes", status CheckStatus.SKIP, and
detail "Field absent from manifest (older version)". Locate where the test
inspects the report (the test calling verify and examining report.results) and
add an assertion that at least one entry matches CheckResult.name ==
"manifest_chunk_size_bytes" and CheckResult.status == CheckStatus.SKIP and
CheckResult.detail contains "Field absent from manifest (older version)" so
regressions on this legacy path are caught.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4eab8d5c-5c9b-4641-82c9-24be9e520269

📥 Commits

Reviewing files that changed from the base of the PR and between d741214 and e923a03.

📒 Files selected for processing (1)
  • openverifiablellm/verify.py

@Shubhamx404
Copy link
Contributor Author

hey @Archit381 can you please review this pr

@github-actions github-actions bot added size/S and removed size/S labels Mar 8, 2026
@github-actions github-actions bot added size/S and removed size/S labels Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] : Add edge case test for extract_text_from_xml to handle malformed XML

1 participant