feat: add schema validation for LLM extracted fields#213
feat: add schema validation for LLM extracted fields#213utkarshqz wants to merge 1 commit intofireform-core:mainfrom
Conversation
|
@utkarshqz It is great that you open this pr for the issue... but there was already a opened pr for this issue.... If you want to improve the implementation please push your code to my pr branch... creating a new pr is redundant and would increase the load on the maintainers. |
Hey @Acuspeedster — thanks for pointing that out and for working on PR #117. I agree that we should avoid creating unnecessary load for the maintainers. The reason I opened a separate PR (#213) instead of pushing to your branch is mainly due to the difference in scope. From what I observed, PR #117 introduces several additional components beyond the JSON validation itself — including the faster-whisper transcription pipeline, new ML dependencies, and Docker-related updates. Those are substantial architectural additions. For Issue #114 specifically, I wanted to keep the implementation minimal and focused by isolating only the API JSON/schema validation logic and hallucination-defense checks (such as formatting validation and repetition detection). This keeps the PR smaller, easier to review, and directly aligned with the requirements of the issue. Since the scope and implementation approach differ quite a bit, I thought it would be helpful to keep this as a separate, focused PR so maintainers can review it independently as an alternative implementation for Issue #114. |
feat: add schema validation for LLM extracted fields
Summary
This PR adds field-level schema validation to the LLM extraction pipeline, directly addressing GSoC Expected Outcome #1 which requires "improved AI extraction accuracy through schema validation".
After Mistral extracts values from the transcript, each value is now automatically validated against expected patterns for its field type before being written to the PDF. Validation issues are reported as structured warnings — never as hard failures — ensuring the pipeline remains robust while giving developers visibility into extraction quality.
Closes / Fixes
Closes #114
Addresses #173 — hallucination detection catches repeated values across fields
Addresses #186 — LLM test coverage now at 40 tests (was 0)
Type of change
What changed and why
1. 🔍
validate_extracted_fields()— new method insrc/llm.pyCalled automatically inside
main_loop()after every extraction. Runs 5 checks:"not-a-phone"→ warning@and a domain"johndoe"→ warningDD/MM/YYYY,YYYY-MM-DD, etc."yesterday"→ warning{"f1": "John", "f2": "John", "f3": "John"}→ warningDesign decisions:
Nonevalues are skipped — no false positives for empty fieldsget_validation_warnings()Real output example (from local testing):
Or when issues are found:
2. 🧪 5 new unit tests —
tests/test_llm.py::TestSchemaValidationtest_valid_fields_return_no_warningstest_invalid_email_flagged@→ warning producedtest_repeated_values_flagged_as_hallucinationtest_null_values_skippedNonevalues → no false positive warningstest_warnings_stored_on_instanceget_validation_warnings()returns correct data3. 📚
docs/TESTING.md— updatedTestSchemaValidationsection describing all 5 new test casesHow Has This Been Tested?
python -m pytest tests/ -v 57 passed, 14 warnings in 0.35smain_loop()call ✅Test Configuration:
Checklist