feat: add schema validation for LLM extracted fields by utkarshqz · Pull Request #213 · fireform-core/FireForm

utkarshqz · 2026-03-10T04:28:13Z

feat: add schema validation for LLM extracted fields

Summary

This PR adds field-level schema validation to the LLM extraction pipeline, directly addressing GSoC Expected Outcome #1 which requires "improved AI extraction accuracy through schema validation".

After Mistral extracts values from the transcript, each value is now automatically validated against expected patterns for its field type before being written to the PDF. Validation issues are reported as structured warnings — never as hard failures — ensuring the pipeline remains robust while giving developers visibility into extraction quality.

Closes / Fixes

Closes #114
Addresses #173 — hallucination detection catches repeated values across fields
Addresses #186 — LLM test coverage now at 40 tests (was 0)

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

What changed and why

1. 🔍 `validate_extracted_fields()` — new method in `src/llm.py`

Called automatically inside main_loop() after every extraction. Runs 5 checks:

Check	What it does	Example
Phone format	Validates digits, dashes, brackets, spaces	`"not-a-phone"` → warning
Email format	Must contain `@` and a domain	`"johndoe"` → warning
Date format	Matches `DD/MM/YYYY`, `YYYY-MM-DD`, etc.	`"yesterday"` → warning
Hallucination detection	Same value in 3+ fields = likely hallucination	`{"f1": "John", "f2": "John", "f3": "John"}` → warning
Length guard	Values over 500 chars flagged	Prevents hallucinated paragraphs filling a name field

Design decisions:

Never raises an exception — all issues are warnings, not failures
None values are skipped — no false positives for empty fields
Warnings stored on instance — accessible via get_validation_warnings()
Runs after both batch extraction AND fallback per-field extraction

Real output example (from local testing):

[SCHEMA VALIDATION] All fields passed validation ✓

Or when issues are found:

[SCHEMA VALIDATION] Issues found:
  [SCHEMA] 'email': value 'johndoe' does not look like a valid email address
  [SCHEMA] Possible hallucination — value 'John Smith' appears in 3 fields: ['f1', 'f2', 'f3']

2. 🧪 5 new unit tests — `tests/test_llm.py::TestSchemaValidation`

Test	What it verifies
`test_valid_fields_return_no_warnings`	Clean extraction → empty warnings list
`test_invalid_email_flagged`	Email without `@` → warning produced
`test_repeated_values_flagged_as_hallucination`	Same value in 3 fields → hallucination warning
`test_null_values_skipped`	`None` values → no false positive warnings
`test_warnings_stored_on_instance`	`get_validation_warnings()` returns correct data

3. 📚 `docs/TESTING.md` — updated

Test count updated from 52 → 57
Added TestSchemaValidation section describing all 5 new test cases
Explains what each validation check covers

docs/TESTING.md is the single source of truth for the test suite — updated with every PR that adds tests.

How Has This Been Tested?

python -m pytest tests/ -v
57 passed, 14 warnings in 0.35s

All 57 tests pass locally ✅
Schema validation runs on every main_loop() call ✅
Verified no false positives on valid data ✅
Verified hallucination detection catches repeated values ✅

Test Configuration:

OS: Windows 11
Python: 3.11.9
pytest: 9.0.2

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules

Note: This PR depends on PR test: replace broken test suite with 52 passing tests #209 (test infrastructure) and PR feat: frontend UI, batch LLM extraction, dynamic PDF labels, API hardening #210 (API + LLM fixes) which are open but not yet merged into main.

Acuspeedster · 2026-03-12T10:58:03Z

@utkarshqz It is great that you open this pr for the issue... but there was already a opened pr for this issue.... If you want to improve the implementation please push your code to my pr branch... creating a new pr is redundant and would increase the load on the maintainers.

utkarshqz · 2026-03-12T18:41:46Z

@utkarshqz It is great that you open this pr for the issue... but there was already a opened pr for this issue.... If you want to improve the implementation please push your code to my pr branch... creating a new pr is redundant and would increase the load on the maintainers.

Hey @Acuspeedster — thanks for pointing that out and for working on PR #117. I agree that we should avoid creating unnecessary load for the maintainers.

The reason I opened a separate PR (#213) instead of pushing to your branch is mainly due to the difference in scope.

From what I observed, PR #117 introduces several additional components beyond the JSON validation itself — including the faster-whisper transcription pipeline, new ML dependencies, and Docker-related updates. Those are substantial architectural additions.

For Issue #114 specifically, I wanted to keep the implementation minimal and focused by isolating only the API JSON/schema validation logic and hallucination-defense checks (such as formatting validation and repetition detection). This keeps the PR smaller, easier to review, and directly aligned with the requirements of the issue.

Since the scope and implementation approach differ quite a bit, I thought it would be helpful to keep this as a separate, focused PR so maintainers can review it independently as an alternative implementation for Issue #114.

feat: add schema validation for LLM extracted fields

4540850

utkarshqz mentioned this pull request Mar 14, 2026

[FEAT]: Schema Validation + Error Recovery #114

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add schema validation for LLM extracted fields#213

feat: add schema validation for LLM extracted fields#213
utkarshqz wants to merge 1 commit intofireform-core:mainfrom
utkarshqz:feat/schema-validation-clean

utkarshqz commented Mar 10, 2026

Uh oh!

Acuspeedster commented Mar 12, 2026

Uh oh!

utkarshqz commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

utkarshqz commented Mar 10, 2026

feat: add schema validation for LLM extracted fields

Summary

Closes / Fixes

Type of change

What changed and why

1. 🔍 validate_extracted_fields() — new method in src/llm.py

2. 🧪 5 new unit tests — tests/test_llm.py::TestSchemaValidation

3. 📚 docs/TESTING.md — updated

How Has This Been Tested?

Checklist

Uh oh!

Acuspeedster commented Mar 12, 2026

Uh oh!

utkarshqz commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. 🔍 `validate_extracted_fields()` — new method in `src/llm.py`

2. 🧪 5 new unit tests — `tests/test_llm.py::TestSchemaValidation`

3. 📚 `docs/TESTING.md` — updated