Skip to content

luzbery1593/passport-data-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Passport Data Extractor

image #Clean passport review interface with autosave and CSV/JSON export.

A passport OCR and MRZ extraction system for processing passport PDFs, scanned documents, image-based documents, and multi-page passport PDFs.

The project extracts structured passport data, validates MRZ fields, supports manual review, saves processed records internally, and exports reviewed data as CSV or JSON.

Overview

Passport Data Extractor is designed for passport-focused document automation workflows.

It processes uploaded passport files, extracts MRZ and OCR data, validates key fields, and provides a clean review interface where extracted data can be corrected and saved automatically.

The backend supports both single-document extraction and multi-page PDF extraction. When a PDF contains multiple passport pages, each page is processed as an individual passport record with its own document ID.

image #Docker Compose running backend and frontend services.

Features

  • Upload passport documents as PDF or image files.
  • Process multi-page passport PDFs page by page.
  • Support digital PDFs, scanned PDFs, and image-based documents.
  • Extract MRZ data using PassportEye.
  • Use OCR fallback for scanned or difficult documents.
  • Apply image preprocessing before MRZ extraction.
  • Retry MRZ extraction on enhanced image variants.
  • Validate passport numbers using MRZ check digits.
  • Parse passport holder name, passport number, date of birth, nationality, and expiration date.
  • Resolve international country codes using ISO alpha-3 codes.
  • Store uploaded documents with unique document IDs.
  • Store each page from a multi-page passport PDF as an individual record.
  • Persist extraction results as JSON.
  • Retrieve processed passport records.
  • Manually review and correct extracted fields.
  • Autosave reviewed data.
  • Export all processed records as CSV or JSON.
  • Include test coverage for MRZ parsing and basic API endpoints.
  • Provide Docker development setup.
  • Include guidance for safe synthetic sample assets.

Supported File Types

  • PDF
  • JPG / JPEG
  • PNG
  • TIFF / TIF
  • BMP
  • WEBP
  • HEIC / HEIF planned

Extracted Fields

  • Full name
  • Passport number
  • Date of birth
  • Nationality
  • Expiration date

Backend Stack

  • Python
  • FastAPI
  • PassportEye
  • PyMuPDF
  • OpenCV
  • Tesseract OCR
  • Pytesseract
  • Pydantic
  • Pycountry
  • Pytest

Frontend Stack

  • React
  • TypeScript
  • Vite
  • Lucide React
  • CSS

API Endpoints

Health Check

GET /health

Extract Passport Data

POST /api/documents/extract

Accepts a passport PDF or image and returns extracted text, structured fields, confidence values, extraction source, and processing method.

Extract Multi-Page Passport PDF

POST /api/documents/extract-pdf-pages

Accepts one PDF file, renders each page, processes every page as a passport document, and stores each page result with its own document ID.

List Processed Records

GET /api/documents

Returns all processed passport records stored internally.

Get Passport Result

GET /api/documents/{document_id}

Returns one processed passport result by document ID.

Update Extracted Fields

PATCH /api/documents/{document_id}/fields

Allows manual review and correction of extracted fields. Corrected fields are saved with source manual.

Export JSON

GET /api/documents/export/json

Exports all processed passport records as JSON.

Export CSV

GET /api/documents/export/csv

Exports all processed passport records as CSV.

Extraction Strategy

The backend uses a layered passport extraction workflow:

  1. PassportEye attempts MRZ extraction from the original document.
  2. Image variants are generated with contrast, scaling, denoising, and thresholding.
  3. PassportEye retries MRZ extraction on enhanced image variants.
  4. OCR fallback extracts text from PDFs and images when needed.
  5. MRZ parsing validates passport numbers with check digits.
  6. Structured fields are returned with confidence values and extraction sources.
  7. Users can manually correct extracted fields through the review interface.

For multi-page PDFs:

  1. The PDF is rendered page by page.
  2. Each page is converted to an image.
  3. Each rendered page is processed as an independent passport document.
  4. Every page receives its own document ID.
  5. Each page result is saved and included in CSV/JSON exports.

Extraction Sources

The API may return fields from:

  • passporteye
  • mrz_check_digit
  • mrz
  • rules
  • manual

Project Structure

backend/ app/ api/ core/ schemas/ services/ storage/ uploads/ results/ tests/ Dockerfile requirements.txt

frontend/ src/ services/ Dockerfile package.json

samples/ README.md synthetic-passport-mrz.txt

docs/ docker-compose.yml README.md

Local Development

Backend

cd backend python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt uvicorn app.main:app --reload

Backend docs:

http://127.0.0.1:8000/docs

Frontend

cd frontend npm install npm run dev

Frontend app:

http://localhost:5173

Docker Development

From the project root:

docker compose build docker compose up

Backend docs:

http://127.0.0.1:8000/docs

Frontend app:

http://localhost:5173

To stop containers:

CTRL + C

Or, if running detached:

docker compose down

Running Tests

From the backend directory:

source .venv/bin/activate pytest

Current tests cover:

  • MRZ name parsing.
  • MRZ date parsing.
  • MRZ country code resolution.
  • MRZ check digit validation.
  • Passport number OCR correction.
  • Basic API endpoints.
  • CSV and JSON export endpoints.

System Dependencies

This project requires Tesseract OCR installed on the system.

On Ubuntu:

sudo apt update sudo apt install tesseract-ocr tesseract-ocr-spa

Docker setup installs backend system dependencies inside the backend container.

Storage Behavior

Uploaded documents and generated extraction results are stored locally:

backend/storage/uploads backend/storage/results

Single uploads are saved as one record.

Multi-page PDFs are rendered page by page. Each page is saved as an individual passport record with its own document ID.

Storage files are ignored by Git to avoid committing sensitive or generated data.

Reset Local Demo Data

From the project root:

rm -f backend/storage/results/.json rm -f backend/storage/uploads/ touch backend/storage/uploads/.gitkeep backend/storage/results/.gitkeep

This resets local processed records and uploaded files while keeping storage folders in the repository.

Sample Data

The samples folder is reserved for synthetic passport samples and demo assets.

Do not place real passport scans, identity documents, or personal data in this repository.

Recommended safe demo assets:

  • Synthetic passport images.
  • Synthetic passport PDFs.
  • Fake MRZ text files.
  • Screenshots with fictitious identity data.

Privacy Notes

This project is designed for synthetic or test passport documents.

Do not commit real passport files, personal documents, or sensitive identity data. Uploaded files and generated results are stored locally and ignored by Git.

Project Status

MVP complete.

Current capabilities include passport upload, multi-page PDF passport extraction, MRZ extraction, OCR fallback, manual review, autosave, internal record storage, CSV/JSON export, enhanced PassportEye preprocessing attempts, Docker setup, synthetic sample guidance, MRZ parser tests, and basic API endpoint tests.

Roadmap

  • Dedicated document history view.
  • Stronger MRZ region detection.
  • More OCR fallback engines.
  • PDF report generation.
  • Production database integration.
  • Authentication and user-specific workspaces.
  • Additional endpoint tests.
  • Synthetic passport image/PDF generator for demos.

About

Passport OCR and MRZ extraction system with FastAPI, React, PassportEye, OCR fallback, autosave review, CSV/JSON export, and multi-page PDF processing.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors