Passport Data Extractor

#Clean passport review interface with autosave and CSV/JSON export.

A passport OCR and MRZ extraction system for processing passport PDFs, scanned documents, image-based documents, and multi-page passport PDFs.

The project extracts structured passport data, validates MRZ fields, supports manual review, saves processed records internally, and exports reviewed data as CSV or JSON.

Overview

Passport Data Extractor is designed for passport-focused document automation workflows.

It processes uploaded passport files, extracts MRZ and OCR data, validates key fields, and provides a clean review interface where extracted data can be corrected and saved automatically.

The backend supports both single-document extraction and multi-page PDF extraction. When a PDF contains multiple passport pages, each page is processed as an individual passport record with its own document ID.

#Docker Compose running backend and frontend services.

Features

Upload passport documents as PDF or image files.
Process multi-page passport PDFs page by page.
Support digital PDFs, scanned PDFs, and image-based documents.
Extract MRZ data using PassportEye.
Use OCR fallback for scanned or difficult documents.
Apply image preprocessing before MRZ extraction.
Retry MRZ extraction on enhanced image variants.
Validate passport numbers using MRZ check digits.
Parse passport holder name, passport number, date of birth, nationality, and expiration date.
Resolve international country codes using ISO alpha-3 codes.
Store uploaded documents with unique document IDs.
Store each page from a multi-page passport PDF as an individual record.
Persist extraction results as JSON.
Retrieve processed passport records.
Manually review and correct extracted fields.
Autosave reviewed data.
Export all processed records as CSV or JSON.
Include test coverage for MRZ parsing and basic API endpoints.
Provide Docker development setup.
Include guidance for safe synthetic sample assets.

Supported File Types

PDF
JPG / JPEG
PNG
TIFF / TIF
BMP
WEBP
HEIC / HEIF planned

Extracted Fields

Full name
Passport number
Date of birth
Nationality
Expiration date

Backend Stack

Python
FastAPI
PassportEye
PyMuPDF
OpenCV
Tesseract OCR
Pytesseract
Pydantic
Pycountry
Pytest

Frontend Stack

React
TypeScript
Vite
Lucide React
CSS

API Endpoints

Health Check

GET /health

Extract Passport Data

POST /api/documents/extract

Accepts a passport PDF or image and returns extracted text, structured fields, confidence values, extraction source, and processing method.

Extract Multi-Page Passport PDF

POST /api/documents/extract-pdf-pages

Accepts one PDF file, renders each page, processes every page as a passport document, and stores each page result with its own document ID.

List Processed Records

GET /api/documents

Returns all processed passport records stored internally.

Get Passport Result

GET /api/documents/{document_id}

Returns one processed passport result by document ID.

Update Extracted Fields

PATCH /api/documents/{document_id}/fields

Allows manual review and correction of extracted fields. Corrected fields are saved with source manual.

Export JSON

GET /api/documents/export/json

Exports all processed passport records as JSON.

Export CSV

GET /api/documents/export/csv

Exports all processed passport records as CSV.

Extraction Strategy

The backend uses a layered passport extraction workflow:

PassportEye attempts MRZ extraction from the original document.
Image variants are generated with contrast, scaling, denoising, and thresholding.
PassportEye retries MRZ extraction on enhanced image variants.
OCR fallback extracts text from PDFs and images when needed.
MRZ parsing validates passport numbers with check digits.
Structured fields are returned with confidence values and extraction sources.
Users can manually correct extracted fields through the review interface.

For multi-page PDFs:

The PDF is rendered page by page.
Each page is converted to an image.
Each rendered page is processed as an independent passport document.
Every page receives its own document ID.
Each page result is saved and included in CSV/JSON exports.

Extraction Sources

The API may return fields from:

passporteye
mrz_check_digit
mrz
rules
manual

Project Structure

backend/ app/ api/ core/ schemas/ services/ storage/ uploads/ results/ tests/ Dockerfile requirements.txt

frontend/ src/ services/ Dockerfile package.json

samples/ README.md synthetic-passport-mrz.txt

docs/ docker-compose.yml README.md

Local Development

Backend

cd backend python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt uvicorn app.main:app --reload

Backend docs:

http://127.0.0.1:8000/docs

Frontend

cd frontend npm install npm run dev

Frontend app:

http://localhost:5173

Docker Development

From the project root:

docker compose build docker compose up

Backend docs:

http://127.0.0.1:8000/docs

Frontend app:

http://localhost:5173

To stop containers:

CTRL + C

Or, if running detached:

docker compose down

Running Tests

From the backend directory:

source .venv/bin/activate pytest

Current tests cover:

MRZ name parsing.
MRZ date parsing.
MRZ country code resolution.
MRZ check digit validation.
Passport number OCR correction.
Basic API endpoints.
CSV and JSON export endpoints.

System Dependencies

This project requires Tesseract OCR installed on the system.

On Ubuntu:

sudo apt update sudo apt install tesseract-ocr tesseract-ocr-spa

Docker setup installs backend system dependencies inside the backend container.

Storage Behavior

Uploaded documents and generated extraction results are stored locally:

backend/storage/uploads backend/storage/results

Single uploads are saved as one record.

Multi-page PDFs are rendered page by page. Each page is saved as an individual passport record with its own document ID.

Storage files are ignored by Git to avoid committing sensitive or generated data.

Reset Local Demo Data

From the project root:

rm -f backend/storage/results/.json rm -f backend/storage/uploads/ touch backend/storage/uploads/.gitkeep backend/storage/results/.gitkeep

This resets local processed records and uploaded files while keeping storage folders in the repository.

Sample Data

The samples folder is reserved for synthetic passport samples and demo assets.

Do not place real passport scans, identity documents, or personal data in this repository.

Recommended safe demo assets:

Synthetic passport images.
Synthetic passport PDFs.
Fake MRZ text files.
Screenshots with fictitious identity data.

Privacy Notes

This project is designed for synthetic or test passport documents.

Do not commit real passport files, personal documents, or sensitive identity data. Uploaded files and generated results are stored locally and ignored by Git.

Project Status

MVP complete.

Current capabilities include passport upload, multi-page PDF passport extraction, MRZ extraction, OCR fallback, manual review, autosave, internal record storage, CSV/JSON export, enhanced PassportEye preprocessing attempts, Docker setup, synthetic sample guidance, MRZ parser tests, and basic API endpoint tests.

Roadmap

Dedicated document history view.
Stronger MRZ region detection.
More OCR fallback engines.
PDF report generation.
Production database integration.
Authentication and user-specific workspaces.
Additional endpoint tests.
Synthetic passport image/PDF generator for demos.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
backend		backend
frontend		frontend
samples		samples
.dockerignore		.dockerignore
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Passport Data Extractor

Overview

Features

Supported File Types

Extracted Fields

Backend Stack

Frontend Stack

API Endpoints

Health Check

Extract Passport Data

Extract Multi-Page Passport PDF

List Processed Records

Get Passport Result

Update Extracted Fields

Export JSON

Export CSV

Extraction Strategy

Extraction Sources

Project Structure

Local Development

Backend

Frontend

Docker Development

Running Tests

System Dependencies

Storage Behavior

Reset Local Demo Data

Sample Data

Privacy Notes

Project Status

Roadmap

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages