#Clean passport review interface with autosave and CSV/JSON export.
A passport OCR and MRZ extraction system for processing passport PDFs, scanned documents, image-based documents, and multi-page passport PDFs.
The project extracts structured passport data, validates MRZ fields, supports manual review, saves processed records internally, and exports reviewed data as CSV or JSON.
Passport Data Extractor is designed for passport-focused document automation workflows.
It processes uploaded passport files, extracts MRZ and OCR data, validates key fields, and provides a clean review interface where extracted data can be corrected and saved automatically.
The backend supports both single-document extraction and multi-page PDF extraction. When a PDF contains multiple passport pages, each page is processed as an individual passport record with its own document ID.
#Docker Compose running backend and frontend services.
- Upload passport documents as PDF or image files.
- Process multi-page passport PDFs page by page.
- Support digital PDFs, scanned PDFs, and image-based documents.
- Extract MRZ data using PassportEye.
- Use OCR fallback for scanned or difficult documents.
- Apply image preprocessing before MRZ extraction.
- Retry MRZ extraction on enhanced image variants.
- Validate passport numbers using MRZ check digits.
- Parse passport holder name, passport number, date of birth, nationality, and expiration date.
- Resolve international country codes using ISO alpha-3 codes.
- Store uploaded documents with unique document IDs.
- Store each page from a multi-page passport PDF as an individual record.
- Persist extraction results as JSON.
- Retrieve processed passport records.
- Manually review and correct extracted fields.
- Autosave reviewed data.
- Export all processed records as CSV or JSON.
- Include test coverage for MRZ parsing and basic API endpoints.
- Provide Docker development setup.
- Include guidance for safe synthetic sample assets.
- JPG / JPEG
- PNG
- TIFF / TIF
- BMP
- WEBP
- HEIC / HEIF planned
- Full name
- Passport number
- Date of birth
- Nationality
- Expiration date
- Python
- FastAPI
- PassportEye
- PyMuPDF
- OpenCV
- Tesseract OCR
- Pytesseract
- Pydantic
- Pycountry
- Pytest
- React
- TypeScript
- Vite
- Lucide React
- CSS
GET /health
POST /api/documents/extract
Accepts a passport PDF or image and returns extracted text, structured fields, confidence values, extraction source, and processing method.
POST /api/documents/extract-pdf-pages
Accepts one PDF file, renders each page, processes every page as a passport document, and stores each page result with its own document ID.
GET /api/documents
Returns all processed passport records stored internally.
GET /api/documents/{document_id}
Returns one processed passport result by document ID.
PATCH /api/documents/{document_id}/fields
Allows manual review and correction of extracted fields. Corrected fields are saved with source manual.
GET /api/documents/export/json
Exports all processed passport records as JSON.
GET /api/documents/export/csv
Exports all processed passport records as CSV.
The backend uses a layered passport extraction workflow:
- PassportEye attempts MRZ extraction from the original document.
- Image variants are generated with contrast, scaling, denoising, and thresholding.
- PassportEye retries MRZ extraction on enhanced image variants.
- OCR fallback extracts text from PDFs and images when needed.
- MRZ parsing validates passport numbers with check digits.
- Structured fields are returned with confidence values and extraction sources.
- Users can manually correct extracted fields through the review interface.
For multi-page PDFs:
- The PDF is rendered page by page.
- Each page is converted to an image.
- Each rendered page is processed as an independent passport document.
- Every page receives its own document ID.
- Each page result is saved and included in CSV/JSON exports.
The API may return fields from:
- passporteye
- mrz_check_digit
- mrz
- rules
- manual
backend/ app/ api/ core/ schemas/ services/ storage/ uploads/ results/ tests/ Dockerfile requirements.txt
frontend/ src/ services/ Dockerfile package.json
samples/ README.md synthetic-passport-mrz.txt
docs/ docker-compose.yml README.md
cd backend python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt uvicorn app.main:app --reload
Backend docs:
cd frontend npm install npm run dev
Frontend app:
From the project root:
docker compose build docker compose up
Backend docs:
Frontend app:
To stop containers:
CTRL + C
Or, if running detached:
docker compose down
From the backend directory:
source .venv/bin/activate pytest
Current tests cover:
- MRZ name parsing.
- MRZ date parsing.
- MRZ country code resolution.
- MRZ check digit validation.
- Passport number OCR correction.
- Basic API endpoints.
- CSV and JSON export endpoints.
This project requires Tesseract OCR installed on the system.
On Ubuntu:
sudo apt update sudo apt install tesseract-ocr tesseract-ocr-spa
Docker setup installs backend system dependencies inside the backend container.
Uploaded documents and generated extraction results are stored locally:
backend/storage/uploads backend/storage/results
Single uploads are saved as one record.
Multi-page PDFs are rendered page by page. Each page is saved as an individual passport record with its own document ID.
Storage files are ignored by Git to avoid committing sensitive or generated data.
From the project root:
rm -f backend/storage/results/.json rm -f backend/storage/uploads/ touch backend/storage/uploads/.gitkeep backend/storage/results/.gitkeep
This resets local processed records and uploaded files while keeping storage folders in the repository.
The samples folder is reserved for synthetic passport samples and demo assets.
Do not place real passport scans, identity documents, or personal data in this repository.
Recommended safe demo assets:
- Synthetic passport images.
- Synthetic passport PDFs.
- Fake MRZ text files.
- Screenshots with fictitious identity data.
This project is designed for synthetic or test passport documents.
Do not commit real passport files, personal documents, or sensitive identity data. Uploaded files and generated results are stored locally and ignored by Git.
MVP complete.
Current capabilities include passport upload, multi-page PDF passport extraction, MRZ extraction, OCR fallback, manual review, autosave, internal record storage, CSV/JSON export, enhanced PassportEye preprocessing attempts, Docker setup, synthetic sample guidance, MRZ parser tests, and basic API endpoint tests.
- Dedicated document history view.
- Stronger MRZ region detection.
- More OCR fallback engines.
- PDF report generation.
- Production database integration.
- Authentication and user-specific workspaces.
- Additional endpoint tests.
- Synthetic passport image/PDF generator for demos.