AI-powered OCR and data extraction system for Brazilian real estate registration documents (matrΓculas de imΓ³veis).
Complete processing pipeline:
PDF β pdf2image β OpenCV Preprocessing β Google Vision OCR β
Text Preprocessing (Regex/NER) β GPT-4o mini (Extraction) β Pydantic Validation β Supabase
- PDF/Image Processing: Converts PDFs to images using pdf2image
- Image Enhancement: OpenCV preprocessing for optimal OCR quality
- Advanced OCR: Google Cloud Vision API for text extraction
- Intelligent Extraction: GPT-4o mini for proofreading and structured data extraction
- Data Validation: Pydantic models for type safety and validation
- Database: Supabase (PostgreSQL) for storage
- REST API: FastAPI with async support
- Background Processing: Async task queue for document processing
- Comprehensive Logging: Structured logging with Loguru
- Python 3.10+
- Supabase account
- Google Cloud Vision API credentials
- OpenAI API key
- Poppler (for PDF processing)
- Download from: https://github.com/oschwartz10612/poppler-windows/releases/
- Extract and add
binfolder to PATH - Or install via conda:
conda install -c conda-forge poppler
-
Clone the repository
cd real-estate-doc-processor-server -
Create virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Set up environment variables
cp .env.example .env
Edit
.envwith your credentials:SUPABASE_URLandSUPABASE_KEYGOOGLE_APPLICATION_CREDENTIALSpathOPENAI_API_KEY
-
Set up Supabase database
- Go to your Supabase project
- Open SQL Editor
- Run the script from
database/schema.sql
-
Create required directories
mkdir -p uploads processed logs
python -m src.mainOr with uvicorn:
uvicorn src.main:app --reload --host 0.0.0.0 --port 8000uvicorn src.main:app --host 0.0.0.0 --port 8000 --workers 4The API will be available at:
- API: http://localhost:8000
- Docs: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
POST /api/documents/upload
Content-Type: multipart/form-data
file: <PDF or Image file>GET /api/documents/?page=1&per_page=100&status=completedGET /api/documents/{document_id}/statusGET /api/documents/{document_id}/full-textGET /api/documents/{document_id}/extracted-dataPOST /api/documents/{document_id}/reprocessDELETE /api/documents/{document_id}GET /healthThe system extracts the following information:
- MatrΓcula number
- CNM (CΓ³digo Nacional de MatrΓcula)
- CartΓ³rio (Registry office)
- Comarca
- Registration date
- Official name and title
- City and state
- Property type (Apartamento, Casa, Terreno, etc.)
- Full address
- Private area (Γ‘rea privativa)
- Total area (Γ‘rea total)
- Building/condominium name
- Unit number and floor
- Number of bedrooms
- Parking spaces
- Current owner name
- CPF/CNPJ
- Previous owner
- Ownership validation metadata
- Referenced matrΓcula
- Transaction type
- Creditor, debtor, guarantor
- CNPJ information
-
PDF to Image (pdf2image)
- Converts each PDF page to 300 DPI PNG image
- Supports multi-page documents
-
Image Preprocessing (OpenCV)
- Grayscale conversion
- Denoising
- Contrast enhancement (CLAHE)
- Adaptive thresholding
- Deskewing
- Morphological operations
-
OCR (Google Vision)
- Document text detection
- Confidence scoring
- Language detection
- Structured data extraction
-
Text Preprocessing
- Text normalization
- OCR error correction
- Regex-based pattern matching
- Named entity recognition
-
AI Extraction (GPT-4o mini)
- Proofreading
- Structured data extraction
- Data validation
- Confidence scoring
-
Database Storage (Supabase)
- Document metadata
- Full OCR text
- Extracted structured data
- Processing history
# Install via conda
conda install -c conda-forge poppler
# Or add to PATH manually- Check
GOOGLE_APPLICATION_CREDENTIALSpath - Verify service account has Vision API enabled
- Ensure billing is enabled on GCP project
- Adjust
PROCESSING_TIMEOUT_SECONDSin .env - Implement exponential backoff (already included)
- Verify Supabase credentials
- Check network connectivity
- Ensure database schema is created
| Variable | Description | Required |
|---|---|---|
SUPABASE_URL |
Supabase project URL | Yes |
SUPABASE_KEY |
Supabase anon key | Yes |
SUPABASE_SERVICE_KEY |
Supabase service role key | Yes |
GOOGLE_APPLICATION_CREDENTIALS |
Path to GCP service account JSON | Yes |
GOOGLE_CLOUD_PROJECT_ID |
GCP project ID | Yes |
OPENAI_API_KEY |
OpenAI API key | Yes |
OPENAI_MODEL |
Model to use (default: gpt-4o-mini) | No |
MAX_UPLOAD_SIZE_MB |
Max file size (default: 10) | No |
HOST |
Server host (default: 0.0.0.0) | No |
PORT |
Server port (default: 8000) | No |
DEBUG |
Debug mode (default: True) | No |
- API keys stored in environment variables
- Row Level Security (RLS) enabled in Supabase
- CORS configured for frontend origins
- File upload validation
- SQL injection protection via parameterized queries
real-estate-doc-processor-server/
βββ src/
β βββ api/ # FastAPI routes
β βββ config/ # Configuration settings
β βββ models/ # Pydantic models
β βββ services/ # Business logic
β β βββ database.py
β β βββ pdf_processor.py
β β βββ image_preprocessor.py
β β βββ ocr_service.py
β β βββ text_preprocessor.py
β β βββ gpt_extractor.py
β β βββ document_processor.py
β βββ utils/ # Utilities
β βββ main.py # Application entry point
βββ database/
β βββ schema.sql # Supabase database schema
βββ uploads/ # Uploaded files
βββ processed/ # Processed images
βββ logs/ # Application logs
βββ .env # Environment variables
βββ requirements.txt # Python dependencies
βββ README.md
# Dockerfile example
FROM python:3.11-slim
# ... deployment configurationheroku create your-app-name
git push heroku mainDeploy using your preferred cloud platform with Python 3.10+ support.
MIT License
Contributions welcome! Please open an issue or submit a pull request.
Built with β€οΈ for Brazilian real estate professionals