Real Estate Document Processor - Backend

AI-powered OCR and data extraction system for Brazilian real estate registration documents (matrículas de imóveis).

🏗️ Architecture

Complete processing pipeline:

PDF → pdf2image → OpenCV Preprocessing → Google Vision OCR →
Text Preprocessing (Regex/NER) → GPT-4o mini (Extraction) → Pydantic Validation → Supabase

🚀 Features

PDF/Image Processing: Converts PDFs to images using pdf2image
Image Enhancement: OpenCV preprocessing for optimal OCR quality
Advanced OCR: Google Cloud Vision API for text extraction
Intelligent Extraction: GPT-4o mini for proofreading and structured data extraction
Data Validation: Pydantic models for type safety and validation
Database: Supabase (PostgreSQL) for storage
REST API: FastAPI with async support
Background Processing: Async task queue for document processing
Comprehensive Logging: Structured logging with Loguru

📋 Prerequisites

Python 3.10+
Supabase account
Google Cloud Vision API credentials
OpenAI API key
Poppler (for PDF processing)

Install Poppler (Windows)

Download from: https://github.com/oschwartz10612/poppler-windows/releases/
Extract and add bin folder to PATH
Or install via conda: conda install -c conda-forge poppler

🔧 Installation

Clone the repository
```
cd real-estate-doc-processor-server
```

Create virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```
Set up environment variables
```
cp .env.example .env
```
Edit .env with your credentials:
- SUPABASE_URL and SUPABASE_KEY
- GOOGLE_APPLICATION_CREDENTIALS path
- OPENAI_API_KEY
Set up Supabase database
- Go to your Supabase project
- Open SQL Editor
- Run the script from database/schema.sql
Create required directories
```
mkdir -p uploads processed logs
```

🎯 Running the Server

Development

python -m src.main

Or with uvicorn:

uvicorn src.main:app --reload --host 0.0.0.0 --port 8000

Production

uvicorn src.main:app --host 0.0.0.0 --port 8000 --workers 4

The API will be available at:

📡 API Endpoints

Document Upload

POST /api/documents/upload
Content-Type: multipart/form-data

file: <PDF or Image file>

Get Documents

GET /api/documents/?page=1&per_page=100&status=completed

Get Document Status

GET /api/documents/{document_id}/status

Get Full OCR Text

GET /api/documents/{document_id}/full-text

Get Extracted Data

GET /api/documents/{document_id}/extracted-data

Reprocess Document

POST /api/documents/{document_id}/reprocess

Delete Document

DELETE /api/documents/{document_id}

Health Check

GET /health

📊 Data Extraction

The system extracts the following information:

Registry Information

Matrícula number
CNM (Código Nacional de Matrícula)
Cartório (Registry office)
Comarca
Registration date
Official name and title
City and state

Property Information

Property type (Apartamento, Casa, Terreno, etc.)
Full address
Private area (área privativa)
Total area (área total)
Building/condominium name
Unit number and floor
Number of bedrooms
Parking spaces

Owner Information

Current owner name
CPF/CNPJ
Previous owner
Ownership validation metadata

Transaction Information

Referenced matrícula
Transaction type
Creditor, debtor, guarantor
CNPJ information

🔄 Processing Pipeline Details

PDF to Image (pdf2image)
- Converts each PDF page to 300 DPI PNG image
- Supports multi-page documents
Image Preprocessing (OpenCV)
- Grayscale conversion
- Denoising
- Contrast enhancement (CLAHE)
- Adaptive thresholding
- Deskewing
- Morphological operations
OCR (Google Vision)
- Document text detection
- Confidence scoring
- Language detection
- Structured data extraction
Text Preprocessing
- Text normalization
- OCR error correction
- Regex-based pattern matching
- Named entity recognition
AI Extraction (GPT-4o mini)
- Proofreading
- Structured data extraction
- Data validation
- Confidence scoring
Database Storage (Supabase)
- Document metadata
- Full OCR text
- Extracted structured data
- Processing history

🐛 Troubleshooting

Poppler not found

# Install via conda
conda install -c conda-forge poppler

# Or add to PATH manually

Google Vision errors

Check GOOGLE_APPLICATION_CREDENTIALS path
Verify service account has Vision API enabled
Ensure billing is enabled on GCP project

OpenAI rate limits

Adjust PROCESSING_TIMEOUT_SECONDS in .env
Implement exponential backoff (already included)

Database connection issues

Verify Supabase credentials
Check network connectivity
Ensure database schema is created

📝 Environment Variables

Variable	Description	Required
`SUPABASE_URL`	Supabase project URL	Yes
`SUPABASE_KEY`	Supabase anon key	Yes
`SUPABASE_SERVICE_KEY`	Supabase service role key	Yes
`GOOGLE_APPLICATION_CREDENTIALS`	Path to GCP service account JSON	Yes
`GOOGLE_CLOUD_PROJECT_ID`	GCP project ID	Yes
`OPENAI_API_KEY`	OpenAI API key	Yes
`OPENAI_MODEL`	Model to use (default: gpt-4o-mini)	No
`MAX_UPLOAD_SIZE_MB`	Max file size (default: 10)	No
`HOST`	Server host (default: 0.0.0.0)	No
`PORT`	Server port (default: 8000)	No
`DEBUG`	Debug mode (default: True)	No

🔐 Security

API keys stored in environment variables
Row Level Security (RLS) enabled in Supabase
CORS configured for frontend origins
File upload validation
SQL injection protection via parameterized queries

📦 Project Structure

real-estate-doc-processor-server/
├── src/
│   ├── api/              # FastAPI routes
│   ├── config/           # Configuration settings
│   ├── models/           # Pydantic models
│   ├── services/         # Business logic
│   │   ├── database.py
│   │   ├── pdf_processor.py
│   │   ├── image_preprocessor.py
│   │   ├── ocr_service.py
│   │   ├── text_preprocessor.py
│   │   ├── gpt_extractor.py
│   │   └── document_processor.py
│   ├── utils/            # Utilities
│   └── main.py           # Application entry point
├── database/
│   └── schema.sql        # Supabase database schema
├── uploads/              # Uploaded files
├── processed/            # Processed images
├── logs/                 # Application logs
├── .env                  # Environment variables
├── requirements.txt      # Python dependencies
└── README.md

🚀 Deployment

Docker (Coming Soon)

# Dockerfile example
FROM python:3.11-slim
# ... deployment configuration

Heroku

heroku create your-app-name
git push heroku main

AWS/GCP

Deploy using your preferred cloud platform with Python 3.10+ support.

📄 License

MIT License

🤝 Contributing

Contributions welcome! Please open an issue or submit a pull request.

Built with ❤️ for Brazilian real estate professionals

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Cities		Cities
database		database
processed		processed
src		src
uploads		uploads
.env		.env
.env.example		.env.example
.gitignore		.gitignore
CREDENTIALS_SETUP.md		CREDENTIALS_SETUP.md
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
QUICK_START.md		QUICK_START.md
README.md		README.md
SETUP.md		SETUP.md
VOCABULARY_CORRECTION.md		VOCABULARY_CORRECTION.md
requirements.txt		requirements.txt
run.py		run.py
test_vocabulary_correction.py		test_vocabulary_correction.py

Folders and files

Latest commit

History

Repository files navigation

Real Estate Document Processor - Backend

🏗️ Architecture

🚀 Features

📋 Prerequisites

Install Poppler (Windows)

🔧 Installation

🎯 Running the Server

Development

Production

📡 API Endpoints

Document Upload

Get Documents

Get Document Status

Get Full OCR Text

Get Extracted Data

Reprocess Document

Delete Document

Health Check

📊 Data Extraction

Registry Information

Property Information

Owner Information

Transaction Information

🔄 Processing Pipeline Details

🐛 Troubleshooting

Poppler not found

Google Vision errors

OpenAI rate limits

Database connection issues

📝 Environment Variables

🔐 Security

📦 Project Structure

🚀 Deployment

Docker (Coming Soon)

Heroku

AWS/GCP

📄 License

🤝 Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages