Skip to content

GreenCode668/real-estate-doc-processor-server

Repository files navigation

Real Estate Document Processor - Backend

AI-powered OCR and data extraction system for Brazilian real estate registration documents (matrΓ­culas de imΓ³veis).

πŸ—οΈ Architecture

Complete processing pipeline:

PDF β†’ pdf2image β†’ OpenCV Preprocessing β†’ Google Vision OCR β†’
Text Preprocessing (Regex/NER) β†’ GPT-4o mini (Extraction) β†’ Pydantic Validation β†’ Supabase

πŸš€ Features

  • PDF/Image Processing: Converts PDFs to images using pdf2image
  • Image Enhancement: OpenCV preprocessing for optimal OCR quality
  • Advanced OCR: Google Cloud Vision API for text extraction
  • Intelligent Extraction: GPT-4o mini for proofreading and structured data extraction
  • Data Validation: Pydantic models for type safety and validation
  • Database: Supabase (PostgreSQL) for storage
  • REST API: FastAPI with async support
  • Background Processing: Async task queue for document processing
  • Comprehensive Logging: Structured logging with Loguru

πŸ“‹ Prerequisites

  • Python 3.10+
  • Supabase account
  • Google Cloud Vision API credentials
  • OpenAI API key
  • Poppler (for PDF processing)

Install Poppler (Windows)

  1. Download from: https://github.com/oschwartz10612/poppler-windows/releases/
  2. Extract and add bin folder to PATH
  3. Or install via conda: conda install -c conda-forge poppler

πŸ”§ Installation

  1. Clone the repository

    cd real-estate-doc-processor-server
  2. Create virtual environment

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Set up environment variables

    cp .env.example .env

    Edit .env with your credentials:

    • SUPABASE_URL and SUPABASE_KEY
    • GOOGLE_APPLICATION_CREDENTIALS path
    • OPENAI_API_KEY
  5. Set up Supabase database

    • Go to your Supabase project
    • Open SQL Editor
    • Run the script from database/schema.sql
  6. Create required directories

    mkdir -p uploads processed logs

🎯 Running the Server

Development

python -m src.main

Or with uvicorn:

uvicorn src.main:app --reload --host 0.0.0.0 --port 8000

Production

uvicorn src.main:app --host 0.0.0.0 --port 8000 --workers 4

The API will be available at:

πŸ“‘ API Endpoints

Document Upload

POST /api/documents/upload
Content-Type: multipart/form-data

file: <PDF or Image file>

Get Documents

GET /api/documents/?page=1&per_page=100&status=completed

Get Document Status

GET /api/documents/{document_id}/status

Get Full OCR Text

GET /api/documents/{document_id}/full-text

Get Extracted Data

GET /api/documents/{document_id}/extracted-data

Reprocess Document

POST /api/documents/{document_id}/reprocess

Delete Document

DELETE /api/documents/{document_id}

Health Check

GET /health

πŸ“Š Data Extraction

The system extracts the following information:

Registry Information

  • MatrΓ­cula number
  • CNM (CΓ³digo Nacional de MatrΓ­cula)
  • CartΓ³rio (Registry office)
  • Comarca
  • Registration date
  • Official name and title
  • City and state

Property Information

  • Property type (Apartamento, Casa, Terreno, etc.)
  • Full address
  • Private area (Γ‘rea privativa)
  • Total area (Γ‘rea total)
  • Building/condominium name
  • Unit number and floor
  • Number of bedrooms
  • Parking spaces

Owner Information

  • Current owner name
  • CPF/CNPJ
  • Previous owner
  • Ownership validation metadata

Transaction Information

  • Referenced matrΓ­cula
  • Transaction type
  • Creditor, debtor, guarantor
  • CNPJ information

πŸ”„ Processing Pipeline Details

  1. PDF to Image (pdf2image)

    • Converts each PDF page to 300 DPI PNG image
    • Supports multi-page documents
  2. Image Preprocessing (OpenCV)

    • Grayscale conversion
    • Denoising
    • Contrast enhancement (CLAHE)
    • Adaptive thresholding
    • Deskewing
    • Morphological operations
  3. OCR (Google Vision)

    • Document text detection
    • Confidence scoring
    • Language detection
    • Structured data extraction
  4. Text Preprocessing

    • Text normalization
    • OCR error correction
    • Regex-based pattern matching
    • Named entity recognition
  5. AI Extraction (GPT-4o mini)

    • Proofreading
    • Structured data extraction
    • Data validation
    • Confidence scoring
  6. Database Storage (Supabase)

    • Document metadata
    • Full OCR text
    • Extracted structured data
    • Processing history

πŸ› Troubleshooting

Poppler not found

# Install via conda
conda install -c conda-forge poppler

# Or add to PATH manually

Google Vision errors

  • Check GOOGLE_APPLICATION_CREDENTIALS path
  • Verify service account has Vision API enabled
  • Ensure billing is enabled on GCP project

OpenAI rate limits

  • Adjust PROCESSING_TIMEOUT_SECONDS in .env
  • Implement exponential backoff (already included)

Database connection issues

  • Verify Supabase credentials
  • Check network connectivity
  • Ensure database schema is created

πŸ“ Environment Variables

Variable Description Required
SUPABASE_URL Supabase project URL Yes
SUPABASE_KEY Supabase anon key Yes
SUPABASE_SERVICE_KEY Supabase service role key Yes
GOOGLE_APPLICATION_CREDENTIALS Path to GCP service account JSON Yes
GOOGLE_CLOUD_PROJECT_ID GCP project ID Yes
OPENAI_API_KEY OpenAI API key Yes
OPENAI_MODEL Model to use (default: gpt-4o-mini) No
MAX_UPLOAD_SIZE_MB Max file size (default: 10) No
HOST Server host (default: 0.0.0.0) No
PORT Server port (default: 8000) No
DEBUG Debug mode (default: True) No

πŸ” Security

  • API keys stored in environment variables
  • Row Level Security (RLS) enabled in Supabase
  • CORS configured for frontend origins
  • File upload validation
  • SQL injection protection via parameterized queries

πŸ“¦ Project Structure

real-estate-doc-processor-server/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ api/              # FastAPI routes
β”‚   β”œβ”€β”€ config/           # Configuration settings
β”‚   β”œβ”€β”€ models/           # Pydantic models
β”‚   β”œβ”€β”€ services/         # Business logic
β”‚   β”‚   β”œβ”€β”€ database.py
β”‚   β”‚   β”œβ”€β”€ pdf_processor.py
β”‚   β”‚   β”œβ”€β”€ image_preprocessor.py
β”‚   β”‚   β”œβ”€β”€ ocr_service.py
β”‚   β”‚   β”œβ”€β”€ text_preprocessor.py
β”‚   β”‚   β”œβ”€β”€ gpt_extractor.py
β”‚   β”‚   └── document_processor.py
β”‚   β”œβ”€β”€ utils/            # Utilities
β”‚   └── main.py           # Application entry point
β”œβ”€β”€ database/
β”‚   └── schema.sql        # Supabase database schema
β”œβ”€β”€ uploads/              # Uploaded files
β”œβ”€β”€ processed/            # Processed images
β”œβ”€β”€ logs/                 # Application logs
β”œβ”€β”€ .env                  # Environment variables
β”œβ”€β”€ requirements.txt      # Python dependencies
└── README.md

πŸš€ Deployment

Docker (Coming Soon)

# Dockerfile example
FROM python:3.11-slim
# ... deployment configuration

Heroku

heroku create your-app-name
git push heroku main

AWS/GCP

Deploy using your preferred cloud platform with Python 3.10+ support.

πŸ“„ License

MIT License

🀝 Contributing

Contributions welcome! Please open an issue or submit a pull request.


Built with ❀️ for Brazilian real estate professionals

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors