A production-ready Django REST API application for extracting text from images and PDF files using Tesseract OCR with intelligent caching, database persistence, and comprehensive audit logging.
🚀 Ready to deploy? Check out QUICK_DEPLOY.md for a 5-minute deployment guide to Railway or Render!
- 🖼️ Image OCR: Extract text from PNG, JPEG, TIFF, BMP, GIF, WebP images
- 📄 PDF Processing: Extract text from both text-based and scanned PDFs
- 🔍 Smart Detection: Automatically detects if PDF is text-based or scanned
- 📑 Page Selection: Extract specific pages from PDFs
- 🌍 Multi-language Support: 100+ languages including English, Arabic, Chinese, Japanese, etc.
- 💾 Database Persistence: Optional document and result storage
- ⚡ Intelligent Caching: SHA256-based duplicate detection and result caching
- 📊 Processing Metrics: Track processing time, confidence scores, word/character counts
- 🔐 UUID Public IDs: Secure document identification without enumeration risks
- �️ Soft Delete: Data retention with archival capabilities
- 📝 Audit Logging: Comprehensive processing logs with structured JSON details
- 🎯 Service Layer: Clean separation of concerns with business logic in services
- 🧪 Comprehensive Testing: 44+ integration and unit tests
- 🎨 Beautiful UI: Modern, responsive frontend with gradient design
- 📋 Copy to Clipboard: Easy one-click copy of extracted text
- 🔌 REST API: Clean REST API endpoints for integration
- 🔄 Multi-format Support: Auto-detects file type (image or PDF)
- ⚙️ Configurable Storage: Control database saving via API parameter
- Prerequisites
- Installation
- Quick Start
- Architecture
- API Documentation
- Database Schema
- Testing
- CI/CD Pipeline
- Configuration
- Deployment
- Development
- Contributing
- License
- Python 3.8+
- Tesseract OCR
Ubuntu/Debian:
sudo apt-get update
sudo apt-get install tesseract-ocr
sudo apt-get install tesseract-ocr-ara # For Arabic supportmacOS:
brew install tesseractWindows: Download and install from: https://github.com/UB-Mannheim/tesseract/wiki
For additional language support:
# Ubuntu/Debian
sudo apt-get install tesseract-ocr-ara # Arabic
sudo apt-get install tesseract-ocr-chi-sim # Chinese Simplified
sudo apt-get install tesseract-ocr-jpn # Japanese
sudo apt-get install tesseract-ocr-rus # Russian
sudo apt-get install tesseract-ocr-spa # Spanish
sudo apt-get install tesseract-ocr-fra # French
# List all available language packs
apt-cache search tesseract-ocr-- Clone the repository
cd /home/ahadjon/work/fergani/fergani-ocr- Install Python dependencies
pip install -r requirements.txt- Run migrations
cd fergani
python manage.py migrate- Create a superuser (optional, for admin panel access)
python manage.py createsuperuser- Start the development server
python manage.py runserver 8001- Access the application:
- Web UI: http://127.0.0.1:8001
- API Root: http://127.0.0.1:8001/api/
- Admin Panel: http://127.0.0.1:8001/admin/
The project follows a layered architecture with clear separation of concerns:
┌─────────────────────────────────────────────────────────┐
│ Views Layer (HTTP) │
│ - Request/Response handling │
│ - Validation with DRF serializers │
│ - Error handling and formatting │
└──────────────────────┬──────────────────────────────────┘
│
┌──────────────────────▼──────────────────────────────────┐
│ Service Layer (Business Logic) │
│ - OCRService: Image processing logic │
│ - PDFService: PDF extraction logic │
│ - MultiFormatService: Multi-format handling │
│ - Database persistence and caching │
└──────────────────────┬──────────────────────────────────┘
│
┌──────────────────────▼──────────────────────────────────┐
│ Utility Layer (Core Processing) │
│ - OCRProcessor: Tesseract integration │
│ - PDFProcessor: PDF text extraction │
│ - Image preprocessing and optimization │
└──────────────────────┬──────────────────────────────────┘
│
┌──────────────────────▼──────────────────────────────────┐
│ Data Layer (Models) │
│ - OCRDocument: File storage and metadata │
│ - OCRResult: Extraction results │
│ - PDFPageResult: Per-page PDF results │
│ - OCRProcessingLog: Audit trail │
└─────────────────────────────────────────────────────────┘
Business logic is encapsulated in service classes (ocr/services.py):
- OCRService: Handles image OCR processing
- PDFService: Handles PDF text extraction
- MultiFormatService: Auto-detects and processes multiple formats
Benefits:
- Views remain thin and focused on HTTP concerns
- Business logic is reusable and testable
- Easy to add features without modifying views
Database operations are abstracted through Django models:
- Clean query interface
- Built-in validation
- Automatic migration management
SHA256-based file hashing for intelligent caching:
- Duplicate files return cached results instantly
- Saves processing time and resources
- Configurable per-request via
save_to_dbparameter
The project uses Django REST Framework's APIView for better customization:
OCRExtractTextView (/api/ocr/extract/)
- POST: Process image and extract text
- GET: Return API information
- Delegates to
OCRService.process_image_extraction()
PDFExtractTextView (/api/pdf/extract/)
- POST: Process PDF and extract text
- GET: Return API information
- Delegates to
PDFService.process_pdf_extraction()
MultiFormatExtractView (/api/extract/)
- POST: Auto-detect format and extract text
- GET: Return API information
- Delegates to
MultiFormatService.process_file()
OCRHealthCheckView (/api/ocr/health/)
- GET: Check service health and Tesseract installation
SupportedLanguagesView (/api/ocr/languages/)
- GET: List all supported OCR languages
http://127.0.0.1:8001/api/
Endpoint: POST /api/ocr/extract/
Description: Extract text from uploaded image using Tesseract OCR with optional database persistence
Request:
- Method: POST
- Content-Type: multipart/form-data
- Body:
image: Image file (PNG, JPEG, TIFF, BMP, GIF, WebP) - Requiredlanguage: Language code (default: 'eng') - Optionalsave_to_db: Save to database (default: true, accepts: true/false/1/0/yes/no) - Optional
Response:
{
"success": true,
"text": "Extracted text content...",
"filename": "example.png",
"file_size": 123456,
"image_dimensions": "800x600",
"image_format": "PNG",
"language": "eng",
"character_count": 150,
"word_count": 25,
"processing_time_ms": 245,
"document_id": "550e8400-e29b-41d4-a716-446655440000",
"cached": false,
"confidence": {
"average_confidence": 92.5,
"min_confidence": 85,
"max_confidence": 98,
"total_words": 25
}
}Response Fields:
document_id: UUID of the saved document (only ifsave_to_db=true)cached: Whether result was returned from cache (duplicate file)processing_time_ms: Time taken to process in millisecondsconfidence: OCR confidence scores (if available)
Example using cURL:
# With database saving (default)
curl -X POST http://127.0.0.1:8001/api/ocr/extract/ \
-F "image=@/path/to/image.png" \
-F "language=eng"
# Without database saving (stateless)
curl -X POST http://127.0.0.1:8001/api/ocr/extract/ \
-F "image=@/path/to/image.png" \
-F "language=eng" \
-F "save_to_db=false"Example using Python requests:
import requests
url = 'http://127.0.0.1:8001/api/ocr/extract/'
files = {'image': open('image.png', 'rb')}
data = {
'language': 'eng',
'save_to_db': 'true' # or 'false' for stateless processing
}
response = requests.post(url, files=files, data=data)
result = response.json()
if result.get('cached'):
print("Result returned from cache!")
print(f"Document ID: {result.get('document_id')}")
print(f"Extracted text: {result['text']}")Get API Information:
curl http://127.0.0.1:8001/api/ocr/extract/Endpoint: POST /api/pdf/extract/
Description: Extract text from PDF files (text-based or scanned) with optional page selection
Request:
- Method: POST
- Content-Type: multipart/form-data
- Body:
file: PDF file - Requiredlanguage: Language code (default: 'eng') - Optionaluse_ocr: Use OCR for scanned PDFs (default: true) - Optionalpages: Comma-separated page numbers (e.g., "1,3,5") - Optionalsave_to_db: Save to database (default: true) - Optional
Response:
{
"success": true,
"text": "Extracted text from all pages...",
"filename": "document.pdf",
"file_size": 2456789,
"total_pages": 10,
"pages_extracted": 10,
"method": "text_extraction",
"language": "eng",
"character_count": 5000,
"word_count": 850,
"processing_time_ms": 1250,
"document_id": "550e8400-e29b-41d4-a716-446655440001",
"cached": false,
"pages": [
{
"page_number": 1,
"text": "Page 1 text...",
"word_count": 85,
"character_count": 500,
"method": "text_extraction"
}
]
}Method Values:
text_extraction: Text was extracted from text-based PDFocr: Text was extracted using OCR (scanned PDF)mixed: Some pages used text extraction, others used OCR
Example using cURL:
# Extract all pages
curl -X POST http://127.0.0.1:8001/api/pdf/extract/ \
-F "file=@/path/to/document.pdf" \
-F "language=eng" \
-F "use_ocr=true"
# Extract specific pages only
curl -X POST http://127.0.0.1:8001/api/pdf/extract/ \
-F "file=@/path/to/document.pdf" \
-F "pages=1,3,5"
# Stateless processing (no database)
curl -X POST http://127.0.0.1:8001/api/pdf/extract/ \
-F "file=@/path/to/document.pdf" \
-F "save_to_db=false"Endpoint: POST /api/extract/
Description: Auto-detect file type (image or PDF) and extract text
Request:
- Method: POST
- Content-Type: multipart/form-data
- Body:
file: Image or PDF file - Requiredlanguage: Language code (default: 'eng') - Optionalsave_to_db: Save to database (default: true) - Optional
Response: Same format as image or PDF endpoint depending on file type
Example:
curl -X POST http://127.0.0.1:8001/api/extract/ \
-F "file=@/path/to/file.png" \
-F "language=eng"curl -X POST http://127.0.0.1:8001/api/extract/
-F "file=@/path/to/file.png"
-F "language=eng"
### 4. Health Check
**Endpoint:** `GET /api/ocr/health/`
**Description:** Check if OCR service is working properly
**Response:**
```json
{
"status": "healthy",
"tesseract_installed": true,
"tesseract_version": "5.3.0",
"supported_languages": 14
}
Example:
curl http://127.0.0.1:8001/api/ocr/health/Endpoint: GET /api/ocr/languages/
Description: Get list of all supported OCR languages installed on the system
Response:
{
"success": true,
"count": 14,
"languages": {
"eng": "English",
"ara": "Arabic",
"spa": "Spanish",
"fra": "French",
"deu": "German",
"rus": "Russian",
"chi_sim": "Chinese (Simplified)",
"jpn": "Japanese"
}
}Example:
curl http://127.0.0.1:8001/api/ocr/languages/The application uses 4 main models for document management and audit logging. For complete details, see DATABASE_ARCHITECTURE.md.
Stores uploaded documents with metadata and processing status.
Key Fields:
uuid: Public UUID identifier (prevents enumeration)file: Uploaded file (stored inmedia/ocr_documents/YYYY/MM/DD/)file_hash: SHA256 hash for duplicate detectionfile_type: 'image' or 'pdf'status: 'pending', 'processing', 'completed', 'failed', 'archived'language: OCR language codeis_deleted: Soft delete flag
Methods:
soft_delete(): Mark as deleted without removing from databasearchive(): Move to archived statuscalculate_file_hash(content): Generate SHA256 hash
Stores text extraction results with metrics.
Key Fields:
document: Foreign key to OCRDocumentextracted_text: The extracted text contentextraction_method: 'ocr', 'text_extraction', or 'mixed'word_count: Automatically calculatedcharacter_count: Automatically calculatedprocessing_time_ms: Processing duration in millisecondsconfidence_score: Average OCR confidence (0-100)
Stores per-page results for PDF documents.
Key Fields:
result: Foreign key to OCRResultpage_number: Page number (1-indexed)extracted_text: Text from this pageword_count: Words on this pageextraction_method: Method used for this page
Audit trail for all processing events.
Key Fields:
document: Foreign key to OCRDocumentlevel: 'info', 'warning', 'error'message: Human-readable messagedetails: JSON field for structured data
Example Log Entry:
{
"level": "info",
"message": "Successfully extracted text from example.pdf",
"details": {
"total_pages": 10,
"method": "text_extraction",
"word_count": 850,
"processing_time_ms": 1250
}
}The system uses SHA256 file hashing to detect duplicates:
- When a file is uploaded, its hash is calculated
- System checks for existing documents with same hash
- If found and
save_to_db=true:- Returns cached result instantly
- Response includes
"cached": true - No reprocessing needed
- If not found:
- Processes file normally
- Saves to database for future cache hits
Benefits:
- Faster responses for duplicate files
- Reduced server load
- Cost savings on processing resources
The project includes comprehensive test coverage with 44+ tests.
tests/
├── README.md # Testing documentation
└── ocr/
├── __init__.py
├── test_ocr_extract.py # Image OCR integration tests (7 tests)
├── test_health_check.py # Health endpoint tests (4 tests)
├── test_supported_languages.py # Language listing tests (5 tests)
├── test_pdf_extract.py # PDF extraction tests (6 tests)
├── test_multiformat_extract.py # Multi-format tests (9 tests)
└── test_services.py # Service layer unit tests (13 tests)
# Run all tests
cd fergani
python manage.py test
# Run specific test file
python manage.py test tests.ocr.test_ocr_extract
# Run with coverage
coverage run --source='ocr' manage.py test
coverage report
coverage html # Generate HTML reportIntegration Tests (tests/ocr/test_*.py)
- Test API endpoints end-to-end
- Verify request/response formats
- Check error handling
- Test file uploads and processing
Unit Tests (tests/ocr/test_services.py)
- Test service layer methods in isolation
- Mock external dependencies
- Verify business logic
- Test edge cases
Example Test:
from django.test import TestCase
from rest_framework.test import APITestCase
from PIL import Image
import io
class OCRExtractTests(APITestCase):
def test_extract_text_from_image(self):
# Create test image
image = Image.new('RGB', (100, 100), color='white')
buffer = io.BytesIO()
image.save(buffer, format='PNG')
buffer.seek(0)
# Make API request
response = self.client.post('/api/ocr/extract/', {
'image': buffer,
'language': 'eng',
'save_to_db': 'true'
}, format='multipart')
# Verify response
self.assertEqual(response.status_code, 200)
self.assertTrue(response.data['success'])
self.assertIn('document_id', response.data)For complete testing documentation, see tests/README.md.
The project includes a comprehensive CI/CD pipeline using GitHub Actions.
Every PR to the develop branch automatically:
✅ Checks code formatting (Black)
✅ Validates import sorting (isort)
✅ Runs linting (Flake8)
✅ Verifies migrations (no unapplied changes)
✅ Runs all tests (44+ tests)
✅ Measures coverage (minimum 70% required)
Every push to main branch:
🚀 Runs deployment checks
🚀 Auto-deploys to Railway/Render
🚀 Monitors deployment status
feature/branch → PR → develop (CI checks) → main (auto-deploy)
develop: Development branch, all PRs merge heremain: Production branch, auto-deploys to live server
Install development tools:
pip install black flake8 isort coverageInstall pre-commit hook (automatically runs checks before each commit):
chmod +x pre-commit-hook.sh
cp pre-commit-hook.sh .git/hooks/pre-commitManual checks:
cd fergani
# Format code
black .
isort .
# Run linting
flake8 .
# Check migrations
python manage.py makemigrations --check --dry-run
# Run tests with coverage
coverage run --source='ocr' manage.py test tests/
coverage report| Tool | Purpose | Configuration |
|---|---|---|
| Black | Code formatting | 127 char line length |
| isort | Import sorting | Black-compatible |
| Flake8 | Linting | Max complexity: 10 |
| Coverage | Test coverage | Minimum: 70% |
- Quick Setup: SETUP_CI.md
- Detailed Guide: CI_CD.md
- Workflow Files:
.github/workflows/
Note: Replace
YOUR_USERNAMEwith your GitHub username to see live status badges.
Key settings in fergani/settings.py:
# Media files (uploaded documents)
MEDIA_URL = '/media/'
MEDIA_ROOT = BASE_DIR / 'media'
# Maximum file upload size (Django)
DATA_UPLOAD_MAX_MEMORY_SIZE = 52428800 # 50MB
# View-level max sizes
OCRExtractTextView.MAX_FILE_SIZE = 10 * 1024 * 1024 # 10MB for images
PDFExtractTextView.MAX_FILE_SIZE = 50 * 1024 * 1024 # 50MB for PDFsFor production, use environment variables:
# .env file
SECRET_KEY=your-secret-key-here
DEBUG=False
ALLOWED_HOSTS=yourdomain.com,api.yourdomain.com
DATABASE_URL=postgresql://user:pass@localhost/dbname
# Tesseract configuration
TESSERACT_CMD=/usr/bin/tesseract # Path to tesseract binaryAccess the admin panel at /admin/ to:
- View all uploaded documents
- Monitor processing status
- View extraction results
- Check audit logs
- Perform bulk actions (archive, soft delete)
- View colored status indicators
Admin Features:
- Colored Status Badges: Visual indicators for document status
- File Size Formatting: Human-readable file sizes
- Bulk Actions: Archive or soft delete multiple documents
- Processing Metrics: View confidence scores and processing times
- Audit Trail: Read-only access to all processing logs
- Search & Filters: Find documents by name, status, date, etc.
- Environment Configuration
export DEBUG=False
export SECRET_KEY='generate-strong-secret-key'
export ALLOWED_HOSTS='yourdomain.com'- Database Setup
# Use PostgreSQL in production
pip install psycopg2-binary
python manage.py migrate
python manage.py createsuperuser- Static Files
python manage.py collectstatic --no-input- Web Server Use Gunicorn + Nginx:
# Install Gunicorn
pip install gunicorn
# Run Gunicorn
gunicorn fergani.wsgi:application --bind 0.0.0.0:8000 --workers 4- Nginx Configuration
server {
listen 80;
server_name yourdomain.com;
location / {
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
location /media/ {
alias /path/to/fergani-ocr/fergani/media/;
}
location /static/ {
alias /path/to/fergani-ocr/fergani/staticfiles/;
}
}- Security
- Enable HTTPS with SSL certificates (Let's Encrypt)
- Set secure cookie flags
- Configure CORS properly
- Regular security updates
- Backup database regularly
Create Dockerfile:
FROM python:3.11-slim
# Install Tesseract and dependencies
RUN apt-get update && apt-get install -y \
tesseract-ocr \
tesseract-ocr-eng \
tesseract-ocr-ara \
poppler-utils \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY fergani/ /app/
RUN python manage.py collectstatic --no-input
EXPOSE 8000
CMD ["gunicorn", "fergani.wsgi:application", "--bind", "0.0.0.0:8000"]fergani-ocr/
├── README.md # This file
├── DATABASE_ARCHITECTURE.md # Database design documentation
├── requirements.txt # Python dependencies
├── fergani/ # Django project root
│ ├── manage.py # Django management script
│ ├── fergani/ # Project configuration
│ │ ├── __init__.py
│ │ ├── settings.py # Django settings
│ │ ├── urls.py # URL routing
│ │ ├── asgi.py # ASGI config
│ │ └── wsgi.py # WSGI config
│ ├── ocr/ # Main OCR application
│ │ ├── __init__.py
│ │ ├── admin.py # Admin panel configuration
│ │ ├── apps.py # App configuration
│ │ ├── models.py # Database models (4 models)
│ │ ├── views.py # API views (image OCR)
│ │ ├── pdf_views.py # PDF and multi-format views
│ │ ├── services.py # Service layer (business logic)
│ │ ├── serializers.py # DRF serializers
│ │ ├── utils.py # OCR and PDF processors
│ │ ├── migrations/ # Database migrations
│ │ │ ├── __init__.py
│ │ │ └── 0001_initial.py # Initial schema
│ │ └── templates/ # Frontend templates
│ │ └── ocr/
│ │ └── index.html # Web UI
│ ├── tests/ # Test suite
│ │ ├── README.md # Test documentation
│ │ └── ocr/ # OCR app tests
│ │ ├── __init__.py
│ │ ├── test_ocr_extract.py
│ │ ├── test_health_check.py
│ │ ├── test_supported_languages.py
│ │ ├── test_pdf_extract.py
│ │ ├── test_multiformat_extract.py
│ │ └── test_services.py
│ └── media/ # Uploaded files (created at runtime)
│ └── ocr_documents/ # Organized by date (YYYY/MM/DD/)
└── .gitignore
- Backend Framework: Django 5.0.1
- API Framework: Django REST Framework 3.14.0+
- OCR Engine: Tesseract OCR (via pytesseract)
- Image Processing: Pillow 10.0+
- PDF Processing: PyPDF2, pdf2image
- Database: SQLite (dev), PostgreSQL (recommended for production)
- Testing: Django TestCase, DRF APITestCase
Service Layer (ocr/services.py)
-
OCRService: Image OCR business logic
process_image_extraction(): Main image processingget_ocr_api_info(): API metadataget_health_status(): Service health checkget_supported_languages(): Language listing
-
PDFService: PDF extraction business logic
process_pdf_extraction(): Main PDF processingvalidate_pdf_file(): File validationget_pdf_api_info(): API metadata
-
MultiFormatService: Multi-format handling
process_file(): Auto-detect and processvalidate_file(): File validationget_api_info(): API metadata
Utilities (ocr/utils.py)
-
OCRProcessor: Core Tesseract operations
extract_text(): Extract text from PIL Imageget_confidence_scores(): OCR confidence metricsget_image_info(): Image metadatais_tesseract_installed(): Check Tesseract availabilityget_supported_languages(): Query installed languages
-
PDFProcessor: Core PDF operations
extract_text_from_pdf(): PDF text extractionextract_specific_pages(): Page selectionis_text_based_pdf(): Detect PDF type
-
Separation of Concerns
- Views: HTTP handling only
- Services: Business logic
- Utils: Core processing
- Models: Data persistence
-
Single Responsibility
- Each class has one clear purpose
- Methods are focused and testable
- Easy to maintain and extend
-
DRY (Don't Repeat Yourself)
- Shared logic in service layer
- Reusable utility functions
- Consistent error handling
-
Testability
- Service layer methods are pure functions
- Easy to mock dependencies
- Comprehensive test coverage
Example: Add new image preprocessing option
- Add utility function in
utils.py:
class OCRProcessor:
@staticmethod
def preprocess_image_denoise(image):
"""Remove noise from image"""
# Implementation
return processed_image- Add service method in
services.py:
class OCRService:
@staticmethod
def process_with_denoise(image_file, language='eng'):
image = Image.open(image_file)
image = OCRProcessor.preprocess_image_denoise(image)
return OCRProcessor.extract_text(image, language)- Add view endpoint in
views.py:
class OCRDenoiseView(APIView):
def post(self, request):
# Handle request
result = OCRService.process_with_denoise(
request.FILES['image'],
request.data.get('language', 'eng')
)
return Response(result)- Add tests in
tests/ocr/test_denoise.py:
class DenoiseTests(APITestCase):
def test_denoise_extraction(self):
# Test implementation
pass- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Write tests for your changes
- Ensure all tests pass (
python manage.py test) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
The application supports 100+ languages through Tesseract. Common languages include:
eng- Englishara- Arabicspa- Spanishfra- Frenchdeu- Germanrus- Russianchi_sim- Chinese (Simplified)chi_tra- Chinese (Traditional)jpn- Japanesekor- Koreanhin- Hindipor- Portugueseita- Italiantur- Turkishpol- Polish
Note: Language packs must be installed separately. Use GET /api/ocr/languages/ to see which languages are currently available on your system.
# Check if Tesseract is installed
tesseract --version
# If not found, install it
sudo apt-get install tesseract-ocr # Ubuntu/Debian
brew install tesseract # macOS# List installed languages
tesseract --list-langs
# Install additional language
sudo apt-get install tesseract-ocr-ara # Arabic# Install PDF dependencies
pip install PyPDF2 pdf2image
sudo apt-get install poppler-utils # Ubuntu/Debian
brew install poppler # macOS# Reset migrations (development only!)
python manage.py migrate ocr zero
python manage.py migrate
# Or start fresh
rm db.sqlite3
python manage.py migrateCheck these settings in settings.py:
DATA_UPLOAD_MAX_MEMORY_SIZE = 52428800 # 50MB
FILE_UPLOAD_MAX_MEMORY_SIZE = 52428800 # 50MBThe SHA256-based caching system provides:
- 10-100x faster responses for duplicate files
- Zero CPU usage for cached results
- Reduced database load through intelligent deduplication
# settings.py
# Use Redis for session/cache
CACHES = {
'default': {
'BACKEND': 'django.core.cache.backends.redis.RedisCache',
'LOCATION': 'redis://127.0.0.1:6379/1',
}
}
# Use Celery for async processing (optional)
CELERY_BROKER_URL = 'redis://localhost:6379/0'
CELERY_RESULT_BACKEND = 'redis://localhost:6379/0'
# Database connection pooling
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql',
'CONN_MAX_AGE': 600, # Connection pooling
}
}We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines.
# Fork and clone
git clone https://github.com/YOUR_USERNAME/fergani-ocr.git
# Install dev tools
pip install black flake8 isort coverage
# Set up pre-commit hook
cp pre-commit-hook.sh .git/hooks/pre-commit
chmod +x .git/hooks/pre-commit
# Create feature branch
git checkout -b feature/awesome-feature
# Make changes, commit, and push
git commit -m "Add awesome feature"
git push origin feature/awesome-feature
# Open PR to 'develop' branchSee CONTRIBUTING.md for:
- Code quality standards
- Testing guidelines
- PR process
- Branch strategy
- Commit message format
MIT License
Copyright (c) 2026 Fergani OCR Project
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Built with ❤️ using Django, Tesseract OCR, and modern Python best practices.