A Flask-based web app for uploading, processing, and searching court PDFs using OCR, NLP, and semantic search with Elasticsearch.
- 🧠 Uses
sentence-transformersfor embeddings - 🕵️ Extracts metadata like Title, Date, and Full Text
- 🔍 Supports both keyword and semantic vector search
- 🧾 PDF-to-text via
pdfplumber+ OCR fallback withpytesseract - ⚡ Powered by Elasticsearch (set up locally via Docker)
- Python 3.10+
- Docker
- Tesseract OCR (
brew install tesseractfor macOS) - Elasticsearch & Kibana (set up using Elastic's official script)
git clone https://github.com/yourusername/document-digitizer.git
cd document-digitizerpython3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtDependencies include:
- Flask
- pdfplumber
- pytesseract
- Pillow
- elasticsearch
- sentence-transformers
- numpy
brew install tesseractcurl -fsSL https://elastic.co/start-local | sh✅ After successful setup:
- Kibana: http://localhost:5601
- Elasticsearch API: http://localhost:9200
- Username:
elastic - Password:
R3DeJHu(or the generated one shown after setup)
The script creates a folder elastic-start-local/ with:
.env,docker-compose.ymlstart,stop,uninstallscripts
To restart services:
./elastic-start-local/startdocument-digitizer/
├── app.py # Flask app
├── processor.py # PDF OCR + NLP + Elasticsearch
├── templates/
│ └── index.html # Basic UI
├── pdfs/ # Uploaded PDFs
└── elastic-start-local/ # Elastic + Kibana (auto-created)
- Upload a court PDF via browser or API.
- Extract text using
pdfplumber. If that fails, fallback to OCR via Tesseract. - Extract:
- Title (based on text heuristics)
- Date (regex + natural formats)
- Full content
- Generate semantic embedding with
sentence-transformers - Store everything in Elasticsearch
- Search via keyword OR semantic (vector similarity)
Renders the upload form.
Upload a PDF to index.
Request:
multipart/form-datawith a keypdf
Response:
{
"message": "Document indexed successfully",
"title": "Some Legal Title"
}Search documents.
Request:
{
"query": "rigorous imprisonment",
"semantic": true,
"date_from": "2010-01-01",
"date_to": "2023-12-31"
}Response:
[
{
"title": "Court Judgement on XYZ",
"date": "2018-07-14",
"score": 0.923,
...
}
]To process all PDFs in the /pdfs folder via command line:
python processor.pyTo test Elasticsearch connection:
from processor import test_elasticsearch_connection
test_elasticsearch_connection(processor)-
PDF text empty?
→ OCR kicks in ifocr_enabled=True -
Semantic search returns all docs?
→ Ensureembeddingis populated and normalized -
Elasticsearch 400/500 errors?
→ Might be mapping mismatch. Try deleting and reindexing:
curl -X DELETE http://localhost:9200/court_documents- Embedding norms close to 0?
→ Debug text length and vector normalization
MIT License — feel free to use, remix, and adapt. Just don't give it to your shady lawyer uncle 👨⚖️.