🏛️ Court Document Digitizer & Search Engine

A Flask-based web app for uploading, processing, and searching court PDFs using OCR, NLP, and semantic search with Elasticsearch.

🧠 Uses sentence-transformers for embeddings
🕵️ Extracts metadata like Title, Date, and Full Text
🔍 Supports both keyword and semantic vector search
🧾 PDF-to-text via pdfplumber + OCR fallback with pytesseract
⚡ Powered by Elasticsearch (set up locally via Docker)

⚙️ Prerequisites

Python 3.10+
Docker
Tesseract OCR (brew install tesseract for macOS)
Elasticsearch & Kibana (set up using Elastic's official script)

🚀 Quickstart

1. Clone this repo

git clone https://github.com/yourusername/document-digitizer.git
cd document-digitizer

2. Set up virtual environment and install dependencies

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Dependencies include:

Flask

pdfplumber

pytesseract

Pillow

elasticsearch

sentence-transformers

numpy

3. Set up Tesseract (if not installed)

brew install tesseract

🔌 Spin Up Elasticsearch & Kibana (Official Method)

curl -fsSL https://elastic.co/start-local | sh

✅ After successful setup:

Kibana: http://localhost:5601
Elasticsearch API: http://localhost:9200
Username: elastic
Password: R3DeJHu (or the generated one shown after setup)

The script creates a folder elastic-start-local/ with:

.env, docker-compose.yml
start, stop, uninstall scripts

To restart services:

./elastic-start-local/start

📦 Project Structure

document-digitizer/
├── app.py                   # Flask app
├── processor.py             # PDF OCR + NLP + Elasticsearch
├── templates/
│   └── index.html           # Basic UI
├── pdfs/                    # Uploaded PDFs
└── elastic-start-local/     # Elastic + Kibana (auto-created)

🧠 How It Works

Upload a court PDF via browser or API.
Extract text using pdfplumber. If that fails, fallback to OCR via Tesseract.
Extract:
- Title (based on text heuristics)
- Date (regex + natural formats)
- Full content
Generate semantic embedding with sentence-transformers
Store everything in Elasticsearch
Search via keyword OR semantic (vector similarity)

🛠️ API Endpoints

`GET /`

Renders the upload form.

`POST /upload`

Upload a PDF to index.

Request:

multipart/form-data with a key pdf

Response:

{
  "message": "Document indexed successfully",
  "title": "Some Legal Title"
}

`POST /search`

Search documents.

Request:

{
  "query": "rigorous imprisonment",
  "semantic": true,
  "date_from": "2010-01-01",
  "date_to": "2023-12-31"
}

Response:

[
  {
    "title": "Court Judgement on XYZ",
    "date": "2018-07-14",
    "score": 0.923,
    ...
  }
]

🧪 Standalone Indexing

To process all PDFs in the /pdfs folder via command line:

python processor.py

To test Elasticsearch connection:

from processor import test_elasticsearch_connection
test_elasticsearch_connection(processor)

🧯 Troubleshooting

PDF text empty?
→ OCR kicks in if ocr_enabled=True
Semantic search returns all docs?
→ Ensure embedding is populated and normalized
Elasticsearch 400/500 errors?
→ Might be mapping mismatch. Try deleting and reindexing:

curl -X DELETE http://localhost:9200/court_documents

Embedding norms close to 0?
→ Debug text length and vector normalization

🙌 Credits

📃 License

MIT License — feel free to use, remix, and adapt. Just don't give it to your shady lawyer uncle 👨‍⚖️.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
pdfs		pdfs
templates		templates
uploaded_pdfs		uploaded_pdfs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
processor.py		processor.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏛️ Court Document Digitizer & Search Engine

⚙️ Prerequisites

🚀 Quickstart

1. Clone this repo

2. Set up virtual environment and install dependencies

3. Set up Tesseract (if not installed)

🔌 Spin Up Elasticsearch & Kibana (Official Method)

📦 Project Structure

🧠 How It Works

🛠️ API Endpoints

`GET /`

`POST /upload`

`POST /search`

🧪 Standalone Indexing

🧯 Troubleshooting

🙌 Credits

📃 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🏛️ Court Document Digitizer & Search Engine

⚙️ Prerequisites

🚀 Quickstart

1. Clone this repo

2. Set up virtual environment and install dependencies

3. Set up Tesseract (if not installed)

🔌 Spin Up Elasticsearch & Kibana (Official Method)

📦 Project Structure

🧠 How It Works

🛠️ API Endpoints

GET /

POST /upload

POST /search

🧪 Standalone Indexing

🧯 Troubleshooting

🙌 Credits

📃 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`GET /`

`POST /upload`

`POST /search`

Packages