feat: add OCR-based searchable PDF converter by upasana-2006 · Pull Request #368 · Durgeshwar-AI/pdfToPng

upasana-2006 · 2026-06-17T16:29:15Z

🔀 Pull Request

📌 Issue Reference

Closes #366

📝 Summary

This PR introduces a new OCR-Based Searchable PDF Converter feature that enables users to convert scanned or image-based PDF documents into searchable PDFs.

Many scanned PDFs contain only images and do not allow users to search, copy, highlight, or extract text. This feature leverages OCR (Optical Character Recognition) to recognize text from scanned PDF pages and generate a new searchable PDF while preserving the original document appearance.

Problem Solved

Scanned PDFs are not searchable.
Text cannot be copied or extracted from image-based documents.
Accessibility and document indexing are limited.
Users often need searchable versions of invoices, notes, reports, forms, and scanned documents.

Changes Made

Backend

Added a new Flask blueprint:
- backend/blueprints/searchable_pdf_ocr.py
Implemented OCR-based PDF processing pipeline.
Added image preprocessing support using OpenCV:
- No preprocessing
- Light denoising
- Balanced OCR cleanup
- Strong thresholding
Added support for configurable OCR languages.
Added validation for uploaded PDF files.
Added generation of downloadable searchable PDF outputs.
Registered the new blueprint in the Flask application.

Dependencies

Added:
- pytesseract
- opencv-python-headless

Frontend

Added a dedicated page:
- frontend/src/pages/PdfSearchableOCR.jsx
Added OCR settings interface:
- Language selection
- Preprocessing mode selection
Added route registration in App.jsx.
Added tool listing entry in toolsData.jsx.

Benefits

Converts scanned PDFs into searchable documents.
Improves accessibility and usability.
Supports multi-page PDF processing.
Preserves original document appearance.
Uses local OCR processing without relying on external APIs.

📸 Screenshots (if applicable)

New Tool Interface

OCR Language Selection
Image Preprocessing Options
Searchable PDF Download Generation

(Screenshots will be added after review/testing.)

✅ Checklist

My code follows the project's coding conventions
I have tested all impacted features
I have updated or added necessary documentation

🔗 Related Issues / PRs

Related Issue: #<issue_number>

🏅 Open Source Program Participation

Program Name: GSSoC 2026

💬 Additional Notes

The implementation performs OCR locally and does not depend on external OCR APIs.
The feature is designed to support scanned, image-only PDFs.
Generated PDFs preserve the visual appearance of the original document while adding searchable text functionality.
The solution is modular and can be extended in the future with additional OCR languages, batch processing, confidence scoring, or advanced document enhancement techniques.

vercel · 2026-06-17T16:29:19Z

@upasana-2006 is attempting to deploy a commit to the Durgeshwar's projects Team on Vercel.

A member of the Team first needs to authorize it.

Durgeshwar-AI · 2026-06-18T07:22:14Z

@upasana-2006 update the requirements.txt else it is not working.

upasana-2006 · 2026-06-18T15:21:05Z

Updated backend/requirements.txt with the required OCR dependencies:

pytesseract>=0.3.10
opencv-python-headless>=4.10.0

The latest commit on this PR branch includes these changes. Please re-check once the workflow/deployment is approved.

Durgeshwar-AI · 2026-06-20T07:47:32Z

Can you please check this error. Also the tool is not added to the toolsData.jsx.

Sorry it took some time for the review.

feat: add OCR-based searchable PDF converter

69e9dd6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add OCR-based searchable PDF converter#368

feat: add OCR-based searchable PDF converter#368
upasana-2006 wants to merge 1 commit into
Durgeshwar-AI:mainfrom
upasana-2006:feat/searchable-pdf-ocr

upasana-2006 commented Jun 17, 2026

Uh oh!

vercel Bot commented Jun 17, 2026

Uh oh!

Durgeshwar-AI commented Jun 18, 2026

Uh oh!

upasana-2006 commented Jun 18, 2026

Uh oh!

Durgeshwar-AI commented Jun 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

upasana-2006 commented Jun 17, 2026

🔀 Pull Request

📌 Issue Reference

📝 Summary

Problem Solved

Changes Made

Backend

Dependencies

Frontend

Benefits

📸 Screenshots (if applicable)

New Tool Interface

✅ Checklist

🔗 Related Issues / PRs

🏅 Open Source Program Participation

💬 Additional Notes

Uh oh!

vercel Bot commented Jun 17, 2026

Uh oh!

Durgeshwar-AI commented Jun 18, 2026

Uh oh!

upasana-2006 commented Jun 18, 2026

Uh oh!

Durgeshwar-AI commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Durgeshwar-AI commented Jun 20, 2026 •

edited

Loading