Skip to content

feat: add OCR-based searchable PDF converter#368

Open
upasana-2006 wants to merge 1 commit into
Durgeshwar-AI:mainfrom
upasana-2006:feat/searchable-pdf-ocr
Open

feat: add OCR-based searchable PDF converter#368
upasana-2006 wants to merge 1 commit into
Durgeshwar-AI:mainfrom
upasana-2006:feat/searchable-pdf-ocr

Conversation

@upasana-2006

Copy link
Copy Markdown

🔀 Pull Request

📌 Issue Reference

Closes #366


📝 Summary

This PR introduces a new OCR-Based Searchable PDF Converter feature that enables users to convert scanned or image-based PDF documents into searchable PDFs.

Many scanned PDFs contain only images and do not allow users to search, copy, highlight, or extract text. This feature leverages OCR (Optical Character Recognition) to recognize text from scanned PDF pages and generate a new searchable PDF while preserving the original document appearance.

Problem Solved

  • Scanned PDFs are not searchable.
  • Text cannot be copied or extracted from image-based documents.
  • Accessibility and document indexing are limited.
  • Users often need searchable versions of invoices, notes, reports, forms, and scanned documents.

Changes Made

Backend

  • Added a new Flask blueprint:

    • backend/blueprints/searchable_pdf_ocr.py
  • Implemented OCR-based PDF processing pipeline.

  • Added image preprocessing support using OpenCV:

    • No preprocessing
    • Light denoising
    • Balanced OCR cleanup
    • Strong thresholding
  • Added support for configurable OCR languages.

  • Added validation for uploaded PDF files.

  • Added generation of downloadable searchable PDF outputs.

  • Registered the new blueprint in the Flask application.

Dependencies

  • Added:

    • pytesseract
    • opencv-python-headless

Frontend

  • Added a dedicated page:

    • frontend/src/pages/PdfSearchableOCR.jsx
  • Added OCR settings interface:

    • Language selection
    • Preprocessing mode selection
  • Added route registration in App.jsx.

  • Added tool listing entry in toolsData.jsx.

Benefits

  • Converts scanned PDFs into searchable documents.
  • Improves accessibility and usability.
  • Supports multi-page PDF processing.
  • Preserves original document appearance.
  • Uses local OCR processing without relying on external APIs.

📸 Screenshots (if applicable)

New Tool Interface

  • OCR Language Selection
  • Image Preprocessing Options
  • Searchable PDF Download Generation

(Screenshots will be added after review/testing.)


✅ Checklist

  • My code follows the project's coding conventions
  • I have tested all impacted features
  • I have updated or added necessary documentation

🔗 Related Issues / PRs

  • Related Issue: #<issue_number>

🏅 Open Source Program Participation

Program Name: GSSoC 2026


💬 Additional Notes

  • The implementation performs OCR locally and does not depend on external OCR APIs.
  • The feature is designed to support scanned, image-only PDFs.
  • Generated PDFs preserve the visual appearance of the original document while adding searchable text functionality.
  • The solution is modular and can be extended in the future with additional OCR languages, batch processing, confidence scoring, or advanced document enhancement techniques.

@vercel

vercel Bot commented Jun 17, 2026

Copy link
Copy Markdown

@upasana-2006 is attempting to deploy a commit to the Durgeshwar's projects Team on Vercel.

A member of the Team first needs to authorize it.

@Durgeshwar-AI

Copy link
Copy Markdown
Owner

@upasana-2006 update the requirements.txt else it is not working.

@upasana-2006

Copy link
Copy Markdown
Author

Updated backend/requirements.txt with the required OCR dependencies:

  • pytesseract>=0.3.10
  • opencv-python-headless>=4.10.0

The latest commit on this PR branch includes these changes. Please re-check once the workflow/deployment is approved.

@Durgeshwar-AI

Durgeshwar-AI commented Jun 20, 2026

Copy link
Copy Markdown
Owner
Screenshot 2026-06-20 131626

Can you please check this error. Also the tool is not added to the toolsData.jsx.

Sorry it took some time for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add OCR-Based Scanned PDF to Searchable PDF Converter

2 participants