Clean PDF Search Implementation by ReemAbdelazim · Pull Request #24 · adanomad/pdf-highlight-oa

ReemAbdelazim · 2025-07-06T01:49:15Z

Technical Assessment: PDF Semantic Search System
This pull request introduces the foundational work for a semantic search system that allows users to upload PDFs and search them by content. Below is a summary of the proposed implementation, progress made, and next steps.

Upload and Save a PDF into the Database
We successfully implemented the upload functionality. Uploaded PDFs are stored as binary (BLOB) data in the pdf_documents table. This part is fully complete.
Embedding-Based Search
The plan involves extracting text from each page of the PDF, generating an embedding vector for that text, and storing it in the pdf_pages table. These vectors enable semantic search by comparing user queries to stored embeddings.

Initially, we attempted to use HuggingFace for embedding generation. However, after spending over 26 hours debugging compatibility issues, it became clear that OpenAI would have been the more efficient and stable option for this context. We began setting up the switch to OpenAI and completed parts of the integration, but time constraints prevented us from finalizing the embedding storage and search comparison logic.

What Remains
If we had more time, the next steps would have been:

Finalize OpenAI embedding integration and generate vectors for all pages.

Store these vectors in the embedding_vector column of the pdf_pages table.

Implement the /api/search route to:

Accept a user query.

Generate an embedding for the query.

Compare it to all stored vectors using similarity scoring.

Return a list of matched pages with metadata such as filename, page number, and a content preview.

Page-Level Indexing
The current schema supports one row per page in the pdf_pages table, indexed by pdf_id and page_number. This allows for precise search and navigation at the page level.
Handling PDFs with Images Only
The plan includes using Tesseract.js to perform OCR on pages where text extraction fails. This ensures that scanned or image-based PDFs are also converted into searchable text and embedded properly.
Global Search Functionality
The intended user-facing search bar will send a query to /api/search, which will return matches across all uploaded documents. The backend is designed to support this once embeddings are available.
Sidebar Integration
Search results will be returned as structured data, including PDF ID, filename, page number, and a content snippet. These can be wired into the existing sidebar UI to allow users to browse and navigate through relevant results.
Document Navigation
The application already supports per-page navigation. Clicking on a search result will trigger a "jump to page" action, integrating smoothly with the current viewer.
Extending the Existing Codebase
All changes in this branch are modular extensions of the current codebase. We made additions to sqliteUtils.ts, added routes in app/api/, and reused existing components such as page.tsx, the search bar, and the sidebar. No major rewrites were required.

Summary
This pull request lays the groundwork for a full semantic search experience over uploaded PDFs. Upload and storage are complete. Embedding-based search is partially implemented and would be straightforward to complete using OpenAI. The remaining work is clearly scoped and aligns with the existing architecture.

Rebuild: Clean PDF search feature

c939be5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean PDF Search Implementation#24

Clean PDF Search Implementation#24
ReemAbdelazim wants to merge 1 commit into
adanomad:mainfrom
ReemAbdelazim:clean-pdf-search

ReemAbdelazim commented Jul 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ReemAbdelazim commented Jul 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant