Clean PDF Search Implementation#24
Open
ReemAbdelazim wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Technical Assessment: PDF Semantic Search System
This pull request introduces the foundational work for a semantic search system that allows users to upload PDFs and search them by content. Below is a summary of the proposed implementation, progress made, and next steps.
Upload and Save a PDF into the Database
We successfully implemented the upload functionality. Uploaded PDFs are stored as binary (BLOB) data in the pdf_documents table. This part is fully complete.
Embedding-Based Search
The plan involves extracting text from each page of the PDF, generating an embedding vector for that text, and storing it in the pdf_pages table. These vectors enable semantic search by comparing user queries to stored embeddings.
Initially, we attempted to use HuggingFace for embedding generation. However, after spending over 26 hours debugging compatibility issues, it became clear that OpenAI would have been the more efficient and stable option for this context. We began setting up the switch to OpenAI and completed parts of the integration, but time constraints prevented us from finalizing the embedding storage and search comparison logic.
What Remains
If we had more time, the next steps would have been:
Finalize OpenAI embedding integration and generate vectors for all pages.
Store these vectors in the embedding_vector column of the pdf_pages table.
Implement the /api/search route to:
Accept a user query.
Generate an embedding for the query.
Compare it to all stored vectors using similarity scoring.
Return a list of matched pages with metadata such as filename, page number, and a content preview.
Page-Level Indexing
The current schema supports one row per page in the pdf_pages table, indexed by pdf_id and page_number. This allows for precise search and navigation at the page level.
Handling PDFs with Images Only
The plan includes using Tesseract.js to perform OCR on pages where text extraction fails. This ensures that scanned or image-based PDFs are also converted into searchable text and embedded properly.
Global Search Functionality
The intended user-facing search bar will send a query to /api/search, which will return matches across all uploaded documents. The backend is designed to support this once embeddings are available.
Sidebar Integration
Search results will be returned as structured data, including PDF ID, filename, page number, and a content snippet. These can be wired into the existing sidebar UI to allow users to browse and navigate through relevant results.
Document Navigation
The application already supports per-page navigation. Clicking on a search result will trigger a "jump to page" action, integrating smoothly with the current viewer.
Extending the Existing Codebase
All changes in this branch are modular extensions of the current codebase. We made additions to sqliteUtils.ts, added routes in app/api/, and reused existing components such as page.tsx, the search bar, and the sidebar. No major rewrites were required.
Summary
This pull request lays the groundwork for a full semantic search experience over uploaded PDFs. Upload and storage are complete. Embedding-based search is partially implemented and would be straightforward to complete using OpenAI. The remaining work is clearly scoped and aligns with the existing architecture.