Skip to content

Clean PDF Search Implementation#24

Open
ReemAbdelazim wants to merge 1 commit into
adanomad:mainfrom
ReemAbdelazim:clean-pdf-search
Open

Clean PDF Search Implementation#24
ReemAbdelazim wants to merge 1 commit into
adanomad:mainfrom
ReemAbdelazim:clean-pdf-search

Conversation

@ReemAbdelazim

Copy link
Copy Markdown

Technical Assessment: PDF Semantic Search System
This pull request introduces the foundational work for a semantic search system that allows users to upload PDFs and search them by content. Below is a summary of the proposed implementation, progress made, and next steps.

  1. Upload and Save a PDF into the Database
    We successfully implemented the upload functionality. Uploaded PDFs are stored as binary (BLOB) data in the pdf_documents table. This part is fully complete.

  2. Embedding-Based Search
    The plan involves extracting text from each page of the PDF, generating an embedding vector for that text, and storing it in the pdf_pages table. These vectors enable semantic search by comparing user queries to stored embeddings.

Initially, we attempted to use HuggingFace for embedding generation. However, after spending over 26 hours debugging compatibility issues, it became clear that OpenAI would have been the more efficient and stable option for this context. We began setting up the switch to OpenAI and completed parts of the integration, but time constraints prevented us from finalizing the embedding storage and search comparison logic.

What Remains
If we had more time, the next steps would have been:

Finalize OpenAI embedding integration and generate vectors for all pages.

Store these vectors in the embedding_vector column of the pdf_pages table.

Implement the /api/search route to:

Accept a user query.

Generate an embedding for the query.

Compare it to all stored vectors using similarity scoring.

Return a list of matched pages with metadata such as filename, page number, and a content preview.

  1. Page-Level Indexing
    The current schema supports one row per page in the pdf_pages table, indexed by pdf_id and page_number. This allows for precise search and navigation at the page level.

  2. Handling PDFs with Images Only
    The plan includes using Tesseract.js to perform OCR on pages where text extraction fails. This ensures that scanned or image-based PDFs are also converted into searchable text and embedded properly.

  3. Global Search Functionality
    The intended user-facing search bar will send a query to /api/search, which will return matches across all uploaded documents. The backend is designed to support this once embeddings are available.

  4. Sidebar Integration
    Search results will be returned as structured data, including PDF ID, filename, page number, and a content snippet. These can be wired into the existing sidebar UI to allow users to browse and navigate through relevant results.

  5. Document Navigation
    The application already supports per-page navigation. Clicking on a search result will trigger a "jump to page" action, integrating smoothly with the current viewer.

  6. Extending the Existing Codebase
    All changes in this branch are modular extensions of the current codebase. We made additions to sqliteUtils.ts, added routes in app/api/, and reused existing components such as page.tsx, the search bar, and the sidebar. No major rewrites were required.

Summary
This pull request lays the groundwork for a full semantic search experience over uploaded PDFs. Upload and storage are complete. Embedding-based search is partially implemented and would be straightforward to complete using OpenAI. The remaining work is clearly scoped and aligns with the existing architecture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant