Skip to content

Complete vector search implementation#25

Open
DanaY326 wants to merge 10 commits into
adanomad:mainfrom
DanaY326:main
Open

Complete vector search implementation#25
DanaY326 wants to merge 10 commits into
adanomad:mainfrom
DanaY326:main

Conversation

@DanaY326

@DanaY326 DanaY326 commented Jul 28, 2025

Copy link
Copy Markdown

This is a complete implementation of the PDF viewer to be able to upload multiple PDFs and perform vector search on the text and images that they contain. Some key features added include extracting images, creating and storing vector embeddings, storing data for multiple PDFs and all their content, UI improvements and performing balanced vector search.

Extracting images isn't a function that is readily available inside JavaScript libraries. In order to extract images, the operator list of the PDF, which is read by software that displays PDFs, was obtained using PDF.js functionality. This was scraped for the dimensions and contents of the images contained inside the PDF. The images could then be readily displayed and embedded, and located precisely in order to be highlighted.

Once obtained, both the images and text needed to be embedded in a space where they could be compared. Many models can compare text or compare images, but it was vital to be able to do both. The CLIP model from OpenAI was chosen as it is open source, from an extremely well-respected company, and worked well in an interactive demo I found.

However, while it works well for searching through only images or only text, it was eventually discovered that when searching through both images and text, it heavily favors text results even over images that are clearly more relevant. The cosine similarity search algorithm was therefore tweaked to balance this by giving a "handicap" to the text results. More investigation would be beneficial to investigate if this is an inherent limitation of text and image embedding engines, or if it would be better to try a different embedding model. But currently, one can search semantically through the text and images in various PDFs and have results ordered by relevance.

Some other additions include:

  • UX enhancements (loading messages, displaying detailed information about search results, and an always-available search bar)
  • the implementation of the sqlite_vec library to store vectors
  • implementation of scroll-to-result functionality for multiple PDFs
  • Tesseract OCR used with multiple workers to speed up the process
  • Accurate highlight dimensions for all kinds of input (plain text, OCR-returned text, images)

Working with existing codebases was useful because much of the code (highlight getting, highlight storage, the PDF upload and search bars) could be adapted to be used in core functionality. It did require time to understand at the start, but overall the tradeoff was very nice. For this implementation I deleted most unused code. This would make the code less confusing for future programmers, and any code that we would want to restore can be found using Git version control.

Possible future improvements:

  • Investigate different embedding models that are more balanced with text and image search results and/or optimize balancing algorithm
  • General UI improvements, delete functionalities
  • Supabase and cloud storage implementation

@DanaY326 DanaY326 closed this Jul 28, 2025
@DanaY326 DanaY326 reopened this Jul 28, 2025
@DanaY326 DanaY326 changed the title Merge branch 'main' of https://github.com/DanaY326/pdf-highlight-oa-fork Complete search implementation Jul 28, 2025
@DanaY326 DanaY326 changed the title Complete search implementation Complete vector search implementation Jul 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant