Complete vector search implementation by DanaY326 · Pull Request #25 · adanomad/pdf-highlight-oa

DanaY326 · 2025-07-28T13:08:31Z

This is a complete implementation of the PDF viewer to be able to upload multiple PDFs and perform vector search on the text and images that they contain. Some key features added include extracting images, creating and storing vector embeddings, storing data for multiple PDFs and all their content, UI improvements and performing balanced vector search.

Extracting images isn't a function that is readily available inside JavaScript libraries. In order to extract images, the operator list of the PDF, which is read by software that displays PDFs, was obtained using PDF.js functionality. This was scraped for the dimensions and contents of the images contained inside the PDF. The images could then be readily displayed and embedded, and located precisely in order to be highlighted.

Once obtained, both the images and text needed to be embedded in a space where they could be compared. Many models can compare text or compare images, but it was vital to be able to do both. The CLIP model from OpenAI was chosen as it is open source, from an extremely well-respected company, and worked well in an interactive demo I found.

However, while it works well for searching through only images or only text, it was eventually discovered that when searching through both images and text, it heavily favors text results even over images that are clearly more relevant. The cosine similarity search algorithm was therefore tweaked to balance this by giving a "handicap" to the text results. More investigation would be beneficial to investigate if this is an inherent limitation of text and image embedding engines, or if it would be better to try a different embedding model. But currently, one can search semantically through the text and images in various PDFs and have results ordered by relevance.

Some other additions include:

UX enhancements (loading messages, displaying detailed information about search results, and an always-available search bar)
the implementation of the sqlite_vec library to store vectors
implementation of scroll-to-result functionality for multiple PDFs
Tesseract OCR used with multiple workers to speed up the process
Accurate highlight dimensions for all kinds of input (plain text, OCR-returned text, images)

Working with existing codebases was useful because much of the code (highlight getting, highlight storage, the PDF upload and search bars) could be adapted to be used in core functionality. It did require time to understand at the start, but overall the tradeoff was very nice. For this implementation I deleted most unused code. This would make the code less confusing for future programmers, and any code that we would want to restore can be found using Git version control.

Possible future improvements:

Investigate different embedding models that are more balanced with text and image search results and/or optimize balancing algorithm
General UI improvements, delete functionalities
Supabase and cloud storage implementation

DanaY326 added 2 commits July 28, 2025 08:39

Commit all files

64b7399

Merge branch 'main' of https://github.com/DanaY326/pdf-highlight-oa-fork

e070cd5

DanaY326 closed this Jul 28, 2025

DanaY326 reopened this Jul 28, 2025

DanaY326 changed the title ~~Merge branch 'main' of https://github.com/DanaY326/pdf-highlight-oa-fork~~ Complete search implementation Jul 28, 2025

DanaY326 changed the title ~~Complete search implementation~~ Complete vector search implementation Jul 28, 2025

DanaY326 added 8 commits July 30, 2025 09:11

Update README.md

9807fad

Update README.md

273f26e

Update README.md

ec95345

Update README.md

0be2181

Update README.md

473a928

Update README.md

30411f5

Update README.md

31ee76e

Deleted Windows-specific dependencies

fdf6cdd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete vector search implementation#25

Complete vector search implementation#25
DanaY326 wants to merge 10 commits into
adanomad:mainfrom
DanaY326:main

DanaY326 commented Jul 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DanaY326 commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DanaY326 commented Jul 28, 2025 •

edited

Loading