Complete vector search implementation#25
Open
DanaY326 wants to merge 10 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a complete implementation of the PDF viewer to be able to upload multiple PDFs and perform vector search on the text and images that they contain. Some key features added include extracting images, creating and storing vector embeddings, storing data for multiple PDFs and all their content, UI improvements and performing balanced vector search.
Extracting images isn't a function that is readily available inside JavaScript libraries. In order to extract images, the operator list of the PDF, which is read by software that displays PDFs, was obtained using PDF.js functionality. This was scraped for the dimensions and contents of the images contained inside the PDF. The images could then be readily displayed and embedded, and located precisely in order to be highlighted.
Once obtained, both the images and text needed to be embedded in a space where they could be compared. Many models can compare text or compare images, but it was vital to be able to do both. The CLIP model from OpenAI was chosen as it is open source, from an extremely well-respected company, and worked well in an interactive demo I found.
However, while it works well for searching through only images or only text, it was eventually discovered that when searching through both images and text, it heavily favors text results even over images that are clearly more relevant. The cosine similarity search algorithm was therefore tweaked to balance this by giving a "handicap" to the text results. More investigation would be beneficial to investigate if this is an inherent limitation of text and image embedding engines, or if it would be better to try a different embedding model. But currently, one can search semantically through the text and images in various PDFs and have results ordered by relevance.
Some other additions include:
Working with existing codebases was useful because much of the code (highlight getting, highlight storage, the PDF upload and search bars) could be adapted to be used in core functionality. It did require time to understand at the start, but overall the tradeoff was very nice. For this implementation I deleted most unused code. This would make the code less confusing for future programmers, and any code that we would want to restore can be found using Git version control.
Possible future improvements: