PDF-Highlight-OA submission - Victor Huang by vichua2006 · Pull Request #23 · adanomad/pdf-highlight-oa

vichua2006 · 2025-07-05T20:49:03Z

Setup Instructions!!

Clone the repository
Install dependencies: pnpm install
Set up environment variables: cp .env.example .env
Run the development server: pnpm run dev
Open http://localhost:3000 in your browser

Implementation 🔧

features implemented:

designed db schema to store pdf metadata and embeddings
endpoint to upload & store pdf in cloud (supabase bucket)
extracting and embedding pdf text (per page) and storing them in db
endpoint (only) to search embeddings based on vector similarity

features not implemented 😥:

creating embeddings for images
search bar to connect to vector similarity endpoint
sidebar to display query result & serve stored PDF to user when clicked
highlighting & improving OCR

My Ideal Approach to the Challenge

Start wayyyy earlier (100% my own fault, didn't plan well for exams)
Extending existing application
- The existing viewer and OCR pipeline stay intact; new code is added in modular endpoints/components.

Persist PDFs + embeddings on Supabase

Supabase Bucket: pdfs/ (public) stores raw files.

Postgres schema

CREATE EXTENSION IF NOT EXISTS "vector";

CREATE TABLE pdf (
  id           SERIAL PRIMARY KEY,
  filename     TEXT NOT NULL,
  pages        INT,
  content_hash TEXT,
  uploaded_at  TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE embedding (
  id        SERIAL PRIMARY KEY,
  doc_id    INT REFERENCES pdf(id) ON DELETE CASCADE,
  page_no   INT NOT NULL CHECK (page_no > 0),
  chunk_no  SMALLINT DEFAULT 0,
  text      TEXT,
  text_vec  VECTOR(1536) NOT NULL
);

CREATE INDEX embedding_text_vec_ivf
  ON embedding USING ivfflat (text_vec vector_cosine_ops) WITH (lists = 100);

CREATE INDEX embedding_doc_page_chunk
  ON embedding (doc_id, page_no, chunk_no);

text_vec = OpenAI text-embedding-3-small ( dims).

Embeddings: page-level now, smaller chunks later
- One embedding per page (denoted as chunk_no = 0).
- Finer embedding granularity can be added later with extra rows (chunk_no = 1, 2,...)/backfilled if needed—no schema change required
Upload flow
1. Browser uploads PDF → Storage bucket and inserts into pdf table
2. Server function parses pdf asynchronously, embedding text and inserting into embedding table
- async for faster response for users, only have to wait for pdf upload
Search flow
- Front-end takes user query → calls endpoint → calls postgres function search_embedding(q,k) which executes
  ORDER BY text_vec <=> q LIMIT k using pgvector approxinate-nearest-neighbor index.
- Returns {doc_id, page_no, score}
- not implemented beyond this point
- → sidebar list.
- Grab public URL from db and jumps to the page (#page=n in PDF.js).
  - URL kept public for dev purposes, would ideally switch over to signed URL later.
Future improvements
- Add CLIP image embeddings later (extra column in the embedding table) and apply weighted scoring for search query.
- content_hash allows deduplication and skip-processing.
- bonuses

Challenges

properly understanding the codebase and designing the solution (was particularly confused by client-side). Overcame it by listing out MVP that I need to build, and focused only on the parts that touched that list.
getting pdfjs to work properly on server-side, was an annoying version issue, found solution via StackOverflow (and a looot of whacking at it with AI)
querying vector similarity with supabase correctly. For some reason the standard syntax didn't work, had to resort to defining an extra postgres function for the query.

Summary

I really enjoyed the challenge! It was super fun to design the db schema with extra considerations for embedding type and granularity (tho perhaps premature lol). Also finally got to use Supabse for a project, and honestly it's pretty sweet that it has its own bucket AND supports pgvectors. Would really have loved to finish the UI components & the highlighting storage but I started a bit late 😔. Overall learned a whole lot from this challenge, and would love to do more of it 😁

Testing

unfortunately semantic search didn't make it to the UI, but after uploading a couple pdfs, could be tested as such:
using Postman/curl/etc, query http://localhost:3000/api/search with

{
    "query": "the phrase you want to search semantically",
    "maxCount":  <positive integer, maximum query results>
}

Screenshots

pdf table (Seven Databases in Seven Weeks successfully uploaded and indexed 🎉 ):

embedding table:

Embedding query w postman:

…ater

Feat: upload pdf and embedding to cloud

Feat/search embedding

vichua2006 and others added 16 commits July 4, 2025 19:05

chore: light theme for darker font color

e0b0ddd

chore: removed .env from tracking

81028dd

chore: update readme

0d65233

feat: basic embedding utils

223ac63

feat: basic embedding

77e850e

feat: designed schema and applied supabase migration

f64bf15

feat: upload pdf w public url

0d4d5af

feat: async embedding generation

c436668

fix: configure pdfjs on server side

96eb887

chore: using non-ocr'ed pdf for display; apply ocr during embedding l…

eb41513

…ater

chore: comments

d108e28

Merge pull request #1 from vichua2006/feat/upload-embedding

40eb193

Feat: upload pdf and embedding to cloud

feat: search via embedding endpoint

0a36f58

chore: debugging embedding routes, weird bug where nothing shows up

d3ce7ee

Merge pull request #2 from vichua2006/feat/search-embedding

dfd8fe2

Feat/search embedding

fix: adjusted weighting

8c06490

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF-Highlight-OA submission - Victor Huang#23

PDF-Highlight-OA submission - Victor Huang#23
vichua2006 wants to merge 16 commits into
adanomad:mainfrom
vichua2006:main

vichua2006 commented Jul 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vichua2006 commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Setup Instructions!!

Implementation 🔧

My Ideal Approach to the Challenge

Challenges

Summary

Testing

Screenshots

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vichua2006 commented Jul 5, 2025 •

edited

Loading