Skip to content

PDF-Highlight-OA submission - Victor Huang#23

Open
vichua2006 wants to merge 16 commits into
adanomad:mainfrom
vichua2006:main
Open

PDF-Highlight-OA submission - Victor Huang#23
vichua2006 wants to merge 16 commits into
adanomad:mainfrom
vichua2006:main

Conversation

@vichua2006

@vichua2006 vichua2006 commented Jul 5, 2025

Copy link
Copy Markdown

Setup Instructions!!

  1. Clone the repository
  2. Install dependencies: pnpm install
  3. Set up environment variables: cp .env.example .env
  4. Run the development server: pnpm run dev
  5. Open http://localhost:3000 in your browser

Implementation 🔧

features implemented:

  • designed db schema to store pdf metadata and embeddings
  • endpoint to upload & store pdf in cloud (supabase bucket)
  • extracting and embedding pdf text (per page) and storing them in db
  • endpoint (only) to search embeddings based on vector similarity

features not implemented 😥:

  • creating embeddings for images
  • search bar to connect to vector similarity endpoint
  • sidebar to display query result & serve stored PDF to user when clicked
  • highlighting & improving OCR

My Ideal Approach to the Challenge

  1. Start wayyyy earlier (100% my own fault, didn't plan well for exams)

  2. Extending existing application

    • The existing viewer and OCR pipeline stay intact; new code is added in modular endpoints/components.
  3. Persist PDFs + embeddings on Supabase

    • Supabase Bucket: pdfs/ (public) stores raw files.
    • Postgres schema
      CREATE EXTENSION IF NOT EXISTS "vector";
      
      CREATE TABLE pdf (
        id           SERIAL PRIMARY KEY,
        filename     TEXT NOT NULL,
        pages        INT,
        content_hash TEXT,
        uploaded_at  TIMESTAMPTZ DEFAULT now()
      );
      
      CREATE TABLE embedding (
        id        SERIAL PRIMARY KEY,
        doc_id    INT REFERENCES pdf(id) ON DELETE CASCADE,
        page_no   INT NOT NULL CHECK (page_no > 0),
        chunk_no  SMALLINT DEFAULT 0,
        text      TEXT,
        text_vec  VECTOR(1536) NOT NULL
      );
      
      CREATE INDEX embedding_text_vec_ivf
        ON embedding USING ivfflat (text_vec vector_cosine_ops) WITH (lists = 100);
      
      CREATE INDEX embedding_doc_page_chunk
        ON embedding (doc_id, page_no, chunk_no);
      • text_vec = OpenAI text-embedding-3-small ( dims).
  4. Embeddings: page-level now, smaller chunks later

    • One embedding per page (denoted as chunk_no = 0).
    • Finer embedding granularity can be added later with extra rows (chunk_no = 1, 2,...)/backfilled if needed—no schema change required
  5. Upload flow

    1. Browser uploads PDF → Storage bucket and inserts into pdf table
    2. Server function parses pdf asynchronously, embedding text and inserting into embedding table
    • async for faster response for users, only have to wait for pdf upload
  6. Search flow

    • Front-end takes user query → calls endpoint → calls postgres function search_embedding(q,k) which executes
      ORDER BY text_vec <=> q LIMIT k using pgvector approxinate-nearest-neighbor index.
    • Returns {doc_id, page_no, score}
    • not implemented beyond this point
    • → sidebar list.
    • Grab public URL from db and jumps to the page (#page=n in PDF.js).
      • URL kept public for dev purposes, would ideally switch over to signed URL later.
  7. Future improvements

    • Add CLIP image embeddings later (extra column in the embedding table) and apply weighted scoring for search query.
    • content_hash allows deduplication and skip-processing.
    • bonuses

Challenges

  • properly understanding the codebase and designing the solution (was particularly confused by client-side). Overcame it by listing out MVP that I need to build, and focused only on the parts that touched that list.
  • getting pdfjs to work properly on server-side, was an annoying version issue, found solution via StackOverflow (and a looot of whacking at it with AI)
  • querying vector similarity with supabase correctly. For some reason the standard syntax didn't work, had to resort to defining an extra postgres function for the query.

Summary

I really enjoyed the challenge! It was super fun to design the db schema with extra considerations for embedding type and granularity (tho perhaps premature lol). Also finally got to use Supabse for a project, and honestly it's pretty sweet that it has its own bucket AND supports pgvectors. Would really have loved to finish the UI components & the highlighting storage but I started a bit late 😔. Overall learned a whole lot from this challenge, and would love to do more of it 😁

Testing

unfortunately semantic search didn't make it to the UI, but after uploading a couple pdfs, could be tested as such:
using Postman/curl/etc, query http://localhost:3000/api/search with

{
    "query": "the phrase you want to search semantically",
    "maxCount":  <positive integer, maximum query results>
}

Screenshots

pdf table (Seven Databases in Seven Weeks successfully uploaded and indexed 🎉 ):
image
embedding table:
image
Embedding query w postman:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant