Skip to content

spike: evaluate SQLite vector storage for desktop mode#310

Open
Monica-CodingWorld wants to merge 11 commits into
Abhash-Chakraborty:mainfrom
Monica-CodingWorld:feat/sqlite-vec-evaluation
Open

spike: evaluate SQLite vector storage for desktop mode#310
Monica-CodingWorld wants to merge 11 commits into
Abhash-Chakraborty:mainfrom
Monica-CodingWorld:feat/sqlite-vec-evaluation

Conversation

@Monica-CodingWorld

@Monica-CodingWorld Monica-CodingWorld commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a SQLite + sqlite-vec proof of concept to evaluate whether SQLite can support Find's metadata storage and vector-search requirements for desktop mode without changing the current PostgreSQL + pgvector runtime.

This implementation creates a small evaluation layer that demonstrates metadata storage, vector insertion, similarity search, and gallery-style queries using sqlite-vec.

Fixes #256

Type of change

  • Bug fix
  • Feature
  • Documentation update
  • Refactor
  • CI / tooling

What changed

  • Added sqlite_vec_poc.py to evaluate SQLite + sqlite-vec functionality.
  • Implemented schema creation for metadata and vector storage.
  • Added media insertion with 768-dimensional embeddings.
  • Implemented vector similarity search using sqlite-vec.
  • Added gallery query support for metadata retrieval.
  • Added test coverage validating schema creation, vector insertion, similarity search, and gallery query behavior.

Screenshots / recordings (for UI changes)

Attach before/after screenshots or a short video.

How to test

Install sqlite-vec

pip install sqlite-vec

Run the proof-of-concept tests

pytest tests/test_sqlite_vec_poc.py -v

Manual validation performed:

  • Created SQLite database and schema successfully.
  • Inserted 768-dimensional vectors.
  • Executed similarity search queries.
  • Verified gallery metadata retrieval.
  • Confirmed nearest-neighbor ordering from sqlite-vec search results.

Checklist

  • I linked the related issue
  • I ran required checks from CONTRIBUTING.md
  • I updated docs/env notes if needed
  • My PR is scoped to a single issue
  • I followed commit message conventions
  • I am not committing secrets or local artifacts

GSSoC'26 checklist

  • I requested issue assignment before starting
  • I have meaningful commits (no spam commits)
  • I am ready to explain my implementation in review comments

Summary by CodeRabbit

  • New Features

    • Added a SQLite-based proof-of-concept for storing media embeddings and performing nearest-neighbor similarity search, returning matching media results (including distance) and supporting gallery retrieval.
  • Tests

    • Improved test handling for the optional sqlite-vec dependency, including skip behavior when unavailable and a clear error assertion when it’s missing.
  • Documentation

    • Updated the ADR with a new section describing the SQLite vector spike results and current limitations/follow-ups.

@github-actions

github-actions Bot commented Jun 14, 2026

Copy link
Copy Markdown

PR Context Summary

Suggested issue links

  • No strong issue match found yet.

Use Fixes #123 or Closes #123 in the PR body when one of the suggestions is the intended issue.
Manual rerun: Actions > PR Context Triage > Run workflow > set pr_number and force_review=true.

@coderabbitai

coderabbitai Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a new proof-of-concept module sqlite_vec_poc.py implementing SQLite-backed vector storage using the sqlite-vec extension. The module provides a connection factory with optional dependency handling, schema creation (media table + vec0 virtual table), media/vector insertion helpers, two similarity search functions, and a high-level wrapper class. A companion test file with dependency detection validates all behaviors through four pytest tests. The desktop runtime ADR is updated with spike result documentation.

Changes

SQLite Vector Storage POC

Layer / File(s) Summary
Connection factory and schema initialization
backend/src/find_api/core/sqlite_vec_poc.py
create_connection enables SQLite extension loading and initializes sqlite_vec, raising a descriptive RuntimeError if the dependency is missing. create_schema creates a media table and a media_vectors vec0 virtual table parameterized by embedding_dim.
Media and vector insertion helpers
backend/src/find_api/core/sqlite_vec_poc.py
insert_vector packs a float list into a binary blob and inserts into media_vectors; insert_media inserts a row into media with status defaulting to "indexed"; count_vectors returns a scalar row count from media_vectors.
Similarity search operations
backend/src/find_api/core/sqlite_vec_poc.py
search_vectors runs a WHERE embedding MATCH query on media_vectors returning (media_id, distance); search_media joins media_vectors to media and returns (id, filename, distance) ordered by vector distance.
High-level wrapper class API
backend/src/find_api/core/sqlite_vec_poc.py
SQLiteVecPOC class wraps module functions: stores a connection, exposes create_schema, combines insert_media calling both media and vector insertion, maps search results to dictionaries, and provides gallery_query to fetch all media ordered by id.
Optional dependency detection and error handling
backend/tests/test_sqlite_vec_poc.py
Dependency detection via importlib.util.find_spec. Pytest fixture sqlite_vec_available skips dependent tests when sqlite-vec is not installed. Test validates that SQLiteVecPOC raises RuntimeError matching "sqlite-vec is required" when the dependency is missing.
Functional tests
backend/tests/test_sqlite_vec_poc.py
Four tests cover schema creation (DB file exists), 768-dim vector insertion and gallery retrieval, two-row similarity search ordering with top-match id validation, and exact gallery query shape including status: "indexed".
ADR spike result documentation
docs/plans/not-started/desktop-runtime-adr.md
ADR metadata updated to 2026-06-19. New section documents PoC code locations, local run instructions, test skip behavior, limitations (no Docker replacement, no migration coverage, incomplete validation of hybrid search/filters/clustering/queue), and guidance for implementation after benchmarking and query abstraction design.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 Hopping through bytes of float and blob,
A vec0 table — what a quest!
I pack my embeddings, small and neat,
And search for matches, quite the feat.
When sqlite-vec is absent, loud I cry,
But desktop SQLite lets Postgres say goodbye! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: a proof-of-concept spike to evaluate SQLite vector storage for desktop mode, which directly reflects the PR's core objective.
Description check ✅ Passed The description follows the template with summary, issue link, change type, detailed changes, test instructions, and completed checklist items. All required sections are present and substantively filled.
Linked Issues check ✅ Passed The PR fulfills issue #256 acceptance criteria by providing schema creation in SQLite [#256], demonstrating vector insert/search for 768-dimensional embeddings [#256], validating gallery/search query patterns [#256], documenting limitations in the ADR [#256], and preserving PostgreSQL defaults [#256].
Out of Scope Changes check ✅ Passed All changes are scoped to the SQLite + sqlite-vec proof-of-concept: new POC module, comprehensive tests, and documentation updates to the desktop runtime ADR. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@macroscopeapp

macroscopeapp Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Approvability

Verdict: Needs human review

Unable to check for correctness in 38d0d26. This PR introduces a new SQLite vector storage proof-of-concept, which constitutes new feature capability. An unresolved review comment flags a potential SQL injection concern in the schema creation function where embedding_dim is interpolated into SQL without validation.

No code changes detected at 66b4b42. Prior analysis still applies.

You can customize Macroscope's approvability policy. Learn more.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/src/find_api/core/sqlite_vec_poc.py`:
- Around line 1-129: The entire file sqlite_vec_poc.py has formatting issues
that violate the project's Ruff code style standards. Run the Ruff formatter on
this file using the command `uv run ruff format
backend/src/find_api/core/sqlite_vec_poc.py` to automatically fix all formatting
violations, then commit the formatted result to resolve the CI blocking issue.
- Around line 18-33: The create_schema function interpolates embedding_dim
directly into SQL using an f-string without validation. Since Python type hints
are not enforced at runtime, invalid input could produce malformed DDL. Validate
that embedding_dim is a positive integer before the SQL statement construction,
and raise an appropriate exception (such as ValueError) if the value is invalid,
negative, or not an integer. This ensures the SQL statement is always
well-formed and prevents potential SQL injection or schema corruption issues.

In `@backend/tests/test_sqlite_vec_poc.py`:
- Around line 1-4: The test file imports SQLiteVecPOC class and EMBEDDING_DIM
constant from the implementation, but these do not exist in
backend/src/find_api/core/sqlite_vec_poc.py which currently only provides
function-based helpers. Either add the SQLiteVecPOC class wrapper and
EMBEDDING_DIM constant to the implementation file to match the test's
class-based API expectations (supporting methods like gallery_query and search),
or rewrite all test code throughout the file (lines 10-75) to use the existing
function-based helpers instead of class methods. Choose one approach and ensure
the imports at the top and all test method calls throughout the file are
consistent with whichever API design you select.
- Around line 1-81: The test file contains formatting drift that is causing Ruff
checks to fail. Run the Ruff formatter to automatically fix all formatting
issues in the test file (which contains the test functions test_schema_creation,
test_insert_768_dimension_vector, test_similarity_search, and
test_gallery_query_shape). After formatting, re-run the full check suite to
verify that the formatting changes pass all Ruff checks and tests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a19c0025-e2cc-4aa0-ba33-83e87f73e838

📥 Commits

Reviewing files that changed from the base of the PR and between 858202e and 81ae551.

📒 Files selected for processing (2)
  • backend/src/find_api/core/sqlite_vec_poc.py
  • backend/tests/test_sqlite_vec_poc.py

Comment thread backend/src/find_api/core/sqlite_vec_poc.py Outdated
Comment on lines +18 to +33
def create_schema(conn, embedding_dim: int):
conn.execute("""
CREATE TABLE IF NOT EXISTS media (
id INTEGER PRIMARY KEY,
filename TEXT NOT NULL,
status TEXT NOT NULL
)
""")

conn.execute(f"""
CREATE VIRTUAL TABLE IF NOT EXISTS media_vectors
USING vec0(
media_id INTEGER PRIMARY KEY,
embedding FLOAT[{embedding_dim}]
)
""")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate embedding_dim before interpolating it into schema SQL.

embedding_dim is inserted directly into SQL text. Because Python hints are not enforced, malformed input can generate invalid DDL (or worse, altered SQL text). Validate/cast to a positive integer before building the statement.

Suggested fix
-def create_schema(conn, embedding_dim: int):
+def create_schema(conn, embedding_dim: int):
+    if not isinstance(embedding_dim, int) or embedding_dim <= 0:
+        raise ValueError("embedding_dim must be a positive integer")
+
     conn.execute("""
         CREATE TABLE IF NOT EXISTS media (
             id INTEGER PRIMARY KEY,
             filename TEXT NOT NULL,
             status TEXT NOT NULL
         )
     """)

As per coding guidelines, "Keep EMBEDDING_DIM aligned with the configured CLIP/SigLIP model and pgvector columns."

🧰 Tools
🪛 OpenGrep (1.22.0)

[ERROR] 27-33: SQL query built via f-string passed to execute()/executemany(). Use parameterized queries with placeholders instead.

(coderabbit.sql-injection.python-fstring-execute)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/find_api/core/sqlite_vec_poc.py` around lines 18 - 33, The
create_schema function interpolates embedding_dim directly into SQL using an
f-string without validation. Since Python type hints are not enforced at
runtime, invalid input could produce malformed DDL. Validate that embedding_dim
is a positive integer before the SQL statement construction, and raise an
appropriate exception (such as ValueError) if the value is invalid, negative, or
not an integer. This ensures the SQL statement is always well-formed and
prevents potential SQL injection or schema corruption issues.

Sources: Coding guidelines, Linters/SAST tools

Comment thread backend/tests/test_sqlite_vec_poc.py Outdated
Comment thread backend/tests/test_sqlite_vec_poc.py Outdated
@Monica-CodingWorld

Copy link
Copy Markdown
Contributor Author

Hi @Abhash-Chakraborty ! I updated the branch to address the CodeRabbit feedback by adding the SQLiteVecPOC wrapper and EMBEDDING_DIM, and I also ran Ruff formatting locally on the files changed in this PR.

My branch is currently clean and only contains:

  • backend/src/find_api/core/sqlite_vec_poc.py
  • backend/tests/test_sqlite_vec_poc.py

However, the CI is still failing. I noticed that ruff format --check . reports formatting issues in a few unrelated files already present in the branch history, and I'm not sure whether the workflow is failing because of those or because of something specific to my changes.

Could you please take a quick look and let me know if there's anything else I should update in this PR? Thanks!

@Abhash-Chakraborty Abhash-Chakraborty added gssoc26 Related to GirlScript Summer of Code 2026. backend FastAPI, database, storage, and API work architecture High-level design decisions and technical direction labels Jun 19, 2026
@Abhash-Chakraborty Abhash-Chakraborty changed the title Feat/sqlite vec evaluation spike: evaluate SQLite vector storage for desktop mode Jun 19, 2026
@Abhash-Chakraborty Abhash-Chakraborty added desktop-app Windows, macOS, and Linux installed app work performance Speed, startup, memory, image size, and runtime efficiency local-first Privacy-preserving local runtime and offline behavior research Needs investigation, comparison, or design proposal before implementation type:performance Performance improvement PR. GSSoC type bonus: +15 points. level:advanced GSSoC difficulty level: advanced. Base contributor points: 55. quality:clean Clean and maintainable PR. GSSoC contributor multiplier: 1.2x. gssoc:approved Valid GSSoC contribution approved for scoring. ready-to-merge Fully approved, tested, and cleared for immediate merging. labels Jun 19, 2026
@github-actions

Copy link
Copy Markdown

@macroscope-app review

Please review this PR against its linked issue, local-first privacy rules, and the current Find repo instructions.
Linked issue(s): #256.
Trigger source: label-gated review (ready-to-merge).

@Abhash-Chakraborty Abhash-Chakraborty left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed against #256. I fixed the remaining acceptance gaps by documenting the sqlite-vec spike result/limitations in the desktop runtime ADR and making the POC dependency optional-safe so normal backend installs do not fail when sqlite-vec is not installed.\n\nTested locally:\n- uv run ruff check src/find_api/core/sqlite_vec_poc.py tests/test_sqlite_vec_poc.py\n- uv run pytest tests/test_sqlite_vec_poc.py -q\n\nLocal pytest result is expected on this machine: 1 passed, 4 skipped because sqlite-vec is optional and not installed. With sqlite-vec installed, the vector insert/search tests run. This keeps the PR scoped as a desktop/runtime spike and does not switch production away from PostgreSQL + pgvector.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/src/find_api/core/sqlite_vec_poc.py`:
- Line 67: The insert_media() method creates a race condition where orphaned
media records can be created if insert_vector() fails after insert_media()
commits. Remove the independent conn.commit() calls from both helper functions
(the one at line 67 in the vector insertion helper and the one at line 83 in the
media insertion helper). Then wrap both the insert_media() and insert_vector()
calls within the main insert_media() method in a single transaction context that
commits once after both operations complete successfully, ensuring atomicity.
- Around line 52-57: Add explicit embedding dimension validation to enforce the
EMBEDDING_DIM = 768 contract in three locations: before the struct.pack() call
in insert_vector, and before any embedding processing in search_vectors and
search_media. For each function, verify that the length of the embedding
parameter matches EMBEDDING_DIM and raise a clear ValueError if it does not,
preventing silent dimension mismatches from causing cryptic sqlite-vec errors
downstream.
- Around line 11-25: In the create_connection function, close the database
connection before raising the RuntimeError when ModuleNotFoundError is caught to
prevent resource leaks. Additionally, immediately after calling
sqlite_vec.load(conn), disable extension loading by calling
conn.enable_load_extension(False) to minimize the attack surface and follow the
principle of least privilege.

In `@docs/plans/not-started/desktop-runtime-adr.md`:
- Around line 88-94: The spike running instructions use plain pip install
instead of the project's uv package manager, which can cause the package to be
installed outside the uv environment and lead to false skips. Replace the line
pip install sqlite-vec with uv pip install sqlite-vec to ensure the package is
installed in the proper uv environment and matches the project's package
management setup.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 93d74ea2-7df8-46e8-8bc7-6256e5394d45

📥 Commits

Reviewing files that changed from the base of the PR and between 81ae551 and 0d5d37f.

📒 Files selected for processing (3)
  • backend/src/find_api/core/sqlite_vec_poc.py
  • backend/tests/test_sqlite_vec_poc.py
  • docs/plans/not-started/desktop-runtime-adr.md

Comment on lines +11 to +25
def create_connection(db_path=":memory:"):
conn = sqlite3.connect(str(db_path))
conn.enable_load_extension(True)

try:
import sqlite_vec
except ModuleNotFoundError as exc:
raise RuntimeError(
"sqlite-vec is required for this desktop-runtime proof of concept. "
"Install it manually with `pip install sqlite-vec` before running "
"the sqlite_vec_poc tests."
) from exc

sqlite_vec.load(conn)
return conn

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and view its content
cat -n backend/src/find_api/core/sqlite_vec_poc.py

Repository: Abhash-Chakraborty/Find

Length of output: 5602


🏁 Script executed:

# Search for tests related to this POC
fd -t f "test.*sqlite" . --type f
fd -t f ".*test.*" backend/ --type f | grep -i sqlite

Repository: Abhash-Chakraborty/Find

Length of output: 143


🏁 Script executed:

# Check usage of create_connection across the codebase
rg "create_connection" --type py -B 2 -A 2

Repository: Abhash-Chakraborty/Find

Length of output: 774


🏁 Script executed:

# Check the test file to see if create_connection is tested
cat -n backend/tests/test_sqlite_vec_poc.py

Repository: Abhash-Chakraborty/Find

Length of output: 2909


🏁 Script executed:

# Verify if sqlite_vec is actually a dependency by checking requirements/setup
fd -t f "requirements" backend/ --type f
fd -t f "setup.py" . --type f
fd -t f "pyproject.toml" . --type f

Repository: Abhash-Chakraborty/Find

Length of output: 92


🏁 Script executed:

# Check if there are any other places where conn.enable_load_extension is used
rg "enable_load_extension" --type py -B 2 -A 2

Repository: Abhash-Chakraborty/Find

Length of output: 418


🏁 Script executed:

# Check if there's any cleanup logic in SQLiteVecPOC class or if it handles connection closure
rg "close\(\)" backend/src/find_api/core/sqlite_vec_poc.py -B 2 -A 2
rg "def __del__" backend/src/find_api/core/sqlite_vec_poc.py -B 2 -A 5

Repository: Abhash-Chakraborty/Find

Length of output: 49


🏁 Script executed:

# Check pyproject.toml to see if sqlite_vec is a dependency
cat -n backend/pyproject.toml | head -80

Repository: Abhash-Chakraborty/Find

Length of output: 2357


🏁 Script executed:

# Verify best practices for sqlite3 connection error handling
web_search
python sqlite3 connection cleanup best practices error handling

Repository: Abhash-Chakraborty/Find

Length of output: 206


🌐 Web query:

sqlite3 python connection cleanup best practices exception handling

💡 Result:

When managing SQLite connections in Python, it is critical to distinguish between transaction management and resource cleanup. The built-in context manager for sqlite3 connections handles only transactions, not the closing of the database connection itself [1][2][3][4]. Best Practices for Cleanup and Exception Handling: 1. Use contextlib.closing for Automatic Cleanup: Since the standard sqlite3 connection context manager does not call.close, use contextlib.closing to ensure the connection is closed automatically when the block exits, regardless of whether an exception occurs [5][6][7][8]. import sqlite3 from contextlib import closing with closing(sqlite3.connect("database.db")) as conn: with conn: # Handles transaction commit/rollback conn.execute("INSERT INTO table_name VALUES (?)", (data,)) 2. Manual Cleanup with try/finally: If you do not use contextlib.closing, you must explicitly call.close in a finally block to guarantee the connection is released, even if an error occurs [9]. conn = sqlite3.connect("database.db") try: with conn: conn.execute("INSERT INTO table_name VALUES (?)", (data,)) except sqlite3.Error as e: print(f"Database error: {e}") finally: conn.close 3. Understanding Transaction Management: The context manager used directly on a connection object (e.g., with conn:) automatically commits if the block finishes successfully and rolls back if an exception is raised [1][2][4]. It is intended for managing transaction boundaries, not for closing the connection [7]. 4. Key Considerations: - Do not rely on garbage collection: Explicitly closing connections prevents resource leaks and file locking issues [9][8]. - Catch specific exceptions: Use sqlite3.Error or its subclasses (e.g., sqlite3.OperationalError, sqlite3.IntegrityError) to handle database-related issues specifically [1][2][9]. - Keep connections scoped: Open the connection as late as possible and close it as soon as you are done to minimize the time the database file is locked [9][10].

Citations:


🏁 Script executed:

# Check if sqlite_vec.load actually needs the extension to remain enabled
# Look for any documentation or pattern in the codebase about extension loading
rg "sqlite_vec" --type py -B 3 -A 3

Repository: Abhash-Chakraborty/Find

Length of output: 4569


🏁 Script executed:

# Check if there are any other POC files or similar patterns in the codebase
fd -t f "poc\|proof" --type f

Repository: Abhash-Chakraborty/Find

Length of output: 49


🏁 Script executed:

# Check if sqlite_vec.load needs the extension to stay enabled after loading
# Look for any documentation or check how sqlite-vec typically works
web_search
sqlite-vec python extension loading best practices enable_load_extension

Repository: Abhash-Chakraborty/Find

Length of output: 167


🏁 Script executed:

# Verify the actual impact of the resource leak by examining test more carefully
# and checking if there are any __del__ or context manager implementations
rg "__enter__|__exit__|__del__" backend/src/find_api/core/sqlite_vec_poc.py -B 2 -A 5

Repository: Abhash-Chakraborty/Find

Length of output: 49


Close connection on missing sqlite_vec import and disable extension loading after load.

On ModuleNotFoundError, the opened DB connection is leaked (line 12). Additionally, extension loading should be disabled immediately after sqlite_vec.load(conn) to minimize attack surface.

Suggested patch
 def create_connection(db_path=":memory:"):
     conn = sqlite3.connect(str(db_path))
     conn.enable_load_extension(True)

     try:
         import sqlite_vec
     except ModuleNotFoundError as exc:
+        conn.close()
         raise RuntimeError(
             "sqlite-vec is required for this desktop-runtime proof of concept. "
             "Install it manually with `pip install sqlite-vec` before running "
             "the sqlite_vec_poc tests."
         ) from exc

     sqlite_vec.load(conn)
+    conn.enable_load_extension(False)
     return conn
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/find_api/core/sqlite_vec_poc.py` around lines 11 - 25, In the
create_connection function, close the database connection before raising the
RuntimeError when ModuleNotFoundError is caught to prevent resource leaks.
Additionally, immediately after calling sqlite_vec.load(conn), disable extension
loading by calling conn.enable_load_extension(False) to minimize the attack
surface and follow the principle of least privilege.

Comment on lines +52 to +57
def insert_vector(
conn,
media_id: int,
embedding: list[float],
):
blob = struct.pack(f"{len(embedding)}f", *embedding)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

wc -l backend/src/find_api/core/sqlite_vec_poc.py

Repository: Abhash-Chakraborty/Find

Length of output: 115


🏁 Script executed:

cat -n backend/src/find_api/core/sqlite_vec_poc.py

Repository: Abhash-Chakraborty/Find

Length of output: 5602


Enforce embedding dimension contract at write/search boundaries (lines 57, 97-100, 123-126).

Current vector functions (insert_vector, search_vectors, search_media) accept any embedding length and pack it without validating against EMBEDDING_DIM = 768, allowing silent dimension mismatches that only fail later with cryptic sqlite-vec errors. Add explicit dimension validation before struct.pack() for deterministic errors and contract safety.

Suggested patch
 EMBEDDING_DIM = 768

+def _validate_embedding_dim(embedding: list[float], expected_dim: int = EMBEDDING_DIM) -> None:
+    if len(embedding) != expected_dim:
+        raise ValueError(
+            f"embedding must have exactly {expected_dim} dimensions, got {len(embedding)}"
+        )
+
@@
 def insert_vector(
@@
 ):
+    _validate_embedding_dim(embedding)
     blob = struct.pack(f"{len(embedding)}f", *embedding)
@@
 def search_vectors(
@@
 ):
+    _validate_embedding_dim(query_embedding)
     blob = struct.pack(
@@
 def search_media(
@@
 ):
+    _validate_embedding_dim(query_embedding)
     blob = struct.pack(

This aligns with the guideline: "Keep EMBEDDING_DIM aligned with the configured CLIP/SigLIP model and pgvector columns."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def insert_vector(
conn,
media_id: int,
embedding: list[float],
):
blob = struct.pack(f"{len(embedding)}f", *embedding)
def insert_vector(
conn,
media_id: int,
embedding: list[float],
):
_validate_embedding_dim(embedding)
blob = struct.pack(f"{len(embedding)}f", *embedding)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/find_api/core/sqlite_vec_poc.py` around lines 52 - 57, Add
explicit embedding dimension validation to enforce the EMBEDDING_DIM = 768
contract in three locations: before the struct.pack() call in insert_vector, and
before any embedding processing in search_vectors and search_media. For each
function, verify that the length of the embedding parameter matches
EMBEDDING_DIM and raise a clear ValueError if it does not, preventing silent
dimension mismatches from causing cryptic sqlite-vec errors downstream.

Source: Coding guidelines

(media_id, blob),
)

conn.commit()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "sqlite_vec_poc.py" | head -5

Repository: Abhash-Chakraborty/Find

Length of output: 113


🏁 Script executed:

wc -l backend/src/find_api/core/sqlite_vec_poc.py

Repository: Abhash-Chakraborty/Find

Length of output: 115


🏁 Script executed:

cat -n backend/src/find_api/core/sqlite_vec_poc.py

Repository: Abhash-Chakraborty/Find

Length of output: 5602


Make media+vector insertion atomic to prevent partial writes.

SQLiteVecPOC.insert_media() (lines 156-172) calls insert_media() and insert_vector() sequentially. Both helper functions perform independent commits (lines 83 and 67 respectively). If insert_vector() fails after insert_media() commits, the database is left with an orphaned media record without a corresponding vector.

Wrap both operations in a transaction context to ensure atomicity:

Suggested patch
 def insert_media(
     self,
     media_id,
     filename,
     embedding,
 ):
-    insert_media(
-        self.conn,
-        media_id,
-        filename,
-    )
-
-    insert_vector(
-        self.conn,
-        media_id,
-        embedding,
-    )
+    with self.conn:
+        insert_media(
+            self.conn,
+            media_id,
+            filename,
+        )
+        insert_vector(
+            self.conn,
+            media_id,
+            embedding,
+        )

Also remove conn.commit() calls from the helper functions at lines 67 and 83.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/src/find_api/core/sqlite_vec_poc.py` at line 67, The insert_media()
method creates a race condition where orphaned media records can be created if
insert_vector() fails after insert_media() commits. Remove the independent
conn.commit() calls from both helper functions (the one at line 67 in the vector
insertion helper and the one at line 83 in the media insertion helper). Then
wrap both the insert_media() and insert_vector() calls within the main
insert_media() method in a single transaction context that commits once after
both operations complete successfully, ensuring atomicity.

Comment on lines +88 to +94
To run the spike manually:

```bash
cd backend
pip install sqlite-vec
uv run pytest tests/test_sqlite_vec_poc.py -q
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Check if uv is used in this project
fd -e "uv.lock" -o -e "pyproject.toml" | head -20

Repository: Abhash-Chakraborty/Find

Length of output: 184


🏁 Script executed:

# Check for the file in question
fd "desktop-runtime-adr.md"

Repository: Abhash-Chakraborty/Find

Length of output: 113


🏁 Script executed:

# List backend directory structure
ls -la backend/ 2>/dev/null || echo "backend directory not found"

Repository: Abhash-Chakraborty/Find

Length of output: 880


🏁 Script executed:

# Read the actual file content around lines 88-94
sed -n '80,100p' docs/plans/not-started/desktop-runtime-adr.md

Repository: Abhash-Chakraborty/Find

Length of output: 729


🏁 Script executed:

# Check the backend pyproject.toml to understand the project setup
cat backend/pyproject.toml

Repository: Abhash-Chakraborty/Find

Length of output: 1883


🏁 Script executed:

# Check if there's a root pyproject.toml with uv tool configuration
cat pyproject.toml 2>/dev/null || echo "No root pyproject.toml"

Repository: Abhash-Chakraborty/Find

Length of output: 90


Use uv pip install to match the project's environment.

The repository uses uv for package management (evidenced by backend/uv.lock and [tool.uv.*] sections in backend/pyproject.toml). Using plain pip install sqlite-vec can install the package outside the uv environment, causing readers to hit the skip path even after following these steps.

♻️ Suggested edit
 cd backend
-pip install sqlite-vec
+uv pip install sqlite-vec
 uv run pytest tests/test_sqlite_vec_poc.py -q
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/plans/not-started/desktop-runtime-adr.md` around lines 88 - 94, The
spike running instructions use plain pip install instead of the project's uv
package manager, which can cause the package to be installed outside the uv
environment and lead to false skips. Replace the line pip install sqlite-vec
with uv pip install sqlite-vec to ensure the package is installed in the proper
uv environment and matches the project's package management setup.

Source: Coding guidelines

@Monica-CodingWorld

Copy link
Copy Markdown
Contributor Author

Hi @Abhash-Chakraborty,

Thank you for reviewing and approving the PR. I just wanted to check whether there's anything else needed from my side before it can be merged.
I've addressed the review feedback and the branch is up to date. If there's any additional change or clarification required, I'd be happy to help.

Thanks for your time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

architecture High-level design decisions and technical direction backend FastAPI, database, storage, and API work desktop-app Windows, macOS, and Linux installed app work gssoc:approved Valid GSSoC contribution approved for scoring. gssoc26 Related to GirlScript Summer of Code 2026. level:advanced GSSoC difficulty level: advanced. Base contributor points: 55. local-first Privacy-preserving local runtime and offline behavior performance Speed, startup, memory, image size, and runtime efficiency quality:clean Clean and maintainable PR. GSSoC contributor multiplier: 1.2x. ready-to-merge Fully approved, tested, and cleared for immediate merging. research Needs investigation, comparison, or design proposal before implementation type:performance Performance improvement PR. GSSoC type bonus: +15 points.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

spike: validate SQLite metadata and vector storage for desktop mode

2 participants