Skip to content

Add DuckDB full-text search provider#40

Merged
mhelleborg merged 4 commits into
mainfrom
claude/provider-duckdb-k9u9fc
Jul 2, 2026
Merged

Add DuckDB full-text search provider#40
mhelleborg merged 4 commits into
mainfrom
claude/provider-duckdb-k9u9fc

Conversation

@mhelleborg

Copy link
Copy Markdown
Owner

Implements the DuckDB provider — part of the multi-provider epic #23. DuckDB is embedded (like SQLite), so this is the first new provider whose entire conformance suite runs locally with no container at all — and it's fully green.

What's here

  • Source/SearchLite.DuckDBSearchManager + SearchIndex<T> on DuckDB.NET.Data.Full (bundles the native lib), mirroring the SQLite provider. One table per index (id VARCHAR PK, document VARCHAR, search_text VARCHAR, last_updated TIMESTAMP), INSERT … ON CONFLICT DO UPDATE upserts, full ISearchIndex<T> surface.
  • Full-text via the fts extension — BM25 scoring through match_bm25. The index is built with stemmer='none', stopwords='none', strip_accents=0, lower=1 for verbatim, case-insensitive, accent-preserving tokenization (needed for the C# / SQL* / café / Über conformance cases).
  • WhereClauseBuilder<T> — every FilterNode<T> operator → DuckDB JSON predicates, with type-aware ORDER BY (resolves the CLR property type so numeric fields cast to DOUBLE instead of sorting lexically, NULLS FIRST/LAST to match LINQ).
  • Tests/SearchLite.DuckDB.Tests — concrete conformance subclass, TableNameTests, WhereClauseTests.
  • DuckDB.NET.Data.Full added to Directory.Packages.props; both projects in SearchLite.sln.

Verification

  • Builds clean in Release across net8.0/net9.0/net10.0.
  • Full conformance suite + unit tests: 144 passed, 0 failed (run locally on net10.0 — no Docker needed).
  • Auto-discovered by the parallel CI matrix.

Implementation notes

  • fts snapshot rebuild: DuckDB's fts index is built over a snapshot and doesn't auto-update, so SearchIndex tracks a _ftsDirty flag set on every write and lazily rebuilds via PRAGMA create_fts_index(…, overwrite=1) before any query — queries always see current data.
  • Candidate matching: match_bm25 is bag-of-words only, so candidate selection matches against a normalized token-delimited projection of search_text (OR-of-tokens when partial matches are on, contiguous phrase when off), while match_bm25 supplies the relevance score for ranking and MinScore.
  • Extension loading: in normal environments the default INSTALL fts; LOAD fts; path works with no config. The test fixture tries that first and only falls back to fetching the official extension binary (published as the duckdb-extension-fts PyPI wheel) when an environment blocks extensions.duckdb.org — the provider also accepts an optional extensionDirectory for fully offline/pre-provisioned setups.
  • Single shared connection with serialized access (DuckDB is embedded single-writer, analogous to the SQLite provider); the concurrent-collections conformance test passes.

🤖 Generated with Claude Code


Generated by Claude Code

claude added 3 commits June 27, 2026 23:32
Implements an embedded DuckDB provider mirroring the SQLite provider:

- SearchLite.DuckDB source project (net8.0;net9.0;net10.0) using the
  DuckDB.NET.Data.Full package (bundles the native library).
- Full ISearchIndex<T> / ISearchEngineManager surface with one table per
  index (id VARCHAR PK, document VARCHAR JSON, search_text, last_updated)
  and INSERT ... ON CONFLICT DO UPDATE upserts.
- Full text via the fts extension: the BM25 index is built over a snapshot
  and does not auto-update on writes, so it is rebuilt lazily (create_fts_index
  with overwrite=1) before any search whenever the table has changed.
  Candidate selection uses a normalized token/phrase projection of search_text
  (so OR-of-terms and phrase semantics work); match_bm25 supplies the score.
- WhereClauseBuilder<T> translating every FilterNode<T> operator into DuckDB
  JSON predicates (json_extract / json_extract_string, list_contains for
  collections, LIKE with ESCAPE for string ops), with JSON-null aware IS NULL
  and type-aware ORDER BY.
- Tests/SearchLite.DuckDB.Tests mirroring the SQLite test project: concrete
  IndexTests subclass, TableNameTests and WhereClauseTests. A fixture supplies
  the official fts extension from the duckdb-extension-fts PyPI wheel into a
  local extension_directory, since the DuckDB extension repository is not
  reachable in this environment.
- Both projects added to SearchLite.sln; DuckDB.NET.Data.Full pinned in
  Directory.Packages.props.

The full inherited conformance suite plus the unit tests pass on net10.0
(144 passed, 0 failed).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01AHZB8AqzqRcBEuzRurzJFf
IndexManyAsync created a fresh command (re-parsing/re-planning the INSERT)
for every row, so bulk-loading 1000 documents took ~5s and tipped the
Performance_ShouldHandleBulkOperations 5s budget over on CI runners.
Reuse a single prepared command across the batch and only mark the FTS
index dirty when at least one row was written. Bulk load drops to ~2s
locally; full conformance suite stays green (144/144).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01AHZB8AqzqRcBEuzRurzJFf
The prepared-command-per-row path still issued one INSERT round-trip per
document, so bulk-loading 1000 rows stayed ~2-5s and intermittently tipped
the Performance_ShouldHandleBulkOperations 5s budget on slower CI runners
(one TFM failed at 5s). DuckDB is columnar, so per-row inserts are slow by
design — switch to the Appender (the native bulk path): stage the batch in
a scratch table via the appender, then upsert from it with INSERT ... SELECT
... ON CONFLICT (the appender itself can't do ON CONFLICT). Bulk load of
1000 docs drops to ~0.2s; full suite stays green (144/144) on net8.0 and
net10.0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01AHZB8AqzqRcBEuzRurzJFf
@mhelleborg

Copy link
Copy Markdown
Owner Author

@copilot resolve the merge conflicts in this pull request

Copilot AI commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

@copilot resolve the merge conflicts in this pull request

Resolved in commit 91b19af. The only conflict was in SearchLite.sln — both the DuckDB and MariaDB project/configuration entries are now included. Directory.Packages.props merged cleanly. Build passes with 0 errors.

@mhelleborg mhelleborg merged commit 1db517d into main Jul 2, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants