Add DuckDB full-text search provider#40
Merged
Merged
Conversation
Implements an embedded DuckDB provider mirroring the SQLite provider: - SearchLite.DuckDB source project (net8.0;net9.0;net10.0) using the DuckDB.NET.Data.Full package (bundles the native library). - Full ISearchIndex<T> / ISearchEngineManager surface with one table per index (id VARCHAR PK, document VARCHAR JSON, search_text, last_updated) and INSERT ... ON CONFLICT DO UPDATE upserts. - Full text via the fts extension: the BM25 index is built over a snapshot and does not auto-update on writes, so it is rebuilt lazily (create_fts_index with overwrite=1) before any search whenever the table has changed. Candidate selection uses a normalized token/phrase projection of search_text (so OR-of-terms and phrase semantics work); match_bm25 supplies the score. - WhereClauseBuilder<T> translating every FilterNode<T> operator into DuckDB JSON predicates (json_extract / json_extract_string, list_contains for collections, LIKE with ESCAPE for string ops), with JSON-null aware IS NULL and type-aware ORDER BY. - Tests/SearchLite.DuckDB.Tests mirroring the SQLite test project: concrete IndexTests subclass, TableNameTests and WhereClauseTests. A fixture supplies the official fts extension from the duckdb-extension-fts PyPI wheel into a local extension_directory, since the DuckDB extension repository is not reachable in this environment. - Both projects added to SearchLite.sln; DuckDB.NET.Data.Full pinned in Directory.Packages.props. The full inherited conformance suite plus the unit tests pass on net10.0 (144 passed, 0 failed). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AHZB8AqzqRcBEuzRurzJFf
IndexManyAsync created a fresh command (re-parsing/re-planning the INSERT) for every row, so bulk-loading 1000 documents took ~5s and tipped the Performance_ShouldHandleBulkOperations 5s budget over on CI runners. Reuse a single prepared command across the batch and only mark the FTS index dirty when at least one row was written. Bulk load drops to ~2s locally; full conformance suite stays green (144/144). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AHZB8AqzqRcBEuzRurzJFf
The prepared-command-per-row path still issued one INSERT round-trip per document, so bulk-loading 1000 rows stayed ~2-5s and intermittently tipped the Performance_ShouldHandleBulkOperations 5s budget on slower CI runners (one TFM failed at 5s). DuckDB is columnar, so per-row inserts are slow by design — switch to the Appender (the native bulk path): stage the batch in a scratch table via the appender, then upsert from it with INSERT ... SELECT ... ON CONFLICT (the appender itself can't do ON CONFLICT). Bulk load of 1000 docs drops to ~0.2s; full suite stays green (144/144) on net8.0 and net10.0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AHZB8AqzqRcBEuzRurzJFf
Owner
Author
|
@copilot resolve the merge conflicts in this pull request |
Contributor
Resolved in commit |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements the DuckDB provider — part of the multi-provider epic #23. DuckDB is embedded (like SQLite), so this is the first new provider whose entire conformance suite runs locally with no container at all — and it's fully green.
What's here
Source/SearchLite.DuckDB—SearchManager+SearchIndex<T>onDuckDB.NET.Data.Full(bundles the native lib), mirroring the SQLite provider. One table per index (id VARCHARPK,document VARCHAR,search_text VARCHAR,last_updated TIMESTAMP),INSERT … ON CONFLICT DO UPDATEupserts, fullISearchIndex<T>surface.ftsextension — BM25 scoring throughmatch_bm25. The index is built withstemmer='none', stopwords='none', strip_accents=0, lower=1for verbatim, case-insensitive, accent-preserving tokenization (needed for theC#/SQL*/café/Überconformance cases).WhereClauseBuilder<T>— everyFilterNode<T>operator → DuckDB JSON predicates, with type-awareORDER BY(resolves the CLR property type so numeric fields cast toDOUBLEinstead of sorting lexically,NULLS FIRST/LASTto match LINQ).Tests/SearchLite.DuckDB.Tests— concrete conformance subclass,TableNameTests,WhereClauseTests.DuckDB.NET.Data.Fulladded toDirectory.Packages.props; both projects inSearchLite.sln.Verification
Implementation notes
SearchIndextracks a_ftsDirtyflag set on every write and lazily rebuilds viaPRAGMA create_fts_index(…, overwrite=1)before any query — queries always see current data.match_bm25is bag-of-words only, so candidate selection matches against a normalized token-delimited projection ofsearch_text(OR-of-tokens when partial matches are on, contiguous phrase when off), whilematch_bm25supplies the relevance score for ranking andMinScore.INSTALL fts; LOAD fts;path works with no config. The test fixture tries that first and only falls back to fetching the official extension binary (published as theduckdb-extension-ftsPyPI wheel) when an environment blocksextensions.duckdb.org— the provider also accepts an optionalextensionDirectoryfor fully offline/pre-provisioned setups.🤖 Generated with Claude Code
Generated by Claude Code