Skip to content

test(store): cover normalizeFTSQuery edge cases and document FTS5 tokenizer#45

Open
mvanhorn wants to merge 1 commit intosteipete:mainfrom
mvanhorn:test/9-fts-query-sanitizer-edge-cases
Open

test(store): cover normalizeFTSQuery edge cases and document FTS5 tokenizer#45
mvanhorn wants to merge 1 commit intosteipete:mainfrom
mvanhorn:test/9-fts-query-sanitizer-edge-cases

Conversation

@mvanhorn
Copy link
Copy Markdown

Summary

Adds focused unit and end-to-end tests for normalizeFTSQuery and a one-line doc comment above each FTS5 create virtual table site noting the default unicode61 tokenizer and the input-normalization contract.

Why this matters

Issue #9 raised three asks: parameterize FTS queries, document the tokenizer choice, and add edge-case tests. Two of those are already in place:

  • FTS query input is parameterized: internal/store/query.go:75-76, query.go:427, members_profile.go:138-139 all pass user input via match ?.
  • Operator literalization is handled by normalizeFTSQuery at internal/store/query.go:893, which wraps each whitespace-separated field in double quotes after stripping inner quotes. AND, OR, NOT, NEAR, and * become literal terms rather than FTS5 syntax.

What was missing:

  • No test directly exercises normalizeFTSQuery. The closest coverage in store_test.go (TestSearchFallbackFilters, TestStoreReadWriteAndSearch) only uses simple non-operator queries, so operator-as-literal behavior was not asserted.
  • The FTS5 default tokenizer (unicode61) is implicit. A future contributor inspecting the schema would have to consult SQLite docs to see what tokenization rules apply.

This PR locks in the existing sanitizer with regression tests and documents the tokenizer choice. The framing matches the precedent set by #3, which was closed with "that should be a narrower regression-test issue rather than this broad one."

Changes

  • internal/store/store_test.go: appends TestNormalizeFTSQueryEdgeCases (table-driven, covers empty/whitespace, single/multi-word, AND/OR/NOT/NEAR as terms, embedded double-quotes, * as literal, mixed punctuation, unicode) and TestSearchMessagesTreatsFTSOperatorsAsLiterals (end-to-end, queries "AND" and asserts only messages whose content contains the token match, not the FTS5 boolean).
  • internal/store/store.go (lines ~394 and ~575) and internal/store/members_profile.go (line ~60): one-line comment above each create virtual table ... using fts5(...) block noting the default unicode61 tokenizer and pointing readers at normalizeFTSQuery.

No change to normalizeFTSQuery, no change to the FTS5 schema (no tokenize= clause added), no change to any caller.

Testing

Local CI gate, all green:

  • gofumpt -l . (clean)
  • go vet ./...
  • staticcheck ./...
  • golangci-lint run
  • gosec -exclude=G101,G115,G202,G301,G304 ./...
  • go test -count=1 ./...
  • go test -count=1 -race ./internal/store/...

Diff: 3 files changed, +92 / -0.

Fixes #9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

search/FTS: injection & tokenizer configuration

1 participant