fix: Limit relevant fields in default wildcard search#768
Open
fix: Limit relevant fields in default wildcard search#768
Conversation
Test Results 43 files ±0 43 suites ±0 2m 52s ⏱️ +28s Results for commit e0680a9. ± Comparison against base commit d389989. This pull request removes 5 and adds 4 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
NOTE: This is a possible fix to address the issue described in https://sikt.atlassian.net/browse/NP-50823. Because it changes default behavior significantly, it will likely break something unexpected and should be tested carefully.
Problem
The
CROSS_FIELDSmulti-match query insearchAllWithBoostsQueryused"*"(wildcard) for the default field set, which OpenSearch expands to every field in the index (~433 text fields in prod). Combined withOperator.AND, this createstokens × fieldsboolean clauses.The
.limit(7)caps space-separated words, but the standard tokenizer also splits on hyphens — so a query like "Genome-wide association meta-analysis..." (7 words) produces 9+ tokens. With ~433 fields, that's ~3900–4300+ clauses, exceeding themaxClauseCountof 4096. Production has more dynamically-mapped fields than test, which is why the same queries work in test but fail in prod.Fix
Replaced the
"*"wildcard with an explicitDEFAULT_SEARCH_ALL_FIELDSmap containing 15 curated fields that are meaningful for free-text publication search:Clause count: 15 fields × ~10 tokens = ~150 clauses (vs 4000+ before), well within the 4096 limit — even with hyphenated queries.
When users explicitly specify the
fieldsparameter (NODES_SEARCHED), the behavior is unchanged — only the default "search all" case is affected.Alternative:
copy_tofieldIf the curated field list turns out to be too restrictive, a more robust long-term option is to add a
copy_todirective in the index mapping that copies all searchable fields into a single combined text field (e.g._search_all). The multi-match query would then target just 1 field instead of N, eliminating the clause explosion entirely while still supporting true all-field search. The trade-off is that it requires a mapping change and full re-index.