Skip to content

Fix wildcard query analyzer match-all leak#21487

Open
baekgyu-kim wants to merge 1 commit into
opensearch-project:mainfrom
baekgyu-kim:21280
Open

Fix wildcard query analyzer match-all leak#21487
baekgyu-kim wants to merge 1 commit into
opensearch-project:mainfrom
baekgyu-kim:21280

Conversation

@baekgyu-kim
Copy link
Copy Markdown

@baekgyu-kim baekgyu-kim commented May 5, 2026

Description

This PR fixes a query_string wildcard query correctness bug where analyzer-stripped wildcard patterns can unintentionally match documents.

Example scenario:

  • A user searches with a wildcard pattern such as *asdf*.
  • The field/search analyzer is configured to keep only digits.
  • The literal part of the pattern, asdf, is stripped during analysis.
  • The remaining wildcard pattern can compile to a match-all automaton.
  • As a result, documents with unrelated indexed terms, such as 1234, can be returned.

Expected behavior:

  • If analysis removes all searchable literal content from the wildcard query, the query should not match documents.
  • User-typed match-all wildcard patterns such as *, **, and *** should continue to behave as match-all patterns.

Fix:

  • Detect wildcard queries that analysis turns into Lucene's match-all automaton (CompiledAutomaton.AUTOMATON_TYPE.ALL).
  • Rewrite those analyzer-induced match-all wildcard queries to match_no_docs.
  • Preserve user-typed match-all wildcard patterns.
  • The guard fires only on AUTOMATON_TYPE.ALL, so ?-based patterns (?, ??, *?*, **?) compile to NORMAL and are unaffected.

Related Issues

Resolves #21280

Check List

  • Functionality includes testing.
    • Added regression tests covering analyzer-stripped wildcard queries, configured search analyzers, and preserved user-typed match-all wildcard patterns.
  • API changes companion pull request created, if applicable.
    • Not applicable because this change does not modify any REST API request or response specification.
  • Public documentation issue/PR created, if applicable.
    • Not applicable because this is a bug fix with no new user-facing API, setting, or documented behavior to add.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions Bot added bug Something isn't working Search Search query, autocomplete ...etc labels May 5, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

PR Reviewer Guide 🔍

(Review updated until commit 3e6b762)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

PR Code Suggestions ✨

Latest suggestions up to 3e6b762

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Handle question mark wildcards

The method checks if all characters are asterisks but doesn't handle the case where
termStr contains question marks (?). Since ? is also a wildcard character that could
be stripped by analyzers, the check should include it to prevent false positives
when users type patterns like ? or ???.

server/src/main/java/org/opensearch/index/search/QueryStringQueryParser.java [763-770]

 private static Query rejectAnalyzerInducedMatchAll(String termStr, Query query) {
     if (query instanceof AutomatonQuery automatonQuery
         && automatonQuery.getCompiled().type == CompiledAutomaton.AUTOMATON_TYPE.ALL
-        && termStr.chars().allMatch(c -> c == '*') == false) {
+        && termStr.chars().allMatch(c -> c == '*' || c == '?') == false) {
         return Queries.newMatchNoDocsQuery("Wildcard pattern normalized to match-all by analyzer");
     }
     return query;
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that ? is also a wildcard character that could be stripped by analyzers. Including it in the check prevents the method from incorrectly rejecting user-typed patterns like **? or ??? that should match all documents. This improves the accuracy of the wildcard handling logic.

Medium

Previous suggestions

Suggestions up to commit e8fd46d
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add empty string validation

The check termStr.chars().allMatch(c -> c == '*') will return true for empty
strings, which could lead to unexpected behavior. Add an explicit check to ensure
termStr is not empty before performing the character validation to prevent edge
cases where an empty term string might bypass the intended logic.

server/src/main/java/org/opensearch/index/search/QueryStringQueryParser.java [763-770]

 private static Query rejectAnalyzerInducedMatchAll(String termStr, Query query) {
     if (query instanceof AutomatonQuery automatonQuery
         && automatonQuery.getCompiled().type == CompiledAutomaton.AUTOMATON_TYPE.ALL
+        && !termStr.isEmpty()
         && termStr.chars().allMatch(c -> c == '*') == false) {
         return Queries.newMatchNoDocsQuery("Wildcard pattern normalized to match-all by analyzer");
     }
     return query;
 }
Suggestion importance[1-10]: 3

__

Why: While technically correct that allMatch returns true for empty streams, this is an edge case that's unlikely in practice since termStr comes from user input and earlier code checks for "*". The suggestion adds defensive programming but has minimal impact on correctness.

Low
Suggestions up to commit ad60310
CategorySuggestion                                                                                                                                    Impact
Possible issue
Handle '?' wildcard in match-all check

The method checks if all characters are '' but doesn't handle '?' wildcards.
According to Lucene wildcard syntax, '?' is also a valid wildcard character that
could create a match-all pattern. Consider including '?' in the check to prevent
false positives when users type patterns like '???' or mixed patterns like '
?*'.

server/src/main/java/org/opensearch/index/search/QueryStringQueryParser.java [763-770]

 private static Query rejectAnalyzerInducedMatchAll(String termStr, Query query) {
     if (query instanceof AutomatonQuery automatonQuery
         && automatonQuery.getCompiled().type == CompiledAutomaton.AUTOMATON_TYPE.ALL
-        && termStr.chars().allMatch(c -> c == '*') == false) {
+        && termStr.chars().allMatch(c -> c == '*' || c == '?') == false) {
         return Queries.newMatchNoDocsQuery("Wildcard pattern normalized to match-all by analyzer");
     }
     return query;
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that ? is also a wildcard character in Lucene syntax. Including it in the check ensures that user-typed patterns like ??? are treated as intentional match-all queries, preventing false rejections. This improves the accuracy of the wildcard handling logic.

Medium
Suggestions up to commit 957f292
CategorySuggestion                                                                                                                                    Impact
General
Handle wildcard-only patterns with '?' characters

The current check only considers as a user-typed match-all pattern, but ?
characters can also produce a match-all automaton when all literal characters are
stripped. Consider also allowing patterns that consist solely of
and ? wildcards
to pass through as intentional match-all queries.

server/src/main/java/org/opensearch/index/search/QueryStringQueryParser.java [766]

-&& termStr.chars().allMatch(c -> c == '*') == false) {
+&& termStr.chars().allMatch(c -> c == '*' || c == '?') == false) {
Suggestion importance[1-10]: 5

__

Why: The suggestion is logically sound - a pattern like ??? consisting only of ? wildcards could also produce a match-all automaton when all literals are stripped, and should be treated as intentional. However, this is an edge case and the primary fix for * patterns is already correct. The improvement is minor but valid.

Low
Improve test accuracy for char_filter stripping simulation

The initReaderForNormalization override returns an empty StringReader
unconditionally, but this method is used for normalization, not for the main
analysis path. To properly simulate a char_filter that strips all characters during
wildcard analysis, the createComponents method should include a filter that produces
no tokens, rather than relying on the normalization path. The current test may not
accurately reflect the bug scenario.

server/src/test/java/org/opensearch/index/query/QueryStringQueryBuilderTests.java [629-632]

 @Override
         protected Reader initReaderForNormalization(String fieldName, Reader reader) {
             return new StringReader("");
         }
 
+        @Override
+        protected TokenStreamComponents createComponents(String fieldName) {
+            KeywordTokenizer tokenizer = new KeywordTokenizer();
+            // Simulate a char_filter that strips all characters, producing an empty token
+            return new TokenStreamComponents(r -> tokenizer.setReader(new StringReader("")), tokenizer);
+        }
+
Suggestion importance[1-10]: 2

__

Why: The improved_code duplicates the createComponents method which already exists in the anonymous class (lines 624-626), making it contradictory. Additionally, the existing test does pass (the initReaderForNormalization path is what triggers the match-all automaton in the wildcard analysis path), so the suggestion's claim that the test "may not accurately reflect the bug scenario" appears incorrect given the test is designed to work with the actual code path.

Low
Suggestions up to commit 7fdad9b
CategorySuggestion                                                                                                                                    Impact
Possible issue
Unwrap boost query before match-all check

The method also needs to handle the case where the query is wrapped in a BoostQuery.
If the wildcard query has a boost applied, query instanceof AutomatonQuery will be
false and the match-all leak will not be detected. Unwrap BoostQuery before checking
the inner query type.

server/src/main/java/org/opensearch/index/search/QueryStringQueryParser.java [763-770]

 private static Query rejectAnalyzerInducedMatchAll(String termStr, Query query) {
-    if (query instanceof AutomatonQuery automatonQuery
+    Query inner = query;
+    if (inner instanceof BoostQuery boostQuery) {
+        inner = boostQuery.getQuery();
+    }
+    if (inner instanceof AutomatonQuery automatonQuery
         && automatonQuery.getCompiled().type == CompiledAutomaton.AUTOMATON_TYPE.ALL
         && termStr.chars().allMatch(c -> c == '*') == false) {
         return Queries.newMatchNoDocsQuery("Wildcard pattern normalized to match-all by analyzer");
     }
     return query;
 }
Suggestion importance[1-10]: 6

__

Why: This is a valid edge case — if a BoostQuery wraps the AutomatonQuery, the match-all detection would be bypassed. The improved_code correctly unwraps the BoostQuery before the instanceof AutomatonQuery check, and BoostQuery is already imported in the file, making this a reasonable defensive improvement.

Low
General
Handle wildcard-only patterns with question marks

The current check only considers as a user-typed match-all pattern, but ? can also
produce a match-all automaton when all literal characters are stripped (e.g., ?asdf?
through a digits-only filter would leave ??, which may also compile to match-all).
Consider also allowing patterns that consist solely of
and ? wildcards to pass
through as user-intended match-all patterns.

server/src/main/java/org/opensearch/index/search/QueryStringQueryParser.java [766]

-&& termStr.chars().allMatch(c -> c == '*') == false) {
+&& termStr.chars().allMatch(c -> c == '*' || c == '?') == false) {
Suggestion importance[1-10]: 5

__

Why: The suggestion has merit since ? wildcards could also produce match-all automata when all literals are stripped. However, ? matches exactly one character (not zero or more), so ?? would not compile to a true match-all automaton the same way ** does. The suggestion is plausible but not clearly correct without deeper analysis of Lucene's automaton behavior for ?-only patterns.

Low

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

✅ Gradle check result for 7fdad9b: SUCCESS

@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

❌ Patch coverage is 88.88889% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 73.51%. Comparing base (d09b185) to head (e8fd46d).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
...pensearch/index/search/QueryStringQueryParser.java 88.88% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #21487      +/-   ##
============================================
+ Coverage     73.45%   73.51%   +0.05%     
- Complexity    74591    74618      +27     
============================================
  Files          5980     5980              
  Lines        338779   338784       +5     
  Branches      48848    48849       +1     
============================================
+ Hits         248849   249054     +205     
+ Misses        70079    69871     -208     
- Partials      19851    19859       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@baekgyu-kim baekgyu-kim force-pushed the 21280 branch 2 times, most recently from a815211 to 957f292 Compare May 6, 2026 11:53
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

Persistent review updated to latest commit 957f292

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

✅ Gradle check result for 957f292: SUCCESS

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Persistent review updated to latest commit ad60310

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

✅ Gradle check result for ad60310: SUCCESS

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

Persistent review updated to latest commit e8fd46d

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

✅ Gradle check result for e8fd46d: SUCCESS

Signed-off-by: Baekgyu <baekgyukim.dev@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 3e6b762

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 3e6b762: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Search Search query, autocomplete ...etc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Null values matched in query_string with asterisks

1 participant