-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[feature](inverted index) support multiple tokenize index in one column #59117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
b2fb894 to
ad74834
Compare
|
run buildall |
1 similar comment
|
run buildall |
d00b070 to
af7b864
Compare
|
run buildall |
TPC-H: Total hot run time: 36319 ms |
TPC-DS: Total hot run time: 179202 ms |
ClickBench: Total hot run time: 27.21 s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
| }; | ||
|
|
||
| if (runtime_state->query_options().enable_profile) { | ||
| if (runtime_state != nullptr && runtime_state->query_options().enable_profile) { |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition checks runtime_state != nullptr before accessing query_options(), but in the original code this check didn't exist. This suggests that runtime_state could be null in some scenarios. However, the query execution path that follows this block doesn't have similar null checks, which could lead to crashes if runtime_state is actually null in practice. Review whether runtime_state can legitimately be null here and add consistent null handling throughout the method if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 34 out of 34 changed files in this pull request and generated 15 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public String getAnalyzerIdentity() { | ||
| if (indexType != IndexDefinition.IndexType.INVERTED) { | ||
| return ""; | ||
| } | ||
| return InvertedIndexUtil.buildAnalyzerIdentity(properties); | ||
| } |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test coverage for the new getAnalyzerIdentity method. This new public method returns analyzer identity for indexes but doesn't appear to be tested. Consider adding unit tests to verify it returns the correct identity for different property configurations.
| std::string column_name = block.get_by_position(arguments[0]).name; | ||
| VLOG_DEBUG << "begin to execute match directly, column_name=" << column_name | ||
| << ", match_query_str=" << match_query_str; | ||
| auto* analyzer_ctx = get_match_analyzer_ctx(context); |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary change removing blank line. This change removes a blank line that was separating the analyzer_ctx initialization from the source column extraction. The blank line improved readability by grouping related operations. Consider keeping it.
| auto* analyzer_ctx = get_match_analyzer_ctx(context); | |
| auto* analyzer_ctx = get_match_analyzer_ctx(context); |
| //waitAnalyzerReady(analyzerStandard) | ||
| //waitAnalyzerReady(analyzerEdge) | ||
| //waitAnalyzerReady(analyzerKeyword) |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out code should be removed. These analyzer readiness check calls appear to be intentionally disabled but if they're not needed, they should be deleted rather than commented out. If they're needed for future use, consider documenting why they're disabled.
| /*sql "DROP TABLE IF EXISTS ${tableName}" | ||
| try { | ||
| sql "DROP INVERTED INDEX ANALYZER ${analyzerStandard}" | ||
| } catch (Exception ignored) { | ||
| } | ||
| try { | ||
| sql "DROP INVERTED INDEX ANALYZER ${analyzerEdge}" | ||
| } catch (Exception ignored) { | ||
| } | ||
| try { | ||
| sql "DROP INVERTED INDEX ANALYZER ${analyzerKeyword}" | ||
| } catch (Exception ignored) { | ||
| } | ||
| try { | ||
| sql "DROP INVERTED INDEX TOKENIZER ${tokenizerName}" | ||
| } catch (Exception ignored) { | ||
| }*/ |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Large block of commented-out cleanup code should be removed or uncommented. The finally block contains extensive cleanup logic that is commented out. If this cleanup is necessary, it should be active. If it's not needed (perhaps handled elsewhere), the commented code should be deleted to improve code clarity.
| indexDef.getIndexType() + " index for column (" + columnName + ") with " | ||
| + (isNewIndexAnalyzer ? "analyzed" : "non-analyzed") | ||
| + " type already exists."); |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent error message format. The error message format changes between line 2827-2828 (using "analyzer identity") versus lines 2835-2836 (using "analyzed"/"non-analyzed"). For consistency, consider using the analyzer identity format for both branches, or restructure to have uniform messaging.
| indexDef.getIndexType() + " index for column (" + columnName + ") with " | |
| + (isNewIndexAnalyzer ? "analyzed" : "non-analyzed") | |
| + " type already exists."); | |
| indexDef.getIndexType() + " index for column (" + columnName + ") with analyzer " | |
| + (isNewIndexAnalyzer ? "analyzed" : "non-analyzed") | |
| + " already exists."); |
| #include <memory> | ||
| #include <ostream> | ||
| #include <string> | ||
| #include <string_view> | ||
| #include <type_traits> |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary inclusion of includes in comment. The comment references "cstddef" and "ostream" which are no longer included after the changes. These appear to be removed includes that shouldn't be mentioned in the comment about what changed. Consider removing them from the diff context or clarifying the comment intent.
| auto* column_slot_ref = assert_cast<VSlotRef*>(get_child(0).get()); | ||
| std::string column_name = column_slot_ref->expr_name(); | ||
| auto it = std::find(column_names.begin(), column_names.end(), column_name); | ||
| auto it = std::ranges::find(column_names, column_name); |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent range-based for loop usage. The code changes lines 97, 106 to use "const auto&" for range-based loops but then uses "std::ranges::find" on line 175 which requires C++20 ranges. Consider using consistent modern C++ patterns throughout, or ensure all usages are compatible with the project's C++ standard.
| auto it = std::ranges::find(column_names, column_name); | |
| auto it = std::find(column_names.begin(), column_names.end(), column_name); |
| "disable_auto_compaction" = "true" | ||
| ) | ||
| """ | ||
| sleep(10000) |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded sleep of 10 seconds without explanation. This appears arbitrary and could make tests unnecessarily slow. Consider either using the existing waitAnalyzerReady mechanism (which is commented out above) or documenting why this specific duration is required.
| protected MatchPredicate(MatchPredicate other) { | ||
| super(other); | ||
| op = other.op; | ||
| invertedIndexParser = other.invertedIndexParser; | ||
| invertedIndexParserMode = other.invertedIndexParserMode; | ||
| invertedIndexCharFilter = other.invertedIndexCharFilter; | ||
| invertedIndexParserLowercase = other.invertedIndexParserLowercase; | ||
| invertedIndexParserStopwords = other.invertedIndexParserStopwords; | ||
| invertedIndexAnalyzerName = other.invertedIndexAnalyzerName; | ||
| explicitAnalyzer = other.explicitAnalyzer; | ||
| } | ||
|
|
||
| /** | ||
| * use for Nereids ONLY | ||
| */ | ||
| public MatchPredicate(Operator op, Expr e1, Expr e2, Type retType, | ||
| NullableMode nullableMode, Index invertedIndex, boolean nullable) { | ||
| this(op, e1, e2, retType, nullableMode, invertedIndex, nullable, null); | ||
| } |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constructor order is unusual - simpler constructor delegates to more complex one which then delegates to another. Consider reordering: have the most complex constructor first, then simpler ones delegate to it. This would make the delegation chain clearer and follow common Java constructor patterns.
| // Fields for thrift serialization (restored from old version) | ||
| private String invertedIndexParser; | ||
| private String invertedIndexParserMode; | ||
| private Map<String, String> invertedIndexCharFilter; | ||
| private boolean invertedIndexParserLowercase = true; | ||
| private String invertedIndexParserStopwords = ""; | ||
| private String invertedIndexAnalyzerName = ""; | ||
| // Fields for SQL generation |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary comment about code that was moved. The comment "Fields for thrift serialization (restored from old version)" seems to reference implementation history rather than explaining the purpose of these fields. Consider updating to describe what these fields are used for functionally, or remove if self-explanatory.
| // Fields for thrift serialization (restored from old version) | |
| private String invertedIndexParser; | |
| private String invertedIndexParserMode; | |
| private Map<String, String> invertedIndexCharFilter; | |
| private boolean invertedIndexParserLowercase = true; | |
| private String invertedIndexParserStopwords = ""; | |
| private String invertedIndexAnalyzerName = ""; | |
| // Fields for SQL generation | |
| // Inverted index analyzer configuration persisted via Thrift serialization | |
| private String invertedIndexParser; | |
| private String invertedIndexParserMode; | |
| private Map<String, String> invertedIndexCharFilter; | |
| private boolean invertedIndexParserLowercase = true; | |
| private String invertedIndexParserStopwords = ""; | |
| private String invertedIndexAnalyzerName = ""; | |
| // Analyzer name explicitly specified in the SQL MATCH predicate, used for SQL generation |
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
TPC-H: Total hot run time: 35237 ms |
TPC-DS: Total hot run time: 178191 ms |
ClickBench: Total hot run time: 27.3 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 35344 ms |
TPC-DS: Total hot run time: 176358 ms |
ClickBench: Total hot run time: 27.21 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
TPC-H: Total hot run time: 31753 ms |
TPC-DS: Total hot run time: 172363 ms |
ClickBench: Total hot run time: 27.41 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
…imization This PR addresses multiple issues in the multi-analyzer inverted index feature: P0 Fixes: - Fix select_best_reader single reader bug: now checks analyzer key match before returning, preventing wrong index usage when specified analyzer index is not built - Unify analyzer key normalization: empty string normalizes to "__default__", "none" stays distinct (critical for correct index selection) - Fix resolveAnalyzerIdentity silent exception: added logging for debugging P1 Fixes: - Eliminate double normalization in inverted_index_iterator.cpp - Simplify MatchPredicate analyzer logic with effectiveAnalyzerName method - Refactor Match subclasses using Template Method Pattern: added createInstance() abstract method, unified withChildren() in base class P2 Fixes: - Extract analyzerSqlFragment to InvertedIndexUtil.buildAnalyzerSqlFragment() - Improve error log level for user-specified analyzer not found Test: - Update regression test for cloud mode compatibility using isCloudMode() Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
run buildall |
Apply clang-format to fix spacing before inline comments. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
run buildall |
Use BUILD INDEX ON table (without index name) in cloud mode since cloud mode doesn't support specifying index name. Both modes now execute the full test including index building and post-build queries. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
run buildall |
TPC-H: Total hot run time: 32219 ms |
TPC-DS: Total hot run time: 172377 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
…alyzer When FE sends analyzer_name = "__default__" (which occurs when there's no index and no explicit analyzer specified), BE's create_analyzer function should treat it the same as an empty analyzer name and use the default builtin analyzer based on parser_type. Previously, this caused "Policy not found with name: __default__" errors because __default__ was being looked up as a custom analyzer policy. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 31666 ms |
TPC-DS: Total hot run time: 172462 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
…tive Normalize analyzer, normalizer, and policy names to lowercase for case-insensitive matching between table creation and query time. This ensures that users can use any case when specifying analyzer names (e.g., 'MyAnalyzer' vs 'myanalyzer') and they will match correctly. Changes: - FE: Normalize analyzer/normalizer names in InvertedIndexUtil - FE: Normalize policy names in IndexPolicyMgr using normalizeKey() - FE: Normalize explicit analyzer in MatchPredicate and Match expressions - BE: Add normalize_name() helper in IndexPolicyMgr for consistent lookup Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Summary
This PR implements Multi-Analyzer Inverted Index feature, which allows creating multiple inverted indexes with different analyzers on a single column.
Key Features
MATCH ... USING ANALYZER analyzer_nameUse Cases
Example
Test Plan
inverted_index_parser_test.cpp,inverted_index_iterator_test.cpptest_multi_tokenize_index_not_built.groovyRelated Documentation