Conversation
📝 WalkthroughWalkthroughAdds a new bigkey_analyzer tool: build integration, CMake target, Chinese README, and a C++ CLI that scans RocksDB instance(s) to report large keys, per-type sizes/TTLs, optional prefix statistics, and console/file output. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant CLI
participant Analyzer
participant RocksDB as "RocksDB(s)"
participant Output as "Console / File"
Note over CLI,Analyzer: Parse flags (path, min-size, top, prefix, type, out)
User->>CLI: run bigkey_analyzer
CLI->>Analyzer: pass Config
Analyzer->>RocksDB: open instance(s) & obtain CF handles
RocksDB-->>Analyzer: iterate entries (key, value, TTL/meta)
Analyzer->>Analyzer: decode keys, compute size/TTL, filter, collect KeyInfo
Analyzer->>Analyzer: sort results, truncate to top‑N, compute prefix stats
Analyzer->>Output: emit big-key list, prefix statistics, summary
Output-->>User: display or write file
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧬 Code graph analysis (1)tools/bigkey_analyzer/bigkey_analyzer.cc (7)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
🔇 Additional comments (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (4)
tools/bigkey_analyzer/bigkey_analyzer.cc (4)
631-669: Consider using RAII for resource management.The function uses manual memory management with raw pointers for the database and column family handles. If an exception occurs during analysis (lines 644-663), the cleanup code (lines 666-669) won't execute, leading to memory leaks.
🔎 Suggested approach using RAII
Consider wrapping the database and handles in smart pointers with custom deleters:
// After opening the database successfully std::unique_ptr<rocksdb::DB, void(*)(rocksdb::DB*)> db_ptr(db, [&handles](rocksdb::DB* d) { for (auto handle : handles) { delete handle; } delete d; });Alternatively, use a simpler scope guard pattern or ensure exception safety through other RAII mechanisms.
67-90: Consider simplifying KeyInfo constructors.The struct defines five different constructors to handle various combinations of const references, rvalue references, and const char pointers. Modern C++ typically relies on fewer constructors with aggregate initialization or designated initializers.
💡 Simplified alternative
struct KeyInfo { std::string type; std::string key; int64_t size; int64_t ttl; bool operator<(const KeyInfo& other) const { return size > other.size; // Sort in descending order by size } }; // Usage with aggregate initialization: key_infos.emplace_back(KeyInfo{"string", std::move(display_key), size, ttl});
299-312: Consider extracting repeated key parsing logic.The pattern of extracting the encoded user key from data keys using
SeekUserkeyDelimis repeated six times across the codebase (in AnalyzeHashes, AnalyzeSets, AnalyzeZsets twice, and AnalyzeLists). Extracting this into a helper function would reduce duplication and improve maintainability.💡 Example helper function
// Returns empty string if extraction fails std::string ExtractEncodedUserKey(const rocksdb::Slice& data_key) { const char* ptr = storage::SeekUserkeyDelim(data_key.data(), data_key.size()); size_t user_key_len = ptr - data_key.data(); if (user_key_len == 0 || user_key_len > data_key.size()) { return ""; } return std::string(data_key.data(), user_key_len); }Then use it consistently:
std::string encoded_user_key = ExtractEncodedUserKey(data_key); if (encoded_user_key.empty()) continue; auto it = hash_info.find(encoded_user_key);
37-44: ReplaceAll function has potential performance considerations.The current implementation calls
str.replacein a loop, which can cause multiple memory reallocations for strings with many occurrences. For the current use case (escaping newlines and spaces in key names), this is likely acceptable.If performance becomes a concern with very large keys or many replacements, consider a single-pass approach:
💡 Alternative single-pass implementation
std::string ReplaceAll(const std::string& str, const std::string& from, const std::string& to) { if (from.empty()) return str; std::string result; result.reserve(str.size()); // Pre-allocate approximate size size_t start_pos = 0; size_t found_pos; while ((found_pos = str.find(from, start_pos)) != std::string::npos) { result.append(str, start_pos, found_pos - start_pos); result.append(to); start_pos = found_pos + from.length(); } result.append(str, start_pos, str.length() - start_pos); return result; }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
tools/CMakeLists.txttools/bigkey_analyzer/CMakeLists.txttools/bigkey_analyzer/README.mdtools/bigkey_analyzer/bigkey_analyzer.cc
🧰 Additional context used
🧬 Code graph analysis (1)
tools/bigkey_analyzer/bigkey_analyzer.cc (1)
src/storage/include/storage/storage_define.h (1)
SeekUserkeyDelim(122-132)
🪛 GitHub Actions: Pika
tools/bigkey_analyzer/CMakeLists.txt
[error] 20-20: CMake error: Target "rocksdb" of type UTILITY may not be linked into another target. One may link only to INTERFACE, OBJECT, STATIC or SHARED libraries, or to executables with the ENABLE_EXPORTS property set.
[warning] 116-116: couldn't find clang-tidy.
[warning] 126-126: couldn't find clang-apply-replacements.
[warning] 797-797: configure_file called with unknown argument(s): src/pika_cache_load_thread.cc
tools/CMakeLists.txt
[warning] 116-116: couldn't find clang-tidy.
[warning] 126-126: couldn't find clang-apply-replacements.
[warning] 797-797: configure_file called with unknown argument(s): src/pika_cache_load_thread.cc
🪛 markdownlint-cli2 (0.18.1)
tools/bigkey_analyzer/README.md
29-29: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
93-93: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (go)
🔇 Additional comments (2)
tools/CMakeLists.txt (1)
3-3: LGTM!The addition of the bigkey_analyzer subdirectory follows the established pattern and correctly integrates the new tool into the build system.
tools/bigkey_analyzer/bigkey_analyzer.cc (1)
692-722: Well-designed database detection logic.The code properly handles both single database instances and multi-database setups (db/0, db/1, ...) by checking for the standard RocksDB
CURRENTmarker file. The detection logic is clear and includes appropriate error messages.
| case 'm': | ||
| config.min_size = std::stoll(optarg); | ||
| break; | ||
| case 't': | ||
| config.top_n = std::stoi(optarg); | ||
| break; |
There was a problem hiding this comment.
Add exception handling for numeric conversions.
The calls to std::stoll and std::stoi can throw std::invalid_argument or std::out_of_range exceptions when users provide invalid input (e.g., --min-size=abc or --top=xyz). Without exception handling, the program will terminate ungracefully.
🔎 Proposed fix with exception handling
switch (opt) {
case 'm':
- config.min_size = std::stoll(optarg);
+ try {
+ config.min_size = std::stoll(optarg);
+ } catch (const std::exception& e) {
+ std::cerr << "Error: Invalid value for --min-size: " << optarg << std::endl;
+ return false;
+ }
break;
case 't':
- config.top_n = std::stoi(optarg);
+ try {
+ config.top_n = std::stoi(optarg);
+ } catch (const std::exception& e) {
+ std::cerr << "Error: Invalid value for --top: " << optarg << std::endl;
+ return false;
+ }
break;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| case 'm': | |
| config.min_size = std::stoll(optarg); | |
| break; | |
| case 't': | |
| config.top_n = std::stoi(optarg); | |
| break; | |
| case 'm': | |
| try { | |
| config.min_size = std::stoll(optarg); | |
| } catch (const std::exception& e) { | |
| std::cerr << "Error: Invalid value for --min-size: " << optarg << std::endl; | |
| return false; | |
| } | |
| break; | |
| case 't': | |
| try { | |
| config.top_n = std::stoi(optarg); | |
| } catch (const std::exception& e) { | |
| std::cerr << "Error: Invalid value for --top: " << optarg << std::endl; | |
| return false; | |
| } | |
| break; |
🤖 Prompt for AI Agents
In tools/bigkey_analyzer/bigkey_analyzer.cc around lines 143-148, the direct
calls to std::stoll(optarg) and std::stoi(optarg) can throw
std::invalid_argument or std::out_of_range for bad user input; wrap each
conversion in a try-catch that catches these exceptions, print a clear error to
std::cerr (including the offending optarg and which flag failed), and exit with
a non-zero code (or return from main) so the program fails gracefully;
alternatively, validate the string contains only digits (and optional leading
sign) before converting and handle out-of-range by limiting/validating ranges
and reporting the error to the user.
| int64_t curtime; | ||
| db->GetEnv()->GetCurrentTime(&curtime).ok(); | ||
| curtime *= 1000; // Convert to milliseconds |
There was a problem hiding this comment.
Handle GetCurrentTime failure to prevent incorrect TTL calculations.
The return value of GetCurrentTime is checked with .ok() but the result is discarded. If the call fails, curtime remains uninitialized, leading to incorrect TTL calculations throughout the function. This same pattern appears in other analyze functions.
🔎 Proposed fix
int64_t curtime;
- db->GetEnv()->GetCurrentTime(&curtime).ok();
+ if (!db->GetEnv()->GetCurrentTime(&curtime).ok()) {
+ std::cerr << "Error: Failed to get current time" << std::endl;
+ return;
+ }
curtime *= 1000; // Convert to millisecondsApply similar fixes to AnalyzeHashes (line 251), AnalyzeSets (line 333), AnalyzeZsets (line 408), and AnalyzeLists (line 503).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| int64_t curtime; | |
| db->GetEnv()->GetCurrentTime(&curtime).ok(); | |
| curtime *= 1000; // Convert to milliseconds | |
| int64_t curtime; | |
| if (!db->GetEnv()->GetCurrentTime(&curtime).ok()) { | |
| std::cerr << "Error: Failed to get current time" << std::endl; | |
| return; | |
| } | |
| curtime *= 1000; // Convert to milliseconds |
🤖 Prompt for AI Agents
In tools/bigkey_analyzer/bigkey_analyzer.cc around lines 191-193,
GetCurrentTime()'s Status is being discarded so curtime may remain uninitialized
on failure; change the code to capture the returned Status, check if
status.ok(), and if not, handle the error (e.g., log or return) and set curtime
to a safe default (0) before converting to milliseconds to avoid incorrect TTL
calculations; apply the same pattern to the corresponding calls in AnalyzeHashes
(around line 251), AnalyzeSets (around line 333), AnalyzeZsets (around line
408), and AnalyzeLists (around line 503).
| ``` | ||
| Usage: bigkey_analyzer [OPTIONS] <db_path> | ||
| Options: | ||
| --min-size=SIZE Only show keys larger than SIZE bytes | ||
| --top=N Only show top N largest keys | ||
| --prefix-stat Show statistics by key prefix | ||
| --prefix-delimiter=C Character used to delimit prefix (default: ':') | ||
| --type=TYPE Only analyze specific type (strings|hashes|lists|sets|zsets|all) | ||
| --output=FILE Write output to file instead of stdout | ||
| --help Display this help message | ||
| ``` |
There was a problem hiding this comment.
Add language identifier to fenced code block.
The usage text block should specify a language identifier for proper syntax highlighting and better readability.
🔎 Proposed fix
-```
+```text
Usage: bigkey_analyzer [OPTIONS] <db_path>
Options:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ``` | |
| Usage: bigkey_analyzer [OPTIONS] <db_path> | |
| Options: | |
| --min-size=SIZE Only show keys larger than SIZE bytes | |
| --top=N Only show top N largest keys | |
| --prefix-stat Show statistics by key prefix | |
| --prefix-delimiter=C Character used to delimit prefix (default: ':') | |
| --type=TYPE Only analyze specific type (strings|hashes|lists|sets|zsets|all) | |
| --output=FILE Write output to file instead of stdout | |
| --help Display this help message | |
| ``` |
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
29-29: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In tools/bigkey_analyzer/README.md around lines 29 to 39, the fenced code block
with the usage text lacks a language identifier; update the opening fence from
``` to ```text so it becomes ```text and leave the block content unchanged,
ensuring the usage snippet is syntax-highlighted as plain text in renders.
| ``` | ||
| ===== Big Key Analysis ===== | ||
| Type Size Key TTL | ||
| hash 1048576 user:profile:1001 -1 | ||
| zset 524288 ranking:global 3600 | ||
| string 262144 config:settings -1 | ||
| ... | ||
|
|
||
| ===== Key Prefix Statistics ===== | ||
| Prefix Count Total Size Avg Size | ||
| user 100 10485760 104857.6 | ||
| ranking 50 2621440 52428.8 | ||
| config 10 524288 52428.8 | ||
| ... | ||
|
|
||
| ===== Summary ===== | ||
| Total keys analyzed: 160 | ||
| Keys by type: | ||
| hash: 50 keys, 25.0 MB total, 524288.0 bytes avg | ||
| zset: 30 keys, 15.0 MB total, 524288.0 bytes avg | ||
| string: 80 keys, 10.0 MB total, 131072.0 bytes avg | ||
| ``` |
There was a problem hiding this comment.
Add language identifier to fenced code block.
The example output block should specify a language identifier for proper syntax highlighting and better readability.
🔎 Proposed fix
-```
+```text
===== Big Key Analysis =====
Type Size Key TTL🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
93-93: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In tools/bigkey_analyzer/README.md around lines 93 to 114, the fenced code block
lacks a language identifier so it doesn’t get proper syntax highlighting; update
the opening fence to include an appropriate language token (for example "text")
— i.e. change ``` to ```text — and ensure the closing fence remains unchanged so
the block renders with the specified language.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
tools/bigkey_analyzer/CMakeLists.txt (2)
1-4: Consider modern CMake multi-line string syntax.While the backslash line continuations are valid, modern CMake style prefers unquoted multi-line arguments without backslashes for improved readability:
set(WARNING_FLAGS -W -Wextra -Wall -Wsign-compare -Wno-unused-parameter -Wno-redundant-decls -Wwrite-strings -Wpointer-arith -Wreorder -Wswitch -Wsign-promo -Woverloaded-virtual -Wnon-virtual-dtor -Wno-missing-field-initializers )
8-9: Prefer explicit source file lists overaux_source_directory.
aux_source_directoryis discouraged in modern CMake because it prevents automatic reconfiguration when files are added/removed and obscures which files contribute to the build.🔎 Recommended approach
-set(SRC_DIR .) -aux_source_directory(${SRC_DIR} BASE_OBJS) +set(BASE_OBJS + bigkey_analyzer.cc + # Add other source files explicitly +)Alternatively, if you must glob, use
CONFIGURE_DEPENDS:file(GLOB BASE_OBJS CONFIGURE_DEPENDS "*.cc")
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tools/bigkey_analyzer/CMakeLists.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Analyze (go)
- GitHub Check: build_on_macos
- GitHub Check: build_on_centos
- GitHub Check: build_on_ubuntu
| target_link_libraries(bigkey_analyzer | ||
| storage | ||
| pthread | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check RocksDB linking in storage library and other tools
echo "=== Check storage library's RocksDB dependency linking ==="
find . -path ./tools -prune -o -name "CMakeLists.txt" -type f -print | grep -E "src/storage|storage" | while read f; do
if [ -f "$f" ]; then
echo "File: $f"
cat -n "$f" | grep -A 10 -B 2 "target_link_libraries"
fi
done
echo -e "\n=== Check CMakeLists.txt files in tools directory ==="
find tools -name "CMakeLists.txt" -type f | while read f; do
echo -e "\n--- $f ---"
cat -n "$f" | grep -A 8 -B 1 "target_link_libraries"
doneRepository: OpenAtomFoundation/pikiwidb
Length of output: 10408
Add explicit ${ROCKSDB_LIBRARY} link for consistency with other tools.
Although storage provides ${ROCKSDB_LIBRARY} as a PUBLIC dependency, all similar tools in the project—including txt_to_pika, pika_to_txt, benchmark_client, and pika_port—explicitly link ${ROCKSDB_LIBRARY} alongside storage. Update to:
target_link_libraries(bigkey_analyzer
storage
${ROCKSDB_LIBRARY}
pthread
)
🤖 Prompt for AI Agents
In tools/bigkey_analyzer/CMakeLists.txt around lines 20 to 23,
target_link_libraries only lists storage and pthread but other tools explicitly
link ${ROCKSDB_LIBRARY}; add ${ROCKSDB_LIBRARY} to the target_link_libraries
call so it becomes: target_link_libraries(bigkey_analyzer storage
${ROCKSDB_LIBRARY} pthread) to match project consistency.
| set_target_properties(bigkey_analyzer PROPERTIES | ||
| RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR} | ||
| CMAKE_COMPILER_IS_GNUCXX TRUE | ||
| COMPILE_FLAGS ${CXXFLAGS} | ||
| ) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Replace deprecated COMPILE_FLAGS and remove incorrect CMAKE_COMPILER_IS_GNUCXX property.
Two issues:
CMAKE_COMPILER_IS_GNUCXXis a CMake-detected variable, not a settable target property. Setting it here has no effect.COMPILE_FLAGSproperty has been deprecated since CMake 3.0 in favor oftarget_compile_options().
🔎 Recommended fix
+target_compile_options(bigkey_analyzer PRIVATE ${WARNING_FLAGS})
+target_compile_features(bigkey_analyzer PRIVATE cxx_std_17)
+
set_target_properties(bigkey_analyzer PROPERTIES
RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}
- CMAKE_COMPILER_IS_GNUCXX TRUE
- COMPILE_FLAGS ${CXXFLAGS}
)Also remove or simplify the CXXFLAGS variable at line 6 since flags are now applied via target_compile_options and C++17 via target_compile_features.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tools/bigkey_analyzer/CMakeLists.txt around lines 25-29, the target
properties block incorrectly sets CMAKE_COMPILER_IS_GNUCXX (a CMake-detected
variable that should not be set) and uses the deprecated COMPILE_FLAGS property;
remove the CMAKE_COMPILER_IS_GNUCXX line entirely, drop the COMPILE_FLAGS
property, and instead apply any desired flags with
target_compile_options(bigkey_analyzer PRIVATE ${CXXFLAGS}) and declare required
C++ standard via target_compile_features(bigkey_analyzer PRIVATE cxx_std_17);
also remove or simplify the CXXFLAGS variable at line 6 since flags are now
applied via target_compile_options and C++17 via target_compile_features.
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (6)
tools/bigkey_analyzer/bigkey_analyzer.cc (6)
144-149: Add exception handling for numeric conversions.This is a duplicate of a previously raised issue. The calls to
std::stollandstd::stoican throwstd::invalid_argumentorstd::out_of_rangeexceptions for invalid input. Without exception handling, the program will terminate ungracefully.
192-194: Handle GetCurrentTime failure to prevent incorrect TTL calculations.This is a duplicate of a previously raised issue. If
GetCurrentTimefails,curtimeremains uninitialized, leading to incorrect TTL calculations.
251-253: Handle GetCurrentTime failure to prevent incorrect TTL calculations.This is a duplicate of a previously raised issue. The same GetCurrentTime handling problem exists here.
333-335: Handle GetCurrentTime failure to prevent incorrect TTL calculations.This is a duplicate of a previously raised issue.
408-410: Handle GetCurrentTime failure to prevent incorrect TTL calculations.This is a duplicate of a previously raised issue.
504-506: Handle GetCurrentTime failure to prevent incorrect TTL calculations.This is a duplicate of a previously raised issue.
🧹 Nitpick comments (2)
tools/bigkey_analyzer/bigkey_analyzer.cc (2)
136-140: Remove redundant initialization.These assignments duplicate the default member initializers already defined in the
Configstruct (lines 107-111). While not harmful, removing them reduces redundancy.🔎 Proposed cleanup
int opt; int option_index = 0; - config.min_size = 0; - config.top_n = -1; - config.prefix_stat = false; - config.prefix_delimiter = ":"; - config.type_filter = "all"; - while ((opt = getopt_long(argc, argv, "m:t:pd:y:o:h", long_options, &option_index)) != -1) {
187-571: Consider refactoring duplicate analysis logic.The five
Analyze*functions (AnalyzeStrings, AnalyzeHashes, AnalyzeSets, AnalyzeZsets, AnalyzeLists) share substantial structural similarity:
- GetCurrentTime call
- Metadata scanning with type filtering
- Staleness and count checks
- TTL calculation
- Size accumulation
While each type requires slightly different parsing logic and column family access, extracting common patterns into helper functions or templates could reduce duplication and improve maintainability. Given this is a diagnostic tool where clarity is valued, this refactoring is recommended but not essential.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tools/bigkey_analyzer/bigkey_analyzer.cc
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Analyze (go)
- GitHub Check: build_on_macos
- GitHub Check: build_on_centos
- GitHub Check: build_on_ubuntu
| const char* ptr = storage::SeekUserkeyDelim(data_key.data(), data_key.size()); | ||
| size_t user_key_len = ptr - data_key.data(); | ||
|
|
||
| if (user_key_len == 0 || user_key_len > data_key.size()) continue; |
There was a problem hiding this comment.
Strengthen validation for SeekUserkeyDelim result.
SeekUserkeyDelim has a TODO comment in its implementation about handling invalid formats, and it may return an invalid pointer if the key format is malformed. Line 303 performs pointer arithmetic (ptr - data_key.data()) that could produce undefined behavior if ptr points outside the valid buffer range. While Line 305 checks user_key_len > data_key.size(), this doesn't fully protect against ptr being arbitrary.
Consider adding a check that ptr is within bounds before computing user_key_len:
🔎 Proposed fix with bounds checking
const char* ptr = storage::SeekUserkeyDelim(data_key.data(), data_key.size());
+ if (ptr < data_key.data() || ptr > data_key.data() + data_key.size()) {
+ continue; // Invalid pointer, skip this key
+ }
size_t user_key_len = ptr - data_key.data();
if (user_key_len == 0 || user_key_len > data_key.size()) continue;Apply the same fix to AnalyzeSets (line 377), AnalyzeZsets (lines 454, 474), and AnalyzeLists (line 549).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const char* ptr = storage::SeekUserkeyDelim(data_key.data(), data_key.size()); | |
| size_t user_key_len = ptr - data_key.data(); | |
| if (user_key_len == 0 || user_key_len > data_key.size()) continue; | |
| const char* ptr = storage::SeekUserkeyDelim(data_key.data(), data_key.size()); | |
| if (ptr < data_key.data() || ptr > data_key.data() + data_key.size()) { | |
| continue; // Invalid pointer, skip this key | |
| } | |
| size_t user_key_len = ptr - data_key.data(); | |
| if (user_key_len == 0 || user_key_len > data_key.size()) continue; |
🤖 Prompt for AI Agents
In tools/bigkey_analyzer/bigkey_analyzer.cc around lines 302-305, the code
computes user_key_len using ptr returned from SeekUserkeyDelim without first
validating that ptr lies within the data_key buffer, which can cause undefined
behaviour for malformed keys; before doing pointer arithmetic, verify ptr is
non-null and that ptr >= data_key.data() && ptr <= data_key.data() +
data_key.size() (or ptr < end depending on intended delimiter semantics), and if
the check fails continue; apply the same pointer-bounds validation pattern to
the corresponding locations in AnalyzeSets (around line 377), AnalyzeZsets
(around lines 454 and 474), and AnalyzeLists (around line 549).
| rocksdb::Status status = rocksdb::DB::OpenForReadOnly(db_options, db_path, | ||
| column_families, &handles, &db); | ||
|
|
||
| if (!status.ok()) { | ||
| std::cerr << "Error opening database at " << db_path << ": " << status.ToString() << std::endl; | ||
| return; | ||
| } | ||
|
|
||
| std::cout << "Analyzing database at " << db_path << std::endl; | ||
|
|
||
| // Analyze each type | ||
| if (config.type_filter == "all" || config.type_filter == "strings") { | ||
| AnalyzeStrings(db, handles[storage::kMetaCF], key_infos, config); |
There was a problem hiding this comment.
Verify column family handles before access.
If the database is missing expected column families, OpenForReadOnly may succeed but return fewer handles than expected. Accessing handles[storage::kMetaCF] or other indexes without verifying handles.size() could cause out-of-bounds access.
🔎 Proposed fix with handle count verification
rocksdb::Status status = rocksdb::DB::OpenForReadOnly(db_options, db_path,
column_families, &handles, &db);
if (!status.ok()) {
std::cerr << "Error opening database at " << db_path << ": " << status.ToString() << std::endl;
return;
}
+
+ if (handles.size() != cf_names.size()) {
+ std::cerr << "Error: Expected " << cf_names.size() << " column families but got "
+ << handles.size() << std::endl;
+ for (auto handle : handles) {
+ delete handle;
+ }
+ delete db;
+ return;
+ }
std::cout << "Analyzing database at " << db_path << std::endl;There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (6)
tools/bigkey_analyzer/bigkey_analyzer.cc (6)
48-52: Add error handling or validation for DecodeUserKey to handle malformed encoded keys.The
storage::DecodeUserKeyfunction has no error indication mechanism, and this wrapper at lines 48–52 ignores any potential corruption for malformed input (e.g., missing delimiter). This can lead to silent data corruption if encoded keys lack the expected format. The function is called at lines 212, 321, 396, 493, and 568.
144-149: Add exception handling for numeric conversions.The calls to
std::stoll(line 145) andstd::stoi(line 148) can throwstd::invalid_argumentorstd::out_of_rangeexceptions when users provide invalid input (e.g.,--min-size=abcor--top=xyz). Without exception handling, the program will terminate ungracefully.🔎 Proposed fix with exception handling
switch (opt) { case 'm': - config.min_size = std::stoll(optarg); + try { + config.min_size = std::stoll(optarg); + } catch (const std::exception& e) { + std::cerr << "Error: Invalid value for --min-size: " << optarg << std::endl; + return false; + } break; case 't': - config.top_n = std::stoi(optarg); + try { + config.top_n = std::stoi(optarg); + } catch (const std::exception& e) { + std::cerr << "Error: Invalid value for --top: " << optarg << std::endl; + return false; + } break;
192-194: Handle GetCurrentTime failure to prevent incorrect TTL calculations.The return value of
GetCurrentTimeis checked with.ok()but the result is discarded. If the call fails,curtimeremains uninitialized, leading to incorrect TTL calculations. This same pattern appears in AnalyzeHashes (line 253), AnalyzeSets (line 336), AnalyzeZsets (line 412), and AnalyzeLists (line 508).🔎 Proposed fix
int64_t curtime; - db->GetEnv()->GetCurrentTime(&curtime).ok(); + if (!db->GetEnv()->GetCurrentTime(&curtime).ok()) { + std::cerr << "Error: Failed to get current time" << std::endl; + return; + } curtime *= 1000; // Convert to millisecondsApply similar fixes to AnalyzeHashes (line 253), AnalyzeSets (line 336), AnalyzeZsets (line 412), and AnalyzeLists (line 508).
304-307: Strengthen validation for SeekUserkeyDelim result.
SeekUserkeyDelimmay return an invalid pointer if the key format is malformed (the implementation has a TODO about this). Line 305 performs pointer arithmetic (ptr - data_key.data()) that could produce undefined behavior ifptrpoints outside the valid buffer range. While line 307 checksuser_key_len > data_key.size(), this doesn't fully protect againstptrbeing arbitrary.🔎 Proposed fix with bounds checking
const char* ptr = storage::SeekUserkeyDelim(data_key.data(), data_key.size()); + if (ptr < data_key.data() || ptr > data_key.data() + data_key.size()) { + continue; // Invalid pointer, skip this key + } size_t user_key_len = ptr - data_key.data(); if (user_key_len == 0 || user_key_len > data_key.size()) continue;Apply the same fix to AnalyzeSets (line 380), AnalyzeZsets (lines 457, 477), and AnalyzeLists (line 552).
639-651: Verify column family handles before access.If the database is missing expected column families,
OpenForReadOnlymay succeed but return fewer handles than expected. Accessinghandles[storage::kMetaCF](line 651) or other indexes without verifyinghandles.size()could cause out-of-bounds access.🔎 Proposed fix with handle count verification
rocksdb::Status status = rocksdb::DB::OpenForReadOnly(db_options, db_path, column_families, &handles, &db); if (!status.ok()) { std::cerr << "Error opening database at " << db_path << ": " << status.ToString() << std::endl; return; } + + if (handles.size() != cf_names.size()) { + std::cerr << "Error: Expected " << cf_names.size() << " column families but got " + << handles.size() << std::endl; + for (auto handle : handles) { + delete handle; + } + delete db; + return; + } std::cout << "Analyzing database at " << db_path << std::endl;
704-704: CURRENT is a file, not a directory.Line 704 checks
DirectoryExists(test_path + "/CURRENT"), butCURRENTis a RocksDB metadata file, not a directory. This check will always fail even when a valid single database exists. The same issue occurs at line 713.🔎 Proposed fix to check for file existence
First, add a helper function to check file existence:
bool FileExists(const std::string& path) { struct stat st; return stat(path.c_str(), &st) == 0 && S_ISREG(st.st_mode); }Then fix lines 704 and 713:
- if (DirectoryExists(test_path + "/CURRENT")) { + if (FileExists(test_path + "/CURRENT")) { // This is a single database instance db_paths.push_back(test_path); std::cout << "Detected single database instance" << std::endl;- if (DirectoryExists(db_inst_path) && DirectoryExists(db_inst_path + "/CURRENT")) { + if (DirectoryExists(db_inst_path) && FileExists(db_inst_path + "/CURRENT")) { db_paths.push_back(db_inst_path); db_index++;
🧹 Nitpick comments (1)
tools/bigkey_analyzer/bigkey_analyzer.cc (1)
296-315: Add iterator status checks after data scans.While AnalyzeStrings checks iterator status at lines 241-243, the other analyze functions (AnalyzeHashes, AnalyzeSets, AnalyzeZsets, AnalyzeLists) don't verify iterator status after scanning data column families. Missing status checks could hide corruption or I/O errors during iteration.
🔎 Proposed fix pattern
Add status checks after each data iterator loop. For AnalyzeHashes (after line 315):
} + + if (!data_iter->status().ok()) { + std::cerr << "Error iterating hash data: " << data_iter->status().ToString() << std::endl; + } // Add resultsApply similar checks to:
- AnalyzeSets after line 391
- AnalyzeZsets after line 468 (data_iter) and line 488 (score_iter)
- AnalyzeLists after line 563
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tools/bigkey_analyzer/bigkey_analyzer.cc
🧰 Additional context used
🧬 Code graph analysis (1)
tools/bigkey_analyzer/bigkey_analyzer.cc (1)
src/storage/include/storage/storage_define.h (1)
SeekUserkeyDelim(122-132)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Analyze (go)
- GitHub Check: build_on_ubuntu
- GitHub Check: build_on_macos
- GitHub Check: build_on_centos
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (6)
tools/bigkey_analyzer/bigkey_analyzer.cc (6)
146-151: Add exception handling for numeric conversions.The calls to
std::stollandstd::stoican throwstd::invalid_argumentorstd::out_of_rangeexceptions for invalid user input (e.g.,--min-size=abc). Without try-catch blocks, the program will terminate ungracefully.🔎 Proposed fix with exception handling
switch (opt) { case 'm': - config.min_size = std::stoll(optarg); + try { + config.min_size = std::stoll(optarg); + } catch (const std::exception& e) { + std::cerr << "Error: Invalid value for --min-size: " << optarg << std::endl; + return false; + } break; case 't': - config.top_n = std::stoi(optarg); + try { + config.top_n = std::stoi(optarg); + } catch (const std::exception& e) { + std::cerr << "Error: Invalid value for --top: " << optarg << std::endl; + return false; + } break;
194-196: Handle GetCurrentTime failure to prevent incorrect TTL calculations.
GetCurrentTime's return status is checked with.ok()but the result is discarded. If the call fails,curtimeremains uninitialized, leading to incorrect TTL calculations. This pattern appears in all five analyze functions.🔎 Proposed fix
int64_t curtime; - db->GetEnv()->GetCurrentTime(&curtime).ok(); + if (!db->GetEnv()->GetCurrentTime(&curtime).ok()) { + std::cerr << "Error: Failed to get current time" << std::endl; + return; + } curtime *= 1000; // Convert to millisecondsApply this fix to:
- AnalyzeStrings (lines 194-196)
- AnalyzeHashes (lines 254-256)
- AnalyzeSets (lines 337-339)
- AnalyzeZsets (lines 413-415)
- AnalyzeLists (lines 511-513)
Also applies to: 254-256, 337-339, 413-415, 511-513
50-54: DecodeUserKey has no error indication mechanism.
storage::DecodeUserKeysilently produces incorrect output for malformed encoded keys (e.g., missing delimiter). The wrapper at lines 50-52 returns the potentially corrupteduser_keyto all five call sites (lines 214, 323, 398, 497, 574), which can lead to silent data corruption in the output.
306-309: Strengthen validation for SeekUserkeyDelim result.
SeekUserkeyDelimmay return an invalid pointer for malformed keys. The pointer arithmeticptr - data_key.data()at lines 307, 383, and 460 could produce undefined behavior ifptrpoints outside the valid buffer range. While the code checksuser_key_len > data_key.size(), this doesn't protect againstptrbeing arbitrary.🔎 Proposed fix with bounds checking
const char* ptr = storage::SeekUserkeyDelim(data_key.data(), data_key.size()); + if (ptr < data_key.data() || ptr > data_key.data() + data_key.size()) { + continue; // Invalid pointer, skip this key + } size_t user_key_len = ptr - data_key.data(); if (user_key_len == 0 || user_key_len > data_key.size()) continue;Apply to AnalyzeHashes (line 306), AnalyzeSets (line 382), and AnalyzeZsets (line 459).
Also applies to: 382-385, 459-462
645-651: Verify column family handles before access.If the database is missing expected column families,
OpenForReadOnlymay succeed but return fewer handles than expected. Accessinghandles[storage::kMetaCF]or other indexes at lines 657, 661, 665, 669-670, and 674 without verifyinghandles.size()could cause out-of-bounds access.🔎 Proposed fix with handle count verification
rocksdb::Status status = rocksdb::DB::OpenForReadOnly(db_options, db_path, column_families, &handles, &db); if (!status.ok()) { std::cerr << "Error opening database at " << db_path << ": " << status.ToString() << std::endl; return; } + + if (handles.size() != cf_names.size()) { + std::cerr << "Error: Expected " << cf_names.size() << " column families but got " + << handles.size() << std::endl; + for (auto handle : handles) { + delete handle; + } + delete db; + return; + } std::cout << "Analyzing database at " << db_path << std::endl;
710-710: CURRENT is a file, not a directory.Lines 710 and 719 check
DirectoryExists(path + "/CURRENT"), butCURRENTis a RocksDB metadata file, not a directory. These checks will always fail even when a valid database exists.🔎 Proposed fix to check for file existence
First, add a helper function after
DirectoryExists:bool FileExists(const std::string& path) { struct stat st; return stat(path.c_str(), &st) == 0 && S_ISREG(st.st_mode); }Then fix lines 710 and 719:
- if (DirectoryExists(test_path + "/CURRENT")) { + if (FileExists(test_path + "/CURRENT")) { // This is a single database instance db_paths.push_back(test_path); std::cout << "Detected single database instance" << std::endl;- if (DirectoryExists(db_inst_path) && DirectoryExists(db_inst_path + "/CURRENT")) { + if (DirectoryExists(db_inst_path) && FileExists(db_inst_path + "/CURRENT")) { db_paths.push_back(db_inst_path); db_index++;Also applies to: 719-719
🧹 Nitpick comments (4)
tools/bigkey_analyzer/bigkey_analyzer.cc (4)
155-157: Consider validating prefix_delimiter.Users can specify an empty string as the prefix delimiter via
--prefix-delimiter="". At line 584,GetKeyPrefixuseskey.find(delimiter). An empty delimiter would causefindto return 0, making all prefixes empty strings. This might not be the intended behavior.🔎 Proposed validation
case 'd': + if (std::string(optarg).empty()) { + std::cerr << "Error: prefix-delimiter cannot be empty" << std::endl; + return false; + } config.prefix_delimiter = optarg; break;
158-160: Validate type_filter values.Users can specify any string for
--type, but only "all", "strings", "hashes", "lists", "sets", and "zsets" are valid. An invalid value like--type=invalidwill silently skip all analysis, producing empty results without any error message.🔎 Proposed validation
case 'y': + { + const std::vector<std::string> valid_types = {"all", "strings", "hashes", "lists", "sets", "zsets"}; + if (std::find(valid_types.begin(), valid_types.end(), optarg) == valid_types.end()) { + std::cerr << "Error: Invalid type. Must be one of: all, strings, hashes, lists, sets, zsets" << std::endl; + return false; + } + } config.type_filter = optarg; break;
749-782: Check for output stream write failures.The code writes to
*outat lines 750-782 but never checks if the writes succeed. If the output file runs out of disk space or encounters an I/O error, writes will silently fail and the stream's error state will be set.🔎 Proposed error checking
Add a check before returning:
*out << " " << entry.first << ": " << entry.second << " keys, " << mb_size << " MB total, " << avg_size << " bytes avg\n"; } + if (out->fail()) { + std::cerr << "Error: Failed to write output" << std::endl; + return 1; + } + return 0; }
138-142: Redundant initialization of Config fields.Lines 138-142 initialize
configfields to the same values already specified as default member initializers in theConfigstruct at lines 109-113. This initialization is redundant.🔎 Suggested simplification
Remove the redundant initialization:
int opt; int option_index = 0; - config.min_size = 0; - config.top_n = -1; - config.prefix_stat = false; - config.prefix_delimiter = ":"; - config.type_filter = "all"; - while ((opt = getopt_long(argc, argv, "m:t:pd:y:o:h", long_options, &option_index)) != -1) {
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tools/bigkey_analyzer/bigkey_analyzer.cc
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: build_on_centos
- GitHub Check: build_on_ubuntu
- GitHub Check: build_on_macos
- GitHub Check: Analyze (go)
|
|
||
| // Add results | ||
| for (const auto& entry : hash_info) { | ||
| int64_t size = std::get<0>(entry.second); | ||
| if (size >= config.min_size) { | ||
| std::string user_key = DecodeUserKey(entry.first); | ||
| std::string display_key = ReplaceAll(user_key, "\n", "\\n"); | ||
| display_key = ReplaceAll(display_key, " ", "\\x20"); | ||
| key_infos.emplace_back("hash", std::move(display_key), size, std::get<1>(entry.second)); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Add iterator status checks after iteration.
AnalyzeHashes, AnalyzeSets, AnalyzeZsets, and AnalyzeLists don't check iterator status after completing their loops. If an iteration error occurs (e.g., I/O error reading from RocksDB), it will be silently ignored. AnalyzeStrings correctly checks status at lines 243-245.
🔎 Proposed fix
Add status checks after each iterator completes. For AnalyzeHashes:
}
+ if (!meta_iter->status().ok()) {
+ std::cerr << "Error iterating hashes metadata: " << meta_iter->status().ToString() << std::endl;
+ }
+
+ if (!data_iter->status().ok()) {
+ std::cerr << "Error iterating hashes data: " << data_iter->status().ToString() << std::endl;
+ }
+
// Add results
for (const auto& entry : hash_info) {Apply similar checks in:
- AnalyzeSets after lines 393 (both iterators)
- AnalyzeZsets after lines 470, 492 (all three iterators)
- AnalyzeLists after lines 569 (both iterators)
Also applies to: 393-404, 492-503, 569-580
🤖 Prompt for AI Agents
In tools/bigkey_analyzer/bigkey_analyzer.cc around lines 318 to 329, the loop
adding hash results does not check the RocksDB iterator Status after iteration;
add a check similar to AnalyzeStrings (lines ~243-245) that calls
iterator->status() and handles non-OK by logging or returning the error before
proceeding. Apply the same pattern to the other analyzer methods: add
iterator->status() checks after the iteration loops in AnalyzeSets (after line
~393 for both iterators), AnalyzeZsets (after lines ~470 and ~492 for each
iterator used), and AnalyzeLists (after line ~569 for both iterators), ensuring
any non-OK status is propagated or logged consistently with existing error
handling.
| std::vector<rocksdb::ColumnFamilyHandle*> handles; | ||
| rocksdb::DB* db; | ||
| rocksdb::Status status = rocksdb::DB::OpenForReadOnly(db_options, db_path, | ||
| column_families, &handles, &db); | ||
|
|
||
| if (!status.ok()) { | ||
| std::cerr << "Error opening database at " << db_path << ": " << status.ToString() << std::endl; | ||
| return; | ||
| } | ||
|
|
||
| std::cout << "Analyzing database at " << db_path << std::endl; | ||
|
|
||
| // Analyze each type | ||
| if (config.type_filter == "all" || config.type_filter == "strings") { | ||
| AnalyzeStrings(db, handles[storage::kMetaCF], key_infos, config); | ||
| } | ||
|
|
||
| if (config.type_filter == "all" || config.type_filter == "hashes") { | ||
| AnalyzeHashes(db, handles[storage::kMetaCF], handles[storage::kHashesDataCF], key_infos, config); | ||
| } | ||
|
|
||
| if (config.type_filter == "all" || config.type_filter == "sets") { | ||
| AnalyzeSets(db, handles[storage::kMetaCF], handles[storage::kSetsDataCF], key_infos, config); | ||
| } | ||
|
|
||
| if (config.type_filter == "all" || config.type_filter == "zsets") { | ||
| AnalyzeZsets(db, handles[storage::kMetaCF], handles[storage::kZsetsDataCF], | ||
| handles[storage::kZsetsScoreCF], key_infos, config); | ||
| } | ||
|
|
||
| if (config.type_filter == "all" || config.type_filter == "lists") { | ||
| AnalyzeLists(db, handles[storage::kMetaCF], handles[storage::kListsDataCF], key_infos, config); | ||
| } | ||
|
|
||
| // Cleanup | ||
| for (auto handle : handles) { | ||
| delete handle; | ||
| } | ||
| delete db; |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Use RAII for resource cleanup to prevent leaks.
Lines 678-681 manually delete column family handles and the database. If any of the Analyze* functions (called at lines 657-674) throw an exception, these resources will leak. The try-catch blocks at lines 480-491 and 557-568 show that exceptions are possible.
🔎 Proposed RAII-based fix
Use std::unique_ptr with custom deleters:
std::vector<rocksdb::ColumnFamilyHandle*> handles;
rocksdb::DB* db;
rocksdb::Status status = rocksdb::DB::OpenForReadOnly(db_options, db_path,
column_families, &handles, &db);
if (!status.ok()) {
std::cerr << "Error opening database at " << db_path << ": " << status.ToString() << std::endl;
return;
}
+ // RAII wrapper for cleanup
+ auto cleanup = [&handles, &db]() {
+ for (auto handle : handles) {
+ delete handle;
+ }
+ delete db;
+ };
+ std::unique_ptr<void, decltype(cleanup)> guard(reinterpret_cast<void*>(1), cleanup);
+
std::cout << "Analyzing database at " << db_path << std::endl;
// Analyze each type
...
- // Cleanup
- for (auto handle : handles) {
- delete handle;
- }
- delete db;
+ // Cleanup happens automatically via guard destructor
}Alternatively, create a RAII wrapper class for RocksDB resources.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (5)
tools/bigkey_analyzer/bigkey_analyzer.cc (5)
146-151: Add exception handling for numeric conversions.The calls to
std::stollandstd::stoican throwstd::invalid_argumentorstd::out_of_rangeexceptions when users provide invalid input. This issue was flagged in a previous review and should be addressed.
194-196: Handle GetCurrentTime failure to prevent incorrect TTL calculations.The return value of
GetCurrentTimeis checked but discarded. If the call fails,curtimeremains uninitialized. This was flagged in a previous review and applies to all Analyze* functions.
317-329: Add iterator status checks after iteration loops.Both
meta_iteranddata_iterlack status checks after their loops complete. I/O errors during iteration would be silently ignored. This was flagged in a previous review.
643-651: Verify column family handles before access and use RAII for cleanup.Two issues previously flagged:
- If the database is missing expected column families,
handles.size()may be less than expected, causing out-of-bounds access.- Manual cleanup at lines 678-681 will leak resources if any Analyze* function throws.
709-709: CURRENT is a file, not a directory.
DirectoryExistsis used to check for theCURRENTfile, butCURRENTis a RocksDB metadata file, not a directory. This check will always fail even when a valid database exists. This was flagged in a previous review.Also applies to: 717-717, 730-730
🧹 Nitpick comments (3)
tools/bigkey_analyzer/bigkey_analyzer.cc (3)
5-31: Consider grouping includes by category for better organization.Standard practice is to group includes: standard library first, then third-party libraries (RocksDB), then project headers. This improves readability and helps identify dependencies at a glance.
🔎 Suggested organization
-#include <iostream> -#include <fstream> -#include <string> -#include <vector> -#include <algorithm> -#include <unordered_map> -#include <memory> -#include <ctime> -#include <sys/stat.h> -#include <getopt.h> - -#include "rocksdb/options.h" -#include "rocksdb/db.h" -#include "rocksdb/env.h" -#include "rocksdb/iterator.h" -#include "rocksdb/slice.h" -#include "rocksdb/status.h" - -#include "storage/storage_define.h" -#include "src/storage/src/base_value_format.h" +// Standard library +#include <algorithm> +#include <ctime> +#include <fstream> +#include <iostream> +#include <memory> +#include <string> +#include <unordered_map> +#include <vector> + +// System headers +#include <getopt.h> +#include <sys/stat.h> + +// Third-party (RocksDB) +#include "rocksdb/db.h" +#include "rocksdb/env.h" +#include "rocksdb/iterator.h" +#include "rocksdb/options.h" +#include "rocksdb/slice.h" +#include "rocksdb/status.h" + +// Project headers +#include "storage/storage_define.h" +#include "src/storage/src/base_value_format.h"
488-491: Avoid catching all exceptions withcatch (...).Using
catch (...)silently swallows all exceptions including system errors. Consider catching specific exception types or at minimum logging that an exception occurred for debugging purposes.🔎 Proposed improvement
try { storage::ParsedZSetsScoreKey parsed_key(score_key); std::string encoded_user_key = parsed_key.key().ToString(); auto it = zset_info.find(encoded_user_key); if (it != zset_info.end()) { std::get<0>(it->second) += score_key.size() + score_value.size(); } - } catch (...) { - // Skip malformed keys + } catch (const std::exception& e) { + // Skip malformed keys - could add debug logging here continue; }Same applies to the catch block in AnalyzeLists (lines 565-568).
684-684: Missing space before opening brace in main signature.Minor style:
int main(int argc, char *argv[])should have consistent spacing. Considerint main(int argc, char* argv[])for consistency with other pointer declarations in the codebase.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tools/bigkey_analyzer/bigkey_analyzer.cc
🧰 Additional context used
🧬 Code graph analysis (1)
tools/bigkey_analyzer/bigkey_analyzer.cc (1)
src/storage/include/storage/storage_define.h (1)
SeekUserkeyDelim(122-132)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: build_on_ubuntu
- GitHub Check: Analyze (go)
- GitHub Check: build_on_macos
- GitHub Check: build_on_centos
| // Output summary | ||
| *out << "\n===== Summary =====\n"; | ||
| *out << "Total keys analyzed: " << key_infos.size() << "\n"; |
There was a problem hiding this comment.
Summary may be misleading after top_n filtering.
After resize(config.top_n) at line 758, the summary reports "Total keys analyzed" based on the truncated vector. If a user has 10,000 keys but uses --top=100, the summary will say "Total keys analyzed: 100", which doesn't reflect the actual number of keys scanned.
🔎 Proposed fix
Track the original count before truncation:
+ size_t total_keys_found = key_infos.size();
+
// Limit to top N if requested
if (config.top_n > 0 && config.top_n < static_cast<int>(key_infos.size())) {
key_infos.resize(config.top_n);
}
// ...
// Output summary
*out << "\n===== Summary =====\n";
- *out << "Total keys analyzed: " << key_infos.size() << "\n";
+ *out << "Total keys found: " << total_keys_found << "\n";
+ if (config.top_n > 0) {
+ *out << "Showing top " << key_infos.size() << " keys\n";
+ }Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tools/bigkey_analyzer/bigkey_analyzer.cc around lines 774 to 776, the summary
prints "Total keys analyzed" using key_infos.size() after the vector was
truncated by resize(config.top_n), which misreports the actual number scanned;
record the original count (e.g. save key_infos.size() to a variable like
total_keys_before_truncation) immediately before the resize call and use that
saved value when printing the total scanned in the summary, while still
reporting the truncated/displayed count (or indicating "showing top N") to avoid
confusion.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (11)
tools/bigkey_analyzer/README.md (2)
33-43: Add language identifier to fenced code block.The usage text block should specify
textas the language identifier for proper syntax highlighting.🔎 Proposed fix
-``` +```text Usage: bigkey_analyzer [OPTIONS] <db_path> Options:
97-118: Add language identifier to fenced code block.The example output block should specify
textas the language identifier for proper syntax highlighting.🔎 Proposed fix
-``` +```text ===== Big Key Analysis ===== Type Size Key TTLtools/bigkey_analyzer/bigkey_analyzer.cc (9)
50-54: DecodeUserKey lacks error handling for malformed input.The
storage::DecodeUserKeyfunction can produce incorrect output for malformed encoded keys, but this wrapper ignores potential issues and returns the result to all call sites.
152-157: Add exception handling for numeric conversions.The calls to
std::stollandstd::stoican throw exceptions for invalid input (e.g.,--min-size=abc), causing ungraceful termination.🔎 Proposed fix with exception handling
switch (opt) { case 'm': - config.min_size = std::stoll(optarg); + try { + config.min_size = std::stoll(optarg); + } catch (const std::exception& e) { + std::cerr << "Error: Invalid value for --min-size: " << optarg << std::endl; + return false; + } break; case 't': - config.top_n = std::stoi(optarg); + try { + config.top_n = std::stoi(optarg); + } catch (const std::exception& e) { + std::cerr << "Error: Invalid value for --top: " << optarg << std::endl; + return false; + } break;
201-203: Handle GetCurrentTime failure to prevent incorrect TTL calculations.If
GetCurrentTimefails,curtimeremains uninitialized, leading to incorrect TTL calculations. This pattern appears in all analyze functions.🔎 Proposed fix
int64_t curtime; - db->GetEnv()->GetCurrentTime(&curtime).ok(); + if (!db->GetEnv()->GetCurrentTime(&curtime).ok()) { + std::cerr << "Error: Failed to get current time" << std::endl; + return; + } curtime *= 1000;Apply to AnalyzeHashes (line 263), AnalyzeSets (line 347), AnalyzeZsets (line 424), and AnalyzeLists (line 523).
314-324: Strengthen validation for SeekUserkeyDelim result and add version checking.Two issues: (1)
SeekUserkeyDelimmay return an invalid pointer for malformed keys, causing undefined behavior in pointer arithmetic. (2) Data from stale key versions is counted without version validation, inflating size calculations.🔎 Proposed fix
const char* ptr = storage::SeekUserkeyDelim(data_key.data(), data_key.size()); + if (ptr < data_key.data() || ptr > data_key.data() + data_key.size()) { + continue; + } size_t user_key_len = ptr - data_key.data(); if (user_key_len == 0 || user_key_len > data_key.size()) continue; std::string encoded_user_key(data_key.data(), user_key_len); auto it = hash_info.find(encoded_user_key); if (it != hash_info.end()) { + // Validate version to exclude stale data + if (data_key.size() > user_key_len + sizeof(uint64_t)) { + uint64_t data_version; + memcpy(&data_version, data_key.data() + user_key_len, sizeof(uint64_t)); + if (data_version != std::get<2>(it->second)) { + continue; + } + } std::get<0>(it->second) += data_key.size() + data_value.size(); }Apply similar fixes to AnalyzeSets (line 391), AnalyzeZsets (lines 469, 489), and AnalyzeLists (line 568).
326-337: Add iterator status checks after iteration.Missing iterator status checks after data iteration. If an I/O error occurs during iteration, it will be silently ignored.
🔎 Proposed fix
} + if (!meta_iter->status().ok()) { + std::cerr << "Error iterating hashes metadata: " << meta_iter->status().ToString() << std::endl; + } + + if (!data_iter->status().ok()) { + std::cerr << "Error iterating hashes data: " << data_iter->status().ToString() << std::endl; + } + // Add results for (const auto& entry : hash_info) {Apply to AnalyzeSets (after line 402), AnalyzeZsets (after lines 480, 502), and AnalyzeLists (after line 580).
655-663: Verify column family handle count before access.If the database is missing expected column families,
OpenForReadOnlymay succeed but return fewer handles, causing out-of-bounds access athandles[storage::kMetaCF].🔎 Proposed fix
rocksdb::Status status = rocksdb::DB::OpenForReadOnly(db_options, db_path, column_families, &handles, &db); if (!status.ok()) { std::cerr << "Error opening database at " << db_path << ": " << status.ToString() << std::endl; return; } + + if (handles.size() != cf_names.size()) { + std::cerr << "Error: Expected " << cf_names.size() << " column families but got " + << handles.size() << std::endl; + for (auto handle : handles) { + delete handle; + } + delete db; + return; + }
689-693: Use RAII for resource cleanup to prevent leaks.Manual cleanup at lines 690-693 will leak resources if any Analyze* function throws an exception (evidenced by try-catch blocks at lines 490-501 and 568-579).
🔎 RAII-based fix
+ // RAII wrapper for cleanup + auto cleanup = [&handles, &db]() { + for (auto handle : handles) { + delete handle; + } + delete db; + }; + std::unique_ptr<void, decltype(cleanup)> guard(reinterpret_cast<void*>(1), cleanup); + std::cout << "Analyzing database at " << db_path << std::endl; // Analyze each type ... - // Cleanup - for (auto handle : handles) { - delete handle; - } - delete db;
721-721: CURRENT is a file, not a directory.Lines 721, 737, 756, and 770 use
DirectoryExiststo check for theCURRENTfile, which is a RocksDB metadata file, not a directory. These checks will always fail even for valid databases.🔎 Proposed fix
Add a FileExists helper:
bool FileExists(const std::string& path) { struct stat st; return stat(path.c_str(), &st) == 0 && S_ISREG(st.st_mode); }Then replace all four checks:
- if (DirectoryExists(test_path + "/CURRENT")) { + if (FileExists(test_path + "/CURRENT")) { - if (DirectoryExists(partition_path) && DirectoryExists(partition_path + "/CURRENT")) { + if (DirectoryExists(partition_path) && FileExists(partition_path + "/CURRENT")) { - if (DirectoryExists(db_inst_path) && DirectoryExists(db_inst_path + "/CURRENT")) { + if (DirectoryExists(db_inst_path) && FileExists(db_inst_path + "/CURRENT")) { - if (DirectoryExists(db_inst_path) && DirectoryExists(db_inst_path + "/CURRENT")) { + if (DirectoryExists(db_inst_path) && FileExists(db_inst_path + "/CURRENT")) {Also applies to: 737-737, 756-756, 770-770
819-821: Summary is misleading after top_n filtering.After
resize(config.top_n)at line 801, the summary reports "Total keys analyzed" based on the truncated vector. If 10,000 keys are scanned but--top=100is used, the summary incorrectly says "Total keys analyzed: 100".🔎 Proposed fix
+ size_t total_keys_found = key_infos.size(); + // Limit to top N if requested if (config.top_n > 0 && config.top_n < static_cast<int>(key_infos.size())) { key_infos.resize(config.top_n); } // ... // Output summary *out << "\n===== Summary =====\n"; - *out << "Total keys analyzed: " << key_infos.size() << "\n"; + *out << "Total keys found: " << total_keys_found << "\n"; + if (config.top_n > 0) { + *out << "Showing top " << key_infos.size() << " keys\n"; + }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tools/bigkey_analyzer/README.mdtools/bigkey_analyzer/bigkey_analyzer.cc
🧰 Additional context used
🧬 Code graph analysis (1)
tools/bigkey_analyzer/bigkey_analyzer.cc (1)
src/storage/include/storage/storage_define.h (1)
SeekUserkeyDelim(122-132)
🪛 markdownlint-cli2 (0.18.1)
tools/bigkey_analyzer/README.md
33-33: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
97-97: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Analyze (go)
- GitHub Check: build_on_centos
- GitHub Check: build_on_macos
- GitHub Check: build_on_ubuntu
| case 'y': | ||
| config.type_filter = optarg; | ||
| break; |
There was a problem hiding this comment.
Validate type_filter to prevent silent analysis failures.
The type_filter parameter is not validated against allowed values. If a user provides an invalid type like --type=foo, the tool will silently skip all analysis without warning, producing empty results that appear successful.
🔎 Proposed fix with validation
case 'y':
config.type_filter = optarg;
+ if (config.type_filter != "all" && config.type_filter != "strings" &&
+ config.type_filter != "hashes" && config.type_filter != "lists" &&
+ config.type_filter != "sets" && config.type_filter != "zsets") {
+ std::cerr << "Error: Invalid type '" << optarg
+ << "'. Must be one of: strings, hashes, lists, sets, zsets, all" << std::endl;
+ return false;
+ }
break;🤖 Prompt for AI Agents
In tools/bigkey_analyzer/bigkey_analyzer.cc around lines 164-166, validate the
value passed to --type (config.type_filter) against the tool's allowed type
names before assigning it; if optarg is not one of the permitted types, print a
clear error to stderr (or via the existing logger) and exit non-zero (or return
an error) instead of silently accepting it. Implement this by comparing optarg
to the canonical list of supported types used elsewhere in the code (e.g., the
set of types the analyzer recognizes), only assign config.type_filter when
valid, and surface a user-facing message listing valid values when invalid input
is detected.
实现原理
存储架构
PikiwiDB unstable分支采用了新的统一存储架构:
default(MetaCF): 存储所有key的元数据hashes_data_cf: 存储hash字段数据sets_data_cf: 存储set成员数据lists_data_cf: 存储list元素数据zsets_data_cf: 存储zset成员数据zsets_score_cf: 存储zset分数索引streams_data_cf: 存储stream数据数据编码格式
1. Key编码
EncodeUserKey将用户key编码为内部key2. MetaCF中的Value格式
每种数据类型在MetaCF中都有对应的元数据:
Strings类型:
ParsedStringsValue解析复杂类型(Hash/Set/List/ZSet):
ParsedXxxMetaValue解析3. DataCF中的Key格式
复杂类型的数据key格式:
通过
SeekUserkeyDelim函数可以提取出encoded_user_key部分分析流程
String类型分析
ParsedStringsValue解析获取TTL等信息复杂类型分析(Hash/Set/List/ZSet)
采用两遍扫描策略:
第一遍 - 扫描MetaCF:
ParsedXxxMetaValue解析第二遍 - 扫描DataCF:
SeekUserkeyDelim从data_key中提取encoded_user_keyZSet特殊处理:
zsets_data_cf和zsets_score_cfTTL计算
db->GetEnv()->GetCurrentTime(&curtime) * 1000etime > 0:etime == 0: TTL = -1(永久有效)多DB实例支持
工具支持两种部署模式:
单实例模式:
多实例模式(db_instance_num > 1):
大小计算
每个key的大小包括:
这样计算反映了实际存储占用空间。
性能优化
OpenForReadOnly避免加锁开销注意事项
Summary by CodeRabbit
New Features
Documentation
Chores
✏️ Tip: You can customize this high-level summary in your review settings.