Skip to content

Sync with main for including newer commits#3

Closed
shubham-sudo wants to merge 500 commits into
LSMMemoryProfilingfrom
main
Closed

Sync with main for including newer commits#3
shubham-sudo wants to merge 500 commits into
LSMMemoryProfilingfrom
main

Conversation

@shubham-sudo
Copy link
Copy Markdown
Collaborator

No description provided.

hx235 and others added 30 commits August 26, 2025 11:20
Summary:
**Context/Summary:**

DumpStats() do not skip dropped CF and can run into a seg fault like below
```
2025-08-23T06:44:05.0469230Z �[0;32m[ RUN      ] �[mFormatLatest/ColumnFamilyTest.LiveIteratorWithDroppedColumnFamily/0
2025-08-23T06:44:05.0470050Z Received signal 11 (Segmentation fault: 11)
2025-08-23T06:44:05.0470510Z #0   0x7000069305e0
2025-08-23T06:44:05.0471070Z facebook#1   rocksdb::DBImpl::DumpStats() (in librocksdb.10.6.0.dylib) (db_impl.cc:1076)
```

This PR skipped it.

Pull Request resolved: facebook#13900

Test Plan:
- Deterministically repro-ed the seg fault before the fix and ensure it doesn't happen after the fix
```
 diff --git a/db/column_family_test.cc b/db/column_family_test.cc
index 3a2ca06..f57d6f757 100644
 --- a/db/column_family_test.cc
+++ b/db/column_family_test.cc
@@ -2372,11 +2372,17 @@ TEST_P(ColumnFamilyTest, LiveIteratorWithDroppedColumnFamily) {
   int kKeysNum = 10000;
   PutRandomData(1, kKeysNum, 100);
   {
+    ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
+        {{"PostDrop", "BeforeAccessCFD"}, {"PostAccessCFD", "BeforeGo"}});
+
+    ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
     std::unique_ptr<Iterator> iterator(
         db_->NewIterator(ReadOptions(), handles_[1]));
     iterator->SeekToFirst();

     DropColumnFamilies({1});
+    TEST_SYNC_POINT("PostDrop");
+    TEST_SYNC_POINT("BeforeGo");

     // Make sure iterator created can still be used.
     int count = 0;
@@ -2386,6 +2392,9 @@ TEST_P(ColumnFamilyTest, LiveIteratorWithDroppedColumnFamily) {
     }
     ASSERT_OK(iterator->status());
     ASSERT_EQ(count, kKeysNum);
+
+    ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
+    ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
   }

   Reopen();
 diff --git a/db/db_impl/db_impl.cc b/db/db_impl/db_impl.cc
index a8e4f5f..a8a0499c0 100644
 --- a/db/db_impl/db_impl.cc
+++ b/db/db_impl/db_impl.cc
@@ -1073,8 +1073,10 @@ void DBImpl::DumpStats() {
         continue;
       }

-      auto* table_factory =
-          cfd->GetCurrentMutableCFOptions().table_factory.get();
+      TEST_SYNC_POINT("BeforeAccessCFD");
+      auto moptions = cfd->GetCurrentMutableCFOptions();
+      auto* table_factory = moptions.table_factory.get();
+      TEST_SYNC_POINT("PostAccessCFD");
       assert(table_factory != nullptr);
       // FIXME: need to a shared_ptr if/when block_cache is going to be mutable
       Cache* cache =
~
```

Reviewed By: archang19

Differential Revision: D81003739

Pulled By: hx235

fbshipit-source-id: bdf3c4cc45988f43e79ebc191a20af5b70ac289f
…3894)

Summary:
Enable stress testing of deletion-triggered compaction.

Pull Request resolved: facebook#13894

Test Plan:
```
 python3 -u tools/db_crashtest.py --simple whitebox --enable_compaction_on_deletion_trigger=true
```

Reviewed By: jaykorean

Differential Revision: D81175559

Pulled By: nmk70

fbshipit-source-id: c5128b7c1e2d07833b0e9385e04b342bc42c65cf
Summary:
The implementation of parallel compression has historically scaled rather poorly, or perhaps modestly with heavy compression, topping out around 3x throughput vs. serial and incurring big overheads in CPU consumption relative to the throughput.

This change addresses one source of that extra CPU consumption: stashing all the keys of a block for later processing into building index and filter blocks. Historically with parallel compression, the index and filter block updates were handled in the last stage of processing along with writing each data block to the file writer. This was because the index blocks needed to know the BlockHandle of the new data block, which could only be known after every preceeding data block was compressed, to know the starting location for the BlockHandle. And because index and filter partitions were historically coupled (see decouple_partitioned_filters), filter updates had to happen at the same time.

Here we get rid of stashing the keys for later processing and the extra CPU associated with it, by
* Creating a two stage process of adding to index blocks ("prepare" and "finish" each entry; one entry per data block). The two stages must be executable in parallel for separate index entries. NOTE: not yet supported by UserDefinedIndex
* Requiring decouple_partitioned_filters=true for parallel compression, because we now add to filters in the first stage of processing when each key is readily available and we cannot couple that with finalizing index entries in the last stage of processing.

It might seem like adding to filters is something that is expensive (hashing etc.) and should be kept out of the bottle-neck first stage of processing (which includes walking the compaction iterator) but it's probably similar cost to simply stashing the keys away for later processing. (We might be able to reduce a bottle-neck by stashing hashes, but we're not to a point where that is worth the effort.)

And it makes sense to make two more simple public API updates in conjunction with this:
* Set decouple_partitioned_filters=true by default. No signs of problems in production.
* Mark parallel compression as production-ready. It's being thoroughly tested in the crash test, successfully, and in limited production uses.

Follow-up:
* Improve the threading/sychronization model of parallel compression for the next major efficiency improvement
* Consider supporting the parallel-compatible index building APIs with UserDefinedIndex, unless it's considered too dangerous to expect users to safely handle the multi-threading.
* (In a subsequent release) remove all the code associated with coupling filter and index partitions and mark the option as ignored.

Pull Request resolved: facebook#13850

Test Plan:
for correctness, existing tests

## Performance Data

The "before" data here includes revert of facebook#13828 for combined performance measurement of this change and that one.

```
SUFFIX=`tty | sed 's|/|_|g'`; for CT in lz4 zstd lz4; do for PT in 1 2 3 4 6 8; do echo "$CT pt=$PT"; (for I in `seq 1 1`; do BIN=/dev/shm/dbbench${SUFFIX}.bin; rm -f $BIN; cp db_bench $BIN; /usr/bin/time $BIN -db=/dev/shm/dbbench$SUFFIX --benchmarks=fillseq -num=30000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 -compression_type=$CT -compression_parallel_threads=$PT 2>&1 | tail -n 3 | head -n 2; done); done; done
```

To get a sense of the overall performance relative to number of parallel threads, we vary that with popular fast compression and popular heavier weight compression (some noise in this data, don't interpret each data point too strongly)

lz4 pt=1
2107431 -> 2112941 ops/sec (+0.3% - improvement)
(26.51 + 0.75) = 27.26 CPU sec -> (26.63 + 0.79) = 27.42 CPU sec (+0.6% - regression)
lz4 pt=2
1606660 -> 1580333 ops/sec (-1.6% - regression)
(47.10 + 8.37) = 55.47 CPU sec -> (45.05 + 9.23) = 54.28 CPU sec (-2.2% - improvement)
lz4 pt=3
1701353 -> 1889283 ops/sec (+11.1% - improvement)
(47.23 + 8.29) = 55.52 CPU sec -> (43.89 + 8.33) = 52.22 CPU sec (-6.0% - improvement)
lz4 pt=4
1651504 -> 1817890 ops/sec (+10.1% - improvement)
(48.07 + 8.31) = 56.38 CPU sec -> (44.77 + 8.45) = 53.22 CPU sec (-5.6% - improvement)
lz4 pt=6
1716099 -> 1888523 ops/sec (+10.1% - improvement)
(47.50 + 8.45) = 55.95 CPU sec -> (44.25 + 8.73) = 52.98 CPU sec (-5.3% - improvement)
lz4 pt=8
1696840 -> 1797256 ops/sec (+5.9% - improvement)
(48.09 + 8.61) = 56.70 CPU sec -> (45.90 + 8.68) = 54.58 CPU sec (-3.8% - improvement)

Clearly parallel threads do not help with fast compression like LZ4, but it's not as bad as it was before.

zstd pt=1
1214258 -> 1202863 ops/sec (-0.9% - regression)
(38.26 + 0.66) = 38.92 CPU sec -> (39.37 + 0.69) = 40.06 CPU sec (+2.9% - regression)
zstd pt=2
1194673 -> 1152746 ops/sec (-3.5% - regression)
(61.01 + 9.85) = 70.86 CPU sec -> (58.28 + 9.99) = 68.27 CPU sec (-3.7% - improvement)
zstd pt=3
1653661 -> 1825618 ops/sec (+10.4% - improvement)
(60.07 + 8.45) = 68.52 CPU sec -> (56.03 + 8.43) = 64.46 CPU sec (-5.9% - improvement)
zstd pt=4
1691723 -> 1890976 ops/sec (+11.8% - improvement)
(59.72 + 8.46) = 68.18 CPU sec -> (55.96 + 8.27) = 64.23 CPU sec (-5.7% - improvement)
zstd pt=6
1684982 -> 1900002 ops/sec (+12.8% - improvement)
(58.89 + 8.26) = 67.15 CPU sec -> (55.98 + 8.48) = 64.46 CPU sec (-4.0% - improvement)
zstd pt=8
1648282 -> 1892531 ops/sec (+14.8% - improvement)
(59.43 + 8.63) = 68.06 CPU sec -> (56.49 + 8.32) = 64.81 CPU sec (-4.8% - improvement)

The throughput is now able to increase by *more than half* with lots of parallelism, rather than only *about a third*.

Scalability is a bit better with higher compression level, and we still see a benefit from this change. (We've also enabled partitioned indexes and filters here, which sees essentially the same benefits):

zstd pt=1 compression_level=7
595720 -> 597359 ops/sec (+0.3% - improvement)
(63.45 + 0.73) = 64.18 CPU sec -> (63.25 + 0.71) = 63.96 CPU sec (-0.3% - improvement)
zstd pt=4 compression_level=7
1527116 -> 1501779 ops/sec (-1.7% - regression)
(85.00 + 8.14) = 93.14 CPU sec -> (81.85 + 9.02) = 90.87 CPU sec (-2.5% - improvement)
zstd pt=6 compression_level=7
1678239 -> 1956070 ops/sec (+16.5% - improvement)
(83.77 + 8.11) = 91.88 CPU sec -> (79.87 + 7.78) = 87.65 CPU sec (-4.6% - improvement)
zstd pt=8 compression_level=7
1696132 -> 1953041 ops/sec (+15.1% - improvement)
(83.97 + 8.14) = 92.11 CPU sec -> (80.61 + 7.78) = 88.39 CPU sec (-4.1% - improvement)

With more tests, not really seeing any consistent differences with no parallelism (despite some micro-optimizations thrown in)

Reviewed By: hx235

Differential Revision: D79853111

Pulled By: pdillinger

fbshipit-source-id: 7a34fd7811217fb74fa6d3efaea7ffcce72beec7
…3879)

Summary:
**Context/Summary:**
`ProcessKeyValueCompaction()` has grown too long to resonate or add any logic to resume from some key and save progress for resumable compaction. This PR breaks this function into smaller functions. Almost all of them are cosmetic changes, except for one thing pointed out in below PR conversation.

Specially, this PR did the following:
- Added `SubcompactionInternalIterators`, `SubcompactionKeyBoundaries` and `BlobFileResources` to manage the lifetime of the local variables of the original functions to be used across smaller functions
- Moved AutoThreadOperationStageUpdater, some IO stats measurement to a different place that makes more sense

Pull Request resolved: facebook#13879

Test Plan: Existing UT

Reviewed By: jaykorean

Differential Revision: D80216092

Pulled By: hx235

fbshipit-source-id: 515615906e5e5fd5ec191bcdd4126f17d282cac2
Summary:
I am wanting to use std::counting_semaphore for something and the timing seems good to require C++20 support. The internets suggest:

* GCC >= 10 is adequate, >= 11 preferred
* Clang >= 10 is needed
* Visual Studio >= 2019 is adquate

And popular linux distributions look like this:
* CentOS Stream 9 -> GCC 11.2  (CentOS 8 is EOL)
* Ubuntu 22.04 LTS -> GCC 11.x  (Ubuntu 20 just ended standard support)
* Debian 12 (oldstable) -> GCC 12.2
  * (Debian 11 has ended security updates, uses GCC 10.2)

This required generating a new docker image based on Ubuntu 22 for CI using gcc. The existing Ubuntu 20 image works for covering appropriate clang versions (though we should maybe add a much later version as well, in the next increment of our Ubuntu 22 image; however the minimum available clang build from apt.llvm.org for Ubuntu 22 is clang 13).

Update to SetDumpFilter is to quiet a mysterious gcc-13 warning-as-error.

Removed --compile-no-warning-as-error from a cmake command line because cmake in the new docker image is too old for this option.

Pull Request resolved: facebook#13904

Test Plan: CI, one minor unit test added to verify std::counting_semaphor works

Reviewed By: xingbowang

Differential Revision: D81266435

Pulled By: pdillinger

fbshipit-source-id: 26040eeccca7004416e29a6ff4f6ea93f2052684
…cebook#13906)

Summary:
Add a new argument --random_seed to script db_crashtest.py to allow reusing the same random seed to produce exactly same test argument. When the argument is missing, a random seed is used, and printed. When developer wants to reproduce the exactly same setup, they could use the same seed with --random_seed for reproduction. The example below shows running the command without and with the argument. All of the arguments are same, except --db and --expected_values_dir, which does not use python random.

* Without --random_seed, a new seed is generated and printed.
```
[xbw@devvm16622.vll0 ~/workspace/ws1/rocksdb (crashtest)]$ /usr/local/bin/python3 -u tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --cf_consistency blackbox --duration=960 --max_key=2500000
Start with random seed 17953760416546706382
Running blackbox-crash-test with
interval_between_crash=120
total-duration=960

Running db_stress with pid=2957716: ./db_stress --WAL_size_limit_MB=0 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --adm_policy=0 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=0 --allow_setting_blob_options_dynamically=1 --allow_unprepared_value=0 --async_io=1 --atomic_flush=1 --auto_readahead_size=0 --auto_refresh_iterator_with_snapshot=1 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --blob_cache_size=2097152 --blob_compaction_readahead_size=4194304 --blob_compression_type=zstd --blob_file_size=1073741824 --blob_file_starting_level=0 --blob_garbage_collection_age_cutoff=0.5 --blob_garbage_collection_force_threshold=0.75 --block_align=1 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=1 --bloom_bits=12 --bottommost_compression_type=none --bottommost_file_compaction_delay=3600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=1 --checkpoint_one_in=10000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=0 --compaction_readahead_size=1048576 --compaction_style=0 --compaction_ttl=100 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_manager=none --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc=23:30-03:15 --data_block_index_type=0 --db=/tmp/rocksdb_crashtest_blackboxqishhgdc --db_write_buffer_size=0 --decouple_partitioned_filters=1 --default_temperature=kWarm --default_write_temperature=kCold --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=10000 --disable_wal=1 --dump_malloc_stats=1 --enable_blob_files=1 --enable_blob_garbage_collection=1 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=1 --enable_do_not_compress_roles=1 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=1 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=0 --exclude_wal_from_write_fault_injection=0 --expected_values_dir=/tmp/rocksdb_crashtest_expected_udz8mw68 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --file_temperature_age_thresholds= --fill_cache=0 --flush_one_in=1000 --format_version=4 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=2097152 --high_pri_pool_ratio=0 --index_block_restart_interval=1 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --ingest_wbwi_one_in=500 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=0 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=2500000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=1048576 --memtable_avg_op_scan_flush_trigger=2 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=100 --memtable_op_scan_flush_trigger=10 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=8 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=128 --min_blob_size=16 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_bottom_pri_threads=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=10000 --periodic_compaction_seconds=1000 --prefix_size=5 --prefixpercent=5 --prepopulate_blob_cache=1 --prepopulate_block_cache=1 --preserve_internal_time_seconds=36000 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=32 --read_fault_one_in=1000 --readahead_size=0 --readpercent=45 --recycle_log_file_num=0 --remote_compaction_worker_threads=0 --reopen=0 --report_bg_io_stats=1 --reset_stats_one_in=1000000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true --set_options_one_in=1000 --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=2 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --table_cache_numshardbits=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=1 --test_ingest_standalone_range_deletion_one_in=0 --top_level_index_pinning=0 --track_and_verify_wals=0 --uncache_aggressiveness=211 --universal_max_read_amp=4 --universal_reduce_file_locking=0 --unpartitioned_pinning=0 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=1 --use_blob_cache=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=1 --use_multiscan=0 --use_put_entity_one_in=0 --use_shared_block_and_blob_cache=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=0 --verify_db_one_in=10000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=0 --write_fault_one_in=0 --write_identity_file=1 --writepercent=35
```

* With --random_seed, the seed specified in the argument is used.
```
[xbw@devvm16622.vll0 ~/workspace/ws1/rocksdb (crashtest)]$ /usr/local/bin/python3 -u tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --cf_consistency blackbox --duration=960 --max_key=2500000 --random_seed=17953760416546706382
Start with random seed 17953760416546706382
Running blackbox-crash-test with
interval_between_crash=120
total-duration=960

Running db_stress with pid=2959006: ./db_stress --WAL_size_limit_MB=0 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --adm_policy=0 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=0 --allow_setting_blob_options_dynamically=1 --allow_unprepared_value=0 --async_io=1 --atomic_flush=1 --auto_readahead_size=0 --auto_refresh_iterator_with_snapshot=1 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --blob_cache_size=2097152 --blob_compaction_readahead_size=4194304 --blob_compression_type=zstd --blob_file_size=1073741824 --blob_file_starting_level=0 --blob_garbage_collection_age_cutoff=0.5 --blob_garbage_collection_force_threshold=0.75 --block_align=1 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=1 --bloom_bits=12 --bottommost_compression_type=none --bottommost_file_compaction_delay=3600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=1 --checkpoint_one_in=10000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=0 --compaction_readahead_size=1048576 --compaction_style=0 --compaction_ttl=100 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_manager=none --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc=23:30-03:15 --data_block_index_type=0 --db=/tmp/rocksdb_crashtest_blackbox0kxvhzbm --db_write_buffer_size=0 --decouple_partitioned_filters=1 --default_temperature=kWarm --default_write_temperature=kCold --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=10000 --disable_wal=1 --dump_malloc_stats=1 --enable_blob_files=1 --enable_blob_garbage_collection=1 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=1 --enable_do_not_compress_roles=1 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=1 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=0 --exclude_wal_from_write_fault_injection=0 --expected_values_dir=/tmp/rocksdb_crashtest_expected_hhk9kcgo --fifo_allow_compaction=0 --file_checksum_impl=crc32c --file_temperature_age_thresholds= --fill_cache=0 --flush_one_in=1000 --format_version=4 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=2097152 --high_pri_pool_ratio=0 --index_block_restart_interval=1 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --ingest_wbwi_one_in=500 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=0 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=2500000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=1048576 --memtable_avg_op_scan_flush_trigger=2 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=100 --memtable_op_scan_flush_trigger=10 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=8 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=128 --min_blob_size=16 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_bottom_pri_threads=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=10000 --periodic_compaction_seconds=1000 --prefix_size=5 --prefixpercent=5 --prepopulate_blob_cache=1 --prepopulate_block_cache=1 --preserve_internal_time_seconds=36000 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=32 --read_fault_one_in=1000 --readahead_size=0 --readpercent=45 --recycle_log_file_num=0 --remote_compaction_worker_threads=0 --reopen=0 --report_bg_io_stats=1 --reset_stats_one_in=1000000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true --set_options_one_in=1000 --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=2 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --table_cache_numshardbits=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=1 --test_ingest_standalone_range_deletion_one_in=0 --top_level_index_pinning=0 --track_and_verify_wals=0 --uncache_aggressiveness=211 --universal_max_read_amp=4 --universal_reduce_file_locking=0 --unpartitioned_pinning=0 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=1 --use_blob_cache=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=1 --use_multiscan=0 --use_put_entity_one_in=0 --use_shared_block_and_blob_cache=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=0 --verify_db_one_in=10000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=0 --write_fault_one_in=0 --write_identity_file=1 --writepercent=35
```

Pull Request resolved: facebook#13906

Test Plan: stress test

Reviewed By: hx235

Differential Revision: D81201034

Pulled By: xingbowang

fbshipit-source-id: 0bb4e0cbcdcf2de9b730492342dcfa18f07e93d6
…n pause (facebook#13891)

Summary:
**Context/Summary:**
A small change as titled.

Pull Request resolved: facebook#13891

Test Plan: - Existing UT and rehearsal stress test

Reviewed By: jaykorean

Differential Revision: D80588011

Pulled By: hx235

fbshipit-source-id: 6987e08a4855782305ad742eef6c0196da0d67ca
Summary:
Re-enabling Remote Compaction Stress Test with some changes to stress test feature combo sanitization changes

Pull Request resolved: facebook#13913

Test Plan:
Ran Meta Internal Tests for a few days

# Follow up
- Skip recovering from WAL in remote worker and re-enable WAL
- Investigate and fix races with Integrated BlobDB

Reviewed By: hx235

Differential Revision: D81509225

Pulled By: jaykorean

fbshipit-source-id: 949762c48ece0a25e3d0281e3510f1e7d3fe3667
… Test (facebook#13916)

Summary:
Fixing "Integrated BlobDB is currently incompatible with Remote Compaction" error

https://github.com/facebook/rocksdb/actions/runs/17417658959/job/49449586139

Pull Request resolved: facebook#13916

Test Plan: CI

Reviewed By: anand1976

Differential Revision: D81537676

Pulled By: jaykorean

fbshipit-source-id: f5e2c40cd498a17cb08486a1cb9404ccf1d812e0
Summary:
# Summary

Until we get WAL + Remote Compaction in Stress Test working, temporarily disable this

Pull Request resolved: facebook#13919

Test Plan: Meta Internal CI run

Reviewed By: anand1976

Differential Revision: D81605621

Pulled By: jaykorean

fbshipit-source-id: 6e1f9a0a7a0f27e7465512689b51364b63ef3e2b
Summary:
Add a new option `MultiScanArgs::max_prefetch_size` that limits the memory usage of per file pinning of prefetched blocks. Note that this only accounts for compressed block size. This is intended to be a stopgap until we implement some kind of global prefetch manager that limits the global multiscan memory usage.

Pull Request resolved: facebook#13920

Test Plan: new unit test `./block_based_table_reader_test --gtest_filter="*MultiScanPrefetchSizeLimit/*"`

Reviewed By: xingbowang

Differential Revision: D81630629

Pulled By: cbi42

fbshipit-source-id: 9f66678915242fe1220620531a4b9fd22747cdea
…#13921)

Summary:
If user_defined_index_factory in BlockBasedTableOptions is configured and we try to open an SST file without the corresponding UDI (either during DB open or file ingestion), ignore a failure to load the UDI by default. If fail_if_no_udi_on_open in BlockBasedTableOptions is true, then treat it as a fatal error.

Pull Request resolved: facebook#13921

Test Plan: Update unit tests

Reviewed By: xingbowang

Differential Revision: D81826054

Pulled By: anand1976

fbshipit-source-id: f4fe0b13ccb02b9448622af487680131e349c52b
Summary:
This diff adds logging in various places in the external file ingestion code where we check for non-OK status codes.

Pull Request resolved: facebook#13905

Test Plan: Debugging external file ingestion should be easier with additional logging.

Differential Revision: D81814033

Pulled By: archang19

fbshipit-source-id: 77f8b342cbad892acedc4603c02865c38886f2f4
Summary:
After running stress test over a week, we've identified more failures to fix. While we work on the fix, disable the remote compaction temporarily to reduce noise and avoid these failures hiding other failures.

Pull Request resolved: facebook#13925

Test Plan: CI

Reviewed By: anand1976

Differential Revision: D81934248

Pulled By: jaykorean

fbshipit-source-id: 9ac11926429eebe1aebf7b520a548dc5987b7d76
Summary:
**Context/Summary:** it's for formal testing to cover statistics in our stress test

Pull Request resolved: facebook#13926

Reviewed By: anand1976, jaykorean

Differential Revision: D81943762

Pulled By: hx235

fbshipit-source-id: 4186be0b35839976b7299667492d0cc722128a06
Summary:
A follow-up to facebook#13904 which was incomplete in updating CI jobs to support C++20 because the C++20 usage was only in tests. Here we add subtle C++20 usage in the public API ("using enum" feature in db.h) to force the issue.

A lot of the work for this PR was in updating the Ubuntu22 docker image, for earlier compiler/runtime versions supporting C++20, and generating a new Ubuntu24 docker image, for later compiler/runtime versions. The Ubuntu22 image needed to be updated because there are incompatibilities with clang-13 + c++20 + libstdc++ for gcc 11, seen on these examples

```
#include <chrono>

int main(int argc, char *argv[]) {
  std::chrono::microseconds d = {}; return 0;
}
```

and

```
#include <coroutine>

int main() { return 0; }
```

The second was causing recurring failures in build-linux-clang-13-asan-ubsan-with-folly, now fixed.

So we have to install clang's libc++ to compile with clang-13. I haven't been able to get this to work with some of the libraries like benchmark, glog, and/or gflags, but I'm able to compile core RocksDB with clang-13. On this docker image, an extra compiler parameter is needed to compile with gcc and glog because it's built from source perhaps not perfectly, because the ubuntu package transitively conflicts with libc++.

The Ubuntu24 image seems to be low-drama and generally work for testing out newer compiler versions. The mingw build uses Ubuntu24 because the mingw package on Ubuntu22 uses a gcc version that is too old.

And the mass of other code changes are trying to work around new warnings, mostly from clang-analyze, which I upgraded to clang-18 in CI.

Pull Request resolved: facebook#13915

Test Plan: CI, including temporarily including the nightly jobs in the PR jobs in earlier revisions to test and stabilize

Reviewed By: archang19

Differential Revision: D81933067

Pulled By: pdillinger

fbshipit-source-id: 7e33823006a79d5f3cf5bc1d625f0a3c08a7d74c
…ebook#13731)

Summary:
PointLockManager manages point lock per key. The old implementation partition the per key lock into 16 stripes. Each stripe handles the point lock for a subset of keys. Each stripe have only one conditional variable. This conditional variable is used by all the transactions that are waiting for its turn to acquire a lock of a key that belongs to this stripe.

In production, we notified that when there are multiple transactions trying to write to the same key, all of them will wait on the same conditional variables. When the previous lock holder released the key, all of the transactions are woken up, but only one of them could proceed, and the rest goes back to sleep. This wasted a lot of CPU cycles. In addition, when there are other keys being locked/unlocked on the same lock stripe, the problem becomes even worse.

In order to solve this issue, we implemented a new PerKeyPointLockManager that keeps a transaction waiter queue at per key level. When a transaction could not acquire a lock immediately, it joins the waiter queue of the key and waits on a dedicated conditional variable. When previous lock holder released the lock, it wakes up the next set of transactions that are eligible to acquire the lock from the waiting queue. The queue respect FIFO order, except it prioritizes lock upgrade/downgrade operation.

However, this waiter queue change increases the deadlock detection cost, because the transaction waiting in the queue also needs to be considered during deadlock detection. To resolve this issue, a new deadlock_timeout_us (microseconds) configuration is introduced in transaction option. Essentially, when a transaction is waiting on a lock, it will join the wait queue and wait for the duration configured by deadlock_timeout_us without perform deadlock detection. If the transaction didn't get the lock after the deadlock_timeout_us timeout is reached, it will then perform deadlock detection and wait until lock_timeout is reached. This optimization takes the heuristic where majority of the transaction would be able to get the lock without perform deadlock detection.

The deadlock_timeout_us configuration needs to be tuned for different workload, if the likelihood of deadlock is very low, the deadlock_timeout_us could be configured close to a big higher than the average transaction execution time, so that majority of the transaction would be able to acquire the lock without performing deadlock detection. If the likelihood of deadlock is high, deadlock_timeout_us could be configured with lower value, so that deadlock would get detected faster.

The new PerKeyPointLockManager is disabled by default. It can be enabled by TransactionDBOptions.use_per_key_point_lock_mgr. The deadlock_timeout_us is only effective when PerKeyPointLockManager is used. When deadlock_timeout_us is set to 0, transaction will perform deadlock detection immediately before wait.

Pull Request resolved: facebook#13731

Test Plan:
Unit test.
Stress unit test that validates deadlock detection and exclusive, shared lock guarantee.
A new point_lock_bench binary is created to help perform performance test.

Reviewed By: pdillinger

Differential Revision: D77353607

Pulled By: xingbowang

fbshipit-source-id: 21cf93354f9a367a78c8666596ed14013ac7240b
Summary:
There are some internal use cases that do not map cleanly onto the existing `IOActivity` enums. This PR creates new custom IOActivity types that internal users can use as they see fit.

Pull Request resolved: facebook#13924

Test Plan: Wrote a simple unit test

Reviewed By: pdillinger

Differential Revision: D82029992

Pulled By: archang19

fbshipit-source-id: a3e23c360baa96cd2e9adf570e71c6e43947bfc8
Summary:
Add copyright notice to any_lock_manager_test.h

Pull Request resolved: facebook#13930

Reviewed By: xingbowang

Differential Revision: D82035581

Pulled By: anand1976

fbshipit-source-id: 2275f7c8b41fbd4384bdae011d244bfa117225f7
Summary:
Fix broken build in PointLockManager change with C++20

Pull Request resolved: facebook#13933

Test Plan: CI

Reviewed By: pdillinger

Differential Revision: D82073490

Pulled By: xingbowang

fbshipit-source-id: 0bd4936fe0a27a28db61ca5f23d3bea90bce73ef
Summary:
... and associated statistics, etc. Someone needs it, so here it is.

Pull Request resolved: facebook#13927

Test Plan: Updated / extended / added some unit tests

Reviewed By: cbi42

Differential Revision: D81981469

Pulled By: pdillinger

fbshipit-source-id: 52558c08741890b781310906acbc18d9eb479363
Summary:
Fix uninitialized value complaint in valgrind due to gtest print padded struct.

Pull Request resolved: facebook#13934

Test Plan: CI. Verified that valgrind no longer complains about it.

Reviewed By: pdillinger

Differential Revision: D82124983

Pulled By: xingbowang

fbshipit-source-id: 99eb7bab99726c45affe0a231777e5951844d73b
…cebook#13936)

Summary:
We should add error logging to be able to pinpoint why RocksDB is returning status `NotSupported` for `ReadAsync`.

Pull Request resolved: facebook#13936

Test Plan: Look at logs (and client logs of error status)

Reviewed By: anand1976

Differential Revision: D82141529

Pulled By: archang19

fbshipit-source-id: c71b70967457be35ef5168321d449f96b2b9441d
…ity and set true by default (facebook#13929)

Summary:
**Context/Summary:**
Internally `CompactionJobStats ::num_input_records` is only used for input record count [verification](https://github.com/facebook/rocksdb/blob/1aca60c089a48857930b4191b0c84b6dd98a221c/db/compaction/compaction_job.cc#L2535) and such verification always checks for `CompactionJobStats::has_num_input_records` (now renamed) before using this field. This is needed because the `CompactionJobStats::num_input_records` gets its number from `CompactionIterator::NumInputEntryScanned()` in a subcompaction and this number can be inaccurate purposefully to increase performance, see [CompactionIterator::must_count_input_entries](https://github.com/facebook/rocksdb/pull/13929/files#diff-e6c876f655a21865c0f3dff94b9763f1bd40cf88a8a86f04868201b2e845a890R186-R199) for more.
- This PR renames the `CompactionJobStats::has_num_input_records` to more explicit naming and adds more comments. Not a behavior change.

Also, aggregation of  `CompactionJobStats::has_num_input_records` among all subcompactions is done by [AND](https://github.com/facebook/rocksdb/blob/1aca60c089a48857930b4191b0c84b6dd98a221c/util/compaction_job_stats_impl.cc#L62) operation so it's false if any of the subcompaction has this field being false. The default value of this field should be "true" in order to not mistakenly "false" by default. We are currently fine because `CompactionJobStats::Reset()` that [sets the value to be true](https://github.com/facebook/rocksdb/blob/1aca60c089a48857930b4191b0c84b6dd98a221c/util/compaction_job_stats_impl.cc#L14) is always called before such aggregation.
 - This PR changes the default value to be true.
 - Resumable compaction development plans to set `CompactionJobStats::has_num_input_records` to be false if the previous compaction carries inaccurate records. In order for this not be overwritten by the subsequent progress in [here](https://github.com/facebook/rocksdb/blob/1aca60c089a48857930b4191b0c84b6dd98a221c/db/compaction/compaction_job.cc#L1540-L1543), this PR also changes this = to AND operation and +=. With the default value `CompactionJobStats::has_num_input_records` now to be true (or Reset() already called) and `CompactionJobStats::num_input_records=0` already, this will not a behavior change.

Pull Request resolved: facebook#13929

Test Plan: - Existing UT to test "...changes the default value to be true" is safe.

Reviewed By: jaykorean

Differential Revision: D82014912

Pulled By: hx235

fbshipit-source-id: 6f211c3b2c9eb7d39abf37271d21a4d3f407b934
…cebook#13945)

Summary:
This PR enables Stress Test to fall back to local compaction when a remote compaction fails, allowing the compaction to be retried on the main thread.

If the local compaction succeeds, the stress test will continue without failing. The main thread will log that the remote compaction failed and was retried locally, while detailed failure logs from the remote compaction attempt will still be printed by the worker thread for further investigation.

This approach allows us to keep collecting useful logs for diagnosing remote compaction failures in Stress Test, while ensuring the test continues to run with remote compaction enabled.

Pull Request resolved: facebook#13945

Test Plan:
```
python3 -u tools/db_crashtest.py --cleanup_cmd='' --simple blackbox --remote_compaction_worker_threads=8 --interval=10
```

# Internal Only

https://www.internalfb.com/sandcastle/workflow/1315051091202224133

https://www.internalfb.com/sandcastle/workflow/3382203320165521367

https://www.internalfb.com/sandcastle/workflow/2616591383512372892

https://www.internalfb.com/sandcastle/workflow/4607182418810099066

Reviewed By: hx235

Differential Revision: D82279337

Pulled By: jaykorean

fbshipit-source-id: 6f663ec2eeb642fd4ad885a90efb344432a32f89
…hreads could select the same non-L0 file (facebook#13946)

Summary:
**Context/Summary:**
Fix a race condition (illustrated below) in FIFO size-based compaction where concurrent threads could select the same non-L0 file, causing assertion failures in debug builds or "Cannot delete table file from LSM tree" errors in release builds.
```
Thread 1                           Thread 2
--------                           --------
FIFO size-based compaction
   ↓
Pick L2 file
   ↓
Mark: file.being_compacted = true (file.being_compacted was false)
   ↓
WriteManifestStart (unlock mutex) ─→ FIFO size-based compaction starts
   ↓                                   ↓
Continue manifest write...          Pick SAME L2 file
   ↓                                Mark: file.being_compacted = true  (file.being_compacted was true) ❌
   ↓                                   ↓
   ↓                                Unlock mutex, wait for manifest
   ↓                                   ↓
Lock mutex ←─────────────────────────────────┘
   ↓
Delete L2 file ✅
   ↓
Complete ─────────────────────────────→ Try delete same file ❌
                                        ↓
                                     ERROR: "file not in LSM tree"

🐛 BUG: Both threads pick the same file!
    Thread 2 doesn't properly check file.being_compacted flag
```

**Test**
New test that fails before the fix and passes after

Pull Request resolved: facebook#13946

Reviewed By: xingbowang

Differential Revision: D82279731

Pulled By: hx235

fbshipit-source-id: b426517f2d1b23dd7d4951157822a2d322fe1435
Summary:
we saw some crash test failure at https://github.com/facebook/rocksdb/blob/f46242cef631351a5c8f4a7b0fb0935ec7fa61c8/table/block_based/block_based_table_iterator.cc#L964-L965. This is likely due to timestamp not being considered properly in some places in MultiScan code paths. This PR fixes the issue.

Pull Request resolved: facebook#13938

Test Plan: crash test with timestamp and multiscan: `python3 -u ./tools/db_crashtest.py whitebox --enable_ts --iterpercent=60 --prefix_size=-1 --prefixpercent=0 --readpercent=0 --test_batches_snapshots=0 --use_multiscan=1 --read_fault_one_in=0 --kill_random_test=88888 --interval=60`

Reviewed By: anand1976

Differential Revision: D82175263

Pulled By: cbi42

fbshipit-source-id: 5d40ede1aec15f8faeaa7fd041b939e68611ff73
Summary:
Complete redo of parallel compression in block_based_table_builder.cc to greatly reduce cross-thread hand-off and blocking. A ring buffer of blocks-in-progress is used to essentially bound working memory while enabling high throughput. Unlike before, all threads can participate in compression work, for a kind of work-stealing algorithm that reduces the need for threads to block. This builds on improvements in facebook#13850

Previously, there was either
* parallel_threads==1, the *emit thread* (caller from flush/compaction) doing all the work
* parallel_threads > 1, the emit thread generates uncompressed blocks, `parallel_threads` worker threads compress blocks, and a writer thread writes to the SST file. Total of `parallel_threads + 2` threads participating. (Other bookkeeping in emit and write steps omitted from description for simplicity.)

Now we have either
* parallel_threads==1 (same), the emit thread doing all the work
* parallel_threads > 1, the emit thread generates uncompressed blocks and can take up compression work when the ring buffer is full; `parallel_threads` worker threads have as their top priority to write compressed blocks to the SST file but also take up compression work in priority order of next-to-write. Total of `parallel_threads + 1` threads participating. In some cases, this could result in less throughput than before, but arguably the previous implementation was using more threads than explicitly allowed.

## Future/alternate considerations
Although we could likely have used some framework for micro-work sharing across threads, that could be difficult with the asymmetry of work loads and thread affinity. Specifically, (a) it would be quite challenging to allow emit work in other threads, because it happens in the caller of BlockBasedTableBuilder, (b) async programming is unlikely to pay off until we have an async interface for writing SST files, and (c) this implementation will nevertheless serve as a benchmark for what we lose or gain in such a framework vs. a hand-tuned system.

This implementation still creates and destroys threads for each SST file created. We hope in the future to have more governance and/or pooling of worker threads across various flushes and compactions, but that is not available currently and would require significant design and implementation work.

## More details
* This implementation makes use of semaphores for idling and re-waking threads. `std::counting_semaphore` and `binary_semaphore` offer the best performance (see benchmark results below) but some implementations are known to have correctness bugs. Also, my attempt at upgrading CI for C++20 support (required for these) in facebook#13904 is actually incomplete. Therefore, using these structures is opt-in with `-DROCKSDB_USE_STD_SEMAPHORES` at compile time, and a naive semaphore implementation based on mutex and condvar is used by default. A folly alternative (folly::fibers::Semaphore) was dropped in during development and found to be less efficient than the naive implementation. One CI job is upgraded to test with the new opt-in.
* One of the biggest concerns about correctness/reliability for this implementation is the possibility of hitting a deadlock, in part because that is not well checked in the DB crash test (a challenging problem!). Note also that with the parallel compression improvements in this release, I am calling the feature production-ready, so there is an extra level of confidence needed in the reliability of the feature. Thus, for DEBUG builds including crash test, I have added a watchdog thread to each parallel SST construction that heuristically checks for the most likely kinds of deadlock that could happen, including for the case of buggy semaphore implementations. It periodically verifies that some thread is outside of its "idle" state, and if the watchdog wakes up repeatedly to see all live threads stuck in their idle state (even if wake-up was attempted) then it declares a deadlock. This feature was manually verified for several seeded deadlock bugs. (More details in code comments.)
* For CPU efficiency, this implementation greatly simplifies the logic to estimate the outstanding or "inflight" size not yet written to the SST file. I expect this size to generally be insignificant relative to the full SST file size so is not worth careful engineering. And based on Meta's current needs, landing under-size for an SST file is better than over-size. See comments on `estimated_inflight_size` for details.
* Some other existing atomics in block_based_table_builder.cc modified to use safe atomic wrappers.
* Status handling in BlockBasedTableBuilder was streamlined to get rid of essentially redundant `status`+`io_status` fields and associated code. Made small optimizations to reduce unnecessary IOStatus copies (with StatusOk()) and mark status conditional branches as LIKELY or UNLIKELY.
* Prefer inline field initialization to initialization in constructor.
* Minimize references to the `parallel_threads` configuration parameter for better separation of concerns / sanitization / etc.  For example, use non-nullity of `pc_rep` to indicate that parallel compression is enabled (and active).
* Some other refactoring to aid the new implementation.

Pull Request resolved: facebook#13910

Test Plan:
## Correctness
Already integrated into unit tests and crash test. CI updated for opt-in semaphore implementation. Basic semaphore unit tests added/updated.

As for the tremendous simplification of logic relating to hitting target SST file size, as expected, the new behavior could under-shoot the single-threaded behavior by a small number of blocks, which will typically affect the file size by ~1/1000th or less. I think that's a good trade-off for cutting out unnecessarily complex code with non-trivial CPU cost (FileSizeEstimator).
```
./db_bench -db=/dev/shm/dbbench_filesize_after8 -benchmarks=fillseq,compact -num=10000000 -compression_type=zstd -compression_level=8 -compression_parallel_threads=8
```

Before, PT=8 & PT=1, and After PT=1 the same or very similar
```
-rw-r--r-- 1 peterd users 67474097 Sep 12 15:32 000052.sst
-rw-r--r-- 1 peterd users 67474214 Sep 12 15:32 000053.sst
-rw-r--r-- 1 peterd users 67473834 Sep 12 15:32 000054.sst
-rw-r--r-- 1 peterd users 67473437 Sep 12 15:32 000055.sst
-rw-r--r-- 1 peterd users 67473835 Sep 12 15:32 000056.sst
-rw-r--r-- 1 peterd users 67473204 Sep 12 15:33 000057.sst
-rw-r--r-- 1 peterd users 67473294 Sep 12 15:33 000058.sst
-rw-r--r-- 1 peterd users 67473839 Sep 12 15:33 000059.sst
```

After, PT=8 (worst case here ~0.05% smaller)
```
-rw-r--r-- 1 peterd users 67463189 Sep 12 14:55 000052.sst
-rw-r--r-- 1 peterd users 67465233 Sep 12 14:55 000053.sst
-rw-r--r-- 1 peterd users 67466822 Sep 12 14:55 000054.sst
-rw-r--r-- 1 peterd users 67466221 Sep 12 14:55 000055.sst
-rw-r--r-- 1 peterd users 67441675 Sep 12 14:55 000056.sst
-rw-r--r-- 1 peterd users 67467855 Sep 12 14:55 000057.sst
-rw-r--r-- 1 peterd users 67455132 Sep 12 14:55 000058.sst
-rw-r--r-- 1 peterd users 67458334 Sep 12 14:55 000059.sst
```

## Performance, modest load
We are primarily interested in balancing throughput in building SST files and CPU usage in doing so. (For example, we could maximize throughput by having worker threads only spin waiting for work, but that would likely be extra CPU usage we want to avoid to allow other productive CPU work to be scheduled.) No read path code has been touched.

A benchmark script running "before" and "after" configurations at the same time to minimize random machine load effects:
```
$ SUFFIX=`tty | sed 's|/|_|g'`; for CT in none lz4 zstd; do for PT in 1 2 3 4 6 8; do echo -n "$CT pt=$PT -> "; (for I in `seq 1 10`; do BIN=/tmp/dbbench${SUFFIX}.bin; rm -f $BIN; cp db_bench $BIN; /usr/bin/time $BIN -db=/dev/shm/dbbench$SUFFIX --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 -compression_type=$CT -compression_parallel_threads=$PT 2>&1; done) | awk '/micros.op/ {n++; sum += $5;} /system / { cpu += $1 + $2; } END { print "ops/s: " int(sum/n) " cpu*s: " cpu; }'; done; done
```

Before this change:
```
none pt=1 -> ops/s: 1999603 cpu*s: 72.08
none pt=2 -> ops/s: 1871094 cpu*s: 148.3
none pt=3 -> ops/s: 1882907 cpu*s: 147.7
lz4  pt=1 -> ops/s: 1987858 cpu*s: 94.74
lz4  pt=2 -> ops/s: 1590192 cpu*s: 182.65
lz4  pt=3 -> ops/s: 1896294 cpu*s: 174.7
lz4  pt=4 -> ops/s: 1949174 cpu*s: 172.26
lz4  pt=6 -> ops/s: 1912517 cpu*s: 175.91
lz4  pt=8 -> ops/s: 1930585 cpu*s: 176.71
zstd pt=1 -> ops/s: 1239379 cpu*s: 129.85
zstd pt=2 -> ops/s: 1171742 cpu*s: 226.12
zstd pt=3 -> ops/s: 1832574 cpu*s: 214.21
zstd pt=4 -> ops/s: 1887124 cpu*s: 212.51
zstd pt=6 -> ops/s: 1920936 cpu*s: 211.7
zstd pt=8 -> ops/s: 1885544 cpu*s: 214.87
```

After this change:
```
none pt=1 -> ops/s: 1964361 cpu*s: 72.66
none pt=2 -> ops/s: 1914033 cpu*s: 104.95
none pt=3 -> ops/s: 1978567 cpu*s: 100.24
lz4  pt=1 -> ops/s: 2041703 cpu*s: 92.88
lz4  pt=2 -> ops/s: 1903210 cpu*s: 121.64
lz4  pt=3 -> ops/s: 1973906 cpu*s: 122.22
lz4  pt=4 -> ops/s: 1952605 cpu*s: 123.05
lz4  pt=6 -> ops/s: 1957524 cpu*s: 124.31
lz4  pt=8 -> ops/s: 1986274 cpu*s: 129.06
zstd pt=1 -> ops/s: 1233748 cpu*s: 130.43
zstd pt=2 -> ops/s: 1675226 cpu*s: 158.41
zstd pt=3 -> ops/s: 1929878 cpu*s: 159.77
zstd pt=4 -> ops/s: 1916403 cpu*s: 160.99
zstd pt=6 -> ops/s: 1942526 cpu*s: 166.21
zstd pt=8 -> ops/s: 1966704 cpu*s: 171.56
```

For parallel_threads=1, results are very similar, as expected.

For parallel_threads>1, throughput is usually improved a bit, but cpu consumption is dramatically reduced. For zstd, maximum throughput is essentially achieved with pt=3 rather than the previous roughly pt=4 to 6. And the old used about 30% more CPU.

We can also compare with more expensive compression by raising the compression level.
```
SUFFIX=`tty | sed 's|/|_|g'`; CT=zstd; for CL in 4 6 8; do for PT in 1 4 8; do echo -n "$CT@$CL pt=$PT -> "; (for I in `seq 1 10`; do BIN=/tmp/dbbench${SUFFIX}.bin; rm -f $BIN; cp db_bench $BIN; /usr/bin/time $BIN -db=/dev/shm/dbbench$SUFFIX --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 -compression_type=$CT -compression_parallel_threads=$PT -compression_level=$CL 2>&1; done) | awk '/micros.op/ {n++; sum += $5;} /system / { cpu += $1 + $2; } END { print "ops/s: " int(sum/n) " cpu*s: " cpu; }'; done; done
```

Before:
```
zstd@4 pt=1 -> ops/s:  883630 cpu*s: 161.12
zstd@4 pt=4 -> ops/s: 1878206 cpu*s: 243.25
zstd@4 pt=8 -> ops/s: 1885002 cpu*s: 245.89
zstd@6 pt=1 -> ops/s:  710767 cpu*s: 189.44
zstd@6 pt=4 -> ops/s: 1706377 cpu*s: 277.29
zstd@6 pt=8 -> ops/s: 1866736 cpu*s: 275.07
zstd@8 pt=1 -> ops/s:  529047 cpu*s: 237.87
zstd@8 pt=4 -> ops/s: 1401379 cpu*s: 330.61
zstd@8 pt=8 -> ops/s: 1895601 cpu*s: 321.59
```

After:
```
zstd@4 pt=1 -> ops/s:  889905 cpu*s: 161.03
zstd@4 pt=4 -> ops/s: 1942240 cpu*s: 193.18
zstd@4 pt=8 -> ops/s: 1922367 cpu*s: 205.21
zstd@6 pt=1 -> ops/s:  713870 cpu*s: 188.91
zstd@6 pt=4 -> ops/s: 1832314 cpu*s: 219.66
zstd@6 pt=8 -> ops/s: 1949631 cpu*s: 229.34
zstd@8 pt=1 -> ops/s:  530324 cpu*s: 238.02
zstd@8 pt=4 -> ops/s: 1479767 cpu*s: 271.65
zstd@8 pt=8 -> ops/s: 1949631 cpu*s: 275.6
```

And we can also look at the cumulative effect of this change and  facebook#13850 that will combine for the parallel compression improvements in the upcoming 10.7 release:

Before both:
```
lz4 pt=1 -> ops/s: 1954445 cpu*s: 95.14
lz4 pt=3 -> ops/s: 1687043 cpu*s: 186.62
lz4 pt=5 -> ops/s: 1708196 cpu*s: 188.33
zstd pt=1 -> ops/s: 1220649 cpu*s: 131.2
zstd pt=3 -> ops/s: 1658100 cpu*s: 227.08
zstd pt=5 -> ops/s: 1685074 cpu*s: 226.08
```

After:
```
lz4 pt=1 -> ops/s: 2048214 cpu*s: 93.24
lz4 pt=3 -> ops/s: 1922049 cpu*s: 122.9
lz4 pt=5 -> ops/s: 1980165 cpu*s: 122.49
zstd pt=1 -> ops/s: 1245165 cpu*s: 128.84
zstd pt=3 -> ops/s: 1956961 cpu*s: 158.73
zstd pt=5 -> ops/s: 1970458 cpu*s: 161.02
```

In summary, before with zstd default level, you could see only
* about 38% increase in throughput for about 73% increase in CPU usage

Now you can get
* about 58% increase in throughput for about 25% increase in CPU usage

## Performance, high load
To validate this for usage on remote compaction workers, we also need to test whether it falls over at high load or anything concerning like that. For this I did a lot of testing with concurrent db_bench and zstd compression_level=8 and parallel_thread (PT) in {1,8} trying to observe "bad" behaviors such as stalls due to preempted threads and such. On a 166 core machine where a "job" is a db_bench process running a fillseq benchmark similar to above in parallel with others, I could summarize the results like this:

10 jobs PT=8 vs. PT=1 -> 12% more CPU usage, 75% reduction in wall time, 1.9 jobs/sec (vs. 0.5)
50 jobs PT=8 vs. PT=1 -> 89% more CPU usage, 27% reduction in wall time, 3.1 jobs/sec (vs. 2.3)
100 jobs PT=8 vs. PT=1 -> 24% more CPU usage, 5% reduction in wall time, 3.25 jobs/sec (vs. 3.1)
150 jobs PT=8 vs. PT=1 -> 4% more CPU usage, 2% increase in wall time, 3.3 jobs/sec (vs. 3.4)
500 jobs PT=8 vs. PT=1 -> 1% more CPU usage, insignificant difference in wall time, 3.3 jobs/sec

Even when there are 4000 threads potentially competing for 166 cores, the throughput (3.3 jobs / sec) is still very close to maximum (3.4). Enabling parallel compression didn't result in notably less throughput (based on wall clock time for all jobs to complete) in any case tested above, and much higher throughput for many cases. If parallel compression causes us to tip from comfortably under-saturating to over-saturating the cores (as in the 50 jobs case), the overall CPU usage can be much higher, presumably due to lower CPU cache hit rates and maybe clock throttling, but parallel compression still has the throughput advantage in those cases.

In other words, what would we stand to gain from being able to intelligently share worker threads between compaction jobs? It doesn't seem that much.

Reviewed By: xingbowang

Differential Revision: D81365623

Pulled By: pdillinger

fbshipit-source-id: 5db5151a959b5d25b84dbe185bc208bd188f2d1c
Summary:
add option MultiScanArgs::use_async_io option and implementation for using ReadAsync() for multiscan. Read requests are submitted during Prepare() and polled during actual scanning.

Pull Request resolved: facebook#13932

Test Plan:
- updated existing unit test to use async_io.
- crash test: `python3 -u ./tools/db_crashtest.py whitebox --iterpercent=60 --prefix_size=-1 --prefixpercent=0 --readpercent=0 --test_batches_snapshots=0 --use_multiscan=1 --read_fault_one_in=0 --kill_random_test=88888 --interval=60 --multiscan_use_async_io=1 --mmap_read=0`

Benchmark:
- Default multiscan benchmark:
```
Set up: /db_bench --benchmarks="fillseq,compact" --disable_wal=1 --threads=1 --num_levels=1 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=1000 --write_buffer_size=268435456

Without async IO:
./db_bench --db="/tmp/rocksdbtest-543376/dbbench" --use_existing_db=1 --benchmarks=multiscan --disable_auto_compactions=1 --seek_nexts=100 --threads=32 --duration=10 --statistics=1 --use_direct_reads=1 --multiscan_use_async_io=0

multiscan    :     415.569 micros/op 75805 ops/sec 10.355 seconds 784968 operations; (multscans:24999)
rocksdb.read.async.micros COUNT : 0

With asycn IO:
./db_bench --db="/tmp/rocksdbtest-543376/dbbench" --use_existing_db=1 --benchmarks=multiscan --disable_auto_compactions=1 --seek_nexts=100 --threads=32 --duration=10 --statistics=1 --use_direct_reads=1 --multiscan_use_async_io=1

multiscan    :     413.236 micros/op 76044 ops/sec 10.375 seconds 788968 operations; (multscans:24999)
rocksdb.read.async.micros COUNT : 3916499

Similar performance.
```

- Larger scan, more scans per multiscan, do not coalesce IO so that async IO can progress while scanning, and use one thread:
```
multiscan_stride = 1000
multiscan_size = 100
seek_nexts = 1000

./db_bench --db="/tmp/rocksdbtest-543376/dbbench" --use_existing_db=1 --benchmarks=multiscan --disable_auto_compactions=1 --threads=1 --duration=10 --statistics=0 --use_direct_reads=1  --cache_size=2097152 --multiscan_size=100 --multiscan_stride=1000 --seek_nexts=1000 --seed=1 --multiscan_coalesce_threshold=0  --multiscan_use_async_io=0

Without async IO:
multiscan    :   20495.205 micros/op 48 ops/sec 10.002 seconds 488 operations; (multscans:488)

With async IO:
multiscan    :   18337.883 micros/op 54 ops/sec 10.013 seconds 546 operations; (multscans:546)

~10% improvement in throughput
```

Reviewed By: xingbowang

Differential Revision: D82077818

Pulled By: cbi42

fbshipit-source-id: 66e32cf4039183c4841827409286dfbaa6dfbcd8
Summary:
... reporting false positive double-lock on some of the new parallel compression code. Switching from std::condition_variable to condition_variable_any simply changes the FP from double-lock to lock inversion. In addition, leaking ParallelCompressionRep instances to avoid memory location reuse fails to fix the FP reports. Thus, I've decided to disable the watchdog with GCC+TSAN.

Pull Request resolved: facebook#13958

Test Plan: local crash test runs could reproduce, now don't reproduce. CLANG TSAN doesn't seem to be reporting the same supposed issues

Reviewed By: xingbowang

Differential Revision: D82555968

Pulled By: pdillinger

fbshipit-source-id: 537fbc3a787f917915a6faf0bdedd1449a7f378a
hx235 and others added 29 commits February 13, 2026 10:37
…acebook#14329)

Summary:
**Context/Summary:**

Internal adoption has demonstrated stability and measurable improvements of this feature without much cost so we can turn it on by default. Eventually we'd like to remove this configuration and make this an expected behavior.

Pull Request resolved: facebook#14329

Test Plan: Existing unit test

Reviewed By: mszeszko-meta

Differential Revision: D93210059

Pulled By: hx235

fbshipit-source-id: 04f77954e6624c8e60a2db030eb19eb341dd0fcf
…14325)

Summary:
In follow-up to facebook#14315

Remove obsolete code replaced by new Compressor/Decompressor interface:
* OLD_CompressData and OLD_UncompressData
* Individual compression/decompression functions (Snappy_*, Zlib_*, BZip2_*, LZ4_*, LZ4HC_*, XPRESS_*, ZSTD_Compress, ZSTD_Uncompress)
* CompressionInfo and UncompressionInfo classes
* UncompressionDict class
* compression::PutDecompressedSizeInfo and GetDecompressedSizeInfo

The only small refactoring in this change that is not pure code removal or movement is in blob_file_builder_test.cc.

Move some function implementations etc. from compression.h to compression.cc:
* CompressionTypeToString, CompressionTypeFromString, CompressionOptionsToString
* ZSTD_TrainDictionary (both overloads), ZSTD_FinalizeDictionary
* DecompressorDict::Populate
* Most compression library includes

Also cleaned up other includes of compression.h, which caused some other files to need new includes.

Pull Request resolved: facebook#14325

Test Plan: existing tests

Reviewed By: hx235

Differential Revision: D93120580

Pulled By: pdillinger

fbshipit-source-id: ab5c50db7379c0387a8c0e379642c9ea2799eae5
Summary:
probably something changed, maybe facebook#14311

Full command:
```
git ls-files '*.cc' '*.h' | grep -v '^third-party/' | grep -v 'range_tree' | xargs clang-format -i
```

Pull Request resolved: facebook#14331

Test Plan: CI

Reviewed By: mszeszko-meta

Differential Revision: D93246992

Pulled By: pdillinger

fbshipit-source-id: 6bc5b97978fef8aee52823dadb6daa4bea57343d
…#14247)

Summary:
Interpolation search is an alternative algorithm to binary search, which performs better on uniformly distributed keys. Instead of binary search always computing the mid point of the left and right boundaries, interpolation search "interpolates" the mid point based on the distance to the target. Fortunately, we can re-use existing block format to support interpolation search.

For a given block, we compute the shared_prefix length of the first and last key. Interpolation search is usually done with numerical target values, so for a variable binary length key, we calculate the "value" as the first 8 non-shared bytes. This also means interpolation search would only really be effective for bytewise comparator (guarded via options validations).

#### Fallback to binary search
- if the the val(left_key) == val(right_key) then we fallback to classic binary search (to avoid divide by 0)
- interpolation search is significantly more computationally expensive than binary search, so when the search distance is small, we also fallback to binary search.
- if interpolation search does not make significant progress (i.e. reduces search space by more than half each iteration), we can assume data is non-uniform and fallback.

Interpolation search also performs best when there is minimal shortening, especially shortening of the last block, as it can heavily skew the distribution of the actual keys.

Note that each search algorithm is guaranteed to make progress because at each iteration the search space is guaranteed to be reduce by at least 1.

For now this change only applies to index block seeks, as data block seeks and other blocks do not have as many entries and would not require significant number of search rounds, but it could be easily extended to include that support.

Pull Request resolved: facebook#14247

Test Plan:
Updated unit tests and crash test with new search option

### Benchmark
The default benchmark sets up keys in generally uniform distribution, so it was a good way to test performance improvements.

Setup: `./db_bench -benchmarks=fillseq,compact -index_shortening_mode=1`

#### Before this change
```
./db_bench -use_existing_db=true -benchmarks=readrandom -seed=1

readrandom   :       2.899 micros/op 344973 ops/sec 2.899 seconds 1000000 operations;   38.2 MB/s (1000000 of 1000000 found)
```

#### After this change

Notice how key comparison counts are the same between the two.
```
./db_bench -use_existing_db=true -benchmarks=readrandom -seed=1 -index_search_type=binary_search

readrandom   :       2.881 micros/op 347128 ops/sec 2.881 seconds 1000000 operations;   38.4 MB/s (1000000 of 1000000 found)
```

```
./db_bench -use_existing_db=true -benchmarks=readrandom -seed=1 -index_search_type=interpolation_search

readrandom   :       2.609 micros/op 383209 ops/sec 2.610 seconds 1000000 operations;   42.4 MB/s (1000000 of 1000000 found)
```

With a non-uniform distribution, `i.e. index_shortening_mode=2`

```
./db_bench -use_existing_db=true -benchmarks=readrandom -seed=1 -index_search_type=binary_search

readrandom   :       2.958 micros/op 338075 ops/sec 2.958 seconds 1000000 operations;   37.4 MB/s (1000000 of 1000000 found)
```

```
./db_bench -use_existing_db=true -benchmarks=readrandom -seed=1 -index_search_type=interpolation_search

readrandom   :       5.502 micros/op 181750 ops/sec 5.502 seconds 1000000 operations;   20.1 MB/s (1000000 of 1000000 found)
```

Reviewed By: pdillinger

Differential Revision: D91063163

Pulled By: joshkang97

fbshipit-source-id: 151d6aa76f8713740b714de6e406aff40d28ccbc
Summary:
Add a new kv ratio based compaction picking algorithm in fifo compaction

Pull Request resolved: facebook#14326

Test Plan: Unit test

Reviewed By: pdillinger

Differential Revision: D93257941

Pulled By: xingbowang

fbshipit-source-id: fd2d0e1356c7b54682a1197475a1bd26cb45c9d4
Summary:
Temporarily disable interpolation search.

Pull Request resolved: facebook#14339

Reviewed By: pdillinger

Differential Revision: D93497658

Pulled By: mszeszko-meta

fbshipit-source-id: 6ac826dd3fc354e18af0d928f87ed71e2cef3f14
Summary:
FIFO crash tests fail on DB open when `fifo_compaction_max_data_files_size_mb` is randomly set to 100 or 500 MB, because `max_table_files_size` defaults to 1GB and the validation requires `max_data_files_size` >= `max_table_files_size` when non-zero. Cap `max_table_files_size` to `max_data_files_size` in db_stress when the latter is set. `max_table_files_size` is ignored at runtime when `max_data_files_size` is non-zero, so this only satisfies the validation constraint.

Pull Request resolved: facebook#14341

Reviewed By: pdillinger

Differential Revision: D93503113

Pulled By: mszeszko-meta

fbshipit-source-id: 5c3e7c9b568661244c71c548cb0fe5e55472c0ca
Summary:
Pull Request resolved: facebook#14340

`ReadBe64FromKey` pads its result with `val <<= (8 - len) * 8` to right-align partial reads. When the seek target's user key is shorter than shared_prefix_len, len is 0 and this becomes a shift by 64, which is undefined behavior for uint64_t. On x86 this happens to produce 0 (the correct result), but UBSan rightfully flags it. Guard the shift with `len > 0 && len < 8`.

Reviewed By: joshkang97

Differential Revision: D93435715

fbshipit-source-id: bab128e9a65ea18d401670268cbac77d45e11340
Summary:
Pull Request resolved: facebook#14337

## Context:

D91624185 changed `FilePrefetchBuffer::PollIfNeeded` from `void` to returning `Status`, correctly propagating `Poll` errors instead of silently swallowing them. A side effect is that when io_uring fails to initialize at runtime (e.g., sandcastle seccomp restrictions), `ReadAsync` returns `NotSupported` which now propagates through `PrefetchRemBuffers` and `HandleOverlappingAsyncData`, causing `PrefetchInternal` to return early before executing the synchronous `Read` that the caller actually depends on. This leaves iterators invalid with no recovery path — both in `db_stress` crash tests and in production. The filesystem advertises async IO support (`CheckFSFeatureSupport` passes), so the failure only surfaces at runtime when io_uring initialization fails. The prior behavior silently degraded to sync reads because `PollIfNeeded` swallowed the error.

## Changes

Add a sync fallback in `FilePrefetchBuffer::ReadAsync` — the single chokepoint for all async reads. When `reader->ReadAsync()` returns `NotSupported`, fall back to `reader->Read()` synchronously, populate the buffer inline, and return OK. Since `async_read_in_progress_` stays false, `PollIfNeeded` becomes a no-op (nothing to poll, data is already there). All callers — `PrefetchRemBuffers`, `HandleOverlappingAsyncData`, `PrefetchAsync` — work transparently without any per-site changes.

Reviewed By: archang19

Differential Revision: D93432284

fbshipit-source-id: daef185fc3535e347d182e75dd443ae921eeb495
…acebook#14321)

Summary:
Pull Request resolved: facebook#14321

Add file_checksum and file_checksum_func_name fields to FileOptions so that downstream FileSystem implementations can access per-file checksum metadata when SST files are opened. The fields are populated from FileMetaData at all call sites where SST files are opened via NewRandomAccessFile: TableCache::GetTableReader, Version::GetTableProperties, and CompactionJob::ReadTablePropertiesDirectly. Also fixes the fallback path in TableCache::GetTableReader to use the local fopts (with temperature and checksum) instead of the original file_options.

Added a kNoFileChecksumFuncName which is distinct from  kUnknownFileChecksumFuncName:

 - kUnknownFileChecksumFuncName ("Unknown"): We have FileMetaData for this file, and the metadata says no checksum was computed (no factory was configured when the file was written). This is a property of the file itself.
- kNoFileChecksumFuncName ("Unavailable"): We don't even have FileMetaData — we're opening this file in a context where there's no checksum metadata to propagate at all (e.g., SstFileDumper, SstFileReader, checksum generation). It's a property of the call site, not the file.

So the assertion file_checksum.empty() is correct for both, but for different reasons — one says "the file has no checksum," the other says "we have no idea about this file's checksum."

Reviewed By: pdillinger

Differential Revision: D92728944

fbshipit-source-id: 8fd34ea22ca87090b26d0a55c921f354f97f1ffc
…nputs (facebook#14333)

Summary:
**Context/Summary:**

Stress test recently encountered a one-off failure where input file selection for trivial move did not select all the files it should and left behind one adjacent file to the input file. This violated the clean cut invariant enforced through `ExpandInputsToCleanCut()` and caused `Get()` to return stale data.

While I had no luck reproducing it nor in code inspection to find the root cause, this debug assertion should help in two ways: 1. Fail fast if the invariant is violated, showing us the file boundary in memory 2. If the assertion doesn't trigger yet the same failure occurs, it points to metadata corruption bypassing this check and ExpandInputsToCleanCut() enforcement

Pull Request resolved: facebook#14333

Test Plan:
- Existing unit tests
- Manually trace through and run a stress test command that frequently exercises this check for 10 minutes
```
./db_stress --level0_file_num_compaction_trigger=2 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_setting_blob_options_dynamically=1 --async_io=1 --auto_readahead_size=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --blob_cache_size=1048576 --blob_compaction_readahead_size=4194304 --blob_compression_type=lz4 --blob_file_size=1048576 --blob_file_starting_level=1 --blob_garbage_collection_age_cutoff=1.0 --blob_garbage_collection_force_threshold=0.75 --block_protection_bytes_per_key=2 --block_size=16384 --bloom_before_level=0 --bloom_bits=16 --bottommost_compression_type=zstd --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=1000000 --checksum_type=kXXH3 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_readahead_size=0 --compaction_ttl=2 --compression_checksum=1 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=0 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_wal=0 --enable_blob_files=0 --enable_blob_garbage_collection=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --enable_thread_tracking=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=big --flush_one_in=1000000 --format_version=3 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=3 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --long_running_snapshots=1 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=1000 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=1000 --memtable_max_range_deletions=100 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_blob_size=0 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=1 --prefix_size=8 --prefixpercent=5 --prepopulate_blob_cache=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=32 --readahead_size=0 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --secondary_cache_fault_one_in=32 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608 --set_options_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=1000 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=3 --use_blob_cache=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=5 --use_shared_block_and_blob_cache=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=1000 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=35
```

Reviewed By: mszeszko-meta

Differential Revision: D93300664

Pulled By: hx235

fbshipit-source-id: b56f01c08a7348ba383110dd8f89b5b1b7961c55
Summary:
Fixes blog author display issues on rocksdb.org/blog by:

* Adding missing authors to authors.yml: pdillinger, alanpaxton, akankshamahajan15, anand1976, poojam23
* Standardizing on GitHub usernames: renamed sdong → siying
* Fixing typo in 2016-02-25-rocksdb-ama.markdown: yhchiang → yhciang
* A short note in CLAUDE.md

Authors were not showing on the blog because they were referenced in post frontmatter but not defined in the _data/authors.yml lookup file.

Pull Request resolved: facebook#14342

Test Plan: push & see ;)

Reviewed By: mszeszko-meta

Differential Revision: D93523972

Pulled By: pdillinger

fbshipit-source-id: 757c33e80f3c1d99ff4134a37321f40634d6e294
Summary:
This PR fixes a bug in the interaction between WritePrepared/WriteUnprepared TransactionDB (with two_write_queues=true) and background error recovery. This bug caused crash tests to fail with a "sequence number going backwards" error during DB open.

Root Cause
------------
When two_write_queues=true, sequence numbers are allocated via FetchAddLastAllocatedSequence() before a write completes, but are only published via SetLastSequence() after the write succeeds. If a background error occurs (e.g., a MANIFEST write failure during flush), the error recovery path in DBImpl::ResumeImpl creates new memtables and WAL files. The new WAL's starting sequence number is based on LastSequence() (the published value), which can be lower than already-allocated sequence numbers that were written to the old WAL. On subsequent recovery, RocksDB detects that sequence numbers in the new WAL are lower than those in the old WAL and reports a "sequence number going backwards" corruption error, causing the DB to fail to open.

Fix
 ---
The fix adds a call to a new VersionSet::SyncLastSequenceWithAllocated() method at the beginning of DBImpl::ResumeImpl, before any new memtables or WALs are created. This method advances last_sequence_ to match last_allocated_sequence_ if the latter is higher, ensuring the new WAL starts with a sequence number that is at least as high as any previously allocated one.

Pull Request resolved: facebook#14313

Test Plan:
---------
Add new unit tests in write_prepared_transaction_test_seqno

Reviewed By: pdillinger

Differential Revision: D92746944

Pulled By: anand1976

fbshipit-source-id: 34385fc13fd74435dd1c3283637eb118f45d887e
Summary:
see draft post

Pull Request resolved: facebook#14078

Test Plan: markdown preview (simple post)

Reviewed By: hx235

Differential Revision: D85475518

Pulled By: pdillinger

fbshipit-source-id: d7f7b0d68de3880fcffbdbbef27cd2c14fe51f96
…es (facebook#14332)

Summary:
I'm implementing this intending it to be used for facebook#14287

Refactor the data block footer encoding/decoding to use a struct-based Encode/Decode API (DataBlockFooter), reserving the top 4 bits of the footer for metadata:
- Bit 31: Hash index present (kDataBlockBinaryAndHash) - existing use
- Bits 28-30: Reserved for future features

Comments have some detail for why it is safe to assume no practical existing SST files would use these newly reserved bits. And for forward compatibility, existing versions detect (non-zero) use of these new bits as impossibly large num_restarts and report "bad block contents". Not perfect, but not bad.

Key changes:
- Replace PackIndexTypeAndNumRestarts/UnPackIndexTypeAndNumRestarts with DataBlockFooter::EncodeTo/DecodeFrom methods
- DecodeFrom returns a detailed error when reserved bits are set, enabling graceful failure on newer format versions
- Reduce kMaxNumRestarts from 2^31-1 to 2^28-1 (268M), which is adequate for the maximum possible restarts in a 4GiB block
- Add GetCorruptionStatus() to Block for detailed error messages (Note that we are sensitive to the size of Block objects, so have to avoid adding unnecessary new members.)
- Remove obsolete kMaxBlockSizeSupportedByHashIndex size checks

Pull Request resolved: facebook#14332

Test Plan:
- Existing unit tests and format compatibility test
- Add test for reserved bit detection (ReservedBitInDataBlockFooter)

Reviewed By: joshkang97

Differential Revision: D93293152

Pulled By: pdillinger

fbshipit-source-id: b65a83e96bb09a98fb9b8b2dd9f754653ca7ed4d
…ok#14315) (facebook#14327)

Summary:
After PR facebook#14315 dropped support for block-based table format_version < 2, several code paths became obsolete. This change removes them.

Investigation findings:

1. Table properties are now a hard requirement for block-based SST files:
   - format_version >= 2 guarantees a properties block exists
   - Removed defensive conditionals like `if (rep_->table_properties)`
   - Missing properties block now returns Status::Corruption instead of just logging an error. This is important because some properties affect the semantic interpretation of the file.

2. Index type property (kIndexType) is now required:
   - kIndexType was introduced in Feb 2014 (commit 74939a9), ~11 months BEFORE format_version was introduced in Jan 2015
   - BlockBasedTablePropertiesCollector::Finish() has always written kIndexType unconditionally for all block-based tables
   - Therefore all format_version >= 2 files have this property
   - Now returns Status::Corruption if missing instead of silently defaulting to kBinarySearch

3. Removed SetOldTableOptions() from sst_file_dumper:
   - This fallback handled files without a properties block
   - Dead code since format_version >= 2 guarantees properties exist

4. Removed kPropertiesBlockOldName ("rocksdb.stats") fallback:
   - The properties block was renamed from "rocksdb.stats" to "rocksdb.properties" in RocksDB 2.7 (April 2014)
   - format_version 2 was introduced in RocksDB 3.10 (Oct 2015)
   - All table formats (block-based, plain, cuckoo) were created after the rename, so they all use "rocksdb.properties"
   - The backward compatibility fallback in FindOptionalMetaBlock() was dead code for all supported table formats

5. Removed obsolete assertion about format_version 0 checksum in BlockBasedTableBuilder::WriteFooter()

Pull Request resolved: facebook#14327

Test Plan: some tests updated for updated requirements. Mostly, CI including format compatible test

Reviewed By: mszeszko-meta

Differential Revision: D93124820

Pulled By: pdillinger

fbshipit-source-id: eb12cbdca0e69f34a08051d5160c282384128a4a
…14335)

Summary:
and remove deprecated DB::MaxMemCompactionLevel(). In the process of pushing through a relatively clean refactoring of uses of the old functions, some other minor public APIs are also migrated from raw DB pointers to unique_ptr.

Claude did pretty much all the work, but requiring dozens of prompts to actually push through relatively clean phase out of raw DB pointers from what needed to be touched, and leaving that code in better shape. (Hundreds of `DB*` still remain all over the place even outside C and Java bindings.)

Pull Request resolved: facebook#14335

Test Plan: existing tests; no functional changes intended

Reviewed By: xingbowang, mszeszko-meta

Differential Revision: D93523820

Pulled By: pdillinger

fbshipit-source-id: e4ca22ad81cd2cfe91122d7507d7ca34fe03d043
Summary:
Extending facebook#14323 by testing scenarios for compaction after downgrade. Detail: we shouldn't need to test loading options with compaction, as options file inclusion is mostly a sanity check for "can you open the DB with options file?"

Pull Request resolved: facebook#14344

Test Plan: manual run of SHORT_TEST=1 J=140 tools/check

Reviewed By: xingbowang

Differential Revision: D93553897

Pulled By: pdillinger

fbshipit-source-id: ec08ae2a3d49971e24a215e38df9506fe1133096
…ebook#14346)

Summary:
Remove deprecated option skip_checking_sst_file_sizes_on_db_open

Pull Request resolved: facebook#14346

Test Plan: Unit test

Reviewed By: hx235

Differential Revision: D93602683

Pulled By: xingbowang

fbshipit-source-id: f576825cb107bb0aeb14f4ff29fef0df269b8728
Summary:
Add clang-tidy-comment workflow. This workflow allows pr clang tidy pr job to post the clang-tidy finding directly on the PR page.

Pull Request resolved: facebook#14348

Test Plan: Will be tested with next clang-tidy PR

Reviewed By: joshkang97

Differential Revision: D93670150

Pulled By: xingbowang

fbshipit-source-id: 8245f9d5bde8cf800d88034c4339de9f387c5692
…acebook#14343)

Summary:
There was an edge case missed in the implementation of interpolation search for target keys that had a length smaller than the shared prefix.

E.g. first_key = "aaaaaa", last_key = "aaaaaz", target_key = "aaz". In the existing setup, we will seek to position 0, but in reality is should be seeked to the end.

#### The fix
The solution here was to also do a bounds check on the first search iteration. We utilize memcmp on the target key with the shared_prefix to determine if the target key is outside the bounds. An edge case here is if the target key itself a prefix of the shared prefix (e.g. target = "aaaa"), in this case memcmp return return 0, but the target key is actually smaller.

### Minor optimizations
- cache left,right values so we don't need to re-compute it when left/right boundaries don't change
- In ReadBe64FromKey, utilize memcpy + swap for fast path
- since we have already computed a shared_prefix, every other comparison only needs to compare the non-shared suffixes

Pull Request resolved: facebook#14343

Test Plan:
Added new unit tests to test this case

### Benchmarks

No significant regressions due to additional memcmp.

#### Configuration
- **CPU:** 192 * AMD EPYC-Genoa Processor
- **RocksDB Version:** 10.12.0
- **Compression:** Snappy
- **Entries:** 1,000,000
- **Value Size:** 100 bytes
- **Index Search Type:** interpolation_search
- **Index Shortening Mode:** 1

#### Results

| Benchmark | Params | ops/s (main) | ops/s (feature) | % change |
|-----------|--------|-------------|-----------------|----------|
| readrandom | 16B keys, no prefix | 367,264 | 369,163 | +0.52% |
| readrandom | 100B keys, prefix_size=50 | 376,066 | 371,193 | -1.29% |

Reviewed By: pdillinger

Differential Revision: D93535267

Pulled By: joshkang97

fbshipit-source-id: beda182efce1e914ff587e697b927347cfa42656
…14353)

Summary:
**Summary/Context:**

Remove the `InRange()` virtual method from `SliceTransform` and all its overrides. This method was marked DEPRECATED, never called by RocksDB, and existed only for backward compatibility.

Also removes the `in_range` callback parameter from `rocksdb_slicetransform_create()` in the C API, which is a breaking change appropriate for a major version release.

Pull Request resolved: facebook#14353

Test Plan: Make check

Reviewed By: xingbowang

Differential Revision: D93795070

Pulled By: hx235

fbshipit-source-id: 5eba23f1d038b19c494997a55e5d8ca379fbedcb
Summary:
In my last version bump, I forgot that the next release would be a major release. We can fix that now ahead of release cut.

I'm also updating folly now because I have experience resolving folly issues. Folly commit e04860553 changed libevent to build as static-only so required a change in our build.

Pull Request resolved: facebook#14357

Test Plan: CI

Reviewed By: xingbowang

Differential Revision: D93797666

Pulled By: pdillinger

fbshipit-source-id: 22179da900f9dc6c5544163071079a4701c7c663
facebook#14349)

Summary:
a couple recent failures in this test. Waiting for purge and disabling sync points before Close should resolve the issues.

Also fixing EventListenerTest.BlobDBOnFlushCompleted because it showed up as flaky in CI for this PR

Pull Request resolved: facebook#14349

Test Plan: watch CI

Reviewed By: mszeszko-meta

Differential Revision: D93619322

Pulled By: pdillinger

fbshipit-source-id: bb9fc7d3c0ecaaeaffe4305e1ad403cbcd597484
…cebook#14352)

Summary:
**Context/Summary:**
Remove `SstFileWriter::Add()` (deprecated in favor of `Put()`) and the `skip_filters` parameter from `SstFileWriter` constructors (deprecated in favor of setting `BlockBasedTableOptions::filter_policy` to `nullptr`).

Both APIs have zero active callers. The `skip_filters` field is also removed from `TableBuilderOptions` (write-side only; the read-side `TableReaderOptions::skip_filters` is unchanged).

Pull Request resolved: facebook#14352

Test Plan: make check

Reviewed By: xingbowang

Differential Revision: D93812389

Pulled By: hx235

fbshipit-source-id: 236b36a6e664758ab5ad90e606bc195d0a6de70f
Summary:
Pull Request resolved: facebook#14355

SupportedOps advertised kAsyncIO based only on the IsIOUringEnabled() weak symbol check, without verifying that the constructor's io_uring probe actually succeeded. Add a thread_local_async_read_io_urings_ null check so kAsyncIO is only reported when the probe passed. Also update the constructor to probe with the same IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN flags that ReadAsync and MultiRead use at runtime.

Reviewed By: anand1976

Differential Revision: D93780065

fbshipit-source-id: 6f51f544b267cb39d09b49949a9485f55eeae12e
…refresh_nanos (facebook#14350)

Summary:
**Context/Summary:**
Remove deprecated, unused APIs and options:
- ReadOptions::managed: This option was not used anymore. The functionality it controlled has been removed long ago.
- ColumnFamilyOptions::snap_refresh_nanos: Deprecated and unused option.

Corresponding C API (rocksdb_readoptions_set_managed) and Java API (ReadOptions.managed/setManaged) are also removed. All related checks an db_impl and db_impl_secondary iterators are cleaned up.

Pull Request resolved: facebook#14350

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D93812438

Pulled By: hx235

fbshipit-source-id: e4a9d21c65f83294b6d0878286ba14024f049bac
Summary:
RocksDB has been using clang-tidy for a long time inside Meta. However, it is not efficient for external contributor, as the result from clang-tidy has to be ferried back through internal contributor. This PR added support to run clang-tidy on external github CI. It added .clang-tidy file based on internal version. It run clang-tidy in a separate pr job and a workflow step would post the pr job result to the PR itself. See example below.

Pull Request resolved: facebook#14347

Test Plan: Github CI

Reviewed By: archang19

Differential Revision: D93862467

Pulled By: xingbowang

fbshipit-source-id: bb4330241036894deb619470efd73a7041a8b62f
…ok#14314)

Summary:
Introduce a new V2 serialization format for wide column entities that supports storing individual column values in blob files. The V2 format adds a column type section that marks each column as either inline or blob-index, enabling per-column blob storage for large values.

Pull Request resolved: facebook#14314

Reviewed By: pdillinger

Differential Revision: D92832066

Pulled By: xingbowang

fbshipit-source-id: 13c24347e1f481a059d67eef987d2d2b184b4a51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.