Add DBZ seed files by mugitya03 · Pull Request #6 · mugitya03/zstd-test

mugitya03 · 2025-11-17T15:33:02Z

Summary by CodeRabbit

New Features
- Multi-threaded compression support with async streaming and parallel worker threads.
- Advanced dictionary training algorithms (COVER and FASTCOVER methods).
- Comprehensive benchmarking framework for performance analysis.
- Entropy encoding optimizations with FSE and Huffman support.
- Thread pool infrastructure for concurrent task execution.

Note

^{Cursor Bugbot is generating a summary for commit 4baa257. Configure here.}

cursor · 2025-11-17T15:33:06Z

You have run out of free Bugbot PR reviews for this billing cycle. This will reset on December 21.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

coderabbitai · 2025-11-17T15:33:11Z

Walkthrough

Adds comprehensive implementations of core ZSTD compression components: FSE and Huffman entropy encoding/decoding, multi-threaded compression pipeline, low-distance matching, block pre-splitting, workspace management, dictionary training (COVER/FASTCOVER), suffix sorting, and benchmarking utilities spanning library core, compression engine, dictionary builders, and tools.

Changes

Cohort / File(s)	Summary
FSE and Huffman Entropy `lib/compress/fse_compress.c`, `lib/compress/huf_compress.c`, `lib/common/fse_decompress.c`	Finite State Entropy and Huffman table construction, encoding, decoding with workspace-based memory management, normalization, and BMI2 optimization.
Compression Core and Sequences `lib/compress/zstd_compress_sequences.c`, `lib/compress/zstd_compress_superblock.c`	Sequence encoding logic, FSE table building, selection strategies, and super-block compression with literal/sequence sub-block handling.
Internal Compression Structures `lib/compress/zstd_compress_internal.h`, `lib/compress/zstd_cwksp.h`	Compression context typedefs (SeqStore, entropy tables, block state), prototypes, and workspace allocation/lifecycle management with phase-based stratification.
Advanced Matching and Splitting `lib/compress/zstd_ldm.c`, `lib/compress/zstd_preSplit.c`	Low-distance matching with gear-based rolling hash and sequence generation; block pre-splitting heuristic using fingerprint-based deviation detection.
Threading Infrastructure `lib/common/pool.c`, `lib/compress/zstdmt_compress.c`	Thread pool with job queuing and synchronization; multi-threaded streaming compression with buffer/sequence/context pools and per-job lifecycle.
C++ Multi-threaded Compression `contrib/pzstd/Pzstd.cpp`	C++ pipeline for parallel frame-based compression/decompression with async I/O, buffer work queues, and skippable frame handling.
Dictionary Training `lib/dictBuilder/cover.c`, `lib/dictBuilder/fastcover.c`, `lib/dictBuilder/zdict.c`	COVER and FASTCOVER algorithms with parameter optimization, d-mer frequency analysis, segment selection, and dictionary finalization with entropy tables.
Suffix Sorting and BWT `lib/dictBuilder/divsufsort.c`	Suffix array construction and Burrows-Wheeler transform with bucket-based B\*-prefix sorting, multi-pass merging, and OpenMP support.
Dictionary I/O and Training `programs/dibio.c`	Multi-file dictionary training orchestration including file loading, memory budgeting, shuffling, and dispatch to ZDICT training backends.
Benchmarking Utilities `programs/benchfn.c`, `programs/benchzstd.c`, `contrib/largeNbDicts/largeNbDicts.c`	Function benchmarking framework with timed iterations and state management; ZSTD memory/file benchmark orchestration with multi-level/CRC validation; dictionary decompression stress-test with CSV results.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

FSE/Huffman implementations (fse_compress.c, fse_decompress.c, huf_compress.c): Require careful validation of normalization logic, table construction correctness, and boundary handling; BMI2 conditional paths add complexity.
Multi-threaded compression (pool.c, zstdmt_compress.c, Pzstd.cpp): Synchronization primitives, job lifecycle, buffer pool management, and potential race conditions demand thorough scrutiny; interactions across threads require tracing.
Sequence encoding and super-block compression (zstd_compress_sequences.c, zstd_compress_superblock.c): Cost models, encoding type selection, and fallback logic (including decoder-version workarounds) are dense and interdependent.
Workspace management (zstd_cwksp.h): Phase-based allocation, alignment guarantees, and ASAN/MSAN interactions require cross-checking against all callers.
Dictionary training (cover.c, fastcover.c, divsufsort.c): Sorting correctness, parameter space exploration, and frequency-based selection heuristics need validation; OpenMP parallelism adds concurrency concerns.
Interconnected dependencies: Each compression component integrates with FSE/Huffman and workspace, creating a web of interactions that must be validated holistically.

Poem

🐰 Entropy flows like carrot juice,
Threads dance through compression's deuce,
From FSE tables to dictionaries true,
Workspace juggles every phase anew,
LDM matches bloom and bytes shrink in view! 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The PR title 'Add DBZ seed files' does not align with the extensive changeset, which introduces large compression infrastructure (FSE, Huffman, thread pool, LDM, MT compression, dictionary training, and benchmarking).	Revise the title to accurately reflect the major changes, such as 'Add comprehensive compression infrastructure: FSE, Huffman, thread pool, and benchmarking tools' or similar descriptor of the actual implementation scope.
Docstring Coverage	⚠️ Warning	Docstring coverage is 46.64% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch add-dbz-seed-files

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

mugitya03 · 2025-11-17T15:38:48Z

@cursor review

cursor · 2025-11-17T15:41:59Z

+        totalCSize += cBlockSize;
+    }
+    return totalCSize;
+}


Bug: Memory Leak: Unfreed Compression Context

Memory leak: ZSTD_CCtx* const cctx is created with ZSTD_createCCtx() but never freed with ZSTD_freeCCtx() before returning from compressBlocks(). This causes the compression context to leak on every call.

cursor · 2025-11-17T15:41:59Z

+    if (ci->dictNb >= ci->nbDicts) ci->dictNb = 0;
+
+    return srcSize;
+}


Bug: Buffer Bug Causes Silent Compression Errors

The compress function passes srcSize as the destination capacity to ZSTD_compress2 instead of the provided dstCapacity parameter, and ignores the return value completely. This causes incorrect buffer size checking and silently ignores compression errors.

cursor · 2025-11-17T15:41:59Z

+    size_t nbSrcBlocks = 0;
+    for (size_t ssnb=0; ssnb < srcSlices.nbSlices; ssnb++) {
+        size_t pos = 0;
+        while (pos <= srcSlices.capacities[ssnb]) {


Bug: Off-by-One: Buffer Boundary Violation

Off-by-one error in splitSlices: the condition while (pos <= srcSlices.capacities[ssnb]) allows pos to equal the capacity, causing an extra iteration beyond the buffer boundary. This inflates the block count and could lead to incorrect behavior.

cursor · 2025-11-17T15:41:59Z

+    unsigned const ms_per_round = RUN_TIME_DEFAULT_MS;
+    unsigned const total_time_ms = nbRounds * ms_per_round;
+
+    double *const speedPerRound = (double *)malloc(nbRounds * sizeof(double));


Bug: Memory Leak in Benchmarking Function

Memory leak in benchMem: speedPerRound is allocated with malloc(nbRounds * sizeof(double)) but never freed before the function returns.

cursor · 2025-11-17T15:41:59Z

+
+    size_t ssnb = 0;
+    for (size_t sliceNb=0; sliceNb < nbSlices; ) {
+        ssnb = (ssnb + 1) % srcSlices.nbSlices;


Bug: Slice Distribution Off-By-One

In splitSlices, ssnb is initialized to 0 then immediately set to (ssnb + 1) % srcSlices.nbSlices which equals 1, causing the first source slice at index 0 to always be skipped when distributing slices.

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (3)

lib/compress/zstd_compress_superblock.c (2)

48-49: Consider defining named constants for header size bounds.

The magic number 200 and threshold values 1 KB and 16 KB reduce readability. Consider defining these as named constants with explanatory comments.

356-356: Consider documenting the fallback multiplier.

The nbSeq * 10 fallback estimate uses a magic multiplier. A brief comment explaining why 10 is chosen as the worst-case bytes-per-sequence estimate would improve clarity.

programs/benchfn.c (1)

192-193: Consider clarifying the initialization comment.

The comment "hopefully large enough" for the initial fastestRun.nanoSecPerRun value is vague. Consider something like "initialized to a very large value (2000 seconds) to ensure any actual measurement will be smaller".

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6cacd63 and 4baa257.

📒 Files selected for processing (20)

contrib/largeNbDicts/largeNbDicts.c (1 hunks)
contrib/pzstd/Pzstd.cpp (1 hunks)
lib/common/fse_decompress.c (1 hunks)
lib/common/pool.c (1 hunks)
lib/compress/fse_compress.c (1 hunks)
lib/compress/huf_compress.c (1 hunks)
lib/compress/zstd_compress_internal.h (1 hunks)
lib/compress/zstd_compress_sequences.c (1 hunks)
lib/compress/zstd_compress_superblock.c (1 hunks)
lib/compress/zstd_cwksp.h (1 hunks)
lib/compress/zstd_ldm.c (1 hunks)
lib/compress/zstd_preSplit.c (1 hunks)
lib/compress/zstdmt_compress.c (1 hunks)
lib/dictBuilder/cover.c (1 hunks)
lib/dictBuilder/divsufsort.c (1 hunks)
lib/dictBuilder/fastcover.c (1 hunks)
lib/dictBuilder/zdict.c (1 hunks)
programs/benchfn.c (1 hunks)
programs/benchzstd.c (1 hunks)
programs/dibio.c (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (14)

programs/dibio.c (2)

programs/util.c (1)

UTIL_getFileSize (545-558)

lib/dictBuilder/zdict.c (3)

ZDICT_trainFromBuffer_legacy (1088-1108)

ZDICT_isError (98-98)

ZDICT_getErrorName (100-100)

lib/dictBuilder/fastcover.c (4)

lib/compress/zstd_compress_internal.h (17)

ZSTD_hash6Ptr (915-915)

ZSTD_hash8Ptr (925-925)

U32 (584-596)

U32 (601-613)

U32 (1050-1055)

U32 (1061-1064)

U32 (1097-1125)

U32 (1132-1146)

U32 (1395-1407)

U32 (1412-1423)

size_t (649-659)

size_t (661-670)

size_t (677-683)

size_t (854-873)

size_t (879-892)

size_t (900-900)

size_t (901-901)

lib/dictBuilder/cover.c (12)

COVER_sum (278-285)

COVER_computeEpochs (732-747)

COVER_dictSelectionError (1035-1037)

COVER_selectDict (1047-1132)

COVER_dictSelectionIsError (1039-1041)

COVER_best_finish (980-1024)

COVER_dictSelectionFree (1043-1045)

COVER_warnOnSmallCorpus (716-730)

COVER_best_init (922-931)

COVER_best_destroy (950-960)

COVER_best_start (966-973)

COVER_best_wait (936-945)

lib/dictBuilder/zdict.c (1)

ZDICT_finalizeDictionary (862-941)

lib/common/pool.c (6)

POOL_create (111-113)

POOL_create (326-328)

POOL_free (180-189)

POOL_free (339-342)

POOL_add (286-296)

POOL_add (354-357)

lib/compress/zstd_compress_sequences.c (2)

lib/compress/fse_compress.c (5)

FSE_optimalTableLog (371-374)

FSE_normalizeCount (465-525)

FSE_writeNCount (330-340)

FSE_buildCTable_rle (528-548)

FSE_buildCTable_wksp (68-214)

lib/legacy/zstd_v03.c (3)

ZSTD_isError (2373-2373)

ERR_isError (525-525)

MEM_32bits (107-107)

contrib/largeNbDicts/largeNbDicts.c (3)

programs/util.c (6)

UTIL_getFileSize (545-558)

UTIL_isDirectory (357-365)

UTIL_getTotalFileSize (630-645)

UTIL_createExpandedFNT (1379-1423)

UTIL_assembleFileNamesTable (811-815)

UTIL_freeFileNamesTable (817-823)

lib/compress/zstd_compress.c (13)

ZSTD_freeCDict (5740-5751)

ZSTD_createCDict_advanced2 (5668-5716)

ZSTD_createCCtx (97-100)

ZSTD_compressCCtx (5495-5503)

ZSTD_compress_usingCDict (5920-5927)

ZSTD_CCtx_setParametersUsingCCtxParams (1186-1198)

ZSTD_freeCCtx (181-192)

ZSTD_CCtx_refCDict (1336-1344)

ZSTD_compress2 (6576-6605)

ZSTD_sizeof_CDict (5550-5557)

ZSTD_createCCtxParams (349-352)

ZSTD_CCtxParams_init (366-372)

ZSTD_CCtxParams_setParameter (778-1029)

programs/benchfn.c (6)

BMK_createTimedFnState (160-166)

BMK_benchTimedFn (211-256)

BMK_isSuccessful_runOutcome (64-67)

BMK_extract_runTime (71-75)

BMK_isCompleted_TimedFn (200-203)

BMK_freeTimedFnState (168-168)

lib/compress/zstd_ldm.c (3)

lib/compress/zstd_compress_internal.h (18)

BYTE (630-644)

U32 (584-596)

U32 (601-613)

U32 (1050-1055)

U32 (1061-1064)

U32 (1097-1125)

U32 (1132-1146)

U32 (1395-1407)

U32 (1412-1423)

size_t (649-659)

size_t (661-670)

size_t (677-683)

size_t (854-873)

size_t (879-892)

size_t (900-900)

size_t (901-901)

ZSTD_window_correctOverflow (1159-1230)

ZSTD_storeSeq (775-811)

lib/compress/zstd_cwksp.h (3)

BYTE (394-397)

size_t (206-210)

size_t (224-232)

lib/compress/zstd_compress.c (1)

ZSTD_selectBlockCompressor (3077-3160)

lib/compress/zstdmt_compress.c (4)

lib/compress/zstd_compress_internal.h (9)

U32 (584-596)

U32 (601-613)

U32 (1050-1055)

U32 (1061-1064)

U32 (1097-1125)

U32 (1132-1146)

U32 (1395-1407)

U32 (1412-1423)

ZSTD_window_update (1354-1390)

lib/compress/zstd_compress.c (9)

ZSTD_freeCCtx (181-192)

ZSTD_createCCtx_advanced (114-124)

ZSTD_referenceExternalSequences (4788-4797)

ZSTD_compressBegin_advanced_internal (5293-5309)

ZSTD_compressContinue_public (4861-4867)

ZSTD_compressEnd_public (5415-5439)

ZSTD_checkCParams (1396-1406)

ZSTD_compressBound (70-74)

ZSTD_writeLastEmptyBlock (4778-4786)

lib/compress/zstd_ldm.c (3)

ZSTD_ldm_adjustParameters (135-167)

ZSTD_ldm_getMaxNbSeq (179-182)

ZSTD_ldm_generateSequences (526-601)

lib/common/pool.c (10)

POOL_create_advanced (115-160)

POOL_create_advanced (330-337)

POOL_free (180-189)

POOL_free (339-342)

POOL_sizeof (206-211)

POOL_sizeof (365-369)

POOL_resize (244-253)

POOL_resize (349-352)

POOL_tryAdd (299-310)

POOL_tryAdd (359-363)

programs/benchzstd.c (4)

lib/compress/zstd_compress.c (9)

ZSTD_CCtx_reset (1375-1390)

ZSTD_CCtx_setParameter (716-776)

ZSTD_CCtx_loadDictionary (1329-1333)

ZSTD_compress2 (6576-6605)

ZSTD_compressBound (70-74)

ZSTD_sizeof_CCtx (206-214)

ZSTD_createCCtx (97-100)

ZSTD_freeCCtx (181-192)

ZSTD_maxCLevel (7681-7681)

lib/legacy/zstd_v03.c (3)

ZSTD_isError (2373-2373)

U64 (126-129)

U64 (177-187)

programs/benchfn.c (6)

BMK_benchTimedFn (211-256)

BMK_isSuccessful_runOutcome (64-67)

BMK_extract_runTime (71-75)

BMK_isCompleted_TimedFn (200-203)

BMK_createTimedFnState (160-166)

BMK_freeTimedFnState (168-168)

programs/util.c (3)

UTIL_getFileSize (545-558)

UTIL_isDirectory (357-365)

UTIL_getTotalFileSize (630-645)

lib/compress/huf_compress.c (2)

lib/legacy/zstd_v03.c (9)

HUF_isError (1451-1451)

U32 (121-124)

U32 (161-164)

U32 (166-175)

BYTE (660-669)

BYTE (671-680)

HUF_readStats (1477-1553)

MEM_32bits (107-107)

ERR_isError (525-525)

lib/compress/fse_compress.c (6)

FSE_optimalTableLog (371-374)

FSE_normalizeCount (465-525)

FSE_writeNCount (330-340)

FSE_buildCTable_wksp (68-214)

FSE_compress_usingCTable (610-620)

FSE_optimalTableLog_internal (357-369)

lib/dictBuilder/cover.c (4)

lib/legacy/zstd_v03.c (5)

U64 (126-129)

U64 (177-187)

U32 (121-124)

U32 (161-164)

U32 (166-175)

lib/dictBuilder/zdict.c (2)

ZDICT_finalizeDictionary (862-941)

ZDICT_isError (98-98)

lib/compress/zstd_compress.c (6)

ZSTD_compressBound (70-74)

ZSTD_createCCtx (97-100)

ZSTD_createCDict (5718-5727)

ZSTD_compress_usingCDict (5920-5927)

ZSTD_freeCCtx (181-192)

ZSTD_freeCDict (5740-5751)

lib/common/pool.c (6)

POOL_create (111-113)

POOL_create (326-328)

POOL_free (180-189)

POOL_free (339-342)

POOL_add (286-296)

POOL_add (354-357)

lib/compress/zstd_compress_internal.h (2)

lib/legacy/zstd_v03.c (10)

U32 (121-124)

U32 (161-164)

U32 (166-175)

ZSTD_isError (2373-2373)

BYTE (660-669)

BYTE (671-680)

ZSTD_wildcopy (2331-2337)

MEM_64bits (108-108)

U64 (126-129)

U64 (177-187)

lib/compress/zstd_compress.c (9)

ZSTD_getSeqStore (222-222)

ZSTD_seqToCodes (2701-2727)

ZSTD_selectBlockCompressor (3077-3160)

ZSTD_cParam_getBounds (427-645)

ZSTD_loadCEntropy (5069-5158)

ZSTD_reset_compressedBlockState (1926-1935)

ZSTD_convertBlockSequences (7326-7400)

ZSTD_get1BlockSummary (7404-7452)

ZSTD_get1BlockSummary (7456-7481)

lib/compress/zstd_compress_superblock.c (4)

lib/compress/huf_compress.c (3)

HUF_compress1X_usingCTable (1163-1166)

HUF_compress4X_usingCTable (1218-1221)

HUF_estimateCompressedSize (794-803)

lib/legacy/zstd_v03.c (3)

ERR_isError (525-525)

BYTE (660-669)

BYTE (671-680)

lib/compress/zstd_compress_sequences.c (3)

ZSTD_encodeSequences (419-442)

ZSTD_crossEntropyCost (139-154)

ZSTD_fseBitCost (104-132)

lib/compress/zstd_compress.c (1)

ZSTD_buildBlockEntropyStats (3806-3834)

lib/dictBuilder/zdict.c (5)

lib/legacy/zstd_v03.c (6)

ERR_isError (525-525)

U32 (121-124)

U32 (161-164)

U32 (166-175)

HUF_isError (1451-1451)

FSE_isError (1106-1106)

lib/compress/zstd_compress.c (10)

ZSTD_reset_compressedBlockState (1926-1935)

ZSTD_loadCEntropy (5069-5158)

ZSTD_compressBegin_usingCDict_deprecated (5881-5885)

ZSTD_compressBlock_deprecated (4891-4898)

ZSTD_getSeqStore (222-222)

ZSTD_seqToCodes (2701-2727)

ZSTD_createCDict_advanced (5650-5666)

ZSTD_createCCtx (97-100)

ZSTD_freeCDict (5740-5751)

ZSTD_freeCCtx (181-192)

lib/dictBuilder/divsufsort.c (1)

divsufsort (1846-1873)

lib/compress/huf_compress.c (2)

HUF_buildCTable_wksp (756-792)

HUF_writeCTable_wksp (248-289)

lib/compress/fse_compress.c (2)

FSE_normalizeCount (465-525)

FSE_writeNCount (330-340)

lib/compress/fse_compress.c (3)

lib/legacy/zstd_v03.c (3)

FSE_isError (1106-1106)

U16 (116-119)

U16 (136-145)

lib/legacy/zstd_v04.c (3)

FSE_isError (1089-1089)

U16 (92-95)

U16 (112-121)

lib/legacy/zstd_v05.c (2)

U16 (119-122)

U16 (149-157)

lib/common/fse_decompress.c (1)

lib/legacy/zstd_v04.c (4)

BYTE (839-848)

BYTE (850-859)

U16 (92-95)

U16 (112-121)

🔇 Additional comments (17)

lib/compress/zstd_compress_superblock.c (3)

220-236: LGTM: Well-documented backward compatibility handling.

The workaround for the zstd decoder bug (versions ≤ 1.3.4) is properly documented with a reference to the fix. The condition is appropriately guarded by FUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION to avoid issues during fuzzing.

665-688: LGTM: Clean public API with proper error handling.

The function properly builds block entropy statistics and delegates to the multi-block compressor with appropriate error propagation.

649-657: Repcode regeneration logic is correct—no issues found.

The ZSTD_updateRep function correctly handles the ll0 parameter (whether litLength == 0). The function uses this boolean to adjust repcode calculations on line 825: repCode = OFFBASE_TO_REPCODE(offBase) - 1 + ll0. When litLength is 0, the repcode index shifts by 1, which is the correct ZSTD semantics. The superblock code at line 654 properly passes this condition and correctly regenerates repcode state for skipped sequences.

programs/benchfn.c (2)

64-81: LGTM: Accessor functions enforce API contract.

The functions correctly access internal fields marked with never_ever_use_directly, which enforces that users must go through these accessor functions. The CONTROL macro provides appropriate assertions to catch misuse.

232-240: LGTM: Adaptive loop adjustment with overflow protection.

The logic dynamically adjusts nbLoops based on run duration to maintain consistent timing, with proper overflow checking. The 10x multiplier for short runs helps quickly reach the target duration.

contrib/pzstd/Pzstd.cpp (3)

344-349: LGTM: Frame size calculation with documented constraints.

The assertion ensures windowLog ≤ 23 to respect the 4GB skippable frame size limit, which is well-documented in the comment.

540-545: LGTM: Adaptive queue sizing for memory management.

When a non-skippable frame is encountered, the code appropriately limits queue sizes to prevent excessive memory usage during serial decompression fallback.

599-610: LGTM: Skippable frame size validation.

The code correctly validates that the compressed size fits in a 32-bit skippable frame before writing, with an appropriate assertion for the uint64_t cast.

lib/dictBuilder/fastcover.c (4)

149-219: LGTM: Well-structured segment selection algorithm.

The sliding window approach with frequency tracking correctly implements the scoring function described in the header comments. The cleanup loops ensure state consistency between calls.

490-491: Related to frequency table size validation.

This memcpy can copy up to 8GB if f == 31. This reinforces the need to validate f bounds before context initialization (see previous comment on lines 377-382).

566-570: LGTM: Parameter validation before expensive operations.

The code validates parameters (including f bounds via FASTCOVER_checkParameters) before allocating large frequency tables, which addresses the memory concerns raised in earlier comments.

377-382: The f parameter is already validated before reaching the allocation.

The parameter validation occurs in FASTCOVER_checkParameters() at lines 241-243, which ensures 0 < f <= FASTCOVER_MAX_F (where FASTCOVER_MAX_F == 31). This validation is called at lines 566 and 726, before FASTCOVER_ctx_init() is invoked, protecting the subsequent allocation at line 377. No changes are needed.

Likely an incorrect or invalid review comment.

lib/compress/zstd_ldm.c (5)

96-133: LGTM: Efficient rolling hash with batch processing.

The gear hash implementation uses macros for loop unrolling and appropriate branch prediction hints. The batch processing pattern (filling splits array) is well-structured.

493-505: LGTM: Optimization for repeating patterns.

The logic to detect and skip overlapping matches (e.g., repeated bytes) is a significant performance optimization. The hash reset and loop continuation correctly handle this edge case.

526-601: LGTM: Robust chunked sequence generation.

The function properly handles large inputs through chunking, with appropriate overflow correction and distance constraint enforcement. The leftover literal accumulation correctly maintains state across chunks.

616-624: LGTM: Correct handling of shortened matches.

When skipping makes a match too short (below minMatch), the code correctly adds the remaining match bytes as literals to the next sequence before advancing position.

697-704: LGTM: Strategy-appropriate LDM integration.

For optimal parsing strategies (btopt and above), LDM sequences are correctly used as candidates rather than being directly accepted, allowing the optimal parser to make the final decision.

coderabbitai · 2025-11-17T15:51:14Z

+static size_t compressBlocks(size_t* cSizes,   /* optional (can be NULL). If present, must contain at least nbBlocks fields */
+                             slice_collection_t dstBlockBuffers,
+                             slice_collection_t srcBlockBuffers,
+                             ZSTD_CDict* cdict, int cLevel)
+{
+    size_t const nbBlocks = srcBlockBuffers.nbSlices;
+    assert(dstBlockBuffers.nbSlices == srcBlockBuffers.nbSlices);
+
+    ZSTD_CCtx* const cctx = ZSTD_createCCtx();
+    assert(cctx != NULL);
+
+    size_t totalCSize = 0;
+    for (size_t blockNb=0; blockNb < nbBlocks; blockNb++) {
+        size_t cBlockSize;
+        if (cdict == NULL) {
+            cBlockSize = ZSTD_compressCCtx(cctx,
+                            dstBlockBuffers.slicePtrs[blockNb], dstBlockBuffers.capacities[blockNb],
+                            srcBlockBuffers.slicePtrs[blockNb], srcBlockBuffers.capacities[blockNb],
+                            cLevel);
+        } else {
+            cBlockSize = ZSTD_compress_usingCDict(cctx,
+                            dstBlockBuffers.slicePtrs[blockNb], dstBlockBuffers.capacities[blockNb],
+                            srcBlockBuffers.slicePtrs[blockNb], srcBlockBuffers.capacities[blockNb],
+                            cdict);
+        }
+        CONTROL(!ZSTD_isError(cBlockSize));
+        if (cSizes) cSizes[blockNb] = cBlockSize;
+        totalCSize += cBlockSize;
+    }
+    return totalCSize;
+}


⚠️ Potential issue | 🟠 Major

Free the temporary CCtx before returning.

compressBlocks() calls ZSTD_createCCtx() on every invocation but never releases it. In steady state this leaks one compression context per call, which adds up quickly when benchmarking large datasets. Please free the context before returning so the function stays memory-neutral.

static size_t compressBlocks(...) { ZSTD_CCtx* const cctx = ZSTD_createCCtx(); assert(cctx != NULL); ... - return totalCSize; + ZSTD_freeCCtx(cctx); + return totalCSize; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

static size_t compressBlocks(size_t* cSizes, /* optional (can be NULL). If present, must contain at least nbBlocks fields */

slice_collection_t dstBlockBuffers,

slice_collection_t srcBlockBuffers,

ZSTD_CDict* cdict, int cLevel)

{

size_t const nbBlocks = srcBlockBuffers.nbSlices;

assert(dstBlockBuffers.nbSlices == srcBlockBuffers.nbSlices);

ZSTD_CCtx* const cctx = ZSTD_createCCtx();

assert(cctx != NULL);

size_t totalCSize = 0;

for (size_t blockNb=0; blockNb < nbBlocks; blockNb++) {

size_t cBlockSize;

if (cdict == NULL) {

cBlockSize = ZSTD_compressCCtx(cctx,

dstBlockBuffers.slicePtrs[blockNb], dstBlockBuffers.capacities[blockNb],

srcBlockBuffers.slicePtrs[blockNb], srcBlockBuffers.capacities[blockNb],

cLevel);

} else {

cBlockSize = ZSTD_compress_usingCDict(cctx,

dstBlockBuffers.slicePtrs[blockNb], dstBlockBuffers.capacities[blockNb],

srcBlockBuffers.slicePtrs[blockNb], srcBlockBuffers.capacities[blockNb],

cdict);

}

CONTROL(!ZSTD_isError(cBlockSize));

if (cSizes) cSizes[blockNb] = cBlockSize;

totalCSize += cBlockSize;

}

return totalCSize;

}

static size_t compressBlocks(size_t* cSizes, /* optional (can be NULL). If present, must contain at least nbBlocks fields */

slice_collection_t dstBlockBuffers,

slice_collection_t srcBlockBuffers,

ZSTD_CDict* cdict, int cLevel)

{

size_t const nbBlocks = srcBlockBuffers.nbSlices;

assert(dstBlockBuffers.nbSlices == srcBlockBuffers.nbSlices);

ZSTD_CCtx* const cctx = ZSTD_createCCtx();

assert(cctx != NULL);

size_t totalCSize = 0;

for (size_t blockNb=0; blockNb < nbBlocks; blockNb++) {

size_t cBlockSize;

if (cdict == NULL) {

cBlockSize = ZSTD_compressCCtx(cctx,

dstBlockBuffers.slicePtrs[blockNb], dstBlockBuffers.capacities[blockNb],

srcBlockBuffers.slicePtrs[blockNb], srcBlockBuffers.capacities[blockNb],

cLevel);

} else {

cBlockSize = ZSTD_compress_usingCDict(cctx,

dstBlockBuffers.slicePtrs[blockNb], dstBlockBuffers.capacities[blockNb],

srcBlockBuffers.slicePtrs[blockNb], srcBlockBuffers.capacities[blockNb],

cdict);

}

CONTROL(!ZSTD_isError(cBlockSize));

if (cSizes) cSizes[blockNb] = cBlockSize;

totalCSize += cBlockSize;

}

ZSTD_freeCCtx(cctx);

return totalCSize;

}

🤖 Prompt for AI Agents

In contrib/largeNbDicts/largeNbDicts.c around lines 538 to 568, the function creates a ZSTD_CCtx with ZSTD_createCCtx() but never releases it; add a call to ZSTD_freeCCtx(cctx) before returning to avoid leaking the compression context (place it after the for loop and before the final return totalCSize). Also ensure any future early-return or error paths free cctx as well (e.g., free cctx before returning on detected errors).

coderabbitai · 2025-11-17T15:51:14Z

+    ZSTD_CCtx_refCDict(ci->cctx, ci->dictionaries.cdicts[ci->dictNb]);
+    ZSTD_compress2(ci->cctx,
+            dst, srcSize,
+            src, srcSize);
+
+    ci->dictNb = ci->dictNb + 1;
+    if (ci->dictNb >= ci->nbDicts) ci->dictNb = 0;
+
+    return srcSize;
+}


⚠️ Potential issue | 🔴 Critical

Propagate ZSTD_compress2() errors and use the provided capacity.

compress() currently hard-codes dstCapacity to srcSize when calling ZSTD_compress2() and ignores the return value. With incompressible input the function silently reports success while leaving the output buffer untouched, so later consumers misinterpret garbage. Please pass the actual dstCapacity and propagate any compression error before reporting success (while still returning srcSize for throughput accounting).

size_t compress(..., void* dst, size_t dstCapacity, void* payload) { compressInstructions* const ci = (compressInstructions*) payload; (void)dstCapacity; ZSTD_CCtx_refCDict(ci->cctx, ci->dictionaries.cdicts[ci->dictNb]); - ZSTD_compress2(ci->cctx, - dst, srcSize, - src, srcSize); + { + size_t const cSize = ZSTD_compress2(ci->cctx, + dst, dstCapacity, + src, srcSize); + if (ZSTD_isError(cSize)) { + return cSize; + } + } ci->dictNb = ci->dictNb + 1; if (ci->dictNb >= ci->nbDicts) ci->dictNb = 0; return srcSize; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

ZSTD_CCtx_refCDict(ci->cctx, ci->dictionaries.cdicts[ci->dictNb]);

ZSTD_compress2(ci->cctx,

dst, srcSize,

src, srcSize);

ci->dictNb = ci->dictNb + 1;

if (ci->dictNb >= ci->nbDicts) ci->dictNb = 0;

return srcSize;

}

ZSTD_CCtx_refCDict(ci->cctx, ci->dictionaries.cdicts[ci->dictNb]);

{

size_t const cSize = ZSTD_compress2(ci->cctx,

dst, dstCapacity,

src, srcSize);

if (ZSTD_isError(cSize)) {

return cSize;

}

}

ci->dictNb = ci->dictNb + 1;

if (ci->dictNb >= ci->nbDicts) ci->dictNb = 0;

return srcSize;

}

🤖 Prompt for AI Agents

In contrib/largeNbDicts/largeNbDicts.c around lines 627 to 636, ZSTD_compress2() is being called with dstCapacity hard-coded to srcSize and its return value ignored; change the call to pass the actual dstCapacity variable, capture the returned size_t result, then check it with ZSTD_isError(result) and if it's an error return that error code, otherwise return srcSize for throughput accounting. Also ensure dstCapacity is in scope and used instead of srcSize in the ZSTD_compress2() call.

coderabbitai · 2025-11-17T15:51:14Z

+  {
+    std::error_code ec;
+    if (is_directory(inputFile, ec)) {
+      errorHolder.setError("Output file is a directory -- ignored");
+      return nullptr;
+    }


⚠️ Potential issue | 🔴 Critical

Fix incorrect error message.

The error message says "Output file is a directory" but the function is checking if the input file is a directory.

Apply this diff:

- errorHolder.setError("Output file is a directory -- ignored"); + errorHolder.setError("Input file is a directory -- ignored");

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

{

std::error_code ec;

if (is_directory(inputFile, ec)) {

errorHolder.setError("Output file is a directory -- ignored");

return nullptr;

}

{

std::error_code ec;

if (is_directory(inputFile, ec)) {

errorHolder.setError("Input file is a directory -- ignored");

return nullptr;

}

🤖 Prompt for AI Agents

In contrib/pzstd/Pzstd.cpp around lines 118 to 123, the error message incorrectly refers to the "Output file" while the code is checking whether the input file is a directory; change the error text to reference the input file (e.g., "Input file is a directory -- ignored") so the message accurately reflects the checked condition.

coderabbitai · 2025-11-17T15:51:14Z

+#define SEGMENT_SIZE 512
+    FPStats* const fpstats = (FPStats*)workspace;
+    Fingerprint* middleEvents = (Fingerprint*)(void*)((char*)workspace + 512 * sizeof(unsigned));
+    assert(blockSize == (128 << 10));
+    assert(workspace != NULL);
+    assert((size_t)workspace % ZSTD_ALIGNOF(FPStats) == 0);
+    ZSTD_STATIC_ASSERT(ZSTD_SLIPBLOCK_WORKSPACESIZE >= sizeof(FPStats));
+    assert(wkspSize >= sizeof(FPStats)); (void)wkspSize;
+
+    initStats(fpstats);
+    HIST_add(fpstats->pastEvents.events, blockStart, SEGMENT_SIZE);
+    HIST_add(fpstats->newEvents.events, (const char*)blockStart + blockSize - SEGMENT_SIZE, SEGMENT_SIZE);
+    fpstats->pastEvents.nbEvents = fpstats->newEvents.nbEvents = SEGMENT_SIZE;
+    if (!compareFingerprints(&fpstats->pastEvents, &fpstats->newEvents, 0, 8))
+        return blockSize;
+
+    HIST_add(middleEvents->events, (const char*)blockStart + blockSize/2 - SEGMENT_SIZE/2, SEGMENT_SIZE);
+    middleEvents->nbEvents = SEGMENT_SIZE;
+    {   U64 const distFromBegin = fpDistance(&fpstats->pastEvents, middleEvents, 8);
+        U64 const distFromEnd = fpDistance(&fpstats->newEvents, middleEvents, 8);
+        U64 const minDistance = SEGMENT_SIZE * SEGMENT_SIZE / 3;
+        if (abs64((S64)distFromBegin - (S64)distFromEnd) < minDistance)
+            return 64 KB;
+        return (distFromBegin > distFromEnd) ? 32 KB : 96 KB;
+    }
+}


⚠️ Potential issue | 🔴 Critical

Avoid aliasing middleEvents with the existing workspace.

middleEvents is derived from (char*)workspace + 512*sizeof(unsigned), which lands inside fpstats->pastEvents.events. Filling middleEvents therefore corrupts pastEvents and the comparison logic that follows. Allocate an independent buffer (e.g. a stack Fingerprint middleEvents; ZSTD_memset(&middleEvents, 0, sizeof(middleEvents));) or carve space after FPStats once the workspace size guarantee includes it. That keeps all three fingerprints disjoint so compareFingerprints() works as intended.

coderabbitai · 2025-11-17T15:51:14Z

+    while (!testmem) {
+        testmem = malloc((size_t)requiredMem);
+        requiredMem -= step;
+    }
+
+    free(testmem);
+    return (size_t)requiredMem;


⚠️ Potential issue | 🔴 Critical

Fix DiB_findMaxMem return value before dividing by memMult
When the first malloc() succeeds, we immediately subtract step, so the function returns one step too small. For small training sets (e.g. < ~0.75 MB) the subsequent division by memMult drops to zero, loadedSize becomes 0, and we load no samples, hitting the “nb of samples too low” exit path. Returning the actual allocated size fixes the regression.

- free(testmem); - return (size_t)requiredMem; + free(testmem); + return (size_t)(requiredMem + step);

🤖 Prompt for AI Agents

In programs/dibio.c around lines 223 to 229, the loop subtracts step immediately after a successful malloc so the function returns one step too small; change the logic so the returned size reflects the actual allocated memory (either subtract step only when malloc fails, or add step back to requiredMem before returning) so DiB_findMaxMem returns the true allocated size before the later division by memMult.

Add DBZ seed files

4baa257

cursor Bot reviewed Nov 17, 2025

View reviewed changes

coderabbitai Bot reviewed Nov 17, 2025

View reviewed changes

Conversation

mugitya03 commented Nov 17, 2025 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

cursor Bot commented Nov 17, 2025

Uh oh!

coderabbitai Bot commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

mugitya03 commented Nov 17, 2025

Uh oh!

cursor Bot Nov 17, 2025

Choose a reason for hiding this comment

Bug: Memory Leak: Unfreed Compression Context

Uh oh!

cursor Bot Nov 17, 2025

Choose a reason for hiding this comment

Bug: Buffer Bug Causes Silent Compression Errors

Uh oh!

cursor Bot Nov 17, 2025

Choose a reason for hiding this comment

Bug: Off-by-One: Buffer Boundary Violation

Uh oh!

cursor Bot Nov 17, 2025

Choose a reason for hiding this comment

Bug: Memory Leak in Benchmarking Function

Uh oh!

cursor Bot Nov 17, 2025

Choose a reason for hiding this comment

Bug: Slice Distribution Off-By-One

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mugitya03 commented Nov 17, 2025 •

edited by cursor Bot

Loading

coderabbitai Bot commented Nov 17, 2025 •

edited

Loading