Skip to content

Add DBZ seed files#6

Open
mugitya03 wants to merge 1 commit intomainfrom
add-dbz-seed-files
Open

Add DBZ seed files#6
mugitya03 wants to merge 1 commit intomainfrom
add-dbz-seed-files

Conversation

@mugitya03
Copy link
Copy Markdown
Owner

@mugitya03 mugitya03 commented Nov 17, 2025

Summary by CodeRabbit

  • New Features
    • Multi-threaded compression support with async streaming and parallel worker threads.
    • Advanced dictionary training algorithms (COVER and FASTCOVER methods).
    • Comprehensive benchmarking framework for performance analysis.
    • Entropy encoding optimizations with FSE and Huffman support.
    • Thread pool infrastructure for concurrent task execution.

Note

Cursor Bugbot is generating a summary for commit 4baa257. Configure here.

@cursor
Copy link
Copy Markdown

cursor Bot commented Nov 17, 2025

You have run out of free Bugbot PR reviews for this billing cycle. This will reset on December 21.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Nov 17, 2025

Walkthrough

Adds comprehensive implementations of core ZSTD compression components: FSE and Huffman entropy encoding/decoding, multi-threaded compression pipeline, low-distance matching, block pre-splitting, workspace management, dictionary training (COVER/FASTCOVER), suffix sorting, and benchmarking utilities spanning library core, compression engine, dictionary builders, and tools.

Changes

Cohort / File(s) Summary
FSE and Huffman Entropy
lib/compress/fse_compress.c, lib/compress/huf_compress.c, lib/common/fse_decompress.c
Finite State Entropy and Huffman table construction, encoding, decoding with workspace-based memory management, normalization, and BMI2 optimization.
Compression Core and Sequences
lib/compress/zstd_compress_sequences.c, lib/compress/zstd_compress_superblock.c
Sequence encoding logic, FSE table building, selection strategies, and super-block compression with literal/sequence sub-block handling.
Internal Compression Structures
lib/compress/zstd_compress_internal.h, lib/compress/zstd_cwksp.h
Compression context typedefs (SeqStore, entropy tables, block state), prototypes, and workspace allocation/lifecycle management with phase-based stratification.
Advanced Matching and Splitting
lib/compress/zstd_ldm.c, lib/compress/zstd_preSplit.c
Low-distance matching with gear-based rolling hash and sequence generation; block pre-splitting heuristic using fingerprint-based deviation detection.
Threading Infrastructure
lib/common/pool.c, lib/compress/zstdmt_compress.c
Thread pool with job queuing and synchronization; multi-threaded streaming compression with buffer/sequence/context pools and per-job lifecycle.
C++ Multi-threaded Compression
contrib/pzstd/Pzstd.cpp
C++ pipeline for parallel frame-based compression/decompression with async I/O, buffer work queues, and skippable frame handling.
Dictionary Training
lib/dictBuilder/cover.c, lib/dictBuilder/fastcover.c, lib/dictBuilder/zdict.c
COVER and FASTCOVER algorithms with parameter optimization, d-mer frequency analysis, segment selection, and dictionary finalization with entropy tables.
Suffix Sorting and BWT
lib/dictBuilder/divsufsort.c
Suffix array construction and Burrows-Wheeler transform with bucket-based B\*-prefix sorting, multi-pass merging, and OpenMP support.
Dictionary I/O and Training
programs/dibio.c
Multi-file dictionary training orchestration including file loading, memory budgeting, shuffling, and dispatch to ZDICT training backends.
Benchmarking Utilities
programs/benchfn.c, programs/benchzstd.c, contrib/largeNbDicts/largeNbDicts.c
Function benchmarking framework with timed iterations and state management; ZSTD memory/file benchmark orchestration with multi-level/CRC validation; dictionary decompression stress-test with CSV results.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

  • FSE/Huffman implementations (fse_compress.c, fse_decompress.c, huf_compress.c): Require careful validation of normalization logic, table construction correctness, and boundary handling; BMI2 conditional paths add complexity.
  • Multi-threaded compression (pool.c, zstdmt_compress.c, Pzstd.cpp): Synchronization primitives, job lifecycle, buffer pool management, and potential race conditions demand thorough scrutiny; interactions across threads require tracing.
  • Sequence encoding and super-block compression (zstd_compress_sequences.c, zstd_compress_superblock.c): Cost models, encoding type selection, and fallback logic (including decoder-version workarounds) are dense and interdependent.
  • Workspace management (zstd_cwksp.h): Phase-based allocation, alignment guarantees, and ASAN/MSAN interactions require cross-checking against all callers.
  • Dictionary training (cover.c, fastcover.c, divsufsort.c): Sorting correctness, parameter space exploration, and frequency-based selection heuristics need validation; OpenMP parallelism adds concurrency concerns.
  • Interconnected dependencies: Each compression component integrates with FSE/Huffman and workspace, creating a web of interactions that must be validated holistically.

Poem

🐰 Entropy flows like carrot juice,
Threads dance through compression's deuce,
From FSE tables to dictionaries true,
Workspace juggles every phase anew,
LDM matches bloom and bytes shrink in view! 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title 'Add DBZ seed files' does not align with the extensive changeset, which introduces large compression infrastructure (FSE, Huffman, thread pool, LDM, MT compression, dictionary training, and benchmarking). Revise the title to accurately reflect the major changes, such as 'Add comprehensive compression infrastructure: FSE, Huffman, thread pool, and benchmarking tools' or similar descriptor of the actual implementation scope.
Docstring Coverage ⚠️ Warning Docstring coverage is 46.64% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch add-dbz-seed-files

Comment @coderabbitai help to get the list of available commands and usage tips.

@mugitya03
Copy link
Copy Markdown
Owner Author

@cursor review

totalCSize += cBlockSize;
}
return totalCSize;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Memory Leak: Unfreed Compression Context

Memory leak: ZSTD_CCtx* const cctx is created with ZSTD_createCCtx() but never freed with ZSTD_freeCCtx() before returning from compressBlocks(). This causes the compression context to leak on every call.

Fix in Cursor Fix in Web

if (ci->dictNb >= ci->nbDicts) ci->dictNb = 0;

return srcSize;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Buffer Bug Causes Silent Compression Errors

The compress function passes srcSize as the destination capacity to ZSTD_compress2 instead of the provided dstCapacity parameter, and ignores the return value completely. This causes incorrect buffer size checking and silently ignores compression errors.

Fix in Cursor Fix in Web

size_t nbSrcBlocks = 0;
for (size_t ssnb=0; ssnb < srcSlices.nbSlices; ssnb++) {
size_t pos = 0;
while (pos <= srcSlices.capacities[ssnb]) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Off-by-One: Buffer Boundary Violation

Off-by-one error in splitSlices: the condition while (pos <= srcSlices.capacities[ssnb]) allows pos to equal the capacity, causing an extra iteration beyond the buffer boundary. This inflates the block count and could lead to incorrect behavior.

Fix in Cursor Fix in Web

unsigned const ms_per_round = RUN_TIME_DEFAULT_MS;
unsigned const total_time_ms = nbRounds * ms_per_round;

double *const speedPerRound = (double *)malloc(nbRounds * sizeof(double));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Memory Leak in Benchmarking Function

Memory leak in benchMem: speedPerRound is allocated with malloc(nbRounds * sizeof(double)) but never freed before the function returns.

Fix in Cursor Fix in Web


size_t ssnb = 0;
for (size_t sliceNb=0; sliceNb < nbSlices; ) {
ssnb = (ssnb + 1) % srcSlices.nbSlices;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Slice Distribution Off-By-One

In splitSlices, ssnb is initialized to 0 then immediately set to (ssnb + 1) % srcSlices.nbSlices which equals 1, causing the first source slice at index 0 to always be skipped when distributing slices.

Fix in Cursor Fix in Web

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
lib/compress/zstd_compress_superblock.c (2)

48-49: Consider defining named constants for header size bounds.

The magic number 200 and threshold values 1 KB and 16 KB reduce readability. Consider defining these as named constants with explanatory comments.


356-356: Consider documenting the fallback multiplier.

The nbSeq * 10 fallback estimate uses a magic multiplier. A brief comment explaining why 10 is chosen as the worst-case bytes-per-sequence estimate would improve clarity.

programs/benchfn.c (1)

192-193: Consider clarifying the initialization comment.

The comment "hopefully large enough" for the initial fastestRun.nanoSecPerRun value is vague. Consider something like "initialized to a very large value (2000 seconds) to ensure any actual measurement will be smaller".

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6cacd63 and 4baa257.

📒 Files selected for processing (20)
  • contrib/largeNbDicts/largeNbDicts.c (1 hunks)
  • contrib/pzstd/Pzstd.cpp (1 hunks)
  • lib/common/fse_decompress.c (1 hunks)
  • lib/common/pool.c (1 hunks)
  • lib/compress/fse_compress.c (1 hunks)
  • lib/compress/huf_compress.c (1 hunks)
  • lib/compress/zstd_compress_internal.h (1 hunks)
  • lib/compress/zstd_compress_sequences.c (1 hunks)
  • lib/compress/zstd_compress_superblock.c (1 hunks)
  • lib/compress/zstd_cwksp.h (1 hunks)
  • lib/compress/zstd_ldm.c (1 hunks)
  • lib/compress/zstd_preSplit.c (1 hunks)
  • lib/compress/zstdmt_compress.c (1 hunks)
  • lib/dictBuilder/cover.c (1 hunks)
  • lib/dictBuilder/divsufsort.c (1 hunks)
  • lib/dictBuilder/fastcover.c (1 hunks)
  • lib/dictBuilder/zdict.c (1 hunks)
  • programs/benchfn.c (1 hunks)
  • programs/benchzstd.c (1 hunks)
  • programs/dibio.c (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (14)
programs/dibio.c (2)
programs/util.c (1)
  • UTIL_getFileSize (545-558)
lib/dictBuilder/zdict.c (3)
  • ZDICT_trainFromBuffer_legacy (1088-1108)
  • ZDICT_isError (98-98)
  • ZDICT_getErrorName (100-100)
lib/dictBuilder/fastcover.c (4)
lib/compress/zstd_compress_internal.h (17)
  • ZSTD_hash6Ptr (915-915)
  • ZSTD_hash8Ptr (925-925)
  • U32 (584-596)
  • U32 (601-613)
  • U32 (1050-1055)
  • U32 (1061-1064)
  • U32 (1097-1125)
  • U32 (1132-1146)
  • U32 (1395-1407)
  • U32 (1412-1423)
  • size_t (649-659)
  • size_t (661-670)
  • size_t (677-683)
  • size_t (854-873)
  • size_t (879-892)
  • size_t (900-900)
  • size_t (901-901)
lib/dictBuilder/cover.c (12)
  • COVER_sum (278-285)
  • COVER_computeEpochs (732-747)
  • COVER_dictSelectionError (1035-1037)
  • COVER_selectDict (1047-1132)
  • COVER_dictSelectionIsError (1039-1041)
  • COVER_best_finish (980-1024)
  • COVER_dictSelectionFree (1043-1045)
  • COVER_warnOnSmallCorpus (716-730)
  • COVER_best_init (922-931)
  • COVER_best_destroy (950-960)
  • COVER_best_start (966-973)
  • COVER_best_wait (936-945)
lib/dictBuilder/zdict.c (1)
  • ZDICT_finalizeDictionary (862-941)
lib/common/pool.c (6)
  • POOL_create (111-113)
  • POOL_create (326-328)
  • POOL_free (180-189)
  • POOL_free (339-342)
  • POOL_add (286-296)
  • POOL_add (354-357)
lib/compress/zstd_compress_sequences.c (2)
lib/compress/fse_compress.c (5)
  • FSE_optimalTableLog (371-374)
  • FSE_normalizeCount (465-525)
  • FSE_writeNCount (330-340)
  • FSE_buildCTable_rle (528-548)
  • FSE_buildCTable_wksp (68-214)
lib/legacy/zstd_v03.c (3)
  • ZSTD_isError (2373-2373)
  • ERR_isError (525-525)
  • MEM_32bits (107-107)
contrib/largeNbDicts/largeNbDicts.c (3)
programs/util.c (6)
  • UTIL_getFileSize (545-558)
  • UTIL_isDirectory (357-365)
  • UTIL_getTotalFileSize (630-645)
  • UTIL_createExpandedFNT (1379-1423)
  • UTIL_assembleFileNamesTable (811-815)
  • UTIL_freeFileNamesTable (817-823)
lib/compress/zstd_compress.c (13)
  • ZSTD_freeCDict (5740-5751)
  • ZSTD_createCDict_advanced2 (5668-5716)
  • ZSTD_createCCtx (97-100)
  • ZSTD_compressCCtx (5495-5503)
  • ZSTD_compress_usingCDict (5920-5927)
  • ZSTD_CCtx_setParametersUsingCCtxParams (1186-1198)
  • ZSTD_freeCCtx (181-192)
  • ZSTD_CCtx_refCDict (1336-1344)
  • ZSTD_compress2 (6576-6605)
  • ZSTD_sizeof_CDict (5550-5557)
  • ZSTD_createCCtxParams (349-352)
  • ZSTD_CCtxParams_init (366-372)
  • ZSTD_CCtxParams_setParameter (778-1029)
programs/benchfn.c (6)
  • BMK_createTimedFnState (160-166)
  • BMK_benchTimedFn (211-256)
  • BMK_isSuccessful_runOutcome (64-67)
  • BMK_extract_runTime (71-75)
  • BMK_isCompleted_TimedFn (200-203)
  • BMK_freeTimedFnState (168-168)
lib/compress/zstd_ldm.c (3)
lib/compress/zstd_compress_internal.h (18)
  • BYTE (630-644)
  • U32 (584-596)
  • U32 (601-613)
  • U32 (1050-1055)
  • U32 (1061-1064)
  • U32 (1097-1125)
  • U32 (1132-1146)
  • U32 (1395-1407)
  • U32 (1412-1423)
  • size_t (649-659)
  • size_t (661-670)
  • size_t (677-683)
  • size_t (854-873)
  • size_t (879-892)
  • size_t (900-900)
  • size_t (901-901)
  • ZSTD_window_correctOverflow (1159-1230)
  • ZSTD_storeSeq (775-811)
lib/compress/zstd_cwksp.h (3)
  • BYTE (394-397)
  • size_t (206-210)
  • size_t (224-232)
lib/compress/zstd_compress.c (1)
  • ZSTD_selectBlockCompressor (3077-3160)
lib/compress/zstdmt_compress.c (4)
lib/compress/zstd_compress_internal.h (9)
  • U32 (584-596)
  • U32 (601-613)
  • U32 (1050-1055)
  • U32 (1061-1064)
  • U32 (1097-1125)
  • U32 (1132-1146)
  • U32 (1395-1407)
  • U32 (1412-1423)
  • ZSTD_window_update (1354-1390)
lib/compress/zstd_compress.c (9)
  • ZSTD_freeCCtx (181-192)
  • ZSTD_createCCtx_advanced (114-124)
  • ZSTD_referenceExternalSequences (4788-4797)
  • ZSTD_compressBegin_advanced_internal (5293-5309)
  • ZSTD_compressContinue_public (4861-4867)
  • ZSTD_compressEnd_public (5415-5439)
  • ZSTD_checkCParams (1396-1406)
  • ZSTD_compressBound (70-74)
  • ZSTD_writeLastEmptyBlock (4778-4786)
lib/compress/zstd_ldm.c (3)
  • ZSTD_ldm_adjustParameters (135-167)
  • ZSTD_ldm_getMaxNbSeq (179-182)
  • ZSTD_ldm_generateSequences (526-601)
lib/common/pool.c (10)
  • POOL_create_advanced (115-160)
  • POOL_create_advanced (330-337)
  • POOL_free (180-189)
  • POOL_free (339-342)
  • POOL_sizeof (206-211)
  • POOL_sizeof (365-369)
  • POOL_resize (244-253)
  • POOL_resize (349-352)
  • POOL_tryAdd (299-310)
  • POOL_tryAdd (359-363)
programs/benchzstd.c (4)
lib/compress/zstd_compress.c (9)
  • ZSTD_CCtx_reset (1375-1390)
  • ZSTD_CCtx_setParameter (716-776)
  • ZSTD_CCtx_loadDictionary (1329-1333)
  • ZSTD_compress2 (6576-6605)
  • ZSTD_compressBound (70-74)
  • ZSTD_sizeof_CCtx (206-214)
  • ZSTD_createCCtx (97-100)
  • ZSTD_freeCCtx (181-192)
  • ZSTD_maxCLevel (7681-7681)
lib/legacy/zstd_v03.c (3)
  • ZSTD_isError (2373-2373)
  • U64 (126-129)
  • U64 (177-187)
programs/benchfn.c (6)
  • BMK_benchTimedFn (211-256)
  • BMK_isSuccessful_runOutcome (64-67)
  • BMK_extract_runTime (71-75)
  • BMK_isCompleted_TimedFn (200-203)
  • BMK_createTimedFnState (160-166)
  • BMK_freeTimedFnState (168-168)
programs/util.c (3)
  • UTIL_getFileSize (545-558)
  • UTIL_isDirectory (357-365)
  • UTIL_getTotalFileSize (630-645)
lib/compress/huf_compress.c (2)
lib/legacy/zstd_v03.c (9)
  • HUF_isError (1451-1451)
  • U32 (121-124)
  • U32 (161-164)
  • U32 (166-175)
  • BYTE (660-669)
  • BYTE (671-680)
  • HUF_readStats (1477-1553)
  • MEM_32bits (107-107)
  • ERR_isError (525-525)
lib/compress/fse_compress.c (6)
  • FSE_optimalTableLog (371-374)
  • FSE_normalizeCount (465-525)
  • FSE_writeNCount (330-340)
  • FSE_buildCTable_wksp (68-214)
  • FSE_compress_usingCTable (610-620)
  • FSE_optimalTableLog_internal (357-369)
lib/dictBuilder/cover.c (4)
lib/legacy/zstd_v03.c (5)
  • U64 (126-129)
  • U64 (177-187)
  • U32 (121-124)
  • U32 (161-164)
  • U32 (166-175)
lib/dictBuilder/zdict.c (2)
  • ZDICT_finalizeDictionary (862-941)
  • ZDICT_isError (98-98)
lib/compress/zstd_compress.c (6)
  • ZSTD_compressBound (70-74)
  • ZSTD_createCCtx (97-100)
  • ZSTD_createCDict (5718-5727)
  • ZSTD_compress_usingCDict (5920-5927)
  • ZSTD_freeCCtx (181-192)
  • ZSTD_freeCDict (5740-5751)
lib/common/pool.c (6)
  • POOL_create (111-113)
  • POOL_create (326-328)
  • POOL_free (180-189)
  • POOL_free (339-342)
  • POOL_add (286-296)
  • POOL_add (354-357)
lib/compress/zstd_compress_internal.h (2)
lib/legacy/zstd_v03.c (10)
  • U32 (121-124)
  • U32 (161-164)
  • U32 (166-175)
  • ZSTD_isError (2373-2373)
  • BYTE (660-669)
  • BYTE (671-680)
  • ZSTD_wildcopy (2331-2337)
  • MEM_64bits (108-108)
  • U64 (126-129)
  • U64 (177-187)
lib/compress/zstd_compress.c (9)
  • ZSTD_getSeqStore (222-222)
  • ZSTD_seqToCodes (2701-2727)
  • ZSTD_selectBlockCompressor (3077-3160)
  • ZSTD_cParam_getBounds (427-645)
  • ZSTD_loadCEntropy (5069-5158)
  • ZSTD_reset_compressedBlockState (1926-1935)
  • ZSTD_convertBlockSequences (7326-7400)
  • ZSTD_get1BlockSummary (7404-7452)
  • ZSTD_get1BlockSummary (7456-7481)
lib/compress/zstd_compress_superblock.c (4)
lib/compress/huf_compress.c (3)
  • HUF_compress1X_usingCTable (1163-1166)
  • HUF_compress4X_usingCTable (1218-1221)
  • HUF_estimateCompressedSize (794-803)
lib/legacy/zstd_v03.c (3)
  • ERR_isError (525-525)
  • BYTE (660-669)
  • BYTE (671-680)
lib/compress/zstd_compress_sequences.c (3)
  • ZSTD_encodeSequences (419-442)
  • ZSTD_crossEntropyCost (139-154)
  • ZSTD_fseBitCost (104-132)
lib/compress/zstd_compress.c (1)
  • ZSTD_buildBlockEntropyStats (3806-3834)
lib/dictBuilder/zdict.c (5)
lib/legacy/zstd_v03.c (6)
  • ERR_isError (525-525)
  • U32 (121-124)
  • U32 (161-164)
  • U32 (166-175)
  • HUF_isError (1451-1451)
  • FSE_isError (1106-1106)
lib/compress/zstd_compress.c (10)
  • ZSTD_reset_compressedBlockState (1926-1935)
  • ZSTD_loadCEntropy (5069-5158)
  • ZSTD_compressBegin_usingCDict_deprecated (5881-5885)
  • ZSTD_compressBlock_deprecated (4891-4898)
  • ZSTD_getSeqStore (222-222)
  • ZSTD_seqToCodes (2701-2727)
  • ZSTD_createCDict_advanced (5650-5666)
  • ZSTD_createCCtx (97-100)
  • ZSTD_freeCDict (5740-5751)
  • ZSTD_freeCCtx (181-192)
lib/dictBuilder/divsufsort.c (1)
  • divsufsort (1846-1873)
lib/compress/huf_compress.c (2)
  • HUF_buildCTable_wksp (756-792)
  • HUF_writeCTable_wksp (248-289)
lib/compress/fse_compress.c (2)
  • FSE_normalizeCount (465-525)
  • FSE_writeNCount (330-340)
lib/compress/fse_compress.c (3)
lib/legacy/zstd_v03.c (3)
  • FSE_isError (1106-1106)
  • U16 (116-119)
  • U16 (136-145)
lib/legacy/zstd_v04.c (3)
  • FSE_isError (1089-1089)
  • U16 (92-95)
  • U16 (112-121)
lib/legacy/zstd_v05.c (2)
  • U16 (119-122)
  • U16 (149-157)
lib/common/fse_decompress.c (1)
lib/legacy/zstd_v04.c (4)
  • BYTE (839-848)
  • BYTE (850-859)
  • U16 (92-95)
  • U16 (112-121)
🔇 Additional comments (17)
lib/compress/zstd_compress_superblock.c (3)

220-236: LGTM: Well-documented backward compatibility handling.

The workaround for the zstd decoder bug (versions ≤ 1.3.4) is properly documented with a reference to the fix. The condition is appropriately guarded by FUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION to avoid issues during fuzzing.


665-688: LGTM: Clean public API with proper error handling.

The function properly builds block entropy statistics and delegates to the multi-block compressor with appropriate error propagation.


649-657: Repcode regeneration logic is correct—no issues found.

The ZSTD_updateRep function correctly handles the ll0 parameter (whether litLength == 0). The function uses this boolean to adjust repcode calculations on line 825: repCode = OFFBASE_TO_REPCODE(offBase) - 1 + ll0. When litLength is 0, the repcode index shifts by 1, which is the correct ZSTD semantics. The superblock code at line 654 properly passes this condition and correctly regenerates repcode state for skipped sequences.

programs/benchfn.c (2)

64-81: LGTM: Accessor functions enforce API contract.

The functions correctly access internal fields marked with never_ever_use_directly, which enforces that users must go through these accessor functions. The CONTROL macro provides appropriate assertions to catch misuse.


232-240: LGTM: Adaptive loop adjustment with overflow protection.

The logic dynamically adjusts nbLoops based on run duration to maintain consistent timing, with proper overflow checking. The 10x multiplier for short runs helps quickly reach the target duration.

contrib/pzstd/Pzstd.cpp (3)

344-349: LGTM: Frame size calculation with documented constraints.

The assertion ensures windowLog ≤ 23 to respect the 4GB skippable frame size limit, which is well-documented in the comment.


540-545: LGTM: Adaptive queue sizing for memory management.

When a non-skippable frame is encountered, the code appropriately limits queue sizes to prevent excessive memory usage during serial decompression fallback.


599-610: LGTM: Skippable frame size validation.

The code correctly validates that the compressed size fits in a 32-bit skippable frame before writing, with an appropriate assertion for the uint64_t cast.

lib/dictBuilder/fastcover.c (4)

149-219: LGTM: Well-structured segment selection algorithm.

The sliding window approach with frequency tracking correctly implements the scoring function described in the header comments. The cleanup loops ensure state consistency between calls.


490-491: Related to frequency table size validation.

This memcpy can copy up to 8GB if f == 31. This reinforces the need to validate f bounds before context initialization (see previous comment on lines 377-382).


566-570: LGTM: Parameter validation before expensive operations.

The code validates parameters (including f bounds via FASTCOVER_checkParameters) before allocating large frequency tables, which addresses the memory concerns raised in earlier comments.


377-382: The f parameter is already validated before reaching the allocation.

The parameter validation occurs in FASTCOVER_checkParameters() at lines 241-243, which ensures 0 < f <= FASTCOVER_MAX_F (where FASTCOVER_MAX_F == 31). This validation is called at lines 566 and 726, before FASTCOVER_ctx_init() is invoked, protecting the subsequent allocation at line 377. No changes are needed.

Likely an incorrect or invalid review comment.

lib/compress/zstd_ldm.c (5)

96-133: LGTM: Efficient rolling hash with batch processing.

The gear hash implementation uses macros for loop unrolling and appropriate branch prediction hints. The batch processing pattern (filling splits array) is well-structured.


493-505: LGTM: Optimization for repeating patterns.

The logic to detect and skip overlapping matches (e.g., repeated bytes) is a significant performance optimization. The hash reset and loop continuation correctly handle this edge case.


526-601: LGTM: Robust chunked sequence generation.

The function properly handles large inputs through chunking, with appropriate overflow correction and distance constraint enforcement. The leftover literal accumulation correctly maintains state across chunks.


616-624: LGTM: Correct handling of shortened matches.

When skipping makes a match too short (below minMatch), the code correctly adds the remaining match bytes as literals to the next sequence before advancing position.


697-704: LGTM: Strategy-appropriate LDM integration.

For optimal parsing strategies (btopt and above), LDM sequences are correctly used as candidates rather than being directly accepted, allowing the optimal parser to make the final decision.

Comment on lines +538 to +568
static size_t compressBlocks(size_t* cSizes, /* optional (can be NULL). If present, must contain at least nbBlocks fields */
slice_collection_t dstBlockBuffers,
slice_collection_t srcBlockBuffers,
ZSTD_CDict* cdict, int cLevel)
{
size_t const nbBlocks = srcBlockBuffers.nbSlices;
assert(dstBlockBuffers.nbSlices == srcBlockBuffers.nbSlices);

ZSTD_CCtx* const cctx = ZSTD_createCCtx();
assert(cctx != NULL);

size_t totalCSize = 0;
for (size_t blockNb=0; blockNb < nbBlocks; blockNb++) {
size_t cBlockSize;
if (cdict == NULL) {
cBlockSize = ZSTD_compressCCtx(cctx,
dstBlockBuffers.slicePtrs[blockNb], dstBlockBuffers.capacities[blockNb],
srcBlockBuffers.slicePtrs[blockNb], srcBlockBuffers.capacities[blockNb],
cLevel);
} else {
cBlockSize = ZSTD_compress_usingCDict(cctx,
dstBlockBuffers.slicePtrs[blockNb], dstBlockBuffers.capacities[blockNb],
srcBlockBuffers.slicePtrs[blockNb], srcBlockBuffers.capacities[blockNb],
cdict);
}
CONTROL(!ZSTD_isError(cBlockSize));
if (cSizes) cSizes[blockNb] = cBlockSize;
totalCSize += cBlockSize;
}
return totalCSize;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Free the temporary CCtx before returning.

compressBlocks() calls ZSTD_createCCtx() on every invocation but never releases it. In steady state this leaks one compression context per call, which adds up quickly when benchmarking large datasets. Please free the context before returning so the function stays memory-neutral.

 static size_t compressBlocks(...)
 {
     ZSTD_CCtx* const cctx = ZSTD_createCCtx();
     assert(cctx != NULL);

     ...
-    return totalCSize;
+    ZSTD_freeCCtx(cctx);
+    return totalCSize;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
static size_t compressBlocks(size_t* cSizes, /* optional (can be NULL). If present, must contain at least nbBlocks fields */
slice_collection_t dstBlockBuffers,
slice_collection_t srcBlockBuffers,
ZSTD_CDict* cdict, int cLevel)
{
size_t const nbBlocks = srcBlockBuffers.nbSlices;
assert(dstBlockBuffers.nbSlices == srcBlockBuffers.nbSlices);
ZSTD_CCtx* const cctx = ZSTD_createCCtx();
assert(cctx != NULL);
size_t totalCSize = 0;
for (size_t blockNb=0; blockNb < nbBlocks; blockNb++) {
size_t cBlockSize;
if (cdict == NULL) {
cBlockSize = ZSTD_compressCCtx(cctx,
dstBlockBuffers.slicePtrs[blockNb], dstBlockBuffers.capacities[blockNb],
srcBlockBuffers.slicePtrs[blockNb], srcBlockBuffers.capacities[blockNb],
cLevel);
} else {
cBlockSize = ZSTD_compress_usingCDict(cctx,
dstBlockBuffers.slicePtrs[blockNb], dstBlockBuffers.capacities[blockNb],
srcBlockBuffers.slicePtrs[blockNb], srcBlockBuffers.capacities[blockNb],
cdict);
}
CONTROL(!ZSTD_isError(cBlockSize));
if (cSizes) cSizes[blockNb] = cBlockSize;
totalCSize += cBlockSize;
}
return totalCSize;
}
static size_t compressBlocks(size_t* cSizes, /* optional (can be NULL). If present, must contain at least nbBlocks fields */
slice_collection_t dstBlockBuffers,
slice_collection_t srcBlockBuffers,
ZSTD_CDict* cdict, int cLevel)
{
size_t const nbBlocks = srcBlockBuffers.nbSlices;
assert(dstBlockBuffers.nbSlices == srcBlockBuffers.nbSlices);
ZSTD_CCtx* const cctx = ZSTD_createCCtx();
assert(cctx != NULL);
size_t totalCSize = 0;
for (size_t blockNb=0; blockNb < nbBlocks; blockNb++) {
size_t cBlockSize;
if (cdict == NULL) {
cBlockSize = ZSTD_compressCCtx(cctx,
dstBlockBuffers.slicePtrs[blockNb], dstBlockBuffers.capacities[blockNb],
srcBlockBuffers.slicePtrs[blockNb], srcBlockBuffers.capacities[blockNb],
cLevel);
} else {
cBlockSize = ZSTD_compress_usingCDict(cctx,
dstBlockBuffers.slicePtrs[blockNb], dstBlockBuffers.capacities[blockNb],
srcBlockBuffers.slicePtrs[blockNb], srcBlockBuffers.capacities[blockNb],
cdict);
}
CONTROL(!ZSTD_isError(cBlockSize));
if (cSizes) cSizes[blockNb] = cBlockSize;
totalCSize += cBlockSize;
}
ZSTD_freeCCtx(cctx);
return totalCSize;
}
🤖 Prompt for AI Agents
In contrib/largeNbDicts/largeNbDicts.c around lines 538 to 568, the function
creates a ZSTD_CCtx with ZSTD_createCCtx() but never releases it; add a call to
ZSTD_freeCCtx(cctx) before returning to avoid leaking the compression context
(place it after the for loop and before the final return totalCSize). Also
ensure any future early-return or error paths free cctx as well (e.g., free cctx
before returning on detected errors).

Comment on lines +627 to +636
ZSTD_CCtx_refCDict(ci->cctx, ci->dictionaries.cdicts[ci->dictNb]);
ZSTD_compress2(ci->cctx,
dst, srcSize,
src, srcSize);

ci->dictNb = ci->dictNb + 1;
if (ci->dictNb >= ci->nbDicts) ci->dictNb = 0;

return srcSize;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Propagate ZSTD_compress2() errors and use the provided capacity.

compress() currently hard-codes dstCapacity to srcSize when calling ZSTD_compress2() and ignores the return value. With incompressible input the function silently reports success while leaving the output buffer untouched, so later consumers misinterpret garbage. Please pass the actual dstCapacity and propagate any compression error before reporting success (while still returning srcSize for throughput accounting).

 size_t compress(..., void* dst, size_t dstCapacity, void* payload)
 {
     compressInstructions* const ci = (compressInstructions*) payload;
     (void)dstCapacity;

     ZSTD_CCtx_refCDict(ci->cctx, ci->dictionaries.cdicts[ci->dictNb]);
-    ZSTD_compress2(ci->cctx,
-            dst, srcSize,
-            src, srcSize);
+    {
+        size_t const cSize = ZSTD_compress2(ci->cctx,
+                                            dst, dstCapacity,
+                                            src, srcSize);
+        if (ZSTD_isError(cSize)) {
+            return cSize;
+        }
+    }

     ci->dictNb = ci->dictNb + 1;
     if (ci->dictNb >= ci->nbDicts) ci->dictNb = 0;

     return srcSize;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ZSTD_CCtx_refCDict(ci->cctx, ci->dictionaries.cdicts[ci->dictNb]);
ZSTD_compress2(ci->cctx,
dst, srcSize,
src, srcSize);
ci->dictNb = ci->dictNb + 1;
if (ci->dictNb >= ci->nbDicts) ci->dictNb = 0;
return srcSize;
}
ZSTD_CCtx_refCDict(ci->cctx, ci->dictionaries.cdicts[ci->dictNb]);
{
size_t const cSize = ZSTD_compress2(ci->cctx,
dst, dstCapacity,
src, srcSize);
if (ZSTD_isError(cSize)) {
return cSize;
}
}
ci->dictNb = ci->dictNb + 1;
if (ci->dictNb >= ci->nbDicts) ci->dictNb = 0;
return srcSize;
}
🤖 Prompt for AI Agents
In contrib/largeNbDicts/largeNbDicts.c around lines 627 to 636, ZSTD_compress2()
is being called with dstCapacity hard-coded to srcSize and its return value
ignored; change the call to pass the actual dstCapacity variable, capture the
returned size_t result, then check it with ZSTD_isError(result) and if it's an
error return that error code, otherwise return srcSize for throughput
accounting. Also ensure dstCapacity is in scope and used instead of srcSize in
the ZSTD_compress2() call.

Comment thread contrib/pzstd/Pzstd.cpp
Comment on lines +118 to +123
{
std::error_code ec;
if (is_directory(inputFile, ec)) {
errorHolder.setError("Output file is a directory -- ignored");
return nullptr;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix incorrect error message.

The error message says "Output file is a directory" but the function is checking if the input file is a directory.

Apply this diff:

-      errorHolder.setError("Output file is a directory -- ignored");
+      errorHolder.setError("Input file is a directory -- ignored");
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{
std::error_code ec;
if (is_directory(inputFile, ec)) {
errorHolder.setError("Output file is a directory -- ignored");
return nullptr;
}
{
std::error_code ec;
if (is_directory(inputFile, ec)) {
errorHolder.setError("Input file is a directory -- ignored");
return nullptr;
}
🤖 Prompt for AI Agents
In contrib/pzstd/Pzstd.cpp around lines 118 to 123, the error message
incorrectly refers to the "Output file" while the code is checking whether the
input file is a directory; change the error text to reference the input file
(e.g., "Input file is a directory -- ignored") so the message accurately
reflects the checked condition.

Comment on lines +201 to +226
#define SEGMENT_SIZE 512
FPStats* const fpstats = (FPStats*)workspace;
Fingerprint* middleEvents = (Fingerprint*)(void*)((char*)workspace + 512 * sizeof(unsigned));
assert(blockSize == (128 << 10));
assert(workspace != NULL);
assert((size_t)workspace % ZSTD_ALIGNOF(FPStats) == 0);
ZSTD_STATIC_ASSERT(ZSTD_SLIPBLOCK_WORKSPACESIZE >= sizeof(FPStats));
assert(wkspSize >= sizeof(FPStats)); (void)wkspSize;

initStats(fpstats);
HIST_add(fpstats->pastEvents.events, blockStart, SEGMENT_SIZE);
HIST_add(fpstats->newEvents.events, (const char*)blockStart + blockSize - SEGMENT_SIZE, SEGMENT_SIZE);
fpstats->pastEvents.nbEvents = fpstats->newEvents.nbEvents = SEGMENT_SIZE;
if (!compareFingerprints(&fpstats->pastEvents, &fpstats->newEvents, 0, 8))
return blockSize;

HIST_add(middleEvents->events, (const char*)blockStart + blockSize/2 - SEGMENT_SIZE/2, SEGMENT_SIZE);
middleEvents->nbEvents = SEGMENT_SIZE;
{ U64 const distFromBegin = fpDistance(&fpstats->pastEvents, middleEvents, 8);
U64 const distFromEnd = fpDistance(&fpstats->newEvents, middleEvents, 8);
U64 const minDistance = SEGMENT_SIZE * SEGMENT_SIZE / 3;
if (abs64((S64)distFromBegin - (S64)distFromEnd) < minDistance)
return 64 KB;
return (distFromBegin > distFromEnd) ? 32 KB : 96 KB;
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Avoid aliasing middleEvents with the existing workspace.

middleEvents is derived from (char*)workspace + 512*sizeof(unsigned), which lands inside fpstats->pastEvents.events. Filling middleEvents therefore corrupts pastEvents and the comparison logic that follows. Allocate an independent buffer (e.g. a stack Fingerprint middleEvents; ZSTD_memset(&middleEvents, 0, sizeof(middleEvents));) or carve space after FPStats once the workspace size guarantee includes it. That keeps all three fingerprints disjoint so compareFingerprints() works as intended.

Comment thread programs/dibio.c
Comment on lines +223 to +229
while (!testmem) {
testmem = malloc((size_t)requiredMem);
requiredMem -= step;
}

free(testmem);
return (size_t)requiredMem;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix DiB_findMaxMem return value before dividing by memMult
When the first malloc() succeeds, we immediately subtract step, so the function returns one step too small. For small training sets (e.g. < ~0.75 MB) the subsequent division by memMult drops to zero, loadedSize becomes 0, and we load no samples, hitting the “nb of samples too low” exit path. Returning the actual allocated size fixes the regression.

-    free(testmem);
-    return (size_t)requiredMem;
+    free(testmem);
+    return (size_t)(requiredMem + step);
🤖 Prompt for AI Agents
In programs/dibio.c around lines 223 to 229, the loop subtracts step immediately
after a successful malloc so the function returns one step too small; change the
logic so the returned size reflects the actual allocated memory (either subtract
step only when malloc fails, or add step back to requiredMem before returning)
so DiB_findMaxMem returns the true allocated size before the later division by
memMult.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants