Skip to content

Add MLK seed files#7

Open
mugitya03 wants to merge 1 commit intomainfrom
add-mlk-seed-files
Open

Add MLK seed files#7
mugitya03 wants to merge 1 commit intomainfrom
add-mlk-seed-files

Conversation

@mugitya03
Copy link
Copy Markdown
Owner

@mugitya03 mugitya03 commented Nov 17, 2025

Note

Imports legacy ZSTD v0.6/v0.7 code, adds I/O and utility modules (incl. async file I/O), benchmarking tools, a zlib wrapper using ZSTD, and tests for large dictionaries.

  • Legacy Code:
    • Add lib/legacy/zstd_v06.c and lib/legacy/zstd_v07.c with full v0.6/v0.7 compression/decompression implementations.
  • Wrappers:
    • Introduce zlibWrapper/ (gzlib.c, gzread.c, zstd_zlibwrapper.c) to expose ZSTD via zlib-compatible APIs.
  • Programs/Tools:
    • Add benchmarking and CLI utilities: programs/benchfn.c, benchzstd.c, datagen.c, dibio.c, fileio.c, fileio_asyncio.c, util.c.
  • Tests:
    • Add large-dictionary stress tests: tests/bigdict.c, tests/largeDictionary.c.

Written by Cursor Bugbot for commit 14d58f4. Configure here.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added educational decoder implementation for learning ZSTD decompression internals.
    • Introduced parallel compression framework (pzstd) for multi-threaded compression workflows.
    • Added seekable format support for random access within compressed data.
    • Implemented dictionary training tools with COVER and FastCOVER algorithms.
    • Added zlib compatibility wrapper layer enabling ZSTD as a zlib alternative.
    • Introduced async I/O subsystem with sparse write support.
    • Added legacy ZSTD v0.1 decompression support for backward compatibility.
  • Benchmarking & Tools

    • Added comprehensive benchmarking framework and ZSTD-specific benchmarking utilities.
    • Added data generator for reproducible test scenarios.
    • Added recovery utility for extracting frames from corrupted files.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Nov 17, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

This pull request adds comprehensive tooling and infrastructure for ZSTD, including diagnostic utilities, parallel compression support (pzstd), seekable format handlers, dictionary training modules, benchmarking frameworks, file I/O subsystems, educational decompression reference, legacy format support, and gzip wrapper integration across contrib/, lib/, doc/, programs/, tests/, and zlibWrapper/ directories.

Changes

Cohort / File(s) Summary
Diagnostic and Testing Tools
contrib/diagnose_corruption/check_flipped_bits.c, contrib/seqBench/seqBench.c, tests/bigdict.c, tests/largeDictionary.c
Adds utilities to test ZSTD decompression under bit-level perturbations, benchmark sequence-based compression, and validate streaming compression/decompression with large buffers.
Parallel Compression Framework (pzstd)
contrib/pzstd/Pzstd.h, contrib/pzstd/utils/Buffer.h
Introduces multi-threaded compression infrastructure with SharedState context management, resource pools for CStream/DStream, async chunk and frame processing, and a reference-counted Buffer class with split/advance/subtract operations.
Seekable Format Support
contrib/seekable_format/zstdseek_compress.c, contrib/seekable_format/zstdseek_decompress.c
Implements bidirectional seekable ZSTD streams with per-frame logging, seek tables for random access, frame-level metadata tracking, and custom file abstractions for both in-memory and file-based sources.
Dictionary Training Pipeline
lib/dictBuilder/cover.c, lib/dictBuilder/divsufsort.c, lib/dictBuilder/fastcover.c, lib/dictBuilder/zdict.c
Adds complete dictionary training implementations using COVER and FASTCOVER heuristics, suffix-array construction (libdivsufsort-lite), entropy analysis, and dictionary finalization with optional multi-threaded optimization.
Benchmarking Framework
programs/benchfn.c, programs/benchzstd.c
Introduces timed function benchmarking infrastructure (BMK) with state management, result aggregation, compression-level sweeping, file-based benchmarks, synthetic data generation, and detailed performance metrics.
File I/O and Data Utilities
programs/fileio_asyncio.c, programs/util.c, programs/datagen.c, programs/dibio.c
Adds asynchronous I/O subsystem with sparse write support, cross-platform file utilities (stat, chmod, size, directory operations), deterministic random data generation with configurable compressibility, and dictionary training orchestration from files.
Educational Decompression Reference
doc/educational_decoder/harness.c, doc/educational_decoder/zstd_decompress.c
Provides self-contained educational ZSTD decompressor with complete frame/block decoding, Huffman/FSE table handling, sequence execution, dictionary support, and command-line harness for decompression with optional dictionaries.
Legacy Format Support
lib/legacy/zstd_v01.c
Implements full ZSTD v0.1 decompressor with streaming and one-shot APIs, FSE/Huffman decoding, frame/block processing, and frame-size probing for backward compatibility.
External Features
contrib/externalSequenceProducer/main.c, contrib/largeNbDicts/largeNbDicts.c, contrib/recovery/recover_directory.c
Adds example external sequence producer, multi-dictionary benchmarking tool with pool management and CSV logging, and directory recovery utility for extracting multiple concatenated ZSTD frames.
gzip Wrapper Integration
zlibWrapper/gzlib.c, zlibWrapper/gzread.c, zlibWrapper/zstd_zlibwrapper.c
Implements complete gzip compatibility layer with stream lifecycle (open/close/seek), header detection, read/write paths, and transparent ZSTD-or-zlib routing with custom memory management.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant pzstd as pzstd<br/>Framework
    participant SharedState as SharedState<br/>Context
    participant CPool as CStream<br/>Pool
    participant Executor as ThreadPool<br/>Executor

    User->>pzstd: pzstdMain(options)
    pzstd->>SharedState: constructor(options)
    SharedState->>CPool: createResourcePool()
    SharedState->>Executor: initialize logging
    pzstd->>Executor: asyncCompressChunks()
    loop Per Chunk
        Executor->>CPool: acquireResource()
        Executor->>Executor: compress via ZSTD_compressStream()
        Executor->>CPool: releaseResource()
        Executor->>pzstd: push result to queue
    end
    pzstd->>pzstd: writeFile(output queue)
    pzstd->>SharedState: destructor
    SharedState->>CPool: free pools
    SharedState->>SharedState: cleanup
Loading
sequenceDiagram
    participant App as Application
    participant Seek as Seekable<br/>Compress
    participant FrameLog as Frame Log
    participant CStream as ZSTD CStream

    App->>Seek: createCStream()
    Seek->>Seek: allocate context
    Seek->>FrameLog: createFrameLog()
    App->>Seek: initCStream()
    App->>Seek: compressStream(input)
    Seek->>CStream: ZSTD_compressStream()
    loop Per Frame
        Seek->>FrameLog: logFrame(cSize,dSize,checksum)
    end
    App->>Seek: endStream()
    Seek->>Seek: writeSeekTable()
    Seek->>Seek: emit seek table footer
    App->>Seek: freeCStream()
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

  • Dictionary Training (cover.c, fastcover.c, divsufsort.c): Intricate sorting/partitioning algorithms, suffix-array construction with OpenMP paths, and multi-threaded parameter optimization require careful logic review.
  • Seekable Format (zstdseek_compress.c, zstdseek_decompress.c): Frame-level metadata tracking, seek-table serialization/deserialization, and custom file abstractions demand verification of offset correctness and seekability guarantees.
  • pzstd Framework (Pzstd.h, Buffer.h): SharedState lifecycle, resource pool management, and concurrent queue operations need synchronization correctness validation.
  • Async I/O (fileio_asyncio.c): Thread-pool coordination, read/write job queuing, sparse write logic, and condition-variable signaling require concurrency analysis.
  • gzip Wrapper (zstd_zlibwrapper.c): Mode detection, routing between zlib and ZSTD paths, stream state transitions, and error propagation must preserve zlib API semantics.
  • Legacy v01 Decompressor (zstd_v01.c): FSE/Huffman decoding, bitstream handling, and frame structure parsing need validation against v0.1 spec.

Poem

🐰 From diagnostic whiskers to seekable streams so bright,
Dictionary training hops through frames with delight,
Parallel threads compress with pzstd's might,
Benchmarks measure speed while gzip wraps tight,
A warren of features—hooray for the height!

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title 'Add MLK seed files' does not match the actual changeset, which adds dozens of substantial new modules across contrib, lib, programs, tests, doc, and zlibWrapper directories implementing dictionary training, compression utilities, legacy support, and other features. Revise the title to accurately describe the primary changes, such as 'Add comprehensive ZSTD utility modules and educational implementations' or be more specific about the dominant change if one exists.
Docstring Coverage ⚠️ Warning Docstring coverage is 41.79% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch add-mlk-seed-files

Comment @coderabbitai help to get the list of available commands and usage tips.

ZSTD_freeDDict(stuff->ddict);
free(stuff->dict);
ZSTD_freeDCtx(stuff->dctx);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Forgotten Allocations

Memory leak in free_stuff function. The stuff->perturbed buffer is allocated with malloc in init_stuff but never freed in free_stuff, causing a memory leak when the program exits.

Fix in Cursor Fix in Web

memset(stuff->error_counts, 0, sizeof(stuff->error_counts));

return 1;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Incomplete Cleanup Leaks Memory

Memory leak on error paths in init_stuff. When initialization fails partway through (e.g., after allocating stuff->input and stuff->perturbed but before completing), the function returns 0 without freeing already-allocated resources. The caller in main exits immediately without calling free_stuff, leaking the partially-initialized allocations.

Fix in Cursor Fix in Web

*dict_id = ZSTD_getDictID_fromDDict(ddict);
}
return ddict;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Incomplete Initialization Leaks Resources

Memory leak in readDict when dictionary creation fails. If readFile succeeds but ZSTD_createDDict_advanced fails, the function returns NULL without freeing the buffer allocated by readFile and assigned to *buf. The caller cannot free this memory since it only checks the return value, not the output parameter.

Fix in Cursor Fix in Web

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 35

🧹 Nitpick comments (24)
programs/util.c (1)

99-105: Consider renaming UTIL_realloc to reflect non-standard behavior.

UTIL_realloc has non-standard semantics: it frees the original pointer on failure, unlike standard realloc(). While documented in a comment, this could lead to subtle bugs if developers expect standard realloc behavior.

Consider renaming to make the behavior explicit:

-UTIL_STATIC void* UTIL_realloc(void *ptr, size_t size)
+UTIL_STATIC void* UTIL_reallocOrFree(void *ptr, size_t size)
 {
     void *newptr = realloc(ptr, size);
     if (newptr) return newptr;
     free(ptr);
     return NULL;
 }

Or add a more prominent warning in the function documentation.

contrib/pzstd/utils/Buffer.h (1)

13-13: Remove unused include.

The <array> header is included but not used anywhere in this file.

Apply this diff to remove the unused include:

-#include <array>
 #include <cstddef>
 #include <memory>
contrib/pzstd/Pzstd.h (2)

46-57: Consider logging or propagating errors from the CStream factory.

When ZSTD_initCStream_advanced fails (line 52), the factory lambda returns nullptr without logging the error or setting errorHolder. This makes it difficult to diagnose why stream creation failed.

Apply this diff to add error logging:

             auto zcs = ZSTD_createCStream();
             if (zcs) {
               auto err = ZSTD_initCStream_advanced(
                   zcs, nullptr, 0, parameters, 0);
               if (ZSTD_isError(err)) {
+                this->log(kLogInfo, "ZSTD_initCStream_advanced failed: %s\n", 
+                          ZSTD_getErrorName(err));
                 ZSTD_freeCStream(zcs);
                 return nullptr;
               }
+            } else {
+              this->log(kLogInfo, "ZSTD_createCStream failed\n");
             }

64-74: Consider logging or propagating errors from the DStream factory.

Similar to the CStream factory, when ZSTD_initDStream fails (line 69), the factory returns nullptr without logging the error. This symmetry issue makes debugging difficult.

Apply this diff to add error logging:

             auto zds = ZSTD_createDStream();
             if (zds) {
               auto err = ZSTD_initDStream(zds);
               if (ZSTD_isError(err)) {
+                this->log(kLogInfo, "ZSTD_initDStream failed: %s\n",
+                          ZSTD_getErrorName(err));
                 ZSTD_freeDStream(zds);
                 return nullptr;
               }
+            } else {
+              this->log(kLogInfo, "ZSTD_createDStream failed\n");
             }
programs/benchfn.c (2)

179-179: Use uintptr_t for pointer arithmetic instead of size_t.

Casting a pointer to size_t for alignment checking is not portable. Use uintptr_t (from <stdint.h>) instead, which is specifically designed for pointer-to-integer conversions.

+#include <stdint.h>
+
 BMK_timedFnState_t*
 BMK_initStatic_timedFnState(void* buffer, size_t size, unsigned total_ms, unsigned run_ms)
 {
     typedef char check_size[ 2 * (sizeof(BMK_timedFnState_shell) >= sizeof(struct BMK_timedFnState_s)) - 1];
     typedef struct { check_size c; BMK_timedFnState_t tfs; } tfs_align;
     size_t const tfs_alignment = offsetof(tfs_align, tfs);
     BMK_timedFnState_t* const r = (BMK_timedFnState_t*)buffer;
     if (buffer == NULL) return NULL;
     if (size < sizeof(struct BMK_timedFnState_s)) return NULL;
-    if ((size_t)buffer % tfs_alignment) return NULL;
+    if ((uintptr_t)buffer % tfs_alignment) return NULL;
     BMK_resetTimedFnState(r, total_ms, run_ms);
     return r;
 }

192-192: Consider using a named constant for the "infinity" sentinel value.

The expression (double)TIMELOOP_NANOSEC * 2000000000 evaluates to approximately 63 years, serving as an effective "maximum possible value" for minimum-time tracking. While functionally correct, a named constant would clarify intent.

+#define BMK_TIMELOOP_INFINITY_NS ((double)TIMELOOP_NANOSEC * 2000000000)
+
 void BMK_resetTimedFnState(BMK_timedFnState_t* timedFnState, unsigned total_ms, unsigned run_ms)
 {
     if (!total_ms) total_ms = 1 ;
     if (!run_ms) run_ms = 1;
     if (run_ms > total_ms) run_ms = total_ms;
     timedFnState->timeSpent_ns = 0;
     timedFnState->timeBudget_ns = (PTime)total_ms * TIMELOOP_NANOSEC / 1000;
     timedFnState->runBudget_ns = (PTime)run_ms * TIMELOOP_NANOSEC / 1000;
-    timedFnState->fastestRun.nanoSecPerRun = (double)TIMELOOP_NANOSEC * 2000000000;
+    timedFnState->fastestRun.nanoSecPerRun = BMK_TIMELOOP_INFINITY_NS;
     timedFnState->fastestRun.sumOfReturn = (size_t)(-1LL);
     timedFnState->nbLoops = 1;
     timedFnState->coolTime = UTIL_getTime();
 }
programs/benchzstd.c (5)

67-71: Consider adding a comment to clarify the maxMemory calculation.

The expression 1ULL << ((sizeof(size_t) * 8) - 31) evaluates to 8 GB on 64-bit systems. While correct, a comment explaining the formula (e.g., "2^33 bytes = 8GB") would improve readability.


471-489: Strengthen overflow and allocation checks.

The overflow check at line 474 is good, but the subsequent malloc(decodedSize) at line 480 doesn't verify that the allocation is reasonable or check for malloc failure before use. While line 481 checks for NULL, consider adding a maximum size sanity check.

         {
             size_t const decodedSize = (size_t)totalDSize64;
             assert((U64)decodedSize == totalDSize64); /* check overflow */
             free(*resultBufferPtr);
             if (totalDSize64 > decodedSize) { /* size_t overflow */
                 RETURN_ERROR(
                         32,
                         BMK_benchOutcome_t,
                         "decompressed size is too large for local system");
             }
+            if (decodedSize > maxMemory) {
+                RETURN_ERROR(
+                        32,
+                        BMK_benchOutcome_t,
+                        "decompressed size exceeds maximum memory limit");
+            }
             *resultBufferPtr = malloc(decodedSize);
             if (!(*resultBufferPtr)) {
                 RETURN_ERROR(
                         33,
                         BMK_benchOutcome_t,
                         "allocation error: not enough memory");
             }

785-786: Consider clarifying the XOR-based parameter validation.

The expression !dstBuffer ^ !dstCapacity correctly validates that both parameters are NULL or both are non-NULL, but the intent is not immediately obvious. A more explicit check would improve readability.

-    int const dstParamsError =
-            !dstBuffer ^ !dstCapacity; /* must be both NULL or none */
+    int const dstParamsError = 
+            (dstBuffer == NULL) != (dstCapacity == 0); /* must be both NULL or both non-NULL */

1032-1049: Memory probing approach is functional but crude.

The current implementation probes for available memory by attempting allocations. While this works for a benchmarking utility, be aware that:

  • The probe may be slow on memory-constrained systems
  • Memory availability can change between the probe and actual use
  • Modern systems offer better APIs (e.g., sysconf(_SC_PHYS_PAGES) on POSIX)

For the current use case (benchmarking tool), this approach is acceptable.


1196-1196: Consider adding a comment to explain the memory calculation.

The expression BMK_findMaxMem(totalSizeToLoad * 3) / 3 divides available memory by 3 to account for source, compressed, and decompressed buffers. A brief comment would clarify this intent.

     /* Memory allocation & restrictions */
+    /* Divide available memory by 3 to account for source, compressed, and result buffers */
     benchedSize = BMK_findMaxMem(totalSizeToLoad * 3) / 3;
contrib/recovery/recover_directory.c (3)

60-60: Consider checking the return value of fclose.

While less critical for read operations, checking fclose() return values is good practice as it can still fail in certain scenarios.

-  fclose(file);
+  int closeResult = fclose(file);
+  CHECK(closeResult == 0, "fclose failed");

62-62: Remove redundant cast.

The variable data is already of type char *, so the cast to (char *) is unnecessary.

-  frames.data = (char *)data;
+  frames.data = data;

141-141: Consider checking the return value of fclose.

Unlike the earlier read-only fclose, this one is closing a file opened for writing. The fclose call flushes any buffered data, and this operation can fail (e.g., disk full, I/O errors). Checking the return value ensures data integrity.

-    fclose(outFile);
+    int closeResult = fclose(outFile);
+    CHECK(closeResult == 0, "fclose failed for %s", outFileName);
zlibWrapper/gzread.c (1)

291-365: Clarify the intent of n = -1 in gz_read

n is initialized with n = -1; and then compared/assigned as an unsigned. This is relying on wrap‑around to UINT_MAX, which is correct but slightly opaque.

Consider making it explicit for future readers:

-        n = -1;
+        n = (unsigned)-1;

This keeps the behavior identical while making the intent clearer.

zlibWrapper/zstd_zlibwrapper.c (2)

85-99: Global compression/decompression toggles are not thread-safe

The wrapper uses process‑wide globals:

static int g_ZWRAP_useZSTDcompression = ZWRAP_USE_ZSTD; /* 0 = don't use ZSTD */
void ZWRAP_useZSTDcompression(int turn_on) { g_ZWRAP_useZSTDcompression = turn_on; }

static ZWRAP_decompress_type g_ZWRAPdecompressionType = ZWRAP_AUTO;
void ZWRAP_setDecompressionType(ZWRAP_decompress_type type) { g_ZWRAPdecompressionType = type; }

If these setters are called while other threads are actively compressing/decompressing, you can get data races and non‑deterministic behavior.

If multi‑threaded use with dynamic reconfiguration is a target, consider:

  • Making these flags atomic, or
  • Moving configuration to per‑stream state (preferred), or
  • Documenting them as “set once at process startup, before creating any streams”.

Given this is upstream‑style code, adjusting or at least documenting the threading expectations might be the pragmatic first step.


735-825: Disabled debug logging currently references res instead of result

In the zlib‑stream branch of z_inflate, the debug log uses an undeclared res:

int const result = (flush == Z_INFLATE_SYNC) ? inflateSync(strm) : inflate(strm, flush);
LOG_WRAPPERD("- inflate3 ... res=%d\n", ..., res);

This compiles today only because LOG_WRAPPERD is defined as a no‑op comment macro:

#define LOG_WRAPPERD(...)  /* fprintf(stderr, __VA_ARGS__) */

If you ever re‑enable logging (e.g., change the macro to call fprintf), this will stop compiling. Consider aligning the name now:

-    LOG_WRAPPERD("- inflate3 ... res=%d\n", ..., res);
+    LOG_WRAPPERD("- inflate3 ... res=%d\n", ..., result);

Same pattern appears later in the other inflate logging block.

contrib/seekable_format/zstdseek_decompress.c (2)

137-154: fflush on FILE after seek may be unnecessary and non-portable*

ZSTD_seekable_seek_FILE() uses:

int const ret = LONG_SEEK((FILE*)opaque, offset, origin);
if (ret) return ret;
return fflush((FILE*)opaque);

Calling fflush() on a read-only stream is implementation-defined/unspecified and is generally discouraged; for pure read paths it isn’t needed after fseek/fseeko/_fseeki64. This might behave oddly on some libcs.

Given this routine is used only as a read-side callback, consider simply:

static int ZSTD_seekable_seek_FILE(void* opaque, long long offset, int origin)
{
-    int const ret = LONG_SEEK((FILE*)opaque, offset, origin);
-    if (ret) return ret;
-    return fflush((FILE*)opaque);
+    return LONG_SEEK((FILE*)opaque, offset, origin);
}

Please confirm whether you rely on mixed read/write usage of FILE* here; if so, we can discuss a more targeted solution.


171-195: Defensive handling of unexpected origin in ZSTD_seekable_seek_buff

The default case in the switch(origin) only has an assert(0):

default:
    assert(0);  /* not possible */
}
if (newOffset > buff->size) {
    return -1;
}

If origin is ever outside {SEEK_SET, SEEK_CUR, SEEK_END} in a release build (asserts disabled), newOffset will be uninitialized, leading to undefined behavior.

It’s cheap to add a defensive fallback:

    default:
-        assert(0);  /* not possible */
+        assert(0);  /* not possible */
+        return -1;
    }
contrib/seekable_format/zstdseek_compress.c (3)

233-236: Check return value of ZSTD_CCtx_setParameter

ZSTD_CCtx_setParameter returns a size_t error code, but its result is currently ignored:

assert(zcs->maxFrameSize < INT_MAX);
ZSTD_CCtx_setParameter(zcs->cstream, ZSTD_c_srcSizeHint, (int)zcs->maxFrameSize);

Although ZSTD_c_srcSizeHint is expected to be valid, ignoring errors can hide misconfiguration or version mismatches. Suggested change:

assert(zcs->maxFrameSize < INT_MAX);
{
    size_t const err = ZSTD_CCtx_setParameter(
            zcs->cstream, ZSTD_c_srcSizeHint, (int)zcs->maxFrameSize);
    if (ZSTD_isError(err)) return err;
}

This keeps error propagation consistent with the rest of the code.

Please double-check against your zstd version’s API docs to ensure this handling matches the expected semantics.


66-75: Frame log allocation/free patterns look good; minor safety suggestion

The frame log vector management (ZSTD_seekable_frameLog_allocVec, ZSTD_seekable_frameLog_freeVec, and ZSTD_seekable_logFrame) follows a standard pattern (initial capacity, exponential growth, bounds check on ZSTD_SEEKABLE_MAXFRAMES).

One small optional improvement is to null out fl->entries after free to make accidental double-free safer:

static size_t ZSTD_seekable_frameLog_freeVec(ZSTD_frameLog* fl)
{
-    if (fl != NULL) free(fl->entries);
+    if (fl != NULL) {
+        free(fl->entries);
+        fl->entries = NULL;
+        fl->capacity = 0;
+        fl->size = 0;
+    }
    return 0;
}

Not required for correctness with current usage, but it hardens the API against misuse.

Also applies to: 167-195


353-365: ZSTD_seekable_endStream flow is reasonable; be aware of size-hint semantics

The ZSTD_seekable_endStream() control flow:

  • Finishes the last frame via ZSTD_seekable_endFrame() (which may return a positive size hint until fully flushed).
  • Then emits the seek table in a streaming-friendly way using ZSTD_seekable_writeSeekTable() and zcs->writingSeekTable state.

The pattern of returning the pending size (endFrame + ZSTD_seekable_seekTableSize(&zcs->framelog)) as a hint before the final frame is logged is subtle but consistent. Callers must treat any non-error, non-zero return as “need more output space; call again with a larger buffer” rather than as bytes written.

No changes required, but it’s worth documenting this behavior in the header so downstream users don’t misinterpret these return values.

contrib/diagnose_corruption/check_flipped_bits.c (2)

31-36: Prefer ZSTD_dictID_t / unsigned for dict IDs to avoid type/format mismatches

dict_id is currently int32_t, but it is:

  • Filled from ZSTD_getDictID_fromFrame() / ZSTD_getDictID_fromDDict(), which return an unsigned dict ID.
  • Printed with %u in snprintf() / fprintf().

Mixing int32_t with %u is technically undefined behavior and can misrepresent large IDs. It also diverges from the ZSTD API’s unsigned dict ID type.

Consider switching to the library’s dict ID typedef and updating signatures accordingly, e.g.:

-  int32_t dict_id;
+  ZSTD_dictID_t dict_id;

and similarly:

-static ZSTD_DDict* readDict(const char* filename, char **buf, size_t* size, int32_t* dict_id) {
+static ZSTD_DDict* readDict(const char* filename, char **buf, size_t* size, ZSTD_dictID_t* dict_id) {

-static ZSTD_DDict* readDictByID(stuff_t *stuff, int32_t dict_id, char **buf, size_t* size) {
+static ZSTD_DDict* readDictByID(stuff_t *stuff, ZSTD_dictID_t dict_id, char **buf, size_t* size) {

This keeps everything consistent with ZSTD’s public API and avoids UB around format specifiers. Based on library context.

Also applies to: 139-140, 144-169, 246-259, 281-289


44-50: Free perturbed and tighten cleanup on early exits

  • stuff->perturbed is allocated in init_stuff() but never freed in free_stuff(), so it’s leaked on the normal success path.
  • When init_stuff() fails or when the original blob already decompresses (the early return after test_decompress()), free_stuff() is not called at all. The OS will reclaim memory on process exit, but explicit cleanup keeps the ownership model clear and future‑proof.

Minimal improvements:

 static void free_stuff(stuff_t* stuff) {
   free(stuff->input);
+  free(stuff->perturbed);
   free(stuff->output);
   ZSTD_freeDDict(stuff->ddict);
   free(stuff->dict);
   ZSTD_freeDCtx(stuff->dctx);
 }

Optionally, zero‑initialize stuff at declaration and always call free_stuff() on all exit paths:

-int main(int argc, char* argv[]) {
-  stuff_t stuff;
+int main(int argc, char* argv[]) {
+  stuff_t stuff = {0};
@@
   if(!init_stuff(&stuff, argc, argv)) {
     fprintf(stderr, "Failed to init.\n");
-    return 1;
+    free_stuff(&stuff);
+    return 1;
   }
@@
   if (test_decompress(&stuff)) {
     fprintf(stderr, "Blob already decompresses successfully!\n");
-    return 1;
+    free_stuff(&stuff);
+    return 1;
   }
@@
-  free_stuff(&stuff);
+  free_stuff(&stuff);
   return 0;

This makes resource ownership explicit and safe even if the code evolves.

Also applies to: 191-218, 378-399

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6cacd63 and 14d58f4.

📒 Files selected for processing (27)
  • contrib/diagnose_corruption/check_flipped_bits.c (1 hunks)
  • contrib/externalSequenceProducer/main.c (1 hunks)
  • contrib/largeNbDicts/largeNbDicts.c (1 hunks)
  • contrib/pzstd/Pzstd.h (1 hunks)
  • contrib/pzstd/utils/Buffer.h (1 hunks)
  • contrib/recovery/recover_directory.c (1 hunks)
  • contrib/seekable_format/zstdseek_compress.c (1 hunks)
  • contrib/seekable_format/zstdseek_decompress.c (1 hunks)
  • contrib/seqBench/seqBench.c (1 hunks)
  • doc/educational_decoder/harness.c (1 hunks)
  • doc/educational_decoder/zstd_decompress.c (1 hunks)
  • lib/dictBuilder/cover.c (1 hunks)
  • lib/dictBuilder/divsufsort.c (1 hunks)
  • lib/dictBuilder/fastcover.c (1 hunks)
  • lib/dictBuilder/zdict.c (1 hunks)
  • lib/legacy/zstd_v01.c (1 hunks)
  • programs/benchfn.c (1 hunks)
  • programs/benchzstd.c (1 hunks)
  • programs/datagen.c (1 hunks)
  • programs/dibio.c (1 hunks)
  • programs/fileio_asyncio.c (1 hunks)
  • programs/util.c (1 hunks)
  • tests/bigdict.c (1 hunks)
  • tests/largeDictionary.c (1 hunks)
  • zlibWrapper/gzlib.c (1 hunks)
  • zlibWrapper/gzread.c (1 hunks)
  • zlibWrapper/zstd_zlibwrapper.c (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (21)
tests/bigdict.c (2)
tests/largeDictionary.c (2)
  • compress (21-61)
  • main (63-128)
programs/datagen.c (1)
  • RDG_genBuffer (144-152)
programs/dibio.c (2)
programs/util.c (1)
  • UTIL_getFileSize (545-558)
lib/dictBuilder/zdict.c (3)
  • ZDICT_trainFromBuffer_legacy (1088-1108)
  • ZDICT_isError (98-98)
  • ZDICT_getErrorName (100-100)
tests/largeDictionary.c (2)
tests/bigdict.c (2)
  • compress (21-61)
  • main (63-133)
programs/datagen.c (1)
  • RDG_genBuffer (144-152)
contrib/pzstd/Pzstd.h (1)
contrib/pzstd/utils/Buffer.h (1)
  • pzstd (17-99)
doc/educational_decoder/zstd_decompress.c (1)
lib/legacy/zstd_v03.c (1)
  • ZSTD_decompress (2905-2910)
contrib/recovery/recover_directory.c (3)
programs/util.c (1)
  • UTIL_getFileSize (545-558)
lib/legacy/zstd_v03.c (3)
  • ZSTD_isError (2373-2373)
  • ZSTD_createDCtx (2981-2987)
  • ZSTD_freeDCtx (2989-2993)
lib/legacy/zstd_v04.c (3)
  • ZSTD_isError (2429-2429)
  • ZSTD_createDCtx (2469-2475)
  • ZSTD_freeDCtx (2477-2481)
zlibWrapper/gzlib.c (1)
zlibWrapper/gzread.c (9)
  • z_size_t (292-366)
  • ZEXPORT (369-398)
  • ZEXPORT (401-425)
  • ZEXPORT (445-470)
  • ZEXPORT (472-474)
  • ZEXPORT (477-531)
  • ZEXPORT (534-591)
  • ZEXPORT (594-609)
  • ZEXPORT (612-637)
zlibWrapper/gzread.c (1)
zlibWrapper/gzlib.c (16)
  • ZEXPORT (265-267)
  • ZEXPORT (270-272)
  • ZEXPORT (275-289)
  • ZEXPORT (293-295)
  • ZEXPORT (299-320)
  • ZEXPORT (323-341)
  • ZEXPORT (344-414)
  • ZEXPORT (417-422)
  • ZEXPORT (425-437)
  • ZEXPORT (440-445)
  • ZEXPORT (448-466)
  • ZEXPORT (469-474)
  • ZEXPORT (477-489)
  • ZEXPORT (492-507)
  • ZEXPORT (510-526)
  • gzFile (90-262)
lib/dictBuilder/fastcover.c (2)
lib/dictBuilder/cover.c (13)
  • COVER_sum (278-285)
  • COVER_computeEpochs (732-747)
  • COVER_dictSelectionError (1035-1037)
  • COVER_selectDict (1047-1132)
  • COVER_dictSelectionIsError (1039-1041)
  • COVER_best_finish (980-1024)
  • COVER_dictSelectionFree (1043-1045)
  • size_t (807-862)
  • size_t (1195-1331)
  • COVER_warnOnSmallCorpus (716-730)
  • COVER_best_init (922-931)
  • COVER_best_destroy (950-960)
  • COVER_best_wait (936-945)
lib/dictBuilder/zdict.c (1)
  • ZDICT_finalizeDictionary (862-941)
lib/dictBuilder/zdict.c (1)
lib/legacy/zstd_v03.c (4)
  • ERR_isError (525-525)
  • U32 (121-124)
  • U32 (161-164)
  • U32 (166-175)
contrib/largeNbDicts/largeNbDicts.c (3)
programs/util.c (6)
  • UTIL_getFileSize (545-558)
  • UTIL_isDirectory (357-365)
  • UTIL_getTotalFileSize (630-645)
  • UTIL_createExpandedFNT (1379-1423)
  • UTIL_assembleFileNamesTable (811-815)
  • UTIL_freeFileNamesTable (817-823)
lib/legacy/zstd_v03.c (3)
  • ZSTD_isError (2373-2373)
  • ZSTD_createDCtx (2981-2987)
  • ZSTD_freeDCtx (2989-2993)
programs/benchfn.c (6)
  • BMK_createTimedFnState (160-166)
  • BMK_benchTimedFn (211-256)
  • BMK_isSuccessful_runOutcome (64-67)
  • BMK_extract_runTime (71-75)
  • BMK_isCompleted_TimedFn (200-203)
  • BMK_freeTimedFnState (168-168)
contrib/diagnose_corruption/check_flipped_bits.c (2)
lib/legacy/zstd_v03.c (3)
  • ZSTD_freeDCtx (2989-2993)
  • ZSTD_createDCtx (2981-2987)
  • ZSTD_isError (2373-2373)
contrib/recovery/recover_directory.c (3)
  • usage (29-38)
  • readFile (47-82)
  • main (88-152)
lib/dictBuilder/cover.c (2)
lib/legacy/zstd_v03.c (12)
  • U64 (126-129)
  • U64 (177-187)
  • U32 (121-124)
  • U32 (161-164)
  • U32 (166-175)
  • size_t (190-196)
  • size_t (325-368)
  • size_t (369-373)
  • size_t (377-381)
  • size_t (388-393)
  • size_t (397-402)
  • size_t (1276-1343)
lib/dictBuilder/zdict.c (2)
  • ZDICT_finalizeDictionary (862-941)
  • ZDICT_isError (98-98)
contrib/seekable_format/zstdseek_decompress.c (2)
lib/legacy/zstd_v04.c (6)
  • ZSTD_isError (2429-2429)
  • U64 (102-105)
  • U64 (154-164)
  • U32 (97-100)
  • U32 (137-140)
  • U32 (142-151)
lib/legacy/zstd_v07.c (6)
  • U64 (283-286)
  • U64 (307-323)
  • U64 (358-364)
  • U32 (278-281)
  • U32 (293-305)
  • U32 (349-355)
contrib/seqBench/seqBench.c (2)
contrib/externalSequenceProducer/main.c (1)
  • main (29-107)
lib/legacy/zstd_v03.c (2)
  • ZSTD_isError (2373-2373)
  • ZSTD_decompress (2905-2910)
contrib/seekable_format/zstdseek_compress.c (2)
lib/legacy/zstd_v03.c (4)
  • ZSTD_isError (2373-2373)
  • U32 (121-124)
  • U32 (161-164)
  • U32 (166-175)
lib/legacy/zstd_v07.c (3)
  • U32 (278-281)
  • U32 (293-305)
  • U32 (349-355)
contrib/externalSequenceProducer/main.c (1)
doc/educational_decoder/zstd_decompress.c (1)
  • ZSTD_decompress (384-391)
doc/educational_decoder/harness.c (1)
doc/educational_decoder/zstd_decompress.c (5)
  • ZSTD_get_decompressed_size (1378-1401)
  • create_dictionary (1405-1411)
  • parse_dictionary (1435-1487)
  • ZSTD_decompress_with_dict (393-409)
  • free_dictionary (1414-1425)
programs/fileio_asyncio.c (1)
contrib/seekable_format/zstdseek_decompress.c (1)
  • LONG_SEEK (91-106)
programs/benchzstd.c (5)
lib/legacy/zstd_v04.c (8)
  • ZSTD_isError (2429-2429)
  • U32 (97-100)
  • U32 (137-140)
  • U32 (142-151)
  • U64 (102-105)
  • U64 (154-164)
  • ZSTD_createDCtx (2469-2475)
  • ZSTD_freeDCtx (2477-2481)
lib/legacy/zstd_v07.c (6)
  • U32 (278-281)
  • U32 (293-305)
  • U32 (349-355)
  • U64 (283-286)
  • U64 (307-323)
  • U64 (358-364)
programs/datagen.c (1)
  • RDG_genBuffer (144-152)
programs/benchfn.c (5)
  • BMK_benchTimedFn (211-256)
  • BMK_isSuccessful_runOutcome (64-67)
  • BMK_extract_runTime (71-75)
  • BMK_createTimedFnState (160-166)
  • BMK_freeTimedFnState (168-168)
programs/util.c (3)
  • UTIL_getFileSize (545-558)
  • UTIL_isDirectory (357-365)
  • UTIL_getTotalFileSize (630-645)
zlibWrapper/zstd_zlibwrapper.c (2)
lib/legacy/zstd_v03.c (2)
  • char (527-532)
  • ZSTD_isError (2373-2373)
lib/legacy/zstd_v04.c (1)
  • ZSTD_isError (2429-2429)
🪛 ast-grep (0.39.9)
zlibWrapper/gzlib.c

[warning] 239-239: This call makes a world-writable file which allows any user on a machine to write to the file. This may allow attackers to influence the behaviour of this process by writing to the file.
Context: 0666
Note: [CWE-732]: Incorrect Permission Assignment for Critical Resource [REFERENCES]
- https://wiki.sei.cmu.edu/confluence/display/c/FIO06-C.+Create+files+with+appropriate+access+permissions

(world-writable-file-c)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Cursor Bugbot
🔇 Additional comments (13)
contrib/pzstd/utils/Buffer.h (1)

58-62: The review comment is incorrect—bounds checking already exists in the underlying Range methods.

The subpiece() and advance() methods in Range.h (lines 121-127 and 107-112) both validate boundaries and throw std::out_of_range exceptions if called with out-of-bounds parameters. In splitAt(), the call to advance(n) will throw an exception if n > size(), preventing undefined behavior. No additional validation is needed at the entry point of splitAt().

Likely an incorrect or invalid review comment.

programs/benchfn.c (3)

108-146: LGTM - Well-structured benchmarking logic.

The function properly handles edge cases (minimum loop count), warms buffers before timing, validates results only on the first iteration, and correctly averages timing across loops.


211-256: LGTM - Robust adaptive benchmarking with overflow protection.

The function correctly adjusts loop counts based on runtime measurements, includes overflow protection (line 238), and properly discards measurements with insufficient runtime to avoid timing noise.


27-27: No issues found — SEC_TO_MICRO is properly defined in timefn.h at line 57.

The constant is correctly defined as ((PTime)1000000) and will be accessible via the included header file. The code at line 27 of benchfn.c is valid and will compile without errors.

Likely an incorrect or invalid review comment.

programs/benchzstd.c (2)

1054-1112: LGTM - Robust file loading with proper error handling.

The function correctly handles edge cases (directories, unknown file sizes, buffer overflow by truncation) and includes proper resource cleanup. Error paths use appropriate error codes.


891-918: LGTM - Clean wrapper functions for API convenience.

The wrapper functions BMK_benchMem and BMK_benchFiles provide simplified interfaces by using default parameters, following good API design practices.

Also applies to: 1253-1270

contrib/recovery/recover_directory.c (2)

147-151: Resource cleanup looks good.

All allocated resources are properly freed before program exit.


16-16: Include path is correctly configured—no issues found.

The #include "util.h" directive resolves correctly. The Makefile at contrib/recovery/Makefile defines PROGRAMDIR ?= ../../programs (line 12) and includes -I$(PROGRAMDIR) in CFLAGS (line 17), which correctly provides the compiler with the include path to the programs/ directory where util.h resides. The build system is properly configured.

doc/educational_decoder/zstd_decompress.c (1)

82-93: Let the fallback sizing clamp instead of aborting.

When a frame doesn’t embed its decompressed size, we guess MAX_COMPRESSION_RATIO * input.size. For any payload ≥64 MiB, that estimate exceeds the 1 GiB cap and we exit immediately—even though plenty of such frames legitimately decompress to <1 GiB. That makes the harness unusable for large-yet-valid streams (common with zstd streaming output). Please clamp the fallback capacity to MAX_OUTPUT_SIZE (after guarding against overflow) and only hard-fail when an explicit frame size is above the cap.

-    if (out_capacity == (size_t)-1) {
-        out_capacity = MAX_COMPRESSION_RATIO * input.size;
-        fprintf(stderr, "WARNING: Compressed data does not contain "
-                        "decompressed size, going to assume the compression "
-                        "ratio is at most %d (decompressed size of at most "
-                        "%u) \n",
-                MAX_COMPRESSION_RATIO, (unsigned)out_capacity);
-    }
-    if (out_capacity > MAX_OUTPUT_SIZE)
-        ERR_OUT("Required output size too large for this implementation \n");
+    if (out_capacity == (size_t)-1) {
+        if (input.size > SIZE_MAX / MAX_COMPRESSION_RATIO) {
+            ERR_OUT("compressed input too large to size output buffer safely \n");
+        }
+        size_t fallback = MAX_COMPRESSION_RATIO * input.size;
+        if (fallback > MAX_OUTPUT_SIZE) {
+            fprintf(stderr,
+                    "WARNING: Compressed data lacks a decompressed size; "
+                    "assuming ratio %d but capping output buffer at %zu bytes \n",
+                    MAX_COMPRESSION_RATIO, MAX_OUTPUT_SIZE);
+            out_capacity = MAX_OUTPUT_SIZE;
+        } else {
+            out_capacity = fallback;
+        }
+    } else if (out_capacity > MAX_OUTPUT_SIZE) {
+        ERR_OUT("Required output size too large for this implementation \n");
+    }

Likely an incorrect or invalid review comment.

zlibWrapper/gzread.c (1)

88-173: gz_look(): gzip/zstd header handling and buffer setup look correct

Header probing, buffer allocation, and inflateInit2(..., 15 + 16) setup are consistent with zlib’s gzread semantics, and the added zstd magic check (0x28, 0xB5) integrates cleanly with the existing GZIP path. No changes needed here.

zlibWrapper/gzlib.c (1)

75-87: gz_reset() state initialization looks sound

Resetting x.have, eof/past flags, how, seek, error state, position, and strm.avail_in is consistent and matches zlib’s expectations. Good foundation for both read and write modes.

zlibWrapper/zstd_zlibwrapper.c (1)

172-199: CCtx creation/free logic is consistent aside from allocator issue

ZWRAP_createCCtx / ZWRAP_freeCCtx correctly wire the custom allocators into zstd’s ZSTD_customMem, and free both the zstd stream and the wrapper context:

ZWRAP_CCtx* zwc = (ZWRAP_CCtx*)ZWRAP_customCalloc(sizeof(ZWRAP_CCtx), customMem);
...
ZSTD_freeCStream(zwc->zbc);
ZWRAP_customFree(zwc, zwc->customMem);

Once ZWRAP_customCalloc’s null handling is fixed (see previous comment), this lifecycle looks solid.

contrib/diagnose_corruption/check_flipped_bits.c (1)

331-376: Perturbation loops look correct and side‑effect free on stuff->input

Both perturb_bits() and perturb_bytes():

  • Snapshot old_val from stuff->input[pos].
  • Mutate stuff->perturbed[pos] only.
  • Restore stuff->perturbed[pos] = old_val; after the inner loop.

So the original input buffer is never modified, and each perturbation starts from the same baseline. Progress logging every 1000 bytes also keeps output manageable for large inputs. No functional issues spotted in these loops.

Comment on lines +75 to +123
static char* readFile(const char* filename, size_t* size) {
struct stat statbuf;
int ret;
FILE* f;
char *buf;
size_t bytes_read;

ret = stat(filename, &statbuf);
if (ret != 0) {
fprintf(stderr, "stat failed: %m\n");
return NULL;
}
if ((statbuf.st_mode & S_IFREG) != S_IFREG) {
fprintf(stderr, "Input must be regular file\n");
return NULL;
}

*size = statbuf.st_size;

f = fopen(filename, "r");
if (f == NULL) {
fprintf(stderr, "fopen failed: %m\n");
return NULL;
}

buf = malloc(*size);
if (buf == NULL) {
fprintf(stderr, "malloc failed\n");
fclose(f);
return NULL;
}

bytes_read = fread(buf, 1, *size, f);
if (bytes_read != *size) {
fprintf(stderr, "failed to read whole file\n");
fclose(f);
free(buf);
return NULL;
}

ret = fclose(f);
if (ret != 0) {
fprintf(stderr, "fclose failed: %m\n");
free(buf);
return NULL;
}

return buf;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

Open input/dictionary files in binary mode ("rb")

readFile() uses:

f = fopen(filename, "r");

These files are zstd frames / dictionaries and should be treated as binary. Using "r" can cause newline translation on platforms that distinguish text vs binary mode, corrupting the data and breaking decompression or dict ID detection. The recovery tool snippet you provided (contrib/recovery/recover_directory.c, lines 46–81) already uses "rb" for this reason.

Recommend:

-  f = fopen(filename, "r");
+  f = fopen(filename, "rb");

This keeps behavior consistent with other tools and avoids subtle cross‑platform issues. Based on relevant code snippets.


I'll verify the review comment by examining the codebase. Let me gather the necessary information:


🏁 Script executed:

#!/bin/bash
# Check the readFile function in check_flipped_bits.c
echo "=== readFile function in check_flipped_bits.c ==="
cat contrib/diagnose_corruption/check_flipped_bits.c | head -150 | tail -80

Length of output: 1956


🏁 Script executed:

#!/bin/bash
# Check the recovery tool mentioned for comparison
echo "=== Recovery tool file opening pattern ==="
grep -n "fopen" contrib/recovery/recover_directory.c -B2 -A2

Length of output: 487


🏁 Script executed:

#!/bin/bash
# Check how readFile is used and what files it processes
echo "=== Usage of readFile function ==="
grep -n "readFile" contrib/diagnose_corruption/check_flipped_bits.c -B2 -A2

Length of output: 637


🏁 Script executed:

#!/bin/bash
# Search for zstd-related file handling patterns in the codebase
echo "=== Zstd file handling in codebase ==="
rg -n "fopen.*rb" --type=c -B2 -A2 | head -40

Length of output: 1998


Fix file mode from text to binary for zstd data

The review comment is verified as accurate. The readFile() function at line 92 uses fopen(filename, "r") to read binary zstd data, which will be corrupted on Windows due to text-mode newline translation. The entire codebase consistently uses binary mode ("rb") for zstd file operations—confirmed across examples, tests, and programs directories. The recovery tool (contrib/recovery/recover_directory.c, line 54) already follows this pattern. Change line 92 from "r" to "rb".

🤖 Prompt for AI Agents
In contrib/diagnose_corruption/check_flipped_bits.c around lines 75 to 123, the
file is opened with fopen(filename, "r") which will perform text-mode
translations on Windows and corrupt binary zstd data; change the fopen call to
use binary mode ("rb") instead, keeping error handling and the rest of the
function intact.

Comment on lines +125 to +142
static ZSTD_DDict* readDict(const char* filename, char **buf, size_t* size, int32_t* dict_id) {
ZSTD_DDict* ddict;
*buf = readFile(filename, size);
if (*buf == NULL) {
fprintf(stderr, "Opening dictionary file '%s' failed\n", filename);
return NULL;
}

ddict = ZSTD_createDDict_advanced(*buf, *size, ZSTD_dlm_byRef, ZSTD_dct_auto, ZSTD_defaultCMem);
if (ddict == NULL) {
fprintf(stderr, "Failed to create ddict.\n");
return NULL;
}
if (dict_id != NULL) {
*dict_id = ZSTD_getDictID_fromDDict(ddict);
}
return ddict;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Avoid leaking dictionary buffers when ZSTD_createDDict_advanced() fails

In readDict():

*buf = readFile(filename, size);
...
ddict = ZSTD_createDDict_advanced(*buf, *size, ...);
if (ddict == NULL) {
  fprintf(stderr, "Failed to create ddict.\n");
  return NULL;
}

If ZSTD_createDDict_advanced() fails, the buffer allocated by readFile() and stored in *buf is never freed. This affects:

  • Cached dict load in init_stuff() via readDictByID(stuff, dict_id, &stuff->dict, &stuff->dict_size);
  • Per‑dict loads in test_decompress() via readDictByID(stuff, custom_dict_id, &custom_dict, &custom_dict_size);

test_decompress() only frees custom_dict when custom_ddict != NULL, so failed ddict creations leak per call and can accumulate if the error persists.

A targeted fix inside readDict() keeps callers simple:

  ddict = ZSTD_createDDict_advanced(*buf, *size, ZSTD_dlm_byRef, ZSTD_dct_auto, ZSTD_defaultCMem);
  if (ddict == NULL) {
    fprintf(stderr, "Failed to create ddict.\n");
+    free(*buf);
+    *buf = NULL;
+    if (size) {
+      *size = 0;
+    }
    return NULL;
  }

With this change, failure paths in readDictByID() and test_decompress() no longer leak buffers even if ddict construction fails repeatedly.

Also applies to: 144-189, 276-319


I'll verify the memory leak issue by examining the code structure and checking the callers.


🏁 Script executed:

#!/bin/bash
# First, check if the file exists and examine its structure
fd check_flipped_bits.c --type f

Length of output: 48


🏁 Script executed:

#!/bin/bash
# Get the full content of the file to understand context
cat contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 10376


🏁 Script executed:

#!/bin/bash
# Search for readDictByID function definition
rg -n 'readDictByID' contrib/diagnose_corruption/check_flipped_bits.c -A 10

Length of output: 1368


🏁 Script executed:

#!/bin/bash
# Search for init_stuff function to understand its usage
rg -n 'init_stuff' contrib/diagnose_corruption/check_flipped_bits.c -B 2 -A 10

Length of output: 647


🏁 Script executed:

#!/bin/bash
# Search for test_decompress to see how it handles dictionaries
rg -n 'test_decompress' contrib/diagnose_corruption/check_flipped_bits.c -A 20

Length of output: 2425


🏁 Script executed:

#!/bin/bash
# Look at lines 144-189 to see if similar pattern exists
sed -n '144,189p' contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 1468


🏁 Script executed:

#!/bin/bash
# Look at lines 276-319 to see if similar pattern exists
sed -n '276,319p' contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 1419


🏁 Script executed:

#!/bin/bash
# Check how readFile is defined and if it allocates memory
rg -n 'readFile' contrib/diagnose_corruption/check_flipped_bits.c -B 2 -A 5

Length of output: 721


🏁 Script executed:

#!/bin/bash
# Look for free() calls in the file to see current deallocation patterns
rg -n 'free\(' contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 264


🏁 Script executed:

#!/bin/bash
# Check current error handling in test_decompress
rg -n 'custom_ddict' contrib/diagnose_corruption/check_flipped_bits.c -B 2 -A 2

Length of output: 973


🏁 Script executed:

#!/bin/bash
# Look for where custom_dict is freed or not freed
rg -n 'custom_dict' contrib/diagnose_corruption/check_flipped_bits.c -n

Length of output: 483


🏁 Script executed:

#!/bin/bash
# Understand the full context of how dict_id and related vars are used
rg -n 'dict_id' contrib/diagnose_corruption/check_flipped_bits.c -B 1 -A 1

Length of output: 2068


🏁 Script executed:

#!/bin/bash
# Check for any existing free(*buf) patterns in readDict or similar functions
ast-grep --pattern $'static $_ readDict($$$) {
  $$$
  free($$$);
  $$$
}'

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Look for error handling patterns in similar dictionary functions
rg -n 'fprintf.*stderr' contrib/diagnose_corruption/check_flipped_bits.c -B 2 -A 3

Length of output: 5286


🏁 Script executed:

#!/bin/bash
# Check for memory cleanup in failure paths
rg -n 'return NULL' contrib/diagnose_corruption/check_flipped_bits.c -B 5 -A 1

Length of output: 1681


🏁 Script executed:

#!/bin/bash
# Verify readFile returns a dynamically allocated buffer
rg -n 'char\*.*readFile' contrib/diagnose_corruption/check_flipped_bits.c -A 20

Length of output: 515


🏁 Script executed:

#!/bin/bash
# Check file to understand if ZSTD_dlm_byRef means buffer ownership
rg -n 'ZSTD_dlm_byRef\|ZSTD_dlm' contrib/diagnose_corruption/check_flipped_bits.c -B 2 -A 2

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Get exact lines 125-142 to verify snippet
sed -n '125,142p' contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 555


🏁 Script executed:

#!/bin/bash
# Check all calls to readDictByID to see how failures are handled
rg -n 'readDictByID' contrib/diagnose_corruption/check_flipped_bits.c -B 2 -A 8

Length of output: 1461


🏁 Script executed:

#!/bin/bash
# Verify if custom_dict cleanup is conditional on custom_ddict != NULL
sed -n '1,400p' contrib/diagnose_corruption/check_flipped_bits.c | grep -n -A 30 'test_decompress'

Length of output: 2843


🏁 Script executed:

#!/bin/bash
# Search for where custom_dict variable is freed
sed -n '1,500p' contrib/diagnose_corruption/check_flipped_bits.c | grep -n 'custom_dict\|free'

Length of output: 941


🏁 Script executed:

#!/bin/bash
# Full view of function calls around the issue lines
sed -n '110,200p' contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 2460


🏁 Script executed:

#!/bin/bash
# Check what happens after readDictByID in init_stuff and test_decompress
sed -n '1,600p' contrib/diagnose_corruption/check_flipped_bits.c | grep -n -B 5 -A 10 'readDictByID'

Length of output: 1895


🏁 Script executed:

#!/bin/bash
# Find all function definitions to understand scope
rg -n '^[a-zA-Z_].*\(' contrib/diagnose_corruption/check_flipped_bits.c | head -20

Length of output: 628


🏁 Script executed:

#!/bin/bash
# Check size_t* parameter in readDict - is size conditional
sed -n '125,145p' contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 692


🏁 Script executed:

#!/bin/bash
# Understand if 'size' parameter can be NULL
rg -n 'readDict' contrib/diagnose_corruption/check_flipped_bits.c | grep -v Binary

Length of output: 551


🏁 Script executed:

#!/bin/bash
# Check the fix's suggestion to conditionally check size
# Look for similar NULL checks in the file
rg -n 'if.*size' contrib/diagnose_corruption/check_flipped_bits.c -B 1 -A 1

Length of output: 132


🏁 Script executed:

#!/bin/bash
# Verify parameter signature for readDict
sed -n '124,126p' contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 117


🏁 Script executed:

#!/bin/bash
# Full file view to understand everything at once
wc -l contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 52


🏁 Script executed:

#!/bin/bash
# Output entire file in segments (first half)
head -n 400 contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 10376


🏁 Script executed:

#!/bin/bash
# Second half of file
tail -n +400 contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 1


🏁 Script executed:

#!/bin/bash
# Find the exact function signatures and verify parameters
rg -n 'static.*readDict\|static.*readDictByID\|init_stuff\|test_decompress' contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check if size parameter could be NULL (non-pointer parameter)
ast-grep --pattern $'size_t\* size'

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Get exact function signature of readDict to verify size is size_t*
sed -n '125,126p' contrib/diagnose_corruption/check_flipped_bits.c && echo "---" && sed -n '130,140p' contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 412


🏁 Script executed:

#!/bin/bash
# Check for calls to free() on size-related variables in the file
rg -n 'free.*size\|free.*buf' contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Ensure we see readFile implementation
rg -n 'char\* readFile' contrib/diagnose_corruption/check_flipped_bits.c -A 30

Length of output: 736


🏁 Script executed:

#!/bin/bash
# Check if ZSTD_dlm_byRef means we own the buffer (not ZSTD)
# Also check for ZSTD API documentation or comments
rg -n 'ZSTD_dlm\|byRef\|memory' contrib/diagnose_corruption/check_flipped_bits.c | head -20

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Verify lines 144-189 and 276-319 are indeed similar functions
echo "=== Lines 144-189 ===" && sed -n '144,189p' contrib/diagnose_corruption/check_flipped_bits.c && echo "" && echo "=== Lines 276-319 ===" && sed -n '276,319p' contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 2933


🏁 Script executed:

#!/bin/bash
# For precise understanding, check function names at those line ranges
sed -n '144p;189p;276p;319p' contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 145


🏁 Script executed:

#!/bin/bash
# Get function start/end for lines 144-189
sed -n '140,195p' contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 1618


🏁 Script executed:

#!/bin/bash
# Get function start/end for lines 276-319
sed -n '270,325p' contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 1622


🏁 Script executed:

#!/bin/bash
# Check how readDictByID uses readDict result and handles errors
sed -n '40,100p' contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 1641


🏁 Script executed:

#!/bin/bash
# Find and display readDictByID complete function
rg -n 'readDictByID' contrib/diagnose_corruption/check_flipped_bits.c -B 5 -A 50

Length of output: 4960


🏁 Script executed:

#!/bin/bash
# Check init_stuff completely
sed -n '190,270p' contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 2021


🏁 Script executed:

#!/bin/bash
# Check test_decompress completely
sed -n '320,400p' contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 1898


🏁 Script executed:

#!/bin/bash
# Final comprehensive check - show lines around the issue plus the callers
echo "=== readDict ===" && sed -n '125,145p' contrib/diagnose_corruption/check_flipped_bits.c && echo "" && echo "=== readDictByID ===" && sed -n '70,90p' contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 1172


🏁 Script executed:

#!/bin/bash
# Show where custom_dict is freed in test_decompress
sed -n '320,450p' contrib/diagnose_corruption/check_flipped_bits.c | grep -n 'custom_dict\|custom_ddict\|free\|readDictByID'

Length of output: 117


🏁 Script executed:

#!/bin/bash
# Check if proposed fix has an issue - size is size_t* not NULL checkable
sed -n '125,126p' contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 116


🏁 Script executed:

#!/bin/bash
# Verify the proposed fix's NULL check for size - this might be a bug in the fix
# If size is size_t* (pointer to size_t), it should never be NULL in this context
rg -n 'if.*size' contrib/diagnose_corruption/check_flipped_bits.c | grep -v Binary

Length of output: 32


🏁 Script executed:

#!/bin/bash
# Check actual function signature carefully
grep -n 'static ZSTD_DDict\* readDict' contrib/diagnose_corruption/check_flipped_bits.c -A 1

Length of output: 271


🏁 Script executed:

#!/bin/bash
# Sanity check: size_t* is always valid (it's passed by reference from callers)
# Verify how readDict is called with size parameter
rg -n 'readDict.*&' contrib/diagnose_corruption/check_flipped_bits.c

Length of output: 354


🏁 Script executed:

#!/bin/bash
# Final check on proposed fix - the 'if (size)' check seems wrong since size_t* is never NULL in this context
# Let's verify the signature one more time and see all calls
echo "=== Function signature ===" && sed -n '125,126p' contrib/diagnose_corruption/check_flipped_bits.c && echo "" && echo "=== Calls to readDict ===" && rg 'readDict\(' contrib/diagnose_corruption/check_flipped_bits.c -B 1 -A 1

Length of output: 549


Fix memory leak in readDict() when ZSTD_createDDict_advanced() fails (lines 125-142, also 144-189, 276-319 affected)

The review is correct. The buffer allocated by readFile() and stored in *buf is indeed leaked when ZSTD_createDDict_advanced() fails at line 134. This occurs in:

  1. Direct impact: readDict() at lines 125-142 fails to clean up after ZSTD_createDDict_advanced() returns NULL
  2. Cascading leaks via readDictByID() (lines 144-189): When readDict() fails inside readDictByID(), the buffer is leaked; callers like init_stuff() (line 249) and test_decompress() (line 288) don't clean it up
  3. Loop accumulation in test_decompress() (lines 276-319): Called repeatedly in perturb_bits() and perturb_bytes(), leaked buffers accumulate on each ddict creation failure

The proposed fix is sound: adding free(*buf); *buf = NULL; inside readDict() ensures cleanup happens regardless of caller context.

Minor note: The proposed if (size) check is unnecessary since size is a size_t* parameter always passed by reference (never NULL), but it's harmless defensive code.

🤖 Prompt for AI Agents
contrib/diagnose_corruption/check_flipped_bits.c around lines 125-142 (and
related callers 144-189, 276-319): readDict() currently leaks the buffer
returned by readFile() when ZSTD_createDDict_advanced() returns NULL; fix by
freeing the allocated buffer and nulling the pointer before returning on failure
(e.g., call free(*buf); *buf = NULL; immediately after detecting ddict == NULL),
and ensure callers do not assume ownership of buf on failure; apply the same
pattern to other failure paths/call sites to prevent repeated accumulation of
leaked buffers.

Comment on lines +21 to +27
#define CHECK(res) \
do { \
if (ZSTD_isError(res)) { \
printf("ERROR: %s\n", ZSTD_getErrorName(res)); \
return 1; \
} \
} while (0) \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Resource leaks on error paths when CHECK fails.

The CHECK macro returns early on error but doesn't clean up previously allocated resources, causing memory leaks at Lines 48, 77, and 84.

Consider refactoring to use goto-based cleanup or refactor CHECK to support cleanup:

-#define CHECK(res)                                      \
-do {                                                    \
-    if (ZSTD_isError(res)) {                            \
-        printf("ERROR: %s\n", ZSTD_getErrorName(res));  \
-        return 1;                                       \
-    }                                                   \
-} while (0)                                             \

 int main(int argc, char *argv[]) {
+    ZSTD_CCtx* zc = NULL;
+    char* src = NULL;
+    char* dst = NULL;
+    char* val = NULL;
+    int ret = 0;
+
     if (argc != 2) {
         printf("Usage: externalSequenceProducer <file>\n");
         return 1;
     }
 
-    ZSTD_CCtx* const zc = ZSTD_createCCtx();
+    zc = ZSTD_createCCtx();
+    if (!zc) {
+        printf("ERROR: Failed to create compression context\n");
+        return 1;
+    }
 
     // ... rest of allocations with error checking ...
 
     {
         size_t const res = ZSTD_CCtx_setParameter(zc, ZSTD_c_enableSeqProducerFallback, 1);
-        CHECK(res);
+        if (ZSTD_isError(res)) {
+            printf("ERROR: %s\n", ZSTD_getErrorName(res));
+            ret = 1;
+            goto cleanup;
+        }
     }
 
     // ... similar for other CHECK calls ...
 
+cleanup:
+    ZSTD_freeCCtx(zc);
+    free(src);
+    free(dst);
+    free(val);
+    return ret;
 }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In contrib/externalSequenceProducer/main.c around lines 21 to 27, the CHECK
macro returns immediately on error and leaks resources allocated earlier
(notably at lines ~48, ~77 and ~84); change error handling to perform cleanup
before returning by either (A) refactoring CHECK to accept a label name and do a
goto cleanup on error, or (B) remove the macro and replace each CHECK invocation
with an if (ZSTD_isError(res)) { fprintf(...); goto cleanup; } pattern. Add a
single cleanup: label near function exit that frees any allocated memory, closes
handles and releases resources in reverse allocation order, set an appropriate
non-zero exit code, and ensure all paths (success and error) jump to that
cleanup label.

return 1;
}

ZSTD_CCtx* const zc = ZSTD_createCCtx();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Add null check after context creation.

ZSTD_createCCtx() can return NULL on allocation failure. Using the context without validation can cause a segmentation fault.

Apply this diff to add a null check:

 ZSTD_CCtx* const zc = ZSTD_createCCtx();
+if (!zc) {
+    printf("ERROR: Failed to create compression context\n");
+    return 1;
+}
🤖 Prompt for AI Agents
In contrib/externalSequenceProducer/main.c around line 35, the code calls
ZSTD_createCCtx() and assigns to ZSTD_CCtx* const zc without checking for NULL;
add a null check immediately after creation and handle allocation failure (e.g.,
log an error, free any previously allocated resources, and return/exit with an
error code) so the function never dereferences zc when it is NULL; ensure the
success path continues to use zc and that any early-return paths clean up other
state as needed.

Comment on lines +51 to +70
FILE *f = fopen(argv[1], "rb");
assert(f);
{
int const ret = fseek(f, 0, SEEK_END);
assert(ret == 0);
}
size_t const srcSize = ftell(f);
{
int const ret = fseek(f, 0, SEEK_SET);
assert(ret == 0);
}

char* const src = malloc(srcSize + 1);
assert(src);
{
size_t const ret = fread(src, srcSize, 1, f);
assert(ret == 1);
int const ret2 = fclose(f);
assert(ret2 == 0);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Replace assert() with proper error handling for file operations.

Using assert() for file I/O error checking is problematic:

  • Assertions are disabled in release builds (-DNDEBUG), leaving no error checking in production
  • Abrupt termination provides poor user experience instead of clean error messages

Additionally, Line 63 allocates srcSize + 1 bytes but only reads srcSize bytes, suggesting possible intent to null-terminate that isn't implemented.

Apply this diff to add proper error handling:

 FILE *f = fopen(argv[1], "rb");
-assert(f);
+if (!f) {
+    printf("ERROR: Failed to open file %s\n", argv[1]);
+    ZSTD_freeCCtx(zc);
+    return 1;
+}
 {
-    int const ret = fseek(f, 0, SEEK_END);
-    assert(ret == 0);
+    if (fseek(f, 0, SEEK_END) != 0) {
+        printf("ERROR: Failed to seek in file\n");
+        fclose(f);
+        ZSTD_freeCCtx(zc);
+        return 1;
+    }
 }
 size_t const srcSize = ftell(f);
 {
-    int const ret = fseek(f, 0, SEEK_SET);
-    assert(ret == 0);
+    if (fseek(f, 0, SEEK_SET) != 0) {
+        printf("ERROR: Failed to seek in file\n");
+        fclose(f);
+        ZSTD_freeCCtx(zc);
+        return 1;
+    }
 }
 
-char* const src = malloc(srcSize + 1);
-assert(src);
+char* const src = malloc(srcSize);
+if (!src) {
+    printf("ERROR: Failed to allocate memory\n");
+    fclose(f);
+    ZSTD_freeCCtx(zc);
+    return 1;
+}
 {
-    size_t const ret = fread(src, srcSize, 1, f);
-    assert(ret == 1);
+    if (fread(src, srcSize, 1, f) != 1) {
+        printf("ERROR: Failed to read file\n");
+        free(src);
+        fclose(f);
+        ZSTD_freeCCtx(zc);
+        return 1;
+    }
-    int const ret2 = fclose(f);
-    assert(ret2 == 0);
+    fclose(f);
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
FILE *f = fopen(argv[1], "rb");
assert(f);
{
int const ret = fseek(f, 0, SEEK_END);
assert(ret == 0);
}
size_t const srcSize = ftell(f);
{
int const ret = fseek(f, 0, SEEK_SET);
assert(ret == 0);
}
char* const src = malloc(srcSize + 1);
assert(src);
{
size_t const ret = fread(src, srcSize, 1, f);
assert(ret == 1);
int const ret2 = fclose(f);
assert(ret2 == 0);
}
FILE *f = fopen(argv[1], "rb");
if (!f) {
printf("ERROR: Failed to open file %s\n", argv[1]);
ZSTD_freeCCtx(zc);
return 1;
}
{
if (fseek(f, 0, SEEK_END) != 0) {
printf("ERROR: Failed to seek in file\n");
fclose(f);
ZSTD_freeCCtx(zc);
return 1;
}
}
size_t const srcSize = ftell(f);
{
if (fseek(f, 0, SEEK_SET) != 0) {
printf("ERROR: Failed to seek in file\n");
fclose(f);
ZSTD_freeCCtx(zc);
return 1;
}
}
char* const src = malloc(srcSize);
if (!src) {
printf("ERROR: Failed to allocate memory\n");
fclose(f);
ZSTD_freeCCtx(zc);
return 1;
}
{
if (fread(src, srcSize, 1, f) != 1) {
printf("ERROR: Failed to read file\n");
free(src);
fclose(f);
ZSTD_freeCCtx(zc);
return 1;
}
fclose(f);
}

Comment thread programs/util.c
Comment on lines +715 to +746
/* Create an array of pointers to the lines in a buffer */
static const char** UTIL_createLinePointers(char* buffer, size_t numLines, size_t bufferSize)
{
size_t lineIndex = 0;
size_t pos = 0;
void* const bufferPtrs = malloc(numLines * sizeof(const char**));
const char** const linePointers = (const char**)bufferPtrs;
if (bufferPtrs == NULL) return NULL;

while (lineIndex < numLines && pos < bufferSize) {
size_t len = 0;
linePointers[lineIndex++] = buffer+pos;

/* Find the next null terminator, being careful not to go past the buffer */
while ((pos + len < bufferSize) && buffer[pos + len] != '\0') {
len++;
}

/* Move past this string and its null terminator */
pos += len;
if (pos < bufferSize) pos++; /* Skip the null terminator if we're not at buffer end */
}

/* Verify we processed the expected number of lines */
if (lineIndex != numLines) {
/* Something went wrong - we didn't find as many lines as expected */
free(bufferPtrs);
return NULL;
}

return linePointers;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix incorrect sizeof in malloc.

Line 720 uses sizeof(const char**) when it should use sizeof(const char*). While these typically have the same value (pointer size), this is semantically incorrect and could cause issues on unusual architectures.

Apply this fix:

-    void* const bufferPtrs = malloc(numLines * sizeof(const char**));
+    void* const bufferPtrs = malloc(numLines * sizeof(const char*));
     const char** const linePointers = (const char**)bufferPtrs;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/* Create an array of pointers to the lines in a buffer */
static const char** UTIL_createLinePointers(char* buffer, size_t numLines, size_t bufferSize)
{
size_t lineIndex = 0;
size_t pos = 0;
void* const bufferPtrs = malloc(numLines * sizeof(const char**));
const char** const linePointers = (const char**)bufferPtrs;
if (bufferPtrs == NULL) return NULL;
while (lineIndex < numLines && pos < bufferSize) {
size_t len = 0;
linePointers[lineIndex++] = buffer+pos;
/* Find the next null terminator, being careful not to go past the buffer */
while ((pos + len < bufferSize) && buffer[pos + len] != '\0') {
len++;
}
/* Move past this string and its null terminator */
pos += len;
if (pos < bufferSize) pos++; /* Skip the null terminator if we're not at buffer end */
}
/* Verify we processed the expected number of lines */
if (lineIndex != numLines) {
/* Something went wrong - we didn't find as many lines as expected */
free(bufferPtrs);
return NULL;
}
return linePointers;
}
/* Create an array of pointers to the lines in a buffer */
static const char** UTIL_createLinePointers(char* buffer, size_t numLines, size_t bufferSize)
{
size_t lineIndex = 0;
size_t pos = 0;
void* const bufferPtrs = malloc(numLines * sizeof(const char*));
const char** const linePointers = (const char**)bufferPtrs;
if (bufferPtrs == NULL) return NULL;
while (lineIndex < numLines && pos < bufferSize) {
size_t len = 0;
linePointers[lineIndex++] = buffer+pos;
/* Find the next null terminator, being careful not to go past the buffer */
while ((pos + len < bufferSize) && buffer[pos + len] != '\0') {
len++;
}
/* Move past this string and its null terminator */
pos += len;
if (pos < bufferSize) pos++; /* Skip the null terminator if we're not at buffer end */
}
/* Verify we processed the expected number of lines */
if (lineIndex != numLines) {
/* Something went wrong - we didn't find as many lines as expected */
free(bufferPtrs);
return NULL;
}
return linePointers;
}
🤖 Prompt for AI Agents
In programs/util.c around lines 715 to 746, the malloc call allocates numLines *
sizeof(const char**) which is semantically wrong; change the allocation to use
sizeof(const char*) (i.e. malloc(numLines * sizeof(const char*))) so you
allocate an array of const char* pointers; keep the rest of the function the
same, check for NULL as already done, and free bufferPtrs on error as before.

Comment thread programs/util.c
Comment on lines +1015 to +1024
if (*bufStart + *pos + pathLength >= *bufEnd) {
ptrdiff_t newListSize = (*bufEnd - *bufStart) + LIST_SIZE_INCREASE;
assert(newListSize >= 0);
*bufStart = (char*)UTIL_realloc(*bufStart, (size_t)newListSize);
if (*bufStart != NULL) {
*bufEnd = *bufStart + newListSize;
} else {
free(path); closedir(dir); return 0;
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate cast from ptrdiff_t to size_t before using.

The code asserts that newListSize >= 0 before casting to size_t, but assertions are compiled out in release builds. Additionally, the ptrdiff_t addition could theoretically overflow.

Use an explicit check instead:

             if (*bufStart + *pos + pathLength >= *bufEnd) {
                 ptrdiff_t newListSize = (*bufEnd - *bufStart) + LIST_SIZE_INCREASE;
-                assert(newListSize >= 0);
-                *bufStart = (char*)UTIL_realloc(*bufStart, (size_t)newListSize);
-                if (*bufStart != NULL) {
-                    *bufEnd = *bufStart + newListSize;
-                } else {
-                    free(path); closedir(dir); return 0;
-                }
+                if (newListSize <= 0 || newListSize > PTRDIFF_MAX) {
+                    free(path); closedir(dir); return 0;
+                }
+                *bufStart = (char*)UTIL_realloc(*bufStart, (size_t)newListSize);
+                if (*bufStart != NULL) {
+                    *bufEnd = *bufStart + newListSize;
+                } else {
+                    free(path); closedir(dir); return 0;
+                }
             }
🤖 Prompt for AI Agents
In programs/util.c around lines 1015 to 1024, the code computes newListSize as a
ptrdiff_t and only uses assert(newListSize >= 0) before casting to size_t, which
is unsafe in release builds and can overflow; change this by checking
newListSize is non-negative and does not exceed SIZE_MAX (or otherwise fits in
size_t) before casting, handle the failure path (free path/closedir and return
0) if the check fails, and then use the safe size_t value for UTIL_realloc and
pointer arithmetic; ensure you also guard the addition that produced newListSize
against overflow (e.g., check *bufEnd - *bufStart <= SIZE_MAX -
LIST_SIZE_INCREASE) before computing the size.

Comment thread programs/util.c
Comment on lines +1470 to +1542
int UTIL_countCores(int logical)
{
static int numCores = 0;
if (numCores != 0) return numCores;

{ LPFN_GLPI glpi;
BOOL done = FALSE;
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer = NULL;
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION ptr = NULL;
DWORD returnLength = 0;
size_t byteOffset = 0;

#if defined(_MSC_VER)
/* Visual Studio does not like the following cast */
# pragma warning( disable : 4054 ) /* conversion from function ptr to data ptr */
# pragma warning( disable : 4055 ) /* conversion from data ptr to function ptr */
#endif
glpi = (LPFN_GLPI)(void*)GetProcAddress(GetModuleHandle(TEXT("kernel32")),
"GetLogicalProcessorInformation");

if (glpi == NULL) {
goto failed;
}

while(!done) {
DWORD rc = glpi(buffer, &returnLength);
if (FALSE == rc) {
if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) {
if (buffer)
free(buffer);
buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION)malloc(returnLength);

if (buffer == NULL) {
perror("zstd");
exit(1);
}
} else {
/* some other error */
goto failed;
}
} else {
done = TRUE;
} }

ptr = buffer;

while (byteOffset + sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION) <= returnLength) {

if (ptr->Relationship == RelationProcessorCore) {
if (logical)
numCores += CountSetBits(ptr->ProcessorMask);
else
numCores++;
}

ptr++;
byteOffset += sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION);
}

free(buffer);

return numCores;
}

failed:
/* try to fall back on GetSystemInfo */
{ SYSTEM_INFO sysinfo;
GetSystemInfo(&sysinfo);
numCores = sysinfo.dwNumberOfProcessors;
if (numCores == 0) numCores = 1; /* just in case */
}
return numCores;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove exit() calls from core counting utilities.

The core counting functions contain exit(1) calls on errors (lines 1503-1504, 1562-1563, 1670-1671). Library utilities should return error indicators (e.g., -1 or 1 as fallback) rather than terminating the process.

Replace exit calls with fallback values:

                     if (buffer == NULL) {
-                        perror("zstd");
-                        exit(1);
+                        /* allocation failed, fall back to 1 core */
+                        numCores = 1;
+                        return numCores;
                     }

Apply similar changes at lines 1562-1563 and 1670-1671.

Also applies to: 1550-1569, 1648-1686

Comment thread zlibWrapper/gzlib.c
Comment on lines +236 to +240
state.state->fd = fd > -1 ? fd : (
#ifdef WIDECHAR
fd == -2 ? _wopen(path, oflag, 0666) :
#endif
open((const char *)path, oflag, 0666));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

open(..., 0666) may create overly permissive files

The gz_open path uses:

state.state->fd = fd > -1 ? fd : (
#ifdef WIDECHAR
    fd == -2 ? _wopen(path, oflag, 0666) :
#endif
    open((const char *)path, oflag, 0666));

Using 0666 means “world‑writable” before umask is applied. On systems with a permissive or zero umask this can create files writable by other users, which is a security concern (CWE‑732).

If you don’t need strict compatibility with upstream zlib behavior, consider tightening this:

-        fd == -2 ? _wopen(path, oflag, 0666) :
-        open((const char *)path, oflag, 0666));
+        fd == -2 ? _wopen(path, oflag, 0600) :
+        open((const char *)path, oflag, 0600));

or gate the stricter mode behind a build‑time option so consumers can choose.

🧰 Tools
🪛 ast-grep (0.39.9)

[warning] 239-239: This call makes a world-writable file which allows any user on a machine to write to the file. This may allow attackers to influence the behaviour of this process by writing to the file.
Context: 0666
Note: [CWE-732]: Incorrect Permission Assignment for Critical Resource [REFERENCES]
- https://wiki.sei.cmu.edu/confluence/display/c/FIO06-C.+Create+files+with+appropriate+access+permissions

(world-writable-file-c)

🤖 Prompt for AI Agents
In zlibWrapper/gzlib.c around lines 236-240, the code calls open/_wopen with
mode 0666 which can produce world-writable files on permissive umasks; change
the created file mode to a stricter default (for example 0644 or 0600 depending
on desired privacy) and/or make the mode configurable via a build-time option or
macro so consumers can opt into the original 0666 behavior; ensure both the open
and _wopen calls use the new mode value and document the chosen default in a
comment.

Comment on lines +128 to +138
static void* ZWRAP_customCalloc(size_t size, ZSTD_customMem customMem)
{
if (customMem.customAlloc) {
/* calloc implemented as malloc+memset;
* not as efficient as calloc, but next best guess for custom malloc */
void* const ptr = customMem.customAlloc(customMem.opaque, size);
memset(ptr, 0, size);
return ptr;
}
return calloc(1, size);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential null dereference in ZWRAP_customCalloc when custom allocator fails

Here:

if (customMem.customAlloc) {
    /* calloc implemented as malloc+memset; ... */
    void* const ptr = customMem.customAlloc(customMem.opaque, size);
    memset(ptr, 0, size);
    return ptr;
}
return calloc(1, size);

If customAlloc returns NULL (e.g., out of memory), memset(ptr, 0, size); will dereference a null pointer and likely crash, instead of propagating the allocation failure.

Recommend guarding the memset:

static void* ZWRAP_customCalloc(size_t size, ZSTD_customMem customMem)
{
    if (customMem.customAlloc) {
        void* const ptr = customMem.customAlloc(customMem.opaque, size);
-       memset(ptr, 0, size);
-       return ptr;
+       if (ptr != NULL) {
+           memset(ptr, 0, size);
+       }
+       return ptr;
    }
    return calloc(1, size);
}

Callers already check for a NULL return, so this fix preserves behavior while avoiding the crash.

🤖 Prompt for AI Agents
In zlibWrapper/zstd_zlibwrapper.c around lines 128 to 138, the customCalloc
implementation unconditionally calls memset on the pointer returned by
customMem.customAlloc which can be NULL; update the code to check if ptr is NULL
before calling memset and simply return NULL if allocation failed, otherwise
perform memset and return the pointer, preserving existing behavior for the
non-custom path that calls calloc.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 20

♻️ Duplicate comments (1)
contrib/diagnose_corruption/check_flipped_bits.c (1)

44-50: Tighten memory management in helper and init paths (known leak issues)

There are a few related leaks / incomplete cleanups:

  • free_stuff() (Lines 44–50) never frees stuff->perturbed, even though it is allocated in init_stuff() as a long-lived buffer.
  • readDict() (Lines 125–142) may leak the buffer returned in *buf if ZSTD_createDDict_advanced() fails: it returns NULL without freeing *buf, and the caller only checks the return value.
  • init_stuff() (Lines 191–274) returns 0 on several allocation/initialization failures after some fields are already allocated/initialized; main() exits immediately on failure and never calls free_stuff(), so those partial allocations are leaked.

Minimal fixes:

 static void free_stuff(stuff_t* stuff) {
   free(stuff->input);
+  free(stuff->perturbed);
   free(stuff->output);
   ZSTD_freeDDict(stuff->ddict);
   free(stuff->dict);
   ZSTD_freeDCtx(stuff->dctx);
 }

 static ZSTD_DDict* readDict(const char* filename, char **buf, size_t* size, int32_t* dict_id) {
   ZSTD_DDict* ddict;
   *buf = readFile(filename, size);
   if (*buf == NULL) {
     fprintf(stderr, "Opening dictionary file '%s' failed\n", filename);
     return NULL;
   }

   ddict = ZSTD_createDDict_advanced(*buf, *size, ZSTD_dlm_byRef, ZSTD_dct_auto, ZSTD_defaultCMem);
   if (ddict == NULL) {
     fprintf(stderr, "Failed to create ddict.\n");
+    free(*buf);
+    *buf = NULL;
     return NULL;
   }
   ...
 }

 static int init_stuff(stuff_t* stuff, int argc, char *argv[]) {
   ...
   stuff->input = readFile(input_filename, &stuff->input_size);
   if (stuff->input == NULL) {
     fprintf(stderr, "Failed to read input file.\n");
-    return 0;
+    return 0;
   }

   stuff->perturbed = malloc(stuff->input_size);
   if (stuff->perturbed == NULL) {
     fprintf(stderr, "malloc failed.\n");
-    return 0;
+    free(stuff->input);
+    return 0;
   }
   ...
   stuff->output = malloc(stuff->output_size);
   if (stuff->output == NULL) {
     fprintf(stderr, "malloc failed.\n");
-    return 0;
+    free(stuff->perturbed);
+    free(stuff->input);
+    return 0;
   }
   ...
   if (stuff->dict_file_dir_name) {
     ...
     if (stuff->ddict == NULL) {
       fprintf(stderr, "Failed to create cached ddict.\n");
-      return 0;
+      free(stuff->output);
+      free(stuff->perturbed);
+      free(stuff->input);
+      return 0;
     }
   } else if (stuff->dict_file_name) {
     ...
     if (stuff->ddict == NULL) {
       fprintf(stderr, "Failed to create ddict from '%s'.\n", stuff->dict_file_name);
-      return 0;
+      free(stuff->output);
+      free(stuff->perturbed);
+      free(stuff->input);
+      return 0;
     }
   }
   stuff->dctx = ZSTD_createDCtx();
   if (stuff->dctx == NULL) {
-    return 0;
+    free(stuff->ddict);
+    free(stuff->output);
+    free(stuff->perturbed);
+    free(stuff->input);
+    return 0;
   }
   ...
 }

And in main(), consider calling free_stuff(&stuff); before returning on early failure paths to ensure consistent cleanup.

Also applies to: 125-142, 191-274

🧹 Nitpick comments (13)
contrib/pzstd/Pzstd.h (2)

19-25: Consider scoping ZSTD_DISABLE_DEPRECATE_WARNINGS to just zstd.h.

You #undef ZSTD_STATIC_LINKING_ONLY after including zstd.h but leave ZSTD_DISABLE_DEPRECATE_WARNINGS defined. If the intent is only to silence deprecation warnings from zstd itself, also #undef it (or wrap in a push/pop pattern) so it doesn’t accidentally affect downstream headers in the same translation unit.


118-128: Fix parameter documentation for asyncDecompressFrames’s executor.

The executor parameter is documented as “the thread pool to run compression jobs in”, but this function is for decompression. Adjusting the wording to “decompression jobs” will avoid confusion when reading the header.

contrib/pzstd/utils/Buffer.h (1)

47-48: Buffer is move-only; confirm this is intentional or add copy operations.

By explicitly defaulting the move constructor and move assignment but not the copy operations, the compiler implicitly deletes the copy ctor/assignment in C++11+, making Buffer move-only even though it wraps a std::shared_ptr. If you intended cheap, copyable “view” semantics over shared storage, consider explicitly defaulting the copy constructor and copy assignment as well.

contrib/externalSequenceProducer/main.c (1)

76-105: Centralize cleanup to avoid leaks on early returns

On error paths (e.g., CHECK(cSize) failure or buffer mismatch) the function returns 1 without freeing src, dst, val, or the ZSTD_CCtx. While the OS will reclaim memory on process exit, it’s easy to avoid leaks by using a single cleanup label.

For example:

-    size_t const cSize = ZSTD_compress2(zc, dst, dstSize, src, srcSize);
-    CHECK(cSize);
+    size_t const cSize = ZSTD_compress2(zc, dst, dstSize, src, srcSize);
+    if (ZSTD_isError(cSize)) {
+        printf("ERROR: %s\n", ZSTD_getErrorName(cSize));
+        goto cleanup;
+    }
@@
-        return 1;
+        goto cleanup_error;
@@
-    ZSTD_freeCCtx(zc);
-    free(src);
-    free(dst);
-    free(val);
-    return 0;
+cleanup:
+    ZSTD_freeCCtx(zc);
+    free(src);
+    free(dst);
+    free(val);
+    return 0;
+
+cleanup_error:
+    ZSTD_freeCCtx(zc);
+    free(src);
+    free(dst);
+    free(val);
+    return 1;

Also consider using %zu (or an explicit cast) for srcSize/cSize in the printf() calls to be strictly portable with size_t.

contrib/seqBench/seqBench.c (2)

8-15: Avoid leaking ZSTD_CCtx on bad CLI usage

ZSTD_CCtx* zc = ZSTD_createCCtx(); is called before checking argc, so on a usage error you return without freeing the context.

A simple fix is to create the context only after validating arguments, e.g.:

-int main(int argc, char *argv[]) {
-    ZSTD_CCtx* zc = ZSTD_createCCtx();
-
-    if (argc != 2) {
+int main(int argc, char *argv[]) {
+    if (argc != 2) {
         printf("Usage: seqBench <file>\n"); // TODO provide the block delim option here
         return 1;
     }
+
+    ZSTD_CCtx* zc = ZSTD_createCCtx();
+    if (zc == NULL) {
+        fprintf(stderr, "ERROR: failed to create ZSTD_CCtx\n");
+        return 1;
+    }

40-52: Add cleanup on exit to avoid leaks and ensure predictable teardown

inBuf, seqs, outBuf, validationBuf, and zc are never freed. Even though this is a one‑shot CLI, it’s straightforward to add a common cleanup section that frees these buffers and the context on both success and failure paths.

For example:

-    if (memcmp(inBuf, validationBuf, inBufSize) == 0) {
-        printf("Compression and decompression were successful!\n");
-    } else {
-        printf("ERROR: input and validation buffers don't match!\n");
-        ...
-    }
-
-    return 0;
+    int exitCode = 0;
+    if (memcmp(inBuf, validationBuf, inBufSize) == 0) {
+        printf("Compression and decompression were successful!\n");
+    } else {
+        printf("ERROR: input and validation buffers don't match!\n");
+        ...
+        exitCode = 1;
+    }
+
+    ZSTD_freeCCtx(zc);
+    free(inBuf);
+    free(seqs);
+    free(outBuf);
+    free(validationBuf);
+    return exitCode;
contrib/largeNbDicts/largeNbDicts.c (1)

682-785: Free speedPerRound and check allocations in benchMem()

benchMem() allocates speedPerRound and a BMK_timedFnState_t but never frees speedPerRound and doesn’t check for allocation failures:

double *const speedPerRound = (double *)malloc(nbRounds * sizeof(double));
BMK_timedFnState_t* const benchState = BMK_createTimedFnState(total_time_ms, ms_per_round);

For a long‑running benchmark process or repeated invocations this accumulates a leak, and on OOM benchState/speedPerRound being NULL will cause crashes.

Consider:

-    double *const speedPerRound = (double *)malloc(nbRounds * sizeof(double));
-
-    BMK_timedFnState_t* const benchState =
-            BMK_createTimedFnState(total_time_ms, ms_per_round);
+    double *const speedPerRound = (double *)malloc(nbRounds * sizeof(double));
+    if (!speedPerRound) {
+        return 1;
+    }
+
+    BMK_timedFnState_t* const benchState =
+            BMK_createTimedFnState(total_time_ms, ms_per_round);
+    if (!benchState) {
+        free(speedPerRound);
+        return 1;
+    }
@@
-    freeDecompressInstructions(di);
-    freeCompressInstructions(ci);
-    BMK_freeTimedFnState(benchState);
-
-    return 0;   /* success */
+    freeDecompressInstructions(di);
+    freeCompressInstructions(ci);
+    BMK_freeTimedFnState(benchState);
+    free(speedPerRound);
+
+    return 0;   /* success */
programs/util.c (1)

1147-1155: Fix trailing separator stripping in convertPathnameToDirName.

The loop currently checks pathname[len], so it never strips trailing separators and won't handle multiple trailing PATH_SEPs as intended.

Suggested fix:

-    /* remove trailing '/' chars */
-    len = strlen(pathname);
-    assert(len > 0);
-    while (pathname[len] == PATH_SEP) {
-        pathname[len] = '\0';
-        len--;
-    }
-    if (len == 0) return;
+    /* remove trailing PATH_SEP chars */
+    len = strlen(pathname);
+    if (len == 0) return;
+    while (len > 0 && pathname[len - 1] == PATH_SEP) {
+        pathname[--len] = '\0';
+    }
+    if (len == 0) return;
tests/bigdict.c (1)

77-81: Route all failure paths through cleanup to avoid leaks in main.

After introducing _exit_code and cleanup, several error paths still use return 1;, leaking buffer, out, roundtrip, cctx, and dctx on failure.

Suggested fix (pattern applied to all error branches):

@@
-    if (!buffer || !out || !roundtrip || !cctx || !dctx) {
-        fprintf(stderr, "Allocation failure\n");
-        _exit_code = 1;
-        goto cleanup;
-    }
+    if (!buffer || !out || !roundtrip || !cctx || !dctx) {
+        fprintf(stderr, "Allocation failure\n");
+        _exit_code = 1;
+        goto cleanup;
+    }
@@
-    if (ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_windowLog, 31)))
-        return 1;
-    if (ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_nbWorkers, 1)))
-        return 1;
-    if (ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_overlapLog, 9)))
-        return 1;
-    if (ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_checksumFlag, 1)))
-        return 1;
-    if (ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_strategy, ZSTD_btopt)))
-        return 1;
-    if (ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_targetLength, 7)))
-        return 1;
-    if (ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_minMatch, 7)))
-        return 1;
-    if (ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_searchLog, 1)))
-        return 1;
-    if (ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_hashLog, 10)))
-        return 1;
-    if (ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_chainLog, 10)))
-        return 1;
-
-    if (ZSTD_isError(ZSTD_DCtx_setParameter(dctx, ZSTD_d_windowLogMax, 31)))
-        return 1;
+    if (ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_windowLog, 31)) ||
+        ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_nbWorkers, 1)) ||
+        ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_overlapLog, 9)) ||
+        ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_checksumFlag, 1)) ||
+        ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_strategy, ZSTD_btopt)) ||
+        ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_targetLength, 7)) ||
+        ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_minMatch, 7)) ||
+        ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_searchLog, 1)) ||
+        ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_hashLog, 10)) ||
+        ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_chainLog, 10)) ||
+        ZSTD_isError(ZSTD_DCtx_setParameter(dctx, ZSTD_d_windowLogMax, 31))) {
+        _exit_code = 1;
+        goto cleanup;
+    }
@@
-            fprintf(stderr, "Compressing 1 GB\n");
-            if (compress(cctx, dctx, out, outSize, buffer, dataSize, roundtrip, ZSTD_e_continue))
-                return 1;
+            fprintf(stderr, "Compressing 1 GB\n");
+            if (compress(cctx, dctx, out, outSize, buffer, dataSize, roundtrip, ZSTD_e_continue)) {
+                _exit_code = 1;
+                goto cleanup;
+            }
@@
-    fprintf(stderr, "Compressing 1 GB\n");
-    if (compress(cctx, dctx, out, outSize, buffer, dataSize, roundtrip, ZSTD_e_end))
-        return 1;
+    fprintf(stderr, "Compressing 1 GB\n");
+    if (compress(cctx, dctx, out, outSize, buffer, dataSize, roundtrip, ZSTD_e_end)) {
+        _exit_code = 1;
+        goto cleanup;
+    }
@@
-    fprintf(stderr, "Success!\n");
-
-    goto cleanup;
+    fprintf(stderr, "Success!\n");
+    _exit_code = 0;
+    goto cleanup;

Also applies to: 83-105, 104-105, 110-121

programs/dibio.c (1)

214-230: Harden DiB_findMaxMem() against pathological malloc failures

DiB_findMaxMem() assumes that at some (possibly small) size, malloc(requiredMem) will succeed. On implementations where malloc(0) may return NULL and very small allocations can also fail, the requiredMem -= step; loop can underflow and spin for a long time.

You can guard against this without changing normal behavior:

 static size_t DiB_findMaxMem(unsigned long long requiredMem)
 {
     size_t const step = 8 MB;
     void* testmem = NULL;

     requiredMem = (((requiredMem >> 23) + 1) << 23);
     requiredMem += step;
     if (requiredMem > g_maxMemory) requiredMem = g_maxMemory;

-    while (!testmem) {
-        testmem = malloc((size_t)requiredMem);
-        requiredMem -= step;
-    }
+    while (!testmem && requiredMem >= step) {
+        testmem = malloc((size_t)requiredMem);
+        if (testmem) break;
+        requiredMem -= step;
+    }
+    if (!testmem) {
+        /* Fall back to minimal allocation failure handling */
+        EXM_THROW(12, "not enough memory for dictionary training");
+    }

     free(testmem);
     return (size_t)requiredMem;
 }

This preserves the probing behavior while avoiding potential underflow / infinite-loop edge cases.

tests/largeDictionary.c (1)

63-128: Unify error handling with a cleanup path to avoid leaks on failure

main() returns early on allocation or parameter errors (Lines 76–103, 111–118) without freeing buffer, out, roundtrip, or the ZSTD contexts, unlike tests/bigdict.c which centralizes cleanup via a cleanup: label.

While leaks are limited to this short-lived test process and only occur on failures, you can easily make it consistent and safer:

-int main(int argc, const char** argv)
-{
-    ZSTD_CCtx* cctx = ZSTD_createCCtx();
-    ZSTD_DCtx* dctx = ZSTD_createDCtx();
+int main(int argc, const char** argv)
+{
+    ZSTD_CCtx* cctx = ZSTD_createCCtx();
+    ZSTD_DCtx* dctx = ZSTD_createDCtx();
@@
-    if (!buffer || !out || !roundtrip || !cctx || !dctx) {
-        fprintf(stderr, "Allocation failure\n");
-        return 1;
-    }
-
-    if (ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_windowLog, 31)))
-        return 1;
-    if (ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_nbWorkers, 1)))
-        return 1;
-    ...
-    if (ZSTD_isError(ZSTD_DCtx_setParameter(dctx, ZSTD_d_windowLogMax, 31)))
-        return 1;
+    if (!buffer || !out || !roundtrip || !cctx || !dctx) {
+        fprintf(stderr, "Allocation failure\n");
+        goto cleanup;
+    }
+
+    if (ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_windowLog, 31)))
+        goto cleanup;
+    if (ZSTD_isError(ZSTD_CCtx_setParameter(cctx, ZSTD_c_nbWorkers, 1)))
+        goto cleanup;
+    ...
+    if (ZSTD_isError(ZSTD_DCtx_setParameter(dctx, ZSTD_d_windowLogMax, 31)))
+        goto cleanup;
@@
-    fprintf(stderr, "Success!\n");
-
-    free(roundtrip);
-    free(out);
-    free(buffer);
-    ZSTD_freeDCtx(dctx);
-    ZSTD_freeCCtx(cctx);
-    return 0;
+    fprintf(stderr, "Success!\n");
+
+cleanup:
+    free(roundtrip);
+    free(out);
+    free(buffer);
+    ZSTD_freeDCtx(dctx);
+    ZSTD_freeCCtx(cctx);
+    return 0;

This mirrors the existing pattern in tests/bigdict.c and keeps future changes safer.

lib/dictBuilder/fastcover.c (1)

313-358: Preserve displayLevel inside FASTCOVER_ctx_t

FASTCOVER_ctx_init() assigns ctx->displayLevel = displayLevel; and then immediately calls memset(ctx, 0, sizeof(*ctx));, wiping the value. Later, other functions read ctx->displayLevel to drive DISPLAYLEVEL logging and to propagate zParams.notificationLevel in the optimize path.

It doesn’t affect correctness, but it does silently disable per‑context logging and propagates notificationLevel = 0 from these contexts.

A small reordering would fix this:

-    ctx->displayLevel = displayLevel;
-    /* Checks */
-    if (totalSamplesSize < MAX(d, sizeof(U64)) ||
-        totalSamplesSize >= (size_t)FASTCOVER_MAX_SAMPLES_SIZE) {
-        ...
-    }
-    ...
-    /* Zero the context */
-    memset(ctx, 0, sizeof(*ctx));
+    /* Zero the context */
+    memset(ctx, 0, sizeof(*ctx));
+    ctx->displayLevel = displayLevel;
+    /* Checks */
+    if (totalSamplesSize < MAX(d, sizeof(U64)) ||
+        totalSamplesSize >= (size_t)FASTCOVER_MAX_SAMPLES_SIZE) {
+        ...
+    }

If you care about logging here, grep for ctx.displayLevel usage and confirm after the change that messages appear at the expected verbosity.

programs/benchzstd.c (1)

783-888: Use BMK_benchMemAdvanced’s allocation guard to keep behavior robust

BMK_benchMemAdvanced() already has a comprehensive allocationincomplete guard and cleans up all allocations on failure. With the fix above, that remains correct. Just ensure any changes to chunking logic don’t add new allocations inside BMK_benchMemAdvancedNoAlloc() without matching error handling and keeping this single exit point pattern.

If you evolve this further (e.g., more modes), consider extracting a small helper struct for per-chunk metadata to keep the allocation/cleanup patterns centralized.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6cacd63 and 14d58f4.

📒 Files selected for processing (27)
  • contrib/diagnose_corruption/check_flipped_bits.c (1 hunks)
  • contrib/externalSequenceProducer/main.c (1 hunks)
  • contrib/largeNbDicts/largeNbDicts.c (1 hunks)
  • contrib/pzstd/Pzstd.h (1 hunks)
  • contrib/pzstd/utils/Buffer.h (1 hunks)
  • contrib/recovery/recover_directory.c (1 hunks)
  • contrib/seekable_format/zstdseek_compress.c (1 hunks)
  • contrib/seekable_format/zstdseek_decompress.c (1 hunks)
  • contrib/seqBench/seqBench.c (1 hunks)
  • doc/educational_decoder/harness.c (1 hunks)
  • doc/educational_decoder/zstd_decompress.c (1 hunks)
  • lib/dictBuilder/cover.c (1 hunks)
  • lib/dictBuilder/divsufsort.c (1 hunks)
  • lib/dictBuilder/fastcover.c (1 hunks)
  • lib/dictBuilder/zdict.c (1 hunks)
  • lib/legacy/zstd_v01.c (1 hunks)
  • programs/benchfn.c (1 hunks)
  • programs/benchzstd.c (1 hunks)
  • programs/datagen.c (1 hunks)
  • programs/dibio.c (1 hunks)
  • programs/fileio_asyncio.c (1 hunks)
  • programs/util.c (1 hunks)
  • tests/bigdict.c (1 hunks)
  • tests/largeDictionary.c (1 hunks)
  • zlibWrapper/gzlib.c (1 hunks)
  • zlibWrapper/gzread.c (1 hunks)
  • zlibWrapper/zstd_zlibwrapper.c (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (21)
contrib/diagnose_corruption/check_flipped_bits.c (2)
lib/legacy/zstd_v03.c (3)
  • ZSTD_freeDCtx (2989-2993)
  • ZSTD_createDCtx (2981-2987)
  • ZSTD_isError (2373-2373)
lib/legacy/zstd_v04.c (3)
  • ZSTD_freeDCtx (2477-2481)
  • ZSTD_createDCtx (2469-2475)
  • ZSTD_isError (2429-2429)
programs/dibio.c (2)
programs/util.c (1)
  • UTIL_getFileSize (545-558)
lib/dictBuilder/zdict.c (3)
  • ZDICT_trainFromBuffer_legacy (1088-1108)
  • ZDICT_isError (98-98)
  • ZDICT_getErrorName (100-100)
lib/dictBuilder/cover.c (1)
lib/legacy/zstd_v07.c (6)
  • U64 (283-286)
  • U64 (307-323)
  • U64 (358-364)
  • U32 (278-281)
  • U32 (293-305)
  • U32 (349-355)
contrib/largeNbDicts/largeNbDicts.c (2)
programs/util.c (6)
  • UTIL_getFileSize (545-558)
  • UTIL_isDirectory (357-365)
  • UTIL_getTotalFileSize (630-645)
  • UTIL_createExpandedFNT (1379-1423)
  • UTIL_assembleFileNamesTable (811-815)
  • UTIL_freeFileNamesTable (817-823)
programs/benchfn.c (6)
  • BMK_createTimedFnState (160-166)
  • BMK_benchTimedFn (211-256)
  • BMK_isSuccessful_runOutcome (64-67)
  • BMK_extract_runTime (71-75)
  • BMK_isCompleted_TimedFn (200-203)
  • BMK_freeTimedFnState (168-168)
lib/dictBuilder/fastcover.c (2)
lib/dictBuilder/cover.c (14)
  • COVER_sum (278-285)
  • COVER_computeEpochs (732-747)
  • COVER_dictSelectionError (1035-1037)
  • COVER_selectDict (1047-1132)
  • COVER_dictSelectionIsError (1039-1041)
  • COVER_best_finish (980-1024)
  • COVER_dictSelectionFree (1043-1045)
  • size_t (807-862)
  • size_t (1195-1331)
  • COVER_warnOnSmallCorpus (716-730)
  • COVER_best_init (922-931)
  • COVER_best_destroy (950-960)
  • COVER_best_start (966-973)
  • COVER_best_wait (936-945)
lib/dictBuilder/zdict.c (1)
  • ZDICT_finalizeDictionary (862-941)
zlibWrapper/gzlib.c (1)
zlibWrapper/gzread.c (9)
  • z_size_t (292-366)
  • ZEXPORT (369-398)
  • ZEXPORT (401-425)
  • ZEXPORT (445-470)
  • ZEXPORT (472-474)
  • ZEXPORT (477-531)
  • ZEXPORT (534-591)
  • ZEXPORT (594-609)
  • ZEXPORT (612-637)
contrib/pzstd/Pzstd.h (1)
contrib/pzstd/utils/Buffer.h (1)
  • pzstd (17-99)
contrib/seqBench/seqBench.c (2)
contrib/externalSequenceProducer/main.c (1)
  • main (29-107)
doc/educational_decoder/zstd_decompress.c (1)
  • ZSTD_decompress (384-391)
zlibWrapper/gzread.c (1)
zlibWrapper/gzlib.c (16)
  • ZEXPORT (265-267)
  • ZEXPORT (270-272)
  • ZEXPORT (275-289)
  • ZEXPORT (293-295)
  • ZEXPORT (299-320)
  • ZEXPORT (323-341)
  • ZEXPORT (344-414)
  • ZEXPORT (417-422)
  • ZEXPORT (425-437)
  • ZEXPORT (440-445)
  • ZEXPORT (448-466)
  • ZEXPORT (469-474)
  • ZEXPORT (477-489)
  • ZEXPORT (492-507)
  • ZEXPORT (510-526)
  • gzFile (90-262)
doc/educational_decoder/zstd_decompress.c (1)
lib/legacy/zstd_v03.c (1)
  • ZSTD_decompress (2905-2910)
tests/largeDictionary.c (2)
tests/bigdict.c (2)
  • compress (21-61)
  • main (63-133)
programs/datagen.c (1)
  • RDG_genBuffer (144-152)
contrib/recovery/recover_directory.c (2)
programs/util.c (1)
  • UTIL_getFileSize (545-558)
lib/legacy/zstd_v03.c (3)
  • ZSTD_isError (2373-2373)
  • ZSTD_createDCtx (2981-2987)
  • ZSTD_freeDCtx (2989-2993)
tests/bigdict.c (2)
tests/largeDictionary.c (2)
  • compress (21-61)
  • main (63-128)
programs/datagen.c (1)
  • RDG_genBuffer (144-152)
doc/educational_decoder/harness.c (1)
doc/educational_decoder/zstd_decompress.c (5)
  • ZSTD_get_decompressed_size (1378-1401)
  • create_dictionary (1405-1411)
  • parse_dictionary (1435-1487)
  • ZSTD_decompress_with_dict (393-409)
  • free_dictionary (1414-1425)
programs/fileio_asyncio.c (1)
contrib/seekable_format/zstdseek_decompress.c (1)
  • LONG_SEEK (91-106)
contrib/externalSequenceProducer/main.c (1)
contrib/seqBench/seqBench.c (1)
  • main (8-53)
programs/benchzstd.c (5)
lib/legacy/zstd_v03.c (8)
  • ZSTD_isError (2373-2373)
  • U32 (121-124)
  • U32 (161-164)
  • U32 (166-175)
  • U64 (126-129)
  • U64 (177-187)
  • ZSTD_createDCtx (2981-2987)
  • ZSTD_freeDCtx (2989-2993)
lib/legacy/zstd_v04.c (8)
  • ZSTD_isError (2429-2429)
  • U32 (97-100)
  • U32 (137-140)
  • U32 (142-151)
  • U64 (102-105)
  • U64 (154-164)
  • ZSTD_createDCtx (2469-2475)
  • ZSTD_freeDCtx (2477-2481)
programs/datagen.c (1)
  • RDG_genBuffer (144-152)
programs/benchfn.c (6)
  • BMK_benchTimedFn (211-256)
  • BMK_isSuccessful_runOutcome (64-67)
  • BMK_extract_runTime (71-75)
  • BMK_isCompleted_TimedFn (200-203)
  • BMK_createTimedFnState (160-166)
  • BMK_freeTimedFnState (168-168)
programs/util.c (3)
  • UTIL_getFileSize (545-558)
  • UTIL_isDirectory (357-365)
  • UTIL_getTotalFileSize (630-645)
contrib/seekable_format/zstdseek_compress.c (1)
lib/legacy/zstd_v04.c (6)
  • ZSTD_isError (2429-2429)
  • U32 (97-100)
  • U32 (137-140)
  • U32 (142-151)
  • BYTE (839-848)
  • BYTE (850-859)
lib/dictBuilder/zdict.c (1)
lib/legacy/zstd_v03.c (4)
  • ERR_isError (525-525)
  • U32 (121-124)
  • U32 (161-164)
  • U32 (166-175)
contrib/seekable_format/zstdseek_decompress.c (2)
lib/legacy/zstd_v03.c (4)
  • ZSTD_isError (2373-2373)
  • U32 (121-124)
  • U32 (161-164)
  • U32 (166-175)
lib/legacy/zstd_v05.c (2)
  • U32 (124-127)
  • U32 (170-178)
zlibWrapper/zstd_zlibwrapper.c (2)
lib/legacy/zstd_v03.c (8)
  • char (527-532)
  • ZSTD_isError (2373-2373)
  • void (131-134)
  • void (147-159)
  • void (383-386)
  • void (651-658)
  • void (2914-2918)
  • ZSTD_decompress (2905-2910)
zlibWrapper/gzread.c (4)
  • int (32-54)
  • int (63-86)
  • int (97-173)
  • int (180-224)
🪛 ast-grep (0.39.9)
zlibWrapper/gzlib.c

[warning] 239-239: This call makes a world-writable file which allows any user on a machine to write to the file. This may allow attackers to influence the behaviour of this process by writing to the file.
Context: 0666
Note: [CWE-732]: Incorrect Permission Assignment for Critical Resource [REFERENCES]
- https://wiki.sei.cmu.edu/confluence/display/c/FIO06-C.+Create+files+with+appropriate+access+permissions

(world-writable-file-c)

🔇 Additional comments (12)
contrib/pzstd/utils/Buffer.h (1)

50-62: Document the precondition and exception behavior for Buffer::splitAt.

Verification reveals that the underlying Range::subpiece() and Range::advance() methods already validate bounds and throw std::out_of_range if n > size(). Therefore, unlike the original concern suggested, undefined behavior does not occur—exceptions are thrown instead.

However, this exception behavior is not documented in the splitAt method comment. Consider either:

  • (a) Documenting the precondition n <= size() and that std::out_of_range is thrown on violation, or
  • (b) Adding a defensive check/assert in splitAt for consistency with callers' expectations.
zlibWrapper/gzread.c (2)

19-27: Union-based gz_statep usage and core read path look consistent

The shift from pointer casting to a gz_statep union (state.file vs state.state) is applied consistently across helpers and the public gzread/gzfread APIs: every function that accepts a gzFile sets state.file = file before dereferencing state.state, and internal helpers only ever touch state.state->.... The lazily-initialized buffers and stream state in gz_look/gz_read appear faithful to upstream zlib semantics, including the Z_OK/Z_BUF_ERROR handling.

No changes needed here from a correctness perspective.

Also applies to: 75-87, 291-398


136-149: Confirm zstd magic handling matches your wrapper’s inflate semantics

You now treat the 0x28 0xB5 prefix as a “zstd header” and route it through the same GZIP path (resetting the inflate stream and setting how = GZIP). This assumes your zlibWrapper’s inflate* implementation knows how to handle zstd frames under the same entry points.

Please double‑check that:

  • inflateInit2(strm, 15 + 16) is compatible with your zstd-backed inflate, and
  • mixed inputs (pure zstd, gzip, or trailing raw data) behave as expected.

If this code was copied from the upstream zstd zlibWrapper, it’s probably fine, but it’s worth re-verifying against current wrapper behavior and tests.

zlibWrapper/gzlib.c (3)

75-87: State initialization, reset, and union pattern align with upstream behavior

gz_reset, gz_open, and the buffer configuration (gzbuffer) correctly operate on the union-based gz_statep by consistently using state.state->... after constructing state.state or setting state.file. The initialization sequence (allocate state, parse mode, set defaults, build path, open fd, set start, then gz_reset) matches canonical zlib logic, and gzbuffer enforces the same pre-allocation and overflow constraints.

This module’s core state lifecycle looks sound.

Also applies to: 89-262, 298-341


343-474: Seek/offset APIs and error propagation appear correct and consistent

The seek/tell/offset helpers (gzseek64/gzseek, gztell64/gztell, gzoffset64/gzoffset) and error/EoF handling (gzeof, gzerror, gzclearerr, gz_error) mirror standard zlib behavior: they normalize offsets, manage pending skips, correctly adjust buffer cursors, and preserve Z_BUF_ERROR vs. fatal errors. gz_error also correctly owns and frees dynamically allocated messages, including the special handling for Z_MEM_ERROR.

No changes required here.

Also applies to: 476-526, 528-569


235-241: Review is technically valid but contextual—verify usage and zstd's compatibility constraints

The 0666 file mode is indeed less secure than the temporary-restrictive pattern used elsewhere in zstd (e.g., programs/fileio.c uses 0600 then fchmod). However, this is inherited from upstream zlib behavior and the wrapper is explicitly designed for backward compatibility.

Key findings:

  • zlibWrapper/gzlib.c has "minimal changes required" from upstream zlib; gzip file access is a documented supported feature
  • The same codebase demonstrates better practices (temporary 0600 + post-write fchmod in fileio.c)
  • No callers to gzip file creation functions found outside the wrapper, suggesting limited usage
  • No existing security documentation or config for file permissions in the wrapper

Before deciding to implement the review's suggestions (document umask or add a config hook), verify:

  1. Is this code path actually used? Evidence suggests it may not be in typical zstd workflows.
  2. Are compatibility constraints strict? If the wrapper must maintain zlib API/behavior exactly, the current approach is intentional.
  3. Does zstd have explicit security requirements for this optional feature?

If file creation is used and security is a concern, the review's suggestions are reasonable and non-breaking (documentation + optional stricter default are additive). If rarely used, lower priority.

programs/datagen.c (1)

144-186: Deterministic data generator looks consistent and safe

RDG_genBuffer() and RDG_genStdout() are well-bounded, seedable, and avoid obvious overflow or aliasing issues. Memory is checked and freed correctly. No changes needed.

lib/dictBuilder/zdict.c (1)

862-941: Dictionary finalization layout looks consistent and safe

ZDICT_finalizeDictionary() correctly enforces capacity checks, clamps dictContentSize, pads up to the maximum repcode distance, and then lays out [header][padding][content] while handling potential overlap via memmove(). This matches the expectations of the COVER/FASTCOVER callers and ZSTD’s dictionary loader.

If you want to double‑check behavior against your zstd version, you can diff this function with the vendor source to ensure no local drift.

contrib/seekable_format/zstdseek_compress.c (1)

268-351: Seek-table streaming logic appears correct

The ZSTD_seekable_seekTableSize(), ZSTD_stwrite32(), and ZSTD_seekable_writeSeekTable() trio correctly:

  • Compute the per-frame entry size based on checksumFlag.
  • Support partial writes by tracking seekTablePos / seekTableIndex and returning “bytes remaining” when the output buffer is too small.
  • Emit the skippable header, per-frame entries, footer count, SFD byte, and trailer magic in a resumable way.

This matches the intended non-blocking API where the caller can keep calling until the return value reaches 0.

Consider adding a small unit test that repeatedly calls ZSTD_seekable_writeSeekTable() with tiny output buffers (e.g., 1–2 bytes) to validate the resumability contract end-to-end.

doc/educational_decoder/zstd_decompress.c (2)

384-409: Top-level decompression API is straightforward and consistent

ZSTD_decompress() and ZSTD_decompress_with_dict() correctly:

  • Wrap the raw pointers in istream_t / ostream_t for bounds-checked IO.
  • Handle an optional parsed dictionary.
  • Decode exactly one frame via decode_frame() and return the number of bytes written.

For an educational decoder that intentionally exit(1)s on error, this is clean and easy to follow, and matches the documented single-frame assumption.

If you plan to expose this beyond educational use, consider swapping the ERROR() macro to propagate error codes instead of calling exit(1), and adding a small test to validate dictionary vs non-dictionary behavior.


1405-1580: Dictionary parsing & application follow the spec and look safe

The dictionary parsing logic:

  • Differentiates between raw-content and formatted dictionaries via the magic (0xEC30A437).
  • Validates minimum size, recent offsets < dict size, and decodes entropy tables using the same HUF/FSE primitives as the frame decoder.
  • Applies dictionaries in frame_context_apply_dict() only when present and matching the requested dictionary_id, deep-copying tables so the dictionary object and frame context have independent lifetimes.

Given the educational scope, this is a solid and faithful implementation of the spec.

If you plan to fuzz this decoder, be sure to include both raw and formatted dictionaries with edge-case sizes to confirm the bounds checks and offset validations behave as expected.

contrib/seekable_format/zstdseek_decompress.c (1)

256-277: Good defensive checks when cloning the seek table

The added assert(zs != NULL) and if (zs->seekTable.entries == NULL) return NULL; guard against misuse and avoid dereferencing a null entries pointer before cloning. Copying the extra sentinel entry (tableLen + 1) keeps the cloned table consistent with the original sizing logic.

No changes needed here.

Comment on lines +35 to +48
ZSTD_CCtx* const zc = ZSTD_createCCtx();

int simpleSequenceProducerState = 0xdeadbeef;

// Here is the crucial bit of code!
ZSTD_registerSequenceProducer(
zc,
&simpleSequenceProducerState,
simpleSequenceProducer
);

{
size_t const res = ZSTD_CCtx_setParameter(zc, ZSTD_c_enableSeqProducerFallback, 1);
CHECK(res);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Check CCtx creation and sequence‑producer registration results

ZSTD_createCCtx() is used without checking for NULL, and ZSTD_registerSequenceProducer()’s return value is ignored. On allocation failure or API error this will lead to later dereferences of an invalid context or silently continuing after a failed registration.

Consider validating both and bailing out (with cleanup) on error, e.g.:

-    ZSTD_CCtx* const zc = ZSTD_createCCtx();
+    ZSTD_CCtx* const zc = ZSTD_createCCtx();
+    if (zc == NULL) {
+        fprintf(stderr, "ERROR: failed to create ZSTD_CCtx\n");
+        return 1;
+    }
@@
-    ZSTD_registerSequenceProducer(
-        zc,
-        &simpleSequenceProducerState,
-        simpleSequenceProducer
-    );
+    {
+        size_t const res = ZSTD_registerSequenceProducer(
+            zc,
+            &simpleSequenceProducerState,
+            simpleSequenceProducer);
+        CHECK(res);
+    }



🏁 Script executed:

#!/bin/bash
# Read the file and extract the relevant section
cat contrib/externalSequenceProducer/main.c | head -60 | tail -35

Length of output: 895


🏁 Script executed:

#!/bin/bash
# Get more context: find the function containing these lines and see cleanup
cat contrib/externalSequenceProducer/main.c

Length of output: 2951


🏁 Script executed:

#!/bin/bash
# Check if ZSTD_registerSequenceProducer is used elsewhere in the file
rg -n "registerSequenceProducer" contrib/externalSequenceProducer/main.c

Length of output: 37


🏁 Script executed:

#!/bin/bash
# Check for CHECK macro definition and error handling patterns
rg -n -B2 -A2 "CHECK|ZSTD_createCCtx|ZSTD_registerSequenceProducer" contrib/externalSequenceProducer/main.c | head -80

Length of output: 922


🏁 Script executed:

#!/bin/bash
# Look for resource cleanup: search for free() and ZSTD_freeCCtx() calls
rg -n "free|ZSTD_free" contrib/externalSequenceProducer/main.c

Length of output: 83


🏁 Script executed:

#!/bin/bash
# Verify ZSTD API: check if there's any documentation or header hints
fd -e h "zstd" --search-path contrib/ --search-path "." 2>/dev/null | head -10

Length of output: 337


Add NULL check for ZSTD_createCCtx() and verify ZSTD_registerSequenceProducer() return value

Both issues are confirmed in the code:

  1. Line 35: ZSTD_createCCtx() is called without NULL checking. If allocation fails, the NULL pointer is dereferenced at line 40.

  2. Lines 40-44: ZSTD_registerSequenceProducer() return value is ignored. A registration failure is not detected and silently continues.

Additionally, note that cleanup at lines 102-105 is only reachable on the success path. If any CHECK fails before that point (e.g., line 48), zc will leak.

The proposed diff in the review is incomplete—it wraps registerSequenceProducer in CHECK but omits the NULL check for createCCtx(). Both should be added:

ZSTD_CCtx* const zc = ZSTD_createCCtx();
if (zc == NULL) {
    fprintf(stderr, "ERROR: failed to create ZSTD_CCtx\n");
    return 1;
}

// ... later ...

{
    size_t const res = ZSTD_registerSequenceProducer(
        zc,
        &simpleSequenceProducerState,
        simpleSequenceProducer);
    CHECK(res);
}
🤖 Prompt for AI Agents
In contrib/externalSequenceProducer/main.c around lines 35–48 (and consider
cleanup at 102–105), ZSTD_createCCtx() is not NULL-checked and
ZSTD_registerSequenceProducer() return value is ignored; add an immediate NULL
check after ZSTD_createCCtx() that prints an error and returns non-zero if
allocation fails, wrap the ZSTD_registerSequenceProducer() call and capture its
return value then CHECK() it (or handle the error), and ensure any early returns
free the context (call ZSTD_freeCCtx or equivalent) to avoid leaking zc when
later CHECKs fail.

Comment on lines +51 to +71
FILE *f = fopen(argv[1], "rb");
assert(f);
{
int const ret = fseek(f, 0, SEEK_END);
assert(ret == 0);
}
size_t const srcSize = ftell(f);
{
int const ret = fseek(f, 0, SEEK_SET);
assert(ret == 0);
}

char* const src = malloc(srcSize + 1);
assert(src);
{
size_t const ret = fread(src, srcSize, 1, f);
assert(ret == 1);
int const ret2 = fclose(f);
assert(ret2 == 0);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Don’t rely on assert() for runtime I/O/allocation failures

fopen(), fseek(), ftell(), malloc(), fread(), and fclose() are only guarded by assert(). In release builds with NDEBUG defined these asserts disappear, and failures (e.g., missing file, OOM, short read) will lead to undefined behavior (e.g., calling fseek() on a NULL FILE*, using uninitialized buffers, or mis‑sized allocations).

Prefer explicit checks and graceful exits, for example:

-    FILE *f = fopen(argv[1], "rb");
-    assert(f);
+    FILE *f = fopen(argv[1], "rb");
+    if (!f) {
+        perror("fopen");
+        ZSTD_freeCCtx(zc);
+        return 1;
+    }
@@
-    size_t const srcSize = ftell(f);
+    long const tellRes = ftell(f);
+    if (tellRes < 0) {
+        perror("ftell");
+        fclose(f);
+        ZSTD_freeCCtx(zc);
+        return 1;
+    }
+    size_t const srcSize = (size_t)tellRes;
@@
-    char* const src = malloc(srcSize + 1);
-    assert(src);
+    char* const src = (char*)malloc(srcSize);
+    if (!src) {
+        fclose(f);
+        ZSTD_freeCCtx(zc);
+        return 1;
+    }
@@
-        size_t const ret = fread(src, srcSize, 1, f);
-        assert(ret == 1);
+        size_t const ret = fread(src, 1, srcSize, f);
+        if (ret != srcSize) {
+            fprintf(stderr, "ERROR: short read\n");
+            fclose(f);
+            free(src);
+            ZSTD_freeCCtx(zc);
+            return 1;
+        }

I'll verify the review comment by examining the actual code and context in the repository.


🏁 Script executed:

#!/bin/bash
# Find and examine the file
fd -t f "main.c" | grep -i "externalSequenceProducer"

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Extract the full context around lines 51-71, plus surrounding code
if [ -f "contrib/externalSequenceProducer/main.c" ]; then
  echo "=== Full file context (lines 1-100) ==="
  head -n 100 "contrib/externalSequenceProducer/main.c" | cat -n
  echo ""
  echo "=== Checking for ZSTD_freeCCtx and zc variable ==="
  grep -n "ZSTD_freeCCtx\|zc" "contrib/externalSequenceProducer/main.c" | head -20
fi

Length of output: 3919


🏁 Script executed:

#!/bin/bash
# Check the entire file for assert usage and similar error handling patterns
if [ -f "contrib/externalSequenceProducer/main.c" ]; then
  echo "=== All assert() calls in file ==="
  grep -n "assert" "contrib/externalSequenceProducer/main.c"
fi

Length of output: 252


I need to see the complete file to verify the full cleanup path and ensure the review's suggested fixes are complete.


🏁 Script executed:

#!/bin/bash
# Get the complete file content
wc -l "contrib/externalSequenceProducer/main.c"
echo ""
echo "=== Full file content ==="
cat "contrib/externalSequenceProducer/main.c"

Length of output: 3022


Replace assert()-based error checks with explicit error handling to prevent undefined behavior in release builds.

The code relies entirely on assert() for runtime error handling at lines 52, 55, 60, 64, 67, 69, 74, and 80. With NDEBUG defined (standard in release builds), these asserts vanish, causing undefined behavior:

  • fopen() returning NULL leads to fseek() called on invalid pointer
  • malloc() failures go undetected, using NULL pointers
  • fread() short reads undetected

Additionally, there's a critical resource leak at line 99: the early return 1; on memcmp mismatch skips cleanup of allocated memory and the compression context, even in the current code.

Implement explicit checks with proper cleanup paths for all I/O and allocation operations (lines 51–70), and ensure all error paths properly free resources before returning.

Comment on lines +621 to +636
/* benched function */
size_t compress(const void* src, size_t srcSize, void* dst, size_t dstCapacity, void* payload)
{
compressInstructions* const ci = (compressInstructions*) payload;
(void)dstCapacity;

ZSTD_CCtx_refCDict(ci->cctx, ci->dictionaries.cdicts[ci->dictNb]);
ZSTD_compress2(ci->cctx,
dst, srcSize,
src, srcSize);

ci->dictNb = ci->dictNb + 1;
if (ci->dictNb >= ci->nbDicts) ci->dictNb = 0;

return srcSize;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Fix compress() to honor dstCapacity and propagate ZSTD errors

compress() currently ignores the dstCapacity parameter and calls:

ZSTD_compress2(ci->cctx,
        dst, srcSize,
        src, srcSize);
...
return srcSize;

Issues:

  • srcSize is used as the destination capacity, even though the caller computed a larger capacity (dstBlocks.capacities[...]), so you can get ZSTD_error_dstSize_tooSmall despite having reserved enough space.
  • The return value from ZSTD_compress2() is ignored, so BMK_benchFunction’s errorFn (ZSTD_isError) never sees failures.
  • Returning srcSize unconditionally hides compression errors.

A safe pattern that preserves the “bytes processed per run” semantics of BMK_runOutcome_t.sumOfReturn:

 size_t compress(const void* src, size_t srcSize, void* dst, size_t dstCapacity, void* payload)
 {
     compressInstructions* const ci = (compressInstructions*) payload;
-    (void)dstCapacity;
-
-    ZSTD_CCtx_refCDict(ci->cctx, ci->dictionaries.cdicts[ci->dictNb]);
-    ZSTD_compress2(ci->cctx,
-            dst, srcSize,
-            src, srcSize);
+    ZSTD_CCtx_refCDict(ci->cctx, ci->dictionaries.cdicts[ci->dictNb]);
+    {
+        size_t const cSize = ZSTD_compress2(ci->cctx,
+                                            dst, dstCapacity,
+                                            src, srcSize);
+        if (ZSTD_isError(cSize)) {
+            return cSize;  /* let BMK_benchFunction + ZSTD_isError() surface the failure */
+        }
+    }
@@
-    return srcSize;
+    return srcSize;  /* bytes of input processed per run */
 }


---

🏁 Script executed:

```shell
#!/bin/bash
# Verify the compress() function code at lines 621-636
echo "=== Checking compress() function ==="
sed -n '621,636p' contrib/largeNbDicts/largeNbDicts.c

Length of output: 511


🏁 Script executed:

#!/bin/bash
# Find how dstBlocks capacities are allocated/assigned
echo "=== Search for dstBlocks capacity setup ==="
rg -n -A5 -B5 'dstBlocks.*capacit' contrib/largeNbDicts/largeNbDicts.c

Length of output: 434


🏁 Script executed:

#!/bin/bash
# Find compressBlocks() to see how dst buffer capacity is determined
echo "=== Search for compressBlocks function ==="
rg -n -A20 'compressBlocks\s*\(' contrib/largeNbDicts/largeNbDicts.c | head -60

Length of output: 2964


🏁 Script executed:

#!/bin/bash
# Search for BMK_benchFunction definition/usage to understand errorFn
echo "=== Search for BMK_benchFunction calls ==="
rg -n -B3 -A3 'BMK_benchFunction' contrib/largeNbDicts/largeNbDicts.c

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Check how compress() function signature relates to BMK_benchFunction requirements
echo "=== Search for compress function definition ==="
rg -n -B5 'size_t compress\(' contrib/largeNbDicts/largeNbDicts.c

Length of output: 222


🏁 Script executed:

#!/bin/bash
# Find all ZSTD_compress2 calls in the repo to see usage patterns
echo "=== ZSTD_compress2 usage patterns ==="
rg -n -B2 -A2 'ZSTD_compress2\(' --type=c | head -40

Length of output: 3306



---

🏁 Script executed:

```shell
#!/bin/bash
# Find where dstBlocks is created and how its capacities relate to srcSize
echo "=== Search for dstBlocks creation/initialization ==="
rg -n -B10 -A10 'dstBlocks.*=' contrib/largeNbDicts/largeNbDicts.c | grep -A15 -B15 'capacity\|allocate\|create' | head -80

Length of output: 831


🏁 Script executed:

#!/bin/bash
# Look for where buffers are allocated for compression/decompression benchmarking
echo "=== Search for buffer allocation in bench() function ==="
rg -n -A50 'static void bench\(' contrib/largeNbDicts/largeNbDicts.c | head -100

Length of output: 56


🏁 Script executed:

#!/bin/bash
# Check if there's a ZSTD_compress2 wrapper or if errors are supposed to be handled elsewhere
echo "=== Search for decompress function to compare patterns ==="
rg -n -A15 'size_t decompress\(' contrib/largeNbDicts/largeNbDicts.c | head -40

Length of output: 672


🏁 Script executed:

#!/bin/bash
# Search for how compress() is used/registered with the benchmark framework
echo "=== Search for compress function reference/usage ==="
rg -n 'compress' contrib/largeNbDicts/largeNbDicts.c | grep -i 'fn\|ptr\|func\|call'

Length of output: 117


Fix compress() to honor dstCapacity and propagate ZSTD errors

The compress() function at lines 621–636 has a critical defect: it ignores the dstCapacity parameter and does not capture the return value from ZSTD_compress2().

Compare with decompress() at lines 639–652, which correctly uses dstCapacity and captures the result. The benchmark framework (line 709) expects errors to be surfaced via errorFn = ZSTD_isError, but compress() always returns srcSize unconditionally, masking all compression failures.

Specifically:

  • (void)dstCapacity; discards the caller-provided capacity.
  • ZSTD_compress2(ci->cctx, dst, srcSize, src, srcSize); uses srcSize as the destination capacity instead, risking ZSTD_error_dstSize_tooSmall even when sufficient space was reserved.
  • The return value is ignored, so errors never reach the benchmark framework.

Apply the fix from the review:

 size_t compress(const void* src, size_t srcSize, void* dst, size_t dstCapacity, void* payload)
 {
     compressInstructions* const ci = (compressInstructions*) payload;
-    (void)dstCapacity;
-
-    ZSTD_CCtx_refCDict(ci->cctx, ci->dictionaries.cdicts[ci->dictNb]);
-    ZSTD_compress2(ci->cctx,
-            dst, srcSize,
-            src, srcSize);
+    ZSTD_CCtx_refCDict(ci->cctx, ci->dictionaries.cdicts[ci->dictNb]);
+    {
+        size_t const cSize = ZSTD_compress2(ci->cctx,
+                                            dst, dstCapacity,
+                                            src, srcSize);
+        if (ZSTD_isError(cSize)) {
+            return cSize;  /* let BMK_benchFunction + ZSTD_isError() surface the failure */
+        }
+    }
 
     ci->dictNb = ci->dictNb + 1;
     if (ci->dictNb >= ci->nbDicts) ci->dictNb = 0;
 
     return srcSize;  /* bytes of input processed per run */
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/* benched function */
size_t compress(const void* src, size_t srcSize, void* dst, size_t dstCapacity, void* payload)
{
compressInstructions* const ci = (compressInstructions*) payload;
(void)dstCapacity;
ZSTD_CCtx_refCDict(ci->cctx, ci->dictionaries.cdicts[ci->dictNb]);
ZSTD_compress2(ci->cctx,
dst, srcSize,
src, srcSize);
ci->dictNb = ci->dictNb + 1;
if (ci->dictNb >= ci->nbDicts) ci->dictNb = 0;
return srcSize;
}
/* benched function */
size_t compress(const void* src, size_t srcSize, void* dst, size_t dstCapacity, void* payload)
{
compressInstructions* const ci = (compressInstructions*) payload;
ZSTD_CCtx_refCDict(ci->cctx, ci->dictionaries.cdicts[ci->dictNb]);
{
size_t const cSize = ZSTD_compress2(ci->cctx,
dst, dstCapacity,
src, srcSize);
if (ZSTD_isError(cSize)) {
return cSize; /* let BMK_benchFunction + ZSTD_isError() surface the failure */
}
}
ci->dictNb = ci->dictNb + 1;
if (ci->dictNb >= ci->nbDicts) ci->dictNb = 0;
return srcSize; /* bytes of input processed per run */
}
🤖 Prompt for AI Agents
In contrib/largeNbDicts/largeNbDicts.c around lines 621–636, compress()
currently ignores dstCapacity and discards ZSTD_compress2()'s return value;
change the call to pass the actual destination capacity and capture its return
so errors propagate to the benchmark framework: remove the "(void)dstCapacity;"
line, call ZSTD_compress2(ci->cctx, dst, dstCapacity, src, srcSize) and store
the returned size_t (e.g., cSize), then return that captured cSize (leave the
dictNb increment logic as-is) so ZSTD_isError can detect and handle failures.

Comment thread contrib/pzstd/Pzstd.h
Comment on lines +19 to +29
#define ZSTD_STATIC_LINKING_ONLY
#define ZSTD_DISABLE_DEPRECATE_WARNINGS /* No deprecation warnings, pzstd itself is deprecated
* and uses deprecated functions
*/
#include "zstd.h"
#undef ZSTD_STATIC_LINKING_ONLY

#include <cstddef>
#include <cstdint>
#include <memory>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make the header self-contained by including <cstdio> for FILE*.

This header declares several APIs taking FILE* but doesn’t include <cstdio>/<stdio.h>. Including Pzstd.h in a TU that hasn’t already pulled in <cstdio> will fail to compile due to an unknown FILE typedef. Add the appropriate standard header alongside the other includes here.

🤖 Prompt for AI Agents
In contrib/pzstd/Pzstd.h around lines 19 to 29, the header uses FILE* in its API
but does not include <cstdio>/<stdio.h>, so including Pzstd.h in a translation
unit without that header breaks compilation; add an appropriate C stdio include
(e.g., include <cstdio>) alongside the existing includes so FILE is defined and
the header is self-contained.

Comment thread contrib/pzstd/Pzstd.h
Comment on lines +31 to +38
/**
* Runs pzstd with `options` and returns the number of bytes written.
* An error occurred if `errorHandler.hasError()`.
*
* @param options The pzstd options to use for (de)compression
* @returns 0 upon success and non-zero on failure.
*/
int pzstdMain(const Options& options);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify pzstdMain’s return-value contract.

The comment says this “returns the number of bytes written”, but the @returns line states “0 upon success and non-zero on failure”. These are contradictory. Please decide whether the function returns a byte count or an error code and update the documentation (and implementation, if needed) so callers can reliably interpret the result.

🤖 Prompt for AI Agents
In contrib/pzstd/Pzstd.h around lines 31 to 38, the documentation for pzstdMain
contradicts itself by saying it "returns the number of bytes written" but the
@returns says "0 upon success and non-zero on failure"; decide which contract we
want (either return a byte count with negative/error sentinel or return an int
error code and provide bytes-written via an out-parameter or errorHandler),
update the function documentation to precisely state the chosen contract, and
modify the implementation and all call sites to match: if choosing byte-count,
make return type convey bytes and return -1 or set errorHandler on failure; if
choosing error-code, change the doc to say returns 0 on success and non-zero
error code and move bytes-written into an out parameter or accessor. Ensure the
header comment, function signature/implementation, and callers are consistent.

Comment thread programs/benchzstd.c
Comment on lines +447 to +529
if (adv->mode == BMK_decodeOnly) {
/* benchmark only decompression : source must be already compressed */
const char* srcPtr = (const char*)srcBuffer;
U64 totalDSize64 = 0;
U32 fileNb;
for (fileNb = 0; fileNb < nbFiles; fileNb++) {
U64 const fSize64 =
ZSTD_findDecompressedSize(srcPtr, fileSizes[fileNb]);
if (fSize64 == ZSTD_CONTENTSIZE_UNKNOWN) {
RETURN_ERROR(
32,
BMK_benchOutcome_t,
"Decompressed size cannot be determined: cannot benchmark");
}
if (fSize64 == ZSTD_CONTENTSIZE_ERROR) {
RETURN_ERROR(
32,
BMK_benchOutcome_t,
"Error while trying to assess decompressed size: data may be invalid");
}
totalDSize64 += fSize64;
srcPtr += fileSizes[fileNb];
}
{
size_t const decodedSize = (size_t)totalDSize64;
assert((U64)decodedSize == totalDSize64); /* check overflow */
free(*resultBufferPtr);
if (totalDSize64 > decodedSize) { /* size_t overflow */
RETURN_ERROR(
32,
BMK_benchOutcome_t,
"decompressed size is too large for local system");
}
*resultBufferPtr = malloc(decodedSize);
if (!(*resultBufferPtr)) {
RETURN_ERROR(
33,
BMK_benchOutcome_t,
"allocation error: not enough memory");
}
cSize = srcSize;
srcSize = decodedSize;
ratio = (double)srcSize / (double)cSize;
}
}

/* Init data chunks */
{
const char* srcPtr = (const char*)srcBuffer;
char* cPtr = (char*)compressedBuffer;
char* resPtr = (char*)(*resultBufferPtr);
U32 fileNb, chunkID;
for (chunkID = 0, fileNb = 0; fileNb < nbFiles; fileNb++) {
size_t remaining = fileSizes[fileNb];
U32 const nbChunksforThisFile = (adv->mode == BMK_decodeOnly)
? 1
: (U32)((remaining + (chunkSizeMax - 1)) / chunkSizeMax);
U32 const chunkIdEnd = chunkID + nbChunksforThisFile;
for (; chunkID < chunkIdEnd; chunkID++) {
size_t const chunkSize = MIN(remaining, chunkSizeMax);
srcPtrs[chunkID] = srcPtr;
srcSizes[chunkID] = chunkSize;
cPtrs[chunkID] = cPtr;
cCapacities[chunkID] = (adv->mode == BMK_decodeOnly)
? chunkSize
: ZSTD_compressBound(chunkSize);
resPtrs[chunkID] = resPtr;
resSizes[chunkID] = (adv->mode == BMK_decodeOnly)
? (size_t)ZSTD_findDecompressedSize(
srcPtr, chunkSize)
: chunkSize;
srcPtr += chunkSize;
cPtr += cCapacities[chunkID];
resPtr += chunkSize;
remaining -= chunkSize;
if (adv->mode == BMK_decodeOnly) {
cSizes[chunkID] = chunkSize;
benchResult.cSize = chunkSize;
}
}
}
nbChunks = chunkID;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Decode-only bench can write past resultBuffer

In BMK_benchMemAdvancedNoAlloc()’s BMK_decodeOnly path, you:

  • Compute the total decompressed size (decodedSize) and allocate *resultBufferPtr accordingly.
  • In the chunking loop you do:
resPtrs[chunkID]      = resPtr;
resSizes[chunkID]     = (size_t)ZSTD_findDecompressedSize(srcPtr, chunkSize);
...
resPtr               += chunkSize;  /* compressed size */

Later, local_defaultDecompress() is called with dstBuffers = resPtrs and dstCapacities = resSizes, so each decompression writes resSizes[chunkID] bytes starting at resPtr that was advanced by the compressed size, not the decompressed size.

If for any file decompressedSize > compressedSize (which is very possible for small or poorly compressible inputs), resPtrs[chunkID] + resSizes[chunkID] can exceed the end of resultBuffer, leading to heap overflow.

You already computed all decompressed sizes up-front, so the layout can be fixed by advancing resPtr by the decompressed size in decode-only mode:

-                resPtrs[chunkID]       = resPtr;
-                resSizes[chunkID]      = (adv->mode == BMK_decodeOnly)
-                                ? (size_t)ZSTD_findDecompressedSize(
-                                srcPtr, chunkSize)
-                                : chunkSize;
+                resPtrs[chunkID]       = resPtr;
+                if (adv->mode == BMK_decodeOnly) {
+                    size_t const decSize =
+                            (size_t)ZSTD_findDecompressedSize(srcPtr, chunkSize);
+                    resSizes[chunkID] = decSize;
+                    resPtr           += decSize;
+                } else {
+                    resSizes[chunkID] = chunkSize;
+                    resPtr           += chunkSize;
+                }
@@
-                srcPtr += chunkSize;
-                cPtr += cCapacities[chunkID];
-                resPtr += chunkSize;
+                srcPtr += chunkSize;
+                cPtr   += cCapacities[chunkID];

(Adjust to avoid double-incrementing resPtr in the non-decode path.)

This keeps each dstBuffer slice within the single resultBuffer of size decodedSize.

You can validate this by constructing a small corpus where compressed size > decompressed size (e.g., very short inputs) and running benchzstd in decode-only mode under ASan/Valgrind to confirm the overflow disappears after the change.


🏁 Script executed:

#!/bin/bash
# Extract the relevant code section and surrounding context
cat programs/benchzstd.c | head -600 | tail -200

Length of output: 7593


🏁 Script executed:

#!/bin/bash
# Search for where resPtrs and resSizes are used in decompression
rg -n "resPtrs\|resSizes" programs/benchzstd.c -A 2 -B 2

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for decompression function calls that use these buffers
rg -n "ZSTD_decompress|dstBuffers|local_defaultDecompress" programs/benchzstd.c -C 3

Length of output: 1273


🏁 Script executed:

#!/bin/bash
# Find the function signature and overall structure
rg -n "BMK_benchMemAdvancedNoAlloc" programs/benchzstd.c -A 5

Length of output: 410


Decode-only bench has confirmed heap overflow from misaligned buffer advancement

The issue is real and critical. In BMK_benchMemAdvancedNoAlloc():

  • resultBuffer is allocated to hold the total decompressed size of all files.
  • In decode-only mode, each file gets exactly 1 chunk (nbChunksforThisFile = 1).
  • For each chunk, you set resSizes[chunkID] to that file's decompressed size but advance resPtr by the file's compressed size.

When multiple files exist where decompressedSize > compressedSize, the buffer slices overlap:

Example: File1 (100B compressed → 300B decompressed), File2 (100B compressed → 200B decompressed)

  • Total decompressed = 500B (buffer size)
  • Layout should be: File1 [0–299], File2 [300–499]
  • Actual result: File1 [0–299], resPtr advances by 100 → File2 starts at [100–299] ✗

When local_defaultDecompress() decompresses File2 at address resPtrs[1] with capacity resSizes[1], it writes past the intended boundary into File1's region, causing heap corruption.

The fix is to advance resPtr by decompressed size in decode-only mode:

-                resSizes[chunkID]      = (adv->mode == BMK_decodeOnly)
-                                ? (size_t)ZSTD_findDecompressedSize(
-                                srcPtr, chunkSize)
-                                : chunkSize;
-                srcPtr += chunkSize;
-                cPtr += cCapacities[chunkID];
-                resPtr += chunkSize;
+                if (adv->mode == BMK_decodeOnly) {
+                    size_t const decSize = (size_t)ZSTD_findDecompressedSize(srcPtr, chunkSize);
+                    resSizes[chunkID] = decSize;
+                    resPtr += decSize;
+                } else {
+                    resSizes[chunkID] = chunkSize;
+                    resPtr += chunkSize;
+                }
+                srcPtr += chunkSize;
+                cPtr += cCapacities[chunkID];
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (adv->mode == BMK_decodeOnly) {
/* benchmark only decompression : source must be already compressed */
const char* srcPtr = (const char*)srcBuffer;
U64 totalDSize64 = 0;
U32 fileNb;
for (fileNb = 0; fileNb < nbFiles; fileNb++) {
U64 const fSize64 =
ZSTD_findDecompressedSize(srcPtr, fileSizes[fileNb]);
if (fSize64 == ZSTD_CONTENTSIZE_UNKNOWN) {
RETURN_ERROR(
32,
BMK_benchOutcome_t,
"Decompressed size cannot be determined: cannot benchmark");
}
if (fSize64 == ZSTD_CONTENTSIZE_ERROR) {
RETURN_ERROR(
32,
BMK_benchOutcome_t,
"Error while trying to assess decompressed size: data may be invalid");
}
totalDSize64 += fSize64;
srcPtr += fileSizes[fileNb];
}
{
size_t const decodedSize = (size_t)totalDSize64;
assert((U64)decodedSize == totalDSize64); /* check overflow */
free(*resultBufferPtr);
if (totalDSize64 > decodedSize) { /* size_t overflow */
RETURN_ERROR(
32,
BMK_benchOutcome_t,
"decompressed size is too large for local system");
}
*resultBufferPtr = malloc(decodedSize);
if (!(*resultBufferPtr)) {
RETURN_ERROR(
33,
BMK_benchOutcome_t,
"allocation error: not enough memory");
}
cSize = srcSize;
srcSize = decodedSize;
ratio = (double)srcSize / (double)cSize;
}
}
/* Init data chunks */
{
const char* srcPtr = (const char*)srcBuffer;
char* cPtr = (char*)compressedBuffer;
char* resPtr = (char*)(*resultBufferPtr);
U32 fileNb, chunkID;
for (chunkID = 0, fileNb = 0; fileNb < nbFiles; fileNb++) {
size_t remaining = fileSizes[fileNb];
U32 const nbChunksforThisFile = (adv->mode == BMK_decodeOnly)
? 1
: (U32)((remaining + (chunkSizeMax - 1)) / chunkSizeMax);
U32 const chunkIdEnd = chunkID + nbChunksforThisFile;
for (; chunkID < chunkIdEnd; chunkID++) {
size_t const chunkSize = MIN(remaining, chunkSizeMax);
srcPtrs[chunkID] = srcPtr;
srcSizes[chunkID] = chunkSize;
cPtrs[chunkID] = cPtr;
cCapacities[chunkID] = (adv->mode == BMK_decodeOnly)
? chunkSize
: ZSTD_compressBound(chunkSize);
resPtrs[chunkID] = resPtr;
resSizes[chunkID] = (adv->mode == BMK_decodeOnly)
? (size_t)ZSTD_findDecompressedSize(
srcPtr, chunkSize)
: chunkSize;
srcPtr += chunkSize;
cPtr += cCapacities[chunkID];
resPtr += chunkSize;
remaining -= chunkSize;
if (adv->mode == BMK_decodeOnly) {
cSizes[chunkID] = chunkSize;
benchResult.cSize = chunkSize;
}
}
}
nbChunks = chunkID;
}
if (adv->mode == BMK_decodeOnly) {
/* benchmark only decompression : source must be already compressed */
const char* srcPtr = (const char*)srcBuffer;
U64 totalDSize64 = 0;
U32 fileNb;
for (fileNb = 0; fileNb < nbFiles; fileNb++) {
U64 const fSize64 =
ZSTD_findDecompressedSize(srcPtr, fileSizes[fileNb]);
if (fSize64 == ZSTD_CONTENTSIZE_UNKNOWN) {
RETURN_ERROR(
32,
BMK_benchOutcome_t,
"Decompressed size cannot be determined: cannot benchmark");
}
if (fSize64 == ZSTD_CONTENTSIZE_ERROR) {
RETURN_ERROR(
32,
BMK_benchOutcome_t,
"Error while trying to assess decompressed size: data may be invalid");
}
totalDSize64 += fSize64;
srcPtr += fileSizes[fileNb];
}
{
size_t const decodedSize = (size_t)totalDSize64;
assert((U64)decodedSize == totalDSize64); /* check overflow */
free(*resultBufferPtr);
if (totalDSize64 > decodedSize) { /* size_t overflow */
RETURN_ERROR(
32,
BMK_benchOutcome_t,
"decompressed size is too large for local system");
}
*resultBufferPtr = malloc(decodedSize);
if (!(*resultBufferPtr)) {
RETURN_ERROR(
33,
BMK_benchOutcome_t,
"allocation error: not enough memory");
}
cSize = srcSize;
srcSize = decodedSize;
ratio = (double)srcSize / (double)cSize;
}
}
/* Init data chunks */
{
const char* srcPtr = (const char*)srcBuffer;
char* cPtr = (char*)compressedBuffer;
char* resPtr = (char*)(*resultBufferPtr);
U32 fileNb, chunkID;
for (chunkID = 0, fileNb = 0; fileNb < nbFiles; fileNb++) {
size_t remaining = fileSizes[fileNb];
U32 const nbChunksforThisFile = (adv->mode == BMK_decodeOnly)
? 1
: (U32)((remaining + (chunkSizeMax - 1)) / chunkSizeMax);
U32 const chunkIdEnd = chunkID + nbChunksforThisFile;
for (; chunkID < chunkIdEnd; chunkID++) {
size_t const chunkSize = MIN(remaining, chunkSizeMax);
srcPtrs[chunkID] = srcPtr;
srcSizes[chunkID] = chunkSize;
cPtrs[chunkID] = cPtr;
cCapacities[chunkID] = (adv->mode == BMK_decodeOnly)
? chunkSize
: ZSTD_compressBound(chunkSize);
resPtrs[chunkID] = resPtr;
if (adv->mode == BMK_decodeOnly) {
size_t const decSize = (size_t)ZSTD_findDecompressedSize(srcPtr, chunkSize);
resSizes[chunkID] = decSize;
resPtr += decSize;
} else {
resSizes[chunkID] = chunkSize;
resPtr += chunkSize;
}
srcPtr += chunkSize;
cPtr += cCapacities[chunkID];
remaining -= chunkSize;
if (adv->mode == BMK_decodeOnly) {
cSizes[chunkID] = chunkSize;
benchResult.cSize = chunkSize;
}
}
}
nbChunks = chunkID;
}
🤖 Prompt for AI Agents
In programs/benchzstd.c around lines 447 to 529, resPtr is advanced by the
compressed size (chunkSize) in decode-only mode, causing overlapping result
buffers when decompressed sizes > compressed sizes; change the advancement so
resPtr moves by the decompressed size for decode-only chunks (i.e. use
resSizes[chunkID] or the computed decompressed size) instead of chunkSize when
adv->mode == BMK_decodeOnly, keeping the existing logic for other modes and
ensuring resSizes is computed before advancing resPtr.

Comment thread programs/fileio_asyncio.c
Comment on lines +54 to +105
/* avoid int overflow */
if (storedSkips > 1 GB) {
if (LONG_SEEK(file, 1 GB, SEEK_CUR) != 0)
EXM_THROW(91, "1 GB skip error (sparse file support)");
storedSkips -= 1 GB;
}

while (ptrT < bufferTEnd) {
size_t nb0T;

/* adjust last segment if < 32 KB */
size_t seg0SizeT = segmentSizeT;
if (seg0SizeT > bufferSizeT) seg0SizeT = bufferSizeT;
bufferSizeT -= seg0SizeT;

/* count leading zeroes */
for (nb0T=0; (nb0T < seg0SizeT) && (ptrT[nb0T] == 0); nb0T++) ;
storedSkips += (unsigned)(nb0T * sizeof(size_t));

if (nb0T != seg0SizeT) { /* not all 0s */
size_t const nbNon0ST = seg0SizeT - nb0T;
/* skip leading zeros */
if (LONG_SEEK(file, storedSkips, SEEK_CUR) != 0)
EXM_THROW(92, "Sparse skip error ; try --no-sparse");
storedSkips = 0;
/* write the rest */
if (fwrite(ptrT + nb0T, sizeof(size_t), nbNon0ST, file) != nbNon0ST)
EXM_THROW(93, "Write error : cannot write block : %s",
strerror(errno));
}
ptrT += seg0SizeT;
}

{ static size_t const maskT = sizeof(size_t)-1;
if (bufferSize & maskT) {
/* size not multiple of sizeof(size_t) : implies end of block */
const char* const restStart = (const char*)bufferTEnd;
const char* restPtr = restStart;
const char* const restEnd = (const char*)buffer + bufferSize;
assert(restEnd > restStart && restEnd < restStart + sizeof(size_t));
for ( ; (restPtr < restEnd) && (*restPtr == 0); restPtr++) ;
storedSkips += (unsigned) (restPtr - restStart);
if (restPtr != restEnd) {
/* not all remaining bytes are 0 */
size_t const restSize = (size_t)(restEnd - restPtr);
if (LONG_SEEK(file, storedSkips, SEEK_CUR) != 0)
EXM_THROW(92, "Sparse skip error ; try --no-sparse");
if (fwrite(restPtr, 1, restSize, file) != restSize)
EXM_THROW(95, "Write error : cannot write end of decoded block : %s",
strerror(errno));
storedSkips = 0;
} } }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Prevent sparse skip accumulator overflow

Line 55 only trims a single 1 GB chunk, so one buffer containing ≥4 GB of zeros drives the unsigned storedSkips past UINT32_MAX before we ever hit Line 79 or Line 95. The counter wraps, the subsequent seek/write targets the wrong offset, and the sparse file is silently corrupted. Please flush in 1 GB chunks every time we cross the threshold—both on entry and after each increment—so storedSkips never exceeds 1 GB.

-    if (storedSkips > 1 GB) {
-        if (LONG_SEEK(file, 1 GB, SEEK_CUR) != 0)
-        EXM_THROW(91, "1 GB skip error (sparse file support)");
-        storedSkips -= 1 GB;
-    }
+    while (storedSkips > 1 GB) {
+        if (LONG_SEEK(file, 1 GB, SEEK_CUR) != 0)
+            EXM_THROW(91, "1 GB skip error (sparse file support)");
+        storedSkips -= 1 GB;
+    }
@@
-        storedSkips += (unsigned)(nb0T * sizeof(size_t));
+        storedSkips += (unsigned)(nb0T * sizeof(size_t));
+        while (storedSkips > 1 GB) {
+            if (LONG_SEEK(file, 1 GB, SEEK_CUR) != 0)
+                EXM_THROW(91, "1 GB skip error (sparse file support)");
+            storedSkips -= 1 GB;
+        }
@@
-            storedSkips += (unsigned) (restPtr - restStart);
+            storedSkips += (unsigned) (restPtr - restStart);
+            while (storedSkips > 1 GB) {
+                if (LONG_SEEK(file, 1 GB, SEEK_CUR) != 0)
+                    EXM_THROW(91, "1 GB skip error (sparse file support)");
+                storedSkips -= 1 GB;
+            }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/* avoid int overflow */
if (storedSkips > 1 GB) {
if (LONG_SEEK(file, 1 GB, SEEK_CUR) != 0)
EXM_THROW(91, "1 GB skip error (sparse file support)");
storedSkips -= 1 GB;
}
while (ptrT < bufferTEnd) {
size_t nb0T;
/* adjust last segment if < 32 KB */
size_t seg0SizeT = segmentSizeT;
if (seg0SizeT > bufferSizeT) seg0SizeT = bufferSizeT;
bufferSizeT -= seg0SizeT;
/* count leading zeroes */
for (nb0T=0; (nb0T < seg0SizeT) && (ptrT[nb0T] == 0); nb0T++) ;
storedSkips += (unsigned)(nb0T * sizeof(size_t));
if (nb0T != seg0SizeT) { /* not all 0s */
size_t const nbNon0ST = seg0SizeT - nb0T;
/* skip leading zeros */
if (LONG_SEEK(file, storedSkips, SEEK_CUR) != 0)
EXM_THROW(92, "Sparse skip error ; try --no-sparse");
storedSkips = 0;
/* write the rest */
if (fwrite(ptrT + nb0T, sizeof(size_t), nbNon0ST, file) != nbNon0ST)
EXM_THROW(93, "Write error : cannot write block : %s",
strerror(errno));
}
ptrT += seg0SizeT;
}
{ static size_t const maskT = sizeof(size_t)-1;
if (bufferSize & maskT) {
/* size not multiple of sizeof(size_t) : implies end of block */
const char* const restStart = (const char*)bufferTEnd;
const char* restPtr = restStart;
const char* const restEnd = (const char*)buffer + bufferSize;
assert(restEnd > restStart && restEnd < restStart + sizeof(size_t));
for ( ; (restPtr < restEnd) && (*restPtr == 0); restPtr++) ;
storedSkips += (unsigned) (restPtr - restStart);
if (restPtr != restEnd) {
/* not all remaining bytes are 0 */
size_t const restSize = (size_t)(restEnd - restPtr);
if (LONG_SEEK(file, storedSkips, SEEK_CUR) != 0)
EXM_THROW(92, "Sparse skip error ; try --no-sparse");
if (fwrite(restPtr, 1, restSize, file) != restSize)
EXM_THROW(95, "Write error : cannot write end of decoded block : %s",
strerror(errno));
storedSkips = 0;
} } }
/* avoid int overflow */
while (storedSkips > 1 GB) {
if (LONG_SEEK(file, 1 GB, SEEK_CUR) != 0)
EXM_THROW(91, "1 GB skip error (sparse file support)");
storedSkips -= 1 GB;
}
while (ptrT < bufferTEnd) {
size_t nb0T;
/* adjust last segment if < 32 KB */
size_t seg0SizeT = segmentSizeT;
if (seg0SizeT > bufferSizeT) seg0SizeT = bufferSizeT;
bufferSizeT -= seg0SizeT;
/* count leading zeroes */
for (nb0T=0; (nb0T < seg0SizeT) && (ptrT[nb0T] == 0); nb0T++) ;
storedSkips += (unsigned)(nb0T * sizeof(size_t));
while (storedSkips > 1 GB) {
if (LONG_SEEK(file, 1 GB, SEEK_CUR) != 0)
EXM_THROW(91, "1 GB skip error (sparse file support)");
storedSkips -= 1 GB;
}
if (nb0T != seg0SizeT) { /* not all 0s */
size_t const nbNon0ST = seg0SizeT - nb0T;
/* skip leading zeros */
if (LONG_SEEK(file, storedSkips, SEEK_CUR) != 0)
EXM_THROW(92, "Sparse skip error ; try --no-sparse");
storedSkips = 0;
/* write the rest */
if (fwrite(ptrT + nb0T, sizeof(size_t), nbNon0ST, file) != nbNon0ST)
EXM_THROW(93, "Write error : cannot write block : %s",
strerror(errno));
}
ptrT += seg0SizeT;
}
{ static size_t const maskT = sizeof(size_t)-1;
if (bufferSize & maskT) {
/* size not multiple of sizeof(size_t) : implies end of block */
const char* const restStart = (const char*)bufferTEnd;
const char* restPtr = restStart;
const char* const restEnd = (const char*)buffer + bufferSize;
assert(restEnd > restStart && restEnd < restStart + sizeof(size_t));
for ( ; (restPtr < restEnd) && (*restPtr == 0); restPtr++) ;
storedSkips += (unsigned) (restPtr - restStart);
while (storedSkips > 1 GB) {
if (LONG_SEEK(file, 1 GB, SEEK_CUR) != 0)
EXM_THROW(91, "1 GB skip error (sparse file support)");
storedSkips -= 1 GB;
}
if (restPtr != restEnd) {
/* not all remaining bytes are 0 */
size_t const restSize = (size_t)(restEnd - restPtr);
if (LONG_SEEK(file, storedSkips, SEEK_CUR) != 0)
EXM_THROW(92, "Sparse skip error ; try --no-sparse");
if (fwrite(restPtr, 1, restSize, file) != restSize)
EXM_THROW(95, "Write error : cannot write end of decoded block : %s",
strerror(errno));
storedSkips = 0;
} } }

Comment thread programs/util.c
Comment on lines +908 to +933
#ifdef _WIN32
static int UTIL_prepareFileList(const char* dirName,
char** bufStart, size_t* pos,
char** bufEnd, int followLinks)
{
char* path;
size_t dirLength, pathLength;
int nbFiles = 0;
WIN32_FIND_DATAA cFile;
HANDLE hFile;

dirLength = strlen(dirName);
path = (char*) malloc(dirLength + 3);
if (!path) return 0;

memcpy(path, dirName, dirLength);
path[dirLength] = '\\';
path[dirLength+1] = '*';
path[dirLength+2] = 0;

hFile=FindFirstFileA(path, &cFile);
if (hFile == INVALID_HANDLE_VALUE) {
UTIL_DISPLAYLEVEL(1, "Cannot open directory '%s'\n", dirName);
return 0;
}
free(path);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Free path on FindFirstFileA failure in UTIL_prepareFileList (Windows leak).

When FindFirstFileA() fails, the function returns 0 without freeing path, leaking a small heap buffer on the error path.

Suggested fix:

-    hFile=FindFirstFileA(path, &cFile);
-    if (hFile == INVALID_HANDLE_VALUE) {
-        UTIL_DISPLAYLEVEL(1, "Cannot open directory '%s'\n", dirName);
-        return 0;
-    }
-    free(path);
+    hFile = FindFirstFileA(path, &cFile);
+    if (hFile == INVALID_HANDLE_VALUE) {
+        UTIL_DISPLAYLEVEL(1, "Cannot open directory '%s'\n", dirName);
+        free(path);
+        return 0;
+    }
+    free(path);
🤖 Prompt for AI Agents
In programs/util.c around lines 908 to 933, the code allocates 'path' then calls
FindFirstFileA and returns 0 on failure without freeing 'path', leaking memory;
modify the failure branch so you free(path) before returning (i.e., call
free(path) immediately after detecting hFile == INVALID_HANDLE_VALUE, then log
and return 0) to ensure the allocated buffer is released on the error path.

Comment on lines +128 to +138
static void* ZWRAP_customCalloc(size_t size, ZSTD_customMem customMem)
{
if (customMem.customAlloc) {
/* calloc implemented as malloc+memset;
* not as efficient as calloc, but next best guess for custom malloc */
void* const ptr = customMem.customAlloc(customMem.opaque, size);
memset(ptr, 0, size);
return ptr;
}
return calloc(1, size);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against NULL from custom allocator in ZWRAP_customCalloc.

If customMem.customAlloc returns NULL (e.g., OOM), memset(ptr, 0, size) dereferences NULL and triggers undefined behavior instead of propagating a clean failure.

Suggested fix:

 static void* ZWRAP_customCalloc(size_t size, ZSTD_customMem customMem)
 {
     if (customMem.customAlloc) {
         /* calloc implemented as malloc+memset;
          * not as efficient as calloc, but next best guess for custom malloc */
-        void* const ptr = customMem.customAlloc(customMem.opaque, size);
-        memset(ptr, 0, size);
-        return ptr;
+        void* const ptr = customMem.customAlloc(customMem.opaque, size);
+        if (ptr == NULL) {
+            return NULL;
+        }
+        memset(ptr, 0, size);
+        return ptr;
     }
     return calloc(1, size);
 }
🤖 Prompt for AI Agents
In zlibWrapper/zstd_zlibwrapper.c around lines 128 to 138, the custom calloc
uses customMem.customAlloc but does not check its return value before calling
memset, which can dereference NULL; change the logic so that after calling
customMem.customAlloc you check if ptr is NULL and if so return NULL
immediately, otherwise call memset(ptr, 0, size) and return ptr; keep the
fallback to return calloc(1, size) when customAlloc is not provided.

Comment on lines +769 to +775
{ int const result = (flush == Z_INFLATE_SYNC) ?
inflateSync(strm) :
inflate(strm, flush);
LOG_WRAPPERD("- inflate3 flush=%d avail_in=%d avail_out=%d total_in=%d total_out=%d res=%d\n",
(int)flush, (int)strm->avail_in, (int)strm->avail_out, (int)strm->total_in, (int)strm->total_out, res);
return result;
} }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix undeclared variable res in z_inflate logging (compilation error).

Both LOG_WRAPPERD calls use res which is not defined; the local variable is result. This will fail to compile.

Suggested fix:

-                {   int const result = (flush == Z_INFLATE_SYNC) ?
-                                        inflateSync(strm) :
-                                        inflate(strm, flush);
-                    LOG_WRAPPERD("- inflate3 flush=%d avail_in=%d avail_out=%d total_in=%d total_out=%d res=%d\n",
-                                 (int)flush, (int)strm->avail_in, (int)strm->avail_out, (int)strm->total_in, (int)strm->total_out, res);
-                    return result;
-            }   }
+                {   int const result = (flush == Z_INFLATE_SYNC) ?
+                                        inflateSync(strm) :
+                                        inflate(strm, flush);
+                    LOG_WRAPPERD("- inflate3 flush=%d avail_in=%d avail_out=%d total_in=%d total_out=%d res=%d\n",
+                                 (int)flush, (int)strm->avail_in, (int)strm->avail_out,
+                                 (int)strm->total_in, (int)strm->total_out, result);
+                    return result;
+            }   }
-                {   int const result = (flush == Z_INFLATE_SYNC) ?
-                                       inflateSync(strm) :
-                                       inflate(strm, flush);
-                    LOG_WRAPPERD("- inflate2 flush=%d avail_in=%d avail_out=%d total_in=%d total_out=%d res=%d\n",
-                                 (int)flush, (int)strm->avail_in, (int)strm->avail_out, (int)strm->total_in, (int)strm->total_out, res);
-                    return result;
-        }   }   }  /* if ! (zwd->totalInBytes == 0 && strm->avail_in >= ZLIB_HEADERSIZE) */
+                {   int const result = (flush == Z_INFLATE_SYNC) ?
+                                       inflateSync(strm) :
+                                       inflate(strm, flush);
+                    LOG_WRAPPERD("- inflate2 flush=%d avail_in=%d avail_out=%d total_in=%d total_out=%d res=%d\n",
+                                 (int)flush, (int)strm->avail_in, (int)strm->avail_out,
+                                 (int)strm->total_in, (int)strm->total_out, result);
+                    return result;
+        }   }   }  /* if ! (zwd->totalInBytes == 0 && strm->avail_in >= ZLIB_HEADERSIZE) */

Also applies to: 820-825

🤖 Prompt for AI Agents
In zlibWrapper/zstd_zlibwrapper.c around lines 769-775 (and also change the
analogous LOG_WRAPPERD at 820-825), the debug log calls reference an undeclared
variable `res` while the actual local variable is named `result`; update the
LOG_WRAPPERD invocations to pass `result` (with the same casts used for other
numeric args, e.g., (int)result) instead of `res` so the code compiles and logs
the correct return value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants