A lossless data compression engine written in C++ using LZ77 and Huffman Coding, built to test custom algorithmic performance against standard zlib.
-
Parallel Processing: Uses C++
std::asyncto compress multiple file chunks at the same time, writing the final output in the exact original order. -
LZ77 Sliding Window: Uses a 32KB buffer to identify and replace duplicate byte sequences.
-
Fast String Matching: Uses a custom hash function and flat arrays to find matching text instantly in O(1) time.
-
Huffman Coding: Builds frequency trees for every 2MB block to compress common characters into fewer bits.
-
Data Integrity: Uses CRC32 checksums to guarantee the extracted file exactly matches the original.
Requires a C++17 compiler. Compile with -O3 to ensure hardware vectorization and cache locality.
g++ -O3 -march=native main.cpp compress.cpp decompress.cpp -o main.exe
To Compress:
./main.exe compress <file_path>
To Decompress:
./main.exe decompress <file_path>.bin
Evaluated against 131 files (247.6 MB total), including the Silesia Corpus, text, PDFs, and high-entropy media (MP4). ZLIB tested at standard Level 6.
| File Category | C++ Space Saved | ZLIB Space Saved | C++ Mean Speed | ZLIB Mean Speed |
|---|---|---|---|---|
| UNKNOWN (Silesia/Binaries) | 63.99% | 64.71% | 0.37s | 1.37s |
| .TXT (Source Code/Text) | 61.71% | 62.67% | 0.07s | 0.03s |
| .PDF (Documents) | 17.32% | 18.27% | 0.05s | 0.02s |
| .MP4 (High-Entropy) | 13.22% | 13.59% | 0.15s | 0.63s |
Integrity Verification: 100% byte-for-byte accuracy verified across all 131 files after full compress/decompress cycles.
Python test scripts are included to fetch datasets and verify bitstream integrity.
python fetch_datasets.py
python benchmark.py
python decompress_checker.py