This project implements the Huffman coding algorithm for file compression and decompression. Huffman coding is a lossless data compression algorithm that assigns variable-length codes to input characters based on their frequencies - characters that occur more frequently are assigned shorter codes.
Huffman coding works by creating a binary tree of nodes:
- Frequency Analysis: Count the frequency of each character in the input.
- Build Priority Queue: Create a min-heap (priority queue) where each node contains a character and its frequency.
- Build Huffman Tree:
- Extract two nodes with the lowest frequencies from the queue.
- Create a new internal node with these two nodes as children, with frequency equal to the sum of the two nodes' frequencies.
- Add this new node back to the priority queue.
- Repeat until only one node remains (the root of the Huffman tree).
- Generate Codes: Traverse the tree to assign codes (0 for left edges, 1 for right edges) to each character.
- Encode Data: Replace each character in the input with its corresponding code.
- Read the input file and count frequency of each character.
- Build a Huffman tree based on these frequencies.
- Generate a mapping of characters to their binary codes.
- Write a header to the output file containing:
- Magic number "huff" to identify the file format
- Character frequency table for later decompression
- Encode the input file using the generated codes.
- Pack the bits into bytes and write to the output file.
- Read the header from the compressed file to extract character frequencies.
- Rebuild the same Huffman tree used during compression.
- Read the compressed data and traverse the Huffman tree to decode:
- For each bit, move left (0) or right (1) in the tree.
- When a leaf node is reached, output its character and return to the root.
- Write the decoded text to the output file.
The project is organized into the following components:
-
HuffmanNode: Represents a node in the Huffman tree.
- Contains character, frequency, and pointers to left and right children.
- Leaf nodes contain actual characters.
-
HuffmanCoding: Static class with methods for Huffman coding operations:
buildTree: Constructs a Huffman tree from frequency data.buildMap: Generates character-to-code mapping from the tree.writeHeader: Writes the file header with character frequencies.processFile: Encodes the input file and writes to output.readHeader: Reads the header from a compressed file.readCompressedFile: Decodes compressed data using a Huffman tree.destroyTree: Cleans up memory by deleting the tree.
-
MaxHeap: Implementation of a max heap for priority queue operations.
- Not directly used in the current implementation (the project uses STL's priority_queue instead).
The main program supports two operations:
- Compression: Convert a text file to a Huffman-encoded file.
- Decompression: Restore a Huffman-encoded file back to its original text.
- C++ compiler with C++11 support
- CMake (version 3.0 or higher)
-
Clone the repository:
git clone https://github.com/peacewalker122/huffman-compression.git cd huffman-compression -
Create a build directory and compile:
mkdir -p build && cd build cmake .. make
./out/compression -c -f input.txt -o output.huff
./out/compression -d -f input.huff -o output.txt
Compressed files follow this format:
- Magic Number: 4 bytes containing "huff" to identify the file format.
- Character Count: Size of the frequency table.
- Frequency Table: Each entry contains:
- Character (1 byte)
- Frequency (8 bytes as long long)
- Encoded Data: Bit stream of Huffman codes packed into bytes.
-c: Compress mode-d: Decompress mode-f [filename]: Input file-o [filename]: Output file
- The current implementation works best with text files.
- The compression ratio depends on the character distribution in the input file.
- Very small files might not benefit from compression due to the overhead of storing the frequency table.
As noted in the code comments, the implementation achieved:
- First benchmark: 2.524s for compression
- Second benchmark: 2.157s for compression
Compression efficiency varies based on:
- The size of the input file
- Character frequency distribution
- Implementation optimizations
Potential enhancements to consider:
- Adaptive Huffman coding for better compression of files with changing character distributions
- Block-based compression for handling very large files
- Multi-threading support for faster processing
- Better error handling and recovery mechanisms
- Supporting compression of binary files
- Adding integrity checks for compressed data
- Coding Challenge: Huffman Coding
- Huffman, D. A. (1952). "A Method for the Construction of Minimum-Redundancy Codes". Proceedings of the IRE. 40 (9): 1098–1101.