Huffman Compression Project Documentation

Project Overview

This project implements the Huffman coding algorithm for file compression and decompression. Huffman coding is a lossless data compression algorithm that assigns variable-length codes to input characters based on their frequencies - characters that occur more frequently are assigned shorter codes.

Algorithm Explanation

Huffman Coding

Huffman coding works by creating a binary tree of nodes:

Frequency Analysis: Count the frequency of each character in the input.
Build Priority Queue: Create a min-heap (priority queue) where each node contains a character and its frequency.
Build Huffman Tree:
- Extract two nodes with the lowest frequencies from the queue.
- Create a new internal node with these two nodes as children, with frequency equal to the sum of the two nodes' frequencies.
- Add this new node back to the priority queue.
- Repeat until only one node remains (the root of the Huffman tree).
Generate Codes: Traverse the tree to assign codes (0 for left edges, 1 for right edges) to each character.
Encode Data: Replace each character in the input with its corresponding code.

Compression Process

Read the input file and count frequency of each character.
Build a Huffman tree based on these frequencies.
Generate a mapping of characters to their binary codes.
Write a header to the output file containing:
- Magic number "huff" to identify the file format
- Character frequency table for later decompression
Encode the input file using the generated codes.
Pack the bits into bytes and write to the output file.

Decompression Process

Read the header from the compressed file to extract character frequencies.
Rebuild the same Huffman tree used during compression.
Read the compressed data and traverse the Huffman tree to decode:
- For each bit, move left (0) or right (1) in the tree.
- When a leaf node is reached, output its character and return to the root.
Write the decoded text to the output file.

Code Structure

The project is organized into the following components:

Core Classes

HuffmanNode: Represents a node in the Huffman tree.
- Contains character, frequency, and pointers to left and right children.
- Leaf nodes contain actual characters.
HuffmanCoding: Static class with methods for Huffman coding operations:
- buildTree: Constructs a Huffman tree from frequency data.
- buildMap: Generates character-to-code mapping from the tree.
- writeHeader: Writes the file header with character frequencies.
- processFile: Encodes the input file and writes to output.
- readHeader: Reads the header from a compressed file.
- readCompressedFile: Decodes compressed data using a Huffman tree.
- destroyTree: Cleans up memory by deleting the tree.
MaxHeap: Implementation of a max heap for priority queue operations.
- Not directly used in the current implementation (the project uses STL's priority_queue instead).

Main Executable

The main program supports two operations:

Compression: Convert a text file to a Huffman-encoded file.
Decompression: Restore a Huffman-encoded file back to its original text.

Installation and Usage

Prerequisites

C++ compiler with C++11 support
CMake (version 3.0 or higher)

Building the Project

Clone the repository:

git clone https://github.com/peacewalker122/huffman-compression.git
cd huffman-compression

Create a build directory and compile:

mkdir -p build && cd build
cmake ..
make

Usage Examples

Compress a File

./out/compression -c -f input.txt -o output.huff

Decompress a File

./out/compression -d -f input.huff -o output.txt

Implementation Details

File Format

Compressed files follow this format:

Magic Number: 4 bytes containing "huff" to identify the file format.
Character Count: Size of the frequency table.
Frequency Table: Each entry contains:
- Character (1 byte)
- Frequency (8 bytes as long long)
Encoded Data: Bit stream of Huffman codes packed into bytes.

Command-Line Arguments

-c: Compress mode
-d: Decompress mode
-f [filename]: Input file
-o [filename]: Output file

Limitations and Considerations

The current implementation works best with text files.
The compression ratio depends on the character distribution in the input file.
Very small files might not benefit from compression due to the overhead of storing the frequency table.

Performance and Benchmarks

As noted in the code comments, the implementation achieved:

First benchmark: 2.524s for compression
Second benchmark: 2.157s for compression

Compression efficiency varies based on:

The size of the input file
Character frequency distribution
Implementation optimizations

Future Improvements

Potential enhancements to consider:

Adaptive Huffman coding for better compression of files with changing character distributions
Block-based compression for handling very large files
Multi-threading support for faster processing
Better error handling and recovery mechanisms
Supporting compression of binary files
Adding integrity checks for compressed data

References

Coding Challenge: Huffman Coding
Huffman, D. A. (1952). "A Method for the Construction of Minimum-Redundancy Codes". Proceedings of the IRE. 40 (9): 1098–1101.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
include		include
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
test.txt		test.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Huffman Compression Project Documentation

Project Overview

Algorithm Explanation

Huffman Coding

Compression Process

Decompression Process

Code Structure

Core Classes

Main Executable

Installation and Usage

Prerequisites

Building the Project

Usage Examples

Compress a File

Decompress a File

Implementation Details

File Format

Command-Line Arguments

Limitations and Considerations

Performance and Benchmarks

Future Improvements

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Huffman Compression Project Documentation

Project Overview

Algorithm Explanation

Huffman Coding

Compression Process

Decompression Process

Code Structure

Core Classes

Main Executable

Installation and Usage

Prerequisites

Building the Project

Usage Examples

Compress a File

Decompress a File

Implementation Details

File Format

Command-Line Arguments

Limitations and Considerations

Performance and Benchmarks

Future Improvements

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages