GitHub - Manu-sh/huffman: huffman compression with minimal file format example implementation (.hafe)

Huffman

Nice to have / Future or TODOs:

improve bitarray class removing the dipendence from std::vector and use a custom one or a simple block of memory (i need to access the underlying capacity), the memory could be aligned but with posix_memalign or similar to exploit avx etc. this also affect the encode function that make a massive use of bitarray
HuffmanCode buffer could live into the stack since it's at most uint8_t[256] (an aligned 32 byte memory chunk) this affect memory because mostly are smaller but can produce a better performance, using a custom allocator is also a solution but introduce more complexity
reconstruct an huffman tree from symbol table paths stored in hafe file (the hashmap solution for decoding require a bit of extra overhead but still constant), since different huffman impl. can produce differents huffman codes if you want store tree there are 2 possibility: describe how to store tree into the format specs. or reconstruct from the symbol table.
HuffmanNode doesn't need to keep frequency, frequency could be stored into a different structure this reduce memory consumpion but introduce more complexity (IMHO it's bad)
encode() function could accept a non-constant buffer and perform the encoding in place since the encoded result is always shorter than the input sequence (however to me seems unnecessary complex)

# compress
./huffman < ../data/divina_commedia.txt > divina_commedia.txt.hafe

# decompress
./huffman - < divina_commedia.txt.hafe > divina_commedia.txt

# checksum
md5sum divina_commedia.txt ../data/divina_commedia.txt

Example of format

All integers in Little Endian:

Offset	Size	Name	Example	Description
0	4 byte	magic	4D 61 6E 75	magic number
4	16 byte	reserved	1D 1D 1D 1D 1D 1D 1D 1D 1D 1D 1D 1D 1D 1D 1D 1D	reserved bytes
20	4 byte	symbol table length	3C 00 00 00	length in bytes of symbol table 32 bit unsigned integer in little endian
24	variable	symbol table	-	see here
variable	8 byte	bitstream length	-	the length of bitstream in bits except padding bits if none, 64 bit unsigned integer in little endian ~~uint8_t last_byte_garbage_bits stream_size % 2**8, a compact way to express bitstream length since can be very long and it contains garbage at the end in the last byte (padding bits).~~
variable	variable	bitstream	-	the compressed bitstream

Symbol Table On Disk

The symbol table is stored as contiguos block of variable size every block, each block is a prefix table entry that describe a symbol and his prefix code empty entries are not allowed.

uint8_t	uint8_t	uint8_t[]
symbol	the length > 0 of prefix code in bits excluding padding, this value cannot be 0	prefix code (at least 1 byte)

Example, suppose you have to store a 3 bit prefix code 101 entry for the symbol 'c' you will write:

1 byte (the symbol 'c')
1 byte the number of bits that compose the prefix code (3)
1 byte (the prefix code) 101+ 5 bit of garbage

To read this prefix code the decompressor will read (bits / 8.0).ceil() bytes and will consider only the first 3 bits.

You may also find helping the following C-Like pseudo-code:

struct {
  uint8_t symbol;
  uint8_t prefix_len_in_bits = 1; // at least 1 bit, no empty entries allowed
  uint8_t[??] prefix_code;         // unknow but at least 1 element
};

uint8_t magic[4] = { 0x4d, 0x61, 0x6e, 0x75 };
uint8_t reserved[16];
uint32_t sym_table_length;
sym_table_entry_t[??] sym_table;

uint64_t bitstream_length;
uint8_t[??] bitstream;

Maybe a crc32 of uncompressed data will be added later there is some space reserved.
The length of every symbol table entry must be > 0.

run test

mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j`nproc --all`
cp ../test-hafe.sh

# you have to check md5 with your eyes, see (*)below
./test-hafe.sh

(*)below: md5 sum of original data (../data/vergine_delle_rocce.txt) and decompressed data (vergine_delle_rocce.txt) must be the same

...
decompressing vergine_delle_rocce.txt.hafe -> vergine_delle_rocce.txt

real	0m0,013s
user	0m0,008s
sys	0m0,004s
a4288f761135779373b0d1b1747f6e47  vergine_delle_rocce.txt
a4288f761135779373b0d1b1747f6e47  ../data/vergine_delle_rocce.txt

-----------

usually output .hafe files are smaller (every test file should be smaller since are optimal tests). es:

compressing ../data/vergine_delle_rocce.txt vergine_delle_rocce.txt.hafe

real	0m0,011s
user	0m0,009s
sys	0m0,002s

352K	../data/vergine_delle_rocce.txt
196K	vergine_delle_rocce.txt.hafe

-----------

Learning resources

Special Thanks

A special thanks to Bill Bird from the University of Victoria (Canada), who has shared on YouTube his wonderful video playlist on compression algorithms. I loved the clarity and detail and the passion of his explanations.

Thanks, Bill.

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
data		data
include		include
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
main.cpp		main.cpp
test-hafe.sh		test-hafe.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Huffman

Symbol Table On Disk

run test

Learning resources

Other interesting topic related to DEFLATE or LZW (notes 4 me)

Special Thanks

Copyright © 2025, Manu-sh, s3gmentationfault@gmail.com. Released under the MIT license.

About

Uh oh!

Languages

License

Manu-sh/huffman

Folders and files

Latest commit

History

Repository files navigation

Huffman

Symbol Table On Disk

run test

Learning resources

Other interesting topic related to DEFLATE or LZW (notes 4 me)

Special Thanks

Copyright © 2025, Manu-sh, s3gmentationfault@gmail.com. Released under the MIT license.

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages