Skip to content

Manu-sh/huffman

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Huffman

Nice to have / Future or TODOs:

  • improve bitarray class removing the dipendence from std::vector and use a custom one or a simple block of memory (i need to access the underlying capacity), the memory could be aligned but with posix_memalign or similar to exploit avx etc. this also affect the encode function that make a massive use of bitarray
  • HuffmanCode buffer could live into the stack since it's at most uint8_t[256] (an aligned 32 byte memory chunk) this affect memory because mostly are smaller but can produce a better performance, using a custom allocator is also a solution but introduce more complexity
  • reconstruct an huffman tree from symbol table paths stored in hafe file (the hashmap solution for decoding require a bit of extra overhead but still constant), since different huffman impl. can produce differents huffman codes if you want store tree there are 2 possibility: describe how to store tree into the format specs. or reconstruct from the symbol table.
  • HuffmanNode doesn't need to keep frequency, frequency could be stored into a different structure this reduce memory consumpion but introduce more complexity (IMHO it's bad)
  • encode() function could accept a non-constant buffer and perform the encoding in place since the encoded result is always shorter than the input sequence (however to me seems unnecessary complex)
# compress
./huffman < ../data/divina_commedia.txt > divina_commedia.txt.hafe

# decompress
./huffman - < divina_commedia.txt.hafe > divina_commedia.txt

# checksum
md5sum divina_commedia.txt ../data/divina_commedia.txt

Example of format

All integers in Little Endian:

Offset Size Name Example Description
0 4 byte magic 4D 61 6E 75 magic number
4 16 byte reserved 1D 1D 1D 1D 1D 1D 1D 1D 1D 1D 1D 1D 1D 1D 1D 1D reserved bytes
20 4 byte symbol table length 3C 00 00 00 length in bytes of symbol table 32 bit unsigned integer in little endian
24 variable symbol table - see here
variable 8 byte bitstream length - the length of bitstream in bits except padding bits if none, 64 bit unsigned integer in little endian uint8_t last_byte_garbage_bits stream_size % 2**8, a compact way to express bitstream length since can be very long and it contains garbage at the end in the last byte (padding bits).
variable variable bitstream - the compressed bitstream

Symbol Table On Disk

The symbol table is stored as contiguos block of variable size every block, each block is a prefix table entry that describe a symbol and his prefix code empty entries are not allowed.

uint8_t uint8_t uint8_t[]
symbol the length > 0 of prefix code in bits excluding padding, this value cannot be 0 prefix code (at least 1 byte)

Example, suppose you have to store a 3 bit prefix code 101 entry for the symbol 'c' you will write:

  • 1 byte (the symbol 'c')
  • 1 byte the number of bits that compose the prefix code (3)
  • 1 byte (the prefix code) 101+ 5 bit of garbage

To read this prefix code the decompressor will read (bits / 8.0).ceil() bytes and will consider only the first 3 bits.

You may also find helping the following C-Like pseudo-code:

struct {
  uint8_t symbol;
  uint8_t prefix_len_in_bits = 1; // at least 1 bit, no empty entries allowed
  uint8_t[??] prefix_code;         // unknow but at least 1 element
};

uint8_t magic[4] = { 0x4d, 0x61, 0x6e, 0x75 };
uint8_t reserved[16];
uint32_t sym_table_length;
sym_table_entry_t[??] sym_table;

uint64_t bitstream_length;
uint8_t[??] bitstream;
  • Maybe a crc32 of uncompressed data will be added later there is some space reserved.
  • The length of every symbol table entry must be > 0.

run test

mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j`nproc --all`
cp ../test-hafe.sh

# you have to check md5 with your eyes, see (*)below
./test-hafe.sh

(*)below: md5 sum of original data (../data/vergine_delle_rocce.txt) and decompressed data (vergine_delle_rocce.txt) must be the same

...
decompressing vergine_delle_rocce.txt.hafe -> vergine_delle_rocce.txt

real	0m0,013s
user	0m0,008s
sys	0m0,004s
a4288f761135779373b0d1b1747f6e47  vergine_delle_rocce.txt
a4288f761135779373b0d1b1747f6e47  ../data/vergine_delle_rocce.txt

-----------

usually output .hafe files are smaller (every test file should be smaller since are optimal tests). es:

compressing ../data/vergine_delle_rocce.txt vergine_delle_rocce.txt.hafe

real	0m0,011s
user	0m0,009s
sys	0m0,002s

352K	../data/vergine_delle_rocce.txt
196K	vergine_delle_rocce.txt.hafe

-----------
Learning resources
Other interesting topic related to DEFLATE or LZW (notes 4 me)
Special Thanks

A special thanks to Bill Bird from the University of Victoria (Canada), who has shared on YouTube his wonderful video playlist on compression algorithms. I loved the clarity and detail and the passion of his explanations.

Thanks, Bill.

Copyright © 2025, Manu-sh, s3gmentationfault@gmail.com. Released under the MIT license.

About

huffman compression with minimal file format example implementation (.hafe)

Topics

Resources

License

Stars

Watchers

Forks