Nice to have / Future or TODOs:
- improve bitarray class removing the dipendence from std::vector and use a custom one or a simple block of memory (i need to access the underlying capacity), the memory could be aligned but with posix_memalign or similar to exploit avx etc. this also affect the encode function that make a massive use of bitarray
- HuffmanCode buffer could live into the stack since it's at most uint8_t[256] (an aligned 32 byte memory chunk) this affect memory because mostly are smaller but can produce a better performance, using a custom allocator is also a solution but introduce more complexity
- reconstruct an huffman tree from symbol table paths stored in hafe file (the hashmap solution for decoding require a bit of extra overhead but still constant), since different huffman impl. can produce differents huffman codes if you want store tree there are 2 possibility: describe how to store tree into the format specs. or reconstruct from the symbol table.
- HuffmanNode doesn't need to keep frequency, frequency could be stored into a different structure this reduce memory consumpion but introduce more complexity (IMHO it's bad)
- encode() function could accept a non-constant buffer and perform the encoding in place since the encoded result is always shorter than the input sequence (however to me seems unnecessary complex)
# compress
./huffman < ../data/divina_commedia.txt > divina_commedia.txt.hafe
# decompress
./huffman - < divina_commedia.txt.hafe > divina_commedia.txt
# checksum
md5sum divina_commedia.txt ../data/divina_commedia.txtExample of format
All integers in Little Endian:
| Offset | Size | Name | Example | Description |
|---|---|---|---|---|
| 0 | 4 byte | magic | 4D 61 6E 75 | magic number |
| 4 | 16 byte | reserved | 1D 1D 1D 1D 1D 1D 1D 1D 1D 1D 1D 1D 1D 1D 1D 1D | reserved bytes |
| 20 | 4 byte | symbol table length | 3C 00 00 00 | length in bytes of symbol table 32 bit unsigned integer in little endian |
| 24 | variable | symbol table | - | see here |
| variable | 8 byte | bitstream length | - | the length of bitstream in bits except padding bits if none, 64 bit unsigned integer in little endian |
| variable | variable | bitstream | - | the compressed bitstream |
The symbol table is stored as contiguos block of variable size every block, each block is a prefix table entry that describe a symbol and his prefix code empty entries are not allowed.
| uint8_t | uint8_t | uint8_t[] |
|---|---|---|
| symbol | the length > 0 of prefix code in bits excluding padding, this value cannot be 0 | prefix code (at least 1 byte) |
Example, suppose you have to store a 3 bit prefix code 101 entry for the symbol 'c' you will write:
- 1 byte (the symbol
'c') - 1 byte the number of bits that compose the prefix code (3)
- 1 byte (the prefix code)
101+ 5 bit of garbage
To read this prefix code the decompressor will read (bits / 8.0).ceil() bytes and will consider only the first 3 bits.
You may also find helping the following C-Like pseudo-code:
struct {
uint8_t symbol;
uint8_t prefix_len_in_bits = 1; // at least 1 bit, no empty entries allowed
uint8_t[??] prefix_code; // unknow but at least 1 element
};
uint8_t magic[4] = { 0x4d, 0x61, 0x6e, 0x75 };
uint8_t reserved[16];
uint32_t sym_table_length;
sym_table_entry_t[??] sym_table;
uint64_t bitstream_length;
uint8_t[??] bitstream;- Maybe a crc32 of uncompressed data will be added later there is some space reserved.
- The length of every symbol table entry must be > 0.
mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j`nproc --all`
cp ../test-hafe.sh
# you have to check md5 with your eyes, see (*)below
./test-hafe.sh
(*)below: md5 sum of original data (../data/vergine_delle_rocce.txt) and decompressed data (vergine_delle_rocce.txt) must be the same
...
decompressing vergine_delle_rocce.txt.hafe -> vergine_delle_rocce.txt
real 0m0,013s
user 0m0,008s
sys 0m0,004s
a4288f761135779373b0d1b1747f6e47 vergine_delle_rocce.txt
a4288f761135779373b0d1b1747f6e47 ../data/vergine_delle_rocce.txt
-----------
usually output .hafe files are smaller (every test file should be smaller since are optimal tests). es:
compressing ../data/vergine_delle_rocce.txt vergine_delle_rocce.txt.hafe
real 0m0,011s
user 0m0,009s
sys 0m0,002s
352K ../data/vergine_delle_rocce.txt
196K vergine_delle_rocce.txt.hafe
-----------
- https://huffman-coding-online.vercel.app
- https://web.stanford.edu/class/archive/cs/cs106b/cs106b.1234/resources/huffman.html
- https://www.math.unipd.it/~baldan/Alg2/Slide/Lezione2.pdf (italian)
- https://www.youtube.com/playlist?list=PLU4IQLU9e_OrY8oASHx0u3IXAL9TOdidm
- https://create.stephan-brumme.com/length-limited-prefix-codes/#package-merge
- https://experiencestack.co/length-limited-huffman-codes-21971f021d43
- https://aniruddhadeb.com/articles/2024/package-merge/
- https://en.wikipedia.org/wiki/Compress_(software)
- https://it.wikipedia.org/wiki/Graphics_Interchange_Format
A special thanks to Bill Bird from the University of Victoria (Canada), who has shared on YouTube his wonderful video playlist on compression algorithms. I loved the clarity and detail and the passion of his explanations.
Thanks, Bill.