Add opt-in Snappy page compression to reduce database size#1149
Add opt-in Snappy page compression to reduce database size#1149tjungblu wants to merge 3 commits intoetcd-io:mainfrom
Conversation
Introduce a `Compression` option on DB/Options that enables transparent Snappy compression of leaf and branch page data. Compression happens at node spill time—before page allocation—so fewer pages are allocated for compressible data. Decompression is transparent on read via a per-transaction cache using sync.Map for concurrent reader safety. Key changes: - New orthogonal CompressedPageFlag (0x20) on page headers; type checks changed from == to bitwise & so the flag coexists with page types - CompressInodes serializes and compresses node data, only used when it reduces the page count - Split threshold increased 4x when compression is enabled so nodes accumulate enough data for meaningful compression - DecompressPage preserves on-disk overflow for correct freelist accounting - Adds github.com/golang/snappy dependency This was largely written by Claude Opus. Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tjungblu The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
also adds a dedicated simulation test Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
go.mod
Outdated
| require ( | ||
| github.com/aclements/go-moremath v0.0.0-20210112150236-f10218a38794 // indirect | ||
| github.com/davecgh/go-spew v1.1.1 // indirect | ||
| github.com/golang/snappy v1.0.0 // indirect |
There was a problem hiding this comment.
It's used in golang source code, Why it's still indirect dependency?
There was a problem hiding this comment.
I assume it's due to different domain? golang.org vs. github.com?
I noticed they've moved the repo to archive last year - so I would also be open to other compression algorithms (if license permits).
There was a problem hiding this comment.
https://github.com/klauspost/compress s2 or the snappy drop-in could be a viable alternative
moved it to Klaus's snappy version using s2 underneath.
| // Allow nodes to grow up to 4x the page size before splitting. | ||
| // Snappy typically achieves 2-4x compression on structured data, | ||
| // so this gives the compressor enough data to work with while | ||
| // keeping individual nodes at a reasonable size. | ||
| splitPageSize *= 4 |
There was a problem hiding this comment.
then it might generate overflow pages if the compression can't compress/fit the data into one page?
There was a problem hiding this comment.
that's right, it will overflow if this goes beyond. I haven't found a nicer way to get better compression ratios across pages, well, besides setting the page size to 4x the default.
| compressedPages := (compressedTotalSize + pageSize - 1) / pageSize | ||
|
|
||
| // Only use compression if it actually reduces the page count. | ||
| if compressedPages >= uncompressedPages { |
There was a problem hiding this comment.
In which case, compression may increase the size?
There was a problem hiding this comment.
imagine you have a buffer filled with random data, then the compression algorithm overhead (header/checksum/dictionary etc) will exceed the initial data size
Have you compared the impact on throughput and latency? At a high level, I feel the implementation is tightly coupled with the existing logic, which may not be ideal from a maintenance perspective. I’m wondering whether we could adopt a pipeline-based approach to minimize the impact on the current design and implementation. |
Not yet. I'll create one of those rwheatmaps in etcd, will post them back in a bit. (EDIT: forgot that this takes a whole day to run, will report tomorrow)
Agreed. Where would you see such a pipeline hooking in best? |
Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
|
Maybe we should align on compression algorithms kubernetes/kubernetes#136030. Current recommendation is also Snappy (implemented via s2) for K8s. |
Note that there is a difference between corruption detection for contents and btree structure. Both are important. |
| fs.BoolVar(&o.goBenchOutput, "gobench-output", false, "") | ||
| fs.IntVar(&o.pageSize, "page-size", common.DefaultPageSize, "Set page size in bytes.") | ||
| fs.IntVar(&o.initialMmapSize, "initial-mmap-size", 0, "Set initial mmap size in bytes for database file.") | ||
| fs.BoolVar(&o.compression, "compression", false, "Enables compression.") |
There was a problem hiding this comment.
In container runtime level, we (containerd) use zstd for image and filesystem (erofs) level. :)
Thanks for sharing the idea. 👍 We can see the decompression speed on this case
|
Sorry for the long wait, I had to retry this several times. here are the raw results: |
|
Very quick comparison using zstd, as per recommendation by @fuweid. Code is in a separate branch here: tjungblu@43b5742 bbolt benchUsing the bench comparison script against the above zstd branch with compression enabled by default, we get the following performance numbers: OpenShift installationComparing the storage size with our latest 4.22 nightly, a fully installed OpenShift cluster now only needs 30 mb in storage size. Down from about 100mb without compression and 48mb with snappy. Kube Burnerapi-intensiveRunning the api-intensive workload, we find: without compression as baseline:
with snappy compression:
with zstd compression:
crd-scaleWith crd-scale we have seen: without compression:
with snappy compression:
with zstd compression:
CPU usage is measured with ( So it seems that zstd would be the best bang for the buck here, too. |


This PR introduces a
Compressionoption on DB/Options that enables transparent Snappy compression of leaf and branch page data. Compression happens at node spill time—before page allocation—so fewer pages are allocated for compressible data. Decompression is transparent on read via a per-transaction cache using sync.Map for concurrent reader safety (required for etcd).Also, Snappy gives us an easy CRC-32C checksum for free, which we otherwise wanted to introduce for corruption detection.
Key changes:
This was largely written by Claude Opus, please take some extra care when reviewing. I'm sure there are edge cases I haven't thought of poking into. And of course, if you have anything for me to benchmark against - let me know. We have to add some more comprehensive benchmarking when integrating into etcd.
Benchmark Results
bbolt bench
Using the bench comparison script against this branch with compression enabled by default, we get the following performance numbers:
OpenShift installation
Comparing the storage size with our latest 4.22 nightly, a fully installed OpenShift cluster now only needs 48 mb in storage size. Down from about 100mb without compression.
Kube Burner
api-intensive
Running the api-intensive workload, we find:
without compression as baseline:
with compression:
crd-scale
With crd-scale we have seen:
without compression:
with compression:
CPU usage is measured with (
sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{pod=~"etcd-ci.*master.*"}) by (pod)), taking the worst performing node. The size by usingetcd_mvcc_db_total_size_in_use_in_bytes