Add opt-in Snappy page compression to reduce database size by tjungblu · Pull Request #1149 · etcd-io/bbolt

tjungblu · 2026-02-12T09:04:40Z

This PR introduces a Compression option on DB/Options that enables transparent Snappy compression of leaf and branch page data. Compression happens at node spill time—before page allocation—so fewer pages are allocated for compressible data. Decompression is transparent on read via a per-transaction cache using sync.Map for concurrent reader safety (required for etcd).

Also, Snappy gives us an easy CRC-32C checksum for free, which we otherwise wanted to introduce for corruption detection.

Key changes:

New orthogonal CompressedPageFlag (0x20) on page headers; type checks changed from == to bitwise & so the flag coexists with page types
CompressInodes serializes and compresses node data, only used when it reduces the page count
Split threshold increased 4x when compression is enabled so nodes accumulate enough data for meaningful compression
DecompressPage preserves on-disk overflow for correct freelist accounting

This was largely written by Claude Opus, please take some extra care when reviewing. I'm sure there are edge cases I haven't thought of poking into. And of course, if you have anything for me to benchmark against - let me know. We have to add some more comprehensive benchmarking when integrating into etcd.

Benchmark Results

bbolt bench

Using the bench comparison script against this branch with compression enabled by default, we get the following performance numbers:

        │    BASE     │                HEAD                │
        │   sec/op    │   sec/op     vs base               │
Write     515.8n ± 2%   517.6n ± 1%       ~ (p=0.985 n=10)
Read      10.74n ± 1%   10.83n ± 4%       ~ (p=0.060 n=10)
geomean   74.43n        74.87n       +0.60%

OpenShift installation

Comparing the storage size with our latest 4.22 nightly, a fully installed OpenShift cluster now only needs 48 mb in storage size. Down from about 100mb without compression.

Kube Burner

api-intensive

Running the api-intensive workload, we find:

without compression as baseline:

cpu usage at about 28%
The run takes about 50mb of storage

with compression:

cpu usage at around 35%
The entire run only takes about 25mb of storage size (2x improvement)

crd-scale

With crd-scale we have seen:

without compression:

cpu usage around 30%
uses 20mb of storage

with compression:

cpu usage at around 40%
only 5mb of storage size used (4x improvement)

CPU usage is measured with (sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{pod=~"etcd-ci.*master.*"}) by (pod)), taking the worst performing node. The size by using etcd_mvcc_db_total_size_in_use_in_bytes

Introduce a `Compression` option on DB/Options that enables transparent Snappy compression of leaf and branch page data. Compression happens at node spill time—before page allocation—so fewer pages are allocated for compressible data. Decompression is transparent on read via a per-transaction cache using sync.Map for concurrent reader safety. Key changes: - New orthogonal CompressedPageFlag (0x20) on page headers; type checks changed from == to bitwise & so the flag coexists with page types - CompressInodes serializes and compresses node data, only used when it reduces the page count - Split threshold increased 4x when compression is enabled so nodes accumulate enough data for meaningful compression - DecompressPage preserves on-disk overflow for correct freelist accounting - Adds github.com/golang/snappy dependency This was largely written by Claude Opus. Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>

k8s-ci-robot · 2026-02-12T09:04:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tjungblu
Once this PR has been reviewed and has the lgtm label, please assign spzala for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

also adds a dedicated simulation test Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>

ahrtr · 2026-02-12T12:13:40Z

go.mod

 require (
 	github.com/aclements/go-moremath v0.0.0-20210112150236-f10218a38794 // indirect
 	github.com/davecgh/go-spew v1.1.1 // indirect
+	github.com/golang/snappy v1.0.0 // indirect


It's used in golang source code, Why it's still indirect dependency?

I assume it's due to different domain? golang.org vs. github.com?

I noticed they've moved the repo to archive last year - so I would also be open to other compression algorithms (if license permits).

https://github.com/klauspost/compress s2 or the snappy drop-in could be a viable alternative

moved it to Klaus's snappy version using s2 underneath.

ahrtr · 2026-02-12T12:19:06Z

node.go

+		// Allow nodes to grow up to 4x the page size before splitting.
+		// Snappy typically achieves 2-4x compression on structured data,
+		// so this gives the compressor enough data to work with while
+		// keeping individual nodes at a reasonable size.
+		splitPageSize *= 4


then it might generate overflow pages if the compression can't compress/fit the data into one page?

that's right, it will overflow if this goes beyond. I haven't found a nicer way to get better compression ratios across pages, well, besides setting the page size to 4x the default.

ahrtr · 2026-02-12T12:23:33Z

internal/common/compression.go

+	compressedPages := (compressedTotalSize + pageSize - 1) / pageSize
+
+	// Only use compression if it actually reduces the page count.
+	if compressedPages >= uncompressedPages {


In which case, compression may increase the size?

imagine you have a buffer filled with random data, then the compression algorithm overhead (header/checksum/dictionary etc) will exceed the initial data size

ahrtr · 2026-02-12T12:30:50Z

Running the api-intensive workload, we find:

without compression as baseline:

cpu usage at about 28%

The run takes about 50mb of storage

with compression:

cpu usage at around 35%

The entire run only takes about 25mb of storage size (2x improvement)

Have you compared the impact on throughput and latency?

At a high level, I feel the implementation is tightly coupled with the existing logic, which may not be ideal from a maintenance perspective. I’m wondering whether we could adopt a pipeline-based approach to minimize the impact on the current design and implementation.

tjungblu · 2026-02-12T12:51:17Z

Have you compared the impact on throughput and latency?

Not yet. I'll create one of those rwheatmaps in etcd, will post them back in a bit. (EDIT: forgot that this takes a whole day to run, will report tomorrow)

At a high level, I feel the implementation is tightly coupled with the existing logic, which may not be ideal from a maintenance perspective. I’m wondering whether we could adopt a pipeline-based approach to minimize the impact on the current design and implementation.

Agreed. Where would you see such a pipeline hooking in best?

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>

serathius · 2026-02-13T10:29:11Z

Maybe we should align on compression algorithms kubernetes/kubernetes#136030. Current recommendation is also Snappy (implemented via s2) for K8s.

serathius · 2026-02-13T10:31:50Z

Also, Snappy gives us an easy CRC-32C checksum for free, which we otherwise wanted to introduce for corruption detection.

Note that there is a difference between corruption detection for contents and btree structure. Both are important.

fuweid · 2026-02-14T16:01:36Z

cmd/bbolt/command/command_bench.go

 	fs.BoolVar(&o.goBenchOutput, "gobench-output", false, "")
 	fs.IntVar(&o.pageSize, "page-size", common.DefaultPageSize, "Set page size in bytes.")
 	fs.IntVar(&o.initialMmapSize, "initial-mmap-size", 0, "Set initial mmap size in bytes for database file.")
+	fs.BoolVar(&o.compression, "compression", false, "Enables compression.")


In container runtime level, we (containerd) use zstd for image and filesystem (erofs) level. :)
Thanks for sharing the idea. 👍 We can see the decompression speed on this case

tjungblu · 2026-02-23T11:09:11Z

Sorry for the long wait, I had to retry this several times.
I ran the rw-heatmap benchmark script on GCP with a C4 series (8 core, 30gb RAM) with etcd 3.6.5, only with bbolt as the dependency replaced and compression set to true.

read

write

here are the raw results:
baseline_result-202602180756.csv
bbolt_page_compression_result-202602180754.csv

tjungblu · 2026-03-05T14:09:54Z

Very quick comparison using zstd, as per recommendation by @fuweid.

Code is in a separate branch here: tjungblu@43b5742

bbolt bench

Using the bench comparison script against the above zstd branch with compression enabled by default, we get the following performance numbers:

        │    BASE     │                HEAD                 │
        │   sec/op    │   sec/op     vs base                │
Write     520.5n ± 2%   598.0n ± 4%  +14.90% (p=0.019 n=10)
Read      10.77n ± 2%   13.11n ± 4%  +21.68% (p=0.000 n=10)
geomean   74.87n        88.53n       +18.24%

OpenShift installation

Comparing the storage size with our latest 4.22 nightly, a fully installed OpenShift cluster now only needs 30 mb in storage size. Down from about 100mb without compression and 48mb with snappy.

Kube Burner

api-intensive

Running the api-intensive workload, we find:

without compression as baseline:

cpu usage at about 28%
The run takes about 50mb of storage

with snappy compression:

cpu usage at around 35%
The entire run only takes about 25mb of storage size (2x improvement)

with zstd compression:

cpu usage at around 35%
The entire run only takes about 15mb of storage size (3x improvement over no compression, 60% over snappy)

crd-scale

With crd-scale we have seen:

without compression:

cpu usage around 30%
uses 20mb of storage

with snappy compression:

cpu usage at around 40%
only 5mb of storage size used (4x improvement)

with zstd compression:

cpu usage at around 40%
The entire run only takes about 3mb of storage size (6x improvement over no compression, 60% over snappy)

CPU usage is measured with (sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{pod=~"etcd-ci.*master.*"}) by (pod)), taking the worst performing node. The size by using etcd_mvcc_db_total_size_in_use_in_bytes

So it seems that zstd would be the best bang for the buck here, too.
I'd go and create a separate PR (#1159), if you guys fancy to merge this behind a option toggle. Likely needs some more tooling as a follow-up to also decompress again (e.g. on downgrades).

k8s-ci-robot added the size/XXL label Feb 12, 2026

add bench command

31526df

also adds a dedicated simulation test Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>

ahrtr reviewed Feb 12, 2026

View reviewed changes

move to klauspost/compress snappy dependency

4374f29

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>

tjungblu mentioned this pull request Feb 13, 2026

Proposal: Structural pattern-based deduplication for etcd storage optimization etcd-io/etcd#21305

Closed

fuweid reviewed Feb 14, 2026

View reviewed changes

tjungblu mentioned this pull request Mar 5, 2026

Add opt-in zstd page compression to reduce database size #1159

Open

Conversation

tjungblu commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results

bbolt bench

OpenShift installation

Kube Burner

api-intensive

crd-scale

Uh oh!

k8s-ci-robot commented Feb 12, 2026

Uh oh!

ahrtr Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

tjungblu Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

tjungblu Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahrtr Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

tjungblu Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

ahrtr Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

tjungblu Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

ahrtr commented Feb 12, 2026

Uh oh!

tjungblu commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serathius commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serathius commented Feb 13, 2026

Uh oh!

fuweid Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

tjungblu commented Feb 23, 2026

Uh oh!

tjungblu commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

bbolt bench

OpenShift installation

Kube Burner

api-intensive

crd-scale

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants

tjungblu commented Feb 12, 2026 •

edited

Loading

tjungblu Feb 12, 2026 •

edited

Loading

tjungblu commented Feb 12, 2026 •

edited

Loading

serathius commented Feb 13, 2026 •

edited

Loading

tjungblu commented Mar 5, 2026 •

edited

Loading