Skip to content

Add opt-in Snappy page compression to reduce database size#1149

Open
tjungblu wants to merge 3 commits intoetcd-io:mainfrom
tjungblu:opus_compression
Open

Add opt-in Snappy page compression to reduce database size#1149
tjungblu wants to merge 3 commits intoetcd-io:mainfrom
tjungblu:opus_compression

Conversation

@tjungblu
Copy link
Contributor

@tjungblu tjungblu commented Feb 12, 2026

This PR introduces a Compression option on DB/Options that enables transparent Snappy compression of leaf and branch page data. Compression happens at node spill time—before page allocation—so fewer pages are allocated for compressible data. Decompression is transparent on read via a per-transaction cache using sync.Map for concurrent reader safety (required for etcd).

Also, Snappy gives us an easy CRC-32C checksum for free, which we otherwise wanted to introduce for corruption detection.

Key changes:

  • New orthogonal CompressedPageFlag (0x20) on page headers; type checks changed from == to bitwise & so the flag coexists with page types
  • CompressInodes serializes and compresses node data, only used when it reduces the page count
  • Split threshold increased 4x when compression is enabled so nodes accumulate enough data for meaningful compression
  • DecompressPage preserves on-disk overflow for correct freelist accounting

This was largely written by Claude Opus, please take some extra care when reviewing. I'm sure there are edge cases I haven't thought of poking into. And of course, if you have anything for me to benchmark against - let me know. We have to add some more comprehensive benchmarking when integrating into etcd.


Benchmark Results

bbolt bench

Using the bench comparison script against this branch with compression enabled by default, we get the following performance numbers:

        │    BASE     │                HEAD                │
        │   sec/op    │   sec/op     vs base               │
Write     515.8n ± 2%   517.6n ± 1%       ~ (p=0.985 n=10)
Read      10.74n ± 1%   10.83n ± 4%       ~ (p=0.060 n=10)
geomean   74.43n        74.87n       +0.60%

OpenShift installation

Comparing the storage size with our latest 4.22 nightly, a fully installed OpenShift cluster now only needs 48 mb in storage size. Down from about 100mb without compression.

Kube Burner

api-intensive

Running the api-intensive workload, we find:

without compression as baseline:

  • cpu usage at about 28%
  • The run takes about 50mb of storage

with compression:

  • cpu usage at around 35%
  • The entire run only takes about 25mb of storage size (2x improvement)

crd-scale

With crd-scale we have seen:

without compression:

  • cpu usage around 30%
  • uses 20mb of storage

with compression:

  • cpu usage at around 40%
  • only 5mb of storage size used (4x improvement)

CPU usage is measured with (sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{pod=~"etcd-ci.*master.*"}) by (pod)), taking the worst performing node. The size by using etcd_mvcc_db_total_size_in_use_in_bytes

Introduce a `Compression` option on DB/Options that enables transparent
Snappy compression of leaf and branch page data. Compression happens at
node spill time—before page allocation—so fewer pages are allocated for
compressible data. Decompression is transparent on read via a
per-transaction cache using sync.Map for concurrent reader safety.

Key changes:
- New orthogonal CompressedPageFlag (0x20) on page headers; type checks
  changed from == to bitwise & so the flag coexists with page types
- CompressInodes serializes and compresses node data, only used when it
  reduces the page count
- Split threshold increased 4x when compression is enabled so nodes
  accumulate enough data for meaningful compression
- DecompressPage preserves on-disk overflow for correct freelist accounting
- Adds github.com/golang/snappy dependency

This was largely written by Claude Opus.

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tjungblu
Once this PR has been reviewed and has the lgtm label, please assign spzala for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

also adds a dedicated simulation test

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
go.mod Outdated
require (
github.com/aclements/go-moremath v0.0.0-20210112150236-f10218a38794 // indirect
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/golang/snappy v1.0.0 // indirect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in golang source code, Why it's still indirect dependency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it's due to different domain? golang.org vs. github.com?

I noticed they've moved the repo to archive last year - so I would also be open to other compression algorithms (if license permits).

Copy link
Contributor Author

@tjungblu tjungblu Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/klauspost/compress s2 or the snappy drop-in could be a viable alternative

moved it to Klaus's snappy version using s2 underneath.

Comment on lines +321 to +325
// Allow nodes to grow up to 4x the page size before splitting.
// Snappy typically achieves 2-4x compression on structured data,
// so this gives the compressor enough data to work with while
// keeping individual nodes at a reasonable size.
splitPageSize *= 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then it might generate overflow pages if the compression can't compress/fit the data into one page?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's right, it will overflow if this goes beyond. I haven't found a nicer way to get better compression ratios across pages, well, besides setting the page size to 4x the default.

compressedPages := (compressedTotalSize + pageSize - 1) / pageSize

// Only use compression if it actually reduces the page count.
if compressedPages >= uncompressedPages {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case, compression may increase the size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imagine you have a buffer filled with random data, then the compression algorithm overhead (header/checksum/dictionary etc) will exceed the initial data size

@ahrtr
Copy link
Member

ahrtr commented Feb 12, 2026

Running the api-intensive workload, we find:

without compression as baseline:

  • cpu usage at about 28%
  • The run takes about 50mb of storage

with compression:

  • cpu usage at around 35%
  • The entire run only takes about 25mb of storage size (2x improvement)

Have you compared the impact on throughput and latency?

At a high level, I feel the implementation is tightly coupled with the existing logic, which may not be ideal from a maintenance perspective. I’m wondering whether we could adopt a pipeline-based approach to minimize the impact on the current design and implementation.

@tjungblu
Copy link
Contributor Author

tjungblu commented Feb 12, 2026

Have you compared the impact on throughput and latency?

Not yet. I'll create one of those rwheatmaps in etcd, will post them back in a bit. (EDIT: forgot that this takes a whole day to run, will report tomorrow)

At a high level, I feel the implementation is tightly coupled with the existing logic, which may not be ideal from a maintenance perspective. I’m wondering whether we could adopt a pipeline-based approach to minimize the impact on the current design and implementation.

Agreed. Where would you see such a pipeline hooking in best?

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
@serathius
Copy link
Member

serathius commented Feb 13, 2026

Maybe we should align on compression algorithms kubernetes/kubernetes#136030. Current recommendation is also Snappy (implemented via s2) for K8s.

@serathius
Copy link
Member

Also, Snappy gives us an easy CRC-32C checksum for free, which we otherwise wanted to introduce for corruption detection.

Note that there is a difference between corruption detection for contents and btree structure. Both are important.

fs.BoolVar(&o.goBenchOutput, "gobench-output", false, "")
fs.IntVar(&o.pageSize, "page-size", common.DefaultPageSize, "Set page size in bytes.")
fs.IntVar(&o.initialMmapSize, "initial-mmap-size", 0, "Set initial mmap size in bytes for database file.")
fs.BoolVar(&o.compression, "compression", false, "Enables compression.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In container runtime level, we (containerd) use zstd for image and filesystem (erofs) level. :)
Thanks for sharing the idea. 👍 We can see the decompression speed on this case

@tjungblu
Copy link
Contributor Author

Sorry for the long wait, I had to retry this several times.
I ran the rw-heatmap benchmark script on GCP with a C4 series (8 core, 30gb RAM) with etcd 3.6.5, only with bbolt as the dependency replaced and compression set to true.

read
bbolt_compression_read

write
bbolt_compression_write

here are the raw results:
baseline_result-202602180756.csv
bbolt_page_compression_result-202602180754.csv

@tjungblu
Copy link
Contributor Author

tjungblu commented Mar 5, 2026

Very quick comparison using zstd, as per recommendation by @fuweid.

Code is in a separate branch here: tjungblu@43b5742

bbolt bench

Using the bench comparison script against the above zstd branch with compression enabled by default, we get the following performance numbers:

        │    BASE     │                HEAD                 │
        │   sec/op    │   sec/op     vs base                │
Write     520.5n ± 2%   598.0n ± 4%  +14.90% (p=0.019 n=10)
Read      10.77n ± 2%   13.11n ± 4%  +21.68% (p=0.000 n=10)
geomean   74.87n        88.53n       +18.24%

OpenShift installation

Comparing the storage size with our latest 4.22 nightly, a fully installed OpenShift cluster now only needs 30 mb in storage size. Down from about 100mb without compression and 48mb with snappy.

Kube Burner

api-intensive

Running the api-intensive workload, we find:

without compression as baseline:

  • cpu usage at about 28%
  • The run takes about 50mb of storage

with snappy compression:

  • cpu usage at around 35%
  • The entire run only takes about 25mb of storage size (2x improvement)

with zstd compression:

  • cpu usage at around 35%
  • The entire run only takes about 15mb of storage size (3x improvement over no compression, 60% over snappy)

crd-scale

With crd-scale we have seen:

without compression:

  • cpu usage around 30%
  • uses 20mb of storage

with snappy compression:

  • cpu usage at around 40%
  • only 5mb of storage size used (4x improvement)

with zstd compression:

  • cpu usage at around 40%
  • The entire run only takes about 3mb of storage size (6x improvement over no compression, 60% over snappy)

CPU usage is measured with (sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{pod=~"etcd-ci.*master.*"}) by (pod)), taking the worst performing node. The size by using etcd_mvcc_db_total_size_in_use_in_bytes


So it seems that zstd would be the best bang for the buck here, too.
I'd go and create a separate PR (#1159), if you guys fancy to merge this behind a option toggle. Likely needs some more tooling as a follow-up to also decompress again (e.g. on downgrades).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

5 participants