Skip to content

oci: Add varlink APIs using "splitdirfdstream"#309

Draft
cgwalters wants to merge 2 commits into
composefs:mainfrom
cgwalters:splitdirfdstream
Draft

oci: Add varlink APIs using "splitdirfdstream"#309
cgwalters wants to merge 2 commits into
composefs:mainfrom
cgwalters:splitdirfdstream

Conversation

@cgwalters

Copy link
Copy Markdown
Collaborator

First, I discovered that actually fd-passing with varlink generally works well, and I was misguided in thinking we needed jsonrpc-fdpass.

Almost: one issue is that varlink doesn't have good support for passing a lot of file descriptors (which jsonrpc-fdpass was designed to handle).

But upon some reflection, I realized we don't need to pass a file descriptor per file, all use cases here are fine with a directory fd plus filename.

So here a new data stream format "splitdirfdstream" is implemented.

We first now use that internally when we're doing a direct pull from containers-storage for reflinking/hardlinkling.

But better: let's expose that data concept over varlink, where a varlink client can both pull or push container image layers that way.

This paves the way to a very clear mechanism for us to integrate with containers-storage or other storage stacks (like containerd) in an agnostic way.

We also now support cfsctl oci copy to copy across composefs repositories which is also implemented this way.

Generated-by: OpenCode (Claude Opus 4.8)

@cgwalters cgwalters requested a review from giuseppe June 5, 2026 16:20
@giuseppe

giuseppe commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

how will this work with podman-container-tools/container-libs#651 ?

I don't think the container-tools are going to add again varlink as a dependency for this use case only

@cgwalters

Copy link
Copy Markdown
Collaborator Author

One thing to bear in mind is there's two levels to this, the splitdirfdstream (replacing splitfdstream) and the higher level IPC mechanism by which that is passed around as representations of OCI layers. Only the latter involves varlink, and we could choose to keep using jsonrpc-fdpass for the latter in container-tools if you prefer.

Sorry about the proposed strategy change, but as I dug in more it just feels right - and I somehow again just missed the fd passing support in the varlink ecosystem. I guess one thing that changed is zlink is relatively new, and is well maintained and good code. (The varlink/rust project went through a messy time)

Would varlink be heavier than jsonrpc-fdpass? Hmm, let me see...I spent some tokens on this

varlink/go vs jsonrpc-fdpass-go: binary size comparison

🤖 Assisted-by: OpenCode (Claude Sonnet 4.6)

Measured against the containers-storage binary built from origin/main with
-ldflags="-s -w" -tags exclude_graphdriver_btrfs. Each branch adds an
equivalent stub service (listen → accept → receive → reply) in
cmd/containers-storage/rpc_bench_stub.go, guarded by a package-level var so
the linker cannot dead-code-eliminate it. Branches: bench/varlink,
bench/jsonrpc-fdpass.

Dependency Size (KiB) Δ KiB Δ % Δ packages
baseline 8,003.8 244
+varlink/go 8,071.9 +68.1 +0.851% 246
+jsonrpc-fdpass-go 8,083.9 +80.1 +1.001% 245

With a realistic service stub, both libraries are similar in weight (~68–80 KiB).
varlink/go is slightly smaller despite pulling in 2 extra packages vs 1,
likely because jsonrpc-fdpass-go depends on golang.org/x/sys for
syscall.RawConn fd passing whereas varlink/go has no external dependencies.


That said, varlink/go#43 needs doing.

@cgwalters

Copy link
Copy Markdown
Collaborator Author

Only the latter involves varlink, and we could choose to keep using jsonrpc-fdpass for the latter in container-tools if you prefer.

Or, it'd probably work to use gRPC for most metadata/controlplane but pass fds over a separate negotiated socket. I don't have a really strong opinion.


One other thing I'd say here that I think is a big cleanup is that our parsing of containers-storage: layouts now always goes through splitdirfdstream(+varlink) as an intermediary - we've added an abstraction layer there that we're now always testing. This helps pave the way much more clearly to plugging in an external binary.

Also the converse is now true - it should now be straightforward for external tooling to push content into composefs-rs in an efficient way.

@giuseppe

giuseppe commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

That said, varlink/go#43 needs doing.

This is a blocker for the containers/storage integration. I have no preference whether it is varlink or jsonrpc, but should we hold this until we know we can use it in containers/storage too?

Comment on lines +19 to +20
- **External chunk** — `length` bytes of a file, resolved via
`openat(dirfds[index], filename)` and read from offset 0.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the producer or consumer take care of metacopy files with overlay?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! It was a bit confusing to me, we don't default to metacopy=on in the overlay driver, but we do set it in containers-common in https://github.com/containers/common/blob/a5ccdae846b629b5ceaefa6ffd5c6511409c3487/rpm/update-config-files.sh#L31

We should now handle that on the producer side.

On the consumer side - we don't need to care about metacopy for writing to composefs(-rs) - we get natural object-content based deduplication.

But I think you're pointing out something interesting in that conceptually splitdirfdstream is "lossy" in that this will just appear there as a full copy up.

So a theoretical consumer side of this in the default overlay driver mode we'll end up with a full file copy, not a metacopy from a lower layer.

But on the other hand - that's also what must happen when pushing to a registry, and so far this protocol is trying to conceptually be "an isolated tar layer, but with ability to reflink content".

So...if we got to the point of using this protocol to fix podman-container-tools/container-libs#144 then we would probably want to add some sort of "hint" that a particular file is just a metacopy?

OTOH, the zstd:chunked code path in containers/storage will naturally handle this (of course this gets into the "convert existing images" issue) - but if we can agree that really the more medium-term direction is a composefs based object store, then we don't need to worry about this in that medium term.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not worried about fully copy vs metacopy as that is only an optimization, but that we get the fd to the correct file and not the 0-filled file. Thinking more of it, it should be handled by the producer and not expose this internal detail.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is fixed now, and is handled by the producer.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so do we need to pass one fd for each of the underlying layers (plus the current one)?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another problem I thought of:

Having an open file descriptor only for the parent directory puts us in a race condition. If the layer gets deleted while we're iterating, the files disappear and we have no reference to them. Holding an open file descriptor to each file (or consuming the old-style tar stream) would avoid this problem, since the layer is only locked for the duration of the code running inside the containers/storage library

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that won't work with additional image stores though. We could have layers on a different store. One fd per layer seems like a reasonable compromise to not model it too much around the way c/storage stores its layers

The protocol absolutely supports passing multiple FDs! That's intentional to keep it flexible. It just doesn't support passing more FDs than can fit into a single SCM_CREDENTIALS.

It should work to have one FD per AIS right?

Unless it's more than ~200, which sounds kind of crazy, but...perhaps it is a good idea to not entirely tie our hands to that Linux SCM_CREDENTIALS limit. We can fix this by having multiple replies on the varlink side for FDs.

Note! This is not a limitation of the splitdirfdstream protocol itself because it only refers to abstract integers that point into an externally passed FD set.

Having an open file descriptor only for the parent directory puts us in a race condition. If the layer gets deleted while we're iterating,

I think there's only a race on the last message right? When the client gets the return value from the server but hasn't processed the files there.

Hmm...

Here's an idea, what if the server always passed a pipe() as an additional fd? The protocol is that the client must only close that pipe when it is fully done processing.

Then the server monitors that pipe, and tears down any server-side resources (locks, mounts) only when that pipe is closed.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels a bit hacky but I can't imagine any other way that doesn't require some sort of locking

Would it be any risk of a deadlock if anything goes wrong on the receiver side?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be any risk of a deadlock if anything goes wrong on the receiver side?

Yes - but we're at the same risk of deadlock if the client just goes into an infinite loop instead of reading the reply messages, right?

I don't think there's any way at all to avoid this. We could have a (long) timeout, but I think that kind of thing is better done at a higher level.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! I completely reworked this again per above discussion.

However, there's some more code duplication I need to remove, working on that

@cgwalters

Copy link
Copy Markdown
Collaborator Author

This is a blocker for the containers/storage integration. I have no preference whether it is varlink or jsonrpc, but should we hold this until we know we can use it in containers/storage too?

I generated a PR in varlink/go#44

I think we don't need to block this strictly speaking, it seems in the end better for us to temporarily use a patched varlink/go than to use the custom jsonrpc-fdpass right? Also, if we can take this track to finally replace the current experimental-image-proxy protocol with one thing it'd be overall a large win.

@giuseppe

giuseppe commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

I think we don't need to block this strictly speaking, it seems in the end better for us to temporarily use a patched varlink/go than to use the custom jsonrpc-fdpass right? Also, if we can take this track to finally replace the current experimental-image-proxy protocol with one thing it'd be overall a large win.

@mheon are you fine with that?

@mheon

mheon commented Jun 8, 2026

Copy link
Copy Markdown

Sorry for not following this one closely. Are we talking about doing a hard dependency on Varlink in order to use composefs with c/storage? Do we know what that's going to do to our binary size?

@giuseppe

giuseppe commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Sorry for not following this one closely. Are we talking about doing a hard dependency on Varlink in order to use composefs with c/storage? Do we know what that's going to do to our binary size?

yes, to add it as the RPC to communicate between Go and Rust.

@cgwalters made a comparison here:

#309 (comment)

…timeout

The integration CI jobs were hitting GitHub's 6-hour job limit, which
caused silent cancellation rather than a real test failure. Two issues
combined to cause this:

1. nextest's integration profile had terminate-after = 60, meaning a
   single hung test could run for 1200 × 60 = 72 000 s (20 h) before
   nextest force-killed it, far exceeding the 6-hour GHA limit.

2. The integration job had no explicit timeout-minutes, so a stalled
   job wasn't surfaced as a clear timeout failure.

Fix both: drop terminate-after to 1 (each test gets exactly 20 min before
nextest terminates it, which is generous for VM boot + test execution), and
add timeout-minutes: 90 to the integration job so any build or runner hang
fails cleanly rather than silently burning 6 hours.

Assisted-by: OpenCode (Claude Sonnet 4.6)
Signed-off-by: Colin Walters <walters@verbum.org>
@cgwalters cgwalters requested a review from giuseppe June 9, 2026 12:11
@cgwalters

Copy link
Copy Markdown
Collaborator Author

OK, this one I think is ready for review/merge.

@cgwalters cgwalters enabled auto-merge June 9, 2026 12:46
@mheon

mheon commented Jun 9, 2026

Copy link
Copy Markdown

I don't think we have a fundamental disagreement with Varlink as an IPC protocol, though I am worried about protocol design. What kind of long term stability are we expecting with this? Are we going to have to do a hard pinning of composefs-rs against c/storage versions to ensure both ends are talking the same version of the protocol?

@cgwalters

Copy link
Copy Markdown
Collaborator Author

Are we going to have to do a hard pinning of composefs-rs against c/storage

Right, this is a confusing topic. At the current time, containers-storage uses /usr/bin/mkcomposefs from https://github.com/composefs/composefs - the C implementation.

We have an effort to fully replace that project with this one, including a new Rust mkcomposefs that has landed here that aims to be 100% compatible.

The trajectory is then to ship the composefs package with this project.

But that's not (directly) related to this PR or the PR Giuseppe is working on, which aim to expose varlink APIs on both ends.

I think the trajectory that would help the most is to replace skopeo --experimental-image-proxy (that's used by bootc/ostree) with a varlink-based interface, that's what's going on in podman-container-tools/container-libs#651 - but that needs updating to match this.

In that flow, it's more the other way around again: bootc/composefs(-rs) would be calling into skopeo ➡️ container-libs via varlink.

First, I discovered that actually fd-passing with varlink generally
works well, and I was misguided in thinking we needed jsonrpc-fdpass.

Almost: one issue is that varlink doesn't have good support
for passing *a lot* of file descriptors (which jsonrpc-fdpass was
designed to handle).

But upon some reflection, I realized we don't need to pass a file
descriptor per file, all use cases here are fine with a directory fd
plus filename.

So here a new data stream format "splitdirfdstream" is implemented.

We first now use that *internally* when we're doing a direct pull
from containers-storage for reflinking/hardlinkling.

But better: let's expose that data concept over varlink, where
a varlink client can both pull or push container image layers
that way.

This paves the way to a very clear mechanism for us to integrate
with containers-storage or other storage stacks (like containerd)
in an agnostic way.

We also now support `cfsctl oci copy` to copy across composefs repositories
which is also implemented this way.

Generated-by: OpenCode (Claude Opus 4.8)
Signed-off-by: Colin Walters <walters@verbum.org>
@cgwalters cgwalters marked this pull request as draft June 10, 2026 19:24
auto-merge was automatically disabled June 10, 2026 19:24

Pull request was converted to draft

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants