Skip to content

Cherry pick commits for v1.18.1#682

Merged
slp merged 10 commits into
containers:stable-1.18.xfrom
slp:cherry-pick-1.18.1
May 20, 2026
Merged

Cherry pick commits for v1.18.1#682
slp merged 10 commits into
containers:stable-1.18.xfrom
slp:cherry-pick-1.18.1

Conversation

@slp
Copy link
Copy Markdown
Collaborator

@slp slp commented May 18, 2026

This PR cherry picks some commits from main to be included in v1.18.1.

mz-pdm and others added 9 commits May 18, 2026 17:20
Some applications check for network availability by looking for a
network device configured for Internet access.  When TSI is used, there
is no such device available by default, although Internet is accessible.
Then those applications behave like when the connection is not
available.

Let's solve this problem by setting up a dummy network interface.  The
dummy interface is automatically created when CONFIG_DUMMY is enabled in
kernel or the corresponding kernel module is loaded.  This means a
sufficiently recent libkrunfw version is needed (see
containers/libkrunfw#116).  The dummy interface
is initially down.

In order to make the applications happy, the interface must be brought
up and set up for Internet connections.  This is ensured by setting the
IP address to 10.0.0.1/8 (an arbitrary choice without any special
reason) in init.c if TSI is enabled.  The netmask is selected to be
sane; it doesn't cover the whole IP range and we cannot set a default
route because then TSI has problems, but it's OK for the tested
application.  We can change it if some application has trouble with
that.

TSI availability is determined by checking the presence of `tsi_hijack'
in the kernel command line, before `--' delimiter if present.

The dummy interface simply swallows all packets.  But it is effectively
bypassed by TSI for practical purposes.  Things like ICMP don't work in
either case.

When the kernel support is not available, the device is not present and
init.c cannot set it up.  We skip the configuration silently in such a
case, to not spam users with errors if they use older libkrunfw or
custom kernels.

Fixes: containers#576

Signed-off-by: Milan Zamazal <mzamazal@redhat.com>
(cherry picked from commit 2593acc)
Signed-off-by: Sergio Lopez <slp@redhat.com>
Fixes containers#650.

Signed-off-by: Sven-Hendrik Haase <svenstaro@gmail.com>
(cherry picked from commit e12b9b3)
Signed-off-by: Sergio Lopez <slp@redhat.com>
In commit c0e42fb0e0 ("vsock/virtio: cap TX credit to local buffer
size") the kernel stopped honoring our peer_buf_alloc value, capping
it to its own.

Use the kernel's peer_buf_alloc instead of CONN_TX_BUF_SIZE as a hint
of when we need to send a credit update.

Signed-off-by: Sergio Lopez <slp@redhat.com>
(cherry picked from commit 4b5b451)
Signed-off-by: Sergio Lopez <slp@redhat.com>
We need to bind to the correct socket types (IPv6, Unix) instead of only IPv4.
This fixes UDP and unix dgram tests hanging when waiting for reply.

Reported-by: Jan Noha <nohajc@gmail.com>

Signed-off-by: Matej Hrica <mhrica@redhat.com>
(cherry picked from commit 4380b32)
Signed-off-by: Sergio Lopez <slp@redhat.com>
The cross_domain `write` handler only matched
`CrossDomainItem::WaylandWritePipe`, falling into the catch-all for
every other item type after the unconditional `remove()` at the top of
the function had already dropped the entry from the table. PipeWire
(and other clients that share host-created eventfds via SCM_RIGHTS for
per-period wakeups) sends CMD_WRITE on those eventfd identifiers — the
first such write returns InvalidCrossDomainItemType *after* removing
the item, and every subsequent write on the same identifier returns
InvalidCrossDomainItemId, masquerading on the guest as the opaque
VIRTIO_GPU_RESP_ERR_UNSPEC (0x1200).

Reproduced inside a libkrun guest on x86_64 by routing PipeWire's
ALSA-shim audio through a host PipeWire daemon. With a paired BT
speaker as the sink, `speaker-test -D pipewire -c2 -t wav -l 1`
produces, per stream, ~10 entries of:

    [    0.682819] [drm:virtio_gpu_dequeue_ctrl_func] *ERROR* response 0x1200 (command 0x207)
    [    0.723762] [drm:virtio_gpu_dequeue_ctrl_func] *ERROR* response 0x1200 (command 0x207)
    [    0.767615] [drm:virtio_gpu_dequeue_ctrl_func] *ERROR* response 0x1200 (command 0x207)
    [    0.807779] [drm:virtio_gpu_dequeue_ctrl_func] *ERROR* response 0x1200 (command 0x207)
    [    0.852469] [drm:virtio_gpu_dequeue_ctrl_func] *ERROR* response 0x1200 (command 0x207)
    [    0.896552] [drm:virtio_gpu_dequeue_ctrl_func] *ERROR* response 0x1200 (command 0x207)
    [    0.936504] [drm:virtio_gpu_dequeue_ctrl_func] *ERROR* response 0x1200 (command 0x207)
    [    0.980567] [drm:virtio_gpu_dequeue_ctrl_func] *ERROR* response 0x1200 (command 0x207)
    [    1.024476] [drm:virtio_gpu_dequeue_ctrl_func] *ERROR* response 0x1200 (command 0x207)
    [    1.064636] [drm:virtio_gpu_dequeue_ctrl_func] *ERROR* response 0x1200 (command 0x207)

with audio still playing — PipeWire has socket-based fallback timing
that doesn't depend on eventfd ack, so the failures are cosmetic for
playback. They are not cosmetic for clients that strictly require the
eventfd handshake (PipeWire's ALSA shim under heavier loads, and the
buffer-pool wakeup path used by V4L2 capture streams).

Add an Eventfd arm that mirrors the WaylandWritePipe semantics:
`write_volatile` performs the 8-byte counter increment, and the item
is re-inserted into the table unless the guest signaled `hang_up`.

Verified post-fix: zero CMD_WRITE failures, zero `0x1200` entries in
guest dmesg, audio playback unchanged. Camera capture (gst-launch
pipewiresrc → MJPEG) also exercises this path for buffer-pool wakeups
and runs cleanly with valid 98%-non-zero JPEG frames.

Signed-off-by: Adam Ford <adam.ford@anodize.com>
(cherry picked from commit 5835d52)
Signed-off-by: Sergio Lopez <slp@redhat.com>
The unescape_string() function in init.c, which handles JSON escape
sequences when parsing environment variables from .krun_config.json,
had a bug where the pointer 'val' was not advanced past the escape
character after processing it.

When encountering a two-character JSON escape sequence like \n or \",
the switch statement pre-increments val to point at the escape
character (e.g., 'n' or '"') and writes the unescaped byte to the
output. However, it never advances val past that character. On the
next loop iteration, the character is not a backslash, so it gets
copied again as a literal character.

This causes:
 - \n (JSON-escaped newline) to produce a newline followed by a
   literal 'n'
 - \" (JSON-escaped double quote) to produce two double quotes

For example, an environment variable set to a JSON string like:

  {\"key\": \"value\"}

would be rendered inside the krun VM as:

  {""key"": ""value""}

Fix this by adding val++ after writing the unescaped character in
each case of the switch statement. The 'u' (unicode) case already
handles its own pointer arithmetic and is not affected.

Fixes: containers#678
Assisted-by: <anthropic/claude-opus-4.6>
Signed-off-by: Dusty Mabe <dusty@dustymabe.com>
(cherry picked from commit 60fe4f6)
Signed-off-by: Sergio Lopez <slp@redhat.com>
building using musl fails with:

error[E0412]: cannot find type `statx` in crate `libc`
   --> src/devices/src/virtio/fs/linux/passthrough.rs:187:39

musl_v1_2_3 allows builds targeting musl 1.2.3 or newer to use statx

rust-lang/libc/src/unix/linux_like/mod.rs#L264-L269:

cfg_if! {
    if #[cfg(any(
        target_env = "gnu",
        target_os = "android",
        all(target_env = "musl", musl_v1_2_3)
    ))] {

fixes: containers#431
Signed-off-by: Pepper Gray <hello@peppergray.xyz>
(cherry picked from commit c9c92e3)
Signed-off-by: Sergio Lopez <slp@redhat.com>
The previous write() held item_state.lock() across the write_volatile call
that performs the actual syscall, serializing every cross-domain CMD_WRITE
behind any other operation on the items table — including add_item from
process_receive, which fires for every new fd received via SCM_RIGHTS as
guests open additional channels.

For per-period eventfd wakeups (e.g. PipeWire audio streams signaling host
playback every audio period), this means the write completes only after any
in-flight item_state operation finishes. Under stream-create churn — many
guest applications opening streams concurrently, each delivering new fds via
SCM_RIGHTS that hit add_item under the same lock — the wait can exceed the
audio period budget and produce missed-deadline glitches at the host's
audio output.

This change shrinks the critical section to a brief fd dup() under the lock,
performs the syscall lock-free, and only re-acquires the lock if hang_up
indicates the item should be removed. In the common case (hang_up == 0,
e.g. repeated eventfd wakeups for an active stream) the table is no longer
touched per write, eliminating both the contention and the previous
"remove + conditional re-insert" churn.

Behavioral changes vs the old code:
- Common case (hang_up == 0): item stays in the table; we hand out a dup'd
  fd for the write. Net behavior identical, lock hold time bounded by dup().
- hang_up == 1: item removed in a separate phase after the write, instead
  of "removed unconditionally then re-inserted on hang_up == 0". Same
  observed end state.
- Concurrent writes to the same id (no longer serialized): the kernel
  guarantees atomicity for eventfd writes (8 bytes) and pipe writes
  <= PIPE_BUF, which are the only two CrossDomainItem variants this branch
  handles. Each caller dup's its own fd and writes through it independently.

Verified with a synthetic reproducer: a sustained 1 kHz sine playing through
a guest PipeWire stream while ~200 short-lived guest streams open and close
concurrently (each issuing SCM_RIGHTS for new eventfds, contending with the
sustained stream's per-period writes for item_state). Capturing the host
sink monitor and counting sample-to-sample deltas exceeding a clean-sine
threshold, the stress workload produced ~10 distinct glitch bursts in an
8-second capture before this change, and zero across five consecutive runs
after.

Signed-off-by: Adam Ford <adam.ford@anodize.com>
(cherry picked from commit d904143)
Signed-off-by: Sergio Lopez <slp@redhat.com>
from_tx_virtq_head assumed a fixed 2-descriptor layout (header + one
data descriptor), which breaks with newer kernels that may combine
header and data in a single descriptor (Linux 6.2+) or split data
across multiple descriptors.

Handle three cases: single combined descriptor (zero-copy), classic
two-descriptor (zero-copy, unchanged), and multi-descriptor data
(copied into an owned contiguous buffer). The RX path already handled
the combined case; this brings the TX path to parity and beyond.

Assisted-by: Claude Code:claude-opus-4.6
Signed-off-by: Sergio Lopez <slp@redhat.com>
(cherry picked from commit 0ecf4d5)
Signed-off-by: Sergio Lopez <slp@redhat.com>
This is a patch version update for the stable-1.18.x series.

Signed-off-by: Sergio Lopez <slp@redhat.com>
@slp slp force-pushed the cherry-pick-1.18.1 branch from b44b1b2 to 3411afd Compare May 18, 2026 16:23
Copy link
Copy Markdown
Collaborator

@mtjhrc mtjhrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@slp slp merged commit b7e43f0 into containers:stable-1.18.x May 20, 2026
14 checks passed
@slp slp deleted the cherry-pick-1.18.1 branch May 20, 2026 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants