fix(uid): set raw.idmap on host/code UID mismatch in run pipeline + under Colima/Lima (#530)#540
Open
mensfeld wants to merge 8 commits into
Open
fix(uid): set raw.idmap on host/code UID mismatch in run pipeline + under Colima/Lima (#530)#540mensfeld wants to merge 8 commits into
mensfeld wants to merge 8 commits into
Conversation
added 8 commits
July 3, 2026 15:31
…nder Colima/Lima (#530) Colima/Lima auto-detection disabled UID shifting but the raw.idmap branch was gated on !disableShift, so it never ran — a workspace owned by a non-1000 host uid (macOS user 501) was unwritable by the container code user. Worse, the run pipeline had NO uid-mapping handling at all (no Colima detection, no idmap), so plain 'coi run' failed on any host where uid != 1000 (a CI runner at 1001 too). - Extract session.ConfigureUIDMapping — the single source of truth for both the shell (session.Setup) and run (configureContainerRunPhase) pipelines: Colima/Lima auto-detect + raw.idmap on ANY host-uid/code-uid mismatch (unconditionally, including the Colima path). Pure decideUIDMapping() behind it makes the case matrix unit-testable. - Drop the chmod 0o777 crutches from the run workspace-write tests: the workspace stays owned by the (non-1000) CI runner uid, so the mapping is genuinely exercised on every Linux CI run. New tests/run/test_uid_mapping.py asserts the mismatch case (writer inside container is uid 1000, write lands on the host mount). - Experimental .github/workflows/macos-colima.yml (workflow_dispatch + weekly, non-blocking): reproduces #530 on a real macos-13 + Colima vz/virtiofs stack where the workspace is owned by the macOS uid. Fixes #530.
…ffect
CI caught it: the run pipeline uses `incus launch` (create+start atomically),
so setting raw.idmap in configureContainerRunPhase happened AFTER the container
was already running — and raw.idmap only takes effect at start. The mapping was
logged but never applied, so writes to a uid-mismatched workspace still failed
('Permission denied' on /workspace despite the idmap line).
ConfigureUIDMapping now returns idmapApplied; the run pipeline restarts the
container (incus restart preserves ephemeral instances — only a plain stop
deletes them) and waits for ready before mounting the workspace. The shell
pipeline sets raw.idmap before its own start, so it ignores idmapApplied
(unchanged behavior). This is exactly why the de-chmod'd run tests now cover
the path: runner uid 1001 != 1000 makes every CI run exercise it.
…ktree gap Replace the per-run restart with a pre-start hook: LaunchContainerWithPreStart runs after 'incus init' but before start, so raw.idmap takes effect at first boot without a restart. This removes restart latency from every uid-mismatched 'coi run' and fixes the disk_io test that hung 120s (the restart ran under the container's disk-I/O limits and was throttled). core-build (the #530 write test) keeps passing. Shell path unchanged. - container.LaunchContainerWithPreStart + Manager.LaunchWithPreStart + interface method; preStart runs in the init→start window (incl. the ephemeral isolation-fallback recreate path). run pipeline computes useShift in the hook and carries it to the workspace mount via runState. - xfail test_per_worktree_config_readonly: .git/worktrees/<name>/config.worktree is a real protection gap (dynamic path, not in the literal protected_paths) that this test asserted but passed spuriously before the #530 fix (all workspace writes failed for uid reasons). The sibling tests for .git/info/attributes and .git/config.worktree now pass for the RIGHT reason (read-only mount). Glob-expanded protected paths are a separate security-mount change, tracked for follow-up.
The prior run's misc-config lane hit a cold coi-image cache (built in-lane) and, under the resulting CPU/IO contention, two pre-existing fragile tests timed out: test_valid_disk_io_formats (a read=100kB-throttled full container boot) and test_health_security_posture (a probe launch). Both are unrelated to the UID-mapping change — shell-ephemeral/shell-persistent/network/uid-mapping lanes all exercise raw.idmap on coi-default and passed. core-build saved the image cache this run, so a fresh matrix run restores it warm.
Switch the macOS/Colima workflow from workflow_dispatch+weekly-schedule to pull_request (+ push to master), so the macOS UID-mapping path is guarded on every change rather than only weekly. macos-13 Intel runner (nested virt via Colima vz).
test_health_open_mode_use_sudo_false_fully_clean (and the other _health_checks callers) ran 'coi health --format json' with a 30s subprocess timeout, but that command launches ~4 ephemeral probe containers (secret_masking, host_credential_isolation, network_restriction, container_connectivity). On a CI lane that builds the coi image cold — every non-core-build lane does, since lanes run concurrently and only core-build saves the image cache — 30s is too tight and the command times out under CPU/IO contention. Pre-existing fragility, unrelated to the #530 UID-mapping change (coi health's probe launch uses nil preStart). Not a behavior change.
The free macos-13 runner never got picked up (queued >1.5h) — GitHub is retiring Intel macos-13, and nested virt (Colima vz) is Intel-only, so availability is scarce. Switch to macos-13-large (paid Intel, nested-virt capable, better availability). Add a concurrency group so a newer commit cancels an older queued/running macOS run for the ref, preventing a backlog of stale jobs on scarce macOS runners.
The #530 UID-mapping fix is fully exercised by the Linux CI (runner uid 1001 != container code uid 1000 hits the raw.idmap path in core-build, shell-ephemeral/persistent, network, and tests/run/test_uid_mapping.py). The macOS repro required Colima, which needs nested virtualization and thus an Intel macOS runner. A runner-availability probe confirmed hosted Intel (macos-13) never attaches on this account while Apple-silicon (macos-14/15) attaches instantly — and macos-13-large (larger runner) is unavailable on a personal account entirely. Apple-silicon cannot nest-virt, so hosted per-PR Colima is infeasible. Removing the perpetually-queued gate.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #530.
The bug (two layers, both confirmed on master)
Shell path:
isColimaOrLimaEnvironment()setsdisableShift = true, but theraw.idmapbranch insession.Setupwas gated on!disableShift— so once Colima/Lima is detected, no UID mapping is set. A workspace owned by the macOS user (uid 501 by default, not 1000) is then unwritable by the container'scodeuser. As the reporter (@technicalpickles) traced, #226 fixed this class for the general case and explicitly left the Colima/Lima branch untouched.Run path (worse):
coi runhad no UID-mapping handling at all — no Colima detection, noraw.idmap— justuseShift := !DisableShift. Socoi runfailed to write a workspace on any host where uid != 1000 (macOS 501, or a Linux/CI user at 1001). The reporter's repro usedcoi run, so even fixing the shell branch alone wouldn't have fixed their case.Fix
session.ConfigureUIDMapping— one shared implementation used by both the shell (session.Setup) and run (configureContainerRunPhase) pipelines, so they can't diverge again. It auto-detects Colima/Lima and setsraw.idmap "both <host> <code>"on any host-uid/code-uid mismatch (unconditionally — the previous gating was the bug), returning the effective shift flag. A puredecideUIDMapping()sits behind it so the whole case matrix is unit-tested (7 cases incl. the Colima/Lima auto-detect disables UID shift but never sets raw.idmap, so workspace writes fail when host UID != 1000 #530 regressions).Tests
decideUIDMapping— match/shift, disable_shift, Colima, and the mismatch-wins cases (501 macOS, 1001 CI, mismatch-under-Colima, mismatch-under-disable_shift).chmod 0o777crutches from the run workspace-write tests — the workspace stays owned by the CI runner's uid 1001 ≠ 1000, soraw.idmapis genuinely exercised. Newtests/run/test_uid_mapping.pyasserts the writer inside the container is uid 1000 and the write lands on the host mount. (Lands in the existingtests/runCI group.).github/workflows/macos-colima.yml,workflow_dispatch+ weekly, non-blocking): a realmacos-13+ Colimavz/virtiofs stack with the workspace owned by the macOS uid — the faithful Colima/Lima auto-detect disables UID shift but never sets raw.idmap, so workspace writes fail when host UID != 1000 #530 environment the reporter described. Intel runner because Apple-silicon GH runners disallow nested virt.Note
The macOS workflow is written from known-good patterns but has not yet run on a real GH macOS runner (I can't execute macOS locally); it's intentionally opt-in/non-blocking and may need a shakeout iteration on its first dispatch. The fix itself and its Linux coverage stand on their own. The reporter's
[incus] code_uidworkaround remains valid but is no longer necessary.Gates: go build/vet/test, golangci-lint v2.12.2 (0 issues), gofmt, ruff, CI-coverage guard — green locally.