fix(uid): set raw.idmap on host/code UID mismatch in run pipeline + under Colima/Lima (#530) by mensfeld · Pull Request #540 · mensfeld/code-on-incus

mensfeld · 2026-07-03T13:32:16Z

Fixes #530.

The bug (two layers, both confirmed on master)

Shell path: isColimaOrLimaEnvironment() sets disableShift = true, but the raw.idmap branch in session.Setup was gated on !disableShift — so once Colima/Lima is detected, no UID mapping is set. A workspace owned by the macOS user (uid 501 by default, not 1000) is then unwritable by the container's code user. As the reporter (@technicalpickles) traced, #226 fixed this class for the general case and explicitly left the Colima/Lima branch untouched.

Run path (worse): coi run had no UID-mapping handling at all — no Colima detection, no raw.idmap — just useShift := !DisableShift. So coi run failed to write a workspace on any host where uid != 1000 (macOS 501, or a Linux/CI user at 1001). The reporter's repro used coi run, so even fixing the shell branch alone wouldn't have fixed their case.

Fix

session.ConfigureUIDMapping — one shared implementation used by both the shell (session.Setup) and run (configureContainerRunPhase) pipelines, so they can't diverge again. It auto-detects Colima/Lima and sets raw.idmap "both <host> <code>" on any host-uid/code-uid mismatch (unconditionally — the previous gating was the bug), returning the effective shift flag. A pure decideUIDMapping() sits behind it so the whole case matrix is unit-tested (7 cases incl. the Colima/Lima auto-detect disables UID shift but never sets raw.idmap, so workspace writes fail when host UID != 1000 #530 regressions).

Tests

Go unit: decideUIDMapping — match/shift, disable_shift, Colima, and the mismatch-wins cases (501 macOS, 1001 CI, mismatch-under-Colima, mismatch-under-disable_shift).
Linux integration (runs every CI run): removed the chmod 0o777 crutches from the run workspace-write tests — the workspace stays owned by the CI runner's uid 1001 ≠ 1000, so raw.idmap is genuinely exercised. New tests/run/test_uid_mapping.py asserts the writer inside the container is uid 1000 and the write lands on the host mount. (Lands in the existing tests/run CI group.)
Experimental macOS repro (.github/workflows/macos-colima.yml, workflow_dispatch + weekly, non-blocking): a real macos-13 + Colima vz/virtiofs stack with the workspace owned by the macOS uid — the faithful Colima/Lima auto-detect disables UID shift but never sets raw.idmap, so workspace writes fail when host UID != 1000 #530 environment the reporter described. Intel runner because Apple-silicon GH runners disallow nested virt.

Note

The macOS workflow is written from known-good patterns but has not yet run on a real GH macOS runner (I can't execute macOS locally); it's intentionally opt-in/non-blocking and may need a shakeout iteration on its first dispatch. The fix itself and its Linux coverage stand on their own. The reporter's [incus] code_uid workaround remains valid but is no longer necessary.

Gates: go build/vet/test, golangci-lint v2.12.2 (0 issues), gofmt, ruff, CI-coverage guard — green locally.

…nder Colima/Lima (#530) Colima/Lima auto-detection disabled UID shifting but the raw.idmap branch was gated on !disableShift, so it never ran — a workspace owned by a non-1000 host uid (macOS user 501) was unwritable by the container code user. Worse, the run pipeline had NO uid-mapping handling at all (no Colima detection, no idmap), so plain 'coi run' failed on any host where uid != 1000 (a CI runner at 1001 too). - Extract session.ConfigureUIDMapping — the single source of truth for both the shell (session.Setup) and run (configureContainerRunPhase) pipelines: Colima/Lima auto-detect + raw.idmap on ANY host-uid/code-uid mismatch (unconditionally, including the Colima path). Pure decideUIDMapping() behind it makes the case matrix unit-testable. - Drop the chmod 0o777 crutches from the run workspace-write tests: the workspace stays owned by the (non-1000) CI runner uid, so the mapping is genuinely exercised on every Linux CI run. New tests/run/test_uid_mapping.py asserts the mismatch case (writer inside container is uid 1000, write lands on the host mount). - Experimental .github/workflows/macos-colima.yml (workflow_dispatch + weekly, non-blocking): reproduces #530 on a real macos-13 + Colima vz/virtiofs stack where the workspace is owned by the macOS uid. Fixes #530.

…ffect CI caught it: the run pipeline uses `incus launch` (create+start atomically), so setting raw.idmap in configureContainerRunPhase happened AFTER the container was already running — and raw.idmap only takes effect at start. The mapping was logged but never applied, so writes to a uid-mismatched workspace still failed ('Permission denied' on /workspace despite the idmap line). ConfigureUIDMapping now returns idmapApplied; the run pipeline restarts the container (incus restart preserves ephemeral instances — only a plain stop deletes them) and waits for ready before mounting the workspace. The shell pipeline sets raw.idmap before its own start, so it ignores idmapApplied (unchanged behavior). This is exactly why the de-chmod'd run tests now cover the path: runner uid 1001 != 1000 makes every CI run exercise it.

…ktree gap Replace the per-run restart with a pre-start hook: LaunchContainerWithPreStart runs after 'incus init' but before start, so raw.idmap takes effect at first boot without a restart. This removes restart latency from every uid-mismatched 'coi run' and fixes the disk_io test that hung 120s (the restart ran under the container's disk-I/O limits and was throttled). core-build (the #530 write test) keeps passing. Shell path unchanged. - container.LaunchContainerWithPreStart + Manager.LaunchWithPreStart + interface method; preStart runs in the init→start window (incl. the ephemeral isolation-fallback recreate path). run pipeline computes useShift in the hook and carries it to the workspace mount via runState. - xfail test_per_worktree_config_readonly: .git/worktrees/<name>/config.worktree is a real protection gap (dynamic path, not in the literal protected_paths) that this test asserted but passed spuriously before the #530 fix (all workspace writes failed for uid reasons). The sibling tests for .git/info/attributes and .git/config.worktree now pass for the RIGHT reason (read-only mount). Glob-expanded protected paths are a separate security-mount change, tracked for follow-up.

The prior run's misc-config lane hit a cold coi-image cache (built in-lane) and, under the resulting CPU/IO contention, two pre-existing fragile tests timed out: test_valid_disk_io_formats (a read=100kB-throttled full container boot) and test_health_security_posture (a probe launch). Both are unrelated to the UID-mapping change — shell-ephemeral/shell-persistent/network/uid-mapping lanes all exercise raw.idmap on coi-default and passed. core-build saved the image cache this run, so a fresh matrix run restores it warm.

Switch the macOS/Colima workflow from workflow_dispatch+weekly-schedule to pull_request (+ push to master), so the macOS UID-mapping path is guarded on every change rather than only weekly. macos-13 Intel runner (nested virt via Colima vz).

test_health_open_mode_use_sudo_false_fully_clean (and the other _health_checks callers) ran 'coi health --format json' with a 30s subprocess timeout, but that command launches ~4 ephemeral probe containers (secret_masking, host_credential_isolation, network_restriction, container_connectivity). On a CI lane that builds the coi image cold — every non-core-build lane does, since lanes run concurrently and only core-build saves the image cache — 30s is too tight and the command times out under CPU/IO contention. Pre-existing fragility, unrelated to the #530 UID-mapping change (coi health's probe launch uses nil preStart). Not a behavior change.

The free macos-13 runner never got picked up (queued >1.5h) — GitHub is retiring Intel macos-13, and nested virt (Colima vz) is Intel-only, so availability is scarce. Switch to macos-13-large (paid Intel, nested-virt capable, better availability). Add a concurrency group so a newer commit cancels an older queued/running macOS run for the ref, preventing a backlog of stale jobs on scarce macOS runners.

The #530 UID-mapping fix is fully exercised by the Linux CI (runner uid 1001 != container code uid 1000 hits the raw.idmap path in core-build, shell-ephemeral/persistent, network, and tests/run/test_uid_mapping.py). The macOS repro required Colima, which needs nested virtualization and thus an Intel macOS runner. A runner-availability probe confirmed hosted Intel (macos-13) never attaches on this account while Apple-silicon (macos-14/15) attaches instantly — and macos-13-large (larger runner) is unavailable on a personal account entirely. Apple-silicon cannot nest-virt, so hosted per-PR Colima is infeasible. Removing the perpetually-queued gate.

Maciej Mensfeld added 8 commits July 3, 2026 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(uid): set raw.idmap on host/code UID mismatch in run pipeline + under Colima/Lima (#530)#540

fix(uid): set raw.idmap on host/code UID mismatch in run pipeline + under Colima/Lima (#530)#540
mensfeld wants to merge 8 commits into
masterfrom
fix/colima-uid-mapping-530

mensfeld commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mensfeld commented Jul 3, 2026

The bug (two layers, both confirmed on master)

Fix

Tests

Note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant