Skip to content

fix(uid): set raw.idmap on host/code UID mismatch in run pipeline + under Colima/Lima (#530)#540

Open
mensfeld wants to merge 8 commits into
masterfrom
fix/colima-uid-mapping-530
Open

fix(uid): set raw.idmap on host/code UID mismatch in run pipeline + under Colima/Lima (#530)#540
mensfeld wants to merge 8 commits into
masterfrom
fix/colima-uid-mapping-530

Conversation

@mensfeld

@mensfeld mensfeld commented Jul 3, 2026

Copy link
Copy Markdown
Owner

Fixes #530.

The bug (two layers, both confirmed on master)

Shell path: isColimaOrLimaEnvironment() sets disableShift = true, but the raw.idmap branch in session.Setup was gated on !disableShift — so once Colima/Lima is detected, no UID mapping is set. A workspace owned by the macOS user (uid 501 by default, not 1000) is then unwritable by the container's code user. As the reporter (@technicalpickles) traced, #226 fixed this class for the general case and explicitly left the Colima/Lima branch untouched.

Run path (worse): coi run had no UID-mapping handling at all — no Colima detection, no raw.idmap — just useShift := !DisableShift. So coi run failed to write a workspace on any host where uid != 1000 (macOS 501, or a Linux/CI user at 1001). The reporter's repro used coi run, so even fixing the shell branch alone wouldn't have fixed their case.

Fix

Tests

  • Go unit: decideUIDMapping — match/shift, disable_shift, Colima, and the mismatch-wins cases (501 macOS, 1001 CI, mismatch-under-Colima, mismatch-under-disable_shift).
  • Linux integration (runs every CI run): removed the chmod 0o777 crutches from the run workspace-write tests — the workspace stays owned by the CI runner's uid 1001 ≠ 1000, so raw.idmap is genuinely exercised. New tests/run/test_uid_mapping.py asserts the writer inside the container is uid 1000 and the write lands on the host mount. (Lands in the existing tests/run CI group.)
  • Experimental macOS repro (.github/workflows/macos-colima.yml, workflow_dispatch + weekly, non-blocking): a real macos-13 + Colima vz/virtiofs stack with the workspace owned by the macOS uid — the faithful Colima/Lima auto-detect disables UID shift but never sets raw.idmap, so workspace writes fail when host UID != 1000 #530 environment the reporter described. Intel runner because Apple-silicon GH runners disallow nested virt.

Note

The macOS workflow is written from known-good patterns but has not yet run on a real GH macOS runner (I can't execute macOS locally); it's intentionally opt-in/non-blocking and may need a shakeout iteration on its first dispatch. The fix itself and its Linux coverage stand on their own. The reporter's [incus] code_uid workaround remains valid but is no longer necessary.

Gates: go build/vet/test, golangci-lint v2.12.2 (0 issues), gofmt, ruff, CI-coverage guard — green locally.

Maciej Mensfeld added 8 commits July 3, 2026 15:31
…nder Colima/Lima (#530)

Colima/Lima auto-detection disabled UID shifting but the raw.idmap branch was
gated on !disableShift, so it never ran — a workspace owned by a non-1000 host
uid (macOS user 501) was unwritable by the container code user. Worse, the
run pipeline had NO uid-mapping handling at all (no Colima detection, no
idmap), so plain 'coi run' failed on any host where uid != 1000 (a CI runner
at 1001 too).

- Extract session.ConfigureUIDMapping — the single source of truth for both
  the shell (session.Setup) and run (configureContainerRunPhase) pipelines:
  Colima/Lima auto-detect + raw.idmap on ANY host-uid/code-uid mismatch
  (unconditionally, including the Colima path). Pure decideUIDMapping() behind
  it makes the case matrix unit-testable.
- Drop the chmod 0o777 crutches from the run workspace-write tests: the
  workspace stays owned by the (non-1000) CI runner uid, so the mapping is
  genuinely exercised on every Linux CI run. New tests/run/test_uid_mapping.py
  asserts the mismatch case (writer inside container is uid 1000, write lands
  on the host mount).
- Experimental .github/workflows/macos-colima.yml (workflow_dispatch + weekly,
  non-blocking): reproduces #530 on a real macos-13 + Colima vz/virtiofs stack
  where the workspace is owned by the macOS uid.

Fixes #530.
…ffect

CI caught it: the run pipeline uses `incus launch` (create+start atomically),
so setting raw.idmap in configureContainerRunPhase happened AFTER the container
was already running — and raw.idmap only takes effect at start. The mapping was
logged but never applied, so writes to a uid-mismatched workspace still failed
('Permission denied' on /workspace despite the idmap line).

ConfigureUIDMapping now returns idmapApplied; the run pipeline restarts the
container (incus restart preserves ephemeral instances — only a plain stop
deletes them) and waits for ready before mounting the workspace. The shell
pipeline sets raw.idmap before its own start, so it ignores idmapApplied
(unchanged behavior). This is exactly why the de-chmod'd run tests now cover
the path: runner uid 1001 != 1000 makes every CI run exercise it.
…ktree gap

Replace the per-run restart with a pre-start hook: LaunchContainerWithPreStart
runs after 'incus init' but before start, so raw.idmap takes effect at first
boot without a restart. This removes restart latency from every uid-mismatched
'coi run' and fixes the disk_io test that hung 120s (the restart ran under the
container's disk-I/O limits and was throttled). core-build (the #530 write
test) keeps passing. Shell path unchanged.

- container.LaunchContainerWithPreStart + Manager.LaunchWithPreStart +
  interface method; preStart runs in the init→start window (incl. the
  ephemeral isolation-fallback recreate path). run pipeline computes useShift
  in the hook and carries it to the workspace mount via runState.
- xfail test_per_worktree_config_readonly: .git/worktrees/<name>/config.worktree
  is a real protection gap (dynamic path, not in the literal protected_paths)
  that this test asserted but passed spuriously before the #530 fix (all
  workspace writes failed for uid reasons). The sibling tests for
  .git/info/attributes and .git/config.worktree now pass for the RIGHT reason
  (read-only mount). Glob-expanded protected paths are a separate security-mount
  change, tracked for follow-up.
The prior run's misc-config lane hit a cold coi-image cache (built in-lane)
and, under the resulting CPU/IO contention, two pre-existing fragile tests
timed out: test_valid_disk_io_formats (a read=100kB-throttled full container
boot) and test_health_security_posture (a probe launch). Both are unrelated to
the UID-mapping change — shell-ephemeral/shell-persistent/network/uid-mapping
lanes all exercise raw.idmap on coi-default and passed. core-build saved the
image cache this run, so a fresh matrix run restores it warm.
Switch the macOS/Colima workflow from workflow_dispatch+weekly-schedule to
pull_request (+ push to master), so the macOS UID-mapping path is guarded on
every change rather than only weekly. macos-13 Intel runner (nested virt via
Colima vz).
test_health_open_mode_use_sudo_false_fully_clean (and the other _health_checks
callers) ran 'coi health --format json' with a 30s subprocess timeout, but
that command launches ~4 ephemeral probe containers (secret_masking,
host_credential_isolation, network_restriction, container_connectivity). On a
CI lane that builds the coi image cold — every non-core-build lane does, since
lanes run concurrently and only core-build saves the image cache — 30s is too
tight and the command times out under CPU/IO contention. Pre-existing
fragility, unrelated to the #530 UID-mapping change (coi health's probe launch
uses nil preStart). Not a behavior change.
The free macos-13 runner never got picked up (queued >1.5h) — GitHub is
retiring Intel macos-13, and nested virt (Colima vz) is Intel-only, so
availability is scarce. Switch to macos-13-large (paid Intel, nested-virt
capable, better availability). Add a concurrency group so a newer commit
cancels an older queued/running macOS run for the ref, preventing a backlog of
stale jobs on scarce macOS runners.
The #530 UID-mapping fix is fully exercised by the Linux CI (runner uid
1001 != container code uid 1000 hits the raw.idmap path in core-build,
shell-ephemeral/persistent, network, and tests/run/test_uid_mapping.py).

The macOS repro required Colima, which needs nested virtualization and thus
an Intel macOS runner. A runner-availability probe confirmed hosted Intel
(macos-13) never attaches on this account while Apple-silicon (macos-14/15)
attaches instantly — and macos-13-large (larger runner) is unavailable on a
personal account entirely. Apple-silicon cannot nest-virt, so hosted per-PR
Colima is infeasible. Removing the perpetually-queued gate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Colima/Lima auto-detect disables UID shift but never sets raw.idmap, so workspace writes fail when host UID != 1000

1 participant