Skip to content

fix(provisioner): apt-get update before gated auto-installs (AL-37)#25

Merged
Roo4L merged 8 commits into
masterfrom
worktree-locales-bugfix
May 10, 2026
Merged

fix(provisioner): apt-get update before gated auto-installs (AL-37)#25
Roo4L merged 8 commits into
masterfrom
worktree-locales-bugfix

Conversation

@Roo4L
Copy link
Copy Markdown
Owner

@Roo4L Roo4L commented May 10, 2026

Summary

Fixes AL-37 — provisioner installer aborted with Package locales has no installation candidate (and Package sudo … symmetrically) on hosts with empty /var/lib/apt/lists/ (freshly pulled Ubuntu containers, long-idle hosts).

  • plugin/provisioner/10-agent-user.sh and 20-sudoers.sh — added DEBIAN_FRONTEND=noninteractive apt-get update inside the existing if ! command -v <X>; then auto-install gates, before the existing apt-get install. Mirrors the apt-get update→install ordering at 30-nodejs.sh:33 (gated here on prereq absence; unconditional there because nodejs is always installed by the AgentLinux installer).
  • tests/docker/Dockerfile.dogfood — appended && rm -rf /var/lib/apt/lists/* to the apt-get install RUN; replaced the previous AL-37 tactical-workaround comment block with a positive-framing comment. Converts AL-37 into permanent regression coverage: any future provisioner that re-introduces an auto-install without apt-get update pattern fails dogfood retests immediately.
  • tests/bats/10-installer.bats — added one named negative-assertion @test (lead: INST-05; (AL-37 regression guard) parenthetical, matching the 60-curl-installer.bats:206 precedent for INST-03 + AL-31). Greps the installer log for has no installation candidate | Unable to locate package and __fails with the matched lines if present.

Strict-mode contract preserved: no || true masking on the new apt-get update lines — real apt failures (network, broken sources, GPG expiry) trip the entrypoint's ERR trap with src:line attribution. Sourced-fragment contract preserved: no new strict-mode flags inside the provisioners (they inherit from the entrypoint per their file headers).

Why now

Bug class introduced under AL-30 (the four-bug fix bundle that landed command -v auto-install gates), surfaced by AL-36 (the minimal-prereqs Dockerfile.dogfood that runs the curl-pipe-bash installer against a clean substrate). Earlier dogfood retests masked the bug because the manual setup recipe ran apt-get update before installing curl, side-effecting a populated cache that the AgentLinux installer's later apt-get install rode on.

Test plan

  • End-to-end on the regression scenario. Built agentlinux-dogfood-al37:24.04 from the modified Dockerfile.dogfood. Confirmed image starting state: 0 files in /var/lib/apt/lists/, no locale-gen, no visudo — both auto-install gates will fire. Booted privileged systemd container (--privileged --cgroupns=host -e container=docker -v /sys/fs/cgroup:/sys/fs/cgroup:rw --tmpfs /run --tmpfs /tmp -v "$PWD":/workspace:ro), mounted the worktree, copied to /opt/agentlinux-src, and ran the LOCAL worktree installer. Provisioners 10 / 20 / 30 / 40 all reached done banners. Captured installer log assertions:

    Assertion Result
    Zero has no installation candidate matches 0 / pass
    apt-get update ran (Get / Hit lines after each WARN) 66 matches / pass
    Setting up locales reached present / pass
    locale C.UTF-8 enforced reached present / pass
    Setting up sudo reached present / pass
    wrote /etc/sudoers.d/agentlinux reached present / pass
    10 / 20 / 30 / 40 provisioners all show done 4 / 4 / pass
  • Static gates. bash -n clean on both edited provisioner scripts; shellcheck clean on both; docker build of Dockerfile.dogfood succeeds against ubuntu:24.04; pre-commit run --files <changed> clean (shellcheck + shfmt + secrets + JSON / YAML lint all green) on both pre-rebase and post-rebase.

  • Review feedback loop, three passes.

    • Pass 1bash-engineer + security-engineer + qa-engineer. bash + security: clean, no findings. qa-engineer: two actionable items — (1) add a named negative-assertion @test for the canonical apt failure strings; (2) tighten an imprecise comment in 10-agent-user.sh ("canonical pattern at 30-nodejs.sh:33" → "apt-get update→install ordering at 30-nodejs.sh:33 (gated here, unconditional there)"). Both applied in c70d297.
    • Pass 2qa-engineer + behavior-coverage-auditor on the new bats @test. Auditor: TST-07 phase-close gate greps only v0.3.0 REQ-IDs (BHV / RT / AGT / CLI / CAT / INST / HRN / TST / DOC); AL-XX is Jira-only namespace. Project precedent (60-curl-installer.bats:206) leads with the closest REQ-ID and rides (AL-XX regression guard) as a parenthetical. Renamed in a01cb0e to lead with INST-05 (sibling of the existing no-EACCES gate; both assert "log has no "), AL-37 as parenthetical. Loop converged.
    • Pass 3 (post-rebase, ai-deslop) — the ai-deslop subagent landed in master via b69c211 (AL-35) but isn't yet active in my harness. Ran the rubric inline against the diff; one finding: Dockerfile.dogfood comment block was 10 lines for a 1-line code addition with the last sentence restating information already conveyed earlier. Trimmed to 7 lines in 3c69db0.
  • CI. Will exercise the existing matrix (tests/docker/run.sh for ubuntu-{22.04,24.04,26.04} + nightly tests/qemu/). The new INST-05 (AL-37 regression guard) @test runs as part of tests/bats/10-installer.bats in every tests/docker/run.sh invocation.

Refs

  • AL-37 — this fix
  • AL-30 — introduced the auto-install pattern this exposes
  • AL-36 — added the minimal-prereqs Dockerfile.dogfood that surfaced AL-37 with a clean substrate
  • ADR-012 — agent-user passwordless sudo (broader context for 20-sudoers.sh)

🤖 Generated with Claude Code

Roo4L added 8 commits May 10, 2026 05:46
The locales auto-install branch in 10-agent-user.sh ran `apt-get install
-y locales` without first running `apt-get update`. On hosts where
/var/lib/apt/lists/ is empty — freshly pulled Ubuntu containers, long-idle
real machines — apt has no record of any package availability and the
install fails with "Package locales has no installation candidate".

Mirrors the canonical pattern at 30-nodejs.sh:33: `apt-get update` then
`apt-get install`, both wrapped in DEBIAN_FRONTEND=noninteractive. Both
calls remain inside the existing `if ! command -v locale-gen` gate so
hosts where the package is already installed skip both steps (steady
state for repeat installer runs and non-slim images).

No `|| true` masking — strict-mode propagation must surface real apt
failures (network, broken sources, GPG expiry) as installer ERR-trap
hits with src:line attribution.

Refs: AL-37 (this fix), AL-30 (introduced the auto-install pattern),
AL-36 (dogfood image that surfaced the bug).
Same bug class as 10-agent-user.sh's locales install — the visudo
auto-install gate at 20-sudoers.sh:45 ran `apt-get install -y sudo`
without first running `apt-get update`. On hosts with an empty apt
cache the install fails with "Package sudo has no installation
candidate", aborting the installer before the agent gets passwordless
sudo (BHV-07/INST-06).

Fix lockstep with the locales fix in 10-agent-user.sh: insert
`DEBIAN_FRONTEND=noninteractive apt-get update` before the existing
apt-get install, both inside the `if ! command -v visudo` gate. The
gate's no-op-when-already-installed property is preserved.

The comment block already said "Mirror the pattern used by
10-agent-user.sh's locales install" — updated to reflect the now-
matching apt-get update + apt-get install pair.

Refs: AL-37 (this fix), AL-30 (introduced the auto-install pattern).
…n coverage

AL-37 fixed plugin/provisioner/10-agent-user.sh and 20-sudoers.sh to run
`apt-get update` before any gated auto-install. With the strategic fix
landed in the installer, the dogfood image can — and should — start with
an empty /var/lib/apt/lists/.

This converts AL-37's failure scenario into permanent test coverage: any
future provisioner that re-introduces an "auto-install missing package
without apt-get update" pattern will fail dogfood retests immediately,
because the dogfood image starts in the exact "empty apt cache" state
that triggers the bug class.

Drops the previous tactical workaround comment block that documented why
we couldn't clean lists yet ("Until that lands, keeping
/var/lib/apt/lists/ populated here is the tactical fix"); that now reads
as an active misstatement of intent. Replaced with a positive-framing
comment explaining why we DO clean lists.

Image-size cost is roughly 25 MB recovered from the previously-retained
package lists; not a goal but a small bonus.

Refs: AL-37 (the installer fix that made this safe), AL-36 (this file
was added as the minimal dogfood substrate).
Quick task artifacts: PLAN.md (target/acceptance/pitfalls), SUMMARY.md
(what changed + why + end-to-end Docker validation transcript), and the
STATE.md row under Quick Tasks Completed referencing the three feature
commits f12c94f, 86d5cbf, 94c4064.

Refs: AL-37.
qa-engineer review of AL-37 surfaced two small follow-ups:

1. **Named negative assertion at the bats layer (10-installer.bats).** AL-37
   is now structurally exercised by tests/docker/Dockerfile.dogfood's empty-
   cache start state, but the bats suite did not have a test that names the
   canonical apt failure strings ("Package <name> has no installation
   candidate" / "Unable to locate package"). Added one @test that greps the
   installer log for either string and __fails with the matched lines if
   found. Converts AL-37 from "absence-of-stack-trace" coverage into named
   negative assertion — a future regression is visible in the bats output
   (and in the behavior-coverage-auditor's grep trail) instead of only in
   dogfood log scrollback.

2. **Comment-precision nit (10-agent-user.sh).** Original comment said
   "Mirrors the canonical pattern at 30-nodejs.sh:33" but 30-nodejs runs
   apt-get update unconditionally (Node always installed) while we are
   gated on locale-gen absence. Tightened to "Mirrors the apt-get update→
   install ordering at 30-nodejs.sh:33 (gated here on `command -v
   locale-gen` absence, unconditional there ...)" — accurate without
   adding load-bearing logic.

Both changes are inert on the happy path: the new @test is a refute_output
shape that fails only if the bug regresses; the comment update is text-
only. shellcheck + bash -n still clean.

Refs: AL-37.
Pass-2 review (qa-engineer + behavior-coverage-auditor):
- TST-07 phase-close gate greps only the v0.3.0 REQ-ID prefixes (BHV / RT
  / AGT / CLI / CAT / INST / HRN / TST / DOC). A bare `AL-37:` @test name
  does not break the gate but is invisible to it — losing the auditor-
  visibility benefit pass-1 wanted.
- Project precedent for citing a Jira tag on a bugfix-followup @test is
  to LEAD with the closest REQ-ID and ride the AL-XX as a parenthetical
  regression tag. See `tests/bats/60-curl-installer.bats:206`:
  `INST-03: resolve_version reads first-hop redirect (AL-31 regression
  guard)`.
- INST-05 ("installer log contains no EACCES or 'permission denied'") is
  the natural REQ-ID parent: both bug classes manifest as installer-log
  lines the entrypoint did NOT trip on, both assert "log has no <bad-
  string>".

Renamed @test + updated `__fail` ID arg + added comment block citing the
60-curl-installer precedent so future maintainers see the convention.

Refs: AL-37, INST-05.
Inline ai-deslop review (the project subagent landed in master via b69c211
but isn't yet active in my harness — ran the rubric inline). The comment
block I added in 617c894 was 10 lines for a 1-line code addition; the
final sentence ("it also reproduces the real-world starting state of a
freshly pulled Ubuntu container where /var/lib/apt/lists/ is empty until
something runs apt-get update") restated information already conveyed
earlier in the block. Trimmed to 7 lines, no information loss.

Refs: AL-37.
…mment

ai-deslop pass-3 (now with the registered subagent rather than my inline
rubric run): the trailing parenthetical "(gated here on `command -v
locale-gen` absence, unconditional there because nodejs is always
installed by the AgentLinux installer)" is over-explanation. A reader
following the `30-nodejs.sh:33` cross-reference sees the gated-vs-
unconditional distinction in two seconds.

This walks back part of the qa-engineer pass-1 nit in c70d297, which
asked for the precision because "canonical pattern" overstated the
relationship. The new "Mirrors the pattern at 30-nodejs.sh:33." (one
line) is honest about the cross-reference without claiming canonical-
pattern equivalence and without inlining the difference.

Refs: AL-37.
@Roo4L Roo4L merged commit 7cca699 into master May 10, 2026
12 checks passed
@Roo4L Roo4L deleted the worktree-locales-bugfix branch May 10, 2026 06:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant