Skip to content

[FEA]: Cutover: flip default to Go, delete Python, flatten dir, update docs #222

Description

@rice-riley

Summary

The flip. After #213#221 land, this single small PR repoints the production agent image build at the Go binary, deletes the Python source, flattens agent/go/ up one level into agent/, and updates the changelog and docs. The previous issue (#221) has already proven the Go image passes the full operator-agent chainsaw suite, so this is a content swap, not a behavior change.

It is intentionally a single PR so it's easy to revert with one click if a downstream consumer hits something the e2e suite missed.

Motivation

Until now Python and Go have shipped side-by-side: Python in the production agent image, Go in a separate agent-go image (built from the nested agent/go/ source tree) only used in CI. That parallelism is the safety net while the rewrite is in flight; once parity is proven (#221), maintaining two implementations indefinitely would be pure tech debt.

This PR is small but high-impact. Reviewers should focus on:

  • Are we sure no consumer is pinned to the Python wheel rather than the image? (Spoiler: there isn't one — the agent is only ever consumed as a container.)
  • Is the changelog entry honest about the language change?
  • Is every mention of "Python", "hatch", or "wheel" in docs/ updated or removed?

Feature description

A single PR doing four mechanical things:

  1. Repoint the production agent image build at the Go Dockerfile.
  2. Delete the Python source under agent/skyhook-agent/, agent/vendor/, agent/hatch.toml, and the Python-specific bits of agent/Makefile. Also remove agent/.dockerignore (the go/ exclusion it carries no longer makes sense once go/ is the only thing in agent/).
  3. Flatten agent/go/ up one level so its content lives directly at agent/.
  4. Update agent/CHANGELOG.md, agent/README.md, .claude/CLAUDE.md, and any docs/ references to the Python agent.

Proposed direction

1. Image content swap

Two acceptable approaches; pick one:

  • A (recommended): git mv containers/agent-go.Dockerfile containers/agent.Dockerfile (overwriting the Python one). Reviewer sees the diff as "old Python build pipeline → new Go build pipeline". Also update the COPY agent/go/ ./ line inside the Dockerfile to COPY . /code/ (or equivalent) — the source no longer lives at agent/go/ after step 3.
  • B: Leave both Dockerfiles in place but change containers/agent.Dockerfile to a one-liner that pulls from agent-go.Dockerfile. Rejected — adds indirection for no benefit.

Same for the workflow: git mv .github/workflows/agent-go-ci.yaml .github/workflows/agent-ci.yaml (overwriting the Python one). Update paths: filters: drop the !agent/go/** exclusion that #213 added (the subdir no longer exists) and drop the agent/go/** include from the Go workflow (replaced by agent/**).

2. Delete Python source

git rm -r agent/skyhook-agent/
git rm -r agent/vendor/
git rm agent/hatch.toml
git rm agent/.dockerignore

agent/Makefile (the Python one) gets deleted in this step. The Go module's Makefile at agent/go/Makefile will end up at agent/Makefile after step 3, so the root Makefile target make -C agent test keeps working (it now hits Go targets instead of hatch ones — that's the intended behavior).

3. Flatten agent/go/agent/

After the deletions, agent/ contains only go/, README.md, CHANGELOG.md, and possibly an empty Makefile slot. Then:

git mv agent/go/* agent/go/.* agent/ 2>/dev/null || true
# or, more carefully, mv each tracked file individually
git rmdir agent/go

The 2>/dev/null is for hidden files like .golangci.yml if present. Verify with git status that no Go file got dropped.

Because the module path declared in agent/go/go.mod is github.com/NVIDIA/nodewright/agent (#213 deliberately chose this with cutover in mind), no import statement in any .go file changes — only the on-disk path. The root Makefile fan-out targets that #213 added as agent-go-* should be renamed to plain agent-* (or merged with whatever the Python agent/Makefile exposed).

Important: stage the deletes from step 2 and the moves from step 3 in the same commit so reviewers see "Python deleted, Go landed in same place" as one logical change. Git renders this as a series of file renames from agent/go/X to agent/X plus the deletions of agent/skyhook-agent/**.

4. CHANGELOG entry

Append to agent/CHANGELOG.md:

## [agent/v7.0.0] - YYYY-MM-DD

### BREAKING (build/runtime, not user-visible)

- *(agent)* Rewrote in Go. The agent CLI contract (positional args, env vars,
  log/flag/history/log-file paths, exit codes, log-line format) is unchanged
  per the operator-agent chainsaw suite. The image still runs as root from a
  distroless base.

### Removed

- Python source under `agent/skyhook-agent/`, the hatch toolchain, and the
  vendored Python deps. The agent is now a single static Go binary.

The version bump to v7.0.0 reflects the major build-stack change; CRDs and CLI args are unchanged so users upgrading should see no behavior difference.

5. Docs

Update:

  • agent/README.md — replace Python build instructions with Go ones.
  • .claude/CLAUDE.md — the "Three components, three toolchains" section becomes "operator (Go), agent (Go), chart (Helm)". Update the "Common commands" agent block.
  • Search docs/ for any reference to "Python" or "skyhook-agent wheel" or "hatch" and update.
  • .github/dependabot.yml (if it lists pip ecosystems for the agent) — remove pip entries for agent/.

6. Tests

This PR's "test" is that all existing CI lanes still pass with the swap:

  • Renamed Agent CI workflow runs go test/build/docker build/operator-agent-tests and they all pass.
  • The smoke-test confirms the published agent:{tag} image at the new build content has the right entrypoint.
  • No e2e regression vs. the last commit on main before cutover.

Scope boundaries

In scope:

  • Dockerfile swap, workflow swap, Python source deletion, dir rename, CHANGELOG, README, docs touch-ups.

Out of scope:

  • Any agent code change. If something needs fixing, open a follow-up.
  • Operator or chart changes (the contract is unchanged).
  • Behavior tweaks "while we're in here" (post-cutover follow-ups, please).

Acceptance criteria

  • containers/agent.Dockerfile builds the Go agent.
  • .github/workflows/agent-ci.yaml runs Go test + build + e2e.
  • agent/skyhook-agent/, agent/vendor/, agent/hatch.toml, and agent/.dockerignore are gone.
  • agent/go/ no longer exists; its content lives at agent/.
  • All Go imports continue to compile unchanged because the module path was already github.com/NVIDIA/nodewright/agent.
  • agent/CHANGELOG.md has a v7.0.0 entry naming the rewrite.
  • agent/README.md documents the Go build/test workflow.
  • .claude/CLAUDE.md "Three components" section is updated.
  • Operator-agent chainsaw suite still passes (proves nothing slipped in the rename).
  • No reference to python, hatch, wheel, or pip remains under agent/ or docs/.

Open questions

  • Should we tag agent/v7.0.0 in the same PR or in a follow-up? Recommend follow-up: PR lands on main, then a release PR bumps the version stamp [FEA]: Bootstrap agent/go/ module + port pure-data types #213 added, then tag. Avoids tag/PR coupling.
  • Keep the Python image (agent:6.4.x and agent:latest-python) on GHCR for one minor cycle as a fallback? Recommend yes — don't delete, but don't push new ones. Operators in production may want a roll-back image if something obscure surfaces in the field.
  • Should the operator's Go module pin a minimum agent version that requires the Go image? Probably not — the contract is the same. If the contract were changing, that would be a separate PR.

References (codebase)

Alternatives considered

  • Soft cutover: keep both Dockerfiles building two different images for several months. Rejected — extends the maintenance window and benefits only us; users don't see two images.
  • Atomic single commit but no PR: just push to main. Rejected — needs review like any other change.
  • Keep agent/go/ as the nested dir: skip the flatten. Rejected — leaves a vestige of "this used to be Python" in the directory tree forever, and forces every doc/Makefile/CI file to keep saying agent/go/ for the next decade. The flatten is a one-time cost; carrying the nested layout is forever.

Code of Conduct

  • I agree to follow Skyhook's Code of Conduct.

Metadata

Metadata

Assignees

Labels

component/agentSkyhook agent (package executor)
No fields configured for Enhancement.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions