Skip to content

[FEA]: do_interrupt with NodeRestart -15 special case #218

Description

@rice-riley

Summary

Port do_interrupt from agent/skyhook-agent/src/skyhook_agent/controller.py (lines ~478–534) into a new internal/interrupts/exec.go. This is the entrypoint the operator hits when it injects an interrupt pod for the interrupt lifecycle stage — the agent reads the base64 interrupt blob, looks up which command sequence to run (reboot, systemctl restart, etc.) via the Inflate from #213, and runs each command idempotently using the runner from #217.

The single non-obvious rule: when the interrupt is a NodeRestart and the command exits with -15 (SIGTERM from the OS during reboot), preserve the flag file so the next agent invocation does not re-run the reboot.

Depends on #213 (Inflate), #215 (filesystem helpers), #217 (the runner).

Motivation

Interrupts are the agent's most surgical operation. They run outside the normal apply/check loop — there's no config.json to read, just the base64 blob and the SKYHOOK_RESOURCE_ID env. The flag-per-resource-id idempotence is what makes "this skyhook resource has been interrupted, don't do it again" possible across pod restarts. A bug here causes either:

  • Repeated reboots on every reconcile (if flags aren't written), or
  • Permanently skipped interrupts (if a transient failure leaves a stale flag).

The -15 special case captures the documented reality that reboot causes the kernel to SIGTERM all userspace processes, and we want that to count as "interrupt completed successfully" rather than "interrupt failed and needs retry".

Feature description

A new Run(ctx, interruptData, rootMount, copyDir string) (failed bool, err error) in internal/interrupts that mirrors Python's do_interrupt.

Proposed direction

1. Setup

  • Read SKYHOOK_RESOURCE_ID from env.
  • Build a synthetic cfg from the resource id by parsing customer-{uuid}-{batch}_{packageName}_{packageVersion} (Python uses SKYHOOK_RESOURCE_ID.split("_") and takes the last 3 fields). Wrap in a helper MakeConfigDataFromResourceID() *config.Config matching Python's make_config_data_from_resource_id. (This helper is also used by the CLI banner in [FEA]: Controller main / agent_main / SIGTERM + CLI parsing and entrypoint banner #219.)
  • MkdirAll(interruptDir) where interruptDir = {skyhookDir}/interrupts/flags/{SKYHOOK_RESOURCE_ID}.

2. Inflate the interrupt

Call interrupts.Inflate(interruptData) from #213 to recover the typed Interrupt.

3. NoOp short-circuit

If interrupt.Type() == NoOp.Type():

  • Write {interruptDir}/no_op.complete containing the current Unix timestamp as a string.
  • Return failed=false.

4. Per-command loop

For each cmd in interrupt.InterruptCmd() indexed i:

  • interruptID = fmt.Sprintf("%s_%d", interrupt.Type(), i)
  • flag = filepath.Join(interruptDir, interruptID + ".complete")
  • If the flag exists, print Skipping interrupt {interruptID} because it was already run for {SKYHOOK_RESOURCE_ID} and continue. (Match Python's exact message.)
  • Otherwise write the flag eagerly with the timestamp. (Yes — written before the command runs, then deleted on failure. Matches Python.)
  • Call the runner from [FEA]: tee streaming and run_step #217: runner.Run(rootMount, cmd, runner.GetLogFile(...), copyDir, runner.WithWriteCmds(true), runner.WithNoChmod(true)).
  • If returncode != 0:
    • Special case: if interrupt.Type() == NodeRestart.Type() AND returncode == -15, do not delete the flag and do not treat as failure — return failed=false. Add a // why: comment naming the constraint:

      // why: NodeRestart causes the kernel to SIGTERM us mid-reboot. The reboot succeeded;
      // the SIGTERM is the expected delivery mechanism. Preserving the flag ensures the
      // next agent pod doesn't re-attempt the reboot.
    • Otherwise print INTERRUPT FAILED: {cmd} return_code: {rc}, os.Remove(flag) (so the next reconcile retries), and return failed=true.

5. Tests

Port the interrupt cases from agent/skyhook-agent/tests/test_controller.py and any from agent/skyhook-agent/tests/test_interrupts.py that touch execution:

  • NoOp short-circuits and writes its flag.
  • ServiceRestart with two services runs all three commands (daemon-reload + 2 restarts) and writes 3 flags.
  • Re-running a completed interrupt skips all commands.
  • Re-running a partially-completed interrupt skips done flags and runs remaining.
  • Failure deletes the flag.
  • NodeRestart with rc == -15 preserves the flag and returns failed=false.
  • NodeRestart with rc != -15 and != 0 deletes the flag and returns failed=true.

Scope boundaries

In scope:

  • The interrupt execution loop and its idempotent flag handling.
  • The -15 NodeRestart special case.
  • Synthesizing config from SKYHOOK_RESOURCE_ID.

Out of scope:

Acceptance criteria

  • All ported tests pass.
  • The -15 special case has its own // why: comment.
  • The pre-write-then-delete-on-failure pattern is preserved.
  • Skip / log message strings match Python verbatim.
  • MakeConfigDataFromResourceID parses the documented format and returns a useful error on malformed input.

Open questions

  • The SKYHOOK_RESOURCE_ID format {prefix}_{name}_{version} assumes underscores never appear inside name or version. Should we tighten the parse with a regex? Recommend yes — package names with underscores would silently mis-parse today. Fix forward in this PR with an explicit error if more than 3 underscore-separated trailing segments parse ambiguously.
  • Should interrupt.Type() use snake_case (Python) or kebab-case (Kubernetes-flavored)? [FEA]: Bootstrap agent/go/ module + port pure-data types #213 already locked snake_case for serialize parity. Confirmed.

References (codebase)

Alternatives considered

  • Move the -15 special case into the runner from [FEA]: tee streaming and run_step #217. Rejected — it's interrupt-specific knowledge ("we expect to be killed by the OS"), not a general runner concern.

Code of Conduct

  • I agree to follow Skyhook's Code of Conduct.

Metadata

Metadata

Assignees

Labels

component/agentSkyhook agent (package executor)
No fields configured for Enhancement.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions