You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Port do_interrupt from agent/skyhook-agent/src/skyhook_agent/controller.py (lines ~478–534) into a new internal/interrupts/exec.go. This is the entrypoint the operator hits when it injects an interrupt pod for the interrupt lifecycle stage — the agent reads the base64 interrupt blob, looks up which command sequence to run (reboot, systemctl restart, etc.) via the Inflate from #213, and runs each command idempotently using the runner from #217.
The single non-obvious rule: when the interrupt is a NodeRestart and the command exits with -15 (SIGTERM from the OS during reboot), preserve the flag file so the next agent invocation does not re-run the reboot.
Depends on #213 (Inflate), #215 (filesystem helpers), #217 (the runner).
Motivation
Interrupts are the agent's most surgical operation. They run outside the normal apply/check loop — there's no config.json to read, just the base64 blob and the SKYHOOK_RESOURCE_ID env. The flag-per-resource-id idempotence is what makes "this skyhook resource has been interrupted, don't do it again" possible across pod restarts. A bug here causes either:
Repeated reboots on every reconcile (if flags aren't written), or
Permanently skipped interrupts (if a transient failure leaves a stale flag).
The -15 special case captures the documented reality that reboot causes the kernel to SIGTERM all userspace processes, and we want that to count as "interrupt completed successfully" rather than "interrupt failed and needs retry".
Feature description
A new Run(ctx, interruptData, rootMount, copyDir string) (failed bool, err error) in internal/interrupts that mirrors Python's do_interrupt.
Proposed direction
1. Setup
Read SKYHOOK_RESOURCE_ID from env.
Build a synthetic cfg from the resource id by parsing customer-{uuid}-{batch}_{packageName}_{packageVersion} (Python uses SKYHOOK_RESOURCE_ID.split("_") and takes the last 3 fields). Wrap in a helper MakeConfigDataFromResourceID() *config.Config matching Python's make_config_data_from_resource_id. (This helper is also used by the CLI banner in [FEA]: Controller main / agent_main / SIGTERM + CLI parsing and entrypoint banner #219.)
MkdirAll(interruptDir) where interruptDir = {skyhookDir}/interrupts/flags/{SKYHOOK_RESOURCE_ID}.
2. Inflate the interrupt
Call interrupts.Inflate(interruptData) from #213 to recover the typed Interrupt.
3. NoOp short-circuit
If interrupt.Type() == NoOp.Type():
Write {interruptDir}/no_op.complete containing the current Unix timestamp as a string.
Return failed=false.
4. Per-command loop
For each cmd in interrupt.InterruptCmd() indexed i:
flag = filepath.Join(interruptDir, interruptID + ".complete")
If the flag exists, print Skipping interrupt {interruptID} because it was already run for {SKYHOOK_RESOURCE_ID} and continue. (Match Python's exact message.)
Otherwise write the flag eagerly with the timestamp. (Yes — written before the command runs, then deleted on failure. Matches Python.)
Call the runner from [FEA]: tee streaming and run_step #217: runner.Run(rootMount, cmd, runner.GetLogFile(...), copyDir, runner.WithWriteCmds(true), runner.WithNoChmod(true)).
If returncode != 0:
Special case: if interrupt.Type() == NodeRestart.Type() AND returncode == -15, do not delete the flag and do not treat as failure — return failed=false. Add a // why: comment naming the constraint:
// why: NodeRestart causes the kernel to SIGTERM us mid-reboot. The reboot succeeded;// the SIGTERM is the expected delivery mechanism. Preserving the flag ensures the// next agent pod doesn't re-attempt the reboot.
Otherwise print INTERRUPT FAILED: {cmd} return_code: {rc}, os.Remove(flag) (so the next reconcile retries), and return failed=true.
The pre-write-then-delete-on-failure pattern is preserved.
Skip / log message strings match Python verbatim.
MakeConfigDataFromResourceID parses the documented format and returns a useful error on malformed input.
Open questions
The SKYHOOK_RESOURCE_ID format {prefix}_{name}_{version} assumes underscores never appear inside name or version. Should we tighten the parse with a regex? Recommend yes — package names with underscores would silently mis-parse today. Fix forward in this PR with an explicit error if more than 3 underscore-separated trailing segments parse ambiguously.
Move the -15 special case into the runner from [FEA]: tee streaming and run_step #217. Rejected — it's interrupt-specific knowledge ("we expect to be killed by the OS"), not a general runner concern.
Summary
Port
do_interruptfrom agent/skyhook-agent/src/skyhook_agent/controller.py (lines ~478–534) into a newinternal/interrupts/exec.go. This is the entrypoint the operator hits when it injects an interrupt pod for theinterruptlifecycle stage — the agent reads the base64 interrupt blob, looks up which command sequence to run (reboot,systemctl restart, etc.) via theInflatefrom #213, and runs each command idempotently using the runner from #217.The single non-obvious rule: when the interrupt is a
NodeRestartand the command exits with-15(SIGTERM from the OS during reboot), preserve the flag file so the next agent invocation does not re-run the reboot.Depends on #213 (
Inflate), #215 (filesystem helpers), #217 (the runner).Motivation
Interrupts are the agent's most surgical operation. They run outside the normal apply/check loop — there's no
config.jsonto read, just the base64 blob and theSKYHOOK_RESOURCE_IDenv. The flag-per-resource-id idempotence is what makes "this skyhook resource has been interrupted, don't do it again" possible across pod restarts. A bug here causes either:The
-15special case captures the documented reality thatrebootcauses the kernel to SIGTERM all userspace processes, and we want that to count as "interrupt completed successfully" rather than "interrupt failed and needs retry".Feature description
A new
Run(ctx, interruptData, rootMount, copyDir string) (failed bool, err error)ininternal/interruptsthat mirrors Python'sdo_interrupt.Proposed direction
1. Setup
SKYHOOK_RESOURCE_IDfrom env.cfgfrom the resource id by parsingcustomer-{uuid}-{batch}_{packageName}_{packageVersion}(Python usesSKYHOOK_RESOURCE_ID.split("_")and takes the last 3 fields). Wrap in a helperMakeConfigDataFromResourceID() *config.Configmatching Python'smake_config_data_from_resource_id. (This helper is also used by the CLI banner in [FEA]: Controllermain/agent_main/ SIGTERM + CLI parsing and entrypoint banner #219.)MkdirAll(interruptDir)whereinterruptDir = {skyhookDir}/interrupts/flags/{SKYHOOK_RESOURCE_ID}.2. Inflate the interrupt
Call
interrupts.Inflate(interruptData)from #213 to recover the typedInterrupt.3. NoOp short-circuit
If
interrupt.Type() == NoOp.Type():{interruptDir}/no_op.completecontaining the current Unix timestamp as a string.failed=false.4. Per-command loop
For each
cmdininterrupt.InterruptCmd()indexedi:interruptID = fmt.Sprintf("%s_%d", interrupt.Type(), i)flag = filepath.Join(interruptDir, interruptID + ".complete")Skipping interrupt {interruptID} because it was already run for {SKYHOOK_RESOURCE_ID}and continue. (Match Python's exact message.)teestreaming andrun_step#217:runner.Run(rootMount, cmd, runner.GetLogFile(...), copyDir, runner.WithWriteCmds(true), runner.WithNoChmod(true)).Special case: if
interrupt.Type() == NodeRestart.Type()ANDreturncode == -15, do not delete the flag and do not treat as failure — returnfailed=false. Add a// why:comment naming the constraint:Otherwise print
INTERRUPT FAILED: {cmd} return_code: {rc},os.Remove(flag)(so the next reconcile retries), and returnfailed=true.5. Tests
Port the interrupt cases from agent/skyhook-agent/tests/test_controller.py and any from agent/skyhook-agent/tests/test_interrupts.py that touch execution:
daemon-reload+ 2 restarts) and writes 3 flags.failed=false.failed=true.Scope boundaries
In scope:
-15NodeRestart special case.SKYHOOK_RESOURCE_ID.Out of scope:
interruptarm of the top-level mode dispatch ([FEA]: Controllermain/agent_main/ SIGTERM + CLI parsing and entrypoint banner #219 wiresRunintocontroller.Run).controller.py.Acceptance criteria
-15special case has its own// why:comment.MakeConfigDataFromResourceIDparses the documented format and returns a useful error on malformed input.Open questions
{prefix}_{name}_{version}assumes underscores never appear insidenameorversion. Should we tighten the parse with a regex? Recommend yes — package names with underscores would silently mis-parse today. Fix forward in this PR with an explicit error if more than 3 underscore-separated trailing segments parse ambiguously.interrupt.Type()use snake_case (Python) or kebab-case (Kubernetes-flavored)? [FEA]: Bootstrapagent/go/module + port pure-data types #213 already locked snake_case for serialize parity. Confirmed.References (codebase)
Alternatives considered
-15special case into the runner from [FEA]:teestreaming andrun_step#217. Rejected — it's interrupt-specific knowledge ("we expect to be killed by the OS"), not a general runner concern.Code of Conduct