Skip to content

Notification system for mission lifecycle events #87

@mlund01

Description

@mlund01

Problem

Squadron emits a rich vocabulary of mission lifecycle events (mission started/completed/failed/stopped, task started/completed/failed, agent reasoning, tool calls, routing decisions) but has no way to forward terminal outcomes — especially failures — to the operators or systems that care. There's no built-in webhook, no email, no PagerDuty, no Slack channel ping, no UI toast. Today an operator finds out a mission failed by going to look.

There's also a smaller gap: the runner transitions to MissionFailed and returns the error to its caller, but never calls MissionHandler.MissionFailed(...) — the handler chain literally never sees a mission-level failure event. So even before notifications, terminal failures don't land in the event store.

Proposal

A unified Notifier interface, with three implementation routes that share one HCL surface area.

Architecture

// squadron/notify/notifier.go
type Notifier interface {
    Send(ctx context.Context, ev NotificationEvent) error
}

type NotificationEvent struct {
    Type        string            // "mission_completed" | "mission_failed"
    MissionName string
    MissionID   string
    Title       string            // pre-rendered short subject
    Body        string            // pre-rendered long body
    Severity    string            // "info" | "warning" | "critical"
    Hints       map[string]string // routing hints (e.g. "channel": "#ops")
    Timestamp   time.Time
}

Three sources of Notifier implementations:

  1. Built-in (in-process Go)webhook, email, command_center. Simple HTTP/SMTP/wsbridge transports. No subprocess, no plugin SDK. Compiled into squadron.

  2. Notifier plugin (new SDK squadron-notifier-sdk) — gRPC subprocess via hashicorp/go-plugin, mirroring squadron-sdk and squadron-gateway-sdk. Single RPC: Notify(NotificationEvent). v1 ships the SDK as the extension point but no concrete plugin yet. Designed for community/custom transports (PagerDuty, Datadog, OpsGenie, MS Teams).

  3. Gateway-as-notifier — extend squadron-gateway-sdk with an optional NotifyingGateway interface (one method: OnNotification). Gateways that implement it can be referenced as notifiers via kind = "gateway". Existing gateways without it keep compiling; squadron does an interface assertion before calling. v1: gateway_slack implements it.

Why hybrid (not pure-plugin)

The gateway SDK already dropped its PagerDuty example (squadron-gateway-sdk#2) because the gateway protocol is Q&A-shaped (buttons, multi-select, free-text), not fire-and-forget. A webhook POST does not need an OS subprocess, SMTP does not need an OS subprocess, and the command-center wsbridge is already in-process — forcing those through a plugin SDK pays the install/release/handshake/process-supervision tax for nothing. Conversely, building a brand-new Slack notifier from scratch when the existing gateway already holds the bot token + channel + websocket is duplicative. The hybrid lets each transport pay only the complexity it actually needs.

HCL surface

# Built-in destinations
notifier "ops_webhook" {
  kind = "webhook"
  url  = "https://hooks.example.com/squadron"
  headers = { Authorization = "Bearer ${vars.hook_token}" }
}

notifier "alerts_email" {
  kind      = "email"
  smtp_host = "smtp.sendgrid.net"
  smtp_port = 587
  from      = "squadron@example.com"
  to        = ["oncall@example.com"]
  username  = vars.smtp_user
  password  = vars.smtp_pass
}

notifier "ui_toast" {
  kind = "command_center"
}

# Gateway-as-notifier (delegates to existing gateway "slack" block)
notifier "ops_slack" {
  kind    = "gateway"
  gateway = "slack"
  channel = "#squadron-ops"
}

# Notifier plugin (post-v1, shape preview)
notifier "pd_critical" {
  kind     = "pagerduty"
  source   = "github.com/foo/notifier_pagerduty"
  version  = "v1.0.0"
  settings = { integration_key = vars.pd_key }
}

# Global subscriptions: apply to ALL missions
notify {
  on      = ["mission_failed"]
  targets = [notifiers.ops_slack, notifiers.alerts_email]
}

notify {
  on      = ["mission_completed"]
  targets = [notifiers.ui_toast]
}

# Mission-level subscription: ADDS to global rules (additive merge, dedupe)
mission "nightly_etl" {
  notify {
    on      = ["mission_failed"]
    targets = [notifiers.pd_critical]
  }
  task "extract" { ... }
}

# Mission with no notify block — receives only the global rules
mission "boring_check" {
  task "check" { ... }
}

Block conventions:

  • notifier "name" {} — top-level destination definition (noun, mirrors gateway "name" {}).
  • notifiers.* — variable namespace for cross-references (mirrors plugins.*, mcp.*).
  • notify { on = []; targets = [] } — subscription (verb). Supported at both global (top-level) and mission level. Additive merge — a mission's effective rules are the union of its own blocks and all global blocks; targets dedupe per delivery. No opt-out from globals in v1.

Resolution. Notifiers load after gateways (so kind = "gateway" can resolve) and before missions (so missions can reference them) in the staged-evaluation pipeline.

Mission failure event-emission gap fix

Add MissionFailed(name string, err error) to MissionHandler and emit at the three runner failure paths immediately before transitioning state. Implement in CLI handler, StoringMissionHandler, wsbridge streamer, and debug logger. Wire payload (EventMissionFailed / MissionFailedData) already exists in squadron-wire/protocol/events.go — currently never produced.

v1 scope

  • Internal Notifier interface + dispatcher
  • Mission failure event-emission gap fixed
  • HCL: notifier "name" {} destinations + notify {} subscriptions at global and mission level
  • Built-in destinations: webhook, email, command_center
  • Gateway-as-notifier extension in squadron-gateway-sdk; Slack implements it
  • squadron-notifier-sdk shipped as new module (interface + proto + plugin host glue), no concrete plugin built
  • Events accepted in v1: mission_completed, mission_failed only
  • Defaults-only message rendering (hardcoded title/body templates)
  • Best-effort delivery: 10 s per-call timeout, failures logged, no retry queue
  • Command center: new wire message type + React toast component

Implementation order (suggested PR sequence)

Each step independently mergeable + testable.

  1. Plumbing fix — add MissionHandler.MissionFailed + emit from runner + update all in-tree handlers. PR in squadron.
  2. Internal Notifier interface + dispatcher + webhook built-in. PR in squadron.
  3. Email built-in. PR in squadron.
  4. NotifyingGateway extension to squadron-gateway-sdk. PR in squadron-gateway-sdk.
  5. Slack gateway implements NotifyingGateway. PR in gateway_slack.
  6. squadron-notifier-sdk skeleton — new module init + initial release tag. Separate repo.
  7. Notifier plugin loader in squadron. PR in squadron.
  8. Command center UI toast — built-in command_center notifier + React toast. PRs in squadron + commander.

Steps 1–3 + 8 alone constitute a usable product even if 4–7 slip.

Touched repos

  • squadron — new notify/ package, config/notifier.go, runner emission fix, plugin loader, command_center notifier, dispatcher wiring
  • squadron-gateway-sdkNotifyingGateway interface + proto extension
  • squadron-notifier-sdknew repo/module
  • gateway_slack — implement NotifyingGateway
  • commander — React toast component + (possibly) a new wire message type
  • squadron-sdk — no changes

What's deferred (v2+)

  • Concrete notifier plugins (PagerDuty, Datadog, OpsGenie, MS Teams)
  • Discord-as-notifier (Slack-only in v1)
  • task_failed, mission_stopped, budget_breach, mission_issue event subscriptions
  • User-provided message templates
  • At-least-once delivery / persistent retry queue / outbox table
  • Per-target severity overrides
  • Mute windows / quiet hours
  • Aggregation (e.g. "5 task_failed in 60s → one notification")
  • Mission opt-out from global notify rules (notify { skip_global = true })

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions