Skip to content

feat(observability): Alertmanager Discord receiver + healthchecks.io Watchdog#206

Merged
jvcorredor merged 4 commits into
mainfrom
worktree-homelab-181-alertmanager-discord-watchdog
May 16, 2026
Merged

feat(observability): Alertmanager Discord receiver + healthchecks.io Watchdog#206
jvcorredor merged 4 commits into
mainfrom
worktree-homelab-181-alertmanager-discord-watchdog

Conversation

@jvcorredor

Copy link
Copy Markdown
Member

What

Wires Alertmanager's two delivery paths for the ADR-0007 observability stack:

Route Receiver Endpoint Purpose
default (every alert) discord Discord channel webhook Actionable alerts reach the operator's phone
alertname = "Watchdog" watchdog healthchecks.io ping URL (repeat_interval: 1m) Dead-man's switch — pages when the pings stop, surviving a fully-dark cluster
alertname = "InfoInhibitor" blackhole — (empty receiver) The chart's non-actionable inhibition helper; dropped so it doesn't spam Discord

Both endpoint URLs are sourced from files mounted out of ESO-synced K8s Secrets (alertmanagerSpec.secrets), so neither URL enters Git or Terraform state — the same out-of-band credential pattern as grafana-admin-password.

Changes

  • terraform/gcp/main.tf — two empty GSM containers, discord-alertmanager-webhook and healthchecks-watchdog-url, populated by the operator out of band.
  • kube-prometheus-stack/manifests/external-secret-{discord,healthchecks}.yaml — one ExternalSecret per container → K8s Secrets in observability.
  • kube-prometheus-stack/helm-values.yaml — Alertmanager config with the discord/watchdog/blackhole receivers and routes. inhibit_rules are restated verbatim from the chart default, since supplying config replaces it wholesale.
  • kube-prometheus-stack/README.md — out-of-band operator steps (create Discord webhook, create healthchecks.io check, populate both GSM containers) and two new smoke-test steps.

Verification

  • tofu validate (terraform/gcp) — passes; tofu fmt -check — clean.
  • scripts/lint-apps.sh kubernetes/apps/kube-prometheus-stack — passes (123 valid resources, 5 manifests valid).
  • helm template of the chart with these values renders the Alertmanager config secret exactly as designed — webhook_url_file / url_file resolve against the alertmanagerSpec.secrets mount paths.

In-cluster acceptance (Discord test alert, healthchecks.io Up/Down) requires the operator to populate the two GSM containers first — documented in the README's "Out-of-band operator steps".

Out of scope

Per the issue: application-level alert rules, and routing to Slack/PagerDuty/email.

Closes: #181

🤖 Generated with Claude Code

…Watchdog

Route Alertmanager's default route to a Discord channel webhook and the
chart's always-firing `Watchdog` alert to a healthchecks.io ping URL on
a 1m repeat_interval — the dead-man's switch that pages on a fully-dark
cluster. `InfoInhibitor`, the chart's non-actionable inhibition helper,
is dropped into an empty `blackhole` receiver so the now-`discord`
default route does not spam it.

Both endpoint URLs are sourced from files mounted out of ESO-synced K8s
Secrets (`alertmanagerSpec.secrets`), so neither URL enters Git or
Terraform state — same out-of-band credential pattern as the Grafana
admin password.

- terraform/gcp/: two empty GSM containers, `discord-alertmanager-webhook`
  and `healthchecks-watchdog-url`, populated by the operator out of band.
- kube-prometheus-stack/manifests/: one ExternalSecret per container,
  syncing into the `observability` namespace.
- kube-prometheus-stack/helm-values.yaml: Alertmanager `config` with the
  discord/watchdog/blackhole receivers and routes; `inhibit_rules`
  restated verbatim from the chart default since supplying `config`
  replaces it wholesale.
- kube-prometheus-stack/README.md: out-of-band operator steps (Discord
  webhook, healthchecks.io check) and two smoke-test steps.

Closes: #181

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented May 16, 2026

Copy link
Copy Markdown

Terraform plan: terraform/gcp/

Changes detected. Review the plan before merging.

Commit: f31645eaa9ca74c0fcb9eb3d87674e2df4bd0ba4 · Job log

Plan output
google_project.lab: Refreshing state... [id=projects/rockingham-homelab]
google_project_service.enabled["cloudresourcemanager.googleapis.com"]: Refreshing state... [id=rockingham-homelab/cloudresourcemanager.googleapis.com]
google_project_service.enabled["dns.googleapis.com"]: Refreshing state... [id=rockingham-homelab/dns.googleapis.com]
google_project_service.enabled["artifactregistry.googleapis.com"]: Refreshing state... [id=rockingham-homelab/artifactregistry.googleapis.com]
google_project_service.enabled["storage.googleapis.com"]: Refreshing state... [id=rockingham-homelab/storage.googleapis.com]
google_project_service.enabled["serviceusage.googleapis.com"]: Refreshing state... [id=rockingham-homelab/serviceusage.googleapis.com]
google_project_service.enabled["secretmanager.googleapis.com"]: Refreshing state... [id=rockingham-homelab/secretmanager.googleapis.com]
google_project_service.enabled["iam.googleapis.com"]: Refreshing state... [id=rockingham-homelab/iam.googleapis.com]
google_service_account.cert_manager: Refreshing state... [id=projects/rockingham-homelab/serviceAccounts/cert-manager-dns01@rockingham-homelab.iam.gserviceaccount.com]
google_secret_manager_secret.arc_installation_id_brazostech: Refreshing state... [id=projects/rockingham-homelab/secrets/arc-installation-id-brazostech]
google_service_account.arc_push: Refreshing state... [id=projects/rockingham-homelab/serviceAccounts/tf-ci-arc-push@rockingham-homelab.iam.gserviceaccount.com]
google_secret_manager_secret.homepage_argocd_token: Refreshing state... [id=projects/rockingham-homelab/secrets/homepage-argocd-token]
google_secret_manager_secret.cloudflare_api_token: Refreshing state... [id=projects/rockingham-homelab/secrets/cloudflare-api-token]
google_service_account.cluster_pull: Refreshing state... [id=projects/rockingham-homelab/serviceAccounts/tf-ci-cluster-pull@rockingham-homelab.iam.gserviceaccount.com]
google_storage_bucket.longhorn_backups: Refreshing state... [id=rockingham-longhorn-backups]
google_secret_manager_secret.grafana_admin_password: Refreshing state... [id=projects/rockingham-homelab/secrets/grafana-admin-password]
google_storage_bucket.tfstate: Refreshing state... [id=rockingham-homelab-tfstate]
google_secret_manager_secret.homepage_adguard_username: Refreshing state... [id=projects/rockingham-homelab/secrets/homepage-adguard-username]
google_service_account.tf_ci: Refreshing state... [id=projects/rockingham-homelab/serviceAccounts/tf-ci-plan@rockingham-homelab.iam.gserviceaccount.com]
google_service_account.eso: Refreshing state... [id=projects/rockingham-homelab/serviceAccounts/external-secrets@rockingham-homelab.iam.gserviceaccount.com]
google_iam_workload_identity_pool.github_actions: Refreshing state... [id=projects/rockingham-homelab/locations/global/workloadIdentityPools/github-actions]
google_secret_manager_secret.longhorn_backup_credentials: Refreshing state... [id=projects/rockingham-homelab/secrets/longhorn-backup-credentials]
google_dns_managed_zone.lab: Refreshing state... [id=projects/rockingham-homelab/managedZones/lab-jackhall-dev]
google_secret_manager_secret.arc_app_private_key: Refreshing state... [id=projects/rockingham-homelab/secrets/arc-app-private-key]
google_service_account.tf_ci_apply: Refreshing state... [id=projects/rockingham-homelab/serviceAccounts/tf-ci-apply@rockingham-homelab.iam.gserviceaccount.com]
google_service_account.longhorn_backup: Refreshing state... [id=projects/rockingham-homelab/serviceAccounts/longhorn-backup@rockingham-homelab.iam.gserviceaccount.com]
google_secret_manager_secret.talos_cluster_secrets: Refreshing state... [id=projects/rockingham-homelab/secrets/talos-cluster-secrets]
google_artifact_registry_repository.projects: Refreshing state... [id=projects/rockingham-homelab/locations/us-east4/repositories/projects]
google_service_account.tf_ci_apply_cloudflare: Refreshing state... [id=projects/rockingham-homelab/serviceAccounts/tf-ci-apply-cloudflare@rockingham-homelab.iam.gserviceaccount.com]
google_secret_manager_secret.homepage_adguard_password: Refreshing state... [id=projects/rockingham-homelab/secrets/homepage-adguard-password]
google_secret_manager_secret.adguard_home_admin: Refreshing state... [id=projects/rockingham-homelab/secrets/adguard-home-admin]
google_secret_manager_secret.cloudflare_tunnel_token: Refreshing state... [id=projects/rockingham-homelab/secrets/cloudflare-tunnel-token]
google_iam_workload_identity_pool.cluster: Refreshing state... [id=projects/rockingham-homelab/locations/global/workloadIdentityPools/cluster]
google_secret_manager_secret.arc_installation_id_raptgroup: Refreshing state... [id=projects/rockingham-homelab/secrets/arc-installation-id-raptgroup]
google_secret_manager_secret.arc_app_id: Refreshing state... [id=projects/rockingham-homelab/secrets/arc-app-id]
google_secret_manager_secret.argocd_repo_ssh_key: Refreshing state... [id=projects/rockingham-homelab/secrets/argocd-repo-ssh-key]
google_storage_bucket.cluster_oidc: Refreshing state... [id=rockingham-homelab-oidc]
google_project_iam_member.tf_ci_viewer: Refreshing state... [id=rockingham-homelab/roles/viewer/serviceAccount:tf-ci-plan@rockingham-homelab.iam.gserviceaccount.com]
google_secret_manager_secret_iam_member.tf_ci_cloudflare_api_token_accessor: Refreshing state... [id=projects/rockingham-homelab/secrets/cloudflare-api-token/roles/secretmanager.secretAccessor/serviceAccount:tf-ci-plan@rockingham-homelab.iam.gserviceaccount.com]
google_project_iam_member.tf_ci_security_reviewer: Refreshing state... [id=rockingham-homelab/roles/iam.securityReviewer/serviceAccount:tf-ci-plan@rockingham-homelab.iam.gserviceaccount.com]
google_project_iam_member.eso_secret_accessor: Refreshing state... [id=rockingham-homelab/roles/secretmanager.secretAccessor/serviceAccount:external-secrets@rockingham-homelab.iam.gserviceaccount.com]
google_iam_workload_identity_pool_provider.github: Refreshing state... [id=projects/rockingham-homelab/locations/global/workloadIdentityPools/github-actions/providers/github]
google_service_account_iam_member.arc_push_wif_user["RaptGroup/zipmenu-public"]: Refreshing state... [id=projects/rockingham-homelab/serviceAccounts/tf-ci-arc-push@rockingham-homelab.iam.gserviceaccount.com/roles/iam.workloadIdentityUser/principalSet://iam.googleapis.com/projects/594695390705/locations/global/workloadIdentityPools/github-actions/attribute.repository/RaptGroup/zipmenu-public]
google_service_account_iam_member.arc_push_wif_user["RaptGroup/homelab"]: Refreshing state... [id=projects/rockingham-homelab/serviceAccounts/tf-ci-arc-push@rockingham-homelab.iam.gserviceaccount.com/roles/iam.workloadIdentityUser/principalSet://iam.googleapis.com/projects/594695390705/locations/global/workloadIdentityPools/github-actions/attribute.repository/RaptGroup/homelab]
google_service_account_iam_member.tf_ci_wif_user: Refreshing state... [id=projects/rockingham-homelab/serviceAccounts/tf-ci-plan@rockingham-homelab.iam.gserviceaccount.com/roles/iam.workloadIdentityUser/principalSet://iam.googleapis.com/projects/594695390705/locations/global/workloadIdentityPools/github-actions/attribute.repository/RaptGroup/homelab]
google_iam_workload_identity_pool_provider.github_arc: Refreshing state... [id=projects/rockingham-homelab/locations/global/workloadIdentityPools/github-actions/providers/github-arc]
google_storage_bucket_iam_member.longhorn_backup_object_admin: Refreshing state... [id=b/rockingham-longhorn-backups/roles/storage.objectAdmin/serviceAccount:longhorn-backup@rockingham-homelab.iam.gserviceaccount.com]
google_dns_managed_zone_iam_member.cert_manager_dns_admin: Refreshing state... [id=projects/rockingham-homelab/managedZones/lab-jackhall-dev/roles/dns.admin/serviceAccount:cert-manager-dns01@rockingham-homelab.iam.gserviceaccount.com]
google_service_account_iam_member.tf_ci_apply_cloudflare_wif_user: Refreshing state... [id=projects/rockingham-homelab/serviceAccounts/tf-ci-apply-cloudflare@rockingham-homelab.iam.gserviceaccount.com/roles/iam.workloadIdentityUser/principalSet://iam.googleapis.com/projects/594695390705/locations/global/workloadIdentityPools/github-actions/attribute.environment/cloudflare]
google_storage_bucket_iam_member.tf_ci_apply_cloudflare_state: Refreshing state... [id=b/rockingham-homelab-tfstate/roles/storage.objectUser/serviceAccount:tf-ci-apply-cloudflare@rockingham-homelab.iam.gserviceaccount.com]
google_secret_manager_secret_iam_member.tf_ci_apply_cloudflare_api_token_accessor: Refreshing state... [id=projects/rockingham-homelab/secrets/cloudflare-api-token/roles/secretmanager.secretAccessor/serviceAccount:tf-ci-apply-cloudflare@rockingham-homelab.iam.gserviceaccount.com]
google_service_account_iam_member.eso_wif_user: Refreshing state... [id=projects/rockingham-homelab/serviceAccounts/external-secrets@rockingham-homelab.iam.gserviceaccount.com/roles/iam.workloadIdentityUser/principal://iam.googleapis.com/projects/594695390705/locations/global/workloadIdentityPools/cluster/subject/system:serviceaccount:external-secrets:external-secrets-gsm]
google_secret_manager_secret_iam_member.tf_ci_apply_cloudflare_tunnel_token_version_adder: Refreshing state... [id=projects/rockingham-homelab/secrets/cloudflare-tunnel-token/roles/secretmanager.secretVersionAdder/serviceAccount:tf-ci-apply-cloudflare@rockingham-homelab.iam.gserviceaccount.com]
google_service_account_iam_member.cluster_pull_wif_user: Refreshing state... [id=projects/rockingham-homelab/serviceAccounts/tf-ci-cluster-pull@rockingham-homelab.iam.gserviceaccount.com/roles/iam.workloadIdentityUser/principal://iam.googleapis.com/projects/594695390705/locations/global/workloadIdentityPools/cluster/subject/system:serviceaccount:ar-canary:ar-canary-puller]
google_service_account_iam_member.cert_manager_wif_user: Refreshing state... [id=projects/rockingham-homelab/serviceAccounts/cert-manager-dns01@rockingham-homelab.iam.gserviceaccount.com/roles/iam.workloadIdentityUser/principal://iam.googleapis.com/projects/594695390705/locations/global/workloadIdentityPools/cluster/subject/system:serviceaccount:cert-manager:cert-manager]
google_storage_bucket_iam_member.cluster_oidc_public_read: Refreshing state... [id=b/rockingham-homelab-oidc/roles/storage.objectViewer/allUsers]
google_service_account_iam_member.tf_ci_apply_wif_user: Refreshing state... [id=projects/rockingham-homelab/serviceAccounts/tf-ci-apply@rockingham-homelab.iam.gserviceaccount.com/roles/iam.workloadIdentityUser/principalSet://iam.googleapis.com/projects/594695390705/locations/global/workloadIdentityPools/github-actions/attribute.environment/gcp]
google_project_iam_member.tf_ci_apply_owner: Refreshing state... [id=rockingham-homelab/roles/owner/serviceAccount:tf-ci-apply@rockingham-homelab.iam.gserviceaccount.com]
google_artifact_registry_repository_iam_member.arc_push_writer: Refreshing state... [id=projects/rockingham-homelab/locations/us-east4/repositories/projects/roles/artifactregistry.writer/serviceAccount:tf-ci-arc-push@rockingham-homelab.iam.gserviceaccount.com]
google_artifact_registry_repository_iam_member.cluster_pull_reader: Refreshing state... [id=projects/rockingham-homelab/locations/us-east4/repositories/projects/roles/artifactregistry.reader/serviceAccount:tf-ci-cluster-pull@rockingham-homelab.iam.gserviceaccount.com]
google_storage_bucket_object.cluster_oidc_discovery: Refreshing state... [id=rockingham-homelab-oidc-.well-known/openid-configuration]
google_iam_workload_identity_pool_provider.cluster_talos: Refreshing state... [id=projects/rockingham-homelab/locations/global/workloadIdentityPools/cluster/providers/talos]

OpenTofu used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create

OpenTofu will perform the following actions:

  # google_secret_manager_secret.discord_alertmanager_webhook will be created
  + resource "google_secret_manager_secret" "discord_alertmanager_webhook" {
      + create_time           = (known after apply)
      + deletion_protection   = false
      + effective_annotations = (known after apply)
      + effective_labels      = {
          + "addon"                      = "kube-prometheus-stack"
          + "goog-terraform-provisioned" = "true"
          + "purpose"                    = "addon-credential"
          + "rotation"                   = "manual"
        }
      + expire_time           = (known after apply)
      + id                    = (known after apply)
      + labels                = {
          + "addon"    = "kube-prometheus-stack"
          + "purpose"  = "addon-credential"
          + "rotation" = "manual"
        }
      + name                  = (known after apply)
      + project               = "rockingham-homelab"
      + secret_id             = "discord-alertmanager-webhook"
      + terraform_labels      = {
          + "addon"                      = "kube-prometheus-stack"
          + "goog-terraform-provisioned" = "true"
          + "purpose"                    = "addon-credential"
          + "rotation"                   = "manual"
        }

      + replication {
          + auto {
            }
        }
    }

  # google_secret_manager_secret.healthchecks_watchdog_url will be created
  + resource "google_secret_manager_secret" "healthchecks_watchdog_url" {
      + create_time           = (known after apply)
      + deletion_protection   = false
      + effective_annotations = (known after apply)
      + effective_labels      = {
          + "addon"                      = "kube-prometheus-stack"
          + "goog-terraform-provisioned" = "true"
          + "purpose"                    = "addon-credential"
          + "rotation"                   = "manual"
        }
      + expire_time           = (known after apply)
      + id                    = (known after apply)
      + labels                = {
          + "addon"    = "kube-prometheus-stack"
          + "purpose"  = "addon-credential"
          + "rotation" = "manual"
        }
      + name                  = (known after apply)
      + project               = "rockingham-homelab"
      + secret_id             = "healthchecks-watchdog-url"
      + terraform_labels      = {
          + "addon"                      = "kube-prometheus-stack"
          + "goog-terraform-provisioned" = "true"
          + "purpose"                    = "addon-credential"
          + "rotation"                   = "manual"
        }

      + replication {
          + auto {
            }
        }
    }

Plan: 2 to add, 0 to change, 0 to destroy.

─────────────────────────────────────────────────────────────────────────────

Saved the plan to: tfplan

To perform exactly these actions, run the following command to apply:
    tofu apply "tfplan"

jvcorredor and others added 3 commits May 15, 2026 21:26
…alertmanager-discord-watchdog

# Conflicts:
#	kubernetes/apps/kube-prometheus-stack/README.md
…r-discord-watchdog' into worktree-homelab-181-alertmanager-discord-watchdog
@jvcorredor jvcorredor merged commit aa3a225 into main May 16, 2026
6 checks passed
@jvcorredor jvcorredor deleted the worktree-homelab-181-alertmanager-discord-watchdog branch May 16, 2026 03:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(observability): Alertmanager Discord receiver + healthchecks.io Watchdog

1 participant