[Hackathon] Render Incident Command Center with SuperPlane#5093
Open
metonym wants to merge 7 commits into
Open
[Hackathon] Render Incident Command Center with SuperPlane#5093metonym wants to merge 7 commits into
metonym wants to merge 7 commits into
Conversation
List recent deploys for a Render service to support rollback and incident workflows. Adds the shared render operation mapper and action configuration helpers used by the new incident-response actions.
Fetch Render CPU, memory, request, and connection metrics and emit normalized summaries (latest, avg, max, count, unit) alongside the raw series for monitoring and incident-response workflows.
Query recent Render logs across one or more resources in the configured workspace, with level, type, text, and path filters. Emits log entries plus a count and error count for incident triage.
Update Render service settings, currently autoDeploy, so workflows can freeze deploys during an incident and thaw them after recovery. Emits serviceId, autoDeploy, and status.
Update Render autoscaling settings (enabled, min/max instances, and CPU and memory targets) for a web service. Emits the requested bounds, targets, status, and Render response.
Create a Render one-off job from an existing service and retrieve its status, enabling health checks, queue drains, repair tasks, and operational runbooks.
|
👋 Commands for maintainers:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a Hackathon submission for https://luma.com/3yx41qr9?tk=19uZD7
Background
This project demonstrates an incident response workflow for a multi-service application running on Render. The demo uses SuperPlane as an operational command center that can inspect Render services, collect runtime metrics, check application health, evaluate incident conditions, and run remediation actions from a single canvas.
The Render demo stack includes a web API, background worker, Key Value cache/queue, and Postgres database.
Changes
This adds new functionality to the existing Render integration to achieve this demo.
Flow
The SuperPlane canvas starts with an Open Incident trigger and fans out into service inspection, deploy lookup, CPU and memory metric collection, Render log inspection, API health checks, DB health checks, and queue health checks. After merging those signals, it branches into two practical paths:
The canvas also writes incident snapshots and decisions to
render-incidentsmemory so a Console can show incident status, service state, threshold decisions, failed redeploys, and manual operator actions like freeze deploys, thaw deploys, purge cache, redeploy, rollback, scale up, and scale down.Triage flow (no remediation)
no scale required.Triage flow (remediation triggered)
pendingJobs: 250.TRUE.In this flow, the remediation triggered, and services were scaled.
To reliably simulate this, the Trigger modal has a
forceScaletoggle (demo purposes).Final canvas.yaml (service IDs are not sensitive)