Skip to content

ops: PowerShell operator scripts for AWS EC2 dev-loop#13

Merged
jieyao-MilestoneHub merged 2 commits into
mainfrom
ops/aws-ec2-operator-scripts
May 3, 2026
Merged

ops: PowerShell operator scripts for AWS EC2 dev-loop#13
jieyao-MilestoneHub merged 2 commits into
mainfrom
ops/aws-ec2-operator-scripts

Conversation

@jieyao-MilestoneHub
Copy link
Copy Markdown
Contributor

Summary

Adds scripts/ops/ with four parameterized PowerShell helpers that turn an EC2-hosted llm-gateway box into a low-friction, near-zero-fixed-cost dev environment. All scripts are idempotent, use tag-based instance discovery, and respect the existing llm-gateway-bootstrap / systemd / idle-cron design.

  • setup-ssh.ps1 — one-time-per-laptop ed25519 key bootstrap. Opens a transient SG :22 inbound rule scoped to the operator's /32 (validates the IP via checkip.amazonaws.com against an IPv4 regex before authoring the rule), pushes the public key via EC2 Instance Connect (60s TTL), persists into ~ubuntu/.ssh/authorized_keys for ongoing use.
  • fix-and-start.ps1 — start instance, disable idle alarm/cron, sed-patch the systemd unit (docker compose --no-color--ansi never for Compose v2.x compat; safe no-op if already correct), wait for Application startup complete., then smoke /health /ready /v1/chat/completions from inside the box.
  • restore-idle-protection.ps1 — re-enable the CW alarm action + idle cron; optional -StopNow to stop the instance immediately and lock in savings.
  • teardown-ssh.ps1 — revoke the transient SG :22 rule when done with the dev box for a while.

scripts/ops/README.md documents prereqs, tag-based discovery (tag:application=vllm-serving + tag:environment=<env>), the daily flow, and the operator's required IAM permission set.

Top-level README.md gets a short "Ops scripts" section linking to scripts/ops/README.md (right before "License").

Why

These helpers were originally living in a downstream consumer (convilyn). They have zero coupling to any consumer's package layout / domain logic — they only know how to start/stop the gateway box, fix a known systemd-unit issue, and manage the cost guardrails. Externalizing them here so any operator running llm-gateway on EC2 can adopt them.

Notes

  • First .ps1 files in the repo. CI is Python-only (ruff / black / pyright / pytest), so the donation does not gate CI.
  • House style observed: bash-style header docstrings, no per-file copyright header (matches deploy/scripts/*.sh), strict mode ($ErrorActionPreference = 'Stop').
  • Parameterized: -Environment dev (default) drives tag discovery. -InstanceId / -Eip / -Region override for explicit pinning. -AlarmNameContains VLLMIdleBackstop is overridable.
  • Required AWS perms documented in scripts/ops/README.md.

Test plan

  • Get-ChildItem scripts/ops/*.ps1 | ForEach-Object { Test-Path $_.FullName } lists all 4 scripts
  • .\scripts\ops\setup-ssh.ps1 -Environment dev succeeds against a real vllm-serving instance and pushes an SSH key
  • .\scripts\ops\fix-and-start.ps1 -Environment dev brings the gateway from stopped → ready and the smoke test returns ok
  • .\scripts\ops\restore-idle-protection.ps1 -Environment dev -StopNow re-arms the alarm + cron and stops the instance
  • .\scripts\ops\teardown-ssh.ps1 -Environment dev revokes the SG :22 rule
  • CI green (Python-only, unaffected)

🤖 Generated with Claude Code

jieyao-MilestoneHub and others added 2 commits May 3, 2026 13:49
Adds scripts/ops/ with four parameterized PowerShell helpers for running
the gateway on a single EC2 GPU host without paying for idle time:

- setup-ssh.ps1            one-time-per-laptop key bootstrap (ed25519 +
                           transient SG :22 inbound /32 + EC2 Instance
                           Connect for first-connect key push)
- fix-and-start.ps1        start instance, disable idle alarm/cron,
                           sed-fix systemd unit for compose v2 compat,
                           wait for "Application startup complete.",
                           smoke /health /ready /v1/chat/completions
- restore-idle-protection  re-enable alarm + cron; -StopNow to lock in
                           savings
- teardown-ssh.ps1         revoke the SG :22 rule when done

Tag-based discovery (tag:application=vllm-serving +
tag:environment=<env>) means no hardcoded instance IDs/EIPs - script
params override when needed. checkip.amazonaws.com response is
IPv4-validated before authoring SG rules.

First .ps1 files in the repo; CI is Python-only so no new lint surface.
Used in production by the convilyn dev-loop, externalized here so any
operator running llm-gateway on EC2 can pick them up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drive-by fix to unblock CI on this PR. The previous commit on main
(2c2c48d "fix(schemas): cap messages, tools, and per-message content
length") landed pre-formatted lines that black 26.3.1 wants collapsed
into single lines under the configured line-length=100. Pure formatting,
no semantic change.

Verified: poetry run black --check llm_gateway/ tests/ now reports
"52 files would be left unchanged."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jieyao-MilestoneHub jieyao-MilestoneHub merged commit 2f8c44d into main May 3, 2026
5 checks passed
@jieyao-MilestoneHub jieyao-MilestoneHub deleted the ops/aws-ec2-operator-scripts branch May 3, 2026 06:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant