Skip to content

Feat/webhook levels prompt gurads#4999

Open
403ENDer wants to merge 4 commits into
superplanehq:mainfrom
403ENDer:feat/webhook-levels-prompt-gurads
Open

Feat/webhook levels prompt gurads#4999
403ENDer wants to merge 4 commits into
superplanehq:mainfrom
403ENDer:feat/webhook-levels-prompt-gurads

Conversation

@403ENDer
Copy link
Copy Markdown

Summary

This PR introduces two major workstreams:

  1. Webhook reconciliation and subscription binding enhancements
  2. AI prompt guardrails and execution safety enforcement

Webhook Enhancements

The webhook-related changes introduce:

  • Scope-based reconciliation support
  • Subscription binding persistence
  • Webhook operation tracking
  • Shadow-mode drift detection infrastructure

These changes lay the foundation for scalable and deterministic webhook lifecycle management across integrations.

AI Prompt Guardrails

The guardrails-related changes introduce:

  • Post-interpolation prompt scanning
  • Policy-based enforcement
  • Audit persistence
  • Soft/hard block execution handling
  • Human-review workflow support for AI-powered workflow components

Additional Documentation

Notes

  • Webhook reconciliation features are rollout-gated through environment flags and currently operate in shadow/observe-only mode.
  • Guardrails default to audit-only behavior unless enforcement policies are explicitly configured.
  • Some guardrail-related changes may be split into follow-up PRs based on review feedback.

403ENDer and others added 3 commits May 25, 2026 22:26
Implements a 5-phase prompt guardrail system to detect and block
injection attacks, secret leakage, and unsafe instructions in AI node
prompts before they reach the LLM provider.

Phase 0 – Schema: Two DB migrations adding 5 guardrail tables
(prompt_guardrail_policies, prompt_scan_results,
prompt_classifier_results, prompt_override_approvals,
prompt_guardrail_bypass_tokens) and 2 execution columns. Field metadata
extended with PromptField/SystemPromptField markers.

Phase 1 – Dark-launch rule engine: pkg/guardrails/ package with 8
detection rules (6 secret, 2 injection), audit-only default policy,
and integration in node_executor via FeaturePromptGuardrails flag.

Phase 2 – Warn-only tier: ScanConfiguration returns ScanOutcome;
warn_only executions write guardrail_warning to execution Metadata JSON
without blocking.

Phase 3 – Soft-block + GuardianWorker: GuardrailGuardianWorker polls
blocked executions every 30s; resumes on override_approved or times out
with guardrail_override_timeout after SoftBlockTimeoutSeconds.

Phase 4 – Classifier infrastructure: Classifier interface,
NoOpClassifier, ClassifierWorker polling pending jobs in batches, and
policy service layer (GetOrgPolicy, UpsertOrgPolicy, ListPendingOverrides,
ApproveOverride).

Phase 5 – Full enforcement + admin API: AnthropicClassifier calls
claude-haiku-4-5-20251001 to confirm findings and refine risk scores.
Six admin HTTP handlers under /admin/api for managing org/workflow
policies and approving soft-block overrides.

Enable in production via env vars:
  START_GUARDRAIL_GUARDIAN_WORKER=yes
  START_CLASSIFIER_WORKER=yes
  ANTHROPIC_CLASSIFIER_API_KEY=<key>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@superplanehq-integration
Copy link
Copy Markdown

👋 Commands for maintainers:

  • /sp start - Start an ephemeral machine (takes ~30s)
  • /sp stop - Stop a running machine (auto-executed on pr close)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant