Skip to content

Conversation

@betterclever
Copy link
Contributor

Summary

Comprehensive refactor of the error handling system with rich, structured error representation, retry policies, and a new E2E testing framework using Bun's test runner.

Changes

Error Handling Refactor

Rich Error Types (feat/sdk)):

  • Added standardized error taxonomy: ServiceError, NetworkError, RateLimitError, TimeoutError, AuthenticationError, NotFoundError, ValidationError, ConfigurationError
  • Each error type supports:
    • Structured details object for context-specific data
    • Field errors for ValidationError (e.g., fieldErrors.api_key: [...])
    • Retry behavior configuration per error type
  • Removed ~300+ lines of redundant error handling code across components

Retry Policies:

  • Default retry policy: maxAttempts: 3, exponential backoff (coefficient 2.0)
  • Non-retryable errors: ValidationError, AuthenticationError, NotFoundError, ConfigurationError
  • Component-level retry policy support via retryPolicy in component definition
  • Error-specific policies: errorTypePolicies map for fine-grained control

Typed Error Migration (75 files, ~3000 insertions):

  • ✅ Core components: logic-script, entry-point, file-writer, etc.
  • ✅ Security components: subfinder, nuclei, httpx, virustotal, etc.
  • ✅ AI components: ai-agent, openai/gemini/openrouter providers, LLM components
  • ✅ IT automation: okta-user-offboard, google-workspace-license-unassign
  • ✅ Worker activities: run-component, run-dispatcher, human-input
  • ✅ Destination adapters: s3, artifact storage adapters

E2E Testing Framework

Bun Test Runner Integration (feat(e2e)):

  • tsconfig.e2e.json - Separate TypeScript config for E2E tests
  • e2e-tests/error-handling.test.ts - 4 test scenarios:
    1. Permanent Service Error: Validates max retry policy enforcement
    2. Retryable Success: Validates exponential backoff and success on retry
    3. Validation Error Details: Validates non-retryable error with field errors
    4. Timeout Error: Validates timeout error structure (skipped due to known issue)
  • e2e-tests/justfile - Commands: test, cleanup, list, check-services, logs
  • e2e-tests/cleanup.ts - Removes workflows with "Test:" prefix
  • e2e-tests/README.md - Comprehensive documentation with examples

Test Component:

  • Added test.error.generator component for E2E testing
  • Parameters: mode, errorType, errorMessage, errorDetails, failUntilAttempt, alwaysFail
  • Registered in worker component index
  • Note: Not hidden in production (used for testing only)

UI Improvements

Error Display:

  • Compact error badge on nodes showing only error type name (e.g., "ValidationError", "ServiceError")
  • Expandable ChevronDown icon that rotates on toggle (180° animation)
  • Full error details shown on click (field errors, structured details)
  • Auto-collapse when node status changes away from 'error'
  • Event stream continues to show full error details via ExecutionErrorView

Testing

  • ✅ Typecheck passes (all workspaces)
  • ✅ Error handling E2E tests validate:
    • Error types correctly thrown and propagated
    • Retry policy honored (max attempts, backoff)
    • Non-retryable errors fail immediately
    • Structured error details preserved (field errors, attempt tracking)
  • ✅ Workflow runs complete successfully after expected attempts

Breaking Changes

None - Error types are backwards compatible and components return structured errors.

Checklist

  • Error taxonomy implemented with rich details
  • Retry policies with exponential backoff
  • Typed error migration (worker, SDK, components)
  • E2E test framework with Bun test runner
  • UI improvements for error display
  • Documentation (E2E README, inline comments)
  • Typecheck passes
  • Manual testing completed

Notes

E2E tests require local dev environment running:

pm2 start pm2.config.cjs
just test

See e2e-tests/README.md for full documentation.

This adds a comprehensive error system for component SDK:

Error Types (Transient - Retryable):
- NetworkError: DNS failures, connection timeouts
- RateLimitError: HTTP 429, API quota exceeded
- ServiceError: HTTP 5xx, service unavailable
- TimeoutError: Operation timeouts
- ResourceUnavailableError: Connection pool/queue exhaustion

Error Types (Permanent - Non-Retryable):
- AuthenticationError: HTTP 401/403, invalid credentials
- NotFoundError: HTTP 404, missing resources
- ValidationError: Invalid input, missing fields
- ConfigurationError: Invalid settings
- PermissionError: Access denied
- ContainerError: Docker image issues

Helpers:
- fromHttpResponse(): Auto-classify HTTP errors
- NetworkError.from(): Wrap Node.js errors
- RateLimitError.fromHeaders(): Parse rate limit headers
- isRetryableError(), getRetryDelayMs(), wrapError()

Also adds ComponentRetryPolicy interface for per-component
retry configuration with Temporal integration support.

Signed-off-by: betterclever <paliwal.pranjal83@gmail.com>
Tier 1 component migration to use standardized error types:

notification.slack:
- ConfigurationError for missing webhookUrl/slackToken
- AuthenticationError for invalid_auth/token_revoked
- fromHttpResponse() for webhook HTTP failures
- Added retry policy: 5 attempts, 2s initial, 60s max

core.http-request:
- TimeoutError for request timeouts (AbortError)
- NetworkError for connection failures
- fromHttpResponse() for HTTP error responses
- Added retry policy: 3 attempts, 1s initial, 30s max

Both components now throw semantically typed errors that enable
intelligent retry decisions by the Temporal activity handler.

Signed-off-by: betterclever <paliwal.pranjal83@gmail.com>
- ConfigurationError when storage service is not available
- Added retry policy: 3 attempts, 1s initial, 10s max
- Quick retries for transient I/O issues

Signed-off-by: betterclever <paliwal.pranjal83@gmail.com>
Tier 2 security component migration to standardized errors:

security.virustotal:
- ValidationError for missing indicator
- ConfigurationError for missing API key
- fromHttpResponse() for API failures
- Retry policy: 4 attempts, 2s initial, 120s max

security.dnsx:
- ContainerError for non-docker runner check
- Retry policy: 3 attempts, 1s initial, 10s max

security.subfinder:
- ContainerError for non-docker runner check
- Retry policy: 2 attempts, 5s initial, 30s max (conservative for expensive ops)

security.nuclei:
- ValidationError for template validation failures
- ServiceError for scan execution failures
- Retry policy: 2 attempts, 10s initial (expensive scans)

All security components now throw typed errors with appropriate
retry policies based on their operational characteristics.

Signed-off-by: betterclever <paliwal.pranjal83@gmail.com>
Tier 3 AI component migration to standardized errors:

core.provider.openai:
- ConfigurationError for missing API key
- Retry policy: 1 attempt (no retry for config validation)

core.provider.gemini:
- ConfigurationError for missing API key
- Retry policy: 1 attempt (no retry for config validation)

core.provider.openrouter:
- ConfigurationError for missing API key
- Retry policy: 1 attempt (no retry for config validation)

core.ai.generate-text:
- ConfigurationError for missing API key
- Retry policy: 3 attempts, 2s initial, 30s max
  (handles transient API errors with exponential backoff)

Signed-off-by: betterclever <paliwal.pranjal83@gmail.com>
- Replace generic Error throws with typed errors:
  - ConfigurationError for missing API key
  - ValidationError for input/schema issues with full context
  - fromHttpResponse for MCP request failures
- Add retry policy (3 attempts, 2s-30s exponential backoff)
- Preserve original error context and AI response snippets for debugging

Signed-off-by: betterclever <paliwal.pranjal83@gmail.com>
- manual-selection: ValidationError for empty options
- okta-user-offboard: NotFoundError, ServiceError, ConfigurationError
- google-workspace-user-delete: NotFoundError, ServiceError, ConfigurationError
- github-remove-org-membership: full error taxonomy including
  AuthenticationError, PermissionError, NetworkError, TimeoutError,
  NotFoundError, and fromHttpResponse helper

Added retry policies to all components with appropriate backoff settings
and non-retryable error types configured per component requirements.

Signed-off-by: betterclever <paliwal.pranjal83@gmail.com>
Migrated components:
- trufflehog: ContainerError, ValidationError, retry policy (2 attempts)
- shuffledns-massdns: ContainerError, retry policy (2 attempts)
- httpx: ServiceError for exit codes, retry policy (2 attempts)
- notify: ConfigurationError for missing provider config
- atlassian-offboarding: ValidationError, ConfigurationError, NetworkError
- nuclei: Complete typed error migration for template validation

All components now use the standardized error taxonomy with
appropriate retry policies and non-retryable error types.

Signed-off-by: betterclever <paliwal.pranjal83@gmail.com>
Security components:
- amass: ServiceError, ValidationError, retry policy
- supabase-scanner: ValidationError, retry policy
- prowler-scan: ConfigurationError, ValidationError, ServiceError, retry policy
- nuclei: ValidationError for template validation

Core components:
- array-pick: ValidationError for out-of-bounds
- secret-fetch: ConfigurationError, NotFoundError, ValidationError
- file-writer: ValidationError for missing content

Signed-off-by: betterclever <paliwal.pranjal83@gmail.com>
- entry-point: ValidationError for missing required inputs
- artifact-writer: ConfigurationError for missing artifact service
- text-splitter: ValidationError for missing file content
- text-joiner: ValidationError for missing file content

Signed-off-by: betterclever <paliwal.pranjal83@gmail.com>
- ContainerError for Docker runner fallback scenario
- ValidationError for failed JSON parsing of script results

Signed-off-by: betterclever <paliwal.pranjal83@gmail.com>
Error categorization improvements:
- TimeoutError: Container timeouts with timeout duration in ms
- ContainerError: Docker spawn failures, non-zero exit codes, PTY failures
- ValidationError: Failed to serialize input params to JSON
- ConfigurationError: Unsupported runner type

All errors include contextual details like:
- Docker command arguments (formatted for readability)
- Exit codes and stderr output
- Original cause errors for debugging

Signed-off-by: betterclever <paliwal.pranjal83@gmail.com>
Core execution pipeline:
- run-component.activity.ts: NotFoundError for missing components, ValidationError for missing inputs
- workflow-runner.ts: NotFoundError for missing actions/components
- human-input.activity.ts: ConfigurationError for missing database
- run-dispatcher.activity.ts: ConfigurationError for missing env, ServiceError for API failures

Adapters:
- secrets.adapter.ts: ServiceError for decryption failures
- artifact.adapter.ts: NotFoundError for missing artifacts/files
- file-storage.adapter.ts: NotFoundError for missing files
- kafka-*.adapter.ts: ConfigurationError for missing brokers
- loki-log.adapter.ts: ServiceError for push failures

Utilities:
- isolated-volume.ts: ValidationError for invalid inputs, ConfigurationError for state errors, ContainerError for Docker failures

Destinations:
- registry.ts: ConfigurationError for duplicate adapters, NotFoundError for missing adapters
- s3.ts: ConfigurationError for missing bucket/credentials

Component SDK:
- registry.ts: ConfigurationError for duplicate components
- contracts.ts: ConfigurationError for duplicate contracts
- ports.ts: ValidationError for invalid credential names

Signed-off-by: betterclever <paliwal.pranjal83@gmail.com>
Remaining worker files:
- destinations/adapters/artifact.ts: ConfigurationError for missing artifact service
- temporal/workers/dev.worker.ts: ConfigurationError for missing env vars
- components/test/docker-echo.ts: ContainerError for inline fallback
- temporal/workflows/index.ts: ApplicationFailure with error types for Temporal sandbox

Notes:
- Temporal workflow code uses ApplicationFailure instead of component-sdk errors
  due to sandbox import restrictions
- All error types now include contextual details for debugging

Signed-off-by: betterclever <paliwal.pranjal83@gmail.com>
Propagate detailed error information from worker to frontend:
- Updated TraceEvent in component-sdk and shared packages to support structured error objects
- Enriched NODE_FAILED events with error types (ValidationError, NotFoundError, etc.), stacks, and details
- Migrated backend workflow_traces.error column to jsonb for flexible event storage
- Implemented ExecutionErrorView on frontend for beautiful, category-aware error rendering
- Integrated structured error display into WorkflowNode and EventInspector components

Signed-off-by: betterclever <paliwal.pranjal83@gmail.com>
- Add E2E test suite using Bun's built-in test runner
- Create tsconfig.e2e.json for proper TypeScript configuration
- Implement error-handling.test.ts with 4 test scenarios:
  - Permanent Service Error (validates max retry policy)
  - Retryable Success (validates exponential backoff)
  - Validation Error (validates non-retryable behavior)
  - Timeout Error (validates timeout error structure)
- Add justfile with test, cleanup, and helper commands
- Add cleanup.ts to remove test workflows with "Test:" prefix
- Add comprehensive README.md with test documentation
- Update test-error-generator component with alwaysFail parameter
- Fix WorkflowNode error badge to show expandable arrow
- Register test-error-generator in worker component index

E2E tests require local dev environment (backend + worker running).
Tests run via "just test" or "bun test --config tsconfig.e2e.json".

Signed-off-by: betterclever <paliwal.pranjal83@gmail.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 230 to 234
error: {
message: errorMsg,
type: (error as any)?.type || (error instanceof Error ? error.name : 'UnknownError'),
stack: error instanceof Error ? error.stack : undefined,
details: (error as any)?.details,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve ValidationError field errors in trace payloads

The trace event only serializes details from the thrown error (details: (error as any)?.details), but ValidationError stores field-level information in error.fieldErrors (e.g., core.workflow.entrypoint throws a ValidationError with only fieldErrors). As a result, those field errors are dropped from trace records and the new ExecutionErrorView cannot show them. Consider merging fieldErrors into the trace payload (e.g., details: { ...error.details, fieldErrors: error.fieldErrors }) or adding a dedicated field on TraceError so ValidationError context isn’t lost.

Useful? React with 👍 / 👎.

Comment on lines 64 to +68
id: 'core.http.request',
label: 'HTTP Request',
category: 'transform',
runner: { kind: 'inline' },
retryPolicy: httpRequestRetryPolicy,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Wire component retryPolicy into activity scheduling

Components now declare retryPolicy (e.g., HTTP Request sets maxAttempts/backoff), but the workflow/worker path never reads ComponentDefinition.retryPolicy when scheduling activities (proxyActivities only sets timeouts, and runComponentActivity ignores it). This means per-component retry limits and nonRetryableErrorTypes have no effect despite being configured. Consider applying the component’s retry policy when invoking runComponentActivity or translating it into Temporal activity options so the configuration is honored.

Useful? React with 👍 / 👎.

- Add RUN_E2E environment variable check in e2e test file
- Configure bunfig.toml to exclude e2e-tests from coverage
- Add test:e2e script to package.json for running e2e tests explicitly
- Fix http-request and atlassian-offboarding test assertions to match actual error messages

Signed-off-by: betterclever <paliwal.pranjal83@gmail.com>
@betterclever betterclever force-pushed the feature/error-retry-policies branch from 269d699 to f3a6179 Compare December 29, 2025 11:16
- Add dedicated fieldErrors field to TraceErrorSchema and TraceEvent interface
- Extract fieldErrors from ValidationError instances without using 'any' types
- Update frontend to read from dedicated fieldErrors field instead of details.fieldErrors
- Update backend trace service and OpenAPI schema to handle fieldErrors
- Ensure consistency across activity and workflow-runner error handling paths

Fixes the issue where ValidationError field-level information was being
dropped from trace records, preventing ExecutionErrorView from displaying
field-specific validation errors.

Signed-off-by: betterclever <paliwal.pranjal83@gmail.com>
Signed-off-by: betterclever <paliwal.pranjal83@gmail.com>
Signed-off-by: betterclever <paliwal.pranjal83@gmail.com>
- Add synchronous service availability check before tests are defined
- Use test.skip conditionally based on service availability
- Tests now officially skip (not pass) when RUN_E2E is set but services are down
- Fix TypeScript errors with timeout option in test.skip wrapper
- Include required x-internal-token header in curl health check

Signed-off-by: betterclever <paliwal.pranjal83@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants