Skip to content

Workflow Orchestration Improvements: Analysis and Recommendations #44

@madjin

Description

@madjin

Executive Summary

Analysis of the knowledge repo's CI/CD workflows against modern workflow orchestration patterns. The current implementation uses GitHub Actions with time-based scheduling, which works but has limitations.

Scores

Category Score Notes
Reliability 7/10 Good error handling, but no automatic retry/recovery
Observability 5/10 Basic alerts exist, no centralized monitoring
Scalability 6/10 Works for current load, but tight coupling limits growth
Maintainability 7/10 Well-documented, but implicit dependencies
Resilience 5/10 continue-on-error masks failures, no compensation logic

Priority 1: Critical (This Week)

1.1 Replace continue-on-error with Explicit Handling

File: .github/workflows/sync.yml

The current pattern silently masks failures:

# Problematic
- name: Sync ElizaOS Documentation
  continue-on-error: true  # Failure is hidden!

Recommended:

- name: Sync ElizaOS Documentation
  id: sync-elizaos
  run: |
    git clone ... || echo "sync_failed=true" >> $GITHUB_OUTPUT

- name: Use cached ElizaOS docs on sync failure
  if: steps.sync-elizaos.outputs.sync_failed == 'true'
  run: |
    echo "::warning::Using cached ElizaOS docs"

1.2 Add Workflow Dependency Triggers

Replace time-based gaps with explicit workflow_run triggers:

# aggregate-daily-sources.yml
on:
  schedule:
    - cron: '30 8 * * *'  # Backup schedule
  workflow_run:
    workflows: ["Sync Knowledge Sources"]
    types: [completed]

jobs:
  aggregate:
    if: |
      github.event_name != 'workflow_run' ||
      github.event.workflow_run.conclusion == 'success'

1.3 Add Health Check Step

- name: Verify prerequisites
  run: |
    [ -f "the-council/aggregated/daily.json" ] || exit 1
    FILE_AGE=$(( $(date +%s) - $(stat -c %Y the-council/aggregated/daily.json) ))
    [ $FILE_AGE -lt 86400 ] || echo "::warning::Aggregated data is stale"

Priority 2: High (This Month)

2.1 Implement Retry Logic for External Calls

- name: Call OpenRouter API with retry
  uses: nick-fields/retry@v2
  with:
    timeout_minutes: 10
    max_attempts: 3
    retry_wait_seconds: 30
    command: python scripts/etl/extract-facts.py ...

2.2 Add Pipeline Status Dashboard

Create .github/workflows/pipeline-status.yml to check all daily outputs exist and alert on missing data.

2.3 Add Input Validation

Validate required fields and data freshness before processing.


Priority 3: Medium (This Quarter)

3.1 Consider Migration to Workflow Orchestrator

Recommended: Dagster (Python-native, good for data pipelines)

Benefits:

  • Asset-based dependencies (run when upstream data changes)
  • Built-in retry and backoff
  • Centralized observability
  • Backfill support
Capability GitHub Actions Temporal/Dagster
Dependency management Time-based gaps Explicit DAG
Retry logic Manual Built-in
Compensation/rollback Not implemented Native support
Observability Workflow logs Unified dashboard
Cost Free (public repo) Self-hosted or cloud

Current Issues Identified

  1. Silent Failures: continue-on-error: true on 7 steps in sync.yml
  2. Implicit Dependencies: Time-based gaps that break when timing drifts
  3. No Retry Logic: Steps fail permanently on first error
  4. No Circuit Breaker: Repeated failures don't trigger backoff
  5. No Compensation: Mid-pipeline failures have no rollback mechanism

Metrics to Track

Metric Target
Daily pipeline success rate >99%
Average pipeline duration <45 min
Failed runs requiring manual intervention <2/month
Data freshness (hours since last update) <12h

Generated from workflow orchestration analysis - see docs/workflow-analysis-report.md for full report

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions