Skip to content

Epic: Distributed Failure Recovery and Fault Propagation in Microservices #45

@KeveenMenezes

Description

@KeveenMenezes

Epic Title:
Distributed Failure Recovery and Fault Propagation in Microservices

Problem Statement:
DuckStore operates as a distributed microservices architecture deployed on AWS across multiple regions. An upstream dependency (e.g., third-party payment) exhibits intermittent outages. Failures propagate throughout synchronous HTTP and asynchronous messaging boundaries, leading to inconsistent service state, phantom duplication of orders, and delayed event processing.

Architectural Context:

  • Microservices built with .NET 10
  • HTTP-based synchronous integration and event-driven messaging
  • AWS regions, multi-environment deployment
  • Cognito enabled, but no other infra provisioned

Constraints:

  • No central orchestration or managed workflow solutions
  • Eventual consistency enforced; atomicity is not guaranteed
  • Partial failures (network, service, region) expected

Non-Functional Requirements:

  • Major incident recovery in <10 minutes
  • No data loss for order placement events
  • Distributed tracing of fault domains

Acceptance Criteria:

  • Fault domains and propagation paths documented
  • Architecture includes recovery playbooks for cascades
  • Demonstrate integration tests simulating cascading faults

Risk Areas:

  • Distributed rollback coordination
  • Messaging deduplication/idempotency
  • Latent race conditions during recovery

Suggested Research Topics:

  • Distributed saga vs. choreography recovery patterns
  • Failure injection and chaos testing in .NET microservices
  • Event replay, dead-letter, and outbox strategies in AWS

Difficulty Level: Architect-Level

Metadata

Metadata

Assignees

No one assigned

    Labels

    copilot challengeA challenge proposed by the co-pilot, with simulations of possible production errors.

    Projects

    Status
    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions