Skip to content

Implement Failover Mechanism for Critical Dependencies to Ensure 99% Uptime #113

@SverreNystad

Description

@SverreNystad

Implement Failover Mechanism for Critical Dependencies to Ensure 99% Uptime

Description

To meet the Availability quality requirement A1, which states: "System uptime must be 99%, with capabilities to handle critical operations around the clock," we need to address the uptime dependencies of TutorAI on our commercial off-the-shelf (COTS) solutions, specifically OpenAI and MongoDB.

Current Issue

  1. OpenAI:

    • Uptime Guarantee: OpenAI does not provide a Service Level Agreement (SLA) guaranteeing any specific uptime.
    • Track Record: OpenAI does not consistently achieve 99% uptime.
    • Impact: Without a failover mechanism, any downtime from OpenAI directly affects TutorAI's availability.
  2. MongoDB:

    • Uptime Guarantee: MongoDB provides an SLA guaranteeing at least 99% uptime (as per their SLA documentation).
    • Impact: Despite the SLA, downtime would still disrupt major functionalities of TutorAI.

Proposed Solution

To ensure TutorAI meets its uptime requirement, we must implement a failover mechanism for both OpenAI and MongoDB:

  1. For OpenAI:

    • Develop a failover system to automatically switch API usage to an alternative Large Language Model (LLM) provider such as Gemini, Claude, LLama, or Grok during OpenAI downtimes.
  2. For MongoDB:

    • Implement a fallback solution for critical database operations. This could involve setting up a secondary database system or utilizing a distributed database architecture to minimize downtime impact.

Action Items

  • Research and Integration:
    • Evaluate potential LLM providers (Gemini, Claude, LLama, Grok) for compatibility and performance.
    • Develop and test the failover mechanism to switch between LLM providers seamlessly.
  • Database Fallback Solutions:
    • Identify suitable fallback strategies for MongoDB.
    • Implement and test the chosen database failover solution.

Conclusion

Implementing these failover mechanisms is crucial to ensuring that TutorAI can achieve the required 99% uptime, thus maintaining reliable operations around the clock despite potential downtime from our COTS dependencies.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions