Skip to content

feat(rl-env): add health check and error recovery for rollout collection #78

@abrichr

Description

@abrichr

No error handling around rollout collection. If the WAA server fails mid-rollout (timeout, crash, undismissable dialog), the entire training run crashes.

Need:

  • Proactive health_check() before rollouts
  • try/except with retry in collect_rollout
  • VM pool health monitoring endpoint (GET /health -> {"status": "ready"|"busy"|"needs_recovery"})

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions