Skip to content

Verify TraceML runtime failures do not stop user training #24

@abhinavsriva

Description

@abhinavsriva

Description
TraceML is an observability tool and must never crash or stop user training if its runtime fails. We tracks a validation task to confirm failure isolation between:

  • the user training thread, and
  • the TraceML runtime.

The goal is to ensure TraceML fails open (disables itself) rather than impacting training correctness or liveness.

What to test
Intentionally inject failures inside the TraceML runtime, for example:

  • TCP / IO errors when sending data to the aggregator
  • Runtime thread exceptions
  • Serialization failures
  • Forced runtime shutdown during active training

Launching-time failures (e.g. TraceML failing to initialize) are acceptable; runtime failures during training are the focus.

Expected behavior

Training continues uninterrupted. No deadlocks or blocking at step boundaries. Runtime failure is caught, logged, and isolated. TraceML disables itself gracefully after failure

Silent deadlocks or resource leaks

Notes

This is an validation task. If training stops or crashes, a separate bug report should be opened and linked to this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is needed

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions