-
Notifications
You must be signed in to change notification settings - Fork 11
Verify TraceML runtime failures do not stop user training #24
Description
Description
TraceML is an observability tool and must never crash or stop user training if its runtime fails. We tracks a validation task to confirm failure isolation between:
- the user training thread, and
- the TraceML runtime.
The goal is to ensure TraceML fails open (disables itself) rather than impacting training correctness or liveness.
What to test
Intentionally inject failures inside the TraceML runtime, for example:
- TCP / IO errors when sending data to the aggregator
- Runtime thread exceptions
- Serialization failures
- Forced runtime shutdown during active training
Launching-time failures (e.g. TraceML failing to initialize) are acceptable; runtime failures during training are the focus.
Expected behavior
Training continues uninterrupted. No deadlocks or blocking at step boundaries. Runtime failure is caught, logged, and isolated. TraceML disables itself gracefully after failure
Silent deadlocks or resource leaks
Notes
This is an validation task. If training stops or crashes, a separate bug report should be opened and linked to this issue.