Verify TraceML runtime failures do not stop user training

**Description**
TraceML is an observability tool and must never crash or stop user training if its runtime fails. We tracks a validation task to confirm failure isolation between:
- the user training thread, and
- the TraceML runtime.

The goal is to ensure TraceML fails open (disables itself) rather than impacting training correctness or liveness.

**What to test**
Intentionally inject failures inside the TraceML runtime, for example:
- TCP / IO errors when sending data to the aggregator
- Runtime thread exceptions
- Serialization failures
- Forced runtime shutdown during active training

Launching-time failures (e.g. TraceML failing to initialize) are acceptable; runtime failures during training are the focus.

**Expected behavior**

Training continues uninterrupted. No deadlocks or blocking at step boundaries. Runtime failure is caught, logged, and isolated. TraceML disables itself gracefully after failure



Silent deadlocks or resource leaks

Notes

**This is an validation task. If training stops or crashes, a separate bug report should be opened and linked to this issue.**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify TraceML runtime failures do not stop user training #24

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Verify TraceML runtime failures do not stop user training #24

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions