Skip to content

Ops: unusable node due to container loading error #240

@lcjohnso

Description

@lcjohnso

Training jobs are having an issue starting since ~Jan 2026. When the node boots, we receive a ContainerInvalidImage error leading to an unusable node. "no space left on device" makes it seem like we need to debug the container building / load process. Or perhaps libraries are out of date?

See full error details in screenshot:
Image

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions