Skip to content

[question] The need for development of BindingConditions in DRANET for reliable RDMA connectivity #209

@ttsuuubasa

Description

@ttsuuubasa

Is there a need to integrate BindingConditions into DRANET to improve the reliability of inter-node RDMA communication?

For distributed workloads such as LLM, multiple Pods across nodes must be ready and correctly configured for RDMA communication, often in a gang scheduling context. If a Pod on one node succeeds (e.g., RDMA is configured and an IP is assigned) but another fails, the workload cannot make progress.

If some DRANET operations can fail, could those configurations be performed proactively before Pod scheduling using BindingConditions, rather than during Pod startup? Would this improve inter-node communication reliability?

I understand that some aspects of connectivity (e.g., switch-level configuration) are outside the scope of DRANET. However, there may be use cases where dynamic configuration of the network is required after Pod scheduling, such as setting up VLAN-like isolation.

Additionally, the final verification of connectivity may still require actual Pod-to-Pod communication, which cannot be fully covered by BindingConditions alone.

In another issue(#103 (comment)), it was mentioned that BindingConditions can help in failure scenarios (e.g., when an IPAM controller is down):

I believe this is a generic assumption with DRA itself that the thing on the Node needs to be up and healthy to be able to allocate resources on the Node when a pod gets assigned to the Node. I fully agree BindingConditions helps us solve part of this, but it isn't specific to this use case here.

In this issue, I would like to discuss about a more generic pattern—using BindingConditions proactively to improve reliability for inter-node RDMA communication.

Metadata

Metadata

Assignees

No one assigned

    Labels

    triage/unresolvedIndicates an issue that can not or will not be resolved.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions