Is there a need to integrate BindingConditions into DRANET to improve the reliability of inter-node RDMA communication?
For distributed workloads such as LLM, multiple Pods across nodes must be ready and correctly configured for RDMA communication, often in a gang scheduling context. If a Pod on one node succeeds (e.g., RDMA is configured and an IP is assigned) but another fails, the workload cannot make progress.
If some DRANET operations can fail, could those configurations be performed proactively before Pod scheduling using BindingConditions, rather than during Pod startup? Would this improve inter-node communication reliability?
I understand that some aspects of connectivity (e.g., switch-level configuration) are outside the scope of DRANET. However, there may be use cases where dynamic configuration of the network is required after Pod scheduling, such as setting up VLAN-like isolation.
Additionally, the final verification of connectivity may still require actual Pod-to-Pod communication, which cannot be fully covered by BindingConditions alone.
In another issue(#103 (comment)), it was mentioned that BindingConditions can help in failure scenarios (e.g., when an IPAM controller is down):
I believe this is a generic assumption with DRA itself that the thing on the Node needs to be up and healthy to be able to allocate resources on the Node when a pod gets assigned to the Node. I fully agree BindingConditions helps us solve part of this, but it isn't specific to this use case here.
In this issue, I would like to discuss about a more generic pattern—using BindingConditions proactively to improve reliability for inter-node RDMA communication.
Is there a need to integrate BindingConditions into DRANET to improve the reliability of inter-node RDMA communication?
For distributed workloads such as LLM, multiple Pods across nodes must be ready and correctly configured for RDMA communication, often in a gang scheduling context. If a Pod on one node succeeds (e.g., RDMA is configured and an IP is assigned) but another fails, the workload cannot make progress.
If some DRANET operations can fail, could those configurations be performed proactively before Pod scheduling using BindingConditions, rather than during Pod startup? Would this improve inter-node communication reliability?
I understand that some aspects of connectivity (e.g., switch-level configuration) are outside the scope of DRANET. However, there may be use cases where dynamic configuration of the network is required after Pod scheduling, such as setting up VLAN-like isolation.
Additionally, the final verification of connectivity may still require actual Pod-to-Pod communication, which cannot be fully covered by BindingConditions alone.
In another issue(#103 (comment)), it was mentioned that BindingConditions can help in failure scenarios (e.g., when an IPAM controller is down):
In this issue, I would like to discuss about a more generic pattern—using BindingConditions proactively to improve reliability for inter-node RDMA communication.