feat: use async queue to enable the ovelaping of weight loading and RDMA transferring by JensenFire · Pull Request #16 · JD-ETH/slime

JensenFire · 2026-01-18T11:08:19Z

Description

Before this PR, the update_weights in RDMA mode is sth like :

Sequential Execution:
load_weights() -> execute_each() -> load_weights() -> execute_each() -> ...
[===CPU===]     [==WAIT==]      [===CPU===]     [==WAIT==]

. while execute_each() is actually an asynchronous operation. After the load_weights of the model_replica is done, the related updated weights could be transferred, and another round load_weights could be started. That's what this pr does: build a queue to execute the RDMA transferring in a asynchronous way, and reduce the latency of the single _update_bucket_weights_from_remote() from 70+ ms ->16ms , with a 10%-20% e2e time cost saving of updating weights

Before :

After:

JD-ETH · 2026-01-18T17:40:50Z

I just have one concern about the logic:

for a q_proj, k_proj, v_proj this will call the async task 3 times and there is no guarantee that the right weight will be transferred in order. We need to only submit those parameter once all the shards are updated. Let's combine the two PRs to achieve this.

Otherwise, if this async queue impl is clearly better, we should remove the old execute_each.

What was blocking by the way? is it about the async execute not having it's own async loop?

JD-ETH

let's discuss tomorrow and try to merge them together, otherwise the results will be wrong.

JD-ETH · 2026-01-18T17:38:56Z

            if self.pipelined_transfer:
-                transfer_bundle.execute_each(updated_name)
+                # Use executable queue for async transfer operations
+                transfer_bundle.execute_each(updated_name, self.executable_queue)


let's default to execute with the queue if it's clearly better

JensenFire · 2026-01-19T02:55:11Z

I just have one concern about the logic:

for a q_proj, k_proj, v_proj this will call the async task 3 times and there is no guarantee that the right weight will be transferred in order. We need to only submit those parameter once all the shards are updated. Let's combine the two PRs to achieve this.

Agree. The updating order does matter. But it should not a problem here according to the reply from letian

What was blocking by the way? is it about the async execute not having it's own async loop?

I think it is. Not sure why the self.engine.batch_transfer_async_write()''s behavior is actually executed in a sequential way.

JD-ETH

good for me to merge right now to get better profiling numbers

JensenFire requested review from JD-ETH and Risc-lt and removed request for JD-ETH January 18, 2026 11:08

JensenFire closed this Jan 18, 2026

JensenFire force-pushed the jsf/async_0118 branch from b4c1717 to 5c1bb4f Compare January 18, 2026 11:10

async

ff2a9f8

JensenFire reopened this Jan 18, 2026

JD-ETH requested changes Jan 18, 2026

View reviewed changes

JD-ETH approved these changes Jan 21, 2026

View reviewed changes

JD-ETH merged commit 26173a7 into JD-ETH:jd/rdma-integration Jan 22, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use async queue to enable the ovelaping of weight loading and RDMA transferring#16

feat: use async queue to enable the ovelaping of weight loading and RDMA transferring#16
JD-ETH merged 1 commit into
JD-ETH:jd/rdma-integrationfrom
JensenFire:jsf/async_0118

JensenFire commented Jan 18, 2026

Uh oh!

JD-ETH commented Jan 18, 2026

Uh oh!

JD-ETH left a comment

Uh oh!

JD-ETH Jan 18, 2026

Uh oh!

JensenFire commented Jan 19, 2026 •

edited

Loading

Uh oh!

JD-ETH left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JensenFire commented Jan 18, 2026

Description

Uh oh!

JD-ETH commented Jan 18, 2026

Uh oh!

JD-ETH left a comment

Choose a reason for hiding this comment

Uh oh!

JD-ETH Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

JensenFire commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JD-ETH left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JensenFire commented Jan 19, 2026 •

edited

Loading