Skip to content

[Design] Etha as a vLLM weight-transfer engine backend #88

@junjzhang

Description

@junjzhang

Goal

Land Etha as a first-class vLLM weight-transfer backend, not just an externally-driven example. Today the integration is one-way and example-shaped:

  • examples/vllm_weight_sync drives vLLM through collective_rpc("setup_tensorbus") + collective_rpc("receive_weights") from outside.
  • vLLM has no built-in concept of "Etha as a weight source"; the example monkey-patches in custom RPC names.

Goal is for vLLM to recognize Etha as a registered weight-transfer engine backend so RL frameworks (and others) don't have to ship their own collective_rpc glue.

Motivation

  1. Production RL needs this contract stable. Right now every RL framework rolls its own version of the examples/vllm_weight_sync/ glue. Pulling it into a vLLM-side abstraction means one converter library, one lifecycle, one error contract.
  2. vLLM-driven placement discovery is already the right design (see transport._get_placements in [feat(examples): land vllm_weight_sync example (currently on examples/vllm-weight-sync branch) #87]). It belongs in vLLM, not in user code that walks vLLM internals.
  3. Avoids the collective_rpc hack. The example only works because vLLM lets us call arbitrary collective_rpc names. That's not an API contract.

Proposed shape (sketch — needs design pass)

Hand vLLM an EngineBackend protocol that owns the receive-side lifecycle:

# inside vLLM, registered like quant backends are today
class WeightTransferBackend(Protocol):
    def setup(self, model: nn.Module, mesh_info: MeshInfo) -> None: ...
    def receive(self, weight_version: int) -> None: ...
    def teardown(self) -> None: ...

# Etha provides:
class EthaWeightTransferBackend(WeightTransferBackend):
    # init_pair once, register_tensors per round, drive transport
    ...

vLLM exposes a single endpoint (HTTP or RPC) POST /weights/sync {version} instead of forcing callers to know about collective_rpc. Backend selection via vLLM config (weight_transfer_backend: "etha").

Open design questions:

  • Where does the HF ↔ vLLM converter live — vLLM side (model-specific knowledge already there) or Etha side (current example)?
  • Sync vs async receive — does the engine pause forward, or double-buffer?
  • Failure mode if peer trainer side is down — fail receive cleanly without taking down inference?

Dependencies

Non-Goals

  • This issue is not a quick refactor of the example. The example stays as the demo; this issue tracks the design + upstream conversation needed to make Etha a registered backend.
  • Not solving multi-vendor weight-transfer (Etha-specific for now; the protocol stays generic so others can implement).

Acceptance Criteria (open)

  • Design doc: protocol surface, lifecycle, failure model
  • Decision: converter location (vLLM-side vs Etha-side)
  • Upstream vLLM RFC or maintainer thread confirming extension point shape
  • Etha-side prototype implementing the protocol against an example vLLM build

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions