Difficulty: π£ Research / open-ended
Scope: TBD β needs a design proposal before implementation.
Subsystems: deployment / launch (mstar/cli/, configs/) Β· communication (cross-node transport)
Prerequisites: Kubernetes operational experience β this is a great fit for a contributor with a strong k8s background, who would own the design here.
Goal
Make M* deployable on Kubernetes. M* is multi-process by design β API server,
conductor, and one worker per GPU communicate over ZMQ
(communicator.py, today IPC/TCP) and route
tensors over shared memory / TCP / RDMA. A k8s story has to map that topology
onto pods/services.
This issue is "write the design first"
Because the deployment topology decisions here are far-reaching, start with a
short design doc / RFC (Request for Comments β a written proposal circulated for
feedback before implementation) in the issue, not a PR. Questions to answer:
Acceptance criteria (phase 1)
- A reviewed design doc covering the above.
- A minimal working single-node example (manifest or Helm chart) that serves one
model, as a proof of concept, before tackling multi-node/RDMA.
New to M*? Skim How it works and the Contributing guide first.
Difficulty: π£ Research / open-ended
Scope: TBD β needs a design proposal before implementation.
Subsystems: deployment / launch (mstar/cli/, configs/) Β· communication (cross-node transport)
Prerequisites: Kubernetes operational experience β this is a great fit for a contributor with a strong k8s background, who would own the design here.
Goal
Make M* deployable on Kubernetes. M* is multi-process by design β API server,
conductor, and one worker per GPU communicate over ZMQ
(communicator.py, today IPC/TCP) and route
tensors over shared memory / TCP / RDMA. A k8s story has to map that topology
onto pods/services.
This issue is "write the design first"
Because the deployment topology decisions here are far-reaching, start with a
short design doc / RFC (Request for Comments β a written proposal circulated for
feedback before implementation) in the issue, not a PR. Questions to answer:
where do conductor and API server live?)
communicator.py (
_endpoint,_tcp_port) β k8s needs Services / DNS instead of/tmpIPC sockets.worker model maps to pod GPU allocation.
probes (the readiness handshake exists β workers send a
SETUP_DONEmessage from
Worker.run()in worker.py and theconductor gates on it).
Acceptance criteria (phase 1)
model, as a proof of concept, before tackling multi-node/RDMA.