Kubernetes deployment support

**Difficulty:** 🟣 Research / open-ended

**Scope:** TBD — needs a design proposal before implementation.

**Subsystems:** deployment / launch ([mstar/cli/](https://github.com/mstar-project/mstar/blob/main/mstar/cli/), [configs/](https://github.com/mstar-project/mstar/blob/main/configs/)) · [communication](https://github.com/mstar-project/mstar/blob/main/mstar/communication/communicator.py) (cross-node transport)

**Prerequisites:** Kubernetes operational experience — this is a great fit for a contributor with a strong k8s background, who would own the design here.

### Goal

Make M\* deployable on Kubernetes. M\* is multi-process by design — API server,
conductor, and one worker per GPU communicate over ZMQ
([communicator.py](https://github.com/mstar-project/mstar/blob/main/mstar/communication/communicator.py), today IPC/TCP) and route
tensors over shared memory / TCP / RDMA. A k8s story has to map that topology
onto pods/services.

### This issue is "write the design first"

Because the deployment topology decisions here are far-reaching, **start with a
short design doc / RFC (Request for Comments — a written proposal circulated for
feedback before implementation) in the issue**, not a PR. Questions to answer:

- [ ] What's the pod topology? (One pod with all workers vs. one pod per worker;
      where do conductor and API server live?)
- [ ] How do components discover each other? Today endpoints are derived in
      [communicator.py](https://github.com/mstar-project/mstar/blob/main/mstar/communication/communicator.py) (`_endpoint`,
      `_tcp_port`) — k8s needs Services / DNS instead of `/tmp` IPC sockets.
- [ ] GPU scheduling (device plugin, resource requests), and how the per-GPU
      worker model maps to pod GPU allocation.
- [ ] Inter-node tensor transport (RDMA) under k8s networking.
- [ ] Packaging: container image, Helm chart vs. raw manifests, health/readiness
      probes (the readiness handshake exists — workers send a `SETUP_DONE`
      message from `Worker.run()` in [worker.py](https://github.com/mstar-project/mstar/blob/main/mstar/worker/worker.py) and the
      conductor gates on it).

### Acceptance criteria (phase 1)

- A reviewed design doc covering the above.
- A minimal working single-node example (manifest or Helm chart) that serves one
  model, as a proof of concept, before tackling multi-node/RDMA.

> _New to M\*? Skim [How it works](https://github.com/mstar-project/mstar/blob/main/README.md#how-it-works) and the [Contributing guide](https://github.com/mstar-project/mstar/blob/main/CONTRIBUTING.md) first._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes deployment support #128

Goal

This issue is "write the design first"

Acceptance criteria (phase 1)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Kubernetes deployment support #128

Description

Goal

This issue is "write the design first"

Acceptance criteria (phase 1)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions