Skip to content

Kubernetes deployment supportΒ #128

@NSagan271

Description

@NSagan271

Difficulty: 🟣 Research / open-ended

Scope: TBD β€” needs a design proposal before implementation.

Subsystems: deployment / launch (mstar/cli/, configs/) Β· communication (cross-node transport)

Prerequisites: Kubernetes operational experience β€” this is a great fit for a contributor with a strong k8s background, who would own the design here.

Goal

Make M* deployable on Kubernetes. M* is multi-process by design β€” API server,
conductor, and one worker per GPU communicate over ZMQ
(communicator.py, today IPC/TCP) and route
tensors over shared memory / TCP / RDMA. A k8s story has to map that topology
onto pods/services.

This issue is "write the design first"

Because the deployment topology decisions here are far-reaching, start with a
short design doc / RFC (Request for Comments β€” a written proposal circulated for
feedback before implementation) in the issue
, not a PR. Questions to answer:

  • What's the pod topology? (One pod with all workers vs. one pod per worker;
    where do conductor and API server live?)
  • How do components discover each other? Today endpoints are derived in
    communicator.py (_endpoint,
    _tcp_port) β€” k8s needs Services / DNS instead of /tmp IPC sockets.
  • GPU scheduling (device plugin, resource requests), and how the per-GPU
    worker model maps to pod GPU allocation.
  • Inter-node tensor transport (RDMA) under k8s networking.
  • Packaging: container image, Helm chart vs. raw manifests, health/readiness
    probes (the readiness handshake exists β€” workers send a SETUP_DONE
    message from Worker.run() in worker.py and the
    conductor gates on it).

Acceptance criteria (phase 1)

  • A reviewed design doc covering the above.
  • A minimal working single-node example (manifest or Helm chart) that serves one
    model, as a proof of concept, before tackling multi-node/RDMA.

New to M*? Skim How it works and the Contributing guide first.

Metadata

Metadata

Assignees

No one assigned

    Labels

    open endedThis issue requires careful scoping and design
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions