Skip to content

nstogner/megamon

Repository files navigation

MegaMon

MegaMon provides metrics related to running JobSets on top of Kubernetes.

MegaMon

MegaMon provides metrics at the container and node levels.

Multi Stack Metrics

MegaMon provides granular metrics about when a JobSet is Up. This allows for calculations such as MTBI (Mean Time Between Interruption).

Upness Metrics

MegaMon provides granular merics about when a JobSet is Down. This allows for calculations such as MTTR (Mean Time To Recovery).

Downness Metrics

Why MegaMon over Kube State Metrics

MegaMon was created to address shortcoming of using kube-state-metrics:

  • Lack of ability to stop publishing a metric the moment a JobSet completes/fails.
  • Difficult/impossible to aggregate metrics across all JobSets if time-line of metrics is not well defined.
  • Complexity in querying Node metrics (Node labels are a separate metric)
  • Difficulty in deriving expected node count for a node pool.
  • Difficult to derive high level metrics (like MTTR) when baseline metrics like Up-ness of jobset containers / nodes require their own complex queries.
  • Difficult or impossible to derive metrics like Time-to-provisioning / Time-to-first-up with promql
  • Current metrics are very large (they require all Nodes to be published as individual metrics and aggregated later)

Errata

  • If a jobset has multiple TPU topologies, the label value will be the TPU topology of the last replicated job in the jobset.

Runtime config

  • Set log level via -zap-log-level 3 flag on manager binary
  • Set "SliceEnabled" to support slice metrics
  • Set "EnableSimulation" in config file to run megamon without access to GKE and GCS.
    • GCS Mock: Data is stored in-memory and will be lost on restart.
    • GKE Mock: Node pools are inferred from existing nodes using a naming convention ([nodepool_name]-n-n-n). Machine type is hardcoded to tpu7x-standard-4t and disk size to 100GB.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5

Languages