MegaMon provides metrics related to running JobSets on top of Kubernetes.
MegaMon provides metrics at the container and node levels.
MegaMon provides granular metrics about when a JobSet is Up. This allows for calculations such as MTBI (Mean Time Between Interruption).
MegaMon provides granular merics about when a JobSet is Down. This allows for calculations such as MTTR (Mean Time To Recovery).
MegaMon was created to address shortcoming of using kube-state-metrics:
- Lack of ability to stop publishing a metric the moment a JobSet completes/fails.
- Difficult/impossible to aggregate metrics across all JobSets if time-line of metrics is not well defined.
- Complexity in querying Node metrics (Node labels are a separate metric)
- Difficulty in deriving expected node count for a node pool.
- Difficult to derive high level metrics (like MTTR) when baseline metrics like Up-ness of jobset containers / nodes require their own complex queries.
- Difficult or impossible to derive metrics like Time-to-provisioning / Time-to-first-up with promql
- Current metrics are very large (they require all Nodes to be published as individual metrics and aggregated later)
- If a jobset has multiple TPU topologies, the label value will be the TPU topology of the last replicated job in the jobset.
- Set log level via
-zap-log-level 3flag on manager binary - Set "SliceEnabled" to support slice metrics
- Set "EnableSimulation" in config file to run megamon without access to GKE and
GCS.
- GCS Mock: Data is stored in-memory and will be lost on restart.
- GKE Mock: Node pools are inferred from existing nodes using a naming
convention (
[nodepool_name]-n-n-n). Machine type is hardcoded totpu7x-standard-4tand disk size to100GB.



