MegaMon

MegaMon provides metrics related to running JobSets on top of Kubernetes.

MegaMon provides metrics at the container and node levels.

MegaMon provides granular metrics about when a JobSet is Up. This allows for calculations such as MTBI (Mean Time Between Interruption).

MegaMon provides granular merics about when a JobSet is Down. This allows for calculations such as MTTR (Mean Time To Recovery).

Why MegaMon over Kube State Metrics

MegaMon was created to address shortcoming of using kube-state-metrics:

Lack of ability to stop publishing a metric the moment a JobSet completes/fails.
Difficult/impossible to aggregate metrics across all JobSets if time-line of metrics is not well defined.
Complexity in querying Node metrics (Node labels are a separate metric)
Difficulty in deriving expected node count for a node pool.
Difficult to derive high level metrics (like MTTR) when baseline metrics like Up-ness of jobset containers / nodes require their own complex queries.
Difficult or impossible to derive metrics like Time-to-provisioning / Time-to-first-up with promql
Current metrics are very large (they require all Nodes to be published as individual metrics and aggregated later)

Errata

If a jobset has multiple TPU topologies, the label value will be the TPU topology of the last replicated job in the jobset.

Runtime config

Set log level via -zap-log-level 3 flag on manager binary
Set "SliceEnabled" to support slice metrics
Set "EnableSimulation" in config file to run megamon without access to GKE and GCS.
- GCS Mock: Data is stored in-memory and will be lost on restart.
- GKE Mock: Node pools are inferred from existing nodes using a naming convention ([nodepool_name]-n-n-n). Machine type is hardcoded to tpu7x-standard-4t and disk size to 100GB.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
cmd		cmd
config		config
copied-slice-api		copied-slice-api
docs		docs
hack		hack
internal		internal
pkg/version		pkg/version
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
.golangci.yml		.golangci.yml
Dockerfile		Dockerfile
Makefile		Makefile
PROJECT		PROJECT
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MegaMon

Why MegaMon over Kube State Metrics

Errata

Runtime config

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

nstogner/megamon

Folders and files

Latest commit

History

Repository files navigation

MegaMon

Why MegaMon over Kube State Metrics

Errata

Runtime config

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages