Skip to content

adding IB checks#143

Open
luccabb wants to merge 8 commits into
mainfrom
ib-fabric-checks
Open

adding IB checks#143
luccabb wants to merge 8 commits into
mainfrom
ib-fabric-checks

Conversation

@luccabb

@luccabb luccabb commented May 8, 2026

Copy link
Copy Markdown
Contributor

Summary

check-ib-counters.md: https://github.com/facebookresearch/gcm/blob/ib-fabric-checks/website/docs/GCM_Health_Checks/health_checks/check-ib/check-ib-counters.md

check-ib-port-errors.md: https://github.com/facebookresearch/gcm/blob/ib-fabric-checks/website/docs/GCM_Health_Checks/health_checks/check-ib/check-ib-port-errors.md

check-mlxcables.md: https://github.com/facebookresearch/gcm/blob/ib-fabric-checks/website/docs/GCM_Health_Checks/health_checks/check-ib/check-mlxcables.md

check-mlxlink.md: https://github.com/facebookresearch/gcm/blob/ib-fabric-checks/website/docs/GCM_Health_Checks/health_checks/check-ib/check-mlxlink.md

check-sm-status.md: https://github.com/facebookresearch/gcm/blob/ib-fabric-checks/website/docs/GCM_Health_Checks/health_checks/check-ib/check-sm-status.md

check-sm-status Implements part of the fourth item from #103

check-ufm-health.md: https://github.com/facebookresearch/gcm/blob/ib-fabric-checks/website/docs/GCM_Health_Checks/health_checks/check-ib/check-ufm-health.md

Test Plan

$ sudo health_checks check-ib check-mlxcables --sink=stdout test app
[{"node": "node.com", "cluster": "test", "derived_cluster": "test", "health_check": "check ib cable ddm", "type": "app", "result": 1, "_msg": "/dev/mst/mt_0_cable_0: mlxcables failed (rc=3)\n/dev/mst/mt_2_cable_0: mlxcables failed", "job_id": 0, "start_time": 1778544700.9887443, "end_time": 1778544701.0466714}]
$ sudo health_checks check-ib check-ib-counters --sink=stdout --log-folder=/tmp test app
[{"node":"node.com", "cluster":"test", "derived_cluster":"test", "health_check":"check ib counters", "type":"app", "result":0, "_msg":"check_ib_counters - 12 ports checked, 0 with errors, total_errors=0 | 'total_errors'=0;0;100 'ports_with_errors'=0 'ports_checked'=12\n | ...", "job_id":0,"start_time":1778546504.8381689, "end_time":1778546504.8888414}]
$ sudo health_checks check-ib check-sm-status --sink=stdout --log-folder=/tmp test app
[{"node": "node.com", "cluster": "test", "derived_cluster": "test", "health_check": "check ib sm status", "type": "app", "result": 0, "_msg": "SM reachable, state SMINFO_MASTER", "job_id": 0, "start_time": 1778546618.8319952, "end_time": 1778546619.0807354}]
$ health_checks check-ib check-ib-port-errors --sink=stdout test app
[{"node: "ufmnode.com", "cluster": "test", "derived_cluster": "test", "health_check": "check ib port errors", "type": "app", "result": 2, "_msg": "3 port error(s) detected: ...", "job_id": 0, "start_time": 1778540118.1727078, "end_time": 1778540124.216324}]
$ sudo health_checks check-ib check-ufm-health --sink=stdout --log-folder=/tmp test app
[{"node": "ufmnode.com", "cluster": "test", "derived_cluster": "test", "health_check": "check ib ufm health", "type": "app", "result": 0, "_msg": "No unhealthy ports reported by UFM.", "job_id": 0, "start_time": 1778541847.2844133, "end_time": 1778541847.284646}]
$ sudo health_checks check-ib check-mlxlink --sink=stdout --log-folder=/tmp test app
[{"node": "node.com", "cluster": "test", "derived_cluster": "test", "health_check": "check ib module health", "type": "app", "result": 2, "_msg": "4 issue(s) across 12 HCA(s): ...", "job_id": 0, "start_time": 1778554379.163512, "end_time": 1778554382.1345477}]

@meta-cla meta-cla Bot added the cla signed label May 8, 2026
@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow What it runs
GPU Cluster Monitoring Python CI lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command Description Requires approval?
/metaci tests Runs Meta internal integration tests (pytest) Yes — a maintainer must trigger the command and approve the deployment request
/metaci integration tests Same as above (alias) Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

@meta-codesync

meta-codesync Bot commented May 9, 2026

Copy link
Copy Markdown

@luccabb has imported this pull request. If you are a Meta employee, you can view this in D104553982.

@luccabb luccabb changed the title Ib fabric checks adding new IB checks May 12, 2026
@luccabb luccabb changed the title adding new IB checks adding IB checks May 12, 2026
@luccabb luccabb marked this pull request as ready for review May 12, 2026 04:47
@meta-cla

meta-cla Bot commented May 27, 2026

Copy link
Copy Markdown

Hi @luccabb!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant