Skip to content

Conversation

@shadychan
Copy link

@shadychan shadychan commented Jan 15, 2026

Support multinode when testing with json format output (-j), which fixed these issues:

  • Missing device info of remote nodes in json output
  • Unexpected local node d2d bandwidth value due to variable localDevice not being updated through MPI
$ mpirun -n 8 --npernode 4 -hostfile /opt/tiger/hostfile --mca orte_tmpdir_base /opt/tiger/openmpi_tmp --mca pml ob1 --mca btl ^openib,smcuda --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 nvbandwidth -t multinode_device_to_device_memcpy_write_ce -j
{
        "nvbandwidth" : 
        {
                "CUDA Runtime Version" : 13000,
                "Driver Version" : "580.105.08",
                "GPU Device list" : 
                [
                        "0: NVIDIA GB200 (00000008:06:00): (n179-067-085)",   <== missing device info of remote nodes
                        "1: NVIDIA GB200 (00000009:06:00): (n179-067-085)",
                        "2: NVIDIA GB200 (00000018:06:00): (n179-067-085)",
                        "3: NVIDIA GB200 (00000019:06:00): (n179-067-085)"
                ],
                "git_version" : "v0.8",
                "testcases" : 
                [
                        {
                                "bandwidth_description" : "memcpy CE GPU(row) <- GPU(column) bandwidth (GB/s)",
                                "bandwidth_matrix" : 
                                [
                                        [
                                                "N/A",
                                                "3231.73",  <== unexpected local node d2d bandwidth value
                                                "3235.38",
                                                "3232.94",
                                                "775.055",
                                                "775.195",
                                                "775.125",
                                                "775.125"
                                        ],
                                        [
                                                "3232.94",
                                                "N/A",
                                                "3232.94",
                                                "3232.94",
                                                "775.055",
        ...
}
  • MPI_ABORT invoked in multinode broadcast testcases and also in test with option -j -p multinode
$ mpirun -n 8 --npernode 4 -hostfile /opt/tiger/hostfile --mca orte_tmpdir_base /opt/tiger/openmpi_tmp --mca pml ob1 --mca btl '^openib,smcuda' --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 nvbandwidth -t multinode_device_to_device_broadcast_one_to_all_sm -j
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[n179-067-191:06592] 5 more processes have sent help message help-mpi-api.txt / mpi-abort
[n179-067-191:06592] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Support multinode when testing with json format output (`-j`),
which fixed these issues:
- Missing device info of remote nodes in json output
- Unexpected local node d2d bandwidth value due to variable
  `localDevice` not being updated through MPI
- MPI_ABORT invoked in multinode broadcast testcases and also in
  test with option `-j -p multinode`

Signed-off-by: Shady Chan <chenyulin.shady@bytedance.com>
@shadychan
Copy link
Author

@ramasubramanianrahul @deepakcu @esitaridi
Please review, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant