Skip to content

增加集体通信预检环境检查#18

Open
ghangz wants to merge 1 commit into
MetaX-MACA:mainfrom
ghangz:mengz/mccl-preflight-env-check
Open

增加集体通信预检环境检查#18
ghangz wants to merge 1 commit into
MetaX-MACA:mainfrom
ghangz:mengz/mccl-preflight-env-check

Conversation

@ghangz

@ghangz ghangz commented Jun 10, 2026

Copy link
Copy Markdown

这次改动补上了集体通信预检环境检查,主要是为了解决集体通信测试与结果整理流程里相关信息不够集中、人工整理成本较高的问题,让日常排查、验证和结果归档更直接。

实现上补充了对应工具或脚本逻辑,同时尽量保持现有用法不变,避免影响已有流程。

这一分支已经在沐曦算力环境完成实际验证,相关检查均已通过,现提交合入。

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new bash script, tools/check_mccl_env.sh, to verify the MCCL environment by checking for required paths, tools, and benchmark binaries. The review feedback recommends several improvements: renaming the status variable to exit_code to prevent potential shell conflicts, adding existence checks for library paths (lib and ompi/lib), and expanding the benchmark verification loop to include sendrecv_perf and alltoall_perf for completeness.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread tools/check_mccl_env.sh
Comment on lines +7 to +17
status=0
check_path() {
local label="$1"
local path="$2"
if [[ -e "$path" ]]; then
echo "[OK] ${label}: ${path}"
else
echo "[MISS] ${label}: ${path}"
status=1
fi
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

使用 status 作为变量名可能会与某些 Shell(如 zsh)中的内置特殊变量或其它上下文产生混淆。建议将其重命名为更具描述性的名称,例如 exit_code,以提高脚本的健壮性和可读性。

Suggested change
status=0
check_path() {
local label="$1"
local path="$2"
if [[ -e "$path" ]]; then
echo "[OK] ${label}: ${path}"
else
echo "[MISS] ${label}: ${path}"
status=1
fi
}
exit_code=0
check_path() {
local label="$1"
local path="$2"
if [[ -e "$path" ]]; then
echo "[OK] ${label}: ${path}"
else
echo "[MISS] ${label}: ${path}"
exit_code=1
fi
}

Comment thread tools/check_mccl_env.sh
Comment on lines +19 to +20
check_path "MACA_PATH" "${MACA_PATH}"
check_path "mpirun" "${MACA_PATH}/ompi/bin/mpirun"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

mccl.sh 中,运行测试需要依赖 LD_LIBRARY_PATH 指向的库路径:

export LD_LIBRARY_PATH=${MACA_PATH}/lib:${MACA_PATH}/ompi/lib

为了确保运行时能够正确加载动态链接库,建议在预检脚本中也对这两个库路径进行存在性检查:

  • ${MACA_PATH}/lib
  • ${MACA_PATH}/ompi/lib
Suggested change
check_path "MACA_PATH" "${MACA_PATH}"
check_path "mpirun" "${MACA_PATH}/ompi/bin/mpirun"
check_path "MACA_PATH" "${MACA_PATH}"
check_path "MACA library path" "${MACA_PATH}/lib"
check_path "OMPI library path" "${MACA_PATH}/ompi/lib"
check_path "mpirun" "${MACA_PATH}/ompi/bin/mpirun"

Comment thread tools/check_mccl_env.sh
Comment on lines +24 to +26
for bench in all_reduce_perf all_gather_perf reduce_scatter_perf; do
check_path "${bench}" "${TEST_DIR}/${bench}"
done

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

mccl.sh 中,当指定参数为 all 时,测试的 benchmark 列表包含了 sendrecv_perfalltoall_perf

BENCH_NAMES="all_reduce_perf all_gather_perf reduce_scatter_perf sendrecv_perf alltoall_perf"

为了确保预检环境的完整性,建议在 check_mccl_env.sh 中也将这两个 benchmark 纳入检查范围。

Suggested change
for bench in all_reduce_perf all_gather_perf reduce_scatter_perf; do
check_path "${bench}" "${TEST_DIR}/${bench}"
done
for bench in all_reduce_perf all_gather_perf reduce_scatter_perf sendrecv_perf alltoall_perf; do
check_path "${bench}" "${TEST_DIR}/${bench}"
done

Comment thread tools/check_mccl_env.sh
check_path "${bench}" "${TEST_DIR}/${bench}"
done

exit "${status}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

配合前面将 status 变量重命名为 exit_code 的修改,这里也需要同步更新。

Suggested change
exit "${status}"
exit "${exit_code}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant