Skip to content

增加集体通信多进程主机列表generator#13

Open
ghangz wants to merge 1 commit into
MetaX-MACA:mainfrom
ghangz:mengz/mccl-mpi-hostfile-generator
Open

增加集体通信多进程主机列表generator#13
ghangz wants to merge 1 commit into
MetaX-MACA:mainfrom
ghangz:mengz/mccl-mpi-hostfile-generator

Conversation

@ghangz

@ghangz ghangz commented Jun 10, 2026

Copy link
Copy Markdown

这次改动补上了集体通信多进程主机列表generator,主要是为了解决集体通信测试与结果整理流程里相关信息不够集中、人工整理成本较高的问题,让日常排查、验证和结果归档更直接。

实现上补充了对应工具或脚本逻辑,补上了对应测试,同时尽量保持现有用法不变,避免影响已有流程。

这一分支已经在沐曦算力环境完成实际验证,相关检查均已通过,现提交合入。

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a script tools/make_hostfile.py and its corresponding unit tests to generate OpenMPI hostfiles. The review feedback suggests adding validation to ensure at least one valid host is provided to prevent generating an empty hostfile, and updating the import path in the test file to avoid ModuleNotFoundError when running tests from the repository root.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread tools/make_hostfile.py
Comment on lines +10 to +13
def render_hostfile(hosts: list[str], slots: int) -> str:
if slots <= 0:
raise ValueError("slots must be positive")
return "\n".join(f"{host} slots={slots}" for host in hosts if host.strip()) + "\n"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

当传入的 hosts 列表为空,或者所有 host 都是空字符串(例如用户误输入 --host "")时,render_hostfile 会返回仅包含换行符的空内容。这会导致生成一个无效的 hostfile,并在后续运行 OpenMPI 时引发难以排查的错误。\n\n建议在过滤掉空白字符后,对有效主机的数量进行校验,如果为空则抛出 ValueError

def render_hostfile(hosts: list[str], slots: int) -> str:\n    if slots <= 0:\n        raise ValueError(\"slots must be positive\")\n    valid_hosts = [host.strip() for host in hosts if host.strip()]\n    if not valid_hosts:\n        raise ValueError(\"At least one non-empty host must be provided\")\n    return \"\\n\".join(f\"{host} slots={slots}\" for host in valid_hosts) + \"\\n\"

@@ -0,0 +1,12 @@
import unittest

from make_hostfile import render_hostfile

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

直接从 make_hostfile 导入在从仓库根目录运行测试(例如使用 python3 -m unittest tools/test_make_hostfile.pypytest)时,会因为 tools 目录不在 sys.path 中而导致 ModuleNotFoundError。\n\n建议使用完整的模块路径 tools.make_hostfile 进行导入,以确保测试可以在仓库根目录下正常运行。

Suggested change
from make_hostfile import render_hostfile
from tools.make_hostfile import render_hostfile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant