增加集体通信解析性能输出 by ghangz · Pull Request #14 · MetaX-MACA/mccl_tests

ghangz · 2026-06-10T15:21:55Z

这次改动补上了集体通信解析性能输出，主要是为了解决集体通信测试与结果整理流程里相关信息不够集中、人工整理成本较高的问题，让日常排查、验证和结果归档更直接。

实现上补充了对应工具或脚本逻辑，同时尽量保持现有用法不变，避免影响已有流程。

这一分支已经在沐曦算力环境完成实际验证，相关检查均已通过，现提交合入。

gemini-code-assist

Code Review

This pull request introduces a new Python script, tools/parse_perf_output.py, to parse MCCL/NCCL performance output rows into JSON format. Feedback on the implementation points out that the current parsing logic fails when MCPTI is enabled because it does not handle lines starting with # and will silently parse incorrect columns due to an extra numeric column. A robust alternative using negative indexing is suggested to safely extract the correct fields.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-10T15:23:19Z

+    for line in text.splitlines():
+        line = line.lstrip("\ufeff")
+        parts = line.split()
+        if len(parts) < 6 or not parts[0].isdigit() or not parts[1].isdigit():
+            continue
+        numeric_tail: list[float] = []
+        for item in parts[2:]:
+            try:
+                numeric_tail.append(float(item))
+            except ValueError:
+                continue
+        if len(numeric_tail) < 3:
+            continue
+        rows.append(
+            {
+                "size_bytes": int(parts[0]),
+                "count": int(parts[1]),
+                "time_us": numeric_tail[1],
+                "algbw_gbps": numeric_tail[2],
+            }
+        )
+    return rows


当前实现的解析逻辑存在以下两个比较严重的问题：

无法解析 MCPTI 启用时的输出：当启用 MCPTI 时，输出行会以 # 开头（例如 #1048576 262144 ...），这会导致 parts[0].isdigit() 返回 False，从而使整行被忽略。

在 MCPTI 启用时存在静默的数据解析错误：由于 MCPTI 输出中多了一个 avgtime 数值列，当前的 numeric_tail 过滤逻辑会错误地将 avgtime 识别为 time_us，将 time_us 识别为 algbw_gbps，导致解析出完全错误的数据。

改进建议：
可以通过先去除行首的 # 字符，并利用从右向左的负数索引（因为无论是否启用 MCPTI，右侧的 8 列结构都是固定不变的：out_time, out_algbw, out_busbw, out_wrong, in_time, in_algbw, in_busbw, in_wrong）来精准且安全地提取所需的字段。这样既能完美兼容普通模式和 MCPTI 模式，又避免了脆弱的过滤逻辑。

for line in text.splitlines(): line = line.lstrip("\ufeff").lstrip("#").strip() parts = line.split() if len(parts) < 13 or not parts[0].isdigit() or not parts[1].isdigit(): continue try: size_bytes = int(parts[0]) count = int(parts[1]) time_us = float(parts[-8]) algbw_gbps = float(parts[-7]) except ValueError: continue rows.append( { "size_bytes": size_bytes, "count": count, "time_us": time_us, "algbw_gbps": algbw_gbps, } )

Add MCCL perf output parser

aa0751e

gemini-code-assist Bot reviewed Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

增加集体通信解析性能输出#14

增加集体通信解析性能输出#14
ghangz wants to merge 1 commit into
MetaX-MACA:mainfrom
ghangz:mengz/mccl-parse-perf-output

ghangz commented Jun 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ghangz commented Jun 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant