增加集体通信日志失败汇总 by ghangz · Pull Request #12 · MetaX-MACA/mccl_tests

ghangz · 2026-06-10T15:21:44Z

这次改动补上了集体通信日志失败汇总，主要是为了解决集体通信测试与结果整理流程里相关信息不够集中、人工整理成本较高的问题，让日常排查、验证和结果归档更直接。

实现上补充了对应工具或脚本逻辑，补上了对应测试，同时尽量保持现有用法不变，避免影响已有流程。

这一分支已经在沐曦算力环境完成实际验证，相关检查均已通过，现提交合入。

gemini-code-assist

Code Review

This pull request introduces a new utility script, summarize_failures.py, along with its corresponding unit tests, to extract and summarize failure lines from MCCL run logs. The code reviewer pointed out a potential memory issue (OOM) when processing very large log files because the script currently reads the entire file into memory at once. A suggestion was provided to stream the file line-by-line and limit the stored failures to 100 while still maintaining an accurate total failure count.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-10T15:22:26Z

+def summarize(path: Path) -> dict[str, object]:
+    failures = []
+    for line_no, line in enumerate(path.read_text(encoding="utf-8", errors="replace").splitlines(), start=1):
+        if FAILURE_RE.search(line):
+            failures.append({"line": line_no, "text": line.strip()})
+    return {"path": str(path), "failure_count": len(failures), "failures": failures[:100]}


如果日志文件非常大（例如数 GB 的集体通信日志），使用 path.read_text().splitlines() 会将整个文件一次性加载到内存中并进行切分，这可能会导致内存占用过高甚至 OOM。建议使用文件流逐行读取，并且只保留前 100 条失败记录，同时统计总失败次数，以提高内存效率并避免潜在的 OOM 问题。

Suggested change

def summarize(path: Path) -> dict[str, object]:

failures = []

for line_no, line in enumerate(path.read_text(encoding="utf-8", errors="replace").splitlines(), start=1):

if FAILURE_RE.search(line):

failures.append({"line": line_no, "text": line.strip()})

return {"path": str(path), "failure_count": len(failures), "failures": failures[:100]}

def summarize(path: Path) -> dict[str, object]:

failures = []

failure_count = 0

with path.open(encoding="utf-8", errors="replace") as f:

for line_no, line in enumerate(f, start=1):

if FAILURE_RE.search(line):

failure_count += 1

if len(failures) < 100:

failures.append({"line": line_no, "text": line.strip()})

return {"path": str(path), "failure_count": failure_count, "failures": failures}

Add MCCL log failure summary tool

400395c

gemini-code-assist Bot reviewed Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

增加集体通信日志失败汇总#12

增加集体通信日志失败汇总#12
ghangz wants to merge 1 commit into
MetaX-MACA:mainfrom
ghangz:mengz/mccl-log-failure-summary

ghangz commented Jun 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ghangz commented Jun 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant