Skip to content

增加集体通信性能阈值检查#17

Open
ghangz wants to merge 1 commit into
MetaX-MACA:mainfrom
ghangz:mengz/mccl-perf-threshold-check
Open

增加集体通信性能阈值检查#17
ghangz wants to merge 1 commit into
MetaX-MACA:mainfrom
ghangz:mengz/mccl-perf-threshold-check

Conversation

@ghangz

@ghangz ghangz commented Jun 10, 2026

Copy link
Copy Markdown

这次改动补上了集体通信性能阈值检查,主要是为了解决集体通信测试与结果整理流程里相关信息不够集中、人工整理成本较高的问题,让日常排查、验证和结果归档更直接。

实现上补充了对应工具或脚本逻辑,补上了对应测试,同时尽量保持现有用法不变,避免影响已有流程。

这一分支已经在沐曦算力环境完成实际验证,相关检查均已通过,现提交合入。

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Python utility, check_perf_thresholds.py, along with corresponding unit tests to validate MCCL performance JSON reports against minimum thresholds. The code review identifies three key issues: a potential AttributeError if the parsed JSON payload is a list rather than a dictionary, a potential KeyError if the size_bytes key is missing from a row, and the fact that the script always exits with a status code of 0 even when threshold checks fail, which prevents CI/CD pipelines from detecting failures.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.


def load_rows(path: Path) -> list[dict[str, float | int]]:
payload = json.loads(path.read_text(encoding="utf-8"))
rows = payload.get("rows", payload)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

如果输入的 JSON 文件直接是一个包含行数据的列表(例如 [{"size_bytes": 1024, ...}]),payload 将会是一个 list 对象。此时调用 payload.get("rows", payload) 会抛出 AttributeError: 'list' object has no attribute 'get' 异常。\n\n建议在调用 .get() 之前先检查 payload 是否为字典。

Suggested change
rows = payload.get("rows", payload)
rows = payload.get("rows", payload) if isinstance(payload, dict) else payload

) -> dict[str, object]:
failures = []
for row in rows:
size_bytes = int(row["size_bytes"])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

evaluate 函数中,直接使用 row["size_bytes"] 获取数据。如果输入的 JSON 数据中某一行缺少 "size_bytes" 键,程序会抛出 KeyError 异常并崩溃。\n\n建议使用 row.get("size_bytes") 并提供默认值,以提高代码的健壮性。

Suggested change
size_bytes = int(row["size_bytes"])
size_bytes = int(row.get("size_bytes", 0))

Comment on lines +47 to +61
text = json.dumps(
evaluate(
load_rows(args.report),
min_algbw_gbps=args.min_algbw_gbps,
max_time_us=args.max_time_us,
min_size_bytes=args.min_size_bytes,
),
indent=2,
)
if args.output:
args.output.parent.mkdir(parents=True, exist_ok=True)
args.output.write_text(text + "\n", encoding="utf-8")
else:
print(text)
return 0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

目前 main() 函数无论性能阈值检查是否通过,最终都会返回 0。作为一个阈值检查工具,通常在检查未通过(即存在 failures)时返回非零状态码(例如 1),以便 CI/CD 流水线或自动化脚本能够直接捕获失败。\n\n建议根据 evaluate 的结果来决定返回值。

    result = evaluate(\n        load_rows(args.report),\n        min_algbw_gbps=args.min_algbw_gbps,\n        max_time_us=args.max_time_us,\n        min_size_bytes=args.min_size_bytes,\n    )\n    text = json.dumps(result, indent=2)\n    if args.output:\n        args.output.parent.mkdir(parents=True, exist_ok=True)\n        args.output.write_text(text + "\\n", encoding="utf-8")\n    else:\n        print(text)\n    return 0 if result["passed"] else 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant