Skip to content

add rocshmem support#359

Merged
msaroufim merged 5 commits into
mainfrom
danie/rocshmem
Sep 24, 2025
Merged

add rocshmem support#359
msaroufim merged 5 commits into
mainfrom
danie/rocshmem

Conversation

@danielhua23

Copy link
Copy Markdown
Collaborator

Description

derivation of #349

@github-actions

Copy link
Copy Markdown

Coverage report

This PR does not seem to contain any modification to coverable code.

@danielhua23

danielhua23 commented Sep 23, 2025

Copy link
Copy Markdown
Collaborator Author

@msaroufim @saienduri
publish a new docker here: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17933366222/job/50994656849
but when I trigger a job, always report an unexpected error: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17935505394
the CI check also failed one

could you pls help take a look?

@msaroufim

Copy link
Copy Markdown
Member

@danielhua23 this is not working quite yet but I found an easier way to test your code, I tested this action using a script https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17949472670 - it's green but run_result crashed

With this error

{"success": true, "error": "", "system": {"gpu": "AMD Instinct MI300X VF", "device_count": 1, "cpu": "INTEL(R) XEON(R) PLATINUM 8568Y+", "runtime": "ROCm", "platform": "Linux-6.1.0-35-amd64-x86_64-with-glibc2.35", "torch": "2.10.0.dev20250916+rocm6.3"}, "runs": {"test": {"start": "2025-09-23T14:29:37.617259", "end": "2025-09-23T14:29:46.220622", "compilation": null, "run": {"success": true, "passed": false, "command": "python rocshmem_test.py test /tmp/tmp3d5ku9j1", "stdout": "=== ROCshmem PyTorch Inline Test ===\n[1/2] clang++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=rocshmem_test -DTORCH_API_INCLUDE_EXTENSION_H -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/include/python3.10 -fPIC -std=c++17 -I/opt/rocm/include -I/home/runner/rocshmem/include/rocshmem -I/opt/openmpi/include -c /home/runner/.cache/torch_extensions/py310_cpu/rocshmem_test/main.cpp -o main.o -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -fPIC\nFAILED: [code=1] main.o \nclang++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=rocshmem_test -DTORCH_API_INCLUDE_EXTENSION_H -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/include/python3.10 -fPIC -std=c++17 -I/opt/rocm/include -I/home/runner/rocshmem/include/rocshmem -I/opt/openmpi/include -c /home/runner/.cache/torch_extensions/py310_cpu/rocshmem_test/main.cpp -o main.o -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -fPIC\n/home/runner/.cache/torch_extensions/py310_cpu/rocshmem_test/main.cpp:3:14: fatal error: 'rocshmem.hpp' file not found\n    #include <rocshmem.hpp>\n             ^~~~~~~~~~~~~~\n1 error generated.\nninja: build stopped: subcommand failed.\nROCshmem test failed: Error building extension 'rocshmem_test'\n", "stderr": "", "exit_code": 0, "duration": 8.593523531220853, "result": {}}, "profile": null}}}

Here's an easier script for you to test stuff out

#!/usr/bin/env python3
"""Test script to trigger AMD workflow with ROCshmem payload"""

import json
import base64
import zlib
import subprocess
import sys

def main():
    # Load the test payload
    with open('scripts/rocshmem_test_payload.json', 'r') as f:
        payload_dict = json.load(f)
    
    # Compress and encode the payload (same as GitHub launcher does)
    payload_json = json.dumps(payload_dict)
    compressed = zlib.compress(payload_json.encode('utf-8'))
    encoded = base64.b64encode(compressed).decode('utf-8')
    
    print(f"Original payload size: {len(payload_json)} bytes")
    print(f"Compressed size: {len(compressed)} bytes")
    print(f"Encoded size: {len(encoded)} bytes")
    
    # Generate a run ID
    import uuid
    run_id = str(uuid.uuid4())
    
    # Trigger the workflow using gh CLI
    cmd = [
        'gh', 'workflow', 'run', 'amd_workflow.yml',
        '--ref', 'danie/rocshmem',  # Your current branch
        '-f', f'run_id={run_id}',
        '-f', f'payload={encoded}',
        '-f', 'runner=amdgpu-mi300-x86-64'
    ]
    
    print(f"Run ID: {run_id}")
    
    print("\nTriggering workflow with command:")
    print(' '.join(cmd))
    
    result = subprocess.run(cmd, capture_output=True, text=True)
    
    if result.returncode == 0:
        print("\n✓ Workflow triggered successfully!")
        print("\nTo view the run status:")
        print("gh run list --workflow=amd_workflow.yml -L 1")
        print("\nTo watch the run:")
        print("gh run watch --workflow=amd_workflow.yml")
    else:
        print(f"\n✗ Failed to trigger workflow:")
        print(result.stderr)
        sys.exit(1)

if __name__ == '__main__':
    main()

@danielhua23

Copy link
Copy Markdown
Collaborator Author

this error is expected if haven't rebuilt docker using my new dockerfile, so I have been asking how to rebuild a docker and get it work with the test code using my new dockerfile in this PR lol.

@saienduri

Copy link
Copy Markdown
Contributor

Can you try now @danielhua23? Test runner has to be manually updated when switching the branch we build the docker from.

@msaroufim

Copy link
Copy Markdown
Member

@saienduri I'm seeing the same issue still, do you guys mind coordinating a fix synchronously? We're starting problem 3 soon and we still don't have support for this

{"success": true, "error": "", "system": {"gpu": "AMD Instinct MI300X VF", "device_count": 1, "cpu": "INTEL(R) XEON(R) PLATINUM 8568Y+", "runtime": "ROCm", "platform": "Linux-6.1.0-35-amd64-x86_64-with-glibc2.35", "torch": "2.10.0.dev20250916+rocm6.3"}, "runs": {"test": {"start": "2025-09-23T16:55:33.158776", "end": "2025-09-23T16:55:41.875478", "compilation": null, "run": {"success": true, "passed": false, "command": "python rocshmem_test.py test /tmp/tmp6586x0u_", "stdout": "=== ROCshmem PyTorch Inline Test ===\n[1/2] clang++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=rocshmem_test -DTORCH_API_INCLUDE_EXTENSION_H -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/include/python3.10 -fPIC -std=c++17 -I/opt/rocm/include -I/home/runner/rocshmem/include/rocshmem -I/opt/openmpi/include -c /home/runner/.cache/torch_extensions/py310_cpu/rocshmem_test/main.cpp -o main.o -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -fPIC\nFAILED: [code=1] main.o \nclang++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=rocshmem_test -DTORCH_API_INCLUDE_EXTENSION_H -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/include/python3.10 -fPIC -std=c++17 -I/opt/rocm/include -I/home/runner/rocshmem/include/rocshmem -I/opt/openmpi/include -c /home/runner/.cache/torch_extensions/py310_cpu/rocshmem_test/main.cpp -o main.o -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -fPIC\n/home/runner/.cache/torch_extensions/py310_cpu/rocshmem_test/main.cpp:3:14: fatal error: 'rocshmem.hpp' file not found\n    #include <rocshmem.hpp>\n             ^~~~~~~~~~~~~~\n1 error generated.\nninja: build stopped: subcommand failed.\nROCshmem test failed: Error building extension 'rocshmem_test'\n", "stderr": "", "exit_code": 0, "duration": 8.706833689007908, "result": {}}, "profile": null}}}

@saienduri

Copy link
Copy Markdown
Contributor

@saienduri saienduri self-requested a review September 24, 2025 05:20
@msaroufim

Copy link
Copy Markdown
Member

Just kicked off CI again

@msaroufim msaroufim merged commit c613971 into main Sep 24, 2025
5 of 6 checks passed
@msaroufim msaroufim mentioned this pull request Sep 24, 2025
SinatrasC pushed a commit to SinatrasC/kernelbot that referenced this pull request Jun 17, 2026
* add rocshmem support

* pin torch version

* add back iris

* rm space

* Trigger CI

---------

Co-authored-by: Mark Saroufim <marksaroufim@meta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants