Skip to content

rocshmem dependencies#349

Closed
chivatam wants to merge 8 commits into
gpu-mode:mainfrom
chivatam:main
Closed

rocshmem dependencies#349
chivatam wants to merge 8 commits into
gpu-mode:mainfrom
chivatam:main

Conversation

@chivatam

@chivatam chivatam commented Sep 7, 2025

Copy link
Copy Markdown

Description

added rocshmem dependencies to the dockerfile

@msaroufim

@chivatam chivatam marked this pull request as draft September 7, 2025 14:29
@chivatam chivatam marked this pull request as ready for review September 7, 2025 14:29
@msaroufim

Copy link
Copy Markdown
Member

Could you share a toy user submission as well using rocshmem. Just wanna get a sense of what things will look like e2e

@msaroufim

Copy link
Copy Markdown
Member

Also @saienduri to sanity check

@chivatam

chivatam commented Sep 7, 2025

Copy link
Copy Markdown
Author

Could you share a toy user submission as well using rocshmem. Just wanna get a sense of what things will look like e2e

import os
from typing import Any

from torch.utils.cpp_extension import load_inline


ROCSHMEM_INSTALL_DIR = os.environ.get("ROCSHMEM_INSTALL_DIR", "/opt/rocshmem")
OMPI_INSTALL_DIR = os.environ.get("OMPI_INSTALL_DIR", "/opt/openmpi")


EXT_NAME = "rocshmem_all2all_ext"

CUDA_SRC = r"""
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
#include <cstdlib>
#include <vector>

#include <hip/hip_runtime.h>
#include <roc_shmem.hpp>

namespace py = pybind11;

__global__ void all2all_kernel(int* symm, int npes) {
    if (threadIdx.x == 0) {
        int me = roc_shmem_my_pe();

        // Initialize local symmetric buffer
        for (int i = 0; i < npes; ++i) symm[i] = -1;
        roc_shmem_barrier_all();

        // Put my rank into every PE's symmetric buffer at index 'me'
        for (int dst = 0; dst < npes; ++dst) {
            roc_shmem_int_p(symm + me, me, dst);
        }
        roc_shmem_barrier_all();
    }
}

static void hip_check(hipError_t err, const char* where) {
    if (err != hipSuccess) {
        throw std::runtime_error(std::string("HIP error at ") + where + ": " + hipGetErrorString(err));
    }
}

void bind_and_init() {
    // Bind device based on rank
    int dev_count = 0;
    hip_check(hipGetDeviceCount(&dev_count), "hipGetDeviceCount");

    int rank = 0;
    if (const char* s = std::getenv("OMPI_COMM_WORLD_RANK")) {
        rank = std::atoi(s);
    }
    hip_check(hipSetDevice(dev_count == 0 ? 0 : (rank % dev_count)), "hipSetDevice");

    // Initialize rocSHMEM after device selection
    roc_shmem_init();
}

std::vector<int> run_all2all() {
    int me   = roc_shmem_my_pe();
    int npes = roc_shmem_n_pes();

    int* symm = (int*)roc_shmem_malloc(sizeof(int) * npes);
    if (!symm) throw std::runtime_error("roc_shmem_malloc failed");

    // Launch one-thread kernel to do the collective
    all2all_kernel<<<1, 1>>>(symm, npes);
    hip_check(hipDeviceSynchronize(), "hipDeviceSynchronize");

    // Copy local symmetric buffer back to host
    std::vector<int> out(npes, -1);
    hip_check(hipMemcpy(out.data(), symm, sizeof(int) * npes, hipMemcpyDeviceToHost), "hipMemcpy D2H");

    roc_shmem_free(symm);
    return out;
}

void finalize() {
    roc_shmem_finalize();
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("bind_and_init", &bind_and_init);
    m.def("run_all2all",  &run_all2all);
    m.def("finalize",     &finalize);
}
"""


def _build_ext():
    return load_inline(
        name=EXT_NAME,
        cuda_sources=[CUDA_SRC],
        functions=["bind_and_init", "run_all2all", "finalize"],
        with_cuda=True,
        extra_cflags=["-std=c++17"],
        extra_cuda_cflags=["-std=c++17"],
        extra_include_paths=[f"{ROCSHMEM_INSTALL_DIR}/include"],
        extra_ldflags=[
            f"-L{ROCSHMEM_INSTALL_DIR}/lib", "-lrocshmem",
            f"-L{OMPI_INSTALL_DIR}/lib", "-lmpi",
            f"-Wl,-rpath,{ROCSHMEM_INSTALL_DIR}/lib:{OMPI_INSTALL_DIR}/lib",
        ],
        verbose=True,
    )


# --- Optional: type-compatible stub for the Python leaderboard pattern ---
def custom_kernel(data: Any):  # input_t -> output_t, toy no-op to fit signature
    return data


def _rank_and_world():
    r = int(os.environ.get("OMPI_COMM_WORLD_RANK", "0"))
    w = int(os.environ.get("OMPI_COMM_WORLD_SIZE", "1"))
    return r, w


if __name__ == "__main__":
    rank, world = _rank_and_world()
    ext = _build_ext()
    ext.bind_and_init()
    out = ext.run_all2all()
    print(f"Rank {rank}/{world} all2all -> {out}")
    ext.finalize()

Vibe coded this but is gonna look similar to HIP kernels in python

@msaroufim

@saienduri

Copy link
Copy Markdown
Contributor

Looks good to me. Starting a test docker build here to check status: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17545534459.

@chivatam

chivatam commented Sep 8, 2025

Copy link
Copy Markdown
Author

ooo! looks like there is some issue with UCX. I ll debug it today!

@chivatam

chivatam commented Sep 9, 2025

Copy link
Copy Markdown
Author

@saienduri I made some changes but not sure if it works, is there a way to test the workflow without approval? I don't have MI300X to test 😅

@msaroufim msaroufim requested a review from saienduri September 9, 2025 18:17
@saienduri

saienduri commented Sep 13, 2025

Copy link
Copy Markdown
Contributor

Thanks, trying a build here now: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17701378282. You can locally try building the docker just to see if the build passes.

@saienduri

Copy link
Copy Markdown
Contributor

Cool, the build passed and a sanity test passed here: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17702258708
I was using this test payload: https://github.com/gpu-mode/discord-cluster-manager/blob/saienduri/fix-payload/scripts/github_test_payload.json
Can you also share a small payload for testing if rocshmem works before we merge this PR?

@chivatam

Copy link
Copy Markdown
Author

@saienduri added one, lmk if it works!

@saienduri

Copy link
Copy Markdown
Contributor

Hmm getting ValueError: Invalid language cpp (https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17714138325)

@msaroufim

msaroufim commented Sep 14, 2025

Copy link
Copy Markdown
Member

You want the example working with load_inline in PyTorch

@chivatam

Copy link
Copy Markdown
Author

done but idk if it works 😬

@msaroufim

Copy link
Copy Markdown
Member

@saienduri can we test the provided payload example on the server directly? If it's fine then we should be good to merge

@saienduri

saienduri commented Sep 17, 2025

Copy link
Copy Markdown
Contributor

ok running the payload in github actions yielded the following (https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17790562194):

"stdout": "=== ROCshmem PyTorch Inline Test ===\nROCshmem test failed: module 'torch.utils' has no attribute 'cpp_extension'\n"

I think it will be the same error on the server itself as well.

@saienduri

Copy link
Copy Markdown
Contributor

Pushed a commit to fix the import issue.
Latest issue after that @chivatam:

ROCshmem test failed: load_inline() got an unexpected keyword argument 'library_dirs'

@chivatam

Copy link
Copy Markdown
Author

Ok, I ll test this on runpod and push a working version. Apologies for all the back and forth!

@danielhua23

Copy link
Copy Markdown
Collaborator

@chivatam Hi, I have no permission to directly push commit to your repo, I corrected your payload, you can refer to that. Just use extra_ldflag instead

{
  "lang": "py",
  "sources": {
    "rocshmem_test.py": "import torch\nfrom torch.utils.cpp_extension import load_inline\nimport os\n\ndef test_rocshmem_compilation():\n    \"\"\"Test ROCshmem compilation using PyTorch's load_inline\"\"\"\n    \n    print(\"=== ROCshmem PyTorch Inline Test ===\")\n    \n    # C++ source code for ROCshmem test\n    cpp_source = \"\"\"\n    #include <rocshmem.hpp>\n    #include <iostream>\n    #include <torch/extension.h>\n    \n    void test_rocshmem() {\n        std::cout << \"Testing ROCshmem compilation...\" << std::endl;\n        \n        // Just test that we can compile and link with rocshmem\n        // Don't actually initialize since we may not have proper MPI setup\n        std::cout << \"ROCshmem headers included successfully!\" << std::endl;\n        std::cout << \"Compilation test passed!\" << std::endl;\n    }\n    \n    PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {\n        m.def(\"test_rocshmem\", &test_rocshmem, \"Test ROCshmem compilation\");\n    }\n    \"\"\"\n    \n    # Set up include paths and libraries\n    rocm_path = os.environ.get('ROCM_PATH', '/opt/rocm')\n    rocshmem_path = os.environ.get('ROCSHMEM_INSTALL_DIR', '/opt/rocshmem')\n    ompi_path = os.environ.get('OMPI_INSTALL_DIR', '/opt/openmpi')\n    \n    include_dirs = [\n        f\"{rocm_path}/include\",\n        f\"{rocshmem_path}/include\", \n        f\"{ompi_path}/include\"\n    ]\n    \n    library_dirs = [\n        f\"{rocm_path}/lib\",\n        f\"{rocshmem_path}/lib\",\n        f\"{ompi_path}/lib\"\n    ]\n    \n    libraries = [\n        \"rocshmem\",\n        \"mpi\", \n        \"amdhip64\",\n        \"hsa-runtime64\"\n    ]\n    \n    # 将库目录转换为链接器标志\n    ldflags = []\n    for lib_dir in library_dirs:\n        ldflags.append(f\"-L{lib_dir}\")\n    \n    # 将库名称转换为链接器标志\n    for lib in libraries:\n        ldflags.append(f\"-l{lib}\")\n    \n    extra_cflags = [\n        \"-fgpu-rdc\",\n        \"-x\", \"hip\"\n    ] + [f\"-I{include_dir}\" for include_dir in include_dirs]\n    \n    extra_ldflags = [\n        \"-fgpu-rdc\",\n        \"--hip-link\"\n    ] + ldflags\n    \n    try:\n        # Use torch.utils.cpp_extension.load_inline to compile\n        rocshmem_module = load_inline(\n            name=\"rocshmem_test\",\n            cpp_sources=cpp_source,\n            extra_cflags=extra_cflags,\n            extra_ldflags=extra_ldflags,\n            verbose=True\n        )\n        \n        print(\"Compilation successful!\")\n        print(\"Linking successful!\")\n        \n        # Run the test\n        rocshmem_module.test_rocshmem()\n        \n        print(\"ROCshmem test completed successfully!\")\n        return True\n        \n    except Exception as e:\n        print(f\"ROCshmem test failed: {e}\")\n        return False\n\nif __name__ == \"__main__\":\n    test_rocshmem_compilation()"
  },
  "main": "rocshmem_test.py",
  "mode": "test"
}

@danielhua23

Copy link
Copy Markdown
Collaborator

@saienduri hi sai, could you pls replace the current one with mine above and trigger test again? Thanks

@msaroufim

Copy link
Copy Markdown
Member

@danielhua23 just gave you write access as well

@saienduri

Copy link
Copy Markdown
Contributor

Latest log @danielhua23:

fatal error: cannot open file '/opt/rocm/amdgcn/bitcode/ocml.bc': Opaque pointers are only supported in -opaque-pointers mode (Producer: 'LLVM18.0.0git' Reader: 'LLVM 14.0.0')\n1 error generated when compiling for gfx803.\nninja: build stopped: subcommand failed.\nROCshmem test failed: Error building extension 'rocshmem_test'\n"

@saienduri

saienduri commented Sep 22, 2025

Copy link
Copy Markdown
Contributor

You can always trigger a run as you make changes like this (make sure to select the same branch and runner name):
{A49AC09D-C769-4FD1-83D9-BCF21EEBDCED}
You can put the contents of the whole json payload in the "Content of the user submission" and then click the "Run workflow" button.
Here is where you can trigger: https://github.com/gpu-mode/discord-cluster-manager/actions/workflows/amd_workflow.yml

After it runs, you can download the artifacts:
{C8ABD2B1-80DD-458E-84A5-F270BBCB1D6C}
example: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17904251405
After downloading the artifacts, you can see the stdout and see if it is what you expect in the result.json.

Also, if you have access to a mi3x server, you can use this docker docker pull ghcr.io/gpu-mode/amd-runner:saienduri-rocshmem and try it out.

Just want to make sure I'm not slowing y'all down here :)

@danielhua23

Copy link
Copy Markdown
Collaborator

Big thanks for your tutorials Sai, I will have a try!

@danielhua23

danielhua23 commented Sep 22, 2025

Copy link
Copy Markdown
Collaborator

Currently the new payload with new docker works well on my local MI3x machines, but how to trigger a job with a new docker built by the new dockerfile? I already ping Sai, if you guys have solutions, you can also help! Thanks!

@msaroufim

msaroufim commented Sep 22, 2025

Copy link
Copy Markdown
Member

@danielhua23 for the dockerfile you can publish a new one here https://github.com/gpu-mode/discord-cluster-manager/actions/workflows/publish_amd_docker.yml just link it to your branch and my understanding is @saienduri's infra should automatically pick it up

@danielhua23 danielhua23 mentioned this pull request Sep 23, 2025
@msaroufim

Copy link
Copy Markdown
Member

This was fixed in #359 - thank you for the intial prototype @chivatam

@msaroufim msaroufim closed this Sep 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants