Skip to content

[Feature]: Add MI300X as a GitHub runner/via SSH #2679

@v01dXYZ

Description

@v01dXYZ

What feature or enhancement are you proposing?

Add to the Github runner fleet instances with one MI300X GPU.

The two providers with cheap 1GPU VMs (1.99USD/hr) are:

  • Digital Ocean: but GPUs are not available currently
  • Hot Aisle: The one I used, it has a cli client to provision and delete VMs (beware if you provision and delete too many times, the CLI client stops to immediately delete instances and you have to rely on the admin panel instead).

The script to initialize the instance is the following one. It is complicated as it tries to parallelize downloading PyTorch and also uncompressing some heavy ROCM debs when upgrading ROCm. It runs around the 5mn mark (15ct).

set -e

function print_bar() {
    echo "####################################################################################################"
}
function print_between_bar() {
    print_bar
    echo "#  " $@
    print_bar
}

####################################################################################################
print_between_bar "Install uv"
####################################################################################################

curl -LsSf https://astral.sh/uv/install.sh | sh

export PATH=$HOME/.local/bin:$PATH

####################################################################################################
print_between_bar "Create venv and download pytorch in background (6GB)"
####################################################################################################

uv venv venv
. venv/bin/activate

torch_pyindex=https://download.pytorch.org/whl/rocm7.2
torch_download_url=$(uv pip install --force-reinstall --no-deps torch -i $torch_pyindex -v 2>&1 | grep -m1 -o "https://.*torch-.*whl" &)
kill $(jobs -l | grep "uv pip install" | cut -f2 -d" ") || echo "no worry, uv killed by SIGPIP"

sudo apt install -y aria2
aria2c -j8 -x8 --summary-interval=1 $torch_download_url &

####################################################################################################
print_between_bar "Upgrade"
####################################################################################################

sudo apt update -y
sudo apt-mark hold linux-generic linux-libc-dev
sudo DEBIAN_FRONTEND=noninteractive apt -y upgrade
sudo apt install -y vim emacs clang liblz4-dev libssl-dev libzstd-dev aria2

####################################################################################################
print_between_bar "Install apt-fast"
####################################################################################################

# otherwise add-apt-repository slow according to 
# https://askubuntu.com/questions/1110626/add-apt-repository-command-is-suddenly-very-slow
sudo sysctl net.ipv6.conf.all.disable_ipv6=1 
sudo add-apt-repository ppa:apt-fast/stable
sudo apt-get update -y
sudo DEBIAN_FRONTEND=noninteractive apt-get -y install apt-fast
sudo sysctl net.ipv6.conf.all.disable_ipv6=0 # put it back as it was before
# _APTMGR="apt-get"

####################################################################################################
print_between_bar "Update ROCM to 7.2.1 (copied from rocm docs)"
#  (with ROCM 7.2.0, we get error messages about GPU staying in low power mode)
####################################################################################################

wget https://repo.radeon.com/amdgpu-install/7.2.1/ubuntu/jammy/amdgpu-install_7.2.1.70201-1_all.deb
sudo apt install ./amdgpu-install_7.2.1.70201-1_all.deb
sudo apt update -y
sudo apt install -y python3-setuptools python3-wheel
sudo usermod -a -G render,video $LOGNAME # Add the current user to the render and video groups

####################################################################################################
print_between_bar "Update ROCm with some parallel uncompression"
####################################################################################################

mkdir $HOME/apt-cache-rocm
sudo APTCACHE=$HOME/apt-cache-rocm _MAXNUM=20 _MAXCONPERSERV=40 _SPLITCON=32 apt-fast install -d -y rocm

rm -rf unpacked packed pkgs_links
mkdir pkgs_links/

cd $HOME/apt-cache-rocm/
for i in *.deb; do
    ln -s $HOME/apt-cache-rocm/$i $HOME/pkgs_links
done

cd $HOME

mkdir unpacked packed
pkgs_to_unpack=$(du -hd0 apt-cache-rocm/* | grep "\([0-9]\{3,\}M\|[^ ]\+G\)" | cut -f2 | cut -d/ -f2)

for pkg in $pkgs_to_unpack; do
    echo "Processing $pkg $(date)"
    rm pkgs_links/$pkg
    pkg=$pkg bash -c 'set -e;\
                      dpkg-deb -R {apt-cache-rocm,unpacked}/$pkg ;\
                      echo $pkg UNPACK DONE $(date) ;\
                      dpkg-deb -Znone -b {unpacked,packed}/$pkg ;\
                      echo $pkg DONE $(date) ;\
                      sleep 1 ;\
                      while ! sudo dpkg -i packed/$pkg 2>/dev/null; do sleep 1; done' &
done
sudo dpkg -i pkgs_links/*.deb &

wait

sudo dpkg --configure -a

####################################################################################################
print_between_bar "Upgrade again"
####################################################################################################

sudo apt-mark hold amdgpu-dkms
sudo DEBIAN_FRONTEND=noninteractive apt upgrade -y

####################################################################################################
print_between_bar "Install pytorch"
####################################################################################################

uv pip install ~/torch*.whl -i $torch_pyindex

####################################################################################################
print_between_bar "Clone and install genesis as editable"
####################################################################################################

git clone https://github.com/Genesis-Embodied-AI/genesis -j24
cd ~/genesis
uv pip install -e .[dev]

# disabled as normally you should build the wheel on another machine since LLVM is statically linked anyway
if false; then
    ####################################################################################################
    print_between_bar "Clone, build and install quadrants"
    ####################################################################################################

    git clone https://github.com/Genesis-Embodied-AI/quadrants --recurse-submodules -j24
    cd quadrants
    uv pip install --group dev
    uv pip install --group test

    uv pip install -r requirements_test_xdist.txt

    uv pip install pip

    # we put it in a shell script to easily call it later on
    echo 'QUADRANTS_CMAKE_ARGS="-DQD_WITH_AMDGPU=ON" ./build.py' >> build.sh
    chmod +x build.sh

    ./build.sh

    # install quadrants
    uv pip install dist/*.whl
fi

With ephemeral instances, the following options could be useful:

ssh  -o 'UserKnownHostsFile=/dev/null' \
  -o 'StrictHostKeyChecking=no' \
  -I <private_key> \
  hotaisle@$1

Motivation

Add to the Github runner fleet a ephemeral runner with a MI300X

Potential Benefit

Encourage developers to better target AMDGPU and improve performance on this platform as well. Allow a comparison with Nvidia GPUs and possibly advertise Genesis as cheaper to use than other engines with only Nvidia support.

What is the expected outcome of the implementation work?

  • Early catch things that are not compatible with AMDGPU or lead to drastic performance degradation
  • Speed and memory benchmarks to compare with Nvidia

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions