What feature or enhancement are you proposing?
Add to the Github runner fleet instances with one MI300X GPU.
The two providers with cheap 1GPU VMs (1.99USD/hr) are:
- Digital Ocean: but GPUs are not available currently
- Hot Aisle: The one I used, it has a cli client to provision and delete VMs (beware if you provision and delete too many times, the CLI client stops to immediately delete instances and you have to rely on the admin panel instead).
The script to initialize the instance is the following one. It is complicated as it tries to parallelize downloading PyTorch and also uncompressing some heavy ROCM debs when upgrading ROCm. It runs around the 5mn mark (15ct).
set -e
function print_bar() {
echo "####################################################################################################"
}
function print_between_bar() {
print_bar
echo "# " $@
print_bar
}
####################################################################################################
print_between_bar "Install uv"
####################################################################################################
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH=$HOME/.local/bin:$PATH
####################################################################################################
print_between_bar "Create venv and download pytorch in background (6GB)"
####################################################################################################
uv venv venv
. venv/bin/activate
torch_pyindex=https://download.pytorch.org/whl/rocm7.2
torch_download_url=$(uv pip install --force-reinstall --no-deps torch -i $torch_pyindex -v 2>&1 | grep -m1 -o "https://.*torch-.*whl" &)
kill $(jobs -l | grep "uv pip install" | cut -f2 -d" ") || echo "no worry, uv killed by SIGPIP"
sudo apt install -y aria2
aria2c -j8 -x8 --summary-interval=1 $torch_download_url &
####################################################################################################
print_between_bar "Upgrade"
####################################################################################################
sudo apt update -y
sudo apt-mark hold linux-generic linux-libc-dev
sudo DEBIAN_FRONTEND=noninteractive apt -y upgrade
sudo apt install -y vim emacs clang liblz4-dev libssl-dev libzstd-dev aria2
####################################################################################################
print_between_bar "Install apt-fast"
####################################################################################################
# otherwise add-apt-repository slow according to
# https://askubuntu.com/questions/1110626/add-apt-repository-command-is-suddenly-very-slow
sudo sysctl net.ipv6.conf.all.disable_ipv6=1
sudo add-apt-repository ppa:apt-fast/stable
sudo apt-get update -y
sudo DEBIAN_FRONTEND=noninteractive apt-get -y install apt-fast
sudo sysctl net.ipv6.conf.all.disable_ipv6=0 # put it back as it was before
# _APTMGR="apt-get"
####################################################################################################
print_between_bar "Update ROCM to 7.2.1 (copied from rocm docs)"
# (with ROCM 7.2.0, we get error messages about GPU staying in low power mode)
####################################################################################################
wget https://repo.radeon.com/amdgpu-install/7.2.1/ubuntu/jammy/amdgpu-install_7.2.1.70201-1_all.deb
sudo apt install ./amdgpu-install_7.2.1.70201-1_all.deb
sudo apt update -y
sudo apt install -y python3-setuptools python3-wheel
sudo usermod -a -G render,video $LOGNAME # Add the current user to the render and video groups
####################################################################################################
print_between_bar "Update ROCm with some parallel uncompression"
####################################################################################################
mkdir $HOME/apt-cache-rocm
sudo APTCACHE=$HOME/apt-cache-rocm _MAXNUM=20 _MAXCONPERSERV=40 _SPLITCON=32 apt-fast install -d -y rocm
rm -rf unpacked packed pkgs_links
mkdir pkgs_links/
cd $HOME/apt-cache-rocm/
for i in *.deb; do
ln -s $HOME/apt-cache-rocm/$i $HOME/pkgs_links
done
cd $HOME
mkdir unpacked packed
pkgs_to_unpack=$(du -hd0 apt-cache-rocm/* | grep "\([0-9]\{3,\}M\|[^ ]\+G\)" | cut -f2 | cut -d/ -f2)
for pkg in $pkgs_to_unpack; do
echo "Processing $pkg $(date)"
rm pkgs_links/$pkg
pkg=$pkg bash -c 'set -e;\
dpkg-deb -R {apt-cache-rocm,unpacked}/$pkg ;\
echo $pkg UNPACK DONE $(date) ;\
dpkg-deb -Znone -b {unpacked,packed}/$pkg ;\
echo $pkg DONE $(date) ;\
sleep 1 ;\
while ! sudo dpkg -i packed/$pkg 2>/dev/null; do sleep 1; done' &
done
sudo dpkg -i pkgs_links/*.deb &
wait
sudo dpkg --configure -a
####################################################################################################
print_between_bar "Upgrade again"
####################################################################################################
sudo apt-mark hold amdgpu-dkms
sudo DEBIAN_FRONTEND=noninteractive apt upgrade -y
####################################################################################################
print_between_bar "Install pytorch"
####################################################################################################
uv pip install ~/torch*.whl -i $torch_pyindex
####################################################################################################
print_between_bar "Clone and install genesis as editable"
####################################################################################################
git clone https://github.com/Genesis-Embodied-AI/genesis -j24
cd ~/genesis
uv pip install -e .[dev]
# disabled as normally you should build the wheel on another machine since LLVM is statically linked anyway
if false; then
####################################################################################################
print_between_bar "Clone, build and install quadrants"
####################################################################################################
git clone https://github.com/Genesis-Embodied-AI/quadrants --recurse-submodules -j24
cd quadrants
uv pip install --group dev
uv pip install --group test
uv pip install -r requirements_test_xdist.txt
uv pip install pip
# we put it in a shell script to easily call it later on
echo 'QUADRANTS_CMAKE_ARGS="-DQD_WITH_AMDGPU=ON" ./build.py' >> build.sh
chmod +x build.sh
./build.sh
# install quadrants
uv pip install dist/*.whl
fi
With ephemeral instances, the following options could be useful:
ssh -o 'UserKnownHostsFile=/dev/null' \
-o 'StrictHostKeyChecking=no' \
-I <private_key> \
hotaisle@$1
Motivation
Add to the Github runner fleet a ephemeral runner with a MI300X
Potential Benefit
Encourage developers to better target AMDGPU and improve performance on this platform as well. Allow a comparison with Nvidia GPUs and possibly advertise Genesis as cheaper to use than other engines with only Nvidia support.
What is the expected outcome of the implementation work?
- Early catch things that are not compatible with AMDGPU or lead to drastic performance degradation
- Speed and memory benchmarks to compare with Nvidia
Additional information
No response
What feature or enhancement are you proposing?
Add to the Github runner fleet instances with one MI300X GPU.
The two providers with cheap 1GPU VMs (1.99USD/hr) are:
The script to initialize the instance is the following one. It is complicated as it tries to parallelize downloading PyTorch and also uncompressing some heavy ROCM debs when upgrading ROCm. It runs around the 5mn mark (15ct).
With ephemeral instances, the following options could be useful:
Motivation
Add to the Github runner fleet a ephemeral runner with a MI300X
Potential Benefit
Encourage developers to better target AMDGPU and improve performance on this platform as well. Allow a comparison with Nvidia GPUs and possibly advertise Genesis as cheaper to use than other engines with only Nvidia support.
What is the expected outcome of the implementation work?
Additional information
No response