Skip to content

[Issue]: FA2 test failed when build CK/Triton backend on top of a rocm official docker image. #143

@DerienFe

Description

@DerienFe

Problem Description

I'm testing a llm training script with 8 MI300X GPUs but the training failed with weird spiking problem followed by NaN issues. This is likely to be a hardware or more fundamental level code issue since the problem reappears at the same point no matter the restart.

As mentioned in the title, I went back to config the docker images, and found lots of test failed. The problem can be reproduced with following command/dockerfile:

FROM docker.io/rocm/pytorch:rocm6.4_ubuntu24.04_py3.12_pytorch_release_2.5.1

WORKDIR /workspace
RUN mkdir /scratch0


# limit the number of CPUs in the container, otherwise libgomp error.
ENV OMP_NUM_THREADS=4
ENV TORCH_NUM_THREADS=4

RUN apt-get -y update
#other python packages.
RUN pip install pytorch-lightning tqdm numpy biopython pandas matplotlib einops ninja packaging numba scipy


RUN git clone https://github.com/ROCm/flash-attention.git &&\
    cd flash-attention &&\
    GPU_ARCHS=gfx942 python setup.py install


# set working dir
WORKDIR /workspace/flash-attention

pytest tests/test_flash_attn_ck.py

Operating System

Ubuntu

CPU

AMD EPYC 9654 96-Core Processor

GPU

MI300X

ROCm Version

6.4.0

ROCm Component

No response

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions