[CI] Enable RoPE fusion regression tests on iGPU#35252
Conversation
Self-hosted iGPU runners use Intel proxy with custom Root CA. The CA certificate is mounted into the container but not integrated into the system CA bundle, causing pip to fail with SSL: CERTIFICATE_VERIFY_FAILED during Python setup. Run update-ca-certificates and export PIP_CERT/REQUESTS_CA_BUNDLE before Setup Python step to resolve this.
setupvars.sh is not found inside the Docker container due to path remapping (/__w/ vs /opt/home/). It is also unnecessary — OpenVINO is installed via pip wheels, matching the CPU RoPE test pattern in job_pytorch_models_tests.yml.
The requirements_rope.txt file was removed from tests/CMakeLists.txt install list in PR openvinotoolkit#34076, which split requirements for model tests. Rope dependencies were moved to pytorch/envs/rope.txt, and the CPU RoPE test was updated accordingly. Sync iGPU RoPE test to install requirements_pytorch plus pytorch/envs/rope.txt, matching the CPU RoPE test pattern.
The Fetch actions step uses sparse-checkout with default clean: true, which wipes untracked files in the workspace — including the previously extracted openvino_tests content (requirements_pytorch, model_hub_tests, etc.). Wheels survive because they are downloaded after the checkout, but the RoPE test fails to find its pip requirements. Set clean: false to keep the extracted artifacts.
Without git in the ubuntu:22.04 container, ababushk/checkout falls back to the REST API download path, which unconditionally deletes the workspace contents and ignores clean: false. Install git before Fetch actions so checkout uses git init/fetch and the previously extracted openvino_tests files survive.
ababushk/checkout unconditionally deletes workspace contents when there is no existing .git directory, regardless of clean: false. Moving Install git + Fetch actions to the very first steps ensures the checkout runs against an already-clean workspace, so subsequent artifact downloads and extraction remain intact through the rest of the job.
GPU plugin from pip-installed OpenVINO wheel discovers 0 devices on iGPU runner while clinfo in the same container sees them. Non-RoPE flow relies on setupvars.sh to set LD_LIBRARY_PATH, which the wheel flow dropped. Export LD_LIBRARY_PATH explicitly to the wheel's openvino/libs directory. Also add a temporary diagnostic block (ldd on GPU plugin, clinfo -l, /etc/OpenCL/vendors and /dev/dri listings, Python smoke that asserts 'GPU' in Core().available_devices) and OV_LOG_LEVEL=5, to be removed in a follow-up once CI is green.
…iffusers The non-RoPE GPU flow sources setupvars.sh, which sets LD_LIBRARY_PATH to install/runtime/lib. The wheel flow does not source it, so the GPU plugin loaded but discovered 0 devices. Point the dynamic linker at site-packages/openvino/libs explicitly. Also add diffusers to the RoPE requirements — the Flux test model is skipped on CPU but runs on GPU and imports FluxTransformer2DModel. Drop the diagnostic block and OV_LOG_LEVEL=5 added in the previous commit now that the LD_LIBRARY_PATH hypothesis is confirmed.
The runner-level clinfo step is informational and returns exit 0 even when 0 OpenCL platforms are present, so a runner that came up without a functional GPU still passes the earlier Verify devices step and all 35 pytest cases then fail with the same cryptic plugin error. Add a one-second Core().available_devices check right before pytest to exit with a clear assertion instead — "iGPU runner did not expose GPU — likely infra flake, rerun the job" — so the source of the failure (infra vs. code) is immediately visible in the log.
Drop HF_HUB_VERBOSITY and TRANSFORMERS_VERBOSITY env vars from the "OpenVINO GPU RoPE Tests" step. They were added during the debug cycle and are no longer needed now that the test step is green. LD_LIBRARY_PATH export (root-cause fix for wheel flow) and the fail-fast Core().available_devices smoke (safety-net against iGPU runner infra flakes) are kept as production code.
| # to the install's runtime/lib. The wheel flow does not source it, | ||
| # so point the dynamic linker at the wheel's openvino/libs | ||
| # directory — otherwise the GPU plugin loads but discovers 0 devices. | ||
| OV_LIBS_DIR=$(python3 -c "import openvino, os; print(os.path.join(os.path.dirname(openvino.__file__), 'libs'))") |
There was a problem hiding this comment.
That looks strange, what type of distribution we are testing here (wheels, build)?
There was a problem hiding this comment.
We are testing the pip-wheel distribution (matches CPU RoPE in job_pytorch_models_tests.yml). The non-RoPE GPU flow sources setupvars.sh which sets LD_LIBRARY_PATH=/runtime/lib. The wheel flow has no equivalent — pip install openvino puts libs under /openvino/libs/ and doesn't patch the dynamic linker. Without this export the GPU plugin loads but discovers 0 devices on the iGPU runner. I've reworded the comment to make this explicit
| volumes: | ||
| - /usr/local/share/ca-certificates:/usr/local/share/ca-certificates:ro # Needed to access CA certificates | ||
| - ${{ github.workspace }}:${{ github.workspace }} # Needed as ${{ github.workspace }} is not working correctly when using Docker | ||
| - /mount:/mount # Needed for HuggingFace model cache |
There was a problem hiding this comment.
I am not sure if the GPU runners have access to the data shares. They are set up in a different place, not in an Azure like aks-* runners.
There was a problem hiding this comment.
All 35 tests passed in the previous green run, so either /mount is reachable from iGPU runners or huggingface_hub silently falls back to the in-container default — tests work either way. @akladiev could you confirm /mount accessibility? If it's truly unavailable, I'll drop the volume mount and the two HF_HUB_CACHE env vars in a follow-up.
There was a problem hiding this comment.
blocked on infra confirmation
There was a problem hiding this comment.
@gkoscins, we should have a server in HRZ available to serve as a shared drive, from what I remember. But I'm not sure if it gets actually mounted right now. Is that true?
By default I think the cache just writes to the local directory on a host (meaning, each host will have its own HF models cache, probably not exactly what we want to achieve). If we truly need to use a shared cache - it has to be mounted first
| name: openvino_tokenizers_wheel | ||
| path: ${{ env.INSTALL_DIR }} | ||
|
|
||
| - name: Setup HF cache |
There was a problem hiding this comment.
See my comment above about the data shares. I believe that they are inaccessible from the GPU runners. @akladiev
There was a problem hiding this comment.
Two parts to this:
The Setup HF cache step itself is gone — HF_HUB_CACHE / HUGGINGFACE_HUB_CACHE are now declared in the job-level env: block (see the Setup HF cache thread).
The substantive question — whether /mount is reachable from iGPU runners — is the same one you raised on the volumes: block (line 46). I've pinged @akladiev there. Previous green runs passed with this configuration either way (either /mount is reachable or huggingface_hub falls back to the in-container default). If @akladiev confirms it's unreachable, I'll drop the volume mount + the two env vars in a follow-up.
Tracking under the line-46 thread to avoid duplicate discussion.
Switch container image from ubuntu:22.04 to buildpack-deps:22.04-scm — a Docker Hub image (reachable from iGPU runners, which cannot pull from openvinogithubactions.azurecr.io) that ships git, ca-certificates, update-ca-certificates pre-installed. This removes the Install git step and reduces Update CA certificates to a single update-ca-certificates call. Other workflow cleanups: - Move HF_HUB_CACHE, HUGGINGFACE_HUB_CACHE, PIP_CERT, REQUESTS_CA_BUNDLE into the job-level env: section. Drop the Setup HF cache step and the GITHUB_ENV echoes from Update CA certificates. - Split the RoPE test step into Install RoPE test requirements (pip install) and OpenVINO GPU RoPE Tests (pytest) so the GHA log makes the source of failures (install vs. test) immediately visible. - Move PYTHONPATH into the step-level env: of OpenVINO GPU RoPE Tests (job-level would affect non-rope test types in the same reusable workflow). - Reword the LD_LIBRARY_PATH comment to state explicitly that this is the pip-wheel distribution flow.
Two follow-ups to the review-feedback commit: 1. Move update-ca-certificates to the first step of the rope flow. buildpack-deps:22.04-scm ships ca-certificates pre-installed, so the postinst trigger that previously merged the bind-mounted Intel proxy root CA into the system bundle no longer fires. Without this move, Fetch actions hits 'server certificate verification failed'. 2. Move diffusers out of the shared rope.txt and into a GPU-only pip install step. The CPU RoPE flow uses the same requirements file and does not exercise Flux, so installing diffusers there only risks breaking optimum-intel's auto-pipeline registration when pip resolves a diffusers version that pulls in a transformers feature not present in optimum-intel==1.27.0. Pin diffusers to 0.34.0 in the GPU step.
Round 2 (109d0d6) cleared CPU/ARM64 by removing diffusers from the shared rope.txt, but the GPU flow still failed at test collection with 'cannot import name GlmModel from transformers'. Root cause: the iGPU runner's setup-python cached venv ships transformers==4.45.1. optimum-intel==1.27.0 accepts >=4.45,<4.58, so pip does not auto-upgrade. diffusers==0.34.0 has no transformers lower bound, so 4.45.1 stays. diffusers 0.34's cogview4 pipeline imports GlmModel (added to transformers in 4.49), and optimum.intel triggers full auto-pipeline registration on import, so every one of the 35 tests fails during collection. Fix: pin transformers>=4.49,<4.58 alongside diffusers==0.34.0 in the GPU-only install step. The window satisfies both cogview4 and optimum-intel.
Instead of branching the shared job_gpu_tests.yml on test_type == 'rope', use a dedicated reusable workflow for the iGPU RoPE fusion tests. - Revert job_gpu_tests.yml to its original form: drop the rope steps, the /mount cache volume, and the HF/PIP env added for rope. - Add job_gpu_rope_tests.yml: a self-contained reusable workflow that downloads the artifacts, installs the OpenVINO wheel, installs the RoPE requirements (diffusers==0.34.0 + transformers>=4.49,<4.58), exports LD_LIBRARY_PATH to the wheel's openvino/libs, asserts a GPU is exposed, and runs the precommit RoPE transformation tests. - ubuntu_22.yml: point iGPU_RoPE_Tests at the new workflow, drop the test_type input and the '-e HF_TOKEN' option (the HF and *.xethub.hf.co domains are now whitelisted on the iGPU runners, so no token is needed), and add the 'Linux' runner label. The /mount shared drive is not mounted on the HRZ iGPU hosts, so the HuggingFace cache volume and HF_HUB_CACHE/HUGGINGFACE_HUB_CACHE env vars are removed.
iGPU self-hosted runners can now reach openvinogithubactions.azurecr.io after recent infra changes, so pull buildpack-deps:22.04-scm through the internal ACR Docker Hub mirror instead of directly from Docker Hub. This matches the dockerhub/-prefix convention already used for ubuntu images in this workflow and avoids Docker Hub rate limits. buildpack-deps is kept (rather than plain ubuntu) because it ships git + ca-certificates pre-installed, keeping the in-pipeline Install git step removed. The stale comment claiming the registry was unreachable is corrected.
The internal ACR Docker Hub mirror does not carry buildpack-deps (only ubuntu is mirrored), so the mirrored image openvinogithubactions.azurecr.io/dockerhub/buildpack-deps:22.04-scm returns "manifest unknown" and the GPU_RoPE job fails at container initialization before any test runs. Revert the image to the direct Docker Hub pull buildpack-deps:22.04-scm, which ran green in the previous CI run, and keep the comment corrected so it no longer claims the ACR registry is unreachable from iGPU runners.
Details:
Extends the RoPE fusion regression testing (introduced for CPU in #34288) to run on iGPU self-hosted runners.
Changes:
The test code in test_transformations.py already handles GPU — it skips GPTJ/CodeGen models (RoPEFusionGPTJ and RoPEFusionIOSlicing are disabled on GPU) and enables Flux (RoPEFusionFlux is GPU-only). No changes to test logic or model list.
Tickets: