Skip to content

[CI] Enable RoPE fusion regression tests on iGPU#35252

Open
evkotov wants to merge 17 commits into
openvinotoolkit:masterfrom
evkotov:CVS-182458
Open

[CI] Enable RoPE fusion regression tests on iGPU#35252
evkotov wants to merge 17 commits into
openvinotoolkit:masterfrom
evkotov:CVS-182458

Conversation

@evkotov
Copy link
Copy Markdown
Contributor

@evkotov evkotov commented Apr 10, 2026

Details:

Extends the RoPE fusion regression testing (introduced for CPU in #34288) to run on iGPU self-hosted runners.

Changes:

  • Add rope test type support to the GPU test workflow (job_gpu_tests.yml): Python wheel installation, HuggingFace model cache, pytest execution with TEST_DEVICE=GPU
  • Uncomment and enable iGPU_RoPE_Tests job in ubuntu_22.yml (was blocked by CVS-182443, now resolved)
  • Simplify device mounting to use only renderD128 (card1 is not needed for compute)
  • Add iGPU_RoPE_Tests to Overall_Status dependency list

The test code in test_transformations.py already handles GPU — it skips GPTJ/CodeGen models (RoPEFusionGPTJ and RoPEFusionIOSlicing are disabled on GPU) and enables Flux (RoPEFusionFlux is GPU-only). No changes to test logic or model list.

Tickets:

  • 182458
  • 182443

@evkotov evkotov self-assigned this Apr 10, 2026
@evkotov evkotov requested a review from a team as a code owner April 10, 2026 08:06
@github-actions github-actions Bot added category: CI OpenVINO public CI github_actions Pull requests that update GitHub Actions code labels Apr 10, 2026
evkotov added 8 commits April 20, 2026 15:29
Self-hosted iGPU runners use Intel proxy with custom Root CA.
The CA certificate is mounted into the container but not integrated
into the system CA bundle, causing pip to fail with
SSL: CERTIFICATE_VERIFY_FAILED during Python setup.

Run update-ca-certificates and export PIP_CERT/REQUESTS_CA_BUNDLE
before Setup Python step to resolve this.
setupvars.sh is not found inside the Docker container due to
path remapping (/__w/ vs /opt/home/). It is also unnecessary —
OpenVINO is installed via pip wheels, matching the CPU RoPE test
pattern in job_pytorch_models_tests.yml.
The requirements_rope.txt file was removed from tests/CMakeLists.txt
install list in PR openvinotoolkit#34076, which split requirements for model tests.
Rope dependencies were moved to pytorch/envs/rope.txt, and the CPU
RoPE test was updated accordingly. Sync iGPU RoPE test to install
requirements_pytorch plus pytorch/envs/rope.txt, matching the CPU
RoPE test pattern.
The Fetch actions step uses sparse-checkout with default clean: true,
which wipes untracked files in the workspace — including the
previously extracted openvino_tests content (requirements_pytorch,
model_hub_tests, etc.). Wheels survive because they are downloaded
after the checkout, but the RoPE test fails to find its pip
requirements. Set clean: false to keep the extracted artifacts.
Without git in the ubuntu:22.04 container, ababushk/checkout falls
back to the REST API download path, which unconditionally deletes
the workspace contents and ignores clean: false. Install git before
Fetch actions so checkout uses git init/fetch and the previously
extracted openvino_tests files survive.
ababushk/checkout unconditionally deletes workspace contents when
there is no existing .git directory, regardless of clean: false.
Moving Install git + Fetch actions to the very first steps ensures
the checkout runs against an already-clean workspace, so subsequent
artifact downloads and extraction remain intact through the rest of
the job.
GPU plugin from pip-installed OpenVINO wheel discovers 0 devices on
iGPU runner while clinfo in the same container sees them. Non-RoPE
flow relies on setupvars.sh to set LD_LIBRARY_PATH, which the wheel
flow dropped. Export LD_LIBRARY_PATH explicitly to the wheel's
openvino/libs directory.

Also add a temporary diagnostic block (ldd on GPU plugin, clinfo -l,
/etc/OpenCL/vendors and /dev/dri listings, Python smoke that asserts
'GPU' in Core().available_devices) and OV_LOG_LEVEL=5, to be removed
in a follow-up once CI is green.
…iffusers

The non-RoPE GPU flow sources setupvars.sh, which sets LD_LIBRARY_PATH
to install/runtime/lib. The wheel flow does not source it, so the GPU
plugin loaded but discovered 0 devices. Point the dynamic linker at
site-packages/openvino/libs explicitly.

Also add diffusers to the RoPE requirements — the Flux test model is
skipped on CPU but runs on GPU and imports FluxTransformer2DModel.

Drop the diagnostic block and OV_LOG_LEVEL=5 added in the previous
commit now that the LD_LIBRARY_PATH hypothesis is confirmed.
@evkotov evkotov requested a review from a team as a code owner April 20, 2026 17:04
@github-actions github-actions Bot added the category: PyTorch FE OpenVINO PyTorch Frontend label Apr 20, 2026
evkotov added 2 commits April 21, 2026 14:54
The runner-level clinfo step is informational and returns exit 0 even
when 0 OpenCL platforms are present, so a runner that came up without
a functional GPU still passes the earlier Verify devices step and all
35 pytest cases then fail with the same cryptic plugin error.

Add a one-second Core().available_devices check right before pytest to
exit with a clear assertion instead — "iGPU runner did not expose GPU
— likely infra flake, rerun the job" — so the source of the failure
(infra vs. code) is immediately visible in the log.
Drop HF_HUB_VERBOSITY and TRANSFORMERS_VERBOSITY env vars from the
"OpenVINO GPU RoPE Tests" step. They were added during the debug
cycle and are no longer needed now that the test step is green.

LD_LIBRARY_PATH export (root-cause fix for wheel flow) and the
fail-fast Core().available_devices smoke (safety-net against iGPU
runner infra flakes) are kept as production code.
Comment thread .github/workflows/job_gpu_tests.yml Outdated
Comment thread .github/workflows/job_gpu_tests.yml Outdated
Comment thread .github/workflows/job_gpu_tests.yml Outdated
Comment thread .github/workflows/job_gpu_tests.yml Outdated
Comment thread .github/workflows/job_gpu_tests.yml Outdated
Comment thread .github/workflows/job_gpu_tests.yml Outdated
# to the install's runtime/lib. The wheel flow does not source it,
# so point the dynamic linker at the wheel's openvino/libs
# directory — otherwise the GPU plugin loads but discovers 0 devices.
OV_LIBS_DIR=$(python3 -c "import openvino, os; print(os.path.join(os.path.dirname(openvino.__file__), 'libs'))")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks strange, what type of distribution we are testing here (wheels, build)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are testing the pip-wheel distribution (matches CPU RoPE in job_pytorch_models_tests.yml). The non-RoPE GPU flow sources setupvars.sh which sets LD_LIBRARY_PATH=/runtime/lib. The wheel flow has no equivalent — pip install openvino puts libs under /openvino/libs/ and doesn't patch the dynamic linker. Without this export the GPU plugin loads but discovers 0 devices on the iGPU runner. I've reworded the comment to make this explicit

Copy link
Copy Markdown
Contributor

@mryzhov mryzhov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ababushk, @akashchi could you please take a look at the PR

Comment thread .github/workflows/job_gpu_tests.yml Outdated
volumes:
- /usr/local/share/ca-certificates:/usr/local/share/ca-certificates:ro # Needed to access CA certificates
- ${{ github.workspace }}:${{ github.workspace }} # Needed as ${{ github.workspace }} is not working correctly when using Docker
- /mount:/mount # Needed for HuggingFace model cache
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if the GPU runners have access to the data shares. They are set up in a different place, not in an Azure like aks-* runners.

Copy link
Copy Markdown
Contributor Author

@evkotov evkotov May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 35 tests passed in the previous green run, so either /mount is reachable from iGPU runners or huggingface_hub silently falls back to the in-container default — tests work either way. @akladiev could you confirm /mount accessibility? If it's truly unavailable, I'll drop the volume mount and the two HF_HUB_CACHE env vars in a follow-up.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocked on infra confirmation

Copy link
Copy Markdown
Collaborator

@akladiev akladiev May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gkoscins, we should have a server in HRZ available to serve as a shared drive, from what I remember. But I'm not sure if it gets actually mounted right now. Is that true?

By default I think the cache just writes to the local directory on a host (meaning, each host will have its own HF models cache, probably not exactly what we want to achieve). If we truly need to use a shared cache - it has to be mounted first

Comment thread .github/workflows/job_gpu_tests.yml Outdated
Comment thread .github/workflows/job_gpu_tests.yml Outdated
name: openvino_tokenizers_wheel
path: ${{ env.INSTALL_DIR }}

- name: Setup HF cache
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above about the data shares. I believe that they are inaccessible from the GPU runners. @akladiev

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two parts to this:

The Setup HF cache step itself is gone — HF_HUB_CACHE / HUGGINGFACE_HUB_CACHE are now declared in the job-level env: block (see the Setup HF cache thread).
The substantive question — whether /mount is reachable from iGPU runners — is the same one you raised on the volumes: block (line 46). I've pinged @akladiev there. Previous green runs passed with this configuration either way (either /mount is reachable or huggingface_hub falls back to the in-container default). If @akladiev confirms it's unreachable, I'll drop the volume mount + the two env vars in a follow-up.
Tracking under the line-46 thread to avoid duplicate discussion.

Comment thread .github/workflows/job_gpu_tests.yml Outdated
Comment thread .github/workflows/job_gpu_tests.yml Outdated
evkotov added 2 commits May 11, 2026 11:15
Switch container image from ubuntu:22.04 to buildpack-deps:22.04-scm —
a Docker Hub image (reachable from iGPU runners, which cannot pull
from openvinogithubactions.azurecr.io) that ships git, ca-certificates,
update-ca-certificates pre-installed. This removes the Install git
step and reduces Update CA certificates to a single
update-ca-certificates call.

Other workflow cleanups:
- Move HF_HUB_CACHE, HUGGINGFACE_HUB_CACHE, PIP_CERT, REQUESTS_CA_BUNDLE
  into the job-level env: section. Drop the Setup HF cache step and the
  GITHUB_ENV echoes from Update CA certificates.
- Split the RoPE test step into Install RoPE test requirements (pip
  install) and OpenVINO GPU RoPE Tests (pytest) so the GHA log makes
  the source of failures (install vs. test) immediately visible.
- Move PYTHONPATH into the step-level env: of OpenVINO GPU RoPE Tests
  (job-level would affect non-rope test types in the same reusable
  workflow).
- Reword the LD_LIBRARY_PATH comment to state explicitly that this is
  the pip-wheel distribution flow.
Two follow-ups to the review-feedback commit:

1. Move update-ca-certificates to the first step of the rope flow.
   buildpack-deps:22.04-scm ships ca-certificates pre-installed, so
   the postinst trigger that previously merged the bind-mounted Intel
   proxy root CA into the system bundle no longer fires. Without this
   move, Fetch actions hits 'server certificate verification failed'.

2. Move diffusers out of the shared rope.txt and into a GPU-only pip
   install step. The CPU RoPE flow uses the same requirements file
   and does not exercise Flux, so installing diffusers there only
   risks breaking optimum-intel's auto-pipeline registration when
   pip resolves a diffusers version that pulls in a transformers
   feature not present in optimum-intel==1.27.0. Pin diffusers to
   0.34.0 in the GPU step.
@github-actions github-actions Bot removed the category: PyTorch FE OpenVINO PyTorch Frontend label May 11, 2026
@akashchi akashchi self-requested a review May 11, 2026 15:18
evkotov added 4 commits May 12, 2026 15:07
Round 2 (109d0d6) cleared CPU/ARM64 by removing diffusers from the
shared rope.txt, but the GPU flow still failed at test collection
with 'cannot import name GlmModel from transformers'.

Root cause: the iGPU runner's setup-python cached venv ships
transformers==4.45.1. optimum-intel==1.27.0 accepts >=4.45,<4.58,
so pip does not auto-upgrade. diffusers==0.34.0 has no transformers
lower bound, so 4.45.1 stays. diffusers 0.34's cogview4 pipeline
imports GlmModel (added to transformers in 4.49), and optimum.intel
triggers full auto-pipeline registration on import, so every one of
the 35 tests fails during collection.

Fix: pin transformers>=4.49,<4.58 alongside diffusers==0.34.0 in
the GPU-only install step. The window satisfies both cogview4 and
optimum-intel.
Instead of branching the shared job_gpu_tests.yml on test_type == 'rope',
use a dedicated reusable workflow for the iGPU RoPE fusion tests.

- Revert job_gpu_tests.yml to its original form: drop the rope steps,
  the /mount cache volume, and the HF/PIP env added for rope.
- Add job_gpu_rope_tests.yml: a self-contained reusable workflow that
  downloads the artifacts, installs the OpenVINO wheel, installs the
  RoPE requirements (diffusers==0.34.0 + transformers>=4.49,<4.58),
  exports LD_LIBRARY_PATH to the wheel's openvino/libs, asserts a GPU is
  exposed, and runs the precommit RoPE transformation tests.
- ubuntu_22.yml: point iGPU_RoPE_Tests at the new workflow, drop the
  test_type input and the '-e HF_TOKEN' option (the HF and *.xethub.hf.co
  domains are now whitelisted on the iGPU runners, so no token is
  needed), and add the 'Linux' runner label.

The /mount shared drive is not mounted on the HRZ iGPU hosts, so the
HuggingFace cache volume and HF_HUB_CACHE/HUGGINGFACE_HUB_CACHE env vars
are removed.
iGPU self-hosted runners can now reach openvinogithubactions.azurecr.io
after recent infra changes, so pull buildpack-deps:22.04-scm through the
internal ACR Docker Hub mirror instead of directly from Docker Hub. This
matches the dockerhub/-prefix convention already used for ubuntu images
in this workflow and avoids Docker Hub rate limits. buildpack-deps is
kept (rather than plain ubuntu) because it ships git + ca-certificates
pre-installed, keeping the in-pipeline Install git step removed. The
stale comment claiming the registry was unreachable is corrected.
The internal ACR Docker Hub mirror does not carry buildpack-deps (only
ubuntu is mirrored), so the mirrored image
openvinogithubactions.azurecr.io/dockerhub/buildpack-deps:22.04-scm
returns "manifest unknown" and the GPU_RoPE job fails at container
initialization before any test runs.

Revert the image to the direct Docker Hub pull buildpack-deps:22.04-scm,
which ran green in the previous CI run, and keep the comment corrected so
it no longer claims the ACR registry is unreachable from iGPU runners.
@evkotov evkotov requested a review from mryzhov June 3, 2026 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: CI OpenVINO public CI github_actions Pull requests that update GitHub Actions code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants