[ci] refactor: consolidate workflows and reorganize tests#687
[ci] refactor: consolidate workflows and reorganize tests#687FoolPlayer wants to merge 16 commits intomainfrom
Conversation
…ests - gpu_unit_tests.yml: restore `pytest -s -x tests/data` (prior commit left the step as `uv run --froze`, silently disabling data tests on GPU) - npu_unit_tests.yml: drop `if: steps.changes.outputs.qwen3_5_ulysses` — that step never existed in the NPU workflow, so the Qwen3.5 Ulysses SP test was never actually running on NPU - delete tests/e2e/test_e2e_training.py, test_e2e_training_no_reshard.py and exec_scripts.py: they used non-relative `from exec_scripts import` (broken from repo root), were not wired into any CI workflow, and duplicated what test_e2e_parallel.py already covers with toy configs - prune Level-3 E2E training docs in docs/testing.md and the qwen3_omni_moe_example.md onboarding guide to match Made-with: Cursor
- Register `unit`, `e2e`, `gpu_only`, `npu_only`, `v5_only`, `dit`,
`benchmark` markers in pyproject.toml. Each test file declares exactly
one of `unit` / `e2e` via `pytestmark`, so CI can select by marker
instead of enumerating paths
- New tests/conftest.py: CI NFS path fixtures (`ci_models_dir`,
`ci_samples_dir`, `ci_dataset_dir`) and a collection hook that applies
device / transformers-version skips based on the `VEOMNI_TEST_DEVICE`
env var (set by the workflows). Avoids per-file `skipif` boilerplate
- New tests/e2e/conftest.py: hoist the six duplicated
`dummy_*_dataset` session fixtures out of test_e2e_parallel.py so the
planned per-modality split in Step 6 doesn't need to re-plumb them
- Tag out-of-CI ulysses async tests with `@pytest.mark.skip` so marker
inclusion preserves current behavior; tag `test_video_utils.py` to use
`os.environ.get("CI_SAMPLES_DIR", "")` instead of crashing at import
when the env var is unset
- Tag `test_models_logits_equal.py` as `gpu_only` (matches current NPU
workflow skipping it), `test_vlm_trainer.py` / `test_count_flops.py` /
`test_qwen3_5_gated_deltanet_ulysses.py` as `v5_only`, and
`test_npu_setup.py` as `npu_only`
Collection summary: `pytest -m unit` collects 464 tests, `pytest -m e2e`
collects 27 (total 491, no overlap). Device gates verified:
`VEOMNI_TEST_DEVICE=npu pytest test_models_logits_equal.py` skips all 6
cases with reason "gpu_only test skipped on device=npu".
Made-with: Cursor
Replaces the step-by-step hardcoded test invocation in the four top-level test workflows (gpu_unit_tests, gpu_e2e_test, npu_unit_tests, npu_e2e_test) with two reusable workflows that each take a pytest marker + a transformers-version variant and run `pytest -m <marker>` once per requested variant. - _test_runner_gpu.yml: L20-8 container (verl-ci-cn-beijing GPU image, --gpus all --ipc host --shm-size 32g), `VEOMNI_TEST_DEVICE=gpu`, optional `run_dit_step` for the diffusers (`--extra dit`) leg that only gpu_e2e uses today - _test_runner_npu.yml: 910B-8 Ascend container with the full device mount list, the ASCEND_VISIBLE_DEVICES slice-detection step, and `VEOMNI_TEST_DEVICE=npu` The `transformers_variant` input takes `stable` / `v5` / `both`. During the v4 deprecation window callers pass `both`, which runs the marker selection once under transformers-stable and once under `--extra transformers5-exp`. When v4 is fully retired the callers flip to `v5` and the stable-branch steps go dead in one commit, with no matrix infrastructure to untangle. Two separate files (instead of a single device-parametrised workflow) because GHA can't interpolate the `container.volumes` / `container.options` block as a whole expression; Ascend's 16-device mount list and the L20's `--gpus all` option are different enough that a merged runner would need more `if:` gating than it saves in duplication. These are pure workflow_call definitions — nothing runs yet until the top-level workflows are rewired to use them (next commit). Made-with: Cursor
Rewrites gpu_unit_tests / gpu_e2e_test / npu_unit_tests / npu_e2e_test
as ~40-line wrappers around the new reusable _test_runner_{gpu,npu}.yml
workflows (added previous commit). Each caller now declares only:
- which pytest marker to select (`unit` or `e2e`)
- the `both` transformers variant (v4 + v5 during the deprecation
window; flips to `v5` when stable is removed)
- whether to run the extra `--extra dit` leg (gpu_e2e only)
Net: 458 lines removed, 35 added. Container config, runner labels,
dependency sync ordering, and the ASCEND_VISIBLE_DEVICES slice
detection step all live in one place now, so a future image bump or
runner-label change is a single-file edit instead of four.
The previous workflows hardcoded each test path as its own step
(`pytest -s -x tests/parallel/ulysses/test_ulysses.py`, etc.). That
meant (a) a typo or truncation in one step silently disabled the
corresponding tests (see the two bugs fixed earlier in this branch)
and (b) newly added test files never ran in CI until somebody
remembered to append a new step. Using `pytest -m <marker> tests/`
replaces this with marker-based selection: any new test file that
declares `pytestmark = pytest.mark.unit` (or `e2e`) is picked up
automatically.
The path filters on each caller are extended to also trigger when the
reusable workflow it uses is edited, so changes to
_test_runner_{gpu,npu}.yml still invalidate CI correctly.
Made-with: Cursor
Collapse three ubuntu-latest-only CI workflows into a single static_checks.yml with three parallel jobs: - doc_task_paths (was check_doc_task_paths.yml) — verify tasks/*.py paths referenced in docs - device_api (was device_api_check.yml) — lint for raw .cuda / "cuda" / "nccl" usage outside the whitelist - patchgen (was check_patchgen.yml) — transformers patchgen drift detection, now path-gated via `git diff` so it only runs the heavy `uv sync --extra transformers5-exp` when patchgen-relevant files actually changed Kept separate: - check_pr_lint.yml — still a standalone workflow_call reusable so the four GPU/NPU test workflows can depend on it as a gating job without also pulling in the heavier patchgen step - check_pr_title.yml — uses `pull_request.types: [opened, edited, ...]` which would needlessly re-run lint on every title edit if merged Also moved tests/special_sanity/check_device_api_usage.py to scripts/ci/check_device_api_usage.py — it was never a pytest test and the `tests/` tree shouldn't host standalone CI scripts. Updated the script's self-whitelist entry, its docstring invocation examples, and the docs/testing.md tree diagram accordingly. The `tests/special_sanity/` directory is now empty and removed. Made-with: Cursor
The single 417-line file mixed text / VLM / omni / DiT parametrizations with the shared torchrun+compare harness, so every model onboarding diff touched one enormous file. Split into four per-modality files plus a `_harness.py` module that owns `main()`, default tolerances and the v4/v5/dit skip marks. - tests/e2e/_harness.py — shared harness (main, _materialize_weights_dir, DEFAULT_RTOL/ATOL, v4_only / v5_only / dit_only marks). - tests/e2e/test_e2e_parallel_text.py — text LLMs. - tests/e2e/test_e2e_parallel_vlm.py — Qwen2-VL / Qwen3-VL family + Qwen3.5(-MoE) that share the VL dataset fixture. - tests/e2e/test_e2e_parallel_omni.py — Qwen2.5-Omni / Qwen3-Omni-MoE. - tests/e2e/test_e2e_parallel_dit.py — Wan DiT. Carries pytestmark = [e2e, dit] so the GPU e2e runner can select it with `-m "e2e and dit"` in the dedicated `--extra dit` step. Verified `pytest --collect-only -m e2e tests/e2e` collects the same 22 cases as before, and `-m "e2e and dit"` selects only the Wan case. Updated all docs and the v5 migration skill to point at the new paths and import names (`DEFAULT_RTOL`, `v5_only`, etc. from `_harness.py`). Dataset fixtures already live in tests/e2e/conftest.py so no fixture relocation was needed. Made-with: Cursor
…tests/fixtures/ Both directories hold static, checked-in test data (minimal model configs, sample media) that tests reference by path but never regenerate. Grouping them under a single tests/fixtures/ umbrella makes the intent obvious and leaves a natural home for any future static assets (golden outputs, reference JSON, etc.). - tests/toy_config/ -> tests/fixtures/toy_config/ (17 model toy dirs) - tests/testdata/ -> tests/fixtures/testdata/ (1 sample image) Updated 22 files (~94 occurrences) including tests/, veomni/ core, docs/, .agents/knowledge/, .agents/skills/, and .gitignore. Verified `pytest --collect-only tests/` still collects 491 tests and `make quality` passes. Made-with: Cursor
There was a problem hiding this comment.
Code Review
This pull request reorganizes the testing infrastructure by introducing a fixtures/ directory for static data and splitting the monolithic test_e2e_parallel.py into modality-specific files supported by a shared harness. It also centralizes CI path resolution and marker enforcement (e.g., device and version gates) in a top-level conftest.py. Feedback focuses on improving the e2e harness by using tmp_path to prevent potential race conditions and removing redundant sys.path manipulations across several test files.
| test_path = f"./{model_name}" | ||
| os.makedirs(test_path, exist_ok=True) |
There was a problem hiding this comment.
Using a hardcoded relative path (./{model_name}) for the test working directory is a high-risk pattern. It can lead to race conditions when tests are executed in parallel (e.g., via pytest-xdist) if multiple test cases share the same model name. Furthermore, if a test fails, the directory is not removed, potentially leaving stale artifacts that could interfere with future runs.
Recommendation: Use a unique temporary directory for each test invocation. The most robust way is to leverage pytest's tmp_path fixture in the test functions and pass it down to this main helper.
|
|
||
| from ..tools.launch_utils import find_free_port | ||
|
|
||
| sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..")) |
There was a problem hiding this comment.
Manual sys.path manipulation is generally discouraged as it can lead to fragile import logic and environment pollution. Given that tests/conftest.py exists, pytest automatically adds the tests/ directory to sys.path during discovery. This manual insertion is redundant and should be removed to maintain a clean test environment.
| from pathlib import Path | ||
|
|
||
|
|
||
| sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..")) |
There was a problem hiding this comment.
| import sys | ||
|
|
||
|
|
||
| sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..")) |
After switching the NPU reusable workflow to marker-based selection (`pytest -m unit tests/`), collection on NPU fails on `tests/ops/test_fused_moe_split_vs_merged.py` because the module's top-level `from veomni.ops.kernels.moe._kernels.kernel.moe import ...` transitively imports `triton`, which is not installed in the Ascend image. Pre-refactor the NPU workflow avoided this by hard-coding only `tests/ops/test_comp.py` in its pytest invocation. Restore that scoping with a local `tests/ops/conftest.py` whose `collect_ignore` list drops the GPU-backend-only files when `VEOMNI_TEST_DEVICE=npu`. Made-with: Cursor
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}format.Test
Local verification on
fp/refactor_tests_and_ci:make qualityclean (ruff check + ruff format --check, 428 files).pytest --collect-only tests/collects 491 tests (same aspre-refactor).
pytest --collect-only -m e2e tests/e2ecollects 22 cases(same as the old monolithic file).
pytest --collect-only -m "e2e and dit" tests/e2ecollects exactlythe single Wan DiT case, which is what the new e2e GPU workflow's
--extra ditstep relies on.workflows themselves — that is the end-to-end validation.
API and Usage Example
N/A — no public API, CLI, or config-field changes. Two internal paths
changed for contributors:
tests/toy_config/<m>_toy/→tests/fixtures/toy_config/<m>_toy/tests/testdata/→tests/fixtures/testdata/tests/e2e/test_e2e_parallel.pyis now four per-modalityfiles. When onboarding a new model, add the
pytest.param(...)tothe matching
test_e2e_parallel_{text,vlm,omni,dit}.py. Seeupdated
docs/testing.mdanddocs/transformers_v5/testing_new_model.md.Design & Code Changes
Commit-by-commit (newest → oldest, each is its own logical unit):
1.
[test] refactor: consolidate tests/toy_config + tests/testdata under tests/fixtures/tests/toy_config/→tests/fixtures/toy_config/(17*_toydirs).tests/testdata/→tests/fixtures/testdata/(1 sample image).tests/,veomni/data/dummy_dataset.py,docs/,.agents/,.gitignore).2.
[test] refactor: split tests/e2e/test_e2e_parallel.py by modalitytests/e2e/_harness.pyownsmain(),_materialize_weights_dir(),DEFAULT_RTOL/ATOL, and thev4_only/v5_only/dit_onlymarks.test_e2e_parallel_text.py,_vlm.py,_omni.py,_dit.py.pytestmark = [e2e, dit]so the GPU e2e runnercan pick it up via
-m "e2e and dit"in a dedicated--extra ditstep.docs/,.agents/skills/veomni-migrate-transformers-v5/SKILL.md, anddocs/transformers_v5/testing_new_model.mdupdated to reflect thenew layout and import names.
3.
[ci] refactor: consolidate static checks + move device-api scriptcheck_doc_task_paths.yml,device_api_check.yml,check_patchgen.yml) into a singlestatic_checks.ymlwith parallel jobs.patchgenjob is now path-gated: it only runs when files underveomni/models/transformers/**change (was always-on before).tests/special_sanity/check_device_api_usage.py→scripts/ci/check_device_api_usage.py(it is a standalone script, not a pytest test). Removed the empty
tests/special_sanity/directory.4.
[ci] refactor: convert 4 top-level test workflows to thin runner callersgpu_unit_tests.yml,npu_unit_tests.yml,gpu_e2e_test.yml,npu_e2e_test.ymlare now each ~40 lines: just declare which markerto run and which transformers variant, then
uses:the reusablerunner. Container image, volumes, NFS mounts, dep-sync logic live in
one place.
5.
[ci] feat: add reusable _test_runner_{gpu,npu}.yml workflowsworkflow_call) own the full GPU/NPUtest execution pipeline.
marker(e.g.unit,e2e),transformers_variant(
stable/v5/both),run_dit_step(GPU only).transformers_variant=bothruns the same selection twice — firstunder the
transformers-stablegroup, then after re-syncing with--extra transformers5-exp. When v4 is fully deprecated, callersswitch to
v5and the v4 branch becomes dead code that can bedropped in one commit.
run_dit_stepadds a third step that syncs--extra ditand runspytest -m "<marker> and dit"— only the GPU e2e caller uses it.6.
[test] feat: add pytest markers + centralized conftest for tests/pyproject.toml:unit,e2e,gpu_only,npu_only,v5_only,dit,benchmark.tests/conftest.py:ci_models_dir,ci_samples_dir,ci_dataset_dirsourced from env vars the reusable workflows set.pytest_collection_modifyitemshook applies device-appropriateskips (
gpu_only/npu_onlybased onVEOMNI_TEST_DEVICE) andtransformers-version skips (
v5_onlybased onis_transformers_version_greater_or_equal_to("5.0.0")).tests/**/*.pywithpytestmark = [pytest.mark.<unit|e2e>, ...]so CI can select by marker instead of file path.
tests/e2e/conftest.pyconsolidating the sixdummy_*_datasetsession fixtures previously duplicated inside themonolithic e2e file.
7.
[ci, test] fix: repair broken CI steps and drop orphan e2e training testsgpu_unit_tests.yml: fixed a prior commit that had truncated thedata-step command to
uv run --froze(missing the rest of the line).npu_unit_tests.yml: removed anif: steps.changes.outputs.qwen3_5_ulysses == 'true'that referenced a step which no longer exists, silently skipping the test.
tests/e2e/test_e2e_training.py,tests/e2e/test_e2e_training_no_reshard.py,tests/e2e/exec_scripts.py— import-broken, never invoked by CI, andsuperseded by
tests/train_scripts/. All doc references updated.Checklist Before Submitting
make qualityclean)docs/testing.md,docs/transformers_v5/*.md,docs/usage/support_new_models/*.md,.agents/knowledge/architecture.md,.agents/skills/*/SKILL.md)tasks/training scripts moved or renamed in this PR-m unit/-m e2eautomatically runs.