Skip to content

test(e2e): cache workers + tp=1 for gpt-oss-20b/Qwen2.5-14B + move responses & chat to 1-GPU#1502

Open
key4ng wants to merge 3 commits into
mainfrom
fix/e2e-2gpu-responses-worker-cache
Open

test(e2e): cache workers + tp=1 for gpt-oss-20b/Qwen2.5-14B + move responses & chat to 1-GPU#1502
key4ng wants to merge 3 commits into
mainfrom
fix/e2e-2gpu-responses-worker-cache

Conversation

@key4ng
Copy link
Copy Markdown
Collaborator

@key4ng key4ng commented May 18, 2026

Summary

The e2e-2gpu-responses CI job hits its 25-min step timeout because the SGLang worker is destroyed and recreated between every test class. Run 25983256186 shows ~10 worker boots (gpt-oss-20b and Qwen2.5-14B-Instruct interleaved) at 1.5–3 min each — ~20 of 24 min spent in startup.

Two root causes and a downstream cleanup:

  • Session-scoped WorkerPool (e2e_test/infra/worker_pool.py): one-slot cache keyed on (engine, model_id, mode, worker_type). setup_backend._setup_local and responses/conftest._start_local_grpc_gateway_with_mcp now pool.acquire(...) and drop their stop_workers(...) teardown — workers live until session end (cleanup via pytest_sessionfinish + atexit). PD prefill/decode bypass the cache.
  • Cluster items by (backend, model) in pytest_collection_modifyitems (fixtures/hooks.py) so consecutive classes hit the cache instead of evicting it.
  • Drop tp: 2 → 1 for Qwen/Qwen2.5-14B-Instruct and openai/gpt-oss-20b in infra/model_specs.py. Qwen2.5-14B BF16 (~28 GB) and gpt-oss-20b MXFP4 (~13 GB) both fit on one H100/80GB; tp=2 was paying NCCL setup on every restart. E2E_MODEL_TP_OVERRIDES still works.

Downstream CI cleanup forced by the tp drop:

  • e2e-2gpu-responsese2e-1gpu-responses (gpu_tier 1, 1-gpu-h100 runner). @pytest.mark.gpu(2)gpu(1) on the 15 affected classes in e2e_test/responses/.
  • Retire e2e-2gpu-chat. The 4 chat_completions classes that used the now-tp=1 models flip to gpu(1) and run under e2e-1gpu-chat. Job-level timeouts bump to 60/45/120 min (sglang/vllm/trtllm); pytest step timeout grows to 45/35/90 via a new test_timeout input on e2e-gpu-job.yml.
  • ci_download_model.sh resolver tp == tiertp <= tier. Without this, e2e-2gpu-pd (PD with tp=1 Llama-3.1-8B on 2 GPUs) hits the same "No models resolved for GPU tier 2" error. PD tests keep gpu(2) because they really do need 2 GPUs at the test level.

Expected effect on the renamed e2e-1gpu-responses: at most one boot per model per session.

Files

  • e2e_test/infra/worker_pool.py (new)WorkerPool + get_pool / cleanup_pool, thread-safe one-slot cache with atexit hook.
  • e2e_test/infra/__init__.py — export pool symbols.
  • e2e_test/infra/model_specs.pytp: 2 → 1 for the two models, with rationale comments.
  • e2e_test/fixtures/setup_backend.py_setup_local acquires from pool; _start_workers_tracked grows a use_pool flag.
  • e2e_test/fixtures/hooks.py_pool_sort_key clusters by (backend, model, nodeid); new pytest_sessionfinish calls cleanup_pool().
  • e2e_test/fixtures/__init__.py, e2e_test/conftest.py — re-export pytest_sessionfinish.
  • e2e_test/responses/conftest.py_start_local_grpc_gateway_with_mcp uses the pool; both gRPC fixtures drop stop_workers on teardown.
  • e2e_test/responses/test_*.py (7 files) — gpu(2)gpu(1) on 15 classes.
  • e2e_test/chat_completions/test_*.py (4 files) — gpu(2)gpu(1) on 4 classes.
  • .github/workflows/pr-test-rust.yml — rename responses job to 1-GPU; delete e2e-2gpu-chat; bump e2e-1gpu-chat timeouts; update finish job's needs/aggregator.
  • .github/workflows/e2e-gpu-job.yml — new test_timeout input (default 25) feeding the pytest step.
  • scripts/ci_download_model.sh — resolver uses tp <= tier.

Test plan

  • ruff check e2e_test/ — passes.
  • ruff format --check on the touched files — passes (pre-existing format drift in untouched files left alone).
  • mypy e2e_test/ on touched files — passes.
  • pre-commit run --files <touched> — passes.
  • Resolver sanity: --gpu-tier 1 → 11 models, --gpu-tier 2 → 11 (same set, including the previously-tp=2 ones), --gpu-tier 4 → 16. Tier 2 no longer empty.
  • Marker sanity: only PD tests in e2e_test/router/ keep gpu(2); the rest of the repo is now gpu(0) (cloud) or gpu(1).
  • Import smoke: conftest.py and responses/conftest.py import cleanly (with smg_client stubbed); _pool_sort_key clusters as expected.
  • Real signal: the renamed e2e-1gpu-responses job finishes under its 30-min cap, log shows ≤ 1 Starting sglang grpc worker for openai/gpt-oss-20b and ≤ 1 for Qwen/Qwen2.5-14B-Instruct. e2e-1gpu-chat matrix finishes under its bumped timeouts. e2e-2gpu-pd keeps downloading its tp=1 model.

Risk: _pool_sort_key reorders pytest collection. Verified no pytest-ordering markers and no implicit cross-test ordering deps in e2e_test/. Risk for e2e-1gpu-chat: now runs strictly more tests; defensive timeout bumps + worker-pool savings should leave headroom, but if any matrix slot busts the 45/35/90-min step cap we'll need a follow-up.

Checklist
  • cargo +nightly fmt passes (no Rust changes)
  • cargo clippy --all-targets --all-features -- -D warnings passes (no Rust changes)
  • ruff / ruff format / mypy / pre-commit pass on touched files
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

  • New Features

    • Configurable E2E test timeout for CI runs.
  • Improvements

    • Reduced GPU requirements: many E2E tests now run with 1 GPU instead of 2.
    • Introduced session-wide worker pooling to reuse workers across test classes and improve efficiency.
    • Pool cleanup at session teardown to ensure orderly worker shutdown.
  • Changes

    • Consolidated 2-GPU chat coverage into the 1-GPU job and removed the 2-GPU job.

Review Change Stack

… and Qwen2.5-14B

The e2e-2gpu-responses CI job has been timing out at 25 min because
class-scoped fixtures (`setup_backend` and `gateway_with_mock_mcp_grpc_*`)
fully tear down and re-launch the SGLang worker between every test class.
Run 25983256186 spent ~20 of 24 min of "Run E2E tests" on worker startup
alone, alternating between gpt-oss-20b and Qwen2.5-14B in non-grouped
order.

Two changes:

1. Introduce a session-scoped `WorkerPool` (`infra/worker_pool.py`) that
   caches workers keyed on `(engine, model_id, mode, worker_type)`. Class
   fixtures call `pool.acquire(...)` instead of `start_workers(...)` and
   skip `stop_workers` on teardown — workers survive until session end.
   `pytest_sessionfinish` + `atexit` guarantee cleanup. PD prefill/decode
   workers bypass the cache (rare, hold multiple workers).
   `pytest_collection_modifyitems` now sorts items by `(backend, model)`
   so same-model classes cluster, making cache hits the common case.

2. Drop `tp` from 2 → 1 in `model_specs.py` for `Qwen/Qwen2.5-14B-Instruct`
   and `openai/gpt-oss-20b`. Qwen2.5-14B BF16 (~28GB) and gpt-oss-20b
   MXFP4 (~13GB) both fit comfortably on a single H100/80GB. NCCL setup
   was paying for tp=2 on every restart. Can be overridden via the
   existing `E2E_MODEL_TP_OVERRIDES` env var.

Combined effect on e2e-2gpu-responses: instead of ~5 gpt-oss-20b boots
+ ~5 Qwen2.5-14B boots interleaved at ~2-3 min each, the suite should
see one boot per model after sorting.

Signed-off-by: key4ng <rukeyang@gmail.com>
@github-actions github-actions Bot added the tests Test changes label May 18, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6068ff58-9011-4209-82cd-7de10b65b216

📥 Commits

Reviewing files that changed from the base of the PR and between 9220332 and e0147bd.

📒 Files selected for processing (5)
  • e2e_test/fixtures/hooks.py
  • e2e_test/fixtures/setup_backend.py
  • e2e_test/infra/worker_pool.py
  • e2e_test/responses/conftest.py
  • e2e_test/responses/test_image_generation.py

📝 Walkthrough

Walkthrough

Adds a session-scoped WorkerPool with acquire/cleanup, wires pytest_sessionfinish to call cleanup_pool, always sorts collected tests for pool affinity, integrates pooling into backend/response fixtures, updates two model tp values to 1, adjusts CI job timeouts/graph, and changes many test GPU markers from 2→1.

Changes

Worker Pool and Session Lifecycle

Layer / File(s) Summary
Worker Pool Infrastructure
e2e_test/infra/worker_pool.py, e2e_test/infra/__init__.py
New WorkerPool singleton with a single-slot cache keyed by engine/model/mode/worker_type/count, acquire()/cleanup(), and exported WorkerPool/get_pool/cleanup_pool.
Session-Finish Cleanup and Hook Wiring
e2e_test/fixtures/hooks.py, e2e_test/fixtures/__init__.py, e2e_test/conftest.py
Adds pytest_sessionfinish that calls cleanup_pool(), re-exports the hook via fixtures, and re-exports/imports it in conftest.py for pytest discovery.
Test Collection and Ordering
e2e_test/fixtures/hooks.py
pytest_collection_modifyitems always applies _pool_sort_key ordering (backend, model, nodeid) and preserves optional env-var filtering when set.
Backend Fixture Pool Integration
e2e_test/fixtures/setup_backend.py
_start_workers_tracked now acquires regular workers via get_pool().acquire(); _setup_local/backend_router acquire from the pool and no longer stop pooled workers on class teardown.
Response Fixture Pool Integration
e2e_test/responses/conftest.py
_start_local_grpc_gateway_with_mcp and SGLang/vLLM fixtures now acquire pooled workers and avoid stopping pooled workers on teardown; helper ensures gateway.shutdown() on failure.
Model Configuration Updates
e2e_test/infra/model_specs.py
tp changed from 21 for Qwen/Qwen2.5-14B-Instruct and openai/gpt-oss-20b with inline override guidance.
CI Workflow Wiring and Timeouts
.github/workflows/e2e-gpu-job.yml, .github/workflows/pr-test-rust.yml
Adds workflow_call input test_timeout and uses it for step timeout; pr-test-rust rehomes 2-GPU chat coverage to 1-GPU jobs, updates timeouts, runners, and finish-job dependencies/failure gating.
Test GPU Marker Updates
e2e_test/*/test_*.py
Multiple tests updated their @pytest.mark.gpu decorators from 2 to 1 to align with pooled single-GPU runs and model tp changes.
CI Model Resolution Script
scripts/ci_download_model.sh
resolve_models_for_tier now selects models with tp <= tier instead of tp == tier.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • lightseekorg/smg#977: Both adjust e2e-gpu-job timeout handling—this PR makes the pytest step timeout configurable.
  • lightseekorg/smg#569: Related work on marker-resolution helpers used by pytest collection hooks.
  • lightseekorg/smg#643: Overlapping pytest hook changes affecting session teardown and collection behavior.

Suggested reviewers

  • CatherineSue
  • slin1237

Poem

🐰 I tunneled code through CI night,
A pool to keep the workers tight,
Tests line up in ordered rows,
Session ends — the cleanup goes,
Hopping on, the E2E delight.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the three main changes: worker caching, tp parameter reduction for two models, and moving tests from 2-GPU to 1-GPU runners.
Docstring Coverage ✅ Passed Docstring coverage is 81.25% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/e2e-2gpu-responses-worker-cache

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread e2e_test/infra/worker_pool.py Outdated
Comment on lines +90 to +97

if self._key == key and len(self._workers) >= count:
logger.info(
"WorkerPool: reusing %d cached worker(s) for %s",
count,
key,
)
return list(self._workers[:count])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Important: The cache-hit path doesn't verify that cached workers are still alive. If a worker process crashes during a test, every subsequent class sharing that backend key will receive the dead worker from the pool. This defeats --reruns (retries hit the same dead cache) and causes cascading failures across all remaining classes for that backend — the fail-fast counter in _start_workers_tracked never increments because no start_workers call is made.

Worker.is_alive() already exists; a one-line liveness gate before the reuse branch would fix it:

Suggested change
if self._key == key and len(self._workers) >= count:
logger.info(
"WorkerPool: reusing %d cached worker(s) for %s",
count,
key,
)
return list(self._workers[:count])
if self._key == key and len(self._workers) >= count and all(
w.is_alive() for w in self._workers[:count]
):
logger.info(
"WorkerPool: reusing %d cached worker(s) for %s",
count,
key,
)
return list(self._workers[:count])

Comment thread e2e_test/infra/worker_pool.py Outdated
Comment on lines +139 to +143
global _POOL
with _POOL_LOCK:
if _POOL is None or _POOL._closed:
_POOL = WorkerPool()
atexit.register(_POOL.cleanup)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: Each time the pool is closed and then get_pool() is called again, a new atexit handler is registered while the old one still references the (now-closed) previous pool. cleanup() is idempotent so this is harmless at runtime, but the handlers accumulate. Consider unregistering the old handler, or guarding with a module-level flag:

Suggested change
global _POOL
with _POOL_LOCK:
if _POOL is None or _POOL._closed:
_POOL = WorkerPool()
atexit.register(_POOL.cleanup)
if _POOL is None or _POOL._closed:
_POOL = WorkerPool()
atexit.unregister(_POOL.cleanup)
atexit.register(_POOL.cleanup)

(In practice this only matters if something triggers a close-then-reacquire cycle, which normal pytest flow won't hit — so this is cosmetic.)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a1fdda8f4d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread e2e_test/infra/worker_pool.py Outdated
Comment on lines +91 to +97
if self._key == key and len(self._workers) >= count:
logger.info(
"WorkerPool: reusing %d cached worker(s) for %s",
count,
key,
)
return list(self._workers[:count])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate cached workers before reusing them

When a cached worker process dies after a previous class (for example from an OOM or backend crash), this branch reuses it solely because the key and count match and returns a dead worker URL. Since the new collection ordering clusters same-model classes, every later class with the same key will keep timing out against the same dead process until a different key evicts it or the session ends; previously each class restarted workers. Check is_alive()/health and evict/restart before returning cached workers.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
e2e_test/fixtures/setup_backend.py (1)

195-202: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Include count in the pool reuse contract.

@pytest.mark.workers(count=...) is still part of this fixture’s public behavior, but the pool contract described in this PR only keys reuse on engine/model_id/mode/worker_type. That means a class asking for count=2 after a count=1 class on the same backend will reuse the 1-worker entry and run with the wrong topology. Please either include count in the pool key or make acquire() reject/evict on count mismatch.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e_test/fixtures/setup_backend.py` around lines 195 - 202, The pool reuse
logic for starting workers currently ignores the requested worker count, causing
_start_workers_tracked to reuse a pool keyed only by
engine/model_id/mode/worker_type and produce wrong topologies; update the pool
key generation (where pools are indexed for reuse) to include the requested
count parameter or, alternatively, modify the pool acquisition path (the
acquire/checkout function used by _start_workers_tracked) to compare the
requested count against the existing pool's count and either reject/evict the
existing entry and recreate a matching pool or raise an error so callers must
recreate; ensure you update any key-building helpers and tests that rely on the
pool identity to reference the new count-inclusive key or the new acquire
behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@e2e_test/infra/worker_pool.py`:
- Around line 85-118: The lock (_lock) is intentionally held across the
long-running start_workers(...) call to keep the cache update of self._key and
self._workers atomic with respect to other callers; add a concise explanatory
comment near the critical section (around the with self._lock: block or
immediately before calling start_workers) stating that the lock is intentionally
held despite start_workers being blocking (CI runs sequentially today) and
noting the potential contention if parallel usage is introduced in future so
maintainers understand this is by design; reference _lock, start_workers,
_evict_locked, _workers, and _key in the comment for clarity.
- Around line 137-144: The current get_pool() registers _POOL.cleanup with
atexit every time a new WorkerPool is created, which can accumulate handlers;
instead register a single module-level atexit handler that calls cleanup_pool()
(or ensure registration happens only once) and remove per-instance
atexit.register(_POOL.cleanup) from get_pool(); modify get_pool(), _POOL, and
the module init to either call atexit.register(cleanup_pool) once (or guard
registration with a module boolean like _ATEXIT_REGISTERED) so subsequent
recreations of _POOL do not add duplicate atexit handlers.

In `@e2e_test/responses/conftest.py`:
- Around line 310-320: Wrap the call to Gateway.start in a try/except (or
try/finally) so that if gateway.start(...) raises after the Gateway instance has
been created, the gateway is cleanly shut down (call the appropriate teardown
method such as gateway.shutdown() / gateway.stop() / gateway.close()) before
re-raising the exception; ensure this interacts correctly with the existing
cleanup that currently runs only when openai.OpenAI(...) raises so no gateway
instance is leaked into subsequent tests.

---

Outside diff comments:
In `@e2e_test/fixtures/setup_backend.py`:
- Around line 195-202: The pool reuse logic for starting workers currently
ignores the requested worker count, causing _start_workers_tracked to reuse a
pool keyed only by engine/model_id/mode/worker_type and produce wrong
topologies; update the pool key generation (where pools are indexed for reuse)
to include the requested count parameter or, alternatively, modify the pool
acquisition path (the acquire/checkout function used by _start_workers_tracked)
to compare the requested count against the existing pool's count and either
reject/evict the existing entry and recreate a matching pool or raise an error
so callers must recreate; ensure you update any key-building helpers and tests
that rely on the pool identity to reference the new count-inclusive key or the
new acquire behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1ec63284-af5b-48c5-bc73-956149e4afc8

📥 Commits

Reviewing files that changed from the base of the PR and between fb1eebf and a1fdda8.

📒 Files selected for processing (8)
  • e2e_test/conftest.py
  • e2e_test/fixtures/__init__.py
  • e2e_test/fixtures/hooks.py
  • e2e_test/fixtures/setup_backend.py
  • e2e_test/infra/__init__.py
  • e2e_test/infra/model_specs.py
  • e2e_test/infra/worker_pool.py
  • e2e_test/responses/conftest.py

Comment thread e2e_test/infra/worker_pool.py
Comment thread e2e_test/infra/worker_pool.py
Comment thread e2e_test/responses/conftest.py Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a session-scoped worker pool to cache and reuse workers across E2E test classes, optimizing GPU resource usage and reducing startup overhead. It introduces test ordering based on backend and model to maximize cache efficiency and adds a session teardown hook for resource cleanup. Review feedback highlights critical improvements for resource management, including ensuring existing workers are evicted when bypassing the cache, incorporating worker counts into the cache key to prevent process leaks, and refining the lifecycle management of the pool's cleanup handlers.

Comment thread e2e_test/infra/worker_pool.py Outdated
Comment on lines +73 to +83
if worker_type != WorkerType.REGULAR:
# PD prefill/decode bypass the cache (see module docstring).
return start_workers(
model_id=model_id,
engine=engine,
mode=mode,
count=count,
worker_type=worker_type,
timeout=timeout,
log_dir=log_dir,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When bypassing the cache for non-regular workers (like PD prefill/decode), the pool must still evict any existing cached worker. Since GPU resources are exclusive and the pool is designed as a 'one-slot' cache, failing to stop the cached worker before starting new ones will lead to GPU resource contention, OOM errors, or NCCL initialization failures.

Suggested change
if worker_type != WorkerType.REGULAR:
# PD prefill/decode bypass the cache (see module docstring).
return start_workers(
model_id=model_id,
engine=engine,
mode=mode,
count=count,
worker_type=worker_type,
timeout=timeout,
log_dir=log_dir,
)
if worker_type != WorkerType.REGULAR:
# PD prefill/decode bypass the cache (see module docstring).
# We must still cleanup any existing pooled worker to free GPU resources.
self.cleanup()
return start_workers(
model_id=model_id,
engine=engine,
mode=mode,
count=count,
worker_type=worker_type,
timeout=timeout,
log_dir=log_dir,
)
References
  1. Ensure that all acquired resources are tracked immediately and consistently to prevent resource leaks.

Comment thread e2e_test/fixtures/setup_backend.py Outdated
Comment on lines 61 to 63
if use_pool:
return get_pool().acquire(**kwargs)
return start_workers(**kwargs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When use_pool is False, we must still ensure that any existing pooled worker is stopped to avoid GPU resource conflicts. This is critical for tests that bypass the pool (like those using backend_router or PD workers) but still run in the same session.

Suggested change
if use_pool:
return get_pool().acquire(**kwargs)
return start_workers(**kwargs)
if use_pool:
return get_pool().acquire(**kwargs)
cleanup_pool()
return start_workers(**kwargs)
References
  1. Ensure that all acquired resources are tracked immediately and consistently to prevent resource leaks.

Comment thread e2e_test/infra/worker_pool.py Outdated
logger = logging.getLogger(__name__)


_PoolKey = tuple[str, str, ConnectionMode, WorkerType]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _PoolKey should include the count of workers. Currently, if a test class requests 2 workers and a subsequent class requests 1 worker for the same model/engine, the pool will reuse the 2 workers but only return 1, effectively leaking a worker process and its associated GPU resources for the duration of that test class. Including count in the key ensures that a change in the required worker topology triggers a proper eviction and restart.

Suggested change
_PoolKey = tuple[str, str, ConnectionMode, WorkerType]
_PoolKey = tuple[str, str, ConnectionMode, WorkerType, int]
References
  1. When a function is expected to return a specific number of resources, validate that the exact number of requested resources was obtained.

Comment thread e2e_test/infra/worker_pool.py Outdated
Comment on lines +89 to +97
key: _PoolKey = (engine, model_id, mode, worker_type)

if self._key == key and len(self._workers) >= count:
logger.info(
"WorkerPool: reusing %d cached worker(s) for %s",
count,
key,
)
return list(self._workers[:count])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update the key generation and reuse logic to include the worker count. This simplifies the check and ensures that we only reuse workers when the requested topology matches exactly.

Suggested change
key: _PoolKey = (engine, model_id, mode, worker_type)
if self._key == key and len(self._workers) >= count:
logger.info(
"WorkerPool: reusing %d cached worker(s) for %s",
count,
key,
)
return list(self._workers[:count])
key: _PoolKey = (engine, model_id, mode, worker_type, count)
if self._key == key:
logger.info(
"WorkerPool: reusing %d cached worker(s) for %s",
count,
key,
)
return list(self._workers)
References
  1. When a function is expected to return a specific number of resources, validate that the exact number of requested resources was obtained.

Comment on lines +137 to +144
def get_pool() -> WorkerPool:
"""Return the session-wide worker pool, creating it on first use."""
global _POOL
with _POOL_LOCK:
if _POOL is None or _POOL._closed:
_POOL = WorkerPool()
atexit.register(_POOL.cleanup)
return _POOL
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Registering the atexit handler inside get_pool can lead to multiple registrations if the pool is closed and recreated (e.g., after a bypass or explicit cleanup). Since atexit holds references to the bound methods, this can also cause a minor memory leak. It is better to register a module-level cleanup function once. Additionally, ensure the cleanup is protected by a lock.

Suggested change
def get_pool() -> WorkerPool:
"""Return the session-wide worker pool, creating it on first use."""
global _POOL
with _POOL_LOCK:
if _POOL is None or _POOL._closed:
_POOL = WorkerPool()
atexit.register(_POOL.cleanup)
return _POOL
def get_pool() -> WorkerPool:
"""Return the session-wide worker pool, creating it on first use."""
global _POOL
with _POOL_LOCK:
if _POOL is None or _POOL._closed:
_POOL = WorkerPool()
return _POOL
def cleanup_pool() -> None:
"""Tear down the session-wide pool if it exists. Called from session-end hook."""
with _POOL_LOCK:
if _POOL is not None:
_POOL.cleanup()
atexit.register(cleanup_pool)
References
  1. When closing a shared resource that can be accessed by other threads, protect the close operation and the nullification of the resource reference with a lock to prevent race conditions during shutdown.

Comment thread e2e_test/fixtures/hooks.py Outdated
callspec = getattr(item, "callspec", None)
if callspec is not None:
params = getattr(callspec, "params", {}) or {}
backend = str(params.get("setup_backend", ""))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The sort key should also consider the backend_router fixture parameter. While backend_router currently bypasses the pool, clustering tests that use it (and thus share a backend/model) still improves overall session stability and makes future pooling of these tests easier.

Suggested change
backend = str(params.get("setup_backend", ""))
backend = str(params.get("setup_backend", params.get("backend_router", "")))

Follow-up to the tp=1 drop on gpt-oss-20b and Qwen2.5-14B-Instruct: the
2-GPU CI jobs that relied on those models can now run on a single H100,
and `ci_download_model.sh --gpu-tier 2` was returning an empty list
because no model in MODEL_SPECS has tp=2 anymore.

Workflow changes:

- Rename `e2e-2gpu-responses` -> `e2e-1gpu-responses` (gpu_tier 1,
  1-gpu-h100 runner). Flip `@pytest.mark.gpu(2)` -> `gpu(1)` on the 15
  affected classes in `e2e_test/responses/`.
- Retire `e2e-2gpu-chat`. Flip `@pytest.mark.gpu(2)` -> `gpu(1)` on the
  4 classes in `e2e_test/chat_completions/` that used the now-tp=1
  models. The merged set runs under `e2e-1gpu-chat`; job-level timeouts
  bumped to 60/45/120 min (sglang/vllm/trtllm).
- `e2e-gpu-job.yml` grows an optional `test_timeout` input (default 25)
  feeding the pytest step's `timeout-minutes`. `e2e-1gpu-chat` passes
  45/35/90 to accommodate the absorbed tests.
- `finish` job's `needs:` and failure aggregator updated to drop the
  retired jobs and add `e2e-1gpu-responses`.

Script:

- `ci_download_model.sh` resolver now uses `tp <= tier` instead of
  `tp == tier`. Tier-N runners have N GPUs and can host any model
  needing <= N. Without this, `e2e-2gpu-pd` (which runs PD with tp=1
  Llama-3.1-8B on 2 GPUs) would also fail with "No models resolved for
  GPU tier 2". PD tests keep their `@pytest.mark.gpu(2)` marker
  because they genuinely need 2 GPUs at the test level.

Signed-off-by: key4ng <rukeyang@gmail.com>
@key4ng key4ng requested a review from gongwei-130 as a code owner May 18, 2026 05:50
@github-actions github-actions Bot added the ci CI/CD configuration changes label May 18, 2026
@key4ng key4ng changed the title test(e2e): cache workers across classes; drop tp=1 for gpt-oss-20b & Qwen2.5-14B test(e2e): cache workers + tp=1 for gpt-oss-20b/Qwen2.5-14B + move responses & chat to 1-GPU May 18, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
e2e_test/responses/test_image_generation.py (1)

520-534: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update stale GPU-routing comment block near these marker changes.

The nearby note still says these classes must stay on gpu(2) and that no 1‑GPU Responses lane exists, which now contradicts the updated markers and CI intent. Please update/remove that block to prevent future misrouting confusion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e_test/responses/test_image_generation.py` around lines 520 - 534, Update
the stale GPU-routing comment block that contradicts the current pytest markers
around TestImageGenerationGrpcSglang: locate the comment referencing "must stay
on gpu(2)" and "no 1‑GPU Responses lane exists" near the class
TestImageGenerationGrpcSglang and the surrounding pytest.mark.gpu decorators,
and either remove it or revise it to reflect the new markers (gpu(1)) and CI
intent so the comment no longer conflicts with the decorators and won't misroute
future tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/e2e-gpu-job.yml:
- Around line 22-26: Add a runtime guard that validates the declared input
contract: compare the test_timeout input against the job_timeout input and fail
fast with a clear error when test_timeout is not strictly less than job_timeout.
Concretely, add an early workflow step (e.g., a run step named "Validate
timeouts") that reads github.event.inputs.test_timeout and
github.event.inputs.job_timeout (or the corresponding env vars) and exits
non‑zero with an explanatory message if test_timeout >= job_timeout; apply the
same validation where the other test_timeout definition is used. Ensure the step
runs before starting pytest so misconfigured callers get an immediate, readable
failure.

---

Outside diff comments:
In `@e2e_test/responses/test_image_generation.py`:
- Around line 520-534: Update the stale GPU-routing comment block that
contradicts the current pytest markers around TestImageGenerationGrpcSglang:
locate the comment referencing "must stay on gpu(2)" and "no 1‑GPU Responses
lane exists" near the class TestImageGenerationGrpcSglang and the surrounding
pytest.mark.gpu decorators, and either remove it or revise it to reflect the new
markers (gpu(1)) and CI intent so the comment no longer conflicts with the
decorators and won't misroute future tests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9d635a93-43bc-4457-a6fe-977badf7015c

📥 Commits

Reviewing files that changed from the base of the PR and between a1fdda8 and 9220332.

📒 Files selected for processing (14)
  • .github/workflows/e2e-gpu-job.yml
  • .github/workflows/pr-test-rust.yml
  • e2e_test/chat_completions/test_function_calling.py
  • e2e_test/chat_completions/test_openai_server.py
  • e2e_test/chat_completions/test_structured_output.py
  • e2e_test/chat_completions/test_validation.py
  • e2e_test/responses/test_builtin_tools.py
  • e2e_test/responses/test_image_generation.py
  • e2e_test/responses/test_sampling_params.py
  • e2e_test/responses/test_state_management.py
  • e2e_test/responses/test_streaming_events.py
  • e2e_test/responses/test_structured_output.py
  • e2e_test/responses/test_tools_call.py
  • scripts/ci_download_model.sh

Comment on lines +22 to +26
test_timeout:
required: false
type: number
default: 25
description: "pytest step timeout in minutes (must be < job timeout)"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Enforce the timeout contract in workflow logic.

Line 26 states test_timeout must be lower than job timeout, but there’s no runtime guard. A bad caller value can create confusing timeout behavior.

Proposed fix
       # Run tests
+      - name: Validate timeout configuration
+        run: |
+          python3 - <<'PY'
+          job_timeout = float("${{ inputs.timeout }}")
+          test_timeout = float("${{ inputs.test_timeout }}")
+          if test_timeout >= job_timeout:
+              raise SystemExit(
+                  f"Invalid timeout config: test_timeout ({test_timeout}) must be < timeout ({job_timeout})"
+              )
+          PY
+
       - name: Run E2E tests
         timeout-minutes: ${{ inputs.test_timeout }}
         env:

Also applies to: 121-121

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/e2e-gpu-job.yml around lines 22 - 26, Add a runtime guard
that validates the declared input contract: compare the test_timeout input
against the job_timeout input and fail fast with a clear error when test_timeout
is not strictly less than job_timeout. Concretely, add an early workflow step
(e.g., a run step named "Validate timeouts") that reads
github.event.inputs.test_timeout and github.event.inputs.job_timeout (or the
corresponding env vars) and exits non‑zero with an explanatory message if
test_timeout >= job_timeout; apply the same validation where the other
test_timeout definition is used. Ensure the step runs before starting pytest so
misconfigured callers get an immediate, readable failure.

Addresses unresolved PR review comments on #1502:

- **Liveness gate**: cache reuse now requires every cached worker to pass
  ``Worker.is_alive()``. A dead worker is evicted before restart so a
  process crash doesn't poison every subsequent class in the cluster.
- **Count in key**: ``_PoolKey`` is now
  ``(engine, model_id, mode, worker_type, count)``. A class asking for
  ``workers(count=2)`` after a ``count=1`` class on the same backend
  triggers eviction instead of silently running with the wrong topology.
- **Evict on bypass**: non-REGULAR worker types (PD prefill/decode) now
  go through ``pool.acquire()`` too. The pool's bypass branch evicts any
  cached REGULAR worker first so a PD launch doesn't fight a stale
  cached process for the same GPUs. ``acquire()`` grew a ``gpu_offset``
  parameter forwarded to ``start_workers`` for the decode-after-prefill
  layout. Callers of non-REGULAR workers still own teardown via
  ``stop_workers``.
- **atexit dedup**: cleanup handler is registered once at module import
  instead of per-pool-instance, so close-then-recreate cycles don't
  accumulate handlers.
- **Gateway leak on start**: in
  ``responses/conftest._start_local_grpc_gateway_with_mcp``, wrap
  ``gateway.start()`` + client init in a single try/except so a failure
  from ``start()`` doesn't leak the gateway process.
- **Sort key picks up ``backend_router``**: ``_pool_sort_key`` falls
  back from ``setup_backend`` to ``backend_router`` parametrize value,
  so future function-scope tests cluster too. ``backend_router`` now
  acquires through the pool itself (function-scoped tests would
  otherwise race the class-scope cached worker for GPUs).
- **Docstrings**: refreshed ``setup_backend`` module docstring, dropped
  the stale ``# ``gpu(2)`` on the gRPC classes`` comment in
  ``test_image_generation.py`` that contradicted the new markers.

Signed-off-by: key4ng <rukeyang@gmail.com>
@key4ng
Copy link
Copy Markdown
Collaborator Author

key4ng commented May 18, 2026

Review feedback addressed in e0147bdd

Thread Status Resolution
Liveness gate (claude / codex / gemini @ worker_pool.py:97) ✅ Fixed Reuse branch now requires all(w.is_alive() for w in cached); dead workers trigger eviction + restart. Reason is logged as dead worker vs key mismatch.
count in pool key (gemini @ worker_pool.py:35/97, coderabbit outside-diff @ setup_backend.py:195-202) ✅ Fixed _PoolKey is now (engine, model_id, mode, worker_type, count). A class asking for workers(count=2) after count=1 evicts instead of silently returning a too-small slice.
Evict on bypass for non-REGULAR workers (gemini @ worker_pool.py:83) ✅ Fixed The worker_type != REGULAR branch now calls _evict_locked() before delegating to start_workers, so a PD launch can't race a stale cached worker for the same GPUs. PD prefill/decode go through pool.acquire() (callers still own teardown). acquire() gained a gpu_offset param for PD decode placement.
use_pool=False skipped eviction (gemini @ setup_backend.py:63) ✅ Fixed (by routing through pool) _start_workers_tracked now always goes through pool.acquire(); the pool's own bypass branch handles eviction. The use_pool flag is gone. backend_router also routes through the pool.
atexit handler dedup (claude / coderabbit / gemini @ worker_pool.py:143-144) ✅ Fixed Single module-level atexit.register(cleanup_pool) at import time; get_pool() no longer registers per-instance. Close-then-recreate cycles don't pile up handlers.
Gateway leak on start() failure (coderabbit @ responses/conftest.py:320) ✅ Fixed gateway.start(...) and client init are wrapped in a single try/except that calls gateway.shutdown() on any failure.
Sort key picks up backend_router (gemini @ hooks.py:169) ✅ Fixed _pool_sort_key falls back from setup_backend to backend_router param.
Stale gpu(2) comment in test_image_generation.py:506-516 (coderabbit outside-diff) ✅ Fixed Comment rewritten to reflect gpu(1) + e2e-1gpu-responses.
Lock held across start_workers (coderabbit @ worker_pool.py:118) ✅ Comment-only Added a docstring note explaining the lock is held intentionally; CI is sequential today, contention is a non-issue.
Validate test_timeout < timeout in workflow (coderabbit @ e2e-gpu-job.yml:26) ⏭️ Skipped The contract is documented in the input description; a misconfiguration manifests as a step that gets killed before the job times out, which is visible enough. Not worth a python validator step.

All ruff/format/pre-commit checks pass on the touched files.

Comment on lines +9 to +12
PD-disaggregation paths run through the pool too (so it can evict a
stale cached worker holding their GPUs) but the caller owns teardown of
the prefill/decode workers via ``stop_workers``. The function-scoped
``backend_router`` fixture intentionally bypasses the pool entirely.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: The module docstring says backend_router "intentionally bypasses the pool entirely," but the fixture was changed in this push to route through get_pool().acquire() (line 357). The inline comment at line 354 correctly describes the new behavior. The docstring should match.

Suggested change
PD-disaggregation paths run through the pool too (so it can evict a
stale cached worker holding their GPUs) but the caller owns teardown of
the prefill/decode workers via ``stop_workers``. The function-scoped
``backend_router`` fixture intentionally bypasses the pool entirely.
PD-disaggregation paths run through the pool too (so it can evict a
stale cached worker holding their GPUs) but the caller owns teardown of
the prefill/decode workers via ``stop_workers``. The function-scoped
``backend_router`` fixture also routes through the pool (so it evicts
any cached class-scope worker) but creates its own gateway per test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci CI/CD configuration changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant