test(e2e): cache workers + tp=1 for gpt-oss-20b/Qwen2.5-14B + move responses & chat to 1-GPU by key4ng · Pull Request #1502 · lightseekorg/smg

key4ng · 2026-05-18T05:26:01Z

Summary

The e2e-2gpu-responses CI job hits its 25-min step timeout because the SGLang worker is destroyed and recreated between every test class. Run 25983256186 shows ~10 worker boots (gpt-oss-20b and Qwen2.5-14B-Instruct interleaved) at 1.5–3 min each — ~20 of 24 min spent in startup.

Two root causes and a downstream cleanup:

Session-scoped WorkerPool (e2e_test/infra/worker_pool.py): one-slot cache keyed on (engine, model_id, mode, worker_type). setup_backend._setup_local and responses/conftest._start_local_grpc_gateway_with_mcp now pool.acquire(...) and drop their stop_workers(...) teardown — workers live until session end (cleanup via pytest_sessionfinish + atexit). PD prefill/decode bypass the cache.
Cluster items by (backend, model) in pytest_collection_modifyitems (fixtures/hooks.py) so consecutive classes hit the cache instead of evicting it.
Drop tp: 2 → 1 for Qwen/Qwen2.5-14B-Instruct and openai/gpt-oss-20b in infra/model_specs.py. Qwen2.5-14B BF16 (~28 GB) and gpt-oss-20b MXFP4 (~13 GB) both fit on one H100/80GB; tp=2 was paying NCCL setup on every restart. E2E_MODEL_TP_OVERRIDES still works.

Downstream CI cleanup forced by the tp drop:

e2e-2gpu-responses → e2e-1gpu-responses (gpu_tier 1, 1-gpu-h100 runner). @pytest.mark.gpu(2) → gpu(1) on the 15 affected classes in e2e_test/responses/.
Retire e2e-2gpu-chat. The 4 chat_completions classes that used the now-tp=1 models flip to gpu(1) and run under e2e-1gpu-chat. Job-level timeouts bump to 60/45/120 min (sglang/vllm/trtllm); pytest step timeout grows to 45/35/90 via a new test_timeout input on e2e-gpu-job.yml.
ci_download_model.sh resolver tp == tier → tp <= tier. Without this, e2e-2gpu-pd (PD with tp=1 Llama-3.1-8B on 2 GPUs) hits the same "No models resolved for GPU tier 2" error. PD tests keep gpu(2) because they really do need 2 GPUs at the test level.

Expected effect on the renamed e2e-1gpu-responses: at most one boot per model per session.

Files

e2e_test/infra/worker_pool.py (new) — WorkerPool + get_pool / cleanup_pool, thread-safe one-slot cache with atexit hook.
e2e_test/infra/__init__.py — export pool symbols.
e2e_test/infra/model_specs.py — tp: 2 → 1 for the two models, with rationale comments.
e2e_test/fixtures/setup_backend.py — _setup_local acquires from pool; _start_workers_tracked grows a use_pool flag.
e2e_test/fixtures/hooks.py — _pool_sort_key clusters by (backend, model, nodeid); new pytest_sessionfinish calls cleanup_pool().
e2e_test/fixtures/__init__.py, e2e_test/conftest.py — re-export pytest_sessionfinish.
e2e_test/responses/conftest.py — _start_local_grpc_gateway_with_mcp uses the pool; both gRPC fixtures drop stop_workers on teardown.
e2e_test/responses/test_*.py (7 files) — gpu(2) → gpu(1) on 15 classes.
e2e_test/chat_completions/test_*.py (4 files) — gpu(2) → gpu(1) on 4 classes.
.github/workflows/pr-test-rust.yml — rename responses job to 1-GPU; delete e2e-2gpu-chat; bump e2e-1gpu-chat timeouts; update finish job's needs/aggregator.
.github/workflows/e2e-gpu-job.yml — new test_timeout input (default 25) feeding the pytest step.
scripts/ci_download_model.sh — resolver uses tp <= tier.

Test plan

ruff check e2e_test/ — passes.
ruff format --check on the touched files — passes (pre-existing format drift in untouched files left alone).
mypy e2e_test/ on touched files — passes.
pre-commit run --files <touched> — passes.
Resolver sanity: --gpu-tier 1 → 11 models, --gpu-tier 2 → 11 (same set, including the previously-tp=2 ones), --gpu-tier 4 → 16. Tier 2 no longer empty.
Marker sanity: only PD tests in e2e_test/router/ keep gpu(2); the rest of the repo is now gpu(0) (cloud) or gpu(1).
Import smoke: conftest.py and responses/conftest.py import cleanly (with smg_client stubbed); _pool_sort_key clusters as expected.
Real signal: the renamed e2e-1gpu-responses job finishes under its 30-min cap, log shows ≤ 1 Starting sglang grpc worker for openai/gpt-oss-20b and ≤ 1 for Qwen/Qwen2.5-14B-Instruct. e2e-1gpu-chat matrix finishes under its bumped timeouts. e2e-2gpu-pd keeps downloading its tp=1 model.

Risk: _pool_sort_key reorders pytest collection. Verified no pytest-ordering markers and no implicit cross-test ordering deps in e2e_test/. Risk for e2e-1gpu-chat: now runs strictly more tests; defensive timeout bumps + worker-pool savings should leave headroom, but if any matrix slot busts the 45/35/90-min step cap we'll need a follow-up.

Checklist

cargo +nightly fmt passes (no Rust changes)
cargo clippy --all-targets --all-features -- -D warnings passes (no Rust changes)
ruff / ruff format / mypy / pre-commit pass on touched files
(Optional) Documentation updated
(Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

New Features
- Configurable E2E test timeout for CI runs.
Improvements
- Reduced GPU requirements: many E2E tests now run with 1 GPU instead of 2.
- Introduced session-wide worker pooling to reuse workers across test classes and improve efficiency.
- Pool cleanup at session teardown to ensure orderly worker shutdown.
Changes
- Consolidated 2-GPU chat coverage into the 1-GPU job and removed the 2-GPU job.

… and Qwen2.5-14B The e2e-2gpu-responses CI job has been timing out at 25 min because class-scoped fixtures (`setup_backend` and `gateway_with_mock_mcp_grpc_*`) fully tear down and re-launch the SGLang worker between every test class. Run 25983256186 spent ~20 of 24 min of "Run E2E tests" on worker startup alone, alternating between gpt-oss-20b and Qwen2.5-14B in non-grouped order. Two changes: 1. Introduce a session-scoped `WorkerPool` (`infra/worker_pool.py`) that caches workers keyed on `(engine, model_id, mode, worker_type)`. Class fixtures call `pool.acquire(...)` instead of `start_workers(...)` and skip `stop_workers` on teardown — workers survive until session end. `pytest_sessionfinish` + `atexit` guarantee cleanup. PD prefill/decode workers bypass the cache (rare, hold multiple workers). `pytest_collection_modifyitems` now sorts items by `(backend, model)` so same-model classes cluster, making cache hits the common case. 2. Drop `tp` from 2 → 1 in `model_specs.py` for `Qwen/Qwen2.5-14B-Instruct` and `openai/gpt-oss-20b`. Qwen2.5-14B BF16 (~28GB) and gpt-oss-20b MXFP4 (~13GB) both fit comfortably on a single H100/80GB. NCCL setup was paying for tp=2 on every restart. Can be overridden via the existing `E2E_MODEL_TP_OVERRIDES` env var. Combined effect on e2e-2gpu-responses: instead of ~5 gpt-oss-20b boots + ~5 Qwen2.5-14B boots interleaved at ~2-3 min each, the suite should see one boot per model after sorting. Signed-off-by: key4ng <rukeyang@gmail.com>

coderabbitai · 2026-05-18T05:26:14Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6068ff58-9011-4209-82cd-7de10b65b216

📥 Commits

Reviewing files that changed from the base of the PR and between 9220332 and e0147bd.

📒 Files selected for processing (5)

e2e_test/fixtures/hooks.py
e2e_test/fixtures/setup_backend.py
e2e_test/infra/worker_pool.py
e2e_test/responses/conftest.py
e2e_test/responses/test_image_generation.py

📝 Walkthrough

Walkthrough

Adds a session-scoped WorkerPool with acquire/cleanup, wires pytest_sessionfinish to call cleanup_pool, always sorts collected tests for pool affinity, integrates pooling into backend/response fixtures, updates two model tp values to 1, adjusts CI job timeouts/graph, and changes many test GPU markers from 2→1.

Changes

Worker Pool and Session Lifecycle

Layer / File(s)	Summary
Worker Pool Infrastructure `e2e_test/infra/worker_pool.py`, `e2e_test/infra/__init__.py`	New `WorkerPool` singleton with a single-slot cache keyed by engine/model/mode/worker_type/count, `acquire()`/`cleanup()`, and exported `WorkerPool`/`get_pool`/`cleanup_pool`.
Session-Finish Cleanup and Hook Wiring `e2e_test/fixtures/hooks.py`, `e2e_test/fixtures/__init__.py`, `e2e_test/conftest.py`	Adds `pytest_sessionfinish` that calls `cleanup_pool()`, re-exports the hook via `fixtures`, and re-exports/imports it in `conftest.py` for pytest discovery.
Test Collection and Ordering `e2e_test/fixtures/hooks.py`	`pytest_collection_modifyitems` always applies `_pool_sort_key` ordering (backend, model, nodeid) and preserves optional env-var filtering when set.
Backend Fixture Pool Integration `e2e_test/fixtures/setup_backend.py`	`_start_workers_tracked` now acquires regular workers via `get_pool().acquire()`; `_setup_local`/`backend_router` acquire from the pool and no longer stop pooled workers on class teardown.
Response Fixture Pool Integration `e2e_test/responses/conftest.py`	`_start_local_grpc_gateway_with_mcp` and SGLang/vLLM fixtures now acquire pooled workers and avoid stopping pooled workers on teardown; helper ensures `gateway.shutdown()` on failure.
Model Configuration Updates `e2e_test/infra/model_specs.py`	`tp` changed from `2`→`1` for `Qwen/Qwen2.5-14B-Instruct` and `openai/gpt-oss-20b` with inline override guidance.
CI Workflow Wiring and Timeouts `.github/workflows/e2e-gpu-job.yml`, `.github/workflows/pr-test-rust.yml`	Adds `workflow_call` input `test_timeout` and uses it for step timeout; pr-test-rust rehomes 2-GPU chat coverage to 1-GPU jobs, updates timeouts, runners, and finish-job dependencies/failure gating.
Test GPU Marker Updates `e2e_test//test_.py`	Multiple tests updated their `@pytest.mark.gpu` decorators from `2` to `1` to align with pooled single-GPU runs and model tp changes.
CI Model Resolution Script `scripts/ci_download_model.sh`	`resolve_models_for_tier` now selects models with `tp <= tier` instead of `tp == tier`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

lightseekorg/smg#977: Both adjust e2e-gpu-job timeout handling—this PR makes the pytest step timeout configurable.
lightseekorg/smg#569: Related work on marker-resolution helpers used by pytest collection hooks.
lightseekorg/smg#643: Overlapping pytest hook changes affecting session teardown and collection behavior.

Suggested reviewers

CatherineSue
slin1237

Poem

🐰 I tunneled code through CI night,
A pool to keep the workers tight,
Tests line up in ordered rows,
Session ends — the cleanup goes,
Hopping on, the E2E delight.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely summarizes the three main changes: worker caching, tp parameter reduction for two models, and moving tests from 2-GPU to 1-GPU runners.
Docstring Coverage	✅ Passed	Docstring coverage is 81.25% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/e2e-2gpu-responses-worker-cache

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-05-18T05:28:40Z

+
+            if self._key == key and len(self._workers) >= count:
+                logger.info(
+                    "WorkerPool: reusing %d cached worker(s) for %s",
+                    count,
+                    key,
+                )
+                return list(self._workers[:count])


🔴 Important: The cache-hit path doesn't verify that cached workers are still alive. If a worker process crashes during a test, every subsequent class sharing that backend key will receive the dead worker from the pool. This defeats --reruns (retries hit the same dead cache) and causes cascading failures across all remaining classes for that backend — the fail-fast counter in _start_workers_tracked never increments because no start_workers call is made.

Worker.is_alive() already exists; a one-line liveness gate before the reuse branch would fix it:

Suggested change

if self._key == key and len(self._workers) >= count:

logger.info(

"WorkerPool: reusing %d cached worker(s) for %s",

count,

key,

)

return list(self._workers[:count])

if self._key == key and len(self._workers) >= count and all(

w.is_alive() for w in self._workers[:count]

):

logger.info(

"WorkerPool: reusing %d cached worker(s) for %s",

count,

key,

)

return list(self._workers[:count])

claude · 2026-05-18T05:28:45Z

+    global _POOL
+    with _POOL_LOCK:
+        if _POOL is None or _POOL._closed:
+            _POOL = WorkerPool()
+            atexit.register(_POOL.cleanup)


🟡 Nit: Each time the pool is closed and then get_pool() is called again, a new atexit handler is registered while the old one still references the (now-closed) previous pool. cleanup() is idempotent so this is harmless at runtime, but the handlers accumulate. Consider unregistering the old handler, or guarding with a module-level flag:

Suggested change

global _POOL

with _POOL_LOCK:

if _POOL is None or _POOL._closed:

_POOL = WorkerPool()

atexit.register(_POOL.cleanup)

if _POOL is None or _POOL._closed:

_POOL = WorkerPool()

atexit.unregister(_POOL.cleanup)

atexit.register(_POOL.cleanup)

(In practice this only matters if something triggers a close-then-reacquire cycle, which normal pytest flow won't hit — so this is cosmetic.)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a1fdda8f4d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-18T05:29:43Z

+            if self._key == key and len(self._workers) >= count:
+                logger.info(
+                    "WorkerPool: reusing %d cached worker(s) for %s",
+                    count,
+                    key,
+                )
+                return list(self._workers[:count])


Validate cached workers before reusing them

When a cached worker process dies after a previous class (for example from an OOM or backend crash), this branch reuses it solely because the key and count match and returns a dead worker URL. Since the new collection ordering clusters same-model classes, every later class with the same key will keep timing out against the same dead process until a different key evicts it or the session ends; previously each class restarted workers. Check is_alive()/health and evict/restart before returning cached workers.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

e2e_test/fixtures/setup_backend.py (1)
195-202: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Include count in the pool reuse contract.

@pytest.mark.workers(count=...) is still part of this fixture’s public behavior, but the pool contract described in this PR only keys reuse on engine/model_id/mode/worker_type. That means a class asking for count=2 after a count=1 class on the same backend will reuse the 1-worker entry and run with the wrong topology. Please either include count in the pool key or make acquire() reject/evict on count mismatch.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e_test/fixtures/setup_backend.py` around lines 195 - 202, The pool reuse
logic for starting workers currently ignores the requested worker count, causing
_start_workers_tracked to reuse a pool keyed only by
engine/model_id/mode/worker_type and produce wrong topologies; update the pool
key generation (where pools are indexed for reuse) to include the requested
count parameter or, alternatively, modify the pool acquisition path (the
acquire/checkout function used by _start_workers_tracked) to compare the
requested count against the existing pool's count and either reject/evict the
existing entry and recreate a matching pool or raise an error so callers must
recreate; ensure you update any key-building helpers and tests that rely on the
pool identity to reference the new count-inclusive key or the new acquire
behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@e2e_test/infra/worker_pool.py`:
- Around line 85-118: The lock (_lock) is intentionally held across the
long-running start_workers(...) call to keep the cache update of self._key and
self._workers atomic with respect to other callers; add a concise explanatory
comment near the critical section (around the with self._lock: block or
immediately before calling start_workers) stating that the lock is intentionally
held despite start_workers being blocking (CI runs sequentially today) and
noting the potential contention if parallel usage is introduced in future so
maintainers understand this is by design; reference _lock, start_workers,
_evict_locked, _workers, and _key in the comment for clarity.
- Around line 137-144: The current get_pool() registers _POOL.cleanup with
atexit every time a new WorkerPool is created, which can accumulate handlers;
instead register a single module-level atexit handler that calls cleanup_pool()
(or ensure registration happens only once) and remove per-instance
atexit.register(_POOL.cleanup) from get_pool(); modify get_pool(), _POOL, and
the module init to either call atexit.register(cleanup_pool) once (or guard
registration with a module boolean like _ATEXIT_REGISTERED) so subsequent
recreations of _POOL do not add duplicate atexit handlers.

In `@e2e_test/responses/conftest.py`:
- Around line 310-320: Wrap the call to Gateway.start in a try/except (or
try/finally) so that if gateway.start(...) raises after the Gateway instance has
been created, the gateway is cleanly shut down (call the appropriate teardown
method such as gateway.shutdown() / gateway.stop() / gateway.close()) before
re-raising the exception; ensure this interacts correctly with the existing
cleanup that currently runs only when openai.OpenAI(...) raises so no gateway
instance is leaked into subsequent tests.

---

Outside diff comments:
In `@e2e_test/fixtures/setup_backend.py`:
- Around line 195-202: The pool reuse logic for starting workers currently
ignores the requested worker count, causing _start_workers_tracked to reuse a
pool keyed only by engine/model_id/mode/worker_type and produce wrong
topologies; update the pool key generation (where pools are indexed for reuse)
to include the requested count parameter or, alternatively, modify the pool
acquisition path (the acquire/checkout function used by _start_workers_tracked)
to compare the requested count against the existing pool's count and either
reject/evict the existing entry and recreate a matching pool or raise an error
so callers must recreate; ensure you update any key-building helpers and tests
that rely on the pool identity to reference the new count-inclusive key or the
new acquire behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1ec63284-af5b-48c5-bc73-956149e4afc8

📥 Commits

Reviewing files that changed from the base of the PR and between fb1eebf and a1fdda8.

📒 Files selected for processing (8)

e2e_test/conftest.py
e2e_test/fixtures/__init__.py
e2e_test/fixtures/hooks.py
e2e_test/fixtures/setup_backend.py
e2e_test/infra/__init__.py
e2e_test/infra/model_specs.py
e2e_test/infra/worker_pool.py
e2e_test/responses/conftest.py

gemini-code-assist

Code Review

This pull request implements a session-scoped worker pool to cache and reuse workers across E2E test classes, optimizing GPU resource usage and reducing startup overhead. It introduces test ordering based on backend and model to maximize cache efficiency and adds a session teardown hook for resource cleanup. Review feedback highlights critical improvements for resource management, including ensuring existing workers are evicted when bypassing the cache, incorporating worker counts into the cache key to prevent process leaks, and refining the lifecycle management of the pool's cleanup handlers.

gemini-code-assist · 2026-05-18T05:32:45Z

+        if worker_type != WorkerType.REGULAR:
+            # PD prefill/decode bypass the cache (see module docstring).
+            return start_workers(
+                model_id=model_id,
+                engine=engine,
+                mode=mode,
+                count=count,
+                worker_type=worker_type,
+                timeout=timeout,
+                log_dir=log_dir,
+            )


When bypassing the cache for non-regular workers (like PD prefill/decode), the pool must still evict any existing cached worker. Since GPU resources are exclusive and the pool is designed as a 'one-slot' cache, failing to stop the cached worker before starting new ones will lead to GPU resource contention, OOM errors, or NCCL initialization failures.

Suggested change

if worker_type != WorkerType.REGULAR:

# PD prefill/decode bypass the cache (see module docstring).

return start_workers(

model_id=model_id,

engine=engine,

mode=mode,

count=count,

worker_type=worker_type,

timeout=timeout,

log_dir=log_dir,

)

if worker_type != WorkerType.REGULAR:

# PD prefill/decode bypass the cache (see module docstring).

# We must still cleanup any existing pooled worker to free GPU resources.

self.cleanup()

return start_workers(

model_id=model_id,

engine=engine,

mode=mode,

count=count,

worker_type=worker_type,

timeout=timeout,

log_dir=log_dir,

)

References

Ensure that all acquired resources are tracked immediately and consistently to prevent resource leaks.

gemini-code-assist · 2026-05-18T05:32:45Z

+        if use_pool:
+            return get_pool().acquire(**kwargs)
        return start_workers(**kwargs)


When use_pool is False, we must still ensure that any existing pooled worker is stopped to avoid GPU resource conflicts. This is critical for tests that bypass the pool (like those using backend_router or PD workers) but still run in the same session.

Suggested change

if use_pool:

return get_pool().acquire(**kwargs)

return start_workers(**kwargs)

if use_pool:

return get_pool().acquire(**kwargs)

cleanup_pool()

return start_workers(**kwargs)

References

Ensure that all acquired resources are tracked immediately and consistently to prevent resource leaks.

gemini-code-assist · 2026-05-18T05:32:45Z

+logger = logging.getLogger(__name__)
+
+
+_PoolKey = tuple[str, str, ConnectionMode, WorkerType]


The _PoolKey should include the count of workers. Currently, if a test class requests 2 workers and a subsequent class requests 1 worker for the same model/engine, the pool will reuse the 2 workers but only return 1, effectively leaking a worker process and its associated GPU resources for the duration of that test class. Including count in the key ensures that a change in the required worker topology triggers a proper eviction and restart.

Suggested change

_PoolKey = tuple[str, str, ConnectionMode, WorkerType]

_PoolKey = tuple[str, str, ConnectionMode, WorkerType, int]

References

When a function is expected to return a specific number of resources, validate that the exact number of requested resources was obtained.

gemini-code-assist · 2026-05-18T05:32:45Z

+            key: _PoolKey = (engine, model_id, mode, worker_type)
+
+            if self._key == key and len(self._workers) >= count:
+                logger.info(
+                    "WorkerPool: reusing %d cached worker(s) for %s",
+                    count,
+                    key,
+                )
+                return list(self._workers[:count])


Update the key generation and reuse logic to include the worker count. This simplifies the check and ensures that we only reuse workers when the requested topology matches exactly.

Suggested change

key: _PoolKey = (engine, model_id, mode, worker_type)

if self._key == key and len(self._workers) >= count:

logger.info(

"WorkerPool: reusing %d cached worker(s) for %s",

count,

key,

)

return list(self._workers[:count])

key: _PoolKey = (engine, model_id, mode, worker_type, count)

if self._key == key:

logger.info(

"WorkerPool: reusing %d cached worker(s) for %s",

count,

key,

)

return list(self._workers)

References

When a function is expected to return a specific number of resources, validate that the exact number of requested resources was obtained.

gemini-code-assist · 2026-05-18T05:32:45Z

+def get_pool() -> WorkerPool:
+    """Return the session-wide worker pool, creating it on first use."""
+    global _POOL
+    with _POOL_LOCK:
+        if _POOL is None or _POOL._closed:
+            _POOL = WorkerPool()
+            atexit.register(_POOL.cleanup)
+        return _POOL


Registering the atexit handler inside get_pool can lead to multiple registrations if the pool is closed and recreated (e.g., after a bypass or explicit cleanup). Since atexit holds references to the bound methods, this can also cause a minor memory leak. It is better to register a module-level cleanup function once. Additionally, ensure the cleanup is protected by a lock.

Suggested change

def get_pool() -> WorkerPool:

"""Return the session-wide worker pool, creating it on first use."""

global _POOL

with _POOL_LOCK:

if _POOL is None or _POOL._closed:

_POOL = WorkerPool()

atexit.register(_POOL.cleanup)

return _POOL

def get_pool() -> WorkerPool:

"""Return the session-wide worker pool, creating it on first use."""

global _POOL

with _POOL_LOCK:

if _POOL is None or _POOL._closed:

_POOL = WorkerPool()

return _POOL

def cleanup_pool() -> None:

"""Tear down the session-wide pool if it exists. Called from session-end hook."""

with _POOL_LOCK:

if _POOL is not None:

_POOL.cleanup()

atexit.register(cleanup_pool)

References

When closing a shared resource that can be accessed by other threads, protect the close operation and the nullification of the resource reference with a lock to prevent race conditions during shutdown.

gemini-code-assist · 2026-05-18T05:32:45Z

+    callspec = getattr(item, "callspec", None)
+    if callspec is not None:
+        params = getattr(callspec, "params", {}) or {}
+        backend = str(params.get("setup_backend", ""))


The sort key should also consider the backend_router fixture parameter. While backend_router currently bypasses the pool, clustering tests that use it (and thus share a backend/model) still improves overall session stability and makes future pooling of these tests easier.

Suggested change

backend = str(params.get("setup_backend", ""))

backend = str(params.get("setup_backend", params.get("backend_router", "")))

Follow-up to the tp=1 drop on gpt-oss-20b and Qwen2.5-14B-Instruct: the 2-GPU CI jobs that relied on those models can now run on a single H100, and `ci_download_model.sh --gpu-tier 2` was returning an empty list because no model in MODEL_SPECS has tp=2 anymore. Workflow changes: - Rename `e2e-2gpu-responses` -> `e2e-1gpu-responses` (gpu_tier 1, 1-gpu-h100 runner). Flip `@pytest.mark.gpu(2)` -> `gpu(1)` on the 15 affected classes in `e2e_test/responses/`. - Retire `e2e-2gpu-chat`. Flip `@pytest.mark.gpu(2)` -> `gpu(1)` on the 4 classes in `e2e_test/chat_completions/` that used the now-tp=1 models. The merged set runs under `e2e-1gpu-chat`; job-level timeouts bumped to 60/45/120 min (sglang/vllm/trtllm). - `e2e-gpu-job.yml` grows an optional `test_timeout` input (default 25) feeding the pytest step's `timeout-minutes`. `e2e-1gpu-chat` passes 45/35/90 to accommodate the absorbed tests. - `finish` job's `needs:` and failure aggregator updated to drop the retired jobs and add `e2e-1gpu-responses`. Script: - `ci_download_model.sh` resolver now uses `tp <= tier` instead of `tp == tier`. Tier-N runners have N GPUs and can host any model needing <= N. Without this, `e2e-2gpu-pd` (which runs PD with tp=1 Llama-3.1-8B on 2 GPUs) would also fail with "No models resolved for GPU tier 2". PD tests keep their `@pytest.mark.gpu(2)` marker because they genuinely need 2 GPUs at the test level. Signed-off-by: key4ng <rukeyang@gmail.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

e2e_test/responses/test_image_generation.py (1)
520-534: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update stale GPU-routing comment block near these marker changes.

The nearby note still says these classes must stay on gpu(2) and that no 1‑GPU Responses lane exists, which now contradicts the updated markers and CI intent. Please update/remove that block to prevent future misrouting confusion.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e_test/responses/test_image_generation.py` around lines 520 - 534, Update
the stale GPU-routing comment block that contradicts the current pytest markers
around TestImageGenerationGrpcSglang: locate the comment referencing "must stay
on gpu(2)" and "no 1‑GPU Responses lane exists" near the class
TestImageGenerationGrpcSglang and the surrounding pytest.mark.gpu decorators,
and either remove it or revise it to reflect the new markers (gpu(1)) and CI
intent so the comment no longer conflicts with the decorators and won't misroute
future tests.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/e2e-gpu-job.yml:
- Around line 22-26: Add a runtime guard that validates the declared input
contract: compare the test_timeout input against the job_timeout input and fail
fast with a clear error when test_timeout is not strictly less than job_timeout.
Concretely, add an early workflow step (e.g., a run step named "Validate
timeouts") that reads github.event.inputs.test_timeout and
github.event.inputs.job_timeout (or the corresponding env vars) and exits
non‑zero with an explanatory message if test_timeout >= job_timeout; apply the
same validation where the other test_timeout definition is used. Ensure the step
runs before starting pytest so misconfigured callers get an immediate, readable
failure.

---

Outside diff comments:
In `@e2e_test/responses/test_image_generation.py`:
- Around line 520-534: Update the stale GPU-routing comment block that
contradicts the current pytest markers around TestImageGenerationGrpcSglang:
locate the comment referencing "must stay on gpu(2)" and "no 1‑GPU Responses
lane exists" near the class TestImageGenerationGrpcSglang and the surrounding
pytest.mark.gpu decorators, and either remove it or revise it to reflect the new
markers (gpu(1)) and CI intent so the comment no longer conflicts with the
decorators and won't misroute future tests.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9d635a93-43bc-4457-a6fe-977badf7015c

📥 Commits

Reviewing files that changed from the base of the PR and between a1fdda8 and 9220332.

📒 Files selected for processing (14)

.github/workflows/e2e-gpu-job.yml
.github/workflows/pr-test-rust.yml
e2e_test/chat_completions/test_function_calling.py
e2e_test/chat_completions/test_openai_server.py
e2e_test/chat_completions/test_structured_output.py
e2e_test/chat_completions/test_validation.py
e2e_test/responses/test_builtin_tools.py
e2e_test/responses/test_image_generation.py
e2e_test/responses/test_sampling_params.py
e2e_test/responses/test_state_management.py
e2e_test/responses/test_streaming_events.py
e2e_test/responses/test_structured_output.py
e2e_test/responses/test_tools_call.py
scripts/ci_download_model.sh

coderabbitai · 2026-05-18T05:53:39Z

+      test_timeout:
+        required: false
+        type: number
+        default: 25
+        description: "pytest step timeout in minutes (must be < job timeout)"


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Enforce the timeout contract in workflow logic.

Line 26 states test_timeout must be lower than job timeout, but there’s no runtime guard. A bad caller value can create confusing timeout behavior.

Proposed fix

# Run tests + - name: Validate timeout configuration + run: | + python3 - <<'PY' + job_timeout = float("${{ inputs.timeout }}") + test_timeout = float("${{ inputs.test_timeout }}") + if test_timeout >= job_timeout: + raise SystemExit( + f"Invalid timeout config: test_timeout ({test_timeout}) must be < timeout ({job_timeout})" + ) + PY + - name: Run E2E tests timeout-minutes: ${{ inputs.test_timeout }} env:

Also applies to: 121-121

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/e2e-gpu-job.yml around lines 22 - 26, Add a runtime guard that validates the declared input contract: compare the test_timeout input against the job_timeout input and fail fast with a clear error when test_timeout is not strictly less than job_timeout. Concretely, add an early workflow step (e.g., a run step named "Validate timeouts") that reads github.event.inputs.test_timeout and github.event.inputs.job_timeout (or the corresponding env vars) and exits non‑zero with an explanatory message if test_timeout >= job_timeout; apply the same validation where the other test_timeout definition is used. Ensure the step runs before starting pytest so misconfigured callers get an immediate, readable failure.

Addresses unresolved PR review comments on #1502: - **Liveness gate**: cache reuse now requires every cached worker to pass ``Worker.is_alive()``. A dead worker is evicted before restart so a process crash doesn't poison every subsequent class in the cluster. - **Count in key**: ``_PoolKey`` is now ``(engine, model_id, mode, worker_type, count)``. A class asking for ``workers(count=2)`` after a ``count=1`` class on the same backend triggers eviction instead of silently running with the wrong topology. - **Evict on bypass**: non-REGULAR worker types (PD prefill/decode) now go through ``pool.acquire()`` too. The pool's bypass branch evicts any cached REGULAR worker first so a PD launch doesn't fight a stale cached process for the same GPUs. ``acquire()`` grew a ``gpu_offset`` parameter forwarded to ``start_workers`` for the decode-after-prefill layout. Callers of non-REGULAR workers still own teardown via ``stop_workers``. - **atexit dedup**: cleanup handler is registered once at module import instead of per-pool-instance, so close-then-recreate cycles don't accumulate handlers. - **Gateway leak on start**: in ``responses/conftest._start_local_grpc_gateway_with_mcp``, wrap ``gateway.start()`` + client init in a single try/except so a failure from ``start()`` doesn't leak the gateway process. - **Sort key picks up ``backend_router``**: ``_pool_sort_key`` falls back from ``setup_backend`` to ``backend_router`` parametrize value, so future function-scope tests cluster too. ``backend_router`` now acquires through the pool itself (function-scoped tests would otherwise race the class-scope cached worker for GPUs). - **Docstrings**: refreshed ``setup_backend`` module docstring, dropped the stale ``# ``gpu(2)`` on the gRPC classes`` comment in ``test_image_generation.py`` that contradicted the new markers. Signed-off-by: key4ng <rukeyang@gmail.com>

key4ng · 2026-05-18T06:44:46Z

Review feedback addressed in `e0147bdd`

Thread	Status	Resolution
Liveness gate (claude / codex / gemini @ `worker_pool.py:97`)	✅ Fixed	Reuse branch now requires `all(w.is_alive() for w in cached)`; dead workers trigger eviction + restart. Reason is logged as `dead worker` vs `key mismatch`.
`count` in pool key (gemini @ `worker_pool.py:35/97`, coderabbit outside-diff @ `setup_backend.py:195-202`)	✅ Fixed	`_PoolKey` is now `(engine, model_id, mode, worker_type, count)`. A class asking for `workers(count=2)` after `count=1` evicts instead of silently returning a too-small slice.
Evict on bypass for non-REGULAR workers (gemini @ `worker_pool.py:83`)	✅ Fixed	The `worker_type != REGULAR` branch now calls `_evict_locked()` before delegating to `start_workers`, so a PD launch can't race a stale cached worker for the same GPUs. PD prefill/decode go through `pool.acquire()` (callers still own teardown). `acquire()` gained a `gpu_offset` param for PD decode placement.
`use_pool=False` skipped eviction (gemini @ `setup_backend.py:63`)	✅ Fixed (by routing through pool)	`_start_workers_tracked` now always goes through `pool.acquire()`; the pool's own bypass branch handles eviction. The `use_pool` flag is gone. `backend_router` also routes through the pool.
`atexit` handler dedup (claude / coderabbit / gemini @ `worker_pool.py:143-144`)	✅ Fixed	Single module-level `atexit.register(cleanup_pool)` at import time; `get_pool()` no longer registers per-instance. Close-then-recreate cycles don't pile up handlers.
Gateway leak on `start()` failure (coderabbit @ `responses/conftest.py:320`)	✅ Fixed	`gateway.start(...)` and client init are wrapped in a single `try/except` that calls `gateway.shutdown()` on any failure.
Sort key picks up `backend_router` (gemini @ `hooks.py:169`)	✅ Fixed	`_pool_sort_key` falls back from `setup_backend` to `backend_router` param.
Stale `gpu(2)` comment in `test_image_generation.py:506-516` (coderabbit outside-diff)	✅ Fixed	Comment rewritten to reflect `gpu(1)` + `e2e-1gpu-responses`.
Lock held across `start_workers` (coderabbit @ `worker_pool.py:118`)	✅ Comment-only	Added a docstring note explaining the lock is held intentionally; CI is sequential today, contention is a non-issue.
Validate `test_timeout < timeout` in workflow (coderabbit @ `e2e-gpu-job.yml:26`)	⏭️ Skipped	The contract is documented in the input description; a misconfiguration manifests as a step that gets killed before the job times out, which is visible enough. Not worth a python validator step.

All ruff/format/pre-commit checks pass on the touched files.

claude · 2026-05-18T06:46:49Z

+PD-disaggregation paths run through the pool too (so it can evict a
+stale cached worker holding their GPUs) but the caller owns teardown of
+the prefill/decode workers via ``stop_workers``. The function-scoped
+``backend_router`` fixture intentionally bypasses the pool entirely.


🟡 Nit: The module docstring says backend_router "intentionally bypasses the pool entirely," but the fixture was changed in this push to route through get_pool().acquire() (line 357). The inline comment at line 354 correctly describes the new behavior. The docstring should match.

Suggested change

PD-disaggregation paths run through the pool too (so it can evict a

stale cached worker holding their GPUs) but the caller owns teardown of

the prefill/decode workers via ``stop_workers``. The function-scoped

``backend_router`` fixture intentionally bypasses the pool entirely.

PD-disaggregation paths run through the pool too (so it can evict a

stale cached worker holding their GPUs) but the caller owns teardown of

the prefill/decode workers via ``stop_workers``. The function-scoped

``backend_router`` fixture also routes through the pool (so it evicts

any cached class-scope worker) but creates its own gateway per test.

key4ng requested review from CatherineSue, XinyueZhang369 and slin1237 as code owners May 18, 2026 05:26

github-actions Bot added the tests Test changes label May 18, 2026

claude Bot reviewed May 18, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 18, 2026

View reviewed changes

coderabbitai Bot requested changes May 18, 2026

View reviewed changes

Comment thread e2e_test/infra/worker_pool.py

Comment thread e2e_test/infra/worker_pool.py

Comment thread e2e_test/responses/conftest.py Outdated

gemini-code-assist Bot reviewed May 18, 2026

View reviewed changes

key4ng requested a review from gongwei-130 as a code owner May 18, 2026 05:50

github-actions Bot added the ci CI/CD configuration changes label May 18, 2026

key4ng changed the title ~~test(e2e): cache workers across classes; drop tp=1 for gpt-oss-20b & Qwen2.5-14B~~ test(e2e): cache workers + tp=1 for gpt-oss-20b/Qwen2.5-14B + move responses & chat to 1-GPU May 18, 2026

coderabbitai Bot requested changes May 18, 2026

View reviewed changes

claude Bot reviewed May 18, 2026

View reviewed changes

		logger = logging.getLogger(__name__)


		_PoolKey = tuple[str, str, ConnectionMode, WorkerType]

	backend = str(params.get("setup_backend", ""))
	backend = str(params.get("setup_backend", params.get("backend_router", "")))

Conversation

key4ng commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

claude Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

key4ng commented May 18, 2026

Review feedback addressed in e0147bdd

Uh oh!

claude Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

key4ng commented May 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Review feedback addressed in `e0147bdd`