test(e2e): cache workers + tp=1 for gpt-oss-20b/Qwen2.5-14B + move responses & chat to 1-GPU#1502
test(e2e): cache workers + tp=1 for gpt-oss-20b/Qwen2.5-14B + move responses & chat to 1-GPU#1502key4ng wants to merge 3 commits into
Conversation
… and Qwen2.5-14B The e2e-2gpu-responses CI job has been timing out at 25 min because class-scoped fixtures (`setup_backend` and `gateway_with_mock_mcp_grpc_*`) fully tear down and re-launch the SGLang worker between every test class. Run 25983256186 spent ~20 of 24 min of "Run E2E tests" on worker startup alone, alternating between gpt-oss-20b and Qwen2.5-14B in non-grouped order. Two changes: 1. Introduce a session-scoped `WorkerPool` (`infra/worker_pool.py`) that caches workers keyed on `(engine, model_id, mode, worker_type)`. Class fixtures call `pool.acquire(...)` instead of `start_workers(...)` and skip `stop_workers` on teardown — workers survive until session end. `pytest_sessionfinish` + `atexit` guarantee cleanup. PD prefill/decode workers bypass the cache (rare, hold multiple workers). `pytest_collection_modifyitems` now sorts items by `(backend, model)` so same-model classes cluster, making cache hits the common case. 2. Drop `tp` from 2 → 1 in `model_specs.py` for `Qwen/Qwen2.5-14B-Instruct` and `openai/gpt-oss-20b`. Qwen2.5-14B BF16 (~28GB) and gpt-oss-20b MXFP4 (~13GB) both fit comfortably on a single H100/80GB. NCCL setup was paying for tp=2 on every restart. Can be overridden via the existing `E2E_MODEL_TP_OVERRIDES` env var. Combined effect on e2e-2gpu-responses: instead of ~5 gpt-oss-20b boots + ~5 Qwen2.5-14B boots interleaved at ~2-3 min each, the suite should see one boot per model after sorting. Signed-off-by: key4ng <rukeyang@gmail.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughAdds a session-scoped WorkerPool with acquire/cleanup, wires pytest_sessionfinish to call cleanup_pool, always sorts collected tests for pool affinity, integrates pooling into backend/response fixtures, updates two model tp values to 1, adjusts CI job timeouts/graph, and changes many test GPU markers from 2→1. ChangesWorker Pool and Session Lifecycle
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
|
||
| if self._key == key and len(self._workers) >= count: | ||
| logger.info( | ||
| "WorkerPool: reusing %d cached worker(s) for %s", | ||
| count, | ||
| key, | ||
| ) | ||
| return list(self._workers[:count]) |
There was a problem hiding this comment.
🔴 Important: The cache-hit path doesn't verify that cached workers are still alive. If a worker process crashes during a test, every subsequent class sharing that backend key will receive the dead worker from the pool. This defeats --reruns (retries hit the same dead cache) and causes cascading failures across all remaining classes for that backend — the fail-fast counter in _start_workers_tracked never increments because no start_workers call is made.
Worker.is_alive() already exists; a one-line liveness gate before the reuse branch would fix it:
| if self._key == key and len(self._workers) >= count: | |
| logger.info( | |
| "WorkerPool: reusing %d cached worker(s) for %s", | |
| count, | |
| key, | |
| ) | |
| return list(self._workers[:count]) | |
| if self._key == key and len(self._workers) >= count and all( | |
| w.is_alive() for w in self._workers[:count] | |
| ): | |
| logger.info( | |
| "WorkerPool: reusing %d cached worker(s) for %s", | |
| count, | |
| key, | |
| ) | |
| return list(self._workers[:count]) |
| global _POOL | ||
| with _POOL_LOCK: | ||
| if _POOL is None or _POOL._closed: | ||
| _POOL = WorkerPool() | ||
| atexit.register(_POOL.cleanup) |
There was a problem hiding this comment.
🟡 Nit: Each time the pool is closed and then get_pool() is called again, a new atexit handler is registered while the old one still references the (now-closed) previous pool. cleanup() is idempotent so this is harmless at runtime, but the handlers accumulate. Consider unregistering the old handler, or guarding with a module-level flag:
| global _POOL | |
| with _POOL_LOCK: | |
| if _POOL is None or _POOL._closed: | |
| _POOL = WorkerPool() | |
| atexit.register(_POOL.cleanup) | |
| if _POOL is None or _POOL._closed: | |
| _POOL = WorkerPool() | |
| atexit.unregister(_POOL.cleanup) | |
| atexit.register(_POOL.cleanup) |
(In practice this only matters if something triggers a close-then-reacquire cycle, which normal pytest flow won't hit — so this is cosmetic.)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a1fdda8f4d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if self._key == key and len(self._workers) >= count: | ||
| logger.info( | ||
| "WorkerPool: reusing %d cached worker(s) for %s", | ||
| count, | ||
| key, | ||
| ) | ||
| return list(self._workers[:count]) |
There was a problem hiding this comment.
Validate cached workers before reusing them
When a cached worker process dies after a previous class (for example from an OOM or backend crash), this branch reuses it solely because the key and count match and returns a dead worker URL. Since the new collection ordering clusters same-model classes, every later class with the same key will keep timing out against the same dead process until a different key evicts it or the session ends; previously each class restarted workers. Check is_alive()/health and evict/restart before returning cached workers.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
e2e_test/fixtures/setup_backend.py (1)
195-202:⚠️ Potential issue | 🟠 Major | ⚡ Quick winInclude
countin the pool reuse contract.
@pytest.mark.workers(count=...)is still part of this fixture’s public behavior, but the pool contract described in this PR only keys reuse onengine/model_id/mode/worker_type. That means a class asking forcount=2after acount=1class on the same backend will reuse the 1-worker entry and run with the wrong topology. Please either includecountin the pool key or makeacquire()reject/evict on count mismatch.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@e2e_test/fixtures/setup_backend.py` around lines 195 - 202, The pool reuse logic for starting workers currently ignores the requested worker count, causing _start_workers_tracked to reuse a pool keyed only by engine/model_id/mode/worker_type and produce wrong topologies; update the pool key generation (where pools are indexed for reuse) to include the requested count parameter or, alternatively, modify the pool acquisition path (the acquire/checkout function used by _start_workers_tracked) to compare the requested count against the existing pool's count and either reject/evict the existing entry and recreate a matching pool or raise an error so callers must recreate; ensure you update any key-building helpers and tests that rely on the pool identity to reference the new count-inclusive key or the new acquire behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@e2e_test/infra/worker_pool.py`:
- Around line 85-118: The lock (_lock) is intentionally held across the
long-running start_workers(...) call to keep the cache update of self._key and
self._workers atomic with respect to other callers; add a concise explanatory
comment near the critical section (around the with self._lock: block or
immediately before calling start_workers) stating that the lock is intentionally
held despite start_workers being blocking (CI runs sequentially today) and
noting the potential contention if parallel usage is introduced in future so
maintainers understand this is by design; reference _lock, start_workers,
_evict_locked, _workers, and _key in the comment for clarity.
- Around line 137-144: The current get_pool() registers _POOL.cleanup with
atexit every time a new WorkerPool is created, which can accumulate handlers;
instead register a single module-level atexit handler that calls cleanup_pool()
(or ensure registration happens only once) and remove per-instance
atexit.register(_POOL.cleanup) from get_pool(); modify get_pool(), _POOL, and
the module init to either call atexit.register(cleanup_pool) once (or guard
registration with a module boolean like _ATEXIT_REGISTERED) so subsequent
recreations of _POOL do not add duplicate atexit handlers.
In `@e2e_test/responses/conftest.py`:
- Around line 310-320: Wrap the call to Gateway.start in a try/except (or
try/finally) so that if gateway.start(...) raises after the Gateway instance has
been created, the gateway is cleanly shut down (call the appropriate teardown
method such as gateway.shutdown() / gateway.stop() / gateway.close()) before
re-raising the exception; ensure this interacts correctly with the existing
cleanup that currently runs only when openai.OpenAI(...) raises so no gateway
instance is leaked into subsequent tests.
---
Outside diff comments:
In `@e2e_test/fixtures/setup_backend.py`:
- Around line 195-202: The pool reuse logic for starting workers currently
ignores the requested worker count, causing _start_workers_tracked to reuse a
pool keyed only by engine/model_id/mode/worker_type and produce wrong
topologies; update the pool key generation (where pools are indexed for reuse)
to include the requested count parameter or, alternatively, modify the pool
acquisition path (the acquire/checkout function used by _start_workers_tracked)
to compare the requested count against the existing pool's count and either
reject/evict the existing entry and recreate a matching pool or raise an error
so callers must recreate; ensure you update any key-building helpers and tests
that rely on the pool identity to reference the new count-inclusive key or the
new acquire behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1ec63284-af5b-48c5-bc73-956149e4afc8
📒 Files selected for processing (8)
e2e_test/conftest.pye2e_test/fixtures/__init__.pye2e_test/fixtures/hooks.pye2e_test/fixtures/setup_backend.pye2e_test/infra/__init__.pye2e_test/infra/model_specs.pye2e_test/infra/worker_pool.pye2e_test/responses/conftest.py
There was a problem hiding this comment.
Code Review
This pull request implements a session-scoped worker pool to cache and reuse workers across E2E test classes, optimizing GPU resource usage and reducing startup overhead. It introduces test ordering based on backend and model to maximize cache efficiency and adds a session teardown hook for resource cleanup. Review feedback highlights critical improvements for resource management, including ensuring existing workers are evicted when bypassing the cache, incorporating worker counts into the cache key to prevent process leaks, and refining the lifecycle management of the pool's cleanup handlers.
| if worker_type != WorkerType.REGULAR: | ||
| # PD prefill/decode bypass the cache (see module docstring). | ||
| return start_workers( | ||
| model_id=model_id, | ||
| engine=engine, | ||
| mode=mode, | ||
| count=count, | ||
| worker_type=worker_type, | ||
| timeout=timeout, | ||
| log_dir=log_dir, | ||
| ) |
There was a problem hiding this comment.
When bypassing the cache for non-regular workers (like PD prefill/decode), the pool must still evict any existing cached worker. Since GPU resources are exclusive and the pool is designed as a 'one-slot' cache, failing to stop the cached worker before starting new ones will lead to GPU resource contention, OOM errors, or NCCL initialization failures.
| if worker_type != WorkerType.REGULAR: | |
| # PD prefill/decode bypass the cache (see module docstring). | |
| return start_workers( | |
| model_id=model_id, | |
| engine=engine, | |
| mode=mode, | |
| count=count, | |
| worker_type=worker_type, | |
| timeout=timeout, | |
| log_dir=log_dir, | |
| ) | |
| if worker_type != WorkerType.REGULAR: | |
| # PD prefill/decode bypass the cache (see module docstring). | |
| # We must still cleanup any existing pooled worker to free GPU resources. | |
| self.cleanup() | |
| return start_workers( | |
| model_id=model_id, | |
| engine=engine, | |
| mode=mode, | |
| count=count, | |
| worker_type=worker_type, | |
| timeout=timeout, | |
| log_dir=log_dir, | |
| ) |
References
- Ensure that all acquired resources are tracked immediately and consistently to prevent resource leaks.
| if use_pool: | ||
| return get_pool().acquire(**kwargs) | ||
| return start_workers(**kwargs) |
There was a problem hiding this comment.
When use_pool is False, we must still ensure that any existing pooled worker is stopped to avoid GPU resource conflicts. This is critical for tests that bypass the pool (like those using backend_router or PD workers) but still run in the same session.
| if use_pool: | |
| return get_pool().acquire(**kwargs) | |
| return start_workers(**kwargs) | |
| if use_pool: | |
| return get_pool().acquire(**kwargs) | |
| cleanup_pool() | |
| return start_workers(**kwargs) |
References
- Ensure that all acquired resources are tracked immediately and consistently to prevent resource leaks.
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| _PoolKey = tuple[str, str, ConnectionMode, WorkerType] |
There was a problem hiding this comment.
The _PoolKey should include the count of workers. Currently, if a test class requests 2 workers and a subsequent class requests 1 worker for the same model/engine, the pool will reuse the 2 workers but only return 1, effectively leaking a worker process and its associated GPU resources for the duration of that test class. Including count in the key ensures that a change in the required worker topology triggers a proper eviction and restart.
| _PoolKey = tuple[str, str, ConnectionMode, WorkerType] | |
| _PoolKey = tuple[str, str, ConnectionMode, WorkerType, int] |
References
- When a function is expected to return a specific number of resources, validate that the exact number of requested resources was obtained.
| key: _PoolKey = (engine, model_id, mode, worker_type) | ||
|
|
||
| if self._key == key and len(self._workers) >= count: | ||
| logger.info( | ||
| "WorkerPool: reusing %d cached worker(s) for %s", | ||
| count, | ||
| key, | ||
| ) | ||
| return list(self._workers[:count]) |
There was a problem hiding this comment.
Update the key generation and reuse logic to include the worker count. This simplifies the check and ensures that we only reuse workers when the requested topology matches exactly.
| key: _PoolKey = (engine, model_id, mode, worker_type) | |
| if self._key == key and len(self._workers) >= count: | |
| logger.info( | |
| "WorkerPool: reusing %d cached worker(s) for %s", | |
| count, | |
| key, | |
| ) | |
| return list(self._workers[:count]) | |
| key: _PoolKey = (engine, model_id, mode, worker_type, count) | |
| if self._key == key: | |
| logger.info( | |
| "WorkerPool: reusing %d cached worker(s) for %s", | |
| count, | |
| key, | |
| ) | |
| return list(self._workers) |
References
- When a function is expected to return a specific number of resources, validate that the exact number of requested resources was obtained.
| def get_pool() -> WorkerPool: | ||
| """Return the session-wide worker pool, creating it on first use.""" | ||
| global _POOL | ||
| with _POOL_LOCK: | ||
| if _POOL is None or _POOL._closed: | ||
| _POOL = WorkerPool() | ||
| atexit.register(_POOL.cleanup) | ||
| return _POOL |
There was a problem hiding this comment.
Registering the atexit handler inside get_pool can lead to multiple registrations if the pool is closed and recreated (e.g., after a bypass or explicit cleanup). Since atexit holds references to the bound methods, this can also cause a minor memory leak. It is better to register a module-level cleanup function once. Additionally, ensure the cleanup is protected by a lock.
| def get_pool() -> WorkerPool: | |
| """Return the session-wide worker pool, creating it on first use.""" | |
| global _POOL | |
| with _POOL_LOCK: | |
| if _POOL is None or _POOL._closed: | |
| _POOL = WorkerPool() | |
| atexit.register(_POOL.cleanup) | |
| return _POOL | |
| def get_pool() -> WorkerPool: | |
| """Return the session-wide worker pool, creating it on first use.""" | |
| global _POOL | |
| with _POOL_LOCK: | |
| if _POOL is None or _POOL._closed: | |
| _POOL = WorkerPool() | |
| return _POOL | |
| def cleanup_pool() -> None: | |
| """Tear down the session-wide pool if it exists. Called from session-end hook.""" | |
| with _POOL_LOCK: | |
| if _POOL is not None: | |
| _POOL.cleanup() | |
| atexit.register(cleanup_pool) |
References
- When closing a shared resource that can be accessed by other threads, protect the close operation and the nullification of the resource reference with a lock to prevent race conditions during shutdown.
| callspec = getattr(item, "callspec", None) | ||
| if callspec is not None: | ||
| params = getattr(callspec, "params", {}) or {} | ||
| backend = str(params.get("setup_backend", "")) |
There was a problem hiding this comment.
The sort key should also consider the backend_router fixture parameter. While backend_router currently bypasses the pool, clustering tests that use it (and thus share a backend/model) still improves overall session stability and makes future pooling of these tests easier.
| backend = str(params.get("setup_backend", "")) | |
| backend = str(params.get("setup_backend", params.get("backend_router", ""))) |
Follow-up to the tp=1 drop on gpt-oss-20b and Qwen2.5-14B-Instruct: the 2-GPU CI jobs that relied on those models can now run on a single H100, and `ci_download_model.sh --gpu-tier 2` was returning an empty list because no model in MODEL_SPECS has tp=2 anymore. Workflow changes: - Rename `e2e-2gpu-responses` -> `e2e-1gpu-responses` (gpu_tier 1, 1-gpu-h100 runner). Flip `@pytest.mark.gpu(2)` -> `gpu(1)` on the 15 affected classes in `e2e_test/responses/`. - Retire `e2e-2gpu-chat`. Flip `@pytest.mark.gpu(2)` -> `gpu(1)` on the 4 classes in `e2e_test/chat_completions/` that used the now-tp=1 models. The merged set runs under `e2e-1gpu-chat`; job-level timeouts bumped to 60/45/120 min (sglang/vllm/trtllm). - `e2e-gpu-job.yml` grows an optional `test_timeout` input (default 25) feeding the pytest step's `timeout-minutes`. `e2e-1gpu-chat` passes 45/35/90 to accommodate the absorbed tests. - `finish` job's `needs:` and failure aggregator updated to drop the retired jobs and add `e2e-1gpu-responses`. Script: - `ci_download_model.sh` resolver now uses `tp <= tier` instead of `tp == tier`. Tier-N runners have N GPUs and can host any model needing <= N. Without this, `e2e-2gpu-pd` (which runs PD with tp=1 Llama-3.1-8B on 2 GPUs) would also fail with "No models resolved for GPU tier 2". PD tests keep their `@pytest.mark.gpu(2)` marker because they genuinely need 2 GPUs at the test level. Signed-off-by: key4ng <rukeyang@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
e2e_test/responses/test_image_generation.py (1)
520-534:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate stale GPU-routing comment block near these marker changes.
The nearby note still says these classes must stay on
gpu(2)and that no 1‑GPU Responses lane exists, which now contradicts the updated markers and CI intent. Please update/remove that block to prevent future misrouting confusion.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@e2e_test/responses/test_image_generation.py` around lines 520 - 534, Update the stale GPU-routing comment block that contradicts the current pytest markers around TestImageGenerationGrpcSglang: locate the comment referencing "must stay on gpu(2)" and "no 1‑GPU Responses lane exists" near the class TestImageGenerationGrpcSglang and the surrounding pytest.mark.gpu decorators, and either remove it or revise it to reflect the new markers (gpu(1)) and CI intent so the comment no longer conflicts with the decorators and won't misroute future tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/e2e-gpu-job.yml:
- Around line 22-26: Add a runtime guard that validates the declared input
contract: compare the test_timeout input against the job_timeout input and fail
fast with a clear error when test_timeout is not strictly less than job_timeout.
Concretely, add an early workflow step (e.g., a run step named "Validate
timeouts") that reads github.event.inputs.test_timeout and
github.event.inputs.job_timeout (or the corresponding env vars) and exits
non‑zero with an explanatory message if test_timeout >= job_timeout; apply the
same validation where the other test_timeout definition is used. Ensure the step
runs before starting pytest so misconfigured callers get an immediate, readable
failure.
---
Outside diff comments:
In `@e2e_test/responses/test_image_generation.py`:
- Around line 520-534: Update the stale GPU-routing comment block that
contradicts the current pytest markers around TestImageGenerationGrpcSglang:
locate the comment referencing "must stay on gpu(2)" and "no 1‑GPU Responses
lane exists" near the class TestImageGenerationGrpcSglang and the surrounding
pytest.mark.gpu decorators, and either remove it or revise it to reflect the new
markers (gpu(1)) and CI intent so the comment no longer conflicts with the
decorators and won't misroute future tests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9d635a93-43bc-4457-a6fe-977badf7015c
📒 Files selected for processing (14)
.github/workflows/e2e-gpu-job.yml.github/workflows/pr-test-rust.ymle2e_test/chat_completions/test_function_calling.pye2e_test/chat_completions/test_openai_server.pye2e_test/chat_completions/test_structured_output.pye2e_test/chat_completions/test_validation.pye2e_test/responses/test_builtin_tools.pye2e_test/responses/test_image_generation.pye2e_test/responses/test_sampling_params.pye2e_test/responses/test_state_management.pye2e_test/responses/test_streaming_events.pye2e_test/responses/test_structured_output.pye2e_test/responses/test_tools_call.pyscripts/ci_download_model.sh
| test_timeout: | ||
| required: false | ||
| type: number | ||
| default: 25 | ||
| description: "pytest step timeout in minutes (must be < job timeout)" |
There was a problem hiding this comment.
Enforce the timeout contract in workflow logic.
Line 26 states test_timeout must be lower than job timeout, but there’s no runtime guard. A bad caller value can create confusing timeout behavior.
Proposed fix
# Run tests
+ - name: Validate timeout configuration
+ run: |
+ python3 - <<'PY'
+ job_timeout = float("${{ inputs.timeout }}")
+ test_timeout = float("${{ inputs.test_timeout }}")
+ if test_timeout >= job_timeout:
+ raise SystemExit(
+ f"Invalid timeout config: test_timeout ({test_timeout}) must be < timeout ({job_timeout})"
+ )
+ PY
+
- name: Run E2E tests
timeout-minutes: ${{ inputs.test_timeout }}
env:Also applies to: 121-121
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/e2e-gpu-job.yml around lines 22 - 26, Add a runtime guard
that validates the declared input contract: compare the test_timeout input
against the job_timeout input and fail fast with a clear error when test_timeout
is not strictly less than job_timeout. Concretely, add an early workflow step
(e.g., a run step named "Validate timeouts") that reads
github.event.inputs.test_timeout and github.event.inputs.job_timeout (or the
corresponding env vars) and exits non‑zero with an explanatory message if
test_timeout >= job_timeout; apply the same validation where the other
test_timeout definition is used. Ensure the step runs before starting pytest so
misconfigured callers get an immediate, readable failure.
Addresses unresolved PR review comments on #1502: - **Liveness gate**: cache reuse now requires every cached worker to pass ``Worker.is_alive()``. A dead worker is evicted before restart so a process crash doesn't poison every subsequent class in the cluster. - **Count in key**: ``_PoolKey`` is now ``(engine, model_id, mode, worker_type, count)``. A class asking for ``workers(count=2)`` after a ``count=1`` class on the same backend triggers eviction instead of silently running with the wrong topology. - **Evict on bypass**: non-REGULAR worker types (PD prefill/decode) now go through ``pool.acquire()`` too. The pool's bypass branch evicts any cached REGULAR worker first so a PD launch doesn't fight a stale cached process for the same GPUs. ``acquire()`` grew a ``gpu_offset`` parameter forwarded to ``start_workers`` for the decode-after-prefill layout. Callers of non-REGULAR workers still own teardown via ``stop_workers``. - **atexit dedup**: cleanup handler is registered once at module import instead of per-pool-instance, so close-then-recreate cycles don't accumulate handlers. - **Gateway leak on start**: in ``responses/conftest._start_local_grpc_gateway_with_mcp``, wrap ``gateway.start()`` + client init in a single try/except so a failure from ``start()`` doesn't leak the gateway process. - **Sort key picks up ``backend_router``**: ``_pool_sort_key`` falls back from ``setup_backend`` to ``backend_router`` parametrize value, so future function-scope tests cluster too. ``backend_router`` now acquires through the pool itself (function-scoped tests would otherwise race the class-scope cached worker for GPUs). - **Docstrings**: refreshed ``setup_backend`` module docstring, dropped the stale ``# ``gpu(2)`` on the gRPC classes`` comment in ``test_image_generation.py`` that contradicted the new markers. Signed-off-by: key4ng <rukeyang@gmail.com>
Review feedback addressed in
|
| Thread | Status | Resolution |
|---|---|---|
Liveness gate (claude / codex / gemini @ worker_pool.py:97) |
✅ Fixed | Reuse branch now requires all(w.is_alive() for w in cached); dead workers trigger eviction + restart. Reason is logged as dead worker vs key mismatch. |
count in pool key (gemini @ worker_pool.py:35/97, coderabbit outside-diff @ setup_backend.py:195-202) |
✅ Fixed | _PoolKey is now (engine, model_id, mode, worker_type, count). A class asking for workers(count=2) after count=1 evicts instead of silently returning a too-small slice. |
Evict on bypass for non-REGULAR workers (gemini @ worker_pool.py:83) |
✅ Fixed | The worker_type != REGULAR branch now calls _evict_locked() before delegating to start_workers, so a PD launch can't race a stale cached worker for the same GPUs. PD prefill/decode go through pool.acquire() (callers still own teardown). acquire() gained a gpu_offset param for PD decode placement. |
use_pool=False skipped eviction (gemini @ setup_backend.py:63) |
✅ Fixed (by routing through pool) | _start_workers_tracked now always goes through pool.acquire(); the pool's own bypass branch handles eviction. The use_pool flag is gone. backend_router also routes through the pool. |
atexit handler dedup (claude / coderabbit / gemini @ worker_pool.py:143-144) |
✅ Fixed | Single module-level atexit.register(cleanup_pool) at import time; get_pool() no longer registers per-instance. Close-then-recreate cycles don't pile up handlers. |
Gateway leak on start() failure (coderabbit @ responses/conftest.py:320) |
✅ Fixed | gateway.start(...) and client init are wrapped in a single try/except that calls gateway.shutdown() on any failure. |
Sort key picks up backend_router (gemini @ hooks.py:169) |
✅ Fixed | _pool_sort_key falls back from setup_backend to backend_router param. |
Stale gpu(2) comment in test_image_generation.py:506-516 (coderabbit outside-diff) |
✅ Fixed | Comment rewritten to reflect gpu(1) + e2e-1gpu-responses. |
Lock held across start_workers (coderabbit @ worker_pool.py:118) |
✅ Comment-only | Added a docstring note explaining the lock is held intentionally; CI is sequential today, contention is a non-issue. |
Validate test_timeout < timeout in workflow (coderabbit @ e2e-gpu-job.yml:26) |
⏭️ Skipped | The contract is documented in the input description; a misconfiguration manifests as a step that gets killed before the job times out, which is visible enough. Not worth a python validator step. |
All ruff/format/pre-commit checks pass on the touched files.
| PD-disaggregation paths run through the pool too (so it can evict a | ||
| stale cached worker holding their GPUs) but the caller owns teardown of | ||
| the prefill/decode workers via ``stop_workers``. The function-scoped | ||
| ``backend_router`` fixture intentionally bypasses the pool entirely. |
There was a problem hiding this comment.
🟡 Nit: The module docstring says backend_router "intentionally bypasses the pool entirely," but the fixture was changed in this push to route through get_pool().acquire() (line 357). The inline comment at line 354 correctly describes the new behavior. The docstring should match.
| PD-disaggregation paths run through the pool too (so it can evict a | |
| stale cached worker holding their GPUs) but the caller owns teardown of | |
| the prefill/decode workers via ``stop_workers``. The function-scoped | |
| ``backend_router`` fixture intentionally bypasses the pool entirely. | |
| PD-disaggregation paths run through the pool too (so it can evict a | |
| stale cached worker holding their GPUs) but the caller owns teardown of | |
| the prefill/decode workers via ``stop_workers``. The function-scoped | |
| ``backend_router`` fixture also routes through the pool (so it evicts | |
| any cached class-scope worker) but creates its own gateway per test. |
Summary
The
e2e-2gpu-responsesCI job hits its 25-min step timeout because the SGLang worker is destroyed and recreated between every test class. Run 25983256186 shows ~10 worker boots (gpt-oss-20b and Qwen2.5-14B-Instruct interleaved) at 1.5–3 min each — ~20 of 24 min spent in startup.Two root causes and a downstream cleanup:
WorkerPool(e2e_test/infra/worker_pool.py): one-slot cache keyed on(engine, model_id, mode, worker_type).setup_backend._setup_localandresponses/conftest._start_local_grpc_gateway_with_mcpnowpool.acquire(...)and drop theirstop_workers(...)teardown — workers live until session end (cleanup viapytest_sessionfinish+atexit). PD prefill/decode bypass the cache.(backend, model)inpytest_collection_modifyitems(fixtures/hooks.py) so consecutive classes hit the cache instead of evicting it.tp: 2 → 1forQwen/Qwen2.5-14B-Instructandopenai/gpt-oss-20bininfra/model_specs.py. Qwen2.5-14B BF16 (~28 GB) and gpt-oss-20b MXFP4 (~13 GB) both fit on one H100/80GB; tp=2 was paying NCCL setup on every restart.E2E_MODEL_TP_OVERRIDESstill works.Downstream CI cleanup forced by the tp drop:
e2e-2gpu-responses→e2e-1gpu-responses(gpu_tier 1, 1-gpu-h100 runner).@pytest.mark.gpu(2)→gpu(1)on the 15 affected classes ine2e_test/responses/.e2e-2gpu-chat. The 4 chat_completions classes that used the now-tp=1 models flip togpu(1)and run undere2e-1gpu-chat. Job-level timeouts bump to 60/45/120 min (sglang/vllm/trtllm); pytest step timeout grows to 45/35/90 via a newtest_timeoutinput one2e-gpu-job.yml.ci_download_model.shresolvertp == tier→tp <= tier. Without this,e2e-2gpu-pd(PD with tp=1 Llama-3.1-8B on 2 GPUs) hits the same "No models resolved for GPU tier 2" error. PD tests keepgpu(2)because they really do need 2 GPUs at the test level.Expected effect on the renamed
e2e-1gpu-responses: at most one boot per model per session.Files
e2e_test/infra/worker_pool.py(new) —WorkerPool+get_pool/cleanup_pool, thread-safe one-slot cache with atexit hook.e2e_test/infra/__init__.py— export pool symbols.e2e_test/infra/model_specs.py—tp: 2 → 1for the two models, with rationale comments.e2e_test/fixtures/setup_backend.py—_setup_localacquires from pool;_start_workers_trackedgrows ause_poolflag.e2e_test/fixtures/hooks.py—_pool_sort_keyclusters by(backend, model, nodeid); newpytest_sessionfinishcallscleanup_pool().e2e_test/fixtures/__init__.py,e2e_test/conftest.py— re-exportpytest_sessionfinish.e2e_test/responses/conftest.py—_start_local_grpc_gateway_with_mcpuses the pool; both gRPC fixtures dropstop_workerson teardown.e2e_test/responses/test_*.py(7 files) —gpu(2)→gpu(1)on 15 classes.e2e_test/chat_completions/test_*.py(4 files) —gpu(2)→gpu(1)on 4 classes..github/workflows/pr-test-rust.yml— rename responses job to 1-GPU; deletee2e-2gpu-chat; bumpe2e-1gpu-chattimeouts; updatefinishjob'sneeds/aggregator..github/workflows/e2e-gpu-job.yml— newtest_timeoutinput (default 25) feeding the pytest step.scripts/ci_download_model.sh— resolver usestp <= tier.Test plan
ruff check e2e_test/— passes.ruff format --checkon the touched files — passes (pre-existing format drift in untouched files left alone).mypy e2e_test/on touched files — passes.pre-commit run --files <touched>— passes.--gpu-tier 1→ 11 models,--gpu-tier 2→ 11 (same set, including the previously-tp=2 ones),--gpu-tier 4→ 16. Tier 2 no longer empty.e2e_test/router/keepgpu(2); the rest of the repo is nowgpu(0)(cloud) orgpu(1).conftest.pyandresponses/conftest.pyimport cleanly (withsmg_clientstubbed);_pool_sort_keyclusters as expected.e2e-1gpu-responsesjob finishes under its 30-min cap, log shows ≤ 1Starting sglang grpc worker for openai/gpt-oss-20band ≤ 1 forQwen/Qwen2.5-14B-Instruct.e2e-1gpu-chatmatrix finishes under its bumped timeouts.e2e-2gpu-pdkeeps downloading its tp=1 model.Risk:
_pool_sort_keyreorders pytest collection. Verified nopytest-orderingmarkers and no implicit cross-test ordering deps ine2e_test/. Risk fore2e-1gpu-chat: now runs strictly more tests; defensive timeout bumps + worker-pool savings should leave headroom, but if any matrix slot busts the 45/35/90-min step cap we'll need a follow-up.Checklist
cargo +nightly fmtpasses (no Rust changes)cargo clippy --all-targets --all-features -- -D warningspasses (no Rust changes)ruff/ruff format/mypy/ pre-commit pass on touched filesSummary by CodeRabbit
New Features
Improvements
Changes