qwen3_5_moe: add OpenAI serving entrypoint by mergennachin · Pull Request #20313 · pytorch/executorch

mergennachin · 2026-06-16T21:42:25Z

Add a model-specific OpenAI-compatible serving launcher for Qwen3.5-MoE. The Python process stays as the control plane for HTTP, chat templating, request validation, session affinity, and Qwen tool parsing; model execution stays in the C++ qwen3_5_moe_worker process through the generic examples/llm_server JSONL protocol.

This keeps Qwen-specific serving glue in examples/models/qwen3_5_moe while reusing the generic server runtime. It also keeps the existing C++ runner path intact: the serving entrypoint is a wrapper around the worker/engine path, not a replacement for main.cpp.

Add CUDA e2e serving smoke coverage for the Qwen artifact job. The test-model-cuda-e2e job runs in a fresh environment after downloading exported artifacts, so install ExecuTorch in editable mode before invoking python -m executorch.examples.models.qwen3_5_moe.serve.

Validation:

python -m pytest -q examples/models/qwen3_5_moe/test_serve.py: 8 passed
python -m py_compile examples/models/qwen3_5_moe/serve.py examples/models/qwen3_5_moe/test_serve.py
bash -n .ci/scripts/test_model_e2e.sh
Qwen BFCL serving slice with Pi-style session-affinity headers: 50/56 generated-slice pass rate (89.29%); parallel, parallel_multiple, and irrelevance categories passed 100%; live_multiple was the weakest slice at 4/6.
Pi integration uses the OpenAI-compatible endpoint plus session_id / x-session-affinity headers. Subagent fanout must run with enough --max-sessions for the concurrent named sessions; otherwise the expected behavior is a 429 capacity_exhausted response instead of silently duplicating model weights.

pytorch-bot · 2026-06-16T21:42:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20313

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-16T21:43:33Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

Adds an OpenAI-compatible serving entrypoint for the Qwen3.5 MoE example model by introducing a model-specific Python launcher (control plane) and a dedicated C++ worker binary (data plane) that speaks the generic examples/llm_server JSONL protocol.

Changes:

Introduce executorch.examples.models.qwen3_5_moe.serve plus hermetic tests asserting control-plane/model-code separation and correct worker spawn args.
Add qwen3_5_moe_worker executable target and wire it into Qwen3.5 MoE CMake presets.
Extend CI to export additional tokenizer files and run a CUDA OpenAI-serving smoke test; document serving usage in the model README.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
examples/models/qwen3_5_moe/test_serve.py	Adds hermetic tests for the serving launcher and separation guarantees.
examples/models/qwen3_5_moe/serve.py	New OpenAI-compatible control-plane entrypoint that spawns the worker and builds the FastAPI app.
examples/models/qwen3_5_moe/README.md	Documents how to run the server and integrate it with pi.
examples/models/qwen3_5_moe/qwen35_moe_worker.cpp	New C++ worker binary for model execution via llm_server JSONL protocol.
examples/models/qwen3_5_moe/CMakePresets.json	Adds the worker target to CUDA/Metal build presets.
examples/models/qwen3_5_moe/CMakeLists.txt	Defines the `qwen3_5_moe_worker` executable and stripping/link options.
.ci/scripts/test_model_e2e.sh	Adds CUDA serving smoke test exercising `/health`, `/v1/models`, and `/v1/chat/completions`.
.ci/scripts/export_model_artifact.sh	Exports `tokenizer_config.json` alongside `tokenizer.json` for serving templating.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

+    env = dict(os.environ)
+    conda = os.environ.get("CONDA_PREFIX")
+    if conda:
+        env["LD_LIBRARY_PATH"] = f"{conda}/lib:" + env.get("LD_LIBRARY_PATH", "")
+    worker_bin = args.worker_bin or _default_worker_bin()


Gasoonjia · 2026-06-17T22:09:31Z

@claude review this PR plz

Gasoonjia · 2026-06-17T22:09:12Z

@@ -0,0 +1 @@
+../../../examples/llm_server


is this required?

Yeah, we have symlinks today

https://github.com/pytorch/executorch/tree/main/src/executorch

So just being consistent

claude · 2026-06-17T22:10:27Z

Claude finished @Gasoonjia's task in 4m 39s —— View job

PR Review: qwen3_5_moe OpenAI serving entrypoint

I read through all 9 changed files plus the surrounding examples/llm_server runtime and setup.py/pyproject.toml packaging. Overall this is clean and the control-plane/data-plane split is well-executed — the worker is a thin model-specific shim over the generic JSONL loop, and the hermetic tests in test_serve.py nicely lock in the "no model code in the control plane" invariant.

Answering @Gasoonjia's question on `src/executorch/examples/llm_server` — is this required?

Yes, it's required, and it's consistent with existing convention. pyproject.toml:117 maps executorch = src/executorch, so in editable installs every executorch.* import resolves under src/executorch/. src/executorch/examples/ is a real directory whose children are committed symlinks back to the top-level examples/ (apple, cuda, models, qualcomm, xnnpack, llm_pte_finetuning already exist this way). serve.py:20-25 imports executorch.examples.llm_server.python.*, so without src/executorch/examples/llm_server -> ../../../examples/llm_server, that import fails in editable mode — which is exactly the mode the new CI smoke test uses (pip install --editable . then python -m executorch.examples.models.qwen3_5_moe.serve). So it's needed and it matches the sibling symlinks. The only thing I'd suggest is making sure this symlink lands in the same PR as whatever first introduced examples/llm_server itself (it may be more naturally owned there than in a Qwen-serving PR), but functionally it belongs.

Findings

1. LD_LIBRARY_PATH trailing-colon footgun — serve.py:64 (Copilot also flagged this, still open)

env["LD_LIBRARY_PATH"] = f"{conda}/lib:" + env.get("LD_LIBRARY_PATH", "")

When LD_LIBRARY_PATH is unset, this yields "{conda}/lib:" with a trailing colon. An empty path element makes the Linux dynamic loader search the current working directory, which is undesirable and a mild security risk. Build without empty elements:

existing = env.get("LD_LIBRARY_PATH", "")
env["LD_LIBRARY_PATH"] = f"{conda}/lib:{existing}" if existing else f"{conda}/lib"

Fix this →

2. Earlier Copilot comments about parents[3]/parents[2] are stale. The code no longer uses positional parents[...]; serve._repo_root() (serve.py:30-46) walks upward looking for a dir containing both CMakeLists.txt and examples/models/qwen3_5_moe, and test_serve.py:17 reuses it. This is robust to the src/executorch symlink layout. Likewise the "file read twice" comment is stale — test_serve.py:29 now reads once into text. Nothing to do here; just confirming for reviewers.

3. Deprecated FastAPI shutdown hook — serve.py:98. @app.on_event("shutdown") is deprecated in current Starlette/FastAPI in favor of lifespan handlers. It still works, but if build_app is shared infra you may eventually get deprecation noise. Low priority; only worth aligning with whatever the rest of examples/llm_server does.

4. --num-runners is effectively a no-op flag — serve.py:135-141,165-169. It only ever validates == 1 and otherwise errors. That's a fine way to document the single-weight-load constraint, but consider whether exposing a knob that has exactly one legal value is worth the surface area versus just documenting it. Not blocking.

Things that look correct

qwen35_moe_worker.cpp mirrors the runner's engine construction and correctly delegates to run_worker_stdio_loop; arg validation and max_sessions clamping match worker_loop.h:401 and qwen35_moe_engine.h:36.
CMakeLists.txt:72-83 adds the worker target consistently (same include dirs, link libs, and -s/gc-sections stripping as the runner).
_spawn correctly omits --data_path when empty (serve.py:73) and the test covers both the explicit-bin and default-bin paths.
The e2e smoke test (test_model_e2e.sh:450-549) is well-guarded (CUDA + qwen-only), picks a free port, captures+dumps the server log on failure, and cleans up the subprocess via trap. The editable reinstall before python -m ...serve is the right call for the fresh artifact job.
export_model_artifact.sh now also copies tokenizer_config.json, which --hf-tokenizer <dir> needs for the chat template — matches the smoke test passing --hf-tokenizer "${MODEL_DIR}".

Note on CI

The 7 "new failures" in Dr. CI (arm-backend tosa/ethos-u/vkml/zephyr, qnn-roberta, gemma3-multimodal) are all docker exec ... /exec failed infra errors unrelated to this PR's Python/CMake changes. Worth a rebase onto viable/strict to clear them.
· branch llm-qwen35-moe-serving

Add a model-specific OpenAI-compatible serving launcher for Qwen3.5-MoE. The Python process stays as the control plane for HTTP, chat templating, request validation, session affinity, and Qwen tool parsing; model execution stays in the C++ qwen3_5_moe_worker process through the generic examples/llm_server JSONL protocol. This keeps Qwen-specific serving glue in examples/models/qwen3_5_moe while reusing the generic server runtime. It also keeps the existing C++ runner path intact: the serving entrypoint is a wrapper around the worker/engine path, not a replacement for main.cpp. Add CUDA e2e serving smoke coverage for the Qwen artifact job. The test-model-cuda-e2e job runs in a fresh environment after downloading exported artifacts, so install ExecuTorch in editable mode before invoking python -m executorch.examples.models.qwen3_5_moe.serve. Validation: - python -m pytest -q examples/models/qwen3_5_moe/test_serve.py: 8 passed - python -m py_compile examples/models/qwen3_5_moe/serve.py examples/models/qwen3_5_moe/test_serve.py - bash -n .ci/scripts/test_model_e2e.sh - Qwen BFCL serving slice with Pi-style session-affinity headers: 50/56 generated-slice pass rate (89.29%); parallel, parallel_multiple, and irrelevance categories passed 100%; live_multiple was the weakest slice at 4/6. - Pi integration uses the OpenAI-compatible endpoint plus session_id / x-session-affinity headers. Subagent fanout must run with enough --max-sessions for the concurrent named sessions; otherwise the expected behavior is a 429 capacity_exhausted response instead of silently duplicating model weights.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.


+if [ "$DEVICE" = "cuda" ] && [ "$MODEL_NAME" = "qwen3_5_moe" ]; then
+  echo "::group::Run $MODEL_NAME OpenAI serving smoke"
+  pip install -r examples/llm_server/python/requirements.txt "transformers==5.0.0rc1"


+#include <executorch/examples/models/qwen3_5_moe/qwen35_moe_engine.h>
+#include <executorch/runtime/platform/log.h>
+
+#include <cstdint>


Copilot AI review requested due to automatic review settings June 16, 2026 21:42

mergennachin requested review from kirklandsign and larryliu0820 as code owners June 16, 2026 21:42

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 16, 2026

mergennachin temporarily deployed to cadence June 16, 2026 21:42 — with GitHub Actions Inactive

mergennachin marked this pull request as draft June 16, 2026 21:42

Copilot started reviewing on behalf of mergennachin June 16, 2026 21:42 View session

Copilot AI reviewed Jun 16, 2026

View reviewed changes

Comment thread examples/models/qwen3_5_moe/serve.py

Comment thread examples/models/qwen3_5_moe/test_serve.py

Comment thread examples/models/qwen3_5_moe/test_serve.py Outdated

mergennachin temporarily deployed to upload-benchmark-results June 16, 2026 23:01 — with GitHub Actions Inactive

mergennachin force-pushed the llm-qwen35-moe-serving branch from 13b2ff0 to 769c740 Compare June 17, 2026 16:11

mergennachin temporarily deployed to cadence June 17, 2026 16:11 — with GitHub Actions Inactive

mergennachin temporarily deployed to upload-benchmark-results June 17, 2026 18:04 — with GitHub Actions Inactive

mergennachin temporarily deployed to cadence June 17, 2026 20:18 — with GitHub Actions Inactive

mergennachin requested review from Gasoonjia, digantdesai and seyeong-han June 17, 2026 21:58

mergennachin marked this pull request as ready for review June 17, 2026 21:58

Copilot AI review requested due to automatic review settings June 17, 2026 21:58

Copilot started reviewing on behalf of mergennachin June 17, 2026 21:58 View session

mergennachin temporarily deployed to upload-benchmark-results June 17, 2026 21:58 — with GitHub Actions Inactive

Copilot AI reviewed Jun 17, 2026

View reviewed changes

Comment thread examples/models/qwen3_5_moe/serve.py

Comment on lines +61 to +65

env = dict(os.environ)

conda = os.environ.get("CONDA_PREFIX")

if conda:

env["LD_LIBRARY_PATH"] = f"{conda}/lib:" + env.get("LD_LIBRARY_PATH", "")

worker_bin = args.worker_bin or _default_worker_bin()

Gasoonjia reviewed Jun 17, 2026

View reviewed changes

mergennachin force-pushed the llm-qwen35-moe-serving branch from 36cd753 to 6dd1502 Compare June 18, 2026 15:18

mergennachin temporarily deployed to cadence June 18, 2026 15:18 — with GitHub Actions Inactive

Gasoonjia approved these changes Jun 18, 2026

View reviewed changes

mergennachin temporarily deployed to upload-benchmark-results June 18, 2026 16:52 — with GitHub Actions Inactive

Copilot AI review requested due to automatic review settings June 18, 2026 18:14

mergennachin force-pushed the llm-qwen35-moe-serving branch from 6dd1502 to cb860d8 Compare June 18, 2026 18:14

mergennachin temporarily deployed to cadence June 18, 2026 18:14 — with GitHub Actions Inactive

Copilot started reviewing on behalf of mergennachin June 18, 2026 18:14 View session

mergennachin merged commit c9ef423 into main Jun 18, 2026
418 of 430 checks passed

mergennachin deleted the llm-qwen35-moe-serving branch June 18, 2026 18:16

Copilot AI reviewed Jun 18, 2026

View reviewed changes

mergennachin temporarily deployed to upload-benchmark-results June 18, 2026 19:53 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qwen3_5_moe: add OpenAI serving entrypoint#20313

qwen3_5_moe: add OpenAI serving entrypoint#20313
mergennachin merged 1 commit into
mainfrom
llm-qwen35-moe-serving

mergennachin commented Jun 16, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Gasoonjia commented Jun 17, 2026

Uh oh!

Gasoonjia Jun 17, 2026

Uh oh!

mergennachin Jun 17, 2026

Uh oh!

claude Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1 @@
		../../../examples/llm_server No newline at end of file

Conversation

mergennachin commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20313

Uh oh!

github-actions Bot commented Jun 16, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Gasoonjia commented Jun 17, 2026

Uh oh!

Gasoonjia Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

mergennachin Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: qwen3_5_moe OpenAI serving entrypoint

Answering @Gasoonjia's question on src/executorch/examples/llm_server — is this required?

Findings

Things that look correct

Note on CI

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mergennachin commented Jun 16, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 16, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 17, 2026 •

edited

Loading

Answering @Gasoonjia's question on `src/executorch/examples/llm_server` — is this required?