Skip to content

qwen3_5_moe: add OpenAI serving entrypoint#20313

Merged
mergennachin merged 1 commit into
mainfrom
llm-qwen35-moe-serving
Jun 18, 2026
Merged

qwen3_5_moe: add OpenAI serving entrypoint#20313
mergennachin merged 1 commit into
mainfrom
llm-qwen35-moe-serving

Conversation

@mergennachin

@mergennachin mergennachin commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Add a model-specific OpenAI-compatible serving launcher for Qwen3.5-MoE. The Python process stays as the control plane for HTTP, chat templating, request validation, session affinity, and Qwen tool parsing; model execution stays in the C++ qwen3_5_moe_worker process through the generic examples/llm_server JSONL protocol.

This keeps Qwen-specific serving glue in examples/models/qwen3_5_moe while reusing the generic server runtime. It also keeps the existing C++ runner path intact: the serving entrypoint is a wrapper around the worker/engine path, not a replacement for main.cpp.

Add CUDA e2e serving smoke coverage for the Qwen artifact job. The test-model-cuda-e2e job runs in a fresh environment after downloading exported artifacts, so install ExecuTorch in editable mode before invoking python -m executorch.examples.models.qwen3_5_moe.serve.

Validation:

  • python -m pytest -q examples/models/qwen3_5_moe/test_serve.py: 8 passed

  • python -m py_compile examples/models/qwen3_5_moe/serve.py examples/models/qwen3_5_moe/test_serve.py

  • bash -n .ci/scripts/test_model_e2e.sh

  • Qwen BFCL serving slice with Pi-style session-affinity headers: 50/56 generated-slice pass rate (89.29%); parallel, parallel_multiple, and irrelevance categories passed 100%; live_multiple was the weakest slice at 4/6.

  • Pi integration uses the OpenAI-compatible endpoint plus session_id / x-session-affinity headers. Subagent fanout must run with enough --max-sessions for the concurrent named sessions; otherwise the expected behavior is a 429 capacity_exhausted response instead of silently duplicating model weights.

Copilot AI review requested due to automatic review settings June 16, 2026 21:42
@pytorch-bot

pytorch-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20313

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 16, 2026
@mergennachin mergennachin marked this pull request as draft June 16, 2026 21:42
@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an OpenAI-compatible serving entrypoint for the Qwen3.5 MoE example model by introducing a model-specific Python launcher (control plane) and a dedicated C++ worker binary (data plane) that speaks the generic examples/llm_server JSONL protocol.

Changes:

  • Introduce executorch.examples.models.qwen3_5_moe.serve plus hermetic tests asserting control-plane/model-code separation and correct worker spawn args.
  • Add qwen3_5_moe_worker executable target and wire it into Qwen3.5 MoE CMake presets.
  • Extend CI to export additional tokenizer files and run a CUDA OpenAI-serving smoke test; document serving usage in the model README.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
examples/models/qwen3_5_moe/test_serve.py Adds hermetic tests for the serving launcher and separation guarantees.
examples/models/qwen3_5_moe/serve.py New OpenAI-compatible control-plane entrypoint that spawns the worker and builds the FastAPI app.
examples/models/qwen3_5_moe/README.md Documents how to run the server and integrate it with pi.
examples/models/qwen3_5_moe/qwen35_moe_worker.cpp New C++ worker binary for model execution via llm_server JSONL protocol.
examples/models/qwen3_5_moe/CMakePresets.json Adds the worker target to CUDA/Metal build presets.
examples/models/qwen3_5_moe/CMakeLists.txt Defines the qwen3_5_moe_worker executable and stripping/link options.
.ci/scripts/test_model_e2e.sh Adds CUDA serving smoke test exercising /health, /v1/models, and /v1/chat/completions.
.ci/scripts/export_model_artifact.sh Exports tokenizer_config.json alongside tokenizer.json for serving templating.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread examples/models/qwen3_5_moe/serve.py
Comment thread examples/models/qwen3_5_moe/test_serve.py
Comment thread examples/models/qwen3_5_moe/test_serve.py Outdated
@mergennachin mergennachin temporarily deployed to upload-benchmark-results June 16, 2026 23:01 — with GitHub Actions Inactive
@mergennachin mergennachin force-pushed the llm-qwen35-moe-serving branch from 13b2ff0 to 769c740 Compare June 17, 2026 16:11
@mergennachin mergennachin temporarily deployed to upload-benchmark-results June 17, 2026 18:04 — with GitHub Actions Inactive
@mergennachin mergennachin marked this pull request as ready for review June 17, 2026 21:58
Copilot AI review requested due to automatic review settings June 17, 2026 21:58
@mergennachin mergennachin temporarily deployed to upload-benchmark-results June 17, 2026 21:58 — with GitHub Actions Inactive

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Comment on lines +61 to +65
env = dict(os.environ)
conda = os.environ.get("CONDA_PREFIX")
if conda:
env["LD_LIBRARY_PATH"] = f"{conda}/lib:" + env.get("LD_LIBRARY_PATH", "")
worker_bin = args.worker_bin or _default_worker_bin()
@Gasoonjia

Copy link
Copy Markdown
Contributor

@claude review this PR plz

@@ -0,0 +1 @@
../../../examples/llm_server No newline at end of file

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this required?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we have symlinks today

https://github.com/pytorch/executorch/tree/main/src/executorch

So just being consistent

@claude

claude Bot commented Jun 17, 2026

Copy link
Copy Markdown

Claude finished @Gasoonjia's task in 4m 39s —— View job


PR Review: qwen3_5_moe OpenAI serving entrypoint

I read through all 9 changed files plus the surrounding examples/llm_server runtime and setup.py/pyproject.toml packaging. Overall this is clean and the control-plane/data-plane split is well-executed — the worker is a thin model-specific shim over the generic JSONL loop, and the hermetic tests in test_serve.py nicely lock in the "no model code in the control plane" invariant.

Answering @Gasoonjia's question on src/executorch/examples/llm_serveris this required?

Yes, it's required, and it's consistent with existing convention. pyproject.toml:117 maps executorch = src/executorch, so in editable installs every executorch.* import resolves under src/executorch/. src/executorch/examples/ is a real directory whose children are committed symlinks back to the top-level examples/ (apple, cuda, models, qualcomm, xnnpack, llm_pte_finetuning already exist this way). serve.py:20-25 imports executorch.examples.llm_server.python.*, so without src/executorch/examples/llm_server -> ../../../examples/llm_server, that import fails in editable mode — which is exactly the mode the new CI smoke test uses (pip install --editable . then python -m executorch.examples.models.qwen3_5_moe.serve). So it's needed and it matches the sibling symlinks. The only thing I'd suggest is making sure this symlink lands in the same PR as whatever first introduced examples/llm_server itself (it may be more naturally owned there than in a Qwen-serving PR), but functionally it belongs.

Findings

1. LD_LIBRARY_PATH trailing-colon footgun — serve.py:64 (Copilot also flagged this, still open)

env["LD_LIBRARY_PATH"] = f"{conda}/lib:" + env.get("LD_LIBRARY_PATH", "")

When LD_LIBRARY_PATH is unset, this yields "{conda}/lib:" with a trailing colon. An empty path element makes the Linux dynamic loader search the current working directory, which is undesirable and a mild security risk. Build without empty elements:

existing = env.get("LD_LIBRARY_PATH", "")
env["LD_LIBRARY_PATH"] = f"{conda}/lib:{existing}" if existing else f"{conda}/lib"

Fix this →

2. Earlier Copilot comments about parents[3]/parents[2] are stale. The code no longer uses positional parents[...]; serve._repo_root() (serve.py:30-46) walks upward looking for a dir containing both CMakeLists.txt and examples/models/qwen3_5_moe, and test_serve.py:17 reuses it. This is robust to the src/executorch symlink layout. Likewise the "file read twice" comment is stale — test_serve.py:29 now reads once into text. Nothing to do here; just confirming for reviewers.

3. Deprecated FastAPI shutdown hook — serve.py:98. @app.on_event("shutdown") is deprecated in current Starlette/FastAPI in favor of lifespan handlers. It still works, but if build_app is shared infra you may eventually get deprecation noise. Low priority; only worth aligning with whatever the rest of examples/llm_server does.

4. --num-runners is effectively a no-op flag — serve.py:135-141,165-169. It only ever validates == 1 and otherwise errors. That's a fine way to document the single-weight-load constraint, but consider whether exposing a knob that has exactly one legal value is worth the surface area versus just documenting it. Not blocking.

Things that look correct

  • qwen35_moe_worker.cpp mirrors the runner's engine construction and correctly delegates to run_worker_stdio_loop; arg validation and max_sessions clamping match worker_loop.h:401 and qwen35_moe_engine.h:36.
  • CMakeLists.txt:72-83 adds the worker target consistently (same include dirs, link libs, and -s/gc-sections stripping as the runner).
  • _spawn correctly omits --data_path when empty (serve.py:73) and the test covers both the explicit-bin and default-bin paths.
  • The e2e smoke test (test_model_e2e.sh:450-549) is well-guarded (CUDA + qwen-only), picks a free port, captures+dumps the server log on failure, and cleans up the subprocess via trap. The editable reinstall before python -m ...serve is the right call for the fresh artifact job.
  • export_model_artifact.sh now also copies tokenizer_config.json, which --hf-tokenizer <dir> needs for the chat template — matches the smoke test passing --hf-tokenizer "${MODEL_DIR}".

Note on CI

The 7 "new failures" in Dr. CI (arm-backend tosa/ethos-u/vkml/zephyr, qnn-roberta, gemma3-multimodal) are all docker exec ... /exec failed infra errors unrelated to this PR's Python/CMake changes. Worth a rebase onto viable/strict to clear them.
· branch llm-qwen35-moe-serving

@mergennachin mergennachin force-pushed the llm-qwen35-moe-serving branch from 36cd753 to 6dd1502 Compare June 18, 2026 15:18
@mergennachin mergennachin temporarily deployed to upload-benchmark-results June 18, 2026 16:52 — with GitHub Actions Inactive
Add a model-specific OpenAI-compatible serving launcher for Qwen3.5-MoE. The Python process stays as the control plane for HTTP, chat templating, request validation, session affinity, and Qwen tool parsing; model execution stays in the C++ qwen3_5_moe_worker process through the generic examples/llm_server JSONL protocol.

This keeps Qwen-specific serving glue in examples/models/qwen3_5_moe while reusing the generic server runtime. It also keeps the existing C++ runner path intact: the serving entrypoint is a wrapper around the worker/engine path, not a replacement for main.cpp.

Add CUDA e2e serving smoke coverage for the Qwen artifact job. The test-model-cuda-e2e job runs in a fresh environment after downloading exported artifacts, so install ExecuTorch in editable mode before invoking python -m executorch.examples.models.qwen3_5_moe.serve.

Validation:

- python -m pytest -q examples/models/qwen3_5_moe/test_serve.py: 8 passed

- python -m py_compile examples/models/qwen3_5_moe/serve.py examples/models/qwen3_5_moe/test_serve.py

- bash -n .ci/scripts/test_model_e2e.sh

- Qwen BFCL serving slice with Pi-style session-affinity headers: 50/56 generated-slice pass rate (89.29%); parallel, parallel_multiple, and irrelevance categories passed 100%; live_multiple was the weakest slice at 4/6.

- Pi integration uses the OpenAI-compatible endpoint plus session_id / x-session-affinity headers. Subagent fanout must run with enough --max-sessions for the concurrent named sessions; otherwise the expected behavior is a 429 capacity_exhausted response instead of silently duplicating model weights.
Copilot AI review requested due to automatic review settings June 18, 2026 18:14
@mergennachin mergennachin force-pushed the llm-qwen35-moe-serving branch from 6dd1502 to cb860d8 Compare June 18, 2026 18:14
@mergennachin mergennachin merged commit c9ef423 into main Jun 18, 2026
418 of 430 checks passed
@mergennachin mergennachin deleted the llm-qwen35-moe-serving branch June 18, 2026 18:16

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.


if [ "$DEVICE" = "cuda" ] && [ "$MODEL_NAME" = "qwen3_5_moe" ]; then
echo "::group::Run $MODEL_NAME OpenAI serving smoke"
pip install -r examples/llm_server/python/requirements.txt "transformers==5.0.0rc1"
#include <executorch/examples/models/qwen3_5_moe/qwen35_moe_engine.h>
#include <executorch/runtime/platform/log.h>

#include <cstdint>
@mergennachin mergennachin temporarily deployed to upload-benchmark-results June 18, 2026 19:53 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants