qwen3_5_moe: add OpenAI serving entrypoint#20313
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20313
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
Adds an OpenAI-compatible serving entrypoint for the Qwen3.5 MoE example model by introducing a model-specific Python launcher (control plane) and a dedicated C++ worker binary (data plane) that speaks the generic examples/llm_server JSONL protocol.
Changes:
- Introduce
executorch.examples.models.qwen3_5_moe.serveplus hermetic tests asserting control-plane/model-code separation and correct worker spawn args. - Add
qwen3_5_moe_workerexecutable target and wire it into Qwen3.5 MoE CMake presets. - Extend CI to export additional tokenizer files and run a CUDA OpenAI-serving smoke test; document serving usage in the model README.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/models/qwen3_5_moe/test_serve.py | Adds hermetic tests for the serving launcher and separation guarantees. |
| examples/models/qwen3_5_moe/serve.py | New OpenAI-compatible control-plane entrypoint that spawns the worker and builds the FastAPI app. |
| examples/models/qwen3_5_moe/README.md | Documents how to run the server and integrate it with pi. |
| examples/models/qwen3_5_moe/qwen35_moe_worker.cpp | New C++ worker binary for model execution via llm_server JSONL protocol. |
| examples/models/qwen3_5_moe/CMakePresets.json | Adds the worker target to CUDA/Metal build presets. |
| examples/models/qwen3_5_moe/CMakeLists.txt | Defines the qwen3_5_moe_worker executable and stripping/link options. |
| .ci/scripts/test_model_e2e.sh | Adds CUDA serving smoke test exercising /health, /v1/models, and /v1/chat/completions. |
| .ci/scripts/export_model_artifact.sh | Exports tokenizer_config.json alongside tokenizer.json for serving templating. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
13b2ff0 to
769c740
Compare
| env = dict(os.environ) | ||
| conda = os.environ.get("CONDA_PREFIX") | ||
| if conda: | ||
| env["LD_LIBRARY_PATH"] = f"{conda}/lib:" + env.get("LD_LIBRARY_PATH", "") | ||
| worker_bin = args.worker_bin or _default_worker_bin() |
|
@claude review this PR plz |
| @@ -0,0 +1 @@ | |||
| ../../../examples/llm_server No newline at end of file | |||
There was a problem hiding this comment.
Yeah, we have symlinks today
https://github.com/pytorch/executorch/tree/main/src/executorch
So just being consistent
|
Claude finished @Gasoonjia's task in 4m 39s —— View job PR Review: qwen3_5_moe OpenAI serving entrypointI read through all 9 changed files plus the surrounding Answering @Gasoonjia's question on
|
36cd753 to
6dd1502
Compare
Add a model-specific OpenAI-compatible serving launcher for Qwen3.5-MoE. The Python process stays as the control plane for HTTP, chat templating, request validation, session affinity, and Qwen tool parsing; model execution stays in the C++ qwen3_5_moe_worker process through the generic examples/llm_server JSONL protocol. This keeps Qwen-specific serving glue in examples/models/qwen3_5_moe while reusing the generic server runtime. It also keeps the existing C++ runner path intact: the serving entrypoint is a wrapper around the worker/engine path, not a replacement for main.cpp. Add CUDA e2e serving smoke coverage for the Qwen artifact job. The test-model-cuda-e2e job runs in a fresh environment after downloading exported artifacts, so install ExecuTorch in editable mode before invoking python -m executorch.examples.models.qwen3_5_moe.serve. Validation: - python -m pytest -q examples/models/qwen3_5_moe/test_serve.py: 8 passed - python -m py_compile examples/models/qwen3_5_moe/serve.py examples/models/qwen3_5_moe/test_serve.py - bash -n .ci/scripts/test_model_e2e.sh - Qwen BFCL serving slice with Pi-style session-affinity headers: 50/56 generated-slice pass rate (89.29%); parallel, parallel_multiple, and irrelevance categories passed 100%; live_multiple was the weakest slice at 4/6. - Pi integration uses the OpenAI-compatible endpoint plus session_id / x-session-affinity headers. Subagent fanout must run with enough --max-sessions for the concurrent named sessions; otherwise the expected behavior is a 429 capacity_exhausted response instead of silently duplicating model weights.
6dd1502 to
cb860d8
Compare
|
|
||
| if [ "$DEVICE" = "cuda" ] && [ "$MODEL_NAME" = "qwen3_5_moe" ]; then | ||
| echo "::group::Run $MODEL_NAME OpenAI serving smoke" | ||
| pip install -r examples/llm_server/python/requirements.txt "transformers==5.0.0rc1" |
| #include <executorch/examples/models/qwen3_5_moe/qwen35_moe_engine.h> | ||
| #include <executorch/runtime/platform/log.h> | ||
|
|
||
| #include <cstdint> |
Add a model-specific OpenAI-compatible serving launcher for Qwen3.5-MoE. The Python process stays as the control plane for HTTP, chat templating, request validation, session affinity, and Qwen tool parsing; model execution stays in the C++ qwen3_5_moe_worker process through the generic examples/llm_server JSONL protocol.
This keeps Qwen-specific serving glue in examples/models/qwen3_5_moe while reusing the generic server runtime. It also keeps the existing C++ runner path intact: the serving entrypoint is a wrapper around the worker/engine path, not a replacement for main.cpp.
Add CUDA e2e serving smoke coverage for the Qwen artifact job. The test-model-cuda-e2e job runs in a fresh environment after downloading exported artifacts, so install ExecuTorch in editable mode before invoking python -m executorch.examples.models.qwen3_5_moe.serve.
Validation:
python -m pytest -q examples/models/qwen3_5_moe/test_serve.py: 8 passed
python -m py_compile examples/models/qwen3_5_moe/serve.py examples/models/qwen3_5_moe/test_serve.py
bash -n .ci/scripts/test_model_e2e.sh
Qwen BFCL serving slice with Pi-style session-affinity headers: 50/56 generated-slice pass rate (89.29%); parallel, parallel_multiple, and irrelevance categories passed 100%; live_multiple was the weakest slice at 4/6.
Pi integration uses the OpenAI-compatible endpoint plus session_id / x-session-affinity headers. Subagent fanout must run with enough --max-sessions for the concurrent named sessions; otherwise the expected behavior is a 429 capacity_exhausted response instead of silently duplicating model weights.