Releases: youssofal/MTPLX
MTPLX v0.3.6
MTPLX v0.3.6
Production patch over v0.3.5 focused on the public release pillars: bounded memory, no silent decode/prefill tradeoff, and clean CLI UX.
Highlights
- Fixes AIME-shaped
max_tokens=65536memory behavior by bounding initial new-token KV reservation while preserving real prompt-context allocation. - Avoids retaining full-capacity live cache refs for anonymous one-off sessions.
- Improves OpenCode tool-result turns so stable cached prefixes are reused instead of cold-prefilling the full history.
- Ships Tune in the packaged CLI:
mtplx tune,mtplx-tune, andmtplx bench tune. - Fixes verified-default onboarding/model labeling for the installed Optimized Speed/Q4 artifact.
- Adds
bench tunechip diagnostics with power, frequency, temperature, utilization, fan, and thermal-pressure telemetry; generation-window scope is labeled when available. - Tightens README claims so paired same-machine speedup is not described as hardware-independent.
Validation
- Local: compileall, ruff, full pytest, twine check, fresh venv smoke, git diff check.
- CLI UX:
mtplx --version, OpenCode dry-run, Pi dry-run,mtplx-tunedry-run, andbench tunedry-run. - GitHub PR #66:
repository-hygiene,wheel, andno-mlx-smokeall passed before merge. - Release workflow: trusted PyPI publishing workflow passed for
v0.3.6.
Known Non-Claims
MTPLX v0.3.5
What's Changed
- Fixed OpenCode tool-result turns cold-prefilling the full conversation history. Follow-up OpenCode turns now reuse the stable SessionBank prefix instead of sitting at
Thinking...for minutes. - Fixed unsafe stream postcommit prefix anchoring so streamed assistant/tool histories do not poison the next cache boundary.
- Locked down the real-world consecutive Qwen XML tool-call regression so back-to-back tool calls stay structured and do not leak raw XML.
Validation
- Targeted server/tool/OpenCode pytest suite passed.
- Built and checked PyPI artifacts with
twine check. - Ran real local MTPLX server, streaming API, Android Studio doctor, OpenCode CLI, and Pi CLI smoke tests against the local Optimized Speed model.
- Verified PyPI
mtplx==0.3.5fresh venv install and Homebrewyoussofal/mtplx/mtplx0.3.5 install/test.
MTPLX v0.3.4
MTPLX v0.3.4 is a patch release for coding-agent UX and serving compatibility.
Highlights:
- Idle SessionBank postcommit is now cooperative and preemptible, so foreground Pi/OpenCode/agent turns do not sit silently behind long background cache work.
- Consecutive Qwen XML tool calls stay structured while streaming as OpenAI delta.tool_calls.
- Swival is now available through the start/integrate handoff flow.
- The locked indirect urllib3 dependency is updated to 2.7.0.
Validation:
- Local focused pytest and CI-mirror subsets passed.
- python -m build and twine check passed for the wheel and sdist.
- Fresh venv no-MLX CLI smoke passed.
- Real max-fan CLI generation with the local Optimized Speed model loaded and answered successfully.
- GitHub Actions build, hygiene, and ci are green on the release commit.
MTPLX v0.3.3
MTPLX v0.3.3
Patch release for OpenAI-compatible serving clients and coding-agent tool UX.
Added
mtplx doctor android-studiofor model discovery, nonstream chat, streaming chat, and tool-bearing request smoke.- Android Studio/OpenAI-compatible request-shape tolerance for
max_completion_tokens,stream_options,response_format,metadata, andparallel_tool_calls.
Changed
- Qwen XML tool calls now stream OpenAI
delta.tool_callsincrementally, so compatible clients can mount file-write/edit cards before the full argument body finishes. - Pi handoff no longer writes a hidden model-level
maxTokenscap.
Fixed
- Fixed Android Studio issue #58 where
/v1/chat/completionscould fail as500: null. - Hardened malformed, unknown, unclosed, or schema-invalid tool-call output so it falls back safely instead of hanging or storing raw XML as successful assistant tool history.
QA
- Local source tests: public CLI, OpenCode/onboarding, OpenAI bridge, server, and streaming tool-call translator passed.
- Built wheel/sdist and passed
twine checkplus fresh no-MLX venv smoke. - Installed the built wheel in a clean temp environment and verified
mtplx 0.3.3plus OpenCode dry-run with no hiddenmaxTokens. - Real local CLI smoke loaded the Optimized Speed model and generated through the MTP path.
- Real server/API smoke verified
/health,max_completion_tokens, OpenAI-shaped invalid-request errors, Android doctor, and incremental streamingwrite_filetool-call deltas.
MTPLX v0.3.2
MTPLX v0.3.2
OpenCode and long-context prefill polish release.
Added
- First-class OpenCode Desktop integration in
mtplx startwith OpenAI-compatible provider config generation. - OpenCode reasoning configured through
reasoning_content, tool-call support metadata, long chunk timeout, and no hidden max-token cap. mtplx doctor opencodediagnostics for provider config, server health, reasoning, and tool-call compatibility.
Changed
- OpenCode now forces raw reasoning streaming by default.
- Sustained prefill defaults are restored to 2048-token chunks for both dense and repage paths, matching the safer production CLI/OpenCode/Pi user path.
Fixed
- Improved OpenCode tool prompting, malformed tool-call fallback, and SessionBank/OpenCode prefix-reuse telemetry.
QA
py_compilepassed for core runtime/server/CLI/OpenCode/session files.- Targeted pytest passed for prefill defaults, profiles, CLI/onboarding, OpenCode, server/OpenAI bridge, and model scheduler.
- Local CLI model smoke passed with Optimized Speed, Sustained Max, MTP enabled.
- Package build and
twine checkpassed for sdist and wheel.
MTPLX v0.3.1
MTPLX v0.3.1 is a hotfix release for the AR / --no-mtp streaming path.\n\nFixed:\n- OpenAI streaming in --no-mtp / AR mode now sends final stats and [DONE] immediately after the last visible token in the default async SessionBank postcommit mode.\n- Retokenized SessionBank postcommit now runs as idle async maintenance instead of holding the client stream open.\n- Startup/runtime labels now correctly show Sustained AR for --no-mtp.\n\nValidation:\n- Real local quickstart --no-mtp streaming QA against the Optimized-Speed model: 768-token stream, 0.001s last-token-to-DONE tail.\n- Built wheel and sdist passed twine check.\n- Fresh wheel install smoke passed.\n- Targeted serving/session/public CLI pytest suite passed.
MTPLX v0.3.0
MTPLX v0.3.0
v0.3.0 is a device-aware default and serving polish release for the public Qwen MTPLX line.
Highlights
mtplx startnow uses hardware-aware verified defaults: M1/M2 Apple Silicon selects the FP16 Optimized Speed model, while M3/M4/M5 and unknown hardware stay on the BF16 Optimized Speed model.- Added the Optimized Quality model option to the start wizard.
- SessionBank/cache-reuse fixes improve long-session reuse, postcommit visibility, and cache-hit/miss observability.
- Foreground generation is protected from idle postcommit/cache-build work through the model-owner scheduling path.
- TurboQuant startup falls back cleanly when optional vLLM Metal external ops are missing.
mtplx serve --host 0.0.0.0now presents wildcard binding and local browser/API URLs clearly.
QA
Validated before release on the pushed v0.3.0 code path:
- Full local pytest: 649 passed, 4 skipped.
ruff,git diff --check, py_compile,python -m build,twine check, and fresh-venv smoke passed.- GitHub main CI, build, hygiene, and tag artifact build passed on commit
50f842a. - Real CLI smoke covered local Speed, Quality, and FP16 model inspection, source
mtplx start cli, sourcemtplx ask, sourcemtplx servewith auth/API checks, installed-wheelmtplx --version, installed-wheelstart cli, and installed-wheelservewith authenticated/v1/chat/completions.
MTPLX v0.2.1
MTPLX v0.2.1
v0.2.1 is the current v0.2 release line: it includes the full v0.2.0 fast
prefill and agent-client update, plus an urgent safety fix for the public
server quickstart path.
Highlights
- Sustained long-context prefill is now the default public path for
mtplx start,quickstart,serve, and benchmark commands. - Long-context prompt processing is dramatically faster than v0.1.6 while
staying on the bounded-memory Sustained route used for the 32K/64K/128K M5
Max release QA. mtplx start piconnects MTPLX to Pi: it writes Pi's model config, starts the
local OpenAI-compatible MTPLX server, and opens Pi automatically.- Pi mode keeps the original MTPLX terminal useful with live server controls:
/reasoning,/mtp,/stats, and/help. - OpenAI-compatible streaming now works correctly when clients send
tools:
normal text still streams incrementally, and real tool calls stream as
structureddelta.tool_callsinstead of raw model markup. - SessionBank has safer async postcommit behavior for tool-call responses and
configurable capacity environment variables.
v0.2.1 Hotfix
mtplx quickstart --max,mtplx serve --max, andmtplx start --maxnow keep
the v0.2 Sustained Max default even when an older~/.mtplx/config.toml
containsprofile = "performance-cold".- Non-Sustained MTP prefill above 16K prompt tokens now fails with a clear
configuration error instead of taking the full hidden/logits path that can
allocate hundreds of GB at 64K+ context.
Immediate User Guidance
For long-context server benchmarks, use:
mtplx config set profile sustained
mtplx quickstart --profile sustained --max--profile performance-cold --max remains available as the short-context Burst
lane, but it is not the long-context v0.2 Sustained prefill path.
MTPLX v0.2.0
MTPLX v0.2.0
v0.2.0 is the fast-prefill and agent-client release.
Highlights
- Sustained is now the default long-context public path for
mtplx start,
quickstart,serve, and benchmark commands unless Burst is explicitly
selected. - Long-context Sustained prompt prefill is substantially faster while keeping
bounded memory and zero large-query split fallback in release QA. - Pi is now a first-class onboarding target:
mtplx start piconfigures Pi,
starts the local OpenAI-compatible server, and keeps/reasoning,/mtp,
/stats, and/helpcontrols available in the MTPLX terminal. - OpenAI streaming with
toolspresent now streams normal text incrementally,
and actual tool-call responses emit structureddelta.tool_callschunks. - Async SessionBank postcommit now runs for tool-call responses after the
foreground request goes idle, preserving Daniel Farina's PR #17 contribution.
Targeted Issue Fixes
- Fixes #9: streamed tool-call arguments are chunked instead of sent as one
complete JSON blob. - Fixes #13: requests that include a
toolsarray but produce normal content
continue to streamdelta.contentincrementally. - Fixes #15: Pi setup and streaming are now covered by first-class CLI
onboarding and compatibility QA.
Release QA Snapshot
Fresh M5 Max release QA was run from the v0.2.0 integration worktree with
--profile sustained --max, 128 generated tokens, and
MTPLX_ASSERT_NO_LARGE_Q_SPLIT_FALLBACK=1.
| Context | Prompt TPS | Decode TPS | Peak Memory | Fallback Calls |
|---|---|---|---|---|
| 32k | 620.6 tok/s | 39.1 tok/s | 22.1 GB | 0 |
| 64k | 504.3 tok/s | 31.2 tok/s | 27.2 GB | 0 |
| 128k | 372.1 tok/s | 25.3 tok/s | 37.5 GB | 0 |
The JSON artifact is
benchmarks/results/v0.2.0-release-m5max-32k-64k-128k.json.
Direct OpenAI-compatible streaming QA passed for tools-present normal content
and forced tool-call streaming. Pi 0.74.0 was also verified against a live local
MTPLX server with incremental message_update / text_delta events.
Release Honesty
- This release does not claim Gemma runtime support.
- This release does not claim continuous batching.
- This release does not claim direct M5 Neural Accelerator use; eligibility is
reported separately from proof. - Long no-fan decode decay remains a future runtime track.
MTPLX v0.1.6
MTPLX v0.1.6 is a small production hotfix over v0.1.5.\n\nFixed:\n- Streamed tool-call responses for OpenAI-compatible agent clients.\n- Paged-tail routing for streamed server responses.\n- Long-context public benchmark defaults so product suites use Sustained/direct HTTP by default while cold headline runs stay on performance-cold.\n\nNo Gemma assistant-pair runtime, model-weight, sampler, or new benchmark-result claims are included in this release.