Releases · youssofal/MTPLX

14 May 23:52

v0.3.6

1d6a7b7

MTPLX v0.3.6 Latest

Latest

MTPLX v0.3.6

Production patch over v0.3.5 focused on the public release pillars: bounded memory, no silent decode/prefill tradeoff, and clean CLI UX.

Highlights

Fixes AIME-shaped max_tokens=65536 memory behavior by bounding initial new-token KV reservation while preserving real prompt-context allocation.
Avoids retaining full-capacity live cache refs for anonymous one-off sessions.
Improves OpenCode tool-result turns so stable cached prefixes are reused instead of cold-prefilling the full history.
Ships Tune in the packaged CLI: mtplx tune, mtplx-tune, and mtplx bench tune.
Fixes verified-default onboarding/model labeling for the installed Optimized Speed/Q4 artifact.
Adds bench tune chip diagnostics with power, frequency, temperature, utilization, fan, and thermal-pressure telemetry; generation-window scope is labeled when available.
Tightens README claims so paired same-machine speedup is not described as hardware-independent.

Validation

Local: compileall, ruff, full pytest, twine check, fresh venv smoke, git diff check.
CLI UX: mtplx --version, OpenCode dry-run, Pi dry-run, mtplx-tune dry-run, and bench tune dry-run.
GitHub PR #66: repository-hygiene, wheel, and no-mlx-smoke all passed before merge.
Release workflow: trusted PyPI publishing workflow passed for v0.3.6.

Known Non-Claims

This release does not claim to fix unrelated open issues #65, #56, or #16.
README release claims avoid unavailable fork/patch proof.

Assets 4

13 May 18:45

youssofal

v0.3.5

253e7eb

MTPLX v0.3.5

What's Changed

Fixed OpenCode tool-result turns cold-prefilling the full conversation history. Follow-up OpenCode turns now reuse the stable SessionBank prefix instead of sitting at Thinking... for minutes.
Fixed unsafe stream postcommit prefix anchoring so streamed assistant/tool histories do not poison the next cache boundary.
Locked down the real-world consecutive Qwen XML tool-call regression so back-to-back tool calls stay structured and do not leak raw XML.

Validation

Targeted server/tool/OpenCode pytest suite passed.
Built and checked PyPI artifacts with twine check.
Ran real local MTPLX server, streaming API, Android Studio doctor, OpenCode CLI, and Pi CLI smoke tests against the local Optimized Speed model.
Verified PyPI mtplx==0.3.5 fresh venv install and Homebrew youssofal/mtplx/mtplx 0.3.5 install/test.

Assets 4

13 May 04:57

youssofal

v0.3.4

98fb2eb

MTPLX v0.3.4

MTPLX v0.3.4 is a patch release for coding-agent UX and serving compatibility.

Highlights:

Idle SessionBank postcommit is now cooperative and preemptible, so foreground Pi/OpenCode/agent turns do not sit silently behind long background cache work.
Consecutive Qwen XML tool calls stay structured while streaming as OpenAI delta.tool_calls.
Swival is now available through the start/integrate handoff flow.
The locked indirect urllib3 dependency is updated to 2.7.0.

Validation:

Local focused pytest and CI-mirror subsets passed.
python -m build and twine check passed for the wheel and sdist.
Fresh venv no-MLX CLI smoke passed.
Real max-fan CLI generation with the local Optimized Speed model loaded and answered successfully.
GitHub Actions build, hygiene, and ci are green on the release commit.

Assets 4

12 May 00:48

youssofal

v0.3.3

66bdd46

MTPLX v0.3.3

Patch release for OpenAI-compatible serving clients and coding-agent tool UX.

Added

mtplx doctor android-studio for model discovery, nonstream chat, streaming chat, and tool-bearing request smoke.
Android Studio/OpenAI-compatible request-shape tolerance for max_completion_tokens, stream_options, response_format, metadata, and parallel_tool_calls.

Changed

Qwen XML tool calls now stream OpenAI delta.tool_calls incrementally, so compatible clients can mount file-write/edit cards before the full argument body finishes.
Pi handoff no longer writes a hidden model-level maxTokens cap.

Fixed

Fixed Android Studio issue #58 where /v1/chat/completions could fail as 500: null.
Hardened malformed, unknown, unclosed, or schema-invalid tool-call output so it falls back safely instead of hanging or storing raw XML as successful assistant tool history.

QA

Local source tests: public CLI, OpenCode/onboarding, OpenAI bridge, server, and streaming tool-call translator passed.
Built wheel/sdist and passed twine check plus fresh no-MLX venv smoke.
Installed the built wheel in a clean temp environment and verified mtplx 0.3.3 plus OpenCode dry-run with no hidden maxTokens.
Real local CLI smoke loaded the Optimized Speed model and generated through the MTP path.
Real server/API smoke verified /health, max_completion_tokens, OpenAI-shaped invalid-request errors, Android doctor, and incremental streaming write_file tool-call deltas.

Assets 5

11 May 05:24

youssofal

v0.3.2

8ee2bdf

MTPLX v0.3.2

OpenCode and long-context prefill polish release.

Added

First-class OpenCode Desktop integration in mtplx start with OpenAI-compatible provider config generation.
OpenCode reasoning configured through reasoning_content, tool-call support metadata, long chunk timeout, and no hidden max-token cap.
mtplx doctor opencode diagnostics for provider config, server health, reasoning, and tool-call compatibility.

Changed

OpenCode now forces raw reasoning streaming by default.
Sustained prefill defaults are restored to 2048-token chunks for both dense and repage paths, matching the safer production CLI/OpenCode/Pi user path.

Fixed

Improved OpenCode tool prompting, malformed tool-call fallback, and SessionBank/OpenCode prefix-reuse telemetry.

QA

py_compile passed for core runtime/server/CLI/OpenCode/session files.
Targeted pytest passed for prefill defaults, profiles, CLI/onboarding, OpenCode, server/OpenAI bridge, and model scheduler.
Local CLI model smoke passed with Optimized Speed, Sustained Max, MTP enabled.
Package build and twine check passed for sdist and wheel.

Assets 5

10 May 21:17

youssofal

v0.3.1

12d2d87

MTPLX v0.3.1

MTPLX v0.3.1 is a hotfix release for the AR / --no-mtp streaming path.\n\nFixed:\n- OpenAI streaming in --no-mtp / AR mode now sends final stats and [DONE] immediately after the last visible token in the default async SessionBank postcommit mode.\n- Retokenized SessionBank postcommit now runs as idle async maintenance instead of holding the client stream open.\n- Startup/runtime labels now correctly show Sustained AR for --no-mtp.\n\nValidation:\n- Real local quickstart --no-mtp streaming QA against the Optimized-Speed model: 768-token stream, 0.001s last-token-to-DONE tail.\n- Built wheel and sdist passed twine check.\n- Fresh wheel install smoke passed.\n- Targeted serving/session/public CLI pytest suite passed.

Assets 5

10 May 06:51

youssofal

v0.3.0

50f842a

MTPLX v0.3.0

v0.3.0 is a device-aware default and serving polish release for the public Qwen MTPLX line.

Highlights

mtplx start now uses hardware-aware verified defaults: M1/M2 Apple Silicon selects the FP16 Optimized Speed model, while M3/M4/M5 and unknown hardware stay on the BF16 Optimized Speed model.
Added the Optimized Quality model option to the start wizard.
SessionBank/cache-reuse fixes improve long-session reuse, postcommit visibility, and cache-hit/miss observability.
Foreground generation is protected from idle postcommit/cache-build work through the model-owner scheduling path.
TurboQuant startup falls back cleanly when optional vLLM Metal external ops are missing.
mtplx serve --host 0.0.0.0 now presents wildcard binding and local browser/API URLs clearly.

QA

Validated before release on the pushed v0.3.0 code path:

Full local pytest: 649 passed, 4 skipped.
ruff, git diff --check, py_compile, python -m build, twine check, and fresh-venv smoke passed.
GitHub main CI, build, hygiene, and tag artifact build passed on commit 50f842a.
Real CLI smoke covered local Speed, Quality, and FP16 model inspection, source mtplx start cli, source mtplx ask, source mtplx serve with auth/API checks, installed-wheel mtplx --version, installed-wheel start cli, and installed-wheel serve with authenticated /v1/chat/completions.

Assets 5

07 May 21:14

youssofal

v0.2.1

6496351

MTPLX v0.2.1

v0.2.1 is the current v0.2 release line: it includes the full v0.2.0 fast
prefill and agent-client update, plus an urgent safety fix for the public
server quickstart path.

Highlights

Sustained long-context prefill is now the default public path for mtplx start, quickstart, serve, and benchmark commands.
Long-context prompt processing is dramatically faster than v0.1.6 while
staying on the bounded-memory Sustained route used for the 32K/64K/128K M5
Max release QA.
mtplx start pi connects MTPLX to Pi: it writes Pi's model config, starts the
local OpenAI-compatible MTPLX server, and opens Pi automatically.
Pi mode keeps the original MTPLX terminal useful with live server controls:
/reasoning, /mtp, /stats, and /help.
OpenAI-compatible streaming now works correctly when clients send tools:
normal text still streams incrementally, and real tool calls stream as
structured delta.tool_calls instead of raw model markup.
SessionBank has safer async postcommit behavior for tool-call responses and
configurable capacity environment variables.

v0.2.1 Hotfix

mtplx quickstart --max, mtplx serve --max, and mtplx start --max now keep
the v0.2 Sustained Max default even when an older ~/.mtplx/config.toml
contains profile = "performance-cold".
Non-Sustained MTP prefill above 16K prompt tokens now fails with a clear
configuration error instead of taking the full hidden/logits path that can
allocate hundreds of GB at 64K+ context.

Immediate User Guidance

For long-context server benchmarks, use:

mtplx config set profile sustained
mtplx quickstart --profile sustained --max

--profile performance-cold --max remains available as the short-context Burst
lane, but it is not the long-context v0.2 Sustained prefill path.

Assets 2

07 May 18:32

youssofal

v0.2.0

c06cc13

MTPLX v0.2.0

v0.2.0 is the fast-prefill and agent-client release.

Highlights

Sustained is now the default long-context public path for mtplx start,
quickstart, serve, and benchmark commands unless Burst is explicitly
selected.
Long-context Sustained prompt prefill is substantially faster while keeping
bounded memory and zero large-query split fallback in release QA.
Pi is now a first-class onboarding target: mtplx start pi configures Pi,
starts the local OpenAI-compatible server, and keeps /reasoning, /mtp,
/stats, and /help controls available in the MTPLX terminal.
OpenAI streaming with tools present now streams normal text incrementally,
and actual tool-call responses emit structured delta.tool_calls chunks.
Async SessionBank postcommit now runs for tool-call responses after the
foreground request goes idle, preserving Daniel Farina's PR #17 contribution.

Targeted Issue Fixes

Fixes #9: streamed tool-call arguments are chunked instead of sent as one
complete JSON blob.
Fixes #13: requests that include a tools array but produce normal content
continue to stream delta.content incrementally.
Fixes #15: Pi setup and streaming are now covered by first-class CLI
onboarding and compatibility QA.

Release QA Snapshot

Fresh M5 Max release QA was run from the v0.2.0 integration worktree with
--profile sustained --max, 128 generated tokens, and
MTPLX_ASSERT_NO_LARGE_Q_SPLIT_FALLBACK=1.

Context	Prompt TPS	Decode TPS	Peak Memory
32k	620.6 tok/s	39.1 tok/s	22.1 GB
64k	504.3 tok/s	31.2 tok/s	27.2 GB
128k	372.1 tok/s	25.3 tok/s	37.5 GB

The JSON artifact is
benchmarks/results/v0.2.0-release-m5max-32k-64k-128k.json.

Direct OpenAI-compatible streaming QA passed for tools-present normal content
and forced tool-call streaming. Pi 0.74.0 was also verified against a live local
MTPLX server with incremental message_update / text_delta events.

Release Honesty

This release does not claim Gemma runtime support.
This release does not claim continuous batching.
This release does not claim direct M5 Neural Accelerator use; eligibility is
reported separately from proof.
Long no-fan decode decay remains a future runtime track.

Assets 5

06 May 07:25

youssofal

v0.1.6

a767069

MTPLX v0.1.6

MTPLX v0.1.6 is a small production hotfix over v0.1.5.\n\nFixed:\n- Streamed tool-call responses for OpenAI-compatible agent clients.\n- Paged-tail routing for streamed server responses.\n- Long-context public benchmark defaults so product suites use Sustained/direct HTTP by default while cold headline runs stay on performance-cold.\n\nNo Gemma assistant-pair runtime, model-weight, sampler, or new benchmark-result claims are included in this release.

Assets 5

Releases: youssofal/MTPLX

MTPLX v0.3.6

MTPLX v0.3.6

Highlights

Validation

Known Non-Claims

Uh oh!

MTPLX v0.3.5

What's Changed

Validation

Uh oh!

MTPLX v0.3.4

Uh oh!

MTPLX v0.3.3

MTPLX v0.3.3

Added

Changed

Fixed

QA

Uh oh!

MTPLX v0.3.2

MTPLX v0.3.2

Added

Changed

Fixed

QA

Uh oh!

MTPLX v0.3.1

Uh oh!

MTPLX v0.3.0

MTPLX v0.3.0

Highlights

QA

Uh oh!

MTPLX v0.2.1

MTPLX v0.2.1

Highlights

v0.2.1 Hotfix

Immediate User Guidance

Uh oh!

MTPLX v0.2.0

MTPLX v0.2.0

Highlights

Targeted Issue Fixes

Release QA Snapshot

Release Honesty

Uh oh!

MTPLX v0.1.6

Uh oh!