Skip to content

Releases: youssofal/MTPLX

MTPLX v0.3.6

14 May 23:52
1d6a7b7

Choose a tag to compare

MTPLX v0.3.6

Production patch over v0.3.5 focused on the public release pillars: bounded memory, no silent decode/prefill tradeoff, and clean CLI UX.

Highlights

  • Fixes AIME-shaped max_tokens=65536 memory behavior by bounding initial new-token KV reservation while preserving real prompt-context allocation.
  • Avoids retaining full-capacity live cache refs for anonymous one-off sessions.
  • Improves OpenCode tool-result turns so stable cached prefixes are reused instead of cold-prefilling the full history.
  • Ships Tune in the packaged CLI: mtplx tune, mtplx-tune, and mtplx bench tune.
  • Fixes verified-default onboarding/model labeling for the installed Optimized Speed/Q4 artifact.
  • Adds bench tune chip diagnostics with power, frequency, temperature, utilization, fan, and thermal-pressure telemetry; generation-window scope is labeled when available.
  • Tightens README claims so paired same-machine speedup is not described as hardware-independent.

Validation

  • Local: compileall, ruff, full pytest, twine check, fresh venv smoke, git diff check.
  • CLI UX: mtplx --version, OpenCode dry-run, Pi dry-run, mtplx-tune dry-run, and bench tune dry-run.
  • GitHub PR #66: repository-hygiene, wheel, and no-mlx-smoke all passed before merge.
  • Release workflow: trusted PyPI publishing workflow passed for v0.3.6.

Known Non-Claims

  • This release does not claim to fix unrelated open issues #65, #56, or #16.
  • README release claims avoid unavailable fork/patch proof.

MTPLX v0.3.5

13 May 18:45

Choose a tag to compare

What's Changed

  • Fixed OpenCode tool-result turns cold-prefilling the full conversation history. Follow-up OpenCode turns now reuse the stable SessionBank prefix instead of sitting at Thinking... for minutes.
  • Fixed unsafe stream postcommit prefix anchoring so streamed assistant/tool histories do not poison the next cache boundary.
  • Locked down the real-world consecutive Qwen XML tool-call regression so back-to-back tool calls stay structured and do not leak raw XML.

Validation

  • Targeted server/tool/OpenCode pytest suite passed.
  • Built and checked PyPI artifacts with twine check.
  • Ran real local MTPLX server, streaming API, Android Studio doctor, OpenCode CLI, and Pi CLI smoke tests against the local Optimized Speed model.
  • Verified PyPI mtplx==0.3.5 fresh venv install and Homebrew youssofal/mtplx/mtplx 0.3.5 install/test.

MTPLX v0.3.4

13 May 04:57

Choose a tag to compare

MTPLX v0.3.4 is a patch release for coding-agent UX and serving compatibility.

Highlights:

  • Idle SessionBank postcommit is now cooperative and preemptible, so foreground Pi/OpenCode/agent turns do not sit silently behind long background cache work.
  • Consecutive Qwen XML tool calls stay structured while streaming as OpenAI delta.tool_calls.
  • Swival is now available through the start/integrate handoff flow.
  • The locked indirect urllib3 dependency is updated to 2.7.0.

Validation:

  • Local focused pytest and CI-mirror subsets passed.
  • python -m build and twine check passed for the wheel and sdist.
  • Fresh venv no-MLX CLI smoke passed.
  • Real max-fan CLI generation with the local Optimized Speed model loaded and answered successfully.
  • GitHub Actions build, hygiene, and ci are green on the release commit.

MTPLX v0.3.3

12 May 00:48

Choose a tag to compare

MTPLX v0.3.3

Patch release for OpenAI-compatible serving clients and coding-agent tool UX.

Added

  • mtplx doctor android-studio for model discovery, nonstream chat, streaming chat, and tool-bearing request smoke.
  • Android Studio/OpenAI-compatible request-shape tolerance for max_completion_tokens, stream_options, response_format, metadata, and parallel_tool_calls.

Changed

  • Qwen XML tool calls now stream OpenAI delta.tool_calls incrementally, so compatible clients can mount file-write/edit cards before the full argument body finishes.
  • Pi handoff no longer writes a hidden model-level maxTokens cap.

Fixed

  • Fixed Android Studio issue #58 where /v1/chat/completions could fail as 500: null.
  • Hardened malformed, unknown, unclosed, or schema-invalid tool-call output so it falls back safely instead of hanging or storing raw XML as successful assistant tool history.

QA

  • Local source tests: public CLI, OpenCode/onboarding, OpenAI bridge, server, and streaming tool-call translator passed.
  • Built wheel/sdist and passed twine check plus fresh no-MLX venv smoke.
  • Installed the built wheel in a clean temp environment and verified mtplx 0.3.3 plus OpenCode dry-run with no hidden maxTokens.
  • Real local CLI smoke loaded the Optimized Speed model and generated through the MTP path.
  • Real server/API smoke verified /health, max_completion_tokens, OpenAI-shaped invalid-request errors, Android doctor, and incremental streaming write_file tool-call deltas.

MTPLX v0.3.2

11 May 05:24

Choose a tag to compare

MTPLX v0.3.2

OpenCode and long-context prefill polish release.

Added

  • First-class OpenCode Desktop integration in mtplx start with OpenAI-compatible provider config generation.
  • OpenCode reasoning configured through reasoning_content, tool-call support metadata, long chunk timeout, and no hidden max-token cap.
  • mtplx doctor opencode diagnostics for provider config, server health, reasoning, and tool-call compatibility.

Changed

  • OpenCode now forces raw reasoning streaming by default.
  • Sustained prefill defaults are restored to 2048-token chunks for both dense and repage paths, matching the safer production CLI/OpenCode/Pi user path.

Fixed

  • Improved OpenCode tool prompting, malformed tool-call fallback, and SessionBank/OpenCode prefix-reuse telemetry.

QA

  • py_compile passed for core runtime/server/CLI/OpenCode/session files.
  • Targeted pytest passed for prefill defaults, profiles, CLI/onboarding, OpenCode, server/OpenAI bridge, and model scheduler.
  • Local CLI model smoke passed with Optimized Speed, Sustained Max, MTP enabled.
  • Package build and twine check passed for sdist and wheel.

MTPLX v0.3.1

10 May 21:17

Choose a tag to compare

MTPLX v0.3.1 is a hotfix release for the AR / --no-mtp streaming path.\n\nFixed:\n- OpenAI streaming in --no-mtp / AR mode now sends final stats and [DONE] immediately after the last visible token in the default async SessionBank postcommit mode.\n- Retokenized SessionBank postcommit now runs as idle async maintenance instead of holding the client stream open.\n- Startup/runtime labels now correctly show Sustained AR for --no-mtp.\n\nValidation:\n- Real local quickstart --no-mtp streaming QA against the Optimized-Speed model: 768-token stream, 0.001s last-token-to-DONE tail.\n- Built wheel and sdist passed twine check.\n- Fresh wheel install smoke passed.\n- Targeted serving/session/public CLI pytest suite passed.

MTPLX v0.3.0

10 May 06:51

Choose a tag to compare

MTPLX v0.3.0

v0.3.0 is a device-aware default and serving polish release for the public Qwen MTPLX line.

Highlights

  • mtplx start now uses hardware-aware verified defaults: M1/M2 Apple Silicon selects the FP16 Optimized Speed model, while M3/M4/M5 and unknown hardware stay on the BF16 Optimized Speed model.
  • Added the Optimized Quality model option to the start wizard.
  • SessionBank/cache-reuse fixes improve long-session reuse, postcommit visibility, and cache-hit/miss observability.
  • Foreground generation is protected from idle postcommit/cache-build work through the model-owner scheduling path.
  • TurboQuant startup falls back cleanly when optional vLLM Metal external ops are missing.
  • mtplx serve --host 0.0.0.0 now presents wildcard binding and local browser/API URLs clearly.

QA

Validated before release on the pushed v0.3.0 code path:

  • Full local pytest: 649 passed, 4 skipped.
  • ruff, git diff --check, py_compile, python -m build, twine check, and fresh-venv smoke passed.
  • GitHub main CI, build, hygiene, and tag artifact build passed on commit 50f842a.
  • Real CLI smoke covered local Speed, Quality, and FP16 model inspection, source mtplx start cli, source mtplx ask, source mtplx serve with auth/API checks, installed-wheel mtplx --version, installed-wheel start cli, and installed-wheel serve with authenticated /v1/chat/completions.

MTPLX v0.2.1

07 May 21:14

Choose a tag to compare

MTPLX v0.2.1

v0.2.1 is the current v0.2 release line: it includes the full v0.2.0 fast
prefill and agent-client update, plus an urgent safety fix for the public
server quickstart path.

Highlights

  • Sustained long-context prefill is now the default public path for mtplx start, quickstart, serve, and benchmark commands.
  • Long-context prompt processing is dramatically faster than v0.1.6 while
    staying on the bounded-memory Sustained route used for the 32K/64K/128K M5
    Max release QA.
  • mtplx start pi connects MTPLX to Pi: it writes Pi's model config, starts the
    local OpenAI-compatible MTPLX server, and opens Pi automatically.
  • Pi mode keeps the original MTPLX terminal useful with live server controls:
    /reasoning, /mtp, /stats, and /help.
  • OpenAI-compatible streaming now works correctly when clients send tools:
    normal text still streams incrementally, and real tool calls stream as
    structured delta.tool_calls instead of raw model markup.
  • SessionBank has safer async postcommit behavior for tool-call responses and
    configurable capacity environment variables.

v0.2.1 Hotfix

  • mtplx quickstart --max, mtplx serve --max, and mtplx start --max now keep
    the v0.2 Sustained Max default even when an older ~/.mtplx/config.toml
    contains profile = "performance-cold".
  • Non-Sustained MTP prefill above 16K prompt tokens now fails with a clear
    configuration error instead of taking the full hidden/logits path that can
    allocate hundreds of GB at 64K+ context.

Immediate User Guidance

For long-context server benchmarks, use:

mtplx config set profile sustained
mtplx quickstart --profile sustained --max

--profile performance-cold --max remains available as the short-context Burst
lane, but it is not the long-context v0.2 Sustained prefill path.

MTPLX v0.2.0

07 May 18:32

Choose a tag to compare

MTPLX v0.2.0

v0.2.0 is the fast-prefill and agent-client release.

Highlights

  • Sustained is now the default long-context public path for mtplx start,
    quickstart, serve, and benchmark commands unless Burst is explicitly
    selected.
  • Long-context Sustained prompt prefill is substantially faster while keeping
    bounded memory and zero large-query split fallback in release QA.
  • Pi is now a first-class onboarding target: mtplx start pi configures Pi,
    starts the local OpenAI-compatible server, and keeps /reasoning, /mtp,
    /stats, and /help controls available in the MTPLX terminal.
  • OpenAI streaming with tools present now streams normal text incrementally,
    and actual tool-call responses emit structured delta.tool_calls chunks.
  • Async SessionBank postcommit now runs for tool-call responses after the
    foreground request goes idle, preserving Daniel Farina's PR #17 contribution.

Targeted Issue Fixes

  • Fixes #9: streamed tool-call arguments are chunked instead of sent as one
    complete JSON blob.
  • Fixes #13: requests that include a tools array but produce normal content
    continue to stream delta.content incrementally.
  • Fixes #15: Pi setup and streaming are now covered by first-class CLI
    onboarding and compatibility QA.

Release QA Snapshot

Fresh M5 Max release QA was run from the v0.2.0 integration worktree with
--profile sustained --max, 128 generated tokens, and
MTPLX_ASSERT_NO_LARGE_Q_SPLIT_FALLBACK=1.

Context Prompt TPS Decode TPS Peak Memory Fallback Calls
32k 620.6 tok/s 39.1 tok/s 22.1 GB 0
64k 504.3 tok/s 31.2 tok/s 27.2 GB 0
128k 372.1 tok/s 25.3 tok/s 37.5 GB 0

The JSON artifact is
benchmarks/results/v0.2.0-release-m5max-32k-64k-128k.json.

Direct OpenAI-compatible streaming QA passed for tools-present normal content
and forced tool-call streaming. Pi 0.74.0 was also verified against a live local
MTPLX server with incremental message_update / text_delta events.

Release Honesty

  • This release does not claim Gemma runtime support.
  • This release does not claim continuous batching.
  • This release does not claim direct M5 Neural Accelerator use; eligibility is
    reported separately from proof.
  • Long no-fan decode decay remains a future runtime track.

MTPLX v0.1.6

06 May 07:25

Choose a tag to compare

MTPLX v0.1.6 is a small production hotfix over v0.1.5.\n\nFixed:\n- Streamed tool-call responses for OpenAI-compatible agent clients.\n- Paged-tail routing for streamed server responses.\n- Long-context public benchmark defaults so product suites use Sustained/direct HTTP by default while cold headline runs stay on performance-cold.\n\nNo Gemma assistant-pair runtime, model-weight, sampler, or new benchmark-result claims are included in this release.