MTP Profiler

Real-world speculative decoding profiler for llama.cpp workloads on Apple Silicon.

MTP Profiler analyzes real llama.cpp inference workloads to help identify optimal speculative decoding (Multi-Token Prediction / MTP) settings for your hardware and usage patterns. Unlike synthetic microbenchmarks, MTP Profiler focuses on long-running real-world sessions such as coding agents, chat workloads, and long-context inference. It profiles throughput degradation, draft-token acceptance rates, and speculative decoding efficiency using telemetry extracted directly from llama.cpp logs.

Why This Exists

Speculative decoding performance depends heavily on:

hardware characteristics,
memory bandwidth,
context length,
model architecture,
workload entropy,
and draft-token acceptance rates.

Synthetic benchmarks often fail to capture real-world behavior, especially during long-running sessions with growing context windows. MTP Profiler helps quantify speculative decoding efficiency using telemetry from actual inference workloads, making it easier to identify the optimal draft-token settings for a specific machine and usage pattern.

What is MTP?

Multi-Token Prediction (MTP) is a speculative decoding technique used in llama.cpp to accelerate inference. Instead of generating one token at a time, MTP uses a smaller "draft" model to predict multiple tokens in parallel, then verifies them against the full model. The key configuration parameters are:

--mtp-n-max — maximum number of draft tokens to generate per step (e.g. 1, 2, 3, 4)
--mtp-n-min — minimum number of draft tokens (usually 0)
--mtp-p-min — minimum acceptance probability threshold (e.g. 0.70)

Higher n_max values can increase throughput but may reduce acceptance rates or become unstable at long context lengths. The optimal setting depends on your specific model, hardware, and workload.

How to run llama.cpp with MTP

Start the llama.cpp server with your desired MTP configuration:

llama-server \
  --model /path/to/your/model.gguf \
  --mtp-n-max 2 \
  --mtp-n-min 0 \
  --mtp-p-min 0.70

How to collect logs

Run the server with output piped through tee to capture logs to a file:

llama-server ... 2>&1 | tee -a llama.log

Send inference requests (via API, web UI, or benchmark tools) while the server is running. The profiler parses the server log to extract timing data, MTP metrics, and system information.

How to get a comprehensive analysis

llama.cpp only supports one MTP configuration at a time. To get a comprehensive comparison and recommendation:

Run the server multiple times with different --mtp-n-max values (e.g. 1, 2, 3, 4), sending the same workload each time
Append each run's logs to the same file using tee -a
Point the profiler at the combined log file — it automatically detects server restarts and merges runs by n_max setting

Example workflow:

# Run 1: n_max=1
llama-server --mtp-n-max 1 ... 2>&1 | tee -a llama.log

# Run 2: n_max=2 (append to same file)
llama-server --mtp-n-max 2 ... 2>&1 | tee -a llama.log

# Run 3: n_max=3 (append to same file)
llama-server --mtp-n-max 3 ... 2>&1 | tee -a llama.log

# Analyze all runs together
mtp-profiler profile llama.log -d output/

About this project

This repository and its code were generated entirely by AI agents. The charts, analysis, and recommendations shown in this README were produced from logs collected while the AI agent was actively implementing this very codebase — making it a self-referential profiling exercise. The real-world test data comes from Qwen3.6-35B-A3B-UD-Q4_K_XL on an Apple M3 Pro (36 GB RAM). All code and results were human-reviewed and verified by the repo owner.

Features

Passive log analysis - No synthetic benchmarks, just analyze real llama.cpp server logs
Cross-run merging - Multiple runs with the same MTP setting are merged into one dataset
Multi-run detection - Detects server restarts and separates runs automatically
Apple Silicon aware - Collects chip type, memory, memory pressure, and thread information
Deterministic recommendations - Algorithmic scoring with diminishing returns penalty
Publication-quality charts - Throughput vs context, acceptance rates, stability boxplots
LOWESS smoothing - Optional advanced smoothing for trend lines
MTP internal stats - Parses draft call counts, generation/acceptance durations
Robust parsing - Tolerates ANSI codes, malformed lines, truncated logs

Installation

cd mtp-profiler
python -m venv .venv
source .venv/bin/activate
pip install -e .

Quick Start

Full pipeline (recommended)

mtp-profiler profile llama.log -d output/

This runs all stages automatically:

Parse - Extract telemetry from the log file
Analyze - Compute throughput stats, correlations, MTP comparisons
Recommend - Generate optimal MTP setting recommendation
Plot - Generate charts in output/charts/

Step-by-step

# Stage 1: Parse log
mtp-profiler parse llama.log -o parsed.json

# Stage 2: Analyze
mtp-profiler analyze parsed.json -o analysis.json

# Stage 3: Recommend
mtp-profiler recommend analysis.json -o recommendation.json

# Stage 4: Plot
mtp-profiler plot analysis.json -d charts/

Multi-run analysis

If your log contains multiple server restarts (different MTP configurations), the default behavior merges all runs by n_max setting, showing one line per unique setting. To analyze a specific run:

mtp-profiler profile llama.log -d output/ -r run_2

Or analyze all runs by specifying the run ID for each stage.

Output

The profile command produces:

output/
├── parsed.json          # Raw extracted telemetry
├── analysis.json        # Computed metrics
├── recommendation.json  # Optimal MTP setting
└── charts/
    ├── throughput_and_acceptance.png  # Throughput + acceptance rate charts
    ├── stability_boxplot.png          # Throughput distribution by setting
    └── uplift_vs_baseline.png         # Throughput uplift comparison

Example Charts

Below are example outputs from profiling Qwen3.6-35B-A3B-UD-Q4_K_XL on Apple M3 Pro (28 GB).

Throughput & Acceptance Rate

Each line represents a different MTP n_max setting, with all runs for the same setting merged into one dataset. The top subplot shows generation throughput vs context length, the bottom shows draft acceptance rate.

Stability Boxplot

Shows the distribution of generation throughput for each MTP setting, making it easy to spot unstable configurations.

Throughput Uplift vs Baseline

Compares each MTP setting's average throughput against the baseline (no MTP), with error bars showing standard deviation.

Recommendation output

============================================================
  MTP Profiler - Recommendation
============================================================
Recommended MTP setting: 2

Settings compared:
  Setting 0: throughput=+0.0%, long-context=degraded, stability=variable
  Setting 1: throughput=+9.7%, long-context=degraded, stability=variable
  Setting 2: throughput=+33.6%, long-context=degraded, stability=variable <-- recommended
  Setting 3: throughput=+25.9%, long-context=degraded, stability=variable
  Setting 4: throughput=+35.2%, long-context=degraded, stability=variable

Throughput vs baseline: +33.6%
Long-context efficiency: degraded
Stability: variable
Average generation throughput: 19.97 t/s
Average draft acceptance rate: 93.8%
============================================================

Data Model

Parsed output structure

{
  "runs": [
    {
      "id": "run_1",
      "metadata": {
        "model": "Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf",
        "quantization": "Q4_K_XL",
        "system": {
          "chip": "Apple M3 Pro",
          "chip_type": "M3",
          "unified_memory_mb": 28753,
          "cpu_threads": 8,
          "cpu_total_threads": 11
        },
        "mtp_config": {"n_max": 1, "n_min": 0, "p_min": 0.7}
      },
      "measurements": [
        {
          "n_tokens": 62557,
          "n_decoded": 278,
          "generation_tokens_per_second": 15.65,
          "prompt_tokens_per_second": 193.36,
          "draft_acceptance_rate": 0.975,
          "n_drafts_generated": 447,
          "n_drafts_accepted": 415
        }
      ]
    }
  ]
}

Measurement fields

Field	Description
`n_tokens`	Context length (tokens)
`n_decoded`	Number of decoded tokens
`generation_tokens_per_second`	Generation throughput
`prompt_tokens_per_second`	Prompt processing throughput
`draft_acceptance_rate`	MTP draft acceptance rate (0-1)
`n_drafts_generated`	Number of drafts generated
`n_drafts_accepted`	Number of drafts accepted
`truncated`	Number of truncated tokens
`mtp_calls`	Number of MTP inference calls
`mtp_gen_drafts`	Drafts generated by MTP
`mtp_acc_drafts`	Drafts accepted by MTP
`mtp_gen_tokens`	Tokens generated by MTP
`mtp_acc_tokens`	Tokens accepted by MTP
`mtp_dur_batch`	Batch processing duration (ms)
`mtp_dur_gen`	Generation duration (ms)
`mtp_dur_acc`	Acceptance duration (ms)

Analysis Metrics

The analyze stage computes:

Throughput statistics - avg, std, min, max, median, p10, p90 generation TPS
Context-TPS correlation - Pearson correlation between context length and throughput
Degradation rate - TPS loss per 1000 tokens of context
MTP setting comparisons - Grouped by n_max setting with context ranges
Stability metrics - Coefficient of variation, variance
Long-context behavior - Short vs long context TPS ratio

Recommendation Scoring

The recommendation engine uses a weighted scoring algorithm:

Factor	Weight	Description
Throughput	40%	Average generation TPS
Stability	25%	Low coefficient of variation
Long-context efficiency	20%	TPS ratio (long/short context)
Acceptance rate	15%	Average draft acceptance rate
Diminishing returns	Penalty	Deducts points for n_max > 2

Architecture

mtp_profiler/
├── cli/              # Typer CLI with subcommands
├── models/           # Pydantic data models
├── parser/           # llama.cpp log parser
├── analyzer/         # Analysis engine
├── visualizer/       # Chart generation
├── recommender/      # Deterministic recommendation engine
└── system_info/      # Apple Silicon detection

CLI Reference

`parse`

Extract telemetry from llama.cpp server logs.

mtp-profiler parse <log_file> [-o OUTPUT] [-v]

`analyze`

Compute derived metrics from parsed data.

mtp-profiler analyze <input.json> [-o OUTPUT] [-r RUN_ID] [-v]

`recommend`

Generate MTP setting recommendations.

mtp-profiler recommend <analysis.json> [-o OUTPUT] [-v]

`plot`

Generate publication-quality charts.

mtp-profiler plot <analysis.json> [-d OUTPUT_DIR] [-r RUN_ID] [--lowess] [--lowess-frac FRAC] [-v]

Use --lowess for smoother trend lines (requires statsmodels):

pip install -e ".[lowess]"
mtp-profiler plot analysis.json --lowess --lowess-frac 0.33

`profile`

Full pipeline: parse → analyze → recommend → plot.

mtp-profiler profile <log_file> [-d OUTPUT_DIR] [-r RUN_ID] [-v]

`sysinfo`

Display system information.

mtp-profiler sysinfo

Requirements

Python 3.11+
Apple Silicon (M1/M2/M3)
llama.cpp server logs

Optional:

statsmodels - Required for LOWESS smoothing (pip install -e ".[lowess]")

Development

# Install in development mode
pip install -e ".[dev]"

# Install with LOWESS smoothing support
pip install -e ".[dev,lowess]"

# Run tests
pytest tests/ -v

# Run with real log
mtp-profiler profile llama.log -d test-output/

Future Extensibility

v1 is intentionally focused on llama.cpp log analysis. The architecture supports future extensions:

LM Studio / Ollama / Open WebUI support
Synthetic benchmark harnesses
Live monitoring
Adaptive runtime MTP recommendations

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
examples		examples
mtp_profiler		mtp_profiler
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

MTP Profiler

Why This Exists

What is MTP?

How to run llama.cpp with MTP

How to collect logs

How to get a comprehensive analysis

About this project

Features

Installation

Quick Start

Full pipeline (recommended)

Step-by-step

Multi-run analysis

Output

Example Charts

Throughput & Acceptance Rate

Stability Boxplot

Throughput Uplift vs Baseline

Recommendation output

Data Model

Parsed output structure

Measurement fields

Analysis Metrics

Recommendation Scoring

Architecture

CLI Reference

parse

analyze

recommend

plot

profile

sysinfo

Requirements

Development

Future Extensibility

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages

`parse`

`analyze`

`recommend`

`plot`

`profile`

`sysinfo`