Real-world speculative decoding profiler for llama.cpp workloads on Apple Silicon.
MTP Profiler analyzes real llama.cpp inference workloads to help identify optimal speculative decoding (Multi-Token Prediction / MTP) settings for your hardware and usage patterns. Unlike synthetic microbenchmarks, MTP Profiler focuses on long-running real-world sessions such as coding agents, chat workloads, and long-context inference. It profiles throughput degradation, draft-token acceptance rates, and speculative decoding efficiency using telemetry extracted directly from llama.cpp logs.
Speculative decoding performance depends heavily on:
- hardware characteristics,
- memory bandwidth,
- context length,
- model architecture,
- workload entropy,
- and draft-token acceptance rates.
Synthetic benchmarks often fail to capture real-world behavior, especially during long-running sessions with growing context windows. MTP Profiler helps quantify speculative decoding efficiency using telemetry from actual inference workloads, making it easier to identify the optimal draft-token settings for a specific machine and usage pattern.
Multi-Token Prediction (MTP) is a speculative decoding technique used in llama.cpp to accelerate inference. Instead of generating one token at a time, MTP uses a smaller "draft" model to predict multiple tokens in parallel, then verifies them against the full model. The key configuration parameters are:
--mtp-n-max— maximum number of draft tokens to generate per step (e.g. 1, 2, 3, 4)--mtp-n-min— minimum number of draft tokens (usually 0)--mtp-p-min— minimum acceptance probability threshold (e.g. 0.70)
Higher n_max values can increase throughput but may reduce acceptance rates or become unstable at long context lengths. The optimal setting depends on your specific model, hardware, and workload.
Start the llama.cpp server with your desired MTP configuration:
llama-server \
--model /path/to/your/model.gguf \
--mtp-n-max 2 \
--mtp-n-min 0 \
--mtp-p-min 0.70Run the server with output piped through tee to capture logs to a file:
llama-server ... 2>&1 | tee -a llama.logSend inference requests (via API, web UI, or benchmark tools) while the server is running. The profiler parses the server log to extract timing data, MTP metrics, and system information.
llama.cpp only supports one MTP configuration at a time. To get a comprehensive comparison and recommendation:
- Run the server multiple times with different
--mtp-n-maxvalues (e.g. 1, 2, 3, 4), sending the same workload each time - Append each run's logs to the same file using
tee -a - Point the profiler at the combined log file — it automatically detects server restarts and merges runs by
n_maxsetting
Example workflow:
# Run 1: n_max=1
llama-server --mtp-n-max 1 ... 2>&1 | tee -a llama.log
# Run 2: n_max=2 (append to same file)
llama-server --mtp-n-max 2 ... 2>&1 | tee -a llama.log
# Run 3: n_max=3 (append to same file)
llama-server --mtp-n-max 3 ... 2>&1 | tee -a llama.log
# Analyze all runs together
mtp-profiler profile llama.log -d output/This repository and its code were generated entirely by AI agents. The charts, analysis, and recommendations shown in this README were produced from logs collected while the AI agent was actively implementing this very codebase — making it a self-referential profiling exercise. The real-world test data comes from Qwen3.6-35B-A3B-UD-Q4_K_XL on an Apple M3 Pro (36 GB RAM). All code and results were human-reviewed and verified by the repo owner.
- Passive log analysis - No synthetic benchmarks, just analyze real llama.cpp server logs
- Cross-run merging - Multiple runs with the same MTP setting are merged into one dataset
- Multi-run detection - Detects server restarts and separates runs automatically
- Apple Silicon aware - Collects chip type, memory, memory pressure, and thread information
- Deterministic recommendations - Algorithmic scoring with diminishing returns penalty
- Publication-quality charts - Throughput vs context, acceptance rates, stability boxplots
- LOWESS smoothing - Optional advanced smoothing for trend lines
- MTP internal stats - Parses draft call counts, generation/acceptance durations
- Robust parsing - Tolerates ANSI codes, malformed lines, truncated logs
cd mtp-profiler
python -m venv .venv
source .venv/bin/activate
pip install -e .mtp-profiler profile llama.log -d output/This runs all stages automatically:
- Parse - Extract telemetry from the log file
- Analyze - Compute throughput stats, correlations, MTP comparisons
- Recommend - Generate optimal MTP setting recommendation
- Plot - Generate charts in
output/charts/
# Stage 1: Parse log
mtp-profiler parse llama.log -o parsed.json
# Stage 2: Analyze
mtp-profiler analyze parsed.json -o analysis.json
# Stage 3: Recommend
mtp-profiler recommend analysis.json -o recommendation.json
# Stage 4: Plot
mtp-profiler plot analysis.json -d charts/If your log contains multiple server restarts (different MTP configurations), the default behavior merges all runs by n_max setting, showing one line per unique setting. To analyze a specific run:
mtp-profiler profile llama.log -d output/ -r run_2Or analyze all runs by specifying the run ID for each stage.
The profile command produces:
output/
├── parsed.json # Raw extracted telemetry
├── analysis.json # Computed metrics
├── recommendation.json # Optimal MTP setting
└── charts/
├── throughput_and_acceptance.png # Throughput + acceptance rate charts
├── stability_boxplot.png # Throughput distribution by setting
└── uplift_vs_baseline.png # Throughput uplift comparison
Below are example outputs from profiling Qwen3.6-35B-A3B-UD-Q4_K_XL on Apple M3 Pro (28 GB).
Each line represents a different MTP n_max setting, with all runs for the same setting merged into one dataset. The top subplot shows generation throughput vs context length, the bottom shows draft acceptance rate.
Shows the distribution of generation throughput for each MTP setting, making it easy to spot unstable configurations.
Compares each MTP setting's average throughput against the baseline (no MTP), with error bars showing standard deviation.
============================================================
MTP Profiler - Recommendation
============================================================
Recommended MTP setting: 2
Settings compared:
Setting 0: throughput=+0.0%, long-context=degraded, stability=variable
Setting 1: throughput=+9.7%, long-context=degraded, stability=variable
Setting 2: throughput=+33.6%, long-context=degraded, stability=variable <-- recommended
Setting 3: throughput=+25.9%, long-context=degraded, stability=variable
Setting 4: throughput=+35.2%, long-context=degraded, stability=variable
Throughput vs baseline: +33.6%
Long-context efficiency: degraded
Stability: variable
Average generation throughput: 19.97 t/s
Average draft acceptance rate: 93.8%
============================================================
{
"runs": [
{
"id": "run_1",
"metadata": {
"model": "Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf",
"quantization": "Q4_K_XL",
"system": {
"chip": "Apple M3 Pro",
"chip_type": "M3",
"unified_memory_mb": 28753,
"cpu_threads": 8,
"cpu_total_threads": 11
},
"mtp_config": {"n_max": 1, "n_min": 0, "p_min": 0.7}
},
"measurements": [
{
"n_tokens": 62557,
"n_decoded": 278,
"generation_tokens_per_second": 15.65,
"prompt_tokens_per_second": 193.36,
"draft_acceptance_rate": 0.975,
"n_drafts_generated": 447,
"n_drafts_accepted": 415
}
]
}
]
}| Field | Description |
|---|---|
n_tokens |
Context length (tokens) |
n_decoded |
Number of decoded tokens |
generation_tokens_per_second |
Generation throughput |
prompt_tokens_per_second |
Prompt processing throughput |
draft_acceptance_rate |
MTP draft acceptance rate (0-1) |
n_drafts_generated |
Number of drafts generated |
n_drafts_accepted |
Number of drafts accepted |
truncated |
Number of truncated tokens |
mtp_calls |
Number of MTP inference calls |
mtp_gen_drafts |
Drafts generated by MTP |
mtp_acc_drafts |
Drafts accepted by MTP |
mtp_gen_tokens |
Tokens generated by MTP |
mtp_acc_tokens |
Tokens accepted by MTP |
mtp_dur_batch |
Batch processing duration (ms) |
mtp_dur_gen |
Generation duration (ms) |
mtp_dur_acc |
Acceptance duration (ms) |
The analyze stage computes:
- Throughput statistics - avg, std, min, max, median, p10, p90 generation TPS
- Context-TPS correlation - Pearson correlation between context length and throughput
- Degradation rate - TPS loss per 1000 tokens of context
- MTP setting comparisons - Grouped by
n_maxsetting with context ranges - Stability metrics - Coefficient of variation, variance
- Long-context behavior - Short vs long context TPS ratio
The recommendation engine uses a weighted scoring algorithm:
| Factor | Weight | Description |
|---|---|---|
| Throughput | 40% | Average generation TPS |
| Stability | 25% | Low coefficient of variation |
| Long-context efficiency | 20% | TPS ratio (long/short context) |
| Acceptance rate | 15% | Average draft acceptance rate |
| Diminishing returns | Penalty | Deducts points for n_max > 2 |
mtp_profiler/
├── cli/ # Typer CLI with subcommands
├── models/ # Pydantic data models
├── parser/ # llama.cpp log parser
├── analyzer/ # Analysis engine
├── visualizer/ # Chart generation
├── recommender/ # Deterministic recommendation engine
└── system_info/ # Apple Silicon detection
Extract telemetry from llama.cpp server logs.
mtp-profiler parse <log_file> [-o OUTPUT] [-v]Compute derived metrics from parsed data.
mtp-profiler analyze <input.json> [-o OUTPUT] [-r RUN_ID] [-v]Generate MTP setting recommendations.
mtp-profiler recommend <analysis.json> [-o OUTPUT] [-v]Generate publication-quality charts.
mtp-profiler plot <analysis.json> [-d OUTPUT_DIR] [-r RUN_ID] [--lowess] [--lowess-frac FRAC] [-v]Use --lowess for smoother trend lines (requires statsmodels):
pip install -e ".[lowess]"
mtp-profiler plot analysis.json --lowess --lowess-frac 0.33Full pipeline: parse → analyze → recommend → plot.
mtp-profiler profile <log_file> [-d OUTPUT_DIR] [-r RUN_ID] [-v]Display system information.
mtp-profiler sysinfo- Python 3.11+
- Apple Silicon (M1/M2/M3)
- llama.cpp server logs
Optional:
statsmodels- Required for LOWESS smoothing (pip install -e ".[lowess]")
# Install in development mode
pip install -e ".[dev]"
# Install with LOWESS smoothing support
pip install -e ".[dev,lowess]"
# Run tests
pytest tests/ -v
# Run with real log
mtp-profiler profile llama.log -d test-output/v1 is intentionally focused on llama.cpp log analysis. The architecture supports future extensions:
- LM Studio / Ollama / Open WebUI support
- Synthetic benchmark harnesses
- Live monitoring
- Adaptive runtime MTP recommendations
MIT


