Skip to content

dmitryryabkov/mtp-profiler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MTP Profiler

Real-world speculative decoding profiler for llama.cpp workloads on Apple Silicon.

MTP Profiler analyzes real llama.cpp inference workloads to help identify optimal speculative decoding (Multi-Token Prediction / MTP) settings for your hardware and usage patterns. Unlike synthetic microbenchmarks, MTP Profiler focuses on long-running real-world sessions such as coding agents, chat workloads, and long-context inference. It profiles throughput degradation, draft-token acceptance rates, and speculative decoding efficiency using telemetry extracted directly from llama.cpp logs.

Why This Exists

Speculative decoding performance depends heavily on:

  • hardware characteristics,
  • memory bandwidth,
  • context length,
  • model architecture,
  • workload entropy,
  • and draft-token acceptance rates.

Synthetic benchmarks often fail to capture real-world behavior, especially during long-running sessions with growing context windows. MTP Profiler helps quantify speculative decoding efficiency using telemetry from actual inference workloads, making it easier to identify the optimal draft-token settings for a specific machine and usage pattern.

What is MTP?

Multi-Token Prediction (MTP) is a speculative decoding technique used in llama.cpp to accelerate inference. Instead of generating one token at a time, MTP uses a smaller "draft" model to predict multiple tokens in parallel, then verifies them against the full model. The key configuration parameters are:

  • --mtp-n-max — maximum number of draft tokens to generate per step (e.g. 1, 2, 3, 4)
  • --mtp-n-min — minimum number of draft tokens (usually 0)
  • --mtp-p-min — minimum acceptance probability threshold (e.g. 0.70)

Higher n_max values can increase throughput but may reduce acceptance rates or become unstable at long context lengths. The optimal setting depends on your specific model, hardware, and workload.

How to run llama.cpp with MTP

Start the llama.cpp server with your desired MTP configuration:

llama-server \
  --model /path/to/your/model.gguf \
  --mtp-n-max 2 \
  --mtp-n-min 0 \
  --mtp-p-min 0.70

How to collect logs

Run the server with output piped through tee to capture logs to a file:

llama-server ... 2>&1 | tee -a llama.log

Send inference requests (via API, web UI, or benchmark tools) while the server is running. The profiler parses the server log to extract timing data, MTP metrics, and system information.

How to get a comprehensive analysis

llama.cpp only supports one MTP configuration at a time. To get a comprehensive comparison and recommendation:

  1. Run the server multiple times with different --mtp-n-max values (e.g. 1, 2, 3, 4), sending the same workload each time
  2. Append each run's logs to the same file using tee -a
  3. Point the profiler at the combined log file — it automatically detects server restarts and merges runs by n_max setting

Example workflow:

# Run 1: n_max=1
llama-server --mtp-n-max 1 ... 2>&1 | tee -a llama.log

# Run 2: n_max=2 (append to same file)
llama-server --mtp-n-max 2 ... 2>&1 | tee -a llama.log

# Run 3: n_max=3 (append to same file)
llama-server --mtp-n-max 3 ... 2>&1 | tee -a llama.log

# Analyze all runs together
mtp-profiler profile llama.log -d output/

About this project

This repository and its code were generated entirely by AI agents. The charts, analysis, and recommendations shown in this README were produced from logs collected while the AI agent was actively implementing this very codebase — making it a self-referential profiling exercise. The real-world test data comes from Qwen3.6-35B-A3B-UD-Q4_K_XL on an Apple M3 Pro (36 GB RAM). All code and results were human-reviewed and verified by the repo owner.

Features

  • Passive log analysis - No synthetic benchmarks, just analyze real llama.cpp server logs
  • Cross-run merging - Multiple runs with the same MTP setting are merged into one dataset
  • Multi-run detection - Detects server restarts and separates runs automatically
  • Apple Silicon aware - Collects chip type, memory, memory pressure, and thread information
  • Deterministic recommendations - Algorithmic scoring with diminishing returns penalty
  • Publication-quality charts - Throughput vs context, acceptance rates, stability boxplots
  • LOWESS smoothing - Optional advanced smoothing for trend lines
  • MTP internal stats - Parses draft call counts, generation/acceptance durations
  • Robust parsing - Tolerates ANSI codes, malformed lines, truncated logs

Installation

cd mtp-profiler
python -m venv .venv
source .venv/bin/activate
pip install -e .

Quick Start

Full pipeline (recommended)

mtp-profiler profile llama.log -d output/

This runs all stages automatically:

  1. Parse - Extract telemetry from the log file
  2. Analyze - Compute throughput stats, correlations, MTP comparisons
  3. Recommend - Generate optimal MTP setting recommendation
  4. Plot - Generate charts in output/charts/

Step-by-step

# Stage 1: Parse log
mtp-profiler parse llama.log -o parsed.json

# Stage 2: Analyze
mtp-profiler analyze parsed.json -o analysis.json

# Stage 3: Recommend
mtp-profiler recommend analysis.json -o recommendation.json

# Stage 4: Plot
mtp-profiler plot analysis.json -d charts/

Multi-run analysis

If your log contains multiple server restarts (different MTP configurations), the default behavior merges all runs by n_max setting, showing one line per unique setting. To analyze a specific run:

mtp-profiler profile llama.log -d output/ -r run_2

Or analyze all runs by specifying the run ID for each stage.

Output

The profile command produces:

output/
├── parsed.json          # Raw extracted telemetry
├── analysis.json        # Computed metrics
├── recommendation.json  # Optimal MTP setting
└── charts/
    ├── throughput_and_acceptance.png  # Throughput + acceptance rate charts
    ├── stability_boxplot.png          # Throughput distribution by setting
    └── uplift_vs_baseline.png         # Throughput uplift comparison

Example Charts

Below are example outputs from profiling Qwen3.6-35B-A3B-UD-Q4_K_XL on Apple M3 Pro (28 GB).

Throughput & Acceptance Rate

Throughput & Acceptance Rate

Each line represents a different MTP n_max setting, with all runs for the same setting merged into one dataset. The top subplot shows generation throughput vs context length, the bottom shows draft acceptance rate.

Stability Boxplot

Stability Boxplot

Shows the distribution of generation throughput for each MTP setting, making it easy to spot unstable configurations.

Throughput Uplift vs Baseline

Uplift vs Baseline

Compares each MTP setting's average throughput against the baseline (no MTP), with error bars showing standard deviation.

Recommendation output

============================================================
  MTP Profiler - Recommendation
============================================================
Recommended MTP setting: 2

Settings compared:
  Setting 0: throughput=+0.0%, long-context=degraded, stability=variable
  Setting 1: throughput=+9.7%, long-context=degraded, stability=variable
  Setting 2: throughput=+33.6%, long-context=degraded, stability=variable <-- recommended
  Setting 3: throughput=+25.9%, long-context=degraded, stability=variable
  Setting 4: throughput=+35.2%, long-context=degraded, stability=variable

Throughput vs baseline: +33.6%
Long-context efficiency: degraded
Stability: variable
Average generation throughput: 19.97 t/s
Average draft acceptance rate: 93.8%
============================================================

Data Model

Parsed output structure

{
  "runs": [
    {
      "id": "run_1",
      "metadata": {
        "model": "Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf",
        "quantization": "Q4_K_XL",
        "system": {
          "chip": "Apple M3 Pro",
          "chip_type": "M3",
          "unified_memory_mb": 28753,
          "cpu_threads": 8,
          "cpu_total_threads": 11
        },
        "mtp_config": {"n_max": 1, "n_min": 0, "p_min": 0.7}
      },
      "measurements": [
        {
          "n_tokens": 62557,
          "n_decoded": 278,
          "generation_tokens_per_second": 15.65,
          "prompt_tokens_per_second": 193.36,
          "draft_acceptance_rate": 0.975,
          "n_drafts_generated": 447,
          "n_drafts_accepted": 415
        }
      ]
    }
  ]
}

Measurement fields

Field Description
n_tokens Context length (tokens)
n_decoded Number of decoded tokens
generation_tokens_per_second Generation throughput
prompt_tokens_per_second Prompt processing throughput
draft_acceptance_rate MTP draft acceptance rate (0-1)
n_drafts_generated Number of drafts generated
n_drafts_accepted Number of drafts accepted
truncated Number of truncated tokens
mtp_calls Number of MTP inference calls
mtp_gen_drafts Drafts generated by MTP
mtp_acc_drafts Drafts accepted by MTP
mtp_gen_tokens Tokens generated by MTP
mtp_acc_tokens Tokens accepted by MTP
mtp_dur_batch Batch processing duration (ms)
mtp_dur_gen Generation duration (ms)
mtp_dur_acc Acceptance duration (ms)

Analysis Metrics

The analyze stage computes:

  • Throughput statistics - avg, std, min, max, median, p10, p90 generation TPS
  • Context-TPS correlation - Pearson correlation between context length and throughput
  • Degradation rate - TPS loss per 1000 tokens of context
  • MTP setting comparisons - Grouped by n_max setting with context ranges
  • Stability metrics - Coefficient of variation, variance
  • Long-context behavior - Short vs long context TPS ratio

Recommendation Scoring

The recommendation engine uses a weighted scoring algorithm:

Factor Weight Description
Throughput 40% Average generation TPS
Stability 25% Low coefficient of variation
Long-context efficiency 20% TPS ratio (long/short context)
Acceptance rate 15% Average draft acceptance rate
Diminishing returns Penalty Deducts points for n_max > 2

Architecture

mtp_profiler/
├── cli/              # Typer CLI with subcommands
├── models/           # Pydantic data models
├── parser/           # llama.cpp log parser
├── analyzer/         # Analysis engine
├── visualizer/       # Chart generation
├── recommender/      # Deterministic recommendation engine
└── system_info/      # Apple Silicon detection

CLI Reference

parse

Extract telemetry from llama.cpp server logs.

mtp-profiler parse <log_file> [-o OUTPUT] [-v]

analyze

Compute derived metrics from parsed data.

mtp-profiler analyze <input.json> [-o OUTPUT] [-r RUN_ID] [-v]

recommend

Generate MTP setting recommendations.

mtp-profiler recommend <analysis.json> [-o OUTPUT] [-v]

plot

Generate publication-quality charts.

mtp-profiler plot <analysis.json> [-d OUTPUT_DIR] [-r RUN_ID] [--lowess] [--lowess-frac FRAC] [-v]

Use --lowess for smoother trend lines (requires statsmodels):

pip install -e ".[lowess]"
mtp-profiler plot analysis.json --lowess --lowess-frac 0.33

profile

Full pipeline: parse → analyze → recommend → plot.

mtp-profiler profile <log_file> [-d OUTPUT_DIR] [-r RUN_ID] [-v]

sysinfo

Display system information.

mtp-profiler sysinfo

Requirements

  • Python 3.11+
  • Apple Silicon (M1/M2/M3)
  • llama.cpp server logs

Optional:

  • statsmodels - Required for LOWESS smoothing (pip install -e ".[lowess]")

Development

# Install in development mode
pip install -e ".[dev]"

# Install with LOWESS smoothing support
pip install -e ".[dev,lowess]"

# Run tests
pytest tests/ -v

# Run with real log
mtp-profiler profile llama.log -d test-output/

Future Extensibility

v1 is intentionally focused on llama.cpp log analysis. The architecture supports future extensions:

  • LM Studio / Ollama / Open WebUI support
  • Synthetic benchmark harnesses
  • Live monitoring
  • Adaptive runtime MTP recommendations

License

MIT

About

Real-world speculative decoding profiler for llama.cpp workloads on Apple Silicon. Analyze MTP throughput, draft acceptance rates, and long-context performance.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages