Replicating Apple's LLM benchmarking #634

meghsat · 2025-11-25T22:59:46Z

meghsat
Nov 25, 2025

Hey all,

I came across this article: https://machinelearning.apple.com/research/exploring-llms-mlx-m5, where Apple claims to have achieved 2.87 sec TTFT on the MacBook Pro M5-24GB for the GPT-OSS-20B-MXFP4-Q4 model using MLX. However, I can’t seem to replicate those numbers — I’m getting a TTFT of ~8 sec.
Note: None of the models listed in the article are performing as claimed.
Here’s my benchmarking setup:

To measure TTFT, I had to modify the mlx_lm/generate.py script. Here’s the PR containing those changes: https://github.com/ml-explore/mlx-lm/pull/633/files
Once you add the TTFT logic, please run this code:

import mlx.core as mx
from mlx_lm import load, generate

model, tokenizer = load(
    "mlx-community/gpt-oss-20b-MXFP4-Q4", # You can replace with any model ID mentioned in the article.
    tokenizer_config={"trust_remote_code": True}
)

mx.eval(model.parameters())

vocab_size = tokenizer.vocab_size
prompt_length = 4096

mx.random.seed(0)

dummy_tokens = mx.random.randint(0, vocab_size, (prompt_length,)).tolist()

tokenizer._eos_token_ids = {}

# warmup
response = generate(
    model, 
    tokenizer, 
    prompt=dummy_tokens, 
    max_tokens=128, 
    verbose=True,
    prefill_step_size=4096 
)

# Actual run
response = generate(
    model, 
    tokenizer, 
    prompt=dummy_tokens, 
    max_tokens=128, 
    verbose=True,
    prefill_step_size=4096  
)

It would be great if anyone has observed similar or different results and could share their setup here. Thanks in advance.

awni · 2025-11-25T23:05:44Z

awni
Nov 25, 2025
Maintainer

What's your hardware / OS? To reproduce those numbers you need the latest MLX (0.30.0) on macOS 26.2 (beta release) on the M5.

2 replies

meghsat Nov 25, 2025
Author

Thank you for pointing it out. I'm on macOS 26.1.

meghsat Nov 26, 2025
Author

I am able to reproduce the numbers using MacOS 26.2. Thanks!

guruswami-ai · 2026-05-20T06:09:14Z

guruswami-ai
May 20, 2026

If you want to compare those numbers to M3 Ultra (stand-alone or clustered) or run some more tests for comparision I've published a bunch at https://github.com/guruswami-ai/mlx-benchmarks. I haven't got my hands on an M5 system yet, but your results seem impressive. I'm looking forward to building a distributed M5 mesh if/when they are available. To date, the lesson seems to be fit everything into one node if you can. An M5 Ultra with 512GB RAM will be impressive and likely bridge the gap to NVIDIA GPU hardware even more.

I need to update my benchmarks https://github.com/guruswami-ai/mlx-benchmarks/blob/main/docs/APPLE_SILICON_GUIDE.md and the 'cluster simulator'at https://chakra.guruswami.ai with your M5 results.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replicating Apple's LLM benchmarking #634

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Replicating Apple's LLM benchmarking #634

Uh oh!

meghsat Nov 25, 2025

Replies: 2 comments · 2 replies

Uh oh!

Uh oh!

awni Nov 25, 2025 Maintainer

Uh oh!

meghsat Nov 25, 2025 Author

Uh oh!

meghsat Nov 26, 2025 Author

Uh oh!

guruswami-ai May 20, 2026

meghsat
Nov 25, 2025

Replies: 2 comments 2 replies

awni
Nov 25, 2025
Maintainer

meghsat Nov 25, 2025
Author

meghsat Nov 26, 2025
Author

guruswami-ai
May 20, 2026