Pi0.5 Performance by NSagan271 · Pull Request #103 · mstar-project/mstar

NSagan271 · 2026-06-10T06:15:58Z

1. Openpi API server Benchmarking

Added the openpi API server to the benchmark (with our harness copying from the openpi client), so that there is parity in terms of server / http overhead.

2. Optimized our Pi0.5

Data worker overhead: For each request, API server / data worker / tensor transport overhead was 10+ ms per request. Much of this time was in file read/write overhead, PNG decoding, and SHM tensor transport between the data worker and the GPU worker (none of which the openpi API server has---they receive a numpy array directly). This was alleviated by:

Adding a new "numpy" datatype to the API server, which is not saved to a file but rather kept in memory and converted to a torch tensor at the data worker.
Having the benchmark resize the PNG to 224 x 244 and convert to numpy before sending it to our API server (this matches what happens on the openpi end).
Keeping the images in uint8 on the data worker end and converting them to float in the ViT encoder's preprocess (this reduces tensor transport overhead by 4x)

Scheduling and conductor overhead: as conceptually nice as it was, having the flow loop be a Loop primitive and having two graph walks with a "back to conductor" in between them was too much overhead.

The total 10-step flow loop is less than 50ms (only 26ms when under a single cuda graph), and only requires attention planning once at the beginning. Replanning attention between flow steps, combined with graph overhead, was too much for speculative scheduling to account for.
Also, I profiled about 4ms of back-to-conductor overhead, so I made our pi0.5 a single graph walk by making Paligemma (prefill) and the action expert (flow) separate nodes that share a KV cache. This is actually conceptually "ok": the two nodes have no overlap in terms of model weights, so putting them in the same "LLM" node was just a design choice rather than something "canonical".

Port over ViT: we were wrapping the transformers siglip implementation, so I moved the code over and added fused Q/K/V projections. Still using SDPA attention because we need float32 precision.
Benchmark Parity: made sure that pi0.5 uses the droid yaml by default in the server launch file. Also openpi only uses 2 images, so had the pi05_droid.yaml override num_cameras: 2 as well.

Results (JCT, 100 warmup, 100 requests, UW PTC server)

ours

mean=0.046s  p50=0.046s  p95=0.049s  p99=0.054s
mean=0.046s  p50=0.046s  p95=0.049s  p99=0.050s
mean=0.045s  p50=0.044s  p95=0.048s  p99=0.053s
mean=0.047s  p50=0.047s  p95=0.048s  p99=0.049s
mean=0.046s  p50=0.045s  p95=0.047s  p99=0.047s

openpi

mean=0.047s  p50=0.047s  p95=0.048s  p99=0.049s
mean=0.048s  p50=0.048s  p95=0.048s  p99=0.049s
mean=0.048s  p50=0.048s  p95=0.048s  p99=0.049s
mean=0.047s  p50=0.047s  p95=0.047s  p99=0.048s
mean=0.047s  p50=0.047s  p95=0.047s  p99=0.048s

Limitations

The "numpy" datatype is not hooked up with the openai API (not entirely sure the best way to do that)
There are added vibe coded timing prints (gated by an environment variable) in the data worker. These can be removed

merceod · 2026-06-13T11:36:10Z

+            # openpi droid policy only uses the first extra image, so send 2 cameras.
+            _numpy_paths=npy_paths[:2],


The benchmark is not apples-to-apples - M* silently runs on one camera, openpi on two. This may invalidates the comparison your PR is presumably built to demonstrate.

The benchmark uploads 2 cameras as 2 separate .npy blobs (benchmark/dataset.py line 854, request.py lines 756-762), and data_worker.py lines224 to 231 loads each as a separate tensor --> image_inputs = [cam0, cam1].

But the encoder consumes only image_inputs[0]: Pi05ViTEncoderSubmodule.prepare_inputs --> self._prepare_one(inputs["image_inputs"][0]) (submodules.py line 182-184). Camera 1 is silently dropped. openpi's baseline uses both (request.py _build_obs).

The encoder's actual contract is that image_inputs[0] is a stacked (num_cameras,3,H,W) tensor (_prepare_one handles the 4-D case; forward_batched flattens cameras into the token sequence). The benchmark/raw-numpy path violates this by sending a list of separate tensors.

Compounding it: the new num_cameras: 2 in configs/pi05_droid.yaml sizes the vit CUDA-graph static buffer to (1,2,3,H,W), while runtime feeds (1,1,3,H,W). static_buf[:1].copy_(real_val) (cuda_graph_runner.py on line 2169) broadcasts the size-1 camera dim, so the model processes two duplicate copies of camera 0, not a crash, just silently wrong.

Also found that the camera labeled "wrist" sent to openpi is actually exterior_image_2_left, and gripper_position is always a padding 0.0 for lerobot/droid_100.

merceod · 2026-06-13T11:38:08Z

+        # A "numpy" upload arrives as "raw_inputs"; Pi0.5 treats it as the image.
+        tensors = kwargs.get("tensors")
+        if tensors is not None and "raw_inputs" in tensors:
+            assert "image_inputs" not in tensors, "got both raw_inputs and image_inputs"


We have raw_inputs + image_inputs both present --> request hangs until timeout. A single client request that uploads both a .png (→image_inputs) and a .npy (→raw_inputs) trips it, and the AssertionError is swallowed by the data-worker's broad except Exception, so the request never reaches the conductor and silently hangs until the timeout instead of returning a 4xx. The validation is in the wrong layer (worker thread, not ingress).

merceod · 2026-06-13T11:40:26Z

+                tensors = rid_outputs.get(input_name, None)
+                if tensors is None:


Nitpick: The empty-list change is shared code affecting every model. This is required for pi05's new action_expert_trigger=[] edge to thread through the speculative path - but it lives in _thread_outputs_to_speculative, used by BAGEL/Qwen3-Omni/Orpheus too. Semantics changed from "empty-list output --> drop & reschedule" to "empty-list output --> valid, thread through". Ifound no other model that emits [] for a consumed edge, so it's probably benign and arguably more correct, but it's a shared-contract change framed as pi05-only so might be worth a comment maybe.

merceod · 2026-06-13T11:43:12Z

There are several stale comments:

configs/pi05_droid.yaml still references Pi05LLMSubmodule.get_cuda_graph_configs ... submodules.py lines325-329 (class was split/renamed);
pi05_model.py's _reset_non_persistent_buffers / position_ids "CRITICAL" comment is now inert for SigLIP (the native embeddings compute arange inline; no registered buffer to reset)

NSagan271 added 10 commits June 11, 2026 21:58

benchmark parity for pi05

478b905

use droid yaml by default

94ef0e5

add cache for droid

73d5d29

collapse full flow loop into one forward pass

26eca8f

port over pi05 siglip

c81cd50

single graph walk for pi05 (remove conductor overhead)

692d432

IN PROGRESS remove PNG loading overhead

99118e9

cleanup

03da3ad

port over openpi benchmarking

c36c00d

ruff check fix

8d97e94

NSagan271 force-pushed the pi05-performance-2 branch from 32d9e45 to 8d97e94 Compare June 11, 2026 22:05

NSagan271 added 4 commits June 11, 2026 22:30

update openpi instructions

3880646

refactor mminf -> mstar in a few places

d437bb9

some cleanup

491ff5b

Cleanup stale comments + abstract away timing prints

a471ae8

NSagan271 requested a review from merceod June 11, 2026 23:07

Remove stale Pi05LLMSubmodule references

8954bf0

merceod requested changes Jun 13, 2026

View reviewed changes

NSagan271 added 2 commits June 14, 2026 04:51

respond to PR comments

0c83059

ruff

41e5b20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pi0.5 Performance#103

Pi0.5 Performance#103
NSagan271 wants to merge 17 commits into
mainfrom
pi05-performance-2

NSagan271 commented Jun 10, 2026 •

edited

Loading

Uh oh!

merceod Jun 13, 2026

Uh oh!

merceod Jun 13, 2026

Uh oh!

merceod Jun 13, 2026

Uh oh!

merceod commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# openpi droid policy only uses the first extra image, so send 2 cameras.
		_numpy_paths=npy_paths[:2],

		tensors = rid_outputs.get(input_name, None)
		if tensors is None:

Conversation

NSagan271 commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Openpi API server Benchmarking

2. Optimized our Pi0.5

Results (JCT, 100 warmup, 100 requests, UW PTC server)

Limitations

Uh oh!

merceod Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

merceod Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

merceod Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

merceod commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NSagan271 commented Jun 10, 2026 •

edited

Loading