Skip to content

Pi0.5 Performance#103

Open
NSagan271 wants to merge 17 commits into
mainfrom
pi05-performance-2
Open

Pi0.5 Performance#103
NSagan271 wants to merge 17 commits into
mainfrom
pi05-performance-2

Conversation

@NSagan271

@NSagan271 NSagan271 commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

1. Openpi API server Benchmarking

Added the openpi API server to the benchmark (with our harness copying from the openpi client), so that there is parity in terms of server / http overhead.

2. Optimized our Pi0.5

  1. Data worker overhead: For each request, API server / data worker / tensor transport overhead was 10+ ms per request. Much of this time was in file read/write overhead, PNG decoding, and SHM tensor transport between the data worker and the GPU worker (none of which the openpi API server has---they receive a numpy array directly). This was alleviated by:
  • Adding a new "numpy" datatype to the API server, which is not saved to a file but rather kept in memory and converted to a torch tensor at the data worker.
  • Having the benchmark resize the PNG to 224 x 244 and convert to numpy before sending it to our API server (this matches what happens on the openpi end).
  • Keeping the images in uint8 on the data worker end and converting them to float in the ViT encoder's preprocess (this reduces tensor transport overhead by 4x)
  1. Scheduling and conductor overhead: as conceptually nice as it was, having the flow loop be a Loop primitive and having two graph walks with a "back to conductor" in between them was too much overhead.
  • The total 10-step flow loop is less than 50ms (only 26ms when under a single cuda graph), and only requires attention planning once at the beginning. Replanning attention between flow steps, combined with graph overhead, was too much for speculative scheduling to account for.
  • Also, I profiled about 4ms of back-to-conductor overhead, so I made our pi0.5 a single graph walk by making Paligemma (prefill) and the action expert (flow) separate nodes that share a KV cache. This is actually conceptually "ok": the two nodes have no overlap in terms of model weights, so putting them in the same "LLM" node was just a design choice rather than something "canonical".
  1. Port over ViT: we were wrapping the transformers siglip implementation, so I moved the code over and added fused Q/K/V projections. Still using SDPA attention because we need float32 precision.

  2. Benchmark Parity: made sure that pi0.5 uses the droid yaml by default in the server launch file. Also openpi only uses 2 images, so had the pi05_droid.yaml override num_cameras: 2 as well.

Results (JCT, 100 warmup, 100 requests, UW PTC server)

ours

mean=0.046s  p50=0.046s  p95=0.049s  p99=0.054s
mean=0.046s  p50=0.046s  p95=0.049s  p99=0.050s
mean=0.045s  p50=0.044s  p95=0.048s  p99=0.053s
mean=0.047s  p50=0.047s  p95=0.048s  p99=0.049s
mean=0.046s  p50=0.045s  p95=0.047s  p99=0.047s

openpi

mean=0.047s  p50=0.047s  p95=0.048s  p99=0.049s
mean=0.048s  p50=0.048s  p95=0.048s  p99=0.049s
mean=0.048s  p50=0.048s  p95=0.048s  p99=0.049s
mean=0.047s  p50=0.047s  p95=0.047s  p99=0.048s
mean=0.047s  p50=0.047s  p95=0.047s  p99=0.048s

Limitations

  • The "numpy" datatype is not hooked up with the openai API (not entirely sure the best way to do that)
  • There are added vibe coded timing prints (gated by an environment variable) in the data worker. These can be removed

@NSagan271 NSagan271 force-pushed the pi05-performance-2 branch from 32d9e45 to 8d97e94 Compare June 11, 2026 22:05
@NSagan271 NSagan271 requested a review from merceod June 11, 2026 23:07
Comment thread benchmark/dataset.py
Comment on lines +853 to +854
# openpi droid policy only uses the first extra image, so send 2 cameras.
_numpy_paths=npy_paths[:2],

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark is not apples-to-apples - M* silently runs on one camera, openpi on two. This may invalidates the comparison your PR is presumably built to demonstrate.

  • The benchmark uploads 2 cameras as 2 separate .npy blobs (benchmark/dataset.py line 854, request.py lines 756-762), and data_worker.py lines224 to 231 loads each as a separate tensor --> image_inputs = [cam0, cam1].
  • But the encoder consumes only image_inputs[0]: Pi05ViTEncoderSubmodule.prepare_inputs --> self._prepare_one(inputs["image_inputs"][0]) (submodules.py line 182-184). Camera 1 is silently dropped. openpi's baseline uses both (request.py _build_obs).
  • The encoder's actual contract is that image_inputs[0] is a stacked (num_cameras,3,H,W) tensor (_prepare_one handles the 4-D case; forward_batched flattens cameras into the token sequence). The benchmark/raw-numpy path violates this by sending a list of separate tensors.
  • Compounding it: the new num_cameras: 2 in configs/pi05_droid.yaml sizes the vit CUDA-graph static buffer to (1,2,3,H,W), while runtime feeds (1,1,3,H,W). static_buf[:1].copy_(real_val) (cuda_graph_runner.py on line 2169) broadcasts the size-1 camera dim, so the model processes two duplicate copies of camera 0, not a crash, just silently wrong.
  • Also found that the camera labeled "wrist" sent to openpi is actually exterior_image_2_left, and gripper_position is always a padding 0.0 for lerobot/droid_100.

Comment thread mstar/model/pi05/pi05_model.py Outdated
# A "numpy" upload arrives as "raw_inputs"; Pi0.5 treats it as the image.
tensors = kwargs.get("tensors")
if tensors is not None and "raw_inputs" in tensors:
assert "image_inputs" not in tensors, "got both raw_inputs and image_inputs"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have raw_inputs + image_inputs both present --> request hangs until timeout. A single client request that uploads both a .png (→image_inputs) and a .npy (→raw_inputs) trips it, and the AssertionError is swallowed by the data-worker's broad except Exception, so the request never reaches the conductor and silently hangs until the timeout instead of returning a 4xx. The validation is in the wrong layer (worker thread, not ingress).

Comment thread mstar/worker/worker.py
Comment on lines +1514 to +1515
tensors = rid_outputs.get(input_name, None)
if tensors is None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: The empty-list change is shared code affecting every model. This is required for pi05's new action_expert_trigger=[] edge to thread through the speculative path - but it lives in _thread_outputs_to_speculative, used by BAGEL/Qwen3-Omni/Orpheus too. Semantics changed from "empty-list output --> drop & reschedule" to "empty-list output --> valid, thread through". Ifound no other model that emits [] for a consumed edge, so it's probably benign and arguably more correct, but it's a shared-contract change framed as pi05-only so might be worth a comment maybe.

@merceod

merceod commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

There are several stale comments:

  • configs/pi05_droid.yaml still references Pi05LLMSubmodule.get_cuda_graph_configs ... submodules.py lines325-329 (class was split/renamed);
  • pi05_model.py's _reset_non_persistent_buffers / position_ids "CRITICAL" comment is now inert for SigLIP (the native embeddings compute arange inline; no registered buffer to reset)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants