Pi0.5 Performance#103
Conversation
32d9e45 to
8d97e94
Compare
| # openpi droid policy only uses the first extra image, so send 2 cameras. | ||
| _numpy_paths=npy_paths[:2], |
There was a problem hiding this comment.
The benchmark is not apples-to-apples - M* silently runs on one camera, openpi on two. This may invalidates the comparison your PR is presumably built to demonstrate.
- The benchmark uploads 2 cameras as 2 separate .npy blobs (benchmark/dataset.py line 854, request.py lines 756-762), and data_worker.py lines224 to 231 loads each as a separate tensor --> image_inputs = [cam0, cam1].
- But the encoder consumes only image_inputs[0]: Pi05ViTEncoderSubmodule.prepare_inputs --> self._prepare_one(inputs["image_inputs"][0]) (submodules.py line 182-184). Camera 1 is silently dropped. openpi's baseline uses both (request.py _build_obs).
- The encoder's actual contract is that image_inputs[0] is a stacked (num_cameras,3,H,W) tensor (_prepare_one handles the 4-D case; forward_batched flattens cameras into the token sequence). The benchmark/raw-numpy path violates this by sending a list of separate tensors.
- Compounding it: the new num_cameras: 2 in configs/pi05_droid.yaml sizes the vit CUDA-graph static buffer to (1,2,3,H,W), while runtime feeds (1,1,3,H,W). static_buf[:1].copy_(real_val) (cuda_graph_runner.py on line 2169) broadcasts the size-1 camera dim, so the model processes two duplicate copies of camera 0, not a crash, just silently wrong.
- Also found that the camera labeled "wrist" sent to openpi is actually exterior_image_2_left, and gripper_position is always a padding 0.0 for lerobot/droid_100.
| # A "numpy" upload arrives as "raw_inputs"; Pi0.5 treats it as the image. | ||
| tensors = kwargs.get("tensors") | ||
| if tensors is not None and "raw_inputs" in tensors: | ||
| assert "image_inputs" not in tensors, "got both raw_inputs and image_inputs" |
There was a problem hiding this comment.
We have raw_inputs + image_inputs both present --> request hangs until timeout. A single client request that uploads both a .png (→image_inputs) and a .npy (→raw_inputs) trips it, and the AssertionError is swallowed by the data-worker's broad except Exception, so the request never reaches the conductor and silently hangs until the timeout instead of returning a 4xx. The validation is in the wrong layer (worker thread, not ingress).
| tensors = rid_outputs.get(input_name, None) | ||
| if tensors is None: |
There was a problem hiding this comment.
Nitpick: The empty-list change is shared code affecting every model. This is required for pi05's new action_expert_trigger=[] edge to thread through the speculative path - but it lives in _thread_outputs_to_speculative, used by BAGEL/Qwen3-Omni/Orpheus too. Semantics changed from "empty-list output --> drop & reschedule" to "empty-list output --> valid, thread through". Ifound no other model that emits [] for a consumed edge, so it's probably benign and arguably more correct, but it's a shared-contract change framed as pi05-only so might be worth a comment maybe.
|
There are several stale comments:
|
1. Openpi API server Benchmarking
Added the openpi API server to the benchmark (with our harness copying from the openpi client), so that there is parity in terms of server / http overhead.
2. Optimized our Pi0.5
preprocess(this reduces tensor transport overhead by 4x)Loopprimitive and having two graph walks with a "back to conductor" in between them was too much overhead.Port over ViT: we were wrapping the
transformerssiglip implementation, so I moved the code over and added fused Q/K/V projections. Still using SDPA attention because we need float32 precision.Benchmark Parity: made sure that pi0.5 uses the droid yaml by default in the server launch file. Also openpi only uses 2 images, so had the
pi05_droid.yamloverridenum_cameras: 2as well.Results (JCT, 100 warmup, 100 requests, UW PTC server)
ours
openpi
Limitations