spot_image_server: opt-in compressed RGB transport + FPS measurement + format guard (Phases 0-2)#9
Conversation
…angelog, lab explainer) Investigation + plan for low camera FPS during rosbag collection. Root cause: per-source in-flight gate + 100ms timer quantization cliff, amplified by 140 unbatched RPCs/s and RAW bandwidth over WiFi. Plan v2 incorporates Codex RPI Step 3 review (request-map split prerequisite, compression-before-batching reorder, bandwidth-aware criteria, RAW/compressed coexistence). No code changes in this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds env-gated (SPOT_IMAGE_SERVER_FPS_DEBUG, default OFF) instrumentation to spot_image_server: per-source SDK round-trip latency, achieved Hz, and bytes/s with a windowed summary log. Lets us classify the bottleneck as latency-bound vs WiFi-bandwidth-bound before later phases. Disabled path is a single boolean check; no request/threading/publish behavior changes. Tests in spot_driver/tests/ exercise the real accounting + window logic via a fake self/response (skip where rclpy/bosdyn absent). Ref: docs/plans/PLAN-spot_ros-camera-fps.md (Phase 0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ior change) Prerequisite for Phase 2 (compressed RGB). Splits the shared self.image_requests into publish_requests (periodic timers) vs service_requests (GetImages service, static-TF, source listing) so a later JPEG retarget of the publish map cannot leak into the RAW-only service/TF paths. Both maps are FORMAT_RAW in this phase -> request semantics unchanged. Adds UnsupportedImageFormatError; getImageMsg now rejects JPEG-on-raw- path (the historical malformed-rgb8 bug), unhandled FORMAT_RAW pixel_format, and any other format, instead of emitting a corrupt or half-filled Image. Also fixes the latent .format-vs-pixel_format enum comparison. Both call sites skip-with-log instead of publishing garbage. Tests: dependency-free AST split checks (pass) + format-guard behavior tests (run in ROS env). py_compile clean. Ref: docs/plans/PLAN-spot_ros-camera-fps.md (Phase 1), docs/plans/IMPL-spot_ros-camera-fps.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… review) Codex Step-6 review of Phase 0+1 found one Medium issue: the send timestamp was written AFTER add_done_callback, so an already-resolved future could run _fps_record before the timestamp was stored, dropping or biasing latency samples. Now recorded before the future is created. No publish-path behavior change. Accepted Low (stale _fps_send_times on failed futures; bounded) documented in IMPL doc. Adds docs/plans/VALIDATION-spot_ros-camera-fps.md (verdict: safe to deploy/test on robot for normal RAW operation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single hand_color_image RGB stream measured at ~7.5 Hz (target 10), latency avg ~75 ms / max ~120-131 ms, ~6.9 MB/s. Confirms the 100 ms quantization cliff on hardware (even one camera can't hold 10 Hz). 14-stream rosbag projects ~50-100 MB/s -> also bandwidth-bound. Compression (Phase 2) fixes both; batching does nothing for the measured single-stream cliff. Closes the deferred decision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On-robot baseline showed a hard ~6 MB/s aggregate WiFi ceiling (single stream 7.5 Hz; all-camera collapses to ~3 Hz each, ~290 ms latency) -> bandwidth-bound; compression is the only viable fix. SPOT_IMAGE_SERVER_RGB_JPEG (default OFF = RAW, fully backward compatible) + SPOT_IMAGE_SERVER_JPEG_QUALITY (default 75). When on, RGB sources (!= hand_tof) request FORMAT_JPEG and publish sensor_msgs/CompressedImage on <ns>/image/compressed via new getCompressedImageMsg. Depth/GetImages/static-TF stay RAW (Phase 1 split makes this isolated). getImageMsg refactored onto shared _buildTfMsg/_buildCameraInfo (behavior preserved). Raw-when-JPEG-on is the standard image_transport republish (no extra robot->driver WiFi). Tests: 5 dependency-free structural + 2 behavior. Suite 10 passed, 4 skipped. py_compile clean. No behavior change with flag unset. Ref: docs/plans/PLAN/IMPL/measurements. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HIGH: update_image_task dereferenced CameraPub.image_pub which is None in compressed mode -> every JPEG RGB tick AttributeError, Phase 2 non-functional. Added CameraPub.active_pub property; update_image_task and process_data both use it. Regression test added. MED: CompressedImage.format -> 'rgb8; jpeg compressed bgr8' (conventional compressed_image_transport string) so republish preserves the rgb8 contract instead of falling back to bgr8. MED: startup warning now explicitly states raw <ns>/image is NOT published for RGB and names AprilTag / camera_pointclouds / spot_utils::CameraClient + the republish recovery command. LOW: SPOT_IMAGE_SERVER_JPEG_QUALITY clamped to [1,100] with warning. DRY: getImageMsg now actually calls _buildCameraInfo (removed inline duplicate; IMPL claim now true). Measurements doc splits RGB-compressed vs RGB+depth success cases. Suite 13 passed, 4 skipped. py_compile clean. Review: docs/plans/VALIDATION-spot_ros-camera-fps.md (Phase 2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…EG / JPEG_QUALITY) Adds a new subsection under 'Set Environment Variables' covering the three opt-in env vars introduced on this branch (all default OFF, so no existing user behavior changes): - SPOT_IMAGE_SERVER_FPS_DEBUG (+ _WINDOW): per-source Hz/latency/MB/s logging for bandwidth-bound episode detection. - SPOT_IMAGE_SERVER_RGB_JPEG: request FORMAT_JPEG and publish CompressedImage on <ns>/image/compressed instead of raw Image; the ~10-20x bandwidth cut needed to keep many cameras within the measured ~6 MB/s WiFi ceiling. Explicitly states the raw RGB topic is NOT published in this mode and gives the standard image_transport republish recipe for consumers that need raw. - SPOT_IMAGE_SERVER_JPEG_QUALITY: integer [1,100], default 75, clamped. Pointers to docs/plans/PLAN, docs/measurements/baseline, and the docs/camera-fps-explainer for the why. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Could you please test and verify that the following functionalities remain unaffected by this PR? They have required a significant amount of effort on the image server side to get working reliably.
ros2 launch spot_bringup camera_pointclouds.launch.py
ros2 launch spot_apriltag spot_apriltag.launch.py
|
alexnavtt
left a comment
There was a problem hiding this comment.
General comments:
- This PR introduces many functions with leading underscores, which is a style currently not used in this code base. These should be changed to align with existing styles.
- Lots of Codex comments left in the code base, referencing Phase1, Phase2, etc. Those should not be included in any PR's, and multi-line comments should be reduced to the bare minimum
- Not sure what to make of all the myriad of docs/... markdown documents added. If those are what Codex generated for the code changes, then I recommend that you follow through them and make sure the changes are appropriate and then delete them
- Lots of tests added but not instructions on how to run them
Overall: I don't really like the Codex gigantic code review. It's super verbose and puts a lot on the plate of the reviewer. The code changes themselves seem to be fine, other than the formatting issues and parameter issues listed. I'll want to give this one another review once all the excess baggage has been trimmed off.
| self._fps_debug = os.environ.get('SPOT_IMAGE_SERVER_FPS_DEBUG', '') \ | ||
| not in ('', '0', 'false', 'False', 'no', 'off') | ||
| if self._fps_debug: | ||
| self._fps_lock = threading.Lock() | ||
| self._fps_send_times: dict[str, float] = {} | ||
| self._fps_stats: dict[str, dict] = defaultdict(self._fps_new_bucket) | ||
| self._fps_window_start = time.monotonic() | ||
| self._fps_window_sec = float( | ||
| os.environ.get('SPOT_IMAGE_SERVER_FPS_DEBUG_WINDOW', '5.0')) | ||
| self.get_logger().warn( | ||
| '[fps-debug] Phase-0 instrumentation ENABLED via ' | ||
| 'SPOT_IMAGE_SERVER_FPS_DEBUG. Logging only; no behavior change. ' | ||
| f'Summary every {self._fps_window_sec:.1f}s.') | ||
| # ---------------------------------------------------------------------- | ||
|
|
||
| # --- Phase 2: opt-in compressed RGB transport (default OFF = RAW) ------ | ||
| # On the robot the bottleneck is the robot->driver gRPC-over-WiFi hop | ||
| # (measured ~6 MB/s aggregate ceiling, RAW). Requesting FORMAT_JPEG | ||
| # makes the robot encode onboard and send ~10-20x fewer bytes. | ||
| # RGB sources (except hand_tof, which is ToF not RGB) then publish a | ||
| # proper sensor_msgs/CompressedImage on <ns>/image/compressed. Depth, | ||
| # the GetImages service, and static-TF stay RAW (Phase 1 split). | ||
| # Consumers needing raw use `image_transport republish compressed raw`. | ||
| self._rgb_jpeg = os.environ.get('SPOT_IMAGE_SERVER_RGB_JPEG', '') \ | ||
| not in ('', '0', 'false', 'False', 'no', 'off') | ||
| try: | ||
| self._jpeg_quality = int(os.environ.get('SPOT_IMAGE_SERVER_JPEG_QUALITY', '75')) | ||
| except ValueError: | ||
| self._jpeg_quality = 75 | ||
| # Clamp to the Spot SDK accepted range [1, 100] (Codex Phase 2: low). | ||
| if not 1 <= self._jpeg_quality <= 100: | ||
| self.get_logger().warn( | ||
| f'[phase2] SPOT_IMAGE_SERVER_JPEG_QUALITY={self._jpeg_quality} ' | ||
| 'out of range; clamping to [1, 100].') | ||
| self._jpeg_quality = min(100, max(1, self._jpeg_quality)) | ||
| if self._rgb_jpeg: |
There was a problem hiding this comment.
Lots of variables being retrieved from env here. These should be made into ROS parameters for the node instead
| if compressed: | ||
| # Phase 2: source requested FORMAT_JPEG -> publish the CORRECT | ||
| # message type. Consumers needing raw use the standard | ||
| # `image_transport republish compressed raw` node (no extra | ||
| # robot->driver WiFi cost). No raw Image publisher here. | ||
| self.image_pub = None | ||
| self.compressed_pub = parent.create_publisher(CompressedImage, '~/' + namespace + '/image/compressed', qos_profile=qos_profile) | ||
| else: | ||
| self.image_pub = parent.create_publisher(Image, '~/' + namespace + '/image', qos_profile=qos_profile) | ||
| self.compressed_pub = None |
There was a problem hiding this comment.
Not seeing a great reason to split this up into image_pub and compressed_pub. Just adds more bookkeeping down the line by having to access through active_pub. Just change the type of image_pub based on the setting and then publish directly through image_pub
| @@ -276,17 +352,7 @@ def getImageMsg(data: ImageResponseProto, lease_manager: SpotLeaseManager) -> Tu | |||
| * CameraInfo: message to define the state and config of the camera that took the image | |||
| * TFMessage: with the transforms necessary to locate the image frames | |||
| """ | |||
| transforms = [] | |||
| for child_frame, transform in data.shot.transforms_snapshot.child_to_parent_edge_map.items(): | |||
| if not transform.parent_frame_name: | |||
| continue | |||
| transforms.append(populateTransformStamped( | |||
| time=TimestampToMsg(lease_manager.robotToLocalTime(data.shot.acquisition_time)), | |||
| parent_frame=transform.parent_frame_name, | |||
| child_frame=child_frame, | |||
| transform=SE3Pose.from_proto(transform.parent_tform_child) | |||
| )) | |||
| tf_msg = TFMessage(transforms=transforms) | |||
| tf_msg = _buildTfMsg(data, lease_manager) | |||
|
|
|||
| image_msg = Image() | |||
| local_time = lease_manager.robotToLocalTime(data.shot.acquisition_time) | |||
| @@ -295,13 +361,17 @@ def getImageMsg(data: ImageResponseProto, lease_manager: SpotLeaseManager) -> Tu | |||
| image_msg.height = data.shot.image.rows | |||
| image_msg.width = data.shot.image.cols | |||
|
|
|||
| # Color/greyscale formats. | |||
| # JPEG format | |||
| source_name = data.source.name | |||
|
|
|||
| # JPEG cannot be a sensor_msgs/Image: an 'rgb8' Image must hold | |||
| # height*step raw bytes, not a compressed bitstream. Publishing JPEG | |||
| # correctly (as sensor_msgs/CompressedImage) is Phase 2; here we reject | |||
| # it so we never emit a corrupt message. | |||
| if data.shot.image.format == image_pb2.Image.FORMAT_JPEG: | |||
| image_msg.encoding = "rgb8" | |||
| image_msg.is_bigendian = True | |||
| image_msg.step = 3 * data.shot.image.cols | |||
| image_msg.data = data.shot.image.data | |||
| raise UnsupportedImageFormatError( | |||
| f"Source '{source_name}' returned FORMAT_JPEG on the raw-Image " | |||
| f"path. JPEG must be published as sensor_msgs/CompressedImage " | |||
| f"(Phase 2); refusing to emit a malformed rgb8 Image.") | |||
There was a problem hiding this comment.
Similarly no need for two separate functions here. Just one getImageMsg function that generates an image based on whether or not the format of data is JPEG or not would be sufficient and would simply code in CameraPub
There was a problem hiding this comment.
I fully agree with Alex's review. Generally, I’m not a fan of a black-box approach to fixing issues. We have been very intentional and meticulous in the design and implementation of this driver, so any proposed changes should come from a strong understanding of why each of those changes are necessary, regardless of agentic AI usage.
This PR also introduces a large number of superfluous .md and test files that we do not want to carry into the repository. Please be mindful of the breadth of changes for a reviewer to look at. Additionally, I would like to ensure that changes of PRs like this have been thoroughly tested against existing related driver functionality, particularly the ones I referenced in my previous comment.
|
Thanks a lot for your feedback @CJans121 @alexnavtt . I understand your concerns regarding the black-box approach to problem solving. I am fully on-board with it. Part of why I believe this fix is in the right direction is because I have been building similar pub-sub systems in my startup for the past 1.5 years and the system recommendations were trivially simple. for example - compressing images and transporting, reducing fan-in by trading off how much to batch. these are the same concepts I had seen in say high-throughput vision pipelines where you have multiple input frames and it needs to pass through async DAG runtime. So, its not black-box, its more grey-box approach. and hence, to make it white-box I would be attaching simple documents with code blocks referenced from this repo that explain what optimizations were performed on the system. Thanks a lot for sharing code reviews. yall are hard-core, makes me super happy and grateful. thanks! As for the .md files - those were the artifacts of adversarial code reviews and plan execution documents. you are right they are not supposed to part of the PR. I will remove them. |
Summary
Adds opt-in (default-OFF) JPEG /
CompressedImagetransport for thespot_image_serverRGB sources, plus an opt-in measurement harness and a hardening of the image-format dispatch. Defaults change nothing — with no new env vars set, the byte-on-the-wire and published-message behavior is identical todevel.Why
On-robot measurement of this branch on the actual Spot (see
docs/measurements/2026-05-19-phase0-baseline.md):hand_color_image: ~7.5 Hz / 10 Hz target, avg latency ~75 ms, max latency 112–131 ms (over the 100 ms timer-tick boundary — the per-sourcedone()gate then quantizes to 5 Hz on those frames; blended ≈ 7.5 Hz).RAW for many simultaneous streams is bandwidth-impossible at the configured 10 Hz; compressed RGB brings per-frame payload from ~MBs to tens of KBs, restoring rate and headroom. The whole root-cause + plan write-up is in
docs/plans/PLAN-spot_ros-camera-fps.md, with a one-pager teaching version indocs/camera-fps-explainer.md.What changes (and what does NOT)
SPOT_IMAGE_SERVER_FPS_DEBUGSPOT_IMAGE_SERVER_FPS_DEBUG_WINDOW5.0SPOT_IMAGE_SERVER_RGB_JPEGhand_tof) requestFORMAT_JPEG; publishsensor_msgs/CompressedImageon~/<source>/image/compressedinstead of rawImage. Depth,~/get_imagesservice, static-TF all stay RAW.SPOT_IMAGE_SERVER_JPEG_QUALITY75[1, 100].README updated with these — see
README.md§ Set Environment Variables → 5. Image Server.When
RGB_JPEGis on, the raw~/<source>/imagetopic for RGB sources is NOT published (depth/raw is unchanged). Consumers that still want raw RGB should runros2 run image_transport republish compressed raw …— the README has the exact recipe. The driver deliberately does not re-request RAW (that would defeat the bandwidth fix) or embed a decoder (couples to CPU/queueing).Structural changes (kept disciplined for review)
self.image_requestssplit. The shared request map is split intopublish_requests(periodic timers) vsservice_requests(GetImages + static-TF + listing). Phase-1 alone keeps both RAW (byte-identical request semantics) — the split is a prerequisite so the JPEG retarget in Phase 2 cannot leak into the RAW-only service / TF paths.getImageMsgformat guard. NewUnsupportedImageFormatError. JPEG-on-raw-path (the historical malformed-rgb8bug atros_helpers.py:300-304) now raises; unhandledFORMAT_RAWpixel_formatraises (was a silent half-filledImage); the latent.format-vs-pixel_format-enum comparison is replaced with a cleanelse. Both call sites skip-with-throttled-warn instead of publishing garbage.getCompressedImageMsgbuilds a properly-typedsensor_msgs/CompressedImagewith the conventionalformat = "rgb8; jpeg compressed bgr8"(soimage_transport republish compressed rawpreserves the rgb8 contract)._buildTfMsg/_buildCameraInfoextracted and reused (no duplicate code between raw and compressed paths).compressed_pubuses the sameQoSProfile(RELIABLE, KEEP_LAST, depth=1)you introduced in devel6e0b838— consistent with the rest of the image publishers and the image→pointcloud-convertor compatibility intent.Tests
Added under
spot_driver/tests/:test_fps_debug_accounting.py— windowed accounting + flush/reset (behavior test, runs withrclpy/bosdynpresent).test_phase1_request_map_split.py— 5 dependency-free AST/source checks proving the split is wired correctly everywhere.test_phase1_format_guard.py— JPEG / unhandled-pixfmt / unknown-format rejection + RAW RGB_U8 / DEPTH_U16 regression.test_phase2_jpeg_transport.py— structural checks for default-off safety,hand_tofexclusion, service/TF maps untouched,active_pubproperty usage (regression for a deploy-blocker Codex flagged), conventionalformatstring, quality clamping; plus behavior checks forgetCompressedImageMsg.Suite:
13 passed, 4 skipped(skips are deps-gated; they run in the ROS env).Process trail (for context, optional reading)
docs/plans/contains the full RPI artifacts — research, plan, two rounds of independent Codex architectural/correctness review, and per-phase implementation log. Codex caught one real deploy-blocker in my Phase 2 draft (animage_pub=Nonedereference inupdate_image_taskwhen compressed mode was on) which is fixed and regression-tested.Status & ask
SPOT_IMAGE_SERVER_RGB_JPEG=1+ RELIABLE-QoS re-baseline pending — that's why this PR is opened as draft. Once those numbers land I'll attach the post-Phase-2[fps-debug]output and undraft.generate_parameter_libraryif you'd prefer), the QoS choice for the compressed publisher (currently mirrors yours), and whether the docs depth (PLAN/IMPL/VALIDATIONetc.) is welcome or you'd rather have it slimmed.