spot_image_server: opt-in compressed RGB transport + FPS measurement + format guard (Phases 0-2) by shubham0704 · Pull Request #9 · UTNuclearRoboticsPublic/spot_ros

shubham0704 · 2026-05-19T23:51:08Z

DRAFT — pending on-robot validation of the JPEG (Phase 2) path and re-baseline under the merged RELIABLE-QoS image publishers. Posted now for early architectural review.

Summary

Adds opt-in (default-OFF) JPEG / CompressedImage transport for the spot_image_server RGB sources, plus an opt-in measurement harness and a hardening of the image-format dispatch. Defaults change nothing — with no new env vars set, the byte-on-the-wire and published-message behavior is identical to devel.

Why

On-robot measurement of this branch on the actual Spot (see docs/measurements/2026-05-19-phase0-baseline.md):

Single RGB stream hand_color_image: ~7.5 Hz / 10 Hz target, avg latency ~75 ms, max latency 112–131 ms (over the 100 ms timer-tick boundary — the per-source done() gate then quantizes to 5 Hz on those frames; blended ≈ 7.5 Hz).
All-camera run: a hard ~5.5–6.0 MB/s aggregate WiFi ceiling that does not scale with stream count — per-stream collapses to ~2.5–3.2 Hz with ~290 ms latency once multiple cameras are subscribed (rosbag triggers this).

RAW for many simultaneous streams is bandwidth-impossible at the configured 10 Hz; compressed RGB brings per-frame payload from ~MBs to tens of KBs, restoring rate and headroom. The whole root-cause + plan write-up is in docs/plans/PLAN-spot_ros-camera-fps.md, with a one-pager teaching version in docs/camera-fps-explainer.md.

What changes (and what does NOT)

Env var	Default	Effect
`SPOT_IMAGE_SERVER_FPS_DEBUG`	OFF	Per-source windowed log (Hz, avg/max SDK round-trip latency, MB/s). Pure logging — no behavior change.
`SPOT_IMAGE_SERVER_FPS_DEBUG_WINDOW`	`5.0`	Window seconds (only if FPS_DEBUG on).
`SPOT_IMAGE_SERVER_RGB_JPEG`	OFF	RGB sources (except `hand_tof`) request `FORMAT_JPEG`; publish `sensor_msgs/CompressedImage` on `~/<source>/image/compressed` instead of raw `Image`. Depth, `~/get_images` service, static-TF all stay RAW.
`SPOT_IMAGE_SERVER_JPEG_QUALITY`	`75`	Clamped to `[1, 100]`.

README updated with these — see README.md § Set Environment Variables → 5. Image Server.

When RGB_JPEG is on, the raw ~/<source>/image topic for RGB sources is NOT published (depth/raw is unchanged). Consumers that still want raw RGB should run ros2 run image_transport republish compressed raw … — the README has the exact recipe. The driver deliberately does not re-request RAW (that would defeat the bandwidth fix) or embed a decoder (couples to CPU/queueing).

Structural changes (kept disciplined for review)

Phase 1 — self.image_requests split. The shared request map is split into publish_requests (periodic timers) vs service_requests (GetImages + static-TF + listing). Phase-1 alone keeps both RAW (byte-identical request semantics) — the split is a prerequisite so the JPEG retarget in Phase 2 cannot leak into the RAW-only service / TF paths.
Phase 1 — getImageMsg format guard. New UnsupportedImageFormatError. JPEG-on-raw-path (the historical malformed-rgb8 bug at ros_helpers.py:300-304) now raises; unhandled FORMAT_RAW pixel_format raises (was a silent half-filled Image); the latent .format-vs-pixel_format-enum comparison is replaced with a clean else. Both call sites skip-with-throttled-warn instead of publishing garbage.
Phase 2 — opt-in JPEG. New getCompressedImageMsg builds a properly-typed sensor_msgs/CompressedImage with the conventional format = "rgb8; jpeg compressed bgr8" (so image_transport republish compressed raw preserves the rgb8 contract). _buildTfMsg / _buildCameraInfo extracted and reused (no duplicate code between raw and compressed paths).
QoS: the new compressed_pub uses the same QoSProfile(RELIABLE, KEEP_LAST, depth=1) you introduced in devel 6e0b838 — consistent with the rest of the image publishers and the image→pointcloud-convertor compatibility intent.

Tests

Added under spot_driver/tests/:

test_fps_debug_accounting.py — windowed accounting + flush/reset (behavior test, runs with rclpy/bosdyn present).
test_phase1_request_map_split.py — 5 dependency-free AST/source checks proving the split is wired correctly everywhere.
test_phase1_format_guard.py — JPEG / unhandled-pixfmt / unknown-format rejection + RAW RGB_U8 / DEPTH_U16 regression.
test_phase2_jpeg_transport.py — structural checks for default-off safety, hand_tof exclusion, service/TF maps untouched, active_pub property usage (regression for a deploy-blocker Codex flagged), conventional format string, quality clamping; plus behavior checks for getCompressedImageMsg.

Suite: 13 passed, 4 skipped (skips are deps-gated; they run in the ROS env).

Process trail (for context, optional reading)

docs/plans/ contains the full RPI artifacts — research, plan, two rounds of independent Codex architectural/correctness review, and per-phase implementation log. Codex caught one real deploy-blocker in my Phase 2 draft (an image_pub=None dereference in update_image_task when compressed mode was on) which is fixed and regression-tested.

Status & ask

✅ Builds, type-checks, all CI-runnable tests pass.
✅ No behavior change with any env var unset (byte-identical to devel).
⏳ On-robot validation of SPOT_IMAGE_SERVER_RGB_JPEG=1 + RELIABLE-QoS re-baseline pending — that's why this PR is opened as draft. Once those numbers land I'll attach the post-Phase-2 [fps-debug] output and undraft.
Looking for architectural feedback on the env-var-vs-ROS-param choice (kept env vars for non-invasive shipping; can promote to generate_parameter_library if you'd prefer), the QoS choice for the compressed publisher (currently mirrors yours), and whether the docs depth (PLAN/IMPL/VALIDATION etc.) is welcome or you'd rather have it slimmed.

…angelog, lab explainer) Investigation + plan for low camera FPS during rosbag collection. Root cause: per-source in-flight gate + 100ms timer quantization cliff, amplified by 140 unbatched RPCs/s and RAW bandwidth over WiFi. Plan v2 incorporates Codex RPI Step 3 review (request-map split prerequisite, compression-before-batching reorder, bandwidth-aware criteria, RAW/compressed coexistence). No code changes in this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds env-gated (SPOT_IMAGE_SERVER_FPS_DEBUG, default OFF) instrumentation to spot_image_server: per-source SDK round-trip latency, achieved Hz, and bytes/s with a windowed summary log. Lets us classify the bottleneck as latency-bound vs WiFi-bandwidth-bound before later phases. Disabled path is a single boolean check; no request/threading/publish behavior changes. Tests in spot_driver/tests/ exercise the real accounting + window logic via a fake self/response (skip where rclpy/bosdyn absent). Ref: docs/plans/PLAN-spot_ros-camera-fps.md (Phase 0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ior change) Prerequisite for Phase 2 (compressed RGB). Splits the shared self.image_requests into publish_requests (periodic timers) vs service_requests (GetImages service, static-TF, source listing) so a later JPEG retarget of the publish map cannot leak into the RAW-only service/TF paths. Both maps are FORMAT_RAW in this phase -> request semantics unchanged. Adds UnsupportedImageFormatError; getImageMsg now rejects JPEG-on-raw- path (the historical malformed-rgb8 bug), unhandled FORMAT_RAW pixel_format, and any other format, instead of emitting a corrupt or half-filled Image. Also fixes the latent .format-vs-pixel_format enum comparison. Both call sites skip-with-log instead of publishing garbage. Tests: dependency-free AST split checks (pass) + format-guard behavior tests (run in ROS env). py_compile clean. Ref: docs/plans/PLAN-spot_ros-camera-fps.md (Phase 1), docs/plans/IMPL-spot_ros-camera-fps.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… review) Codex Step-6 review of Phase 0+1 found one Medium issue: the send timestamp was written AFTER add_done_callback, so an already-resolved future could run _fps_record before the timestamp was stored, dropping or biasing latency samples. Now recorded before the future is created. No publish-path behavior change. Accepted Low (stale _fps_send_times on failed futures; bounded) documented in IMPL doc. Adds docs/plans/VALIDATION-spot_ros-camera-fps.md (verdict: safe to deploy/test on robot for normal RAW operation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Single hand_color_image RGB stream measured at ~7.5 Hz (target 10), latency avg ~75 ms / max ~120-131 ms, ~6.9 MB/s. Confirms the 100 ms quantization cliff on hardware (even one camera can't hold 10 Hz). 14-stream rosbag projects ~50-100 MB/s -> also bandwidth-bound. Compression (Phase 2) fixes both; batching does nothing for the measured single-stream cliff. Closes the deferred decision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

On-robot baseline showed a hard ~6 MB/s aggregate WiFi ceiling (single stream 7.5 Hz; all-camera collapses to ~3 Hz each, ~290 ms latency) -> bandwidth-bound; compression is the only viable fix. SPOT_IMAGE_SERVER_RGB_JPEG (default OFF = RAW, fully backward compatible) + SPOT_IMAGE_SERVER_JPEG_QUALITY (default 75). When on, RGB sources (!= hand_tof) request FORMAT_JPEG and publish sensor_msgs/CompressedImage on <ns>/image/compressed via new getCompressedImageMsg. Depth/GetImages/static-TF stay RAW (Phase 1 split makes this isolated). getImageMsg refactored onto shared _buildTfMsg/_buildCameraInfo (behavior preserved). Raw-when-JPEG-on is the standard image_transport republish (no extra robot->driver WiFi). Tests: 5 dependency-free structural + 2 behavior. Suite 10 passed, 4 skipped. py_compile clean. No behavior change with flag unset. Ref: docs/plans/PLAN/IMPL/measurements. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

HIGH: update_image_task dereferenced CameraPub.image_pub which is None in compressed mode -> every JPEG RGB tick AttributeError, Phase 2 non-functional. Added CameraPub.active_pub property; update_image_task and process_data both use it. Regression test added. MED: CompressedImage.format -> 'rgb8; jpeg compressed bgr8' (conventional compressed_image_transport string) so republish preserves the rgb8 contract instead of falling back to bgr8. MED: startup warning now explicitly states raw <ns>/image is NOT published for RGB and names AprilTag / camera_pointclouds / spot_utils::CameraClient + the republish recovery command. LOW: SPOT_IMAGE_SERVER_JPEG_QUALITY clamped to [1,100] with warning. DRY: getImageMsg now actually calls _buildCameraInfo (removed inline duplicate; IMPL claim now true). Measurements doc splits RGB-compressed vs RGB+depth success cases. Suite 13 passed, 4 skipped. py_compile clean. Review: docs/plans/VALIDATION-spot_ros-camera-fps.md (Phase 2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…EG / JPEG_QUALITY) Adds a new subsection under 'Set Environment Variables' covering the three opt-in env vars introduced on this branch (all default OFF, so no existing user behavior changes): - SPOT_IMAGE_SERVER_FPS_DEBUG (+ _WINDOW): per-source Hz/latency/MB/s logging for bandwidth-bound episode detection. - SPOT_IMAGE_SERVER_RGB_JPEG: request FORMAT_JPEG and publish CompressedImage on <ns>/image/compressed instead of raw Image; the ~10-20x bandwidth cut needed to keep many cameras within the measured ~6 MB/s WiFi ceiling. Explicitly states the raw RGB topic is NOT published in this mode and gives the standard image_transport republish recipe for consumers that need raw. - SPOT_IMAGE_SERVER_JPEG_QUALITY: integer [1,100], default 75, clamped. Pointers to docs/plans/PLAN, docs/measurements/baseline, and the docs/camera-fps-explainer for the why. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CJans121 · 2026-05-20T02:04:00Z

Could you please test and verify that the following functionalities remain unaffected by this PR? They have required a significant amount of effort on the image server side to get working reliably.

Point cloud generation and visualization in RViz from camera images. Use --show-args to view the available camera arguments.

ros2 launch spot_bringup camera_pointclouds.launch.py

Ensure we can successfully visualize the generated point clouds in RViz.

AprilTag detection:

ros2 launch spot_apriltag spot_apriltag.launch.py

Ensure that tag detection functions correctly across all cameras. The camera list is defined in the launch file.

alexnavtt

General comments:

This PR introduces many functions with leading underscores, which is a style currently not used in this code base. These should be changed to align with existing styles.
Lots of Codex comments left in the code base, referencing Phase1, Phase2, etc. Those should not be included in any PR's, and multi-line comments should be reduced to the bare minimum
Not sure what to make of all the myriad of docs/... markdown documents added. If those are what Codex generated for the code changes, then I recommend that you follow through them and make sure the changes are appropriate and then delete them
Lots of tests added but not instructions on how to run them

Overall: I don't really like the Codex gigantic code review. It's super verbose and puts a lot on the plate of the reviewer. The code changes themselves seem to be fine, other than the formatting issues and parameter issues listed. I'll want to give this one another review once all the excess baggage has been trimmed off.

alexnavtt · 2026-05-20T19:09:19Z

+        self._fps_debug = os.environ.get('SPOT_IMAGE_SERVER_FPS_DEBUG', '') \
+            not in ('', '0', 'false', 'False', 'no', 'off')
+        if self._fps_debug:
+            self._fps_lock = threading.Lock()
+            self._fps_send_times: dict[str, float] = {}
+            self._fps_stats: dict[str, dict] = defaultdict(self._fps_new_bucket)
+            self._fps_window_start = time.monotonic()
+            self._fps_window_sec = float(
+                os.environ.get('SPOT_IMAGE_SERVER_FPS_DEBUG_WINDOW', '5.0'))
+            self.get_logger().warn(
+                '[fps-debug] Phase-0 instrumentation ENABLED via '
+                'SPOT_IMAGE_SERVER_FPS_DEBUG. Logging only; no behavior change. '
+                f'Summary every {self._fps_window_sec:.1f}s.')
+        # ----------------------------------------------------------------------
+
+        # --- Phase 2: opt-in compressed RGB transport (default OFF = RAW) ------
+        # On the robot the bottleneck is the robot->driver gRPC-over-WiFi hop
+        # (measured ~6 MB/s aggregate ceiling, RAW). Requesting FORMAT_JPEG
+        # makes the robot encode onboard and send ~10-20x fewer bytes.
+        # RGB sources (except hand_tof, which is ToF not RGB) then publish a
+        # proper sensor_msgs/CompressedImage on <ns>/image/compressed. Depth,
+        # the GetImages service, and static-TF stay RAW (Phase 1 split).
+        # Consumers needing raw use `image_transport republish compressed raw`.
+        self._rgb_jpeg = os.environ.get('SPOT_IMAGE_SERVER_RGB_JPEG', '') \
+            not in ('', '0', 'false', 'False', 'no', 'off')
+        try:
+            self._jpeg_quality = int(os.environ.get('SPOT_IMAGE_SERVER_JPEG_QUALITY', '75'))
+        except ValueError:
+            self._jpeg_quality = 75
+        # Clamp to the Spot SDK accepted range [1, 100] (Codex Phase 2: low).
+        if not 1 <= self._jpeg_quality <= 100:
+            self.get_logger().warn(
+                f'[phase2] SPOT_IMAGE_SERVER_JPEG_QUALITY={self._jpeg_quality} '
+                'out of range; clamping to [1, 100].')
+            self._jpeg_quality = min(100, max(1, self._jpeg_quality))
+        if self._rgb_jpeg:


Lots of variables being retrieved from env here. These should be made into ROS parameters for the node instead

alexnavtt · 2026-05-20T19:19:39Z

+        if compressed:
+            # Phase 2: source requested FORMAT_JPEG -> publish the CORRECT
+            # message type. Consumers needing raw use the standard
+            # `image_transport republish compressed raw` node (no extra
+            # robot->driver WiFi cost). No raw Image publisher here.
+            self.image_pub = None
+            self.compressed_pub = parent.create_publisher(CompressedImage, '~/' + namespace + '/image/compressed', qos_profile=qos_profile)
+        else:
+            self.image_pub = parent.create_publisher(Image, '~/' + namespace + '/image', qos_profile=qos_profile)
+            self.compressed_pub = None


Not seeing a great reason to split this up into image_pub and compressed_pub. Just adds more bookkeeping down the line by having to access through active_pub. Just change the type of image_pub based on the setting and then publish directly through image_pub

alexnavtt · 2026-05-20T19:24:11Z

@@ -276,17 +352,7 @@ def getImageMsg(data: ImageResponseProto, lease_manager: SpotLeaseManager) -> Tu
            * CameraInfo: message to define the state and config of the camera that took the image
            * TFMessage: with the transforms necessary to locate the image frames
    """
-    transforms = []
-    for child_frame, transform in data.shot.transforms_snapshot.child_to_parent_edge_map.items():
-        if not transform.parent_frame_name:
-            continue
-        transforms.append(populateTransformStamped(
-            time=TimestampToMsg(lease_manager.robotToLocalTime(data.shot.acquisition_time)),
-            parent_frame=transform.parent_frame_name,
-            child_frame=child_frame,
-            transform=SE3Pose.from_proto(transform.parent_tform_child)
-        ))
-    tf_msg = TFMessage(transforms=transforms)
+    tf_msg = _buildTfMsg(data, lease_manager)

    image_msg = Image()
    local_time = lease_manager.robotToLocalTime(data.shot.acquisition_time)
@@ -295,13 +361,17 @@ def getImageMsg(data: ImageResponseProto, lease_manager: SpotLeaseManager) -> Tu
    image_msg.height = data.shot.image.rows
    image_msg.width = data.shot.image.cols

-    # Color/greyscale formats.
-    # JPEG format
+    source_name = data.source.name
+
+    # JPEG cannot be a sensor_msgs/Image: an 'rgb8' Image must hold
+    # height*step raw bytes, not a compressed bitstream. Publishing JPEG
+    # correctly (as sensor_msgs/CompressedImage) is Phase 2; here we reject
+    # it so we never emit a corrupt message.
    if data.shot.image.format == image_pb2.Image.FORMAT_JPEG:
-        image_msg.encoding = "rgb8"
-        image_msg.is_bigendian = True
-        image_msg.step = 3 * data.shot.image.cols
-        image_msg.data = data.shot.image.data
+        raise UnsupportedImageFormatError(
+            f"Source '{source_name}' returned FORMAT_JPEG on the raw-Image "
+            f"path. JPEG must be published as sensor_msgs/CompressedImage "
+            f"(Phase 2); refusing to emit a malformed rgb8 Image.")


Similarly no need for two separate functions here. Just one getImageMsg function that generates an image based on whether or not the format of data is JPEG or not would be sufficient and would simply code in CameraPub

CJans121 · 2026-05-20T21:54:26Z

I fully agree with Alex's review. Generally, I’m not a fan of a black-box approach to fixing issues. We have been very intentional and meticulous in the design and implementation of this driver, so any proposed changes should come from a strong understanding of why each of those changes are necessary, regardless of agentic AI usage.

This PR also introduces a large number of superfluous .md and test files that we do not want to carry into the repository. Please be mindful of the breadth of changes for a reviewer to look at. Additionally, I would like to ensure that changes of PRs like this have been thoroughly tested against existing related driver functionality, particularly the ones I referenced in my previous comment.

shubham0704 · 2026-05-21T03:54:08Z

Thanks a lot for your feedback @CJans121 @alexnavtt . I understand your concerns regarding the black-box approach to problem solving. I am fully on-board with it. Part of why I believe this fix is in the right direction is because I have been building similar pub-sub systems in my startup for the past 1.5 years and the system recommendations were trivially simple. for example - compressing images and transporting, reducing fan-in by trading off how much to batch. these are the same concepts I had seen in say high-throughput vision pipelines where you have multiple input frames and it needs to pass through async DAG runtime. So, its not black-box, its more grey-box approach. and hence, to make it white-box I would be attaching simple documents with code blocks referenced from this repo that explain what optimizations were performed on the system. Thanks a lot for sharing code reviews. yall are hard-core, makes me super happy and grateful. thanks!

As for the .md files - those were the artifacts of adversarial code reviews and plan execution documents. you are right they are not supposed to part of the PR. I will remove them.
Right now, I am in the middle of the CoRL deadline. I will be visiting the lab again soon. I will nail the demo and cleanup the PR properly and share updates on this thread.

djvampire123 and others added 8 commits May 19, 2026 18:48

FabianEP11 marked this pull request as ready for review May 20, 2026 00:00

FabianEP11 requested review from CJans121 and alexnavtt and removed request for CJans121 May 20, 2026 00:00

alexnavtt requested changes May 20, 2026

View reviewed changes

CJans121 requested changes May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spot_image_server: opt-in compressed RGB transport + FPS measurement + format guard (Phases 0-2)#9

spot_image_server: opt-in compressed RGB transport + FPS measurement + format guard (Phases 0-2)#9
shubham0704 wants to merge 8 commits into
UTNuclearRoboticsPublic:develfrom
shubham0704:camera-fps-fix

shubham0704 commented May 19, 2026

Uh oh!

CJans121 commented May 20, 2026

Uh oh!

alexnavtt left a comment

Uh oh!

alexnavtt May 20, 2026

Uh oh!

alexnavtt May 20, 2026

Uh oh!

alexnavtt May 20, 2026

Uh oh!

CJans121 May 20, 2026

Uh oh!

shubham0704 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

shubham0704 commented May 19, 2026

Summary

Why

What changes (and what does NOT)

Structural changes (kept disciplined for review)

Tests

Process trail (for context, optional reading)

Status & ask

Uh oh!

CJans121 commented May 20, 2026

Uh oh!

alexnavtt left a comment

Choose a reason for hiding this comment

Uh oh!

alexnavtt May 20, 2026

Choose a reason for hiding this comment

Uh oh!

alexnavtt May 20, 2026

Choose a reason for hiding this comment

Uh oh!

alexnavtt May 20, 2026

Choose a reason for hiding this comment

Uh oh!

CJans121 May 20, 2026

Choose a reason for hiding this comment

Uh oh!

shubham0704 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants