Skip to content

spot_image_server: opt-in compressed RGB transport + FPS measurement + format guard (Phases 0-2)#9

Open
shubham0704 wants to merge 8 commits into
UTNuclearRoboticsPublic:develfrom
shubham0704:camera-fps-fix
Open

spot_image_server: opt-in compressed RGB transport + FPS measurement + format guard (Phases 0-2)#9
shubham0704 wants to merge 8 commits into
UTNuclearRoboticsPublic:develfrom
shubham0704:camera-fps-fix

Conversation

@shubham0704
Copy link
Copy Markdown

DRAFT — pending on-robot validation of the JPEG (Phase 2) path and re-baseline under the merged RELIABLE-QoS image publishers. Posted now for early architectural review.

Summary

Adds opt-in (default-OFF) JPEG / CompressedImage transport for the spot_image_server RGB sources, plus an opt-in measurement harness and a hardening of the image-format dispatch. Defaults change nothing — with no new env vars set, the byte-on-the-wire and published-message behavior is identical to devel.

Why

On-robot measurement of this branch on the actual Spot (see docs/measurements/2026-05-19-phase0-baseline.md):

  • Single RGB stream hand_color_image: ~7.5 Hz / 10 Hz target, avg latency ~75 ms, max latency 112–131 ms (over the 100 ms timer-tick boundary — the per-source done() gate then quantizes to 5 Hz on those frames; blended ≈ 7.5 Hz).
  • All-camera run: a hard ~5.5–6.0 MB/s aggregate WiFi ceiling that does not scale with stream count — per-stream collapses to ~2.5–3.2 Hz with ~290 ms latency once multiple cameras are subscribed (rosbag triggers this).

RAW for many simultaneous streams is bandwidth-impossible at the configured 10 Hz; compressed RGB brings per-frame payload from ~MBs to tens of KBs, restoring rate and headroom. The whole root-cause + plan write-up is in docs/plans/PLAN-spot_ros-camera-fps.md, with a one-pager teaching version in docs/camera-fps-explainer.md.

What changes (and what does NOT)

Env var Default Effect
SPOT_IMAGE_SERVER_FPS_DEBUG OFF Per-source windowed log (Hz, avg/max SDK round-trip latency, MB/s). Pure logging — no behavior change.
SPOT_IMAGE_SERVER_FPS_DEBUG_WINDOW 5.0 Window seconds (only if FPS_DEBUG on).
SPOT_IMAGE_SERVER_RGB_JPEG OFF RGB sources (except hand_tof) request FORMAT_JPEG; publish sensor_msgs/CompressedImage on ~/<source>/image/compressed instead of raw Image. Depth, ~/get_images service, static-TF all stay RAW.
SPOT_IMAGE_SERVER_JPEG_QUALITY 75 Clamped to [1, 100].

README updated with these — see README.md § Set Environment Variables → 5. Image Server.

When RGB_JPEG is on, the raw ~/<source>/image topic for RGB sources is NOT published (depth/raw is unchanged). Consumers that still want raw RGB should run ros2 run image_transport republish compressed raw … — the README has the exact recipe. The driver deliberately does not re-request RAW (that would defeat the bandwidth fix) or embed a decoder (couples to CPU/queueing).

Structural changes (kept disciplined for review)

  • Phase 1 — self.image_requests split. The shared request map is split into publish_requests (periodic timers) vs service_requests (GetImages + static-TF + listing). Phase-1 alone keeps both RAW (byte-identical request semantics) — the split is a prerequisite so the JPEG retarget in Phase 2 cannot leak into the RAW-only service / TF paths.
  • Phase 1 — getImageMsg format guard. New UnsupportedImageFormatError. JPEG-on-raw-path (the historical malformed-rgb8 bug at ros_helpers.py:300-304) now raises; unhandled FORMAT_RAW pixel_format raises (was a silent half-filled Image); the latent .format-vs-pixel_format-enum comparison is replaced with a clean else. Both call sites skip-with-throttled-warn instead of publishing garbage.
  • Phase 2 — opt-in JPEG. New getCompressedImageMsg builds a properly-typed sensor_msgs/CompressedImage with the conventional format = "rgb8; jpeg compressed bgr8" (so image_transport republish compressed raw preserves the rgb8 contract). _buildTfMsg / _buildCameraInfo extracted and reused (no duplicate code between raw and compressed paths).
  • QoS: the new compressed_pub uses the same QoSProfile(RELIABLE, KEEP_LAST, depth=1) you introduced in devel 6e0b838 — consistent with the rest of the image publishers and the image→pointcloud-convertor compatibility intent.

Tests

Added under spot_driver/tests/:

  • test_fps_debug_accounting.py — windowed accounting + flush/reset (behavior test, runs with rclpy/bosdyn present).
  • test_phase1_request_map_split.py5 dependency-free AST/source checks proving the split is wired correctly everywhere.
  • test_phase1_format_guard.py — JPEG / unhandled-pixfmt / unknown-format rejection + RAW RGB_U8 / DEPTH_U16 regression.
  • test_phase2_jpeg_transport.py — structural checks for default-off safety, hand_tof exclusion, service/TF maps untouched, active_pub property usage (regression for a deploy-blocker Codex flagged), conventional format string, quality clamping; plus behavior checks for getCompressedImageMsg.

Suite: 13 passed, 4 skipped (skips are deps-gated; they run in the ROS env).

Process trail (for context, optional reading)

docs/plans/ contains the full RPI artifacts — research, plan, two rounds of independent Codex architectural/correctness review, and per-phase implementation log. Codex caught one real deploy-blocker in my Phase 2 draft (an image_pub=None dereference in update_image_task when compressed mode was on) which is fixed and regression-tested.

Status & ask

  • ✅ Builds, type-checks, all CI-runnable tests pass.
  • ✅ No behavior change with any env var unset (byte-identical to devel).
  • ⏳ On-robot validation of SPOT_IMAGE_SERVER_RGB_JPEG=1 + RELIABLE-QoS re-baseline pending — that's why this PR is opened as draft. Once those numbers land I'll attach the post-Phase-2 [fps-debug] output and undraft.
  • Looking for architectural feedback on the env-var-vs-ROS-param choice (kept env vars for non-invasive shipping; can promote to generate_parameter_library if you'd prefer), the QoS choice for the compressed publisher (currently mirrors yours), and whether the docs depth (PLAN/IMPL/VALIDATION etc.) is welcome or you'd rather have it slimmed.

djvampire123 and others added 8 commits May 19, 2026 18:48
…angelog, lab explainer)

Investigation + plan for low camera FPS during rosbag collection.
Root cause: per-source in-flight gate + 100ms timer quantization cliff,
amplified by 140 unbatched RPCs/s and RAW bandwidth over WiFi.
Plan v2 incorporates Codex RPI Step 3 review (request-map split
prerequisite, compression-before-batching reorder, bandwidth-aware
criteria, RAW/compressed coexistence). No code changes in this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds env-gated (SPOT_IMAGE_SERVER_FPS_DEBUG, default OFF) instrumentation
to spot_image_server: per-source SDK round-trip latency, achieved Hz, and
bytes/s with a windowed summary log. Lets us classify the bottleneck as
latency-bound vs WiFi-bandwidth-bound before later phases. Disabled path
is a single boolean check; no request/threading/publish behavior changes.

Tests in spot_driver/tests/ exercise the real accounting + window logic
via a fake self/response (skip where rclpy/bosdyn absent).

Ref: docs/plans/PLAN-spot_ros-camera-fps.md (Phase 0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ior change)

Prerequisite for Phase 2 (compressed RGB). Splits the shared
self.image_requests into publish_requests (periodic timers) vs
service_requests (GetImages service, static-TF, source listing) so a
later JPEG retarget of the publish map cannot leak into the RAW-only
service/TF paths. Both maps are FORMAT_RAW in this phase -> request
semantics unchanged.

Adds UnsupportedImageFormatError; getImageMsg now rejects JPEG-on-raw-
path (the historical malformed-rgb8 bug), unhandled FORMAT_RAW
pixel_format, and any other format, instead of emitting a corrupt or
half-filled Image. Also fixes the latent .format-vs-pixel_format enum
comparison. Both call sites skip-with-log instead of publishing garbage.

Tests: dependency-free AST split checks (pass) + format-guard behavior
tests (run in ROS env). py_compile clean.

Ref: docs/plans/PLAN-spot_ros-camera-fps.md (Phase 1),
docs/plans/IMPL-spot_ros-camera-fps.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… review)

Codex Step-6 review of Phase 0+1 found one Medium issue: the send
timestamp was written AFTER add_done_callback, so an already-resolved
future could run _fps_record before the timestamp was stored, dropping
or biasing latency samples. Now recorded before the future is created.
No publish-path behavior change. Accepted Low (stale _fps_send_times on
failed futures; bounded) documented in IMPL doc.

Adds docs/plans/VALIDATION-spot_ros-camera-fps.md (verdict: safe to
deploy/test on robot for normal RAW operation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single hand_color_image RGB stream measured at ~7.5 Hz (target 10),
latency avg ~75 ms / max ~120-131 ms, ~6.9 MB/s. Confirms the 100 ms
quantization cliff on hardware (even one camera can't hold 10 Hz).
14-stream rosbag projects ~50-100 MB/s -> also bandwidth-bound.
Compression (Phase 2) fixes both; batching does nothing for the
measured single-stream cliff. Closes the deferred decision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On-robot baseline showed a hard ~6 MB/s aggregate WiFi ceiling (single
stream 7.5 Hz; all-camera collapses to ~3 Hz each, ~290 ms latency) ->
bandwidth-bound; compression is the only viable fix.

SPOT_IMAGE_SERVER_RGB_JPEG (default OFF = RAW, fully backward
compatible) + SPOT_IMAGE_SERVER_JPEG_QUALITY (default 75). When on,
RGB sources (!= hand_tof) request FORMAT_JPEG and publish
sensor_msgs/CompressedImage on <ns>/image/compressed via new
getCompressedImageMsg. Depth/GetImages/static-TF stay RAW (Phase 1
split makes this isolated). getImageMsg refactored onto shared
_buildTfMsg/_buildCameraInfo (behavior preserved). Raw-when-JPEG-on is
the standard image_transport republish (no extra robot->driver WiFi).

Tests: 5 dependency-free structural + 2 behavior. Suite 10 passed,
4 skipped. py_compile clean. No behavior change with flag unset.

Ref: docs/plans/PLAN/IMPL/measurements.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HIGH: update_image_task dereferenced CameraPub.image_pub which is None
in compressed mode -> every JPEG RGB tick AttributeError, Phase 2
non-functional. Added CameraPub.active_pub property; update_image_task
and process_data both use it. Regression test added.

MED: CompressedImage.format -> 'rgb8; jpeg compressed bgr8' (conventional
compressed_image_transport string) so republish preserves the rgb8
contract instead of falling back to bgr8.
MED: startup warning now explicitly states raw <ns>/image is NOT
published for RGB and names AprilTag / camera_pointclouds /
spot_utils::CameraClient + the republish recovery command.
LOW: SPOT_IMAGE_SERVER_JPEG_QUALITY clamped to [1,100] with warning.
DRY: getImageMsg now actually calls _buildCameraInfo (removed inline
duplicate; IMPL claim now true). Measurements doc splits RGB-compressed
vs RGB+depth success cases.

Suite 13 passed, 4 skipped. py_compile clean.
Review: docs/plans/VALIDATION-spot_ros-camera-fps.md (Phase 2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…EG / JPEG_QUALITY)

Adds a new subsection under 'Set Environment Variables' covering the
three opt-in env vars introduced on this branch (all default OFF, so no
existing user behavior changes):

- SPOT_IMAGE_SERVER_FPS_DEBUG (+ _WINDOW): per-source Hz/latency/MB/s
  logging for bandwidth-bound episode detection.
- SPOT_IMAGE_SERVER_RGB_JPEG: request FORMAT_JPEG and publish
  CompressedImage on <ns>/image/compressed instead of raw Image; the
  ~10-20x bandwidth cut needed to keep many cameras within the measured
  ~6 MB/s WiFi ceiling. Explicitly states the raw RGB topic is NOT
  published in this mode and gives the standard image_transport
  republish recipe for consumers that need raw.
- SPOT_IMAGE_SERVER_JPEG_QUALITY: integer [1,100], default 75, clamped.

Pointers to docs/plans/PLAN, docs/measurements/baseline, and the
docs/camera-fps-explainer for the why.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@FabianEP11 FabianEP11 marked this pull request as ready for review May 20, 2026 00:00
@FabianEP11 FabianEP11 requested review from CJans121 and alexnavtt and removed request for CJans121 May 20, 2026 00:00
@CJans121
Copy link
Copy Markdown
Collaborator

Could you please test and verify that the following functionalities remain unaffected by this PR? They have required a significant amount of effort on the image server side to get working reliably.

  • Point cloud generation and visualization in RViz from camera images. Use --show-args to view the available camera arguments.
ros2 launch spot_bringup camera_pointclouds.launch.py
  • Ensure we can successfully visualize the generated point clouds in RViz.
  • AprilTag detection:
ros2 launch spot_apriltag spot_apriltag.launch.py
  • Ensure that tag detection functions correctly across all cameras. The camera list is defined in the launch file.

Copy link
Copy Markdown
Contributor

@alexnavtt alexnavtt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comments:

  • This PR introduces many functions with leading underscores, which is a style currently not used in this code base. These should be changed to align with existing styles.
  • Lots of Codex comments left in the code base, referencing Phase1, Phase2, etc. Those should not be included in any PR's, and multi-line comments should be reduced to the bare minimum
  • Not sure what to make of all the myriad of docs/... markdown documents added. If those are what Codex generated for the code changes, then I recommend that you follow through them and make sure the changes are appropriate and then delete them
  • Lots of tests added but not instructions on how to run them

Overall: I don't really like the Codex gigantic code review. It's super verbose and puts a lot on the plate of the reviewer. The code changes themselves seem to be fine, other than the formatting issues and parameter issues listed. I'll want to give this one another review once all the excess baggage has been trimmed off.

Comment on lines +142 to +177
self._fps_debug = os.environ.get('SPOT_IMAGE_SERVER_FPS_DEBUG', '') \
not in ('', '0', 'false', 'False', 'no', 'off')
if self._fps_debug:
self._fps_lock = threading.Lock()
self._fps_send_times: dict[str, float] = {}
self._fps_stats: dict[str, dict] = defaultdict(self._fps_new_bucket)
self._fps_window_start = time.monotonic()
self._fps_window_sec = float(
os.environ.get('SPOT_IMAGE_SERVER_FPS_DEBUG_WINDOW', '5.0'))
self.get_logger().warn(
'[fps-debug] Phase-0 instrumentation ENABLED via '
'SPOT_IMAGE_SERVER_FPS_DEBUG. Logging only; no behavior change. '
f'Summary every {self._fps_window_sec:.1f}s.')
# ----------------------------------------------------------------------

# --- Phase 2: opt-in compressed RGB transport (default OFF = RAW) ------
# On the robot the bottleneck is the robot->driver gRPC-over-WiFi hop
# (measured ~6 MB/s aggregate ceiling, RAW). Requesting FORMAT_JPEG
# makes the robot encode onboard and send ~10-20x fewer bytes.
# RGB sources (except hand_tof, which is ToF not RGB) then publish a
# proper sensor_msgs/CompressedImage on <ns>/image/compressed. Depth,
# the GetImages service, and static-TF stay RAW (Phase 1 split).
# Consumers needing raw use `image_transport republish compressed raw`.
self._rgb_jpeg = os.environ.get('SPOT_IMAGE_SERVER_RGB_JPEG', '') \
not in ('', '0', 'false', 'False', 'no', 'off')
try:
self._jpeg_quality = int(os.environ.get('SPOT_IMAGE_SERVER_JPEG_QUALITY', '75'))
except ValueError:
self._jpeg_quality = 75
# Clamp to the Spot SDK accepted range [1, 100] (Codex Phase 2: low).
if not 1 <= self._jpeg_quality <= 100:
self.get_logger().warn(
f'[phase2] SPOT_IMAGE_SERVER_JPEG_QUALITY={self._jpeg_quality} '
'out of range; clamping to [1, 100].')
self._jpeg_quality = min(100, max(1, self._jpeg_quality))
if self._rgb_jpeg:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of variables being retrieved from env here. These should be made into ROS parameters for the node instead

Comment on lines +55 to +64
if compressed:
# Phase 2: source requested FORMAT_JPEG -> publish the CORRECT
# message type. Consumers needing raw use the standard
# `image_transport republish compressed raw` node (no extra
# robot->driver WiFi cost). No raw Image publisher here.
self.image_pub = None
self.compressed_pub = parent.create_publisher(CompressedImage, '~/' + namespace + '/image/compressed', qos_profile=qos_profile)
else:
self.image_pub = parent.create_publisher(Image, '~/' + namespace + '/image', qos_profile=qos_profile)
self.compressed_pub = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not seeing a great reason to split this up into image_pub and compressed_pub. Just adds more bookkeeping down the line by having to access through active_pub. Just change the type of image_pub based on the setting and then publish directly through image_pub

Comment on lines 318 to +374
@@ -276,17 +352,7 @@ def getImageMsg(data: ImageResponseProto, lease_manager: SpotLeaseManager) -> Tu
* CameraInfo: message to define the state and config of the camera that took the image
* TFMessage: with the transforms necessary to locate the image frames
"""
transforms = []
for child_frame, transform in data.shot.transforms_snapshot.child_to_parent_edge_map.items():
if not transform.parent_frame_name:
continue
transforms.append(populateTransformStamped(
time=TimestampToMsg(lease_manager.robotToLocalTime(data.shot.acquisition_time)),
parent_frame=transform.parent_frame_name,
child_frame=child_frame,
transform=SE3Pose.from_proto(transform.parent_tform_child)
))
tf_msg = TFMessage(transforms=transforms)
tf_msg = _buildTfMsg(data, lease_manager)

image_msg = Image()
local_time = lease_manager.robotToLocalTime(data.shot.acquisition_time)
@@ -295,13 +361,17 @@ def getImageMsg(data: ImageResponseProto, lease_manager: SpotLeaseManager) -> Tu
image_msg.height = data.shot.image.rows
image_msg.width = data.shot.image.cols

# Color/greyscale formats.
# JPEG format
source_name = data.source.name

# JPEG cannot be a sensor_msgs/Image: an 'rgb8' Image must hold
# height*step raw bytes, not a compressed bitstream. Publishing JPEG
# correctly (as sensor_msgs/CompressedImage) is Phase 2; here we reject
# it so we never emit a corrupt message.
if data.shot.image.format == image_pb2.Image.FORMAT_JPEG:
image_msg.encoding = "rgb8"
image_msg.is_bigendian = True
image_msg.step = 3 * data.shot.image.cols
image_msg.data = data.shot.image.data
raise UnsupportedImageFormatError(
f"Source '{source_name}' returned FORMAT_JPEG on the raw-Image "
f"path. JPEG must be published as sensor_msgs/CompressedImage "
f"(Phase 2); refusing to emit a malformed rgb8 Image.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly no need for two separate functions here. Just one getImageMsg function that generates an image based on whether or not the format of data is JPEG or not would be sufficient and would simply code in CameraPub

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree with Alex's review. Generally, I’m not a fan of a black-box approach to fixing issues. We have been very intentional and meticulous in the design and implementation of this driver, so any proposed changes should come from a strong understanding of why each of those changes are necessary, regardless of agentic AI usage.

This PR also introduces a large number of superfluous .md and test files that we do not want to carry into the repository. Please be mindful of the breadth of changes for a reviewer to look at. Additionally, I would like to ensure that changes of PRs like this have been thoroughly tested against existing related driver functionality, particularly the ones I referenced in my previous comment.

@shubham0704
Copy link
Copy Markdown
Author

Thanks a lot for your feedback @CJans121 @alexnavtt . I understand your concerns regarding the black-box approach to problem solving. I am fully on-board with it. Part of why I believe this fix is in the right direction is because I have been building similar pub-sub systems in my startup for the past 1.5 years and the system recommendations were trivially simple. for example - compressing images and transporting, reducing fan-in by trading off how much to batch. these are the same concepts I had seen in say high-throughput vision pipelines where you have multiple input frames and it needs to pass through async DAG runtime. So, its not black-box, its more grey-box approach. and hence, to make it white-box I would be attaching simple documents with code blocks referenced from this repo that explain what optimizations were performed on the system. Thanks a lot for sharing code reviews. yall are hard-core, makes me super happy and grateful. thanks!

As for the .md files - those were the artifacts of adversarial code reviews and plan execution documents. you are right they are not supposed to part of the PR. I will remove them.
Right now, I am in the middle of the CoRL deadline. I will be visiting the lab again soon. I will nail the demo and cleanup the PR properly and share updates on this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants