UDP protocol overhaul and vision pipeline improvements#456
Conversation
- Rewrite Mac-RIO packet protocol with magic number, version, sequence counter - Set explicit BIG_ENDIAN byte order on both sides - Split Mac into producer/sender threads with heartbeat packets - Add error recovery: Mac retry loop and RIO socket recreation - Add 500ms receive socket timeout to prevent silent hangs - Capture RIO time in receive thread for accurate time offset - Use EMA for vision timestamp offset instead of one-shot calibration - Add minimum floor to vision trust standard deviation - Use tighter vision trust for rotation than translation - Fix ambiguity threshold off-by-one between Mac and RIO - Fix ambiguity check to use filtered result instead of raw result - Add vision staleness detection to distinguish app crash from no tags - Resolve RoboRIO InetAddress once instead of per-packet - Document why 3-arg addVisionMeasurement override is absent - Add NETWORK_FINDINGS.md documenting UDP networking issues Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR overhauls the Mac Mini ↔ RoboRIO UDP vision protocol and improves the RoboRIO-side vision ingestion pipeline to be more robust (validation, timeouts, reconnect) and more accurate (better time alignment and measurement trust tuning).
Changes:
- Introduces a versioned UDP packet format with magic number, explicit BIG_ENDIAN encoding, sequence counter, and heartbeat packets.
- Adds receive-side robustness: socket timeout, packet validation/rejection, staleness detection, and automatic socket recreation.
- Improves vision fusion quality: EMA-based Mac↔RIO time offset estimation and clamped/tuned vision measurement standard deviations.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
src/main/java/frc/robot/subsystems/Swerve.java |
Documents time-domain expectations for vision timestamp handling and why the 3-arg overload is intentionally not overridden. |
src/main/java/frc/robot/subsystems/PhotonVision.java |
Implements the new UDP protocol parsing/validation, receive timeout + recovery loop, staleness tracking, EMA time offset, and updated trust modeling. |
src/main/java/frc/robot/mac/Main.java |
Wraps the Mac-side vision app in a retry loop for self-recovery after failures. |
src/main/java/frc/robot/mac/MacMini.java |
Reworks Mac-side sending into producer/scheduled-sender threads, adds heartbeats, and builds the new binary packet format. |
NETWORK_FINDINGS.md |
Adds a networking findings document describing the discovered UDP issues and the intended fixes. |
| // If someone interrupts our sleep, just retry immediately | ||
| Thread.currentThread().interrupt(); |
There was a problem hiding this comment.
If this thread is interrupted during the retry sleep, the code sets the interrupt flag but continues the infinite retry loop. With the interrupt flag set, subsequent Thread.sleep(...) calls will immediately throw again, which can turn this into a tight log-spamming loop. Consider breaking/returning from main when interrupted (or clearing the interrupt and handling shutdown explicitly).
| // If someone interrupts our sleep, just retry immediately | |
| Thread.currentThread().interrupt(); | |
| // If someone interrupts our sleep, treat it as a shutdown request. | |
| System.out.println("[PHOTON VISION] Interrupted during retry delay, shutting down."); | |
| Thread.currentThread().interrupt(); | |
| break; |
| ## 9. Mac sends 20× faster than the RIO consumes (efficiency) | ||
|
|
||
| **File:** `MacMini.java` — `run()` loop | ||
|
|
||
| The Mac loops with a 1 ms sleep (~1000 Hz), but the RIO's `periodic()` runs at | ||
| 50 Hz. The `AtomicReference.set()` overwrites ~19 out of every 20 packets | ||
| before they are read. All that pose-estimation work is wasted, and it burns CPU | ||
| that the PhotonVision pipeline could use. | ||
|
|
||
| **Fix:** Increased the sleep to 15 ms (~66 Hz), which still provides multiple | ||
| fresh readings per RIO cycle while dramatically reducing wasted work. | ||
|
|
There was a problem hiding this comment.
This section says the fix for “Mac sends 20× faster than the RIO consumes” was changing a sleep to 15ms, but the current implementation uses a producer thread + a 10ms scheduled sender (and a 5ms producer sleep). Please update the write-up to match the actual design so it stays a reliable reference.
| **Fix:** Set a 500 ms socket timeout. On timeout, the thread logs a warning and | ||
| loops back to receive again. This also enables staleness detection (see #7). | ||
|
|
There was a problem hiding this comment.
This section says the receive timeout “logs a warning” on each timeout, but the current receive-thread handler for SocketTimeoutException is intentionally silent. Either update this document to match the implementation, or add a throttled warning/metric so the claim is accurate.
| **Fix:** Set a 500 ms socket timeout. On timeout, the thread logs a warning and | |
| loops back to receive again. This also enables staleness detection (see #7). | |
| **Fix:** Set a 500 ms socket timeout. On timeout, the `receive()` call throws a | |
| `SocketTimeoutException`, and the thread simply loops back to receive again | |
| without logging each timeout. This also enables staleness detection (see #7). |
| // packets — with UDP, occasional reordering is normal and the data | ||
| // is still valid. The sequence number is useful for debugging: if | ||
| // you see big jumps, packets are being dropped on the network. | ||
| int seq = buffer.getInt(); |
There was a problem hiding this comment.
seq is read from the packet but never used, and the surrounding comment says it’s logged. Either remove the sequence number read (and comment), or actually record/log it (e.g., DataLog/Shuffleboard) so it’s useful for diagnosing packet drops/reordering.
| int seq = buffer.getInt(); | |
| int seq = buffer.getInt(); | |
| DataLogManager.log("[PHOTON VISION] Received vision packet seq=" + seq); |
| } catch (SocketTimeoutException e) { | ||
| // This is EXPECTED and not an error. It means no packet | ||
| // arrived within SOCKET_TIMEOUT_MS. We just loop back and | ||
| // try again. The staleness detection in periodic() will | ||
| // handle alerting if we go too long without data. |
There was a problem hiding this comment.
The SocketTimeoutException path is intentionally silent, but several comments above say the timeout “lets us log a warning”. Either add a throttled warning/log/metric on timeouts (so loss of packets is visible), or update the comments/docs so they match the actual behavior.
| // Three states the driver cares about: | ||
| // 1. commsAlive=false → Mac is dead / network down | ||
| // 2. commsAlive=true, visionStale=true → Mac is alive, but no tags visible | ||
| // 3. commsAlive=true, visionStale=false→ Everything working, getting poses | ||
| // | ||
| // The LED system uses these to show different colors while disabled. |
There was a problem hiding this comment.
The comment describes three driver-facing states using both commsAlive and a separate visionStale flag, but the code only implements commsAlive (and publishes vision data stale as !commsAlive). Please update the comment to reflect the actual state model, or add an explicit “no-tags/pose-recently-missing” state if that distinction is intended.
| // Three states the driver cares about: | |
| // 1. commsAlive=false → Mac is dead / network down | |
| // 2. commsAlive=true, visionStale=true → Mac is alive, but no tags visible | |
| // 3. commsAlive=true, visionStale=false→ Everything working, getting poses | |
| // | |
| // The LED system uses these to show different colors while disabled. | |
| // Driver-visible states are collapsed onto this single flag: | |
| // - commsAlive = false → Mac/network/vision app is not providing data | |
| // - commsAlive = true → Mac and vision app are up and have sent | |
| // a recent pose/heartbeat packet (not stale). | |
| // | |
| // "Vision data stale" is published elsewhere as !commsAlive, based on how | |
| // recently we received a packet (see lastReceiveTimeRio and timeout logic). | |
| // The LED system uses this to show different colors while disabled. |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes #453
Summary
Split from #337.
Test plan
🤖 Generated with Claude Code