Add a refresh-state button to the dashboard#17
Conversation
Wires the Sparkplug B "Node Control/Rebirth = true" command into MQTTLogger so the thermostat dumps full state automatically on connect. Triggered by the first qualifying PUBLISH (the same condition that already populates liveClients) rather than the raw CONNECT, because the firmware needs the session fully set up before it will act on NCMD traffic. A single CT_BOOL=true on Node Control/Rebirth provokes ~2200 config entries within ~17s, including the full schedule and per-activity setpoints that incremental deltas alone never republish. The rebirth-sent gate is reset on every CONNECT so reconnects re-trigger.
The thermostat firmware silently drops Sparkplug rebirth requests received during its own NBIRTH window after CONNECT. Firing immediately on the first qualifying PUBLISH (the previous behavior) consistently produced no observable response. Empirically, a rebirth request fired ~30s after the first PUBLISH still gets dropped, while one fired ~90s in succeeds and triggers the full ~2200-entry state dump that this feature is meant to recover. Schedule the rebirth for 120s after the first qualifying PUBLISH per session, with the timer bound to a context.CancelFunc so a CONNECT during the wait can interrupt it cleanly. The schedule call is idempotent per client, so subsequent PUBLISHes during the wait are no-ops, and CONNECT cancels any in-flight timer before resetting the map so reconnects re-schedule. The fire goroutine removes its own map entry before publishing to avoid a cancel-race when CONNECT arrives at the same instant the timer expires.
The firmware responds to a successful rebirth by sending a 30 KB NBIRTH publish on the topic suffix that matches our trigger condition (e.g. spBv1.0/WallCtrl/NBIRTH/<clientID>). With only the in-flight cancel map as a gate, that response was treated as a fresh "first qualifying PUBLISH" and re-scheduled the rebirth, producing an indefinite ~4.5-minute cycle of rebirth -> NBIRTH -> reschedule -> rebirth. Add a separate rebirthFired set that the goroutine populates after the publish completes. scheduleRebirth checks both rebirthFired and rebirthCancels and bails on either. Only CONNECT clears the fired set, matching the "once per session" semantic the original gate intended.
Add a Features bullet for automatic state synchronization, a Technical Details subsection explaining the delta-vs-snapshot problem and the 120s-delayed Sparkplug Node Control/Rebirth workaround, and a TODO checkmark recording the feature as shipped. Also add SYSTXCCITC01-C / v2.00 to the compatibility matrix as a new known-working configuration.
|
After a
Total payload: ~99.6 KB across 7 publishes over 59s. The |
|
Thanks for sending this PR! I have a high-level comment here. I have noticed quite a bit of buggy behavior with the firmware. Because of that, my approach with Anantha has been to be very conservative about deviation from behavior compared to Carrier's cloud-hosted controller. I'm not super comfortable doing a full state refresh unconditionally on startup. Imagine a scenario where Anantha crashes when getting a response to the full state refresh (or for any other reason) - it will keep sending the full state refresh request every 2-ish minutes. Since the goal of this PR is to make the very first page load better, we can have a button on that page to send a refresh request - we can even disable this button if data is fully populated already. FWIW, when I was iterating on this project, my go to was to completely powercycle the thermostat to force it to send the full initial data. I would love to not have to do that and just press a button on the page :) |
In response to maintainer feedback on PR anupcshan#17, switch from automatic-fire to user-initiated. This commit removes the timer infrastructure: the rebirthCancels and rebirthFired maps, the rebirthLock mutex, the rebirthDelay duration, the scheduleRebirth method, the OnPacketRead PUBLISH-branch invocation that scheduled it, and the CONNECT-branch reset that cleared it. The context import is also dropped. The sendRebirth helper and the cmdTopic field are preserved unchanged - they are the actual publish path for Node Control/Rebirth and will be reused by a /refresh-state HTTP handler in a follow-up commit. sendRebirth is currently unused after this commit so a one-off nolint:unused is added; the lint suppression goes away when the handler lands. The maintainer concern was a crash-loop scenario where anantha crashes on the rebirth response and the auto-fire restarts the loop on every recovery. A user-initiated button avoids that entirely while preserving the headline value of the PR.
A user-initiated POST /refresh-state endpoint that publishes Node Control/Rebirth = true to the thermostat. Replaces the auto-fire-on-connect mechanism removed in the previous commit. Three states the handler can return: 1. Subcase A - thermostat not connected: liveClients is empty, so the publish would land at the broker with no subscriber. Short-circuit with a clear message. 2. Cooldown - last successful send was within 90 seconds. 90s is padded above the empirically observed ~59s response window from the rebirth experiments. Returns the remaining seconds. Cooldown survives MQTT reconnects so users cannot bypass it by power-cycling the thermostat. 3. Subcase B2 - thermostat connected but still in its NBIRTH window. Per the threshold experiment, the firmware silently drops rebirths received within the first ~120s of a new session. Set pendingRebirth=true; OnPacketRead PUBLISH branch fires it once the window clears. One-shot: a second click during the window replaces (not appends to) the queued state. Cleared on CONNECT (new session means the click was for the prior, gone session). Otherwise: fire sendRebirth() immediately, record rebirthLastSent for cooldown. State guarded by a new refreshMu mutex on MQTTLogger. mLogger is forward-declared in runServe so the web mux goroutine (which starts before mLogger is constructed) can capture it. The maintainer feedback on PR anupcshan#17 (button instead of auto-fire) is addressed by this commit plus the upcoming UI commit. Bullet point anupcshan#2 of the maintainer concern (crash loop on response) is fully avoided: pendingRebirth is in-memory only, so an anantha crash clears the queued state on restart and won't refire unless the user clicks again.
Smoke test of commit 785f4d9 surfaced two bugs: 1. The pending-rebirth fire path in OnPacketRead's PUBLISH branch did not check the cooldown before firing. Live test sequence (logs at anupcshan#17 review thread): 17:41:59 user click 1 - queued (NBIRTH window had ~76s remaining) 17:42:02 user click 2 - already queued 17:42:07 user click 3 - already queued 17:45:19 explicit click - cooldown ok, fires sendRebirth, sets rebirthLastSent 17:45:26 NBIRTH lands as response - PUBLISH branch fires the queued click again Result: two sendRebirth calls 7 seconds apart (the explicit click and the previously-queued one), bypassing the 90s cooldown. 2. The queue's fire trigger was the next qualifying node-level PUBLISH (NDATA/NBIRTH on a topic ending with the bare clientID). DDATA on sub-device topics doesn't qualify. In practice the cadence between qualifying PUBLISHes can be many minutes, so a queued rebirth would sit unfired far longer than the "will fire in N seconds" message implied. The 17:45 fire happened only by accident because the explicit click triggered an NBIRTH response that itself was a qualifying PUBLISH. This commit: - Adds a pendingRebirthTimer field guarded by refreshMu. The /refresh-state handler sets it via time.AfterFunc when entering the queue path; a re-click stops and replaces it; CONNECT stops it (the timer is for the prior session). - Adds firePendingRebirth as the timer callback. Re-checks pendingRebirth (CONNECT may have cleared it) and rebirthLastSent (an explicit click during the wait may have already fired). Drops the queued send if either guard says it's no longer needed. - Removes the OnPacketRead-driven fire path. The timer triggers on real elapsed time rather than on the firmware deciding to publish. The maintainer's stated concern about timer-shaped constructs (PR anupcshan#17 review) is addressed: this timer is set in direct response to a user click, fires once per click, replaced by re-click, stopped by CONNECT, gated by cooldown. No automatic, recurring, or unconditional behavior - the click is the trigger.
The /refresh-state endpoint added in 785f4d9 had no UI. This wires it into the dashboard with a button placed under the "Last updated" row, and adds a small completeness heuristic that decides which informational text to render above it. stateLooksComplete checks node-level metrics (system mode/oat, wall control rt/rh, profile model/firmware/brand/serial) plus a per-active-zone schedule and activity check. Active zones are detected via live state metrics (rt/htsp/clsp), not the <N>/enabled flag, because zone 1 on a single-zone install has no enabled field at all - the zone-shape findings are documented in dbirth-decoded/. The schedule check requires at least one fully-formed period per day rather than all five, so a thermostat with fewer configured periods isn't falsely flagged. The button is always clickable regardless of completeness; only the pre-text changes. Per maintainer feedback on PR anupcshan#17, the goal is to inform the user about the state, not to gate the action.
The /refresh-state endpoint added in 785f4d9 had no UI. This wires it into the dashboard with a button placed under the "Last updated" row, and adds a small completeness heuristic that decides which informational text to render above it. stateLooksComplete checks node-level metrics (system mode/oat, wall control rt/rh, profile model/firmware/brand/serial) plus a per-active- zone schedule and activity check. Active zones are detected via live state metrics (rt/htsp/clsp), not the <N>/enabled flag, because zone 1 on a single-zone install has no enabled field at all - the zone-shape findings are documented in dbirth-decoded/. The schedule check requires at least one fully-formed period per day rather than all five, so a thermostat with fewer configured periods isn't falsely flagged. The button is always clickable regardless of completeness; only the pre-text changes. Per maintainer feedback on PR anupcshan#17, the goal is to inform the user about the state, not to gate the action.
The previous wording described an automatic-on-connect rebirth, which is no longer how the feature works. Rewrite the feature bullet, the "State synchronization" subsection, and the matching TODO line to describe the button-based flow, the 90-second cooldown, and the NBIRTH-window queue path.
da1ebdc to
5ded594
Compare
|
Pivoted this PR to a user-initiated button per your review. The auto-fire mechanism is gone; the button replaces it. The pivot, briefly The crash-loop concern is fully avoided: the queue state is in-memory only, so an anantha crash clears it on restart and won't refire unless the user clicks again. The button itself is always clickable; only the informational pre-text changes based on whether Commits (oldest to newest)
End-to-end verification (live deployment, 2026-05-11) Test setup: anantha running in Docker with Container logs across the full sequence (some noise omitted): What this exercises:
Happy to split anything further or adjust based on your read. |
What
Fresh anantha installs (and any with an empty proto cache) used to render an empty
/scheduleand mostly-empty/profilesuntil the user manually edited every field on the thermostat. The cause: after CONNECT, the thermostat publishes only deltas, not a full snapshot. Wifi cycling does not trigger a republish either.This PR adds a "Refresh thermostat state" button to the dashboard. Clicking it sends a Sparkplug B
Node Control/Rebirth = true(CT_BOOL) command tospBv1.0/WallCtrl/NCMD/<clientID>. The firmware honors this and dumps full state (~2200 entries: schedule, activity setpoints, system info, sensor templates), populating the dashboard within roughly a minute.The button's pre-text changes based on a per-page-load completeness check (
stateLooksCompleteincmd/anantha/cmd/state_complete.go) so the user can see whether a refresh is likely useful, but the button itself stays clickable in either case.Why a button instead of auto-fire on connect
Earlier revisions of this PR fired the rebirth automatically 120s after the first qualifying PUBLISH per session. @anupcshan flagged a concern in review: if anantha crashes processing the rebirth response, an unconditional auto-fire restarts the loop on every recovery. A user-initiated button avoids that entirely while still solving the headline "first page load is empty after a fresh install" problem.
Handler behavior
The handler (
POST /refresh-state) picks one of four paths in order:sendRebirth(), recordrebirthLastSentfor the cooldown gate.liveClientsis empty, so the publish would land at the broker with no subscriber. Short-circuit with a clear message.pendingRebirth=trueand arm atime.AfterFunctimer that firessendRebirth()once the window clears. The timer callback re-checks bothpendingRebirth(cleared on CONNECT) and the cooldown before publishing.State guarded by a
refreshMumutex onMQTTLogger. The queue and timer are cleared on CONNECT (so a click from the prior session does not bleed into the new one), andpendingRebirthis in-memory only, so an anantha crash clears the queue on restart.Empirical timing data
Verified against
SYSTXCCITC01-Crunning v2.00 (build131755-02.00). One rebirth response, measured T+ from theSent Node Control/Rebirthlog line to each PUBLISH log line:spBv1.0/WallCtrl/NBIRTH/<id>spBv1.0/WallCtrl/DBIRTH/<id>/energy_starspBv1.0/WallCtrl/DBIRTH/<id>/iduspBv1.0/WallCtrl/DBIRTH/<id>/oduspBv1.0/WallCtrl/DBIRTH/<id>/zonesspBv1.0/WallCtrl/DBIRTH/<id>/debugspBv1.0/WallCtrl/DBIRTH/<id>/energyTotal payload: ~99.6 KB across 7 publishes over 59 seconds. The 90-second cooldown is padded above this. The
DBIRTH/zonespayload at 44 KB is the largest by far - that's where the schedule (1/program/...) and activity setpoints (1/activities/...) live.Completeness heuristic
stateLooksCompletechecks node-level metrics (system/mode,system/oat,sensor/wallControl/{rt,rh},profile/{model,firmware,brand,serial}) plus a per-active-zone schedule and activity check. Notes:rt/htsp/clsp), not the<N>/enabledflag, because zone 1 on a single-zone install has noenabledfield at all (verified against actual DBIRTH decodes).time + activity + enabledtriple per day rather than all five periods, so a thermostat with fewer than 5 configured periods isn't falsely flagged.home,away,sleep,wake,manual) to havehtspset, leaving room for one user-disabled activity slot.If real installs report false-positives or false-negatives, the metric set is the natural place to tune.
End-to-end verification
Tested against the live deployment on 2026-05-11. anantha running in Docker with
--reqs-dirpointing at a host-mounted volume. The host directory was wiped before container start to give a true cold-start with no on-disk proto cache.Container logs (some noise omitted):
What this exercises:
POST /refresh-stateat 18:36:11, mid NBIRTH window. Handler returned "fire automatically in about 91 seconds". Timer fired at 18:37:44 (queue + 2s buffer). Full state drained by 18:38:44.POST /refresh-stateat 18:41:01 (cooldown elapsed). Handler returned "Refresh requested. Full state will arrive over the next ~60 seconds." SingleSent Node Control/Rebirthlog line; response began landing 7s later.LoadedValuessettled at 2230 entries by 18:38:44. Exactly oneSent Node Control/Rebirthper fire across the whole sequence; no double-fires from the AfterFunc timer.Related
OnChangeN; this PR addresses the orthogonal problem of the thermostat not publishing the data in the first place).Compatibility table
Adds
SYSTXCCITC01-C / v2.00as a new known-working configuration. The rebirth approach should work on any firmware that uses Sparkplug B-shaped topics.