Skip to content

ZMQ Bug: Last Value Caching#445

Merged
astronomerdave merged 1 commit into
mainfrom
bug/telemetry-timeout
Jun 3, 2026
Merged

ZMQ Bug: Last Value Caching#445
astronomerdave merged 1 commit into
mainfrom
bug/telemetry-timeout

Conversation

@astronomerdave
Copy link
Copy Markdown
Contributor

This fixes messaged to fill a known gap in ZMQ by replacing zmq_proxy with last-value-cache broker. Without this, if a daemon restarts it failed to get telemetry. This is based on lvcache.c.

By ZMQ's own admission,

"Here are the classic failure cases for pub-sub:

Subscribers join late, so they miss messages the server already sent.
Subscribers can drop off and lose messages while they are away.
Subscribers can crash and restart, and lose whatever data they already received."

And the explicit admission:

"Reliability requires complexity that most of us don't need, most of the time, which is why ZeroMQ does..."

-- https://zguide.zeromq.org/docs/chapter5/

and also here:

"If you've used commercial publish-subscribe systems, you may be used to some features that are missing in the fast and cheerful ØMQ pub-sub model. One of these is last value caching (LVC). This solves the problem of how a new subscriber catches up when it joins the network. The theory is that publishers get notified when a new subscriber joins and subscribes to some specific topics. The publisher can then rebroadcast the last message for those topics.
I've already explained why publishers don't get notified when there are new subscribers: in large pub-sub systems the volumes of data make it pretty much impossible. To make really large-scale pub-sub networks work, you need a protocol like PGM that exploits an upscale Ethernet switch's ability to multicast data to thousands of subscribers. Trying to do a TCP unicast from the publisher to each of thousands of subscribers just doesn't scale." oreilly

— ZeroMQ, O'Reilly (Chapter 5), https://www.oreilly.com/library/view/zeromq/9781449334437/ch05s03.html

Copy link
Copy Markdown

@drriddle drriddle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why I use a constant stream of telemetry, there's no question of the current state in that case. An improvement could be made by having the client request the current data frame after connection, so that they are up to date with everything. They may miss what happened while they were gone, but there's no getting around that without downloading the entire data base back X data frames.

@astronomerdave
Copy link
Copy Markdown
Contributor Author

That was already being done. All daemons when they start would publish their status and presence. Publishing their presence was supposed to induce all publishers that he subscribes to, to publish their status. That way, anyone can come and go as they please and no one gets out of touch. It was a good idea but it was broken.

The problem came because I was using the built-in zmq_proxy(). A PUB/XPUB socket filters its outgoing messages: it only puts a message on the wire for a topic that it knows has at least one subscriber. So in order for a published message to go out, he must know about the subscriber's subscription. What happened to me today was that I restarted a daemon and he published that he was online -- this is supposed to induce the daemons that he subscribes to, to publish their status, but they never saw his announcement because his subscriber table was empty.

Anyway, it's not actually a bug, it's a feature to limit traffic, and to get it to behave like I want, it needs a smarter broker, which is what this adds.

@astronomerdave astronomerdave reopened this Jun 3, 2026
@astronomerdave astronomerdave merged commit 62c4a5c into main Jun 3, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants