[fix][consumer] Add reconnect failure listener and auto-close on max retry exhaustion by PavelZeger · Pull Request #1490 · apache/pulsar-client-go

PavelZeger · 2026-05-02T15:55:57Z

Motivation

When a partitionConsumer exhausts all broker reconnection attempts (controlled by
MaxReconnectToBroker), the client silently increments a metric and exits the retry loop,
leaving the consumer alive but unable to receive messages. There is no way for application
code to detect this failure or react to it (e.g. recreate the consumer or alert on-call).

Modifications

Added two opt-in fields to ConsumerOptions:

MaxReconnectToBrokerListener func(consumer Consumer, err error) — a callback invoked
exactly once, on the same internal goroutine, immediately after the last reconnect attempt
fails. The consumer argument is the parent Consumer the application holds, and err
is the last connection error. The listener fires whenever MaxReconnectToBroker retries
are exhausted or when the configured backoff policy signals IsMaxBackoffReached.
CloseConsumerOnMaxReconnectToBroker bool — when true, automatically closes the
consumer after exhausting reconnect attempts. The close runs asynchronously after
MaxReconnectToBrokerListener (if set) returns. Internally parentConsumer.Close() is
launched in a goroutine; this cancels the consumer's context, which unblocks the
internal.Retry loop, allowing runEventsLoop to process the close request without
deadlocking.

Both fields default to their zero values (nil / false), so there is no behaviour change
for existing consumers.

Why points 3 and 4 from the issue are not implemented in this PR

The original issue suggested two additional fixes:

3. Propagate the error to the consumer's error channel

The Consumer interface does not expose an error channel. Adding one would be a breaking
API change: every implementation (consumer, consumer_multitopic, consumer_regex,
consumer_zero_queue) would need a new method, and all existing callers that perform a
type-assertion or embed the interface would break. This is a larger design decision that
warrants its own issue and a deprecation / migration path. The MaxReconnectToBrokerListener
callback achieves the same observable outcome (application code is notified of the failure)
without modifying the public interface.

4. Update consumer state to a terminal "failed" state

There is currently no consumerFailed state in the internal state machine
(consumerInit → consumerReady → consumerClosing → consumerClosed). Introducing a new
terminal state would require updating every state guard in consumer_partition.go
(there are more than a dozen) as well as the multi-topic and regex consumer wrappers.
In practice, enabling CloseConsumerOnMaxReconnectToBroker already transitions the
consumer through consumerClosing → consumerClosed, which is the correct terminal state
and prevents any further operations on a dead consumer. A separate "failed" state that
carries an error cause can be considered as a follow-up if observability tooling needs to
distinguish a failed-closed consumer from a normally-closed one.

Verifying this change

Make sure that the change passes the CI checks.

This change added behaviour that requires a running broker to test end-to-end. Unit-level
verification can be done by constructing a partitionConsumerOpts directly with a
maxReconnectToBroker of 1 and asserting the listener fires and the consumer closes.
Integration test coverage is tracked as a follow-up.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API: yes — two new opt-in fields added to ConsumerOptions
The schema: no
The default values of configurations: no
The wire protocol: no

Documentation

Does this pull request introduce a new feature? yes
If yes, how is the feature documented? GoDocs on the new ConsumerOptions fields

…retry exhaustion

Copilot

Pull request overview

Adds an opt-in notification/auto-close mechanism for consumers that can no longer reconnect to a broker, addressing the “silent broken consumer” failure mode described in #1481.

Changes:

Extends ConsumerOptions with MaxReconnectToBrokerListener (callback) and CloseConsumerOnMaxReconnectToBroker (auto-close).
Wires the new options into partitionConsumer.reconnectToBroker() and newPartitionConsumerOpts().
Adds integration-style tests that validate listener invocation and auto-close behavior when reconnect retries are exhausted.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File	Description
`pulsar/consumer.go`	Adds new public `ConsumerOptions` fields + GoDoc for reconnect exhaustion notification and optional auto-close.
`pulsar/consumer_impl.go`	Propagates the new `ConsumerOptions` fields into `partitionConsumerOpts`.
`pulsar/consumer_partition.go`	Invokes listener / triggers async close when reconnect attempts are exhausted.
`pulsar/consumer_test.go`	Adds tests using testcontainers to validate listener firing and auto-close on retry exhaustion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

PavelZeger · 2026-05-09T08:21:09Z

 		if maxRetry == 0 || bo.IsMaxBackoffReached() {
 			pc.metrics.ConsumersReconnectMaxRetry.Inc()
+			if pc.options.maxReconnectToBrokerListener != nil {
+				pc.options.maxReconnectToBrokerListener(pc.parentConsumer, err)
+			}


I confirmed the bug. With MaxReconnectToBroker unset, bo.IsMaxBackoffReached() stays true and the branch returns err, so the retry loop re-enters indefinitely — listener fires every iteration, close goroutines pile up.

Didn't take the "make it terminal" option: it would silently change the default contract of retrying forever at the backoff ceiling.

Fix: closure-local maxRetryNotified bool gating only the listener and close goroutine; retry flow and metric untouched. Plain bool is fine since opFn runs sequentially.

Added TestConsumerMaxReconnectToBrokerListenerFiresOnceWhenBackoffMaxed: MaxReconnectToBroker unset + always-maxed backoff, kills broker, waits 3 s past first invocation, asserts counter == 1. Passes.

PavelZeger · 2026-05-09T08:25:05Z

+			if pc.options.closeConsumerOnMaxReconnectToBroker {
+				go pc.parentConsumer.Close()
+			}


Addressed by the same patch as the previous thread - go pc.parentConsumer.Close() and maxReconnectToBrokerListener are both inside one if !maxRetryNotified { … } block, so they fire together exactly once per reconnect cycle. Used a plain bool rather than sync.Once/atomic because opFn is invoked sequentially by internal.Retry - no concurrent entry into the closure, no race to protect against. Happy to swap to sync.Once if you'd prefer the explicit signal.

PavelZeger · 2026-05-09T08:33:15Z

+	// (e.g. recreate the consumer). Only fires when MaxReconnectToBroker is set to a finite value
+	// or when the backoff policy signals IsMaxBackoffReached.
+	MaxReconnectToBrokerListener func(consumer Consumer, err error)
+
+	// CloseConsumerOnMaxReconnectToBroker, when true, automatically closes the consumer after
+	// exhausting all reconnect attempts. The close happens asynchronously after
+	// MaxReconnectToBrokerListener (if set) returns. Default: false.


Addressed in the commit instead of committing suggestion because it was outdated due to the previous commit.

PavelZeger · 2026-05-09T08:44:13Z

+	req := testcontainers.ContainerRequest{
+		Image:        getPulsarTestImage(),
+		ExposedPorts: []string{"6650/tcp", "8080/tcp"},
+		WaitingFor:   wait.ForExposedPort(),
+		Cmd:          []string{"bin/pulsar", "standalone", "-nfw", "--advertised-address", "localhost"},
+	}
+	c, err := testcontainers.GenericContainer(context.Background(), testcontainers.GenericContainerRequest{
+		ContainerRequest: req,
+		Started:          true,
+	})
+	require.NoError(t, err)
+	endpoint, err := c.PortEndpoint(context.Background(), "6650", "pulsar")
+	require.NoError(t, err)


Addressed in latest commit: registered t.Cleanup right after container creation in all three tests as a best-effort safety net (logs on error), and changed the in-body _ = c.Terminate(...) to require.NoError(t, c.Terminate(...)) since that call is part of the scenario.

PavelZeger · 2026-05-09T08:45:55Z

+			MaxReconnectToBroker: &maxRetry,
+			BackOffPolicyFunc: func() backoff.Policy {
+				return newTestBackoffPolicy(100*time.Millisecond, 1*time.Second)
+			},


Already addressed in a follow-up commit: added TestConsumerMaxReconnectToBrokerListenerFiresOnceWhenBackoffMaxed with a maxBackoffReachedPolicy whose IsMaxBackoffReached() always returns true. It asserts the listener fires exactly once even after many additional retry iterations, covering the previously untested exhaustion path.

PavelZeger · 2026-05-09T08:47:30Z

+		ContainerRequest: req,
+		Started:          true,
+	})
+	require.NoError(t, err)


Already addressed in the latest commit: t.Cleanup registered immediately after container creation in TestConsumerMaxReconnectToBrokerAutoClose (and the other two reconnect tests) - best-effort terminate that logs on error, with the in-body terminate now checked via require.NoError.

nodece · 2026-05-09T10:53:22Z

Hi @PavelZeger, could you follow up on the Java client logic?

PavelZeger · 2026-05-09T13:28:38Z

Hi @PavelZeger, could you follow up on the Java client logic?

Sure, @nodece.
I pushed an update that takes the parts of your proposal we can apply without breaking the existing API:

What's done now

Backoff cap no longer ends retries or fires the listener. Dropped the IsMaxBackoffReached check from reconnectToBroker. The consumer now keeps retrying forever at the capped delay, just like Java's ConnectionHandler.reconnectLater().
Non-retriable errors are detected and reported. Added a check for the consumer-relevant errors that Java treats as non-retriable: AuthorizationError, TopicNotFound, TopicTerminatedError, SubscriptionNotFound, IncompatibleSchema, ConsumerBusy, InvalidTopicName, ConsumerAssignError, NotAllowedError. When the broker returns one of these, the listener fires with the error and the retry loop stops - same idea as Java's connectionFailed().

What I left for later

I kept MaxReconnectToBroker as an opt-in cap on runtime reconnects. Issue partitionConsumer reconnectToBroker reaches max retry but no notification/closure to external user #1481 specifically asks for a way to stop the consumer after a bounded number of failed reconnects, so removing it would break users who already set it. Can revisit if maintainers want it removed.
Applying MaxReconnectToBroker (or a new LookupTimeout) at initial subscribe is a different code path. I'd rather file it as a follow-up issue so it gets its own focused review.

Does this work for you? If you want me to go further on 3 or 4 in this PR, let me know — otherwise I'll open a follow-up for 4 after this PR.

Sources:

[fix][consumer] Add reconnect failure listener and auto-close on max …

16835e7

…retry exhaustion

lhotari requested review from RobertIndie, Copilot, crossoverJie, freeznet, nodece and shibd May 8, 2026 14:58

Copilot started reviewing on behalf of lhotari May 8, 2026 14:59 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

PavelZeger added 3 commits May 9, 2026 11:19

[fix][consumer] Fix consumer_partition.go review comment

3638707

[fix][consumer] Fix consumer.go review comment

b10dd94

[fix][consumer] Fix consumer_test.go review comment

1335418

[fix][consumer] Fix according to Java client logic

6a64f26

[fix][consumer] Fix according to Java client logic

a0da044

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix][consumer] Add reconnect failure listener and auto-close on max retry exhaustion#1490

[fix][consumer] Add reconnect failure listener and auto-close on max retry exhaustion#1490
PavelZeger wants to merge 6 commits intoapache:masterfrom
PavelZeger:fix/issue-1481-consumer-reconnect-notification

PavelZeger commented May 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

PavelZeger May 9, 2026

Uh oh!

PavelZeger May 9, 2026

Uh oh!

PavelZeger May 9, 2026

Uh oh!

PavelZeger May 9, 2026

Uh oh!

PavelZeger May 9, 2026

Uh oh!

PavelZeger May 9, 2026

Uh oh!

nodece commented May 9, 2026

Uh oh!

PavelZeger commented May 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

PavelZeger commented May 2, 2026

Motivation

Modifications

Why points 3 and 4 from the issue are not implemented in this PR

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

PavelZeger May 9, 2026

Choose a reason for hiding this comment

Uh oh!

PavelZeger May 9, 2026

Choose a reason for hiding this comment

Uh oh!

PavelZeger May 9, 2026

Choose a reason for hiding this comment

Uh oh!

PavelZeger May 9, 2026

Choose a reason for hiding this comment

Uh oh!

PavelZeger May 9, 2026

Choose a reason for hiding this comment

Uh oh!

PavelZeger May 9, 2026

Choose a reason for hiding this comment

Uh oh!

nodece commented May 9, 2026

Uh oh!

PavelZeger commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PavelZeger commented May 9, 2026 •

edited

Loading