Skip to content

[fix][consumer] Add reconnect failure listener and auto-close on max retry exhaustion#1490

Open
PavelZeger wants to merge 6 commits intoapache:masterfrom
PavelZeger:fix/issue-1481-consumer-reconnect-notification
Open

[fix][consumer] Add reconnect failure listener and auto-close on max retry exhaustion#1490
PavelZeger wants to merge 6 commits intoapache:masterfrom
PavelZeger:fix/issue-1481-consumer-reconnect-notification

Conversation

@PavelZeger
Copy link
Copy Markdown

Fixes #1481

Motivation

When a partitionConsumer exhausts all broker reconnection attempts (controlled by
MaxReconnectToBroker), the client silently increments a metric and exits the retry loop,
leaving the consumer alive but unable to receive messages. There is no way for application
code to detect this failure or react to it (e.g. recreate the consumer or alert on-call).

Modifications

Added two opt-in fields to ConsumerOptions:

  • MaxReconnectToBrokerListener func(consumer Consumer, err error) — a callback invoked
    exactly once, on the same internal goroutine, immediately after the last reconnect attempt
    fails. The consumer argument is the parent Consumer the application holds, and err
    is the last connection error. The listener fires whenever MaxReconnectToBroker retries
    are exhausted or when the configured backoff policy signals IsMaxBackoffReached.

  • CloseConsumerOnMaxReconnectToBroker bool — when true, automatically closes the
    consumer after exhausting reconnect attempts. The close runs asynchronously after
    MaxReconnectToBrokerListener (if set) returns. Internally parentConsumer.Close() is
    launched in a goroutine; this cancels the consumer's context, which unblocks the
    internal.Retry loop, allowing runEventsLoop to process the close request without
    deadlocking.

Both fields default to their zero values (nil / false), so there is no behaviour change
for existing consumers.

Why points 3 and 4 from the issue are not implemented in this PR

The original issue suggested two additional fixes:

3. Propagate the error to the consumer's error channel

The Consumer interface does not expose an error channel. Adding one would be a breaking
API change: every implementation (consumer, consumer_multitopic, consumer_regex,
consumer_zero_queue) would need a new method, and all existing callers that perform a
type-assertion or embed the interface would break. This is a larger design decision that
warrants its own issue and a deprecation / migration path. The MaxReconnectToBrokerListener
callback achieves the same observable outcome (application code is notified of the failure)
without modifying the public interface.

4. Update consumer state to a terminal "failed" state

There is currently no consumerFailed state in the internal state machine
(consumerInit → consumerReady → consumerClosing → consumerClosed). Introducing a new
terminal state would require updating every state guard in consumer_partition.go
(there are more than a dozen) as well as the multi-topic and regex consumer wrappers.
In practice, enabling CloseConsumerOnMaxReconnectToBroker already transitions the
consumer through consumerClosing → consumerClosed, which is the correct terminal state
and prevents any further operations on a dead consumer. A separate "failed" state that
carries an error cause can be considered as a follow-up if observability tooling needs to
distinguish a failed-closed consumer from a normally-closed one.

Verifying this change

  • Make sure that the change passes the CI checks.

This change added behaviour that requires a running broker to test end-to-end. Unit-level
verification can be done by constructing a partitionConsumerOpts directly with a
maxReconnectToBroker of 1 and asserting the listener fires and the consumer closes.
Integration test coverage is tracked as a follow-up.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API: yes — two new opt-in fields added to ConsumerOptions
  • The schema: no
  • The default values of configurations: no
  • The wire protocol: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? GoDocs on the new ConsumerOptions fields

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in notification/auto-close mechanism for consumers that can no longer reconnect to a broker, addressing the “silent broken consumer” failure mode described in #1481.

Changes:

  • Extends ConsumerOptions with MaxReconnectToBrokerListener (callback) and CloseConsumerOnMaxReconnectToBroker (auto-close).
  • Wires the new options into partitionConsumer.reconnectToBroker() and newPartitionConsumerOpts().
  • Adds integration-style tests that validate listener invocation and auto-close behavior when reconnect retries are exhausted.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
pulsar/consumer.go Adds new public ConsumerOptions fields + GoDoc for reconnect exhaustion notification and optional auto-close.
pulsar/consumer_impl.go Propagates the new ConsumerOptions fields into partitionConsumerOpts.
pulsar/consumer_partition.go Invokes listener / triggers async close when reconnect attempts are exhausted.
pulsar/consumer_test.go Adds tests using testcontainers to validate listener firing and auto-close on retry exhaustion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pulsar/consumer_partition.go Outdated
Comment on lines +2103 to +2107
if maxRetry == 0 || bo.IsMaxBackoffReached() {
pc.metrics.ConsumersReconnectMaxRetry.Inc()
if pc.options.maxReconnectToBrokerListener != nil {
pc.options.maxReconnectToBrokerListener(pc.parentConsumer, err)
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed the bug. With MaxReconnectToBroker unset, bo.IsMaxBackoffReached() stays true and the branch returns err, so the retry loop re-enters indefinitely — listener fires every iteration, close goroutines pile up.

Didn't take the "make it terminal" option: it would silently change the default contract of retrying forever at the backoff ceiling.

Fix: closure-local maxRetryNotified bool gating only the listener and close goroutine; retry flow and metric untouched. Plain bool is fine since opFn runs sequentially.

Added TestConsumerMaxReconnectToBrokerListenerFiresOnceWhenBackoffMaxed: MaxReconnectToBroker unset + always-maxed backoff, kills broker, waits 3 s past first invocation, asserts counter == 1. Passes.

Comment thread pulsar/consumer_partition.go Outdated
Comment on lines +2108 to +2110
if pc.options.closeConsumerOnMaxReconnectToBroker {
go pc.parentConsumer.Close()
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by the same patch as the previous thread - go pc.parentConsumer.Close() and maxReconnectToBrokerListener are both inside one if !maxRetryNotified { … } block, so they fire together exactly once per reconnect cycle. Used a plain bool rather than sync.Once/atomic because opFn is invoked sequentially by internal.Retry - no concurrent entry into the closure, no race to protect against. Happy to swap to sync.Once if you'd prefer the explicit signal.

Comment thread pulsar/consumer.go Outdated
Comment on lines +228 to +234
// (e.g. recreate the consumer). Only fires when MaxReconnectToBroker is set to a finite value
// or when the backoff policy signals IsMaxBackoffReached.
MaxReconnectToBrokerListener func(consumer Consumer, err error)

// CloseConsumerOnMaxReconnectToBroker, when true, automatically closes the consumer after
// exhausting all reconnect attempts. The close happens asynchronously after
// MaxReconnectToBrokerListener (if set) returns. Default: false.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the commit instead of committing suggestion because it was outdated due to the previous commit.

Comment thread pulsar/consumer_test.go
Comment on lines +5715 to +5727
req := testcontainers.ContainerRequest{
Image: getPulsarTestImage(),
ExposedPorts: []string{"6650/tcp", "8080/tcp"},
WaitingFor: wait.ForExposedPort(),
Cmd: []string{"bin/pulsar", "standalone", "-nfw", "--advertised-address", "localhost"},
}
c, err := testcontainers.GenericContainer(context.Background(), testcontainers.GenericContainerRequest{
ContainerRequest: req,
Started: true,
})
require.NoError(t, err)
endpoint, err := c.PortEndpoint(context.Background(), "6650", "pulsar")
require.NoError(t, err)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest commit: registered t.Cleanup right after container creation in all three tests as a best-effort safety net (logs on error), and changed the in-body _ = c.Terminate(...) to require.NoError(t, c.Terminate(...)) since that call is part of the scenario.

Comment thread pulsar/consumer_test.go
Comment on lines +5750 to +5753
MaxReconnectToBroker: &maxRetry,
BackOffPolicyFunc: func() backoff.Policy {
return newTestBackoffPolicy(100*time.Millisecond, 1*time.Second)
},
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed in a follow-up commit: added TestConsumerMaxReconnectToBrokerListenerFiresOnceWhenBackoffMaxed with a maxBackoffReachedPolicy whose IsMaxBackoffReached() always returns true. It asserts the listener fires exactly once even after many additional retry iterations, covering the previously untested exhaustion path.

Comment thread pulsar/consumer_test.go
ContainerRequest: req,
Started: true,
})
require.NoError(t, err)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed in the latest commit: t.Cleanup registered immediately after container creation in TestConsumerMaxReconnectToBrokerAutoClose (and the other two reconnect tests) - best-effort terminate that logs on error, with the in-body terminate now checked via require.NoError.

@nodece
Copy link
Copy Markdown
Member

nodece commented May 9, 2026

Hi @PavelZeger, could you follow up on the Java client logic?

@PavelZeger
Copy link
Copy Markdown
Author

PavelZeger commented May 9, 2026

Hi @PavelZeger, could you follow up on the Java client logic?

Sure, @nodece.
I pushed an update that takes the parts of your proposal we can apply without breaking the existing API:

What's done now

  1. Backoff cap no longer ends retries or fires the listener. Dropped the IsMaxBackoffReached check from reconnectToBroker. The consumer now keeps retrying forever at the capped delay, just like Java's ConnectionHandler.reconnectLater().
  2. Non-retriable errors are detected and reported. Added a check for the consumer-relevant errors that Java treats as non-retriable: AuthorizationError, TopicNotFound, TopicTerminatedError, SubscriptionNotFound, IncompatibleSchema, ConsumerBusy, InvalidTopicName, ConsumerAssignError, NotAllowedError. When the broker returns one of these, the listener fires with the error and the retry loop stops - same idea as Java's connectionFailed().

What I left for later

  1. I kept MaxReconnectToBroker as an opt-in cap on runtime reconnects. Issue partitionConsumer reconnectToBroker reaches max retry but no notification/closure to external user #1481 specifically asks for a way to stop the consumer after a bounded number of failed reconnects, so removing it would break users who already set it. Can revisit if maintainers want it removed.
  2. Applying MaxReconnectToBroker (or a new LookupTimeout) at initial subscribe is a different code path. I'd rather file it as a follow-up issue so it gets its own focused review.

Does this work for you? If you want me to go further on 3 or 4 in this PR, let me know — otherwise I'll open a follow-up for 4 after this PR.

Sources:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

partitionConsumer reconnectToBroker reaches max retry but no notification/closure to external user

3 participants