Skip to content

New failover implementation#8566

Open
pbrezina wants to merge 10 commits intoSSSD:failoverfrom
pbrezina:failover
Open

New failover implementation#8566
pbrezina wants to merge 10 commits intoSSSD:failoverfrom
pbrezina:failover

Conversation

@pbrezina
Copy link
Copy Markdown
Member

This pull request is intended to be a start of a "failover" feature branch where other developers will be able to contribute.

The main failover logic works, compiles and can be tested using a "minimal" provider that is included as an example. The purpose of the "minimal" provider is only to test the failover without the need to port full provider code and itwill be removed prior pushing the contents to the master branch. See how to set it up in minimal-provider-notes.txt and see the switch to new failover in commit minimal: switch to new failover for service lookup and user authentication - this is the minimal set of changes to get it working, but the real port should get and will require more refactoring.

The work is still not finished and there is missing functionality. This functionality, however, can be implemented in small areas of code and should not require larger changes or glues in the whole code base, so this is ready for review. Remaining work is tracked at [1]. Feel free to take any of these tickets and open new tickets when you find something missing.

When reviewing, you can start with src/providers/failover/readme.md that provides high level documentation of the code. And of course do not forget the design page [2].

Thanks, Pavel

pbrezina and others added 10 commits March 31, 2026 11:54
so it can be directly modified
So it can be modified later.
This crafts and implements the new failover interface,
it does not provide complete implementation of the failover
mechanism yet. It brings the code to a state were the public
and private interfaces are stable, working and testable so
the following tasks can be split and work on in parallel.

What is missing at this state:
- server configuration and discovery
  (failover_server_group/batch/vtable_op)
- server selection mechanism (sss_failover_vtable_op_server_next)
- kerberos authentication
- sharing servers between IPA/AD LDAP and KDC
- online/offline callbacks (resolve callback should not be needed)

But especially it is possible to start refactoring SSSD code to start
using the new failover implementation.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a new failover mechanism for SSSD, introducing prioritized server groups, parallelized candidate server discovery, and a transaction-based API for automated retries. It also provides a minimal provider implementation to demonstrate the new architecture. Critical logic bugs were identified in the server group resolution logic, where duplicate detection causes premature loop exit, and in the address change detection function, which currently returns inverted results.

Comment on lines +338 to +358
for (j = 0; state->group->discovered_servers[j] != NULL; j++, i++) {
found = false;
current = state->group->discovered_servers[j];
for (k = 0; out[k] != NULL; k++) {
if (sss_failover_server_equal(out[k], current)) {
found = true;
break;
}
}

if (found) {
break;
}

out[i] = talloc_reference(out, current);
if (out[i] == NULL) {
DEBUG(SSSDBG_CRIT_FAILURE, "Out of memory!\n");
talloc_free(out);
return ENOMEM;
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a logic error in the loop that merges discovered servers and removes duplicates. The break statement on line 349 will cause the loop to terminate prematurely if a duplicate server is found, preventing any subsequent discovered servers from being added to the list.

This should be a continue, but simply changing it would also be incorrect due to the i++ in the loop's post-increment step, which would create a hole in the out array.

The loop should be refactored to correctly handle duplicates without terminating early.

    for (j = 0; state->group->discovered_servers[j] != NULL; j++) {
        found = false;
        current = state->group->discovered_servers[j];
        for (k = 0; out[k] != NULL; k++) {
            if (sss_failover_server_equal(out[k], current)) {
                found = true;
                break;
            }
        }

        if (found) {
            continue;
        }

        out[i] = talloc_reference(out, current);
        if (out[i] == NULL) {
            DEBUG(SSSDBG_CRIT_FAILURE, "Out of memory!\n");
            talloc_free(out);
            return ENOMEM;
        }
        i++;
    }

Comment on lines +41 to +42
return memcmp(server->addr->binary, hostent->addr_list[0]->ipaddr,
server->addr->binary_len) == 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic in sss_failover_server_resolve_address_changed appears to be inverted. The function name suggests it should return true if the address has changed, but the implementation memcmp(...) == 0 returns true if the addresses are the same. This will cause incorrect behavior where address changes are not detected.

    return memcmp(server->addr->binary, hostent->addr_list[0]->ipaddr,
                  server->addr->binary_len) != 0;


count = talloc_array_length(fctx->groups);

for (slot = 0; fctx->groups[slot] != NULL && slot < count; slot++) {

Check failure

Code scanning / CodeQL

Array offset used before range check High

This use of offset 'slot' should follow the
range check
.
return req;

done:
if (ret == EOK) {

Check warning

Code scanning / CodeQL

Comparison result is always the same Warning

Comparison is always false because ret >= 12.
#include "providers/failover/ldap/failover_ldap.h"

static errno_t
find_password_expiration_attributes(TALLOC_CTX *mem_ctx,

Check warning

Code scanning / CodeQL

Poorly documented large function Warning

Poorly documented function: fewer than 2% comments for a function of 114 lines.
Comment on lines +72 to +90
switch (ar->entry_type & BE_REQ_TYPE_MASK) {
case BE_REQ_SERVICES:
DEBUG(SSSDBG_TRACE_FUNC, "Executing BE_REQ_SERVICES request\n");

subreq = minimal_services_get_send(state, be_ctx->ev, fctx, id_ctx,
sdom, ar->filter_value,
ar->extra_value, ar->filter_type,
noexist_delete);
break;
default: /*fail*/
ret = EINVAL;
state->err = "Invalid request type";
DEBUG(SSSDBG_OP_FAILURE,
"Unexpected request type: 0x%X [%s:%s] in %s\n",
ar->entry_type, ar->filter_value,
ar->extra_value?ar->extra_value:"-",
ar->domain);
goto done;
}

Check notice

Code scanning / CodeQL

No trivial switch statements Note

This switch statement should either handle more cases, or be rewritten as an if statement.
Comment on lines +120 to +128
switch (state->ar->entry_type & BE_REQ_TYPE_MASK) {
case BE_REQ_SERVICES:
err = "Service lookup failed";
ret = minimal_services_get_recv(subreq);
break;
default: /* fail */
ret = EINVAL;
break;
}

Check notice

Code scanning / CodeQL

No trivial switch statements Note

This switch statement should either handle more cases, or be rewritten as an if statement.
Comment on lines +212 to +220
// TODO handle how to yield ERR_SERVER_FAILED
// ret = sdap_id_op_done(state->op, ret, &dp_error);
// if (dp_error == DP_ERR_OK && ret != EOK) {
// /* retry */
// ret = minimal_services_get_retry(req);
// if (ret != EOK) {
// tevent_req_error(req, ret);
// return;
// }

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.
Comment on lines +222 to +225
// /* Return to the mainloop to retry */
// return;
// }
// state->sdap_ret = ret;

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.
Comment on lines +227 to +232
// /* An error occurred. */
// if (ret && ret != ENOENT) {
// state->dp_error = dp_error;
// tevent_req_error(req, ret);
// return;
//}

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.
@alexey-tikhonov alexey-tikhonov self-assigned this Apr 1, 2026
@alexey-tikhonov alexey-tikhonov added Waiting for review no-backport This should go to target branch only. labels Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-backport This should go to target branch only. Waiting for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants