Conversation
(cherry picked from commit 0f5f3b6)
so it can be directly modified
So it can be modified later.
This crafts and implements the new failover interface, it does not provide complete implementation of the failover mechanism yet. It brings the code to a state were the public and private interfaces are stable, working and testable so the following tasks can be split and work on in parallel. What is missing at this state: - server configuration and discovery (failover_server_group/batch/vtable_op) - server selection mechanism (sss_failover_vtable_op_server_next) - kerberos authentication - sharing servers between IPA/AD LDAP and KDC - online/offline callbacks (resolve callback should not be needed) But especially it is possible to start refactoring SSSD code to start using the new failover implementation.
There was a problem hiding this comment.
Code Review
This pull request implements a new failover mechanism for SSSD, introducing prioritized server groups, parallelized candidate server discovery, and a transaction-based API for automated retries. It also provides a minimal provider implementation to demonstrate the new architecture. Critical logic bugs were identified in the server group resolution logic, where duplicate detection causes premature loop exit, and in the address change detection function, which currently returns inverted results.
| for (j = 0; state->group->discovered_servers[j] != NULL; j++, i++) { | ||
| found = false; | ||
| current = state->group->discovered_servers[j]; | ||
| for (k = 0; out[k] != NULL; k++) { | ||
| if (sss_failover_server_equal(out[k], current)) { | ||
| found = true; | ||
| break; | ||
| } | ||
| } | ||
|
|
||
| if (found) { | ||
| break; | ||
| } | ||
|
|
||
| out[i] = talloc_reference(out, current); | ||
| if (out[i] == NULL) { | ||
| DEBUG(SSSDBG_CRIT_FAILURE, "Out of memory!\n"); | ||
| talloc_free(out); | ||
| return ENOMEM; | ||
| } | ||
| } |
There was a problem hiding this comment.
There's a logic error in the loop that merges discovered servers and removes duplicates. The break statement on line 349 will cause the loop to terminate prematurely if a duplicate server is found, preventing any subsequent discovered servers from being added to the list.
This should be a continue, but simply changing it would also be incorrect due to the i++ in the loop's post-increment step, which would create a hole in the out array.
The loop should be refactored to correctly handle duplicates without terminating early.
for (j = 0; state->group->discovered_servers[j] != NULL; j++) {
found = false;
current = state->group->discovered_servers[j];
for (k = 0; out[k] != NULL; k++) {
if (sss_failover_server_equal(out[k], current)) {
found = true;
break;
}
}
if (found) {
continue;
}
out[i] = talloc_reference(out, current);
if (out[i] == NULL) {
DEBUG(SSSDBG_CRIT_FAILURE, "Out of memory!\n");
talloc_free(out);
return ENOMEM;
}
i++;
}| return memcmp(server->addr->binary, hostent->addr_list[0]->ipaddr, | ||
| server->addr->binary_len) == 0; |
There was a problem hiding this comment.
The logic in sss_failover_server_resolve_address_changed appears to be inverted. The function name suggests it should return true if the address has changed, but the implementation memcmp(...) == 0 returns true if the addresses are the same. This will cause incorrect behavior where address changes are not detected.
return memcmp(server->addr->binary, hostent->addr_list[0]->ipaddr,
server->addr->binary_len) != 0;|
|
||
| count = talloc_array_length(fctx->groups); | ||
|
|
||
| for (slot = 0; fctx->groups[slot] != NULL && slot < count; slot++) { |
Check failure
Code scanning / CodeQL
Array offset used before range check High
| return req; | ||
|
|
||
| done: | ||
| if (ret == EOK) { |
Check warning
Code scanning / CodeQL
Comparison result is always the same Warning
| #include "providers/failover/ldap/failover_ldap.h" | ||
|
|
||
| static errno_t | ||
| find_password_expiration_attributes(TALLOC_CTX *mem_ctx, |
Check warning
Code scanning / CodeQL
Poorly documented large function Warning
| switch (ar->entry_type & BE_REQ_TYPE_MASK) { | ||
| case BE_REQ_SERVICES: | ||
| DEBUG(SSSDBG_TRACE_FUNC, "Executing BE_REQ_SERVICES request\n"); | ||
|
|
||
| subreq = minimal_services_get_send(state, be_ctx->ev, fctx, id_ctx, | ||
| sdom, ar->filter_value, | ||
| ar->extra_value, ar->filter_type, | ||
| noexist_delete); | ||
| break; | ||
| default: /*fail*/ | ||
| ret = EINVAL; | ||
| state->err = "Invalid request type"; | ||
| DEBUG(SSSDBG_OP_FAILURE, | ||
| "Unexpected request type: 0x%X [%s:%s] in %s\n", | ||
| ar->entry_type, ar->filter_value, | ||
| ar->extra_value?ar->extra_value:"-", | ||
| ar->domain); | ||
| goto done; | ||
| } |
Check notice
Code scanning / CodeQL
No trivial switch statements Note
| switch (state->ar->entry_type & BE_REQ_TYPE_MASK) { | ||
| case BE_REQ_SERVICES: | ||
| err = "Service lookup failed"; | ||
| ret = minimal_services_get_recv(subreq); | ||
| break; | ||
| default: /* fail */ | ||
| ret = EINVAL; | ||
| break; | ||
| } |
Check notice
Code scanning / CodeQL
No trivial switch statements Note
| // TODO handle how to yield ERR_SERVER_FAILED | ||
| // ret = sdap_id_op_done(state->op, ret, &dp_error); | ||
| // if (dp_error == DP_ERR_OK && ret != EOK) { | ||
| // /* retry */ | ||
| // ret = minimal_services_get_retry(req); | ||
| // if (ret != EOK) { | ||
| // tevent_req_error(req, ret); | ||
| // return; | ||
| // } |
Check notice
Code scanning / CodeQL
Commented-out code Note
| // /* Return to the mainloop to retry */ | ||
| // return; | ||
| // } | ||
| // state->sdap_ret = ret; |
Check notice
Code scanning / CodeQL
Commented-out code Note
| // /* An error occurred. */ | ||
| // if (ret && ret != ENOENT) { | ||
| // state->dp_error = dp_error; | ||
| // tevent_req_error(req, ret); | ||
| // return; | ||
| //} |
Check notice
Code scanning / CodeQL
Commented-out code Note
This pull request is intended to be a start of a "failover" feature branch where other developers will be able to contribute.
The main failover logic works, compiles and can be tested using a "minimal" provider that is included as an example. The purpose of the "minimal" provider is only to test the failover without the need to port full provider code and itwill be removed prior pushing the contents to the master branch. See how to set it up in
minimal-provider-notes.txtand see the switch to new failover in commitminimal: switch to new failover for service lookup and user authentication- this is the minimal set of changes to get it working, but the real port should get and will require more refactoring.The work is still not finished and there is missing functionality. This functionality, however, can be implemented in small areas of code and should not require larger changes or glues in the whole code base, so this is ready for review. Remaining work is tracked at [1]. Feel free to take any of these tickets and open new tickets when you find something missing.
When reviewing, you can start with
src/providers/failover/readme.mdthat provides high level documentation of the code. And of course do not forget the design page [2].Thanks, Pavel