Port providers to New Failover WIP#8532
Port providers to New Failover WIP#8532justin-stephenson wants to merge 24 commits intoSSSD:failoverfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new failover mechanism for SSSD providers, exemplified by a new 'minimal' provider. The changes involve significant refactoring of error handling in existing LDAP and AD providers, removing dp_error parameters and related macros, and integrating a new sss_failover_transaction for connection management. The new failover logic is also integrated into various LDAP and IPA components for tasks like account info, enumeration, and sudo rules. However, several critical issues were identified: test hooks that intentionally trigger server failures are present in production code paths (ldap_id.c, minimal_id_services.c) and should be removed or guarded. Additionally, key failover options, DNS discovery domains, and server URIs are hardcoded (failover.c, failover_group.c, failover_srv.c, ldap_init.c), limiting configurability and dynamic behavior, which is unsuitable for a production-ready feature.
e345d10 to
90d41a1
Compare
|
|
||
| count = talloc_array_length(fctx->groups); | ||
|
|
||
| for (slot = 0; fctx->groups[slot] != NULL && slot < count; slot++) { |
Check failure
Code scanning / CodeQL
Array offset used before range check High
| } | ||
|
|
||
| if (state->sdap_ret == ENOENT && state->noexist_delete == true) { | ||
| if (ret == ENOENT && state->noexist_delete == true) { |
Check warning
Code scanning / CodeQL
Comparison result is always the same Warning
517f6dd to
62796d5
Compare
|
I pushed the failover feature branch, can you please rebase and use it as the base branch? |
pbrezina
left a comment
There was a problem hiding this comment.
Hi Justin, it's a lot of code and you didn't want a full review yet so I just scrolled over it.
It looks good. I have some concerns about the first two patches that converts the interface to a single return code - it looks like some important logic was removed as well at few places.
And the sdap_id_op code is responsible of terminating the retry logic and bringing SSSD offline. I think it should, at some point, return ERR_NO_MORE_SERVER. And the LDAP search code should return ERR_SERVER_FAILURE so it can be picked up by sss_failover_transaction.
| ret = sdap_id_op_done(state->sdap_op, ret, &dp_error); | ||
| ret = sdap_id_op_done(state->sdap_op, ret); | ||
| if (ret != EOK) { | ||
| if (dp_error == DP_ERR_OK) { |
There was a problem hiding this comment.
This looks like some retry logic was accidentally removed?
There was a problem hiding this comment.
I changed this to the following in the commit Port LDAP provider code to new failover
if (ret == ERR_NO_MORE_SERVERS) {
ret = sdap_access_decide_offline(state->cached_access);
} else if (ret == ERR_SERVER_FAILURE) {
goto done;
} else if (ret != EOK) {
DEBUG(SSSDBG_CRIT_FAILURE,
"sdap_id_op_done() returned error [%d][%s]\n",
ret, sss_strerror(ret));
goto done;
}
| dp_err = DP_ERR_FATAL; | ||
| } | ||
|
|
||
| if (dp_err == DP_ERR_OK && retval != EOK) { |
There was a problem hiding this comment.
This file sdap_id_op.c is now removed
src/providers/ldap/ldap_init.c
Outdated
| sdap_account_info_handler_send, sdap_account_info_handler_recv, id_ctx, | ||
| struct sdap_id_ctx, struct dp_id_data, struct dp_reply_std); | ||
| sdap_account_info_handler_send, sdap_account_info_handler_recv, init_ctx, | ||
| struct ldap_init_ctx, struct dp_id_data, struct dp_reply_std); |
There was a problem hiding this comment.
minimal_init_ctx was created to do minimal changes required for the minimal provider, without completely refactoring it. Here, you should refactor the id_ctx to hold the necessary data (failover contexts) and remove old failover bits.
There was a problem hiding this comment.
Updated. I could not yet fully remove sdap_server_opts from sdap_id_ctx due to some refactoring needed but I created a ticket for it.
|
The end goal is to get rid of sdap_id_op code completely. Maybe you could remove the file and avoid building ipa and ad providers for now to make sure we do not depend on anything from sdap_id_op? |
Thank you.
Functions below contain logic checking the values of |
04a8ff0 to
e0e54c0
Compare
|
|
||
| static errno_t sdap_get_groups_next_base(struct tevent_req *req); | ||
| static void sdap_get_groups_ldap_connect_done(struct tevent_req *subreq); | ||
| //static void sdap_get_groups_ldap_connect_done(struct tevent_req *subreq); |
Check notice
Code scanning / CodeQL
Commented-out code Note
| errno_t ret; | ||
| struct tevent_req *req; | ||
| struct tevent_req *subreq; | ||
| //struct tevent_req *subreq; |
Check notice
Code scanning / CodeQL
Commented-out code Note
| //struct tevent_req *subreq; | ||
| struct sdap_get_groups_state *state; | ||
| struct sdap_id_conn_ctx *ldap_conn = NULL; | ||
| //struct sdap_id_conn_ctx *ldap_conn = NULL; |
Check notice
Code scanning / CodeQL
Commented-out code Note
| @@ -1798,6 +1799,7 @@ struct tevent_req *sdap_get_groups_send(TALLOC_CTX *memctx, | |||
| req); | |||
| return req; | |||
| } | |||
| */ | |||
|
|
|||
| ret = sdap_get_groups_next_base(req); | |||
|
|
|||
| @@ -1810,17 +1812,17 @@ struct tevent_req *sdap_get_groups_send(TALLOC_CTX *memctx, | |||
| return req; | |||
| } | |||
|
|
|||
| /* | |||
| static void sdap_get_groups_ldap_connect_done(struct tevent_req *subreq) | |||
| { | |||
| struct tevent_req *req; | |||
| struct sdap_get_groups_state *state; | |||
| int ret; | |||
Check notice
Code scanning / CodeQL
FIXME comment Note
| @@ -1798,6 +1799,7 @@ struct tevent_req *sdap_get_groups_send(TALLOC_CTX *memctx, | |||
| req); | |||
| return req; | |||
| } | |||
| */ | |||
|
|
|||
| ret = sdap_get_groups_next_base(req); | |||
|
|
|||
| @@ -1810,17 +1812,17 @@ struct tevent_req *sdap_get_groups_send(TALLOC_CTX *memctx, | |||
| return req; | |||
| } | |||
|
|
|||
| /* | |||
| static void sdap_get_groups_ldap_connect_done(struct tevent_req *subreq) | |||
| { | |||
| struct tevent_req *req; | |||
| struct sdap_get_groups_state *state; | |||
| int ret; | |||
Check notice
Code scanning / CodeQL
Commented-out code Note
| if (subreq == NULL) { | ||
| ret = ENOMEM; | ||
| goto immediately; | ||
| } | ||
| */ | ||
|
|
||
| tevent_req_set_callback(subreq, | ||
| sdap_ad_tokengroups_initgr_posix_sids_connect_done, | ||
| req); | ||
|
|
||
| return req; | ||
|
|
||
| immediately: | ||
| if (ret == EOK) { | ||
| tevent_req_done(req); | ||
| } else { | ||
| tevent_req_error(req, ret); | ||
| } | ||
| tevent_req_post(req, ev); | ||
|
|
||
| return req; | ||
| } | ||
|
|
||
| static void | ||
| sdap_ad_tokengroups_initgr_posix_sids_connect_done(struct tevent_req *subreq) | ||
| { | ||
| struct sdap_ad_tokengroups_initgr_posix_state *state = NULL; | ||
| struct tevent_req *req = NULL; | ||
| int ret; | ||
| int dp_error = DP_ERR_FATAL; | ||
| //int ret; | ||
|
|
Check notice
Code scanning / CodeQL
Commented-out code Note
| /* | ||
| ret = sdap_id_op_connect_recv(subreq); | ||
| talloc_zfree(subreq); | ||
|
|
Check notice
Code scanning / CodeQL
Commented-out code Note
| subreq = sdap_get_ad_tokengroups_send(state, state->ev, state->opts, | ||
| sdap_id_op_handle(state->op), | ||
| // sdap_id_op_handle(state->op), | ||
| NULL, | ||
| state->username, state->orig_dn, | ||
| state->timeout); | ||
| if (subreq == NULL) { | ||
| tevent_req_error(req, ENOMEM); | ||
| return; | ||
| } | ||
|
|
||
| tevent_req_set_callback(subreq, sdap_ad_tokengroups_initgr_posix_tg_done, | ||
| req); | ||
|
|
||
| return; | ||
| } | ||
|
|
||
| errno_t | ||
| sdap_ad_tokengroups_get_posix_members(TALLOC_CTX *mem_ctx, | ||
| struct sss_domain_info *user_domain, | ||
| size_t num_sids, | ||
| char **sids, | ||
| size_t *_num_missing, | ||
| char ***_missing, | ||
| size_t *_num_valid, | ||
| char ***_valid_groups) | ||
| { | ||
| TALLOC_CTX *tmp_ctx = NULL; | ||
| struct sss_domain_info *domain = NULL; | ||
| struct ldb_message *msg = NULL; | ||
| const char *attrs[] = {SYSDB_NAME, NULL}; | ||
| const char *name = NULL; | ||
| char *sid = NULL; | ||
| char **valid_groups = NULL; | ||
| size_t num_valid_groups; | ||
| char **missing_sids = NULL; | ||
| size_t num_missing_sids; | ||
| size_t i; | ||
| errno_t ret; | ||
|
|
||
| tmp_ctx = talloc_new(NULL); | ||
| if (tmp_ctx == NULL) { | ||
| DEBUG(SSSDBG_CRIT_FAILURE, "talloc_new() failed\n"); | ||
| ret = ENOMEM; | ||
| goto done; | ||
| } | ||
|
|
||
| num_valid_groups = 0; | ||
| valid_groups = talloc_zero_array(tmp_ctx, char*, num_sids + 1); | ||
| if (valid_groups == NULL) { | ||
| ret = ENOMEM; | ||
| goto done; | ||
| } | ||
|
|
||
| num_missing_sids = 0; | ||
| missing_sids = talloc_zero_array(tmp_ctx, char*, num_sids + 1); | ||
| if (missing_sids == NULL) { | ||
| ret = ENOMEM; | ||
| goto done; | ||
| } | ||
|
|
||
| /* For each SID check if it is already present in the cache. If yes, we | ||
| * will get name of the group and update the membership. Otherwise we need | ||
| * to remember the SID and download missing groups one by one. */ | ||
| for (i = 0; i < num_sids; i++) { | ||
| sid = sids[i]; | ||
| DEBUG(SSSDBG_TRACE_LIBS, "Processing membership SID [%s]\n", sid); | ||
|
|
||
| domain = sss_get_domain_by_sid_ldap_fallback(user_domain, sid); | ||
| if (domain == NULL) { | ||
| const char *check_dom; | ||
| const char *check_name; | ||
|
|
||
| ret = well_known_sid_to_name(sid, &check_dom, &check_name); | ||
| if (ret == EOK) { | ||
| DEBUG(SSSDBG_TRACE_FUNC, | ||
| "Skipping SID [%s][%s\\%s] which is " | ||
| "currently not handled by SSSD.\n", | ||
| sid, check_dom, check_name); | ||
| } else { | ||
| DEBUG(SSSDBG_MINOR_FAILURE, "Domain not found for SID %s\n", sid); | ||
| } | ||
| continue; | ||
| } | ||
|
|
||
| ret = sysdb_search_group_by_sid_str(tmp_ctx, domain, sid, attrs, &msg); | ||
| if (ret == EOK) { | ||
| /* we will update membership of this group */ | ||
| name = ldb_msg_find_attr_as_string(msg, SYSDB_NAME, NULL); | ||
| if (name == NULL) { | ||
| DEBUG(SSSDBG_MINOR_FAILURE, | ||
| "Could not retrieve group name from sysdb\n"); | ||
| ret = EINVAL; | ||
| goto done; | ||
| } | ||
|
|
||
| valid_groups[num_valid_groups] = sysdb_group_strdn(valid_groups, | ||
| domain->name, | ||
| name); | ||
| if (valid_groups[num_valid_groups] == NULL) { | ||
| ret = ENOMEM; | ||
| goto done; | ||
| } | ||
| num_valid_groups++; | ||
| } else if (ret == ENOENT) { | ||
| if (_missing != NULL) { | ||
| /* we need to download this group */ | ||
| missing_sids[num_missing_sids] = talloc_steal(missing_sids, | ||
| sid); | ||
| num_missing_sids++; | ||
|
|
||
| DEBUG(SSSDBG_TRACE_FUNC, "Missing SID %s will be downloaded\n", | ||
| sid); | ||
| } | ||
|
|
||
| /* else: We have downloaded missing groups but some of them may | ||
| * remained missing because they are outside of search base. We | ||
| * will just ignore them and continue with the next group. */ | ||
| } else { | ||
| DEBUG(SSSDBG_MINOR_FAILURE, "Could not look up SID %s in sysdb: " | ||
| "[%s]\n", sid, strerror(ret)); | ||
| goto done; | ||
| } | ||
| } | ||
|
|
||
| valid_groups[num_valid_groups] = NULL; | ||
| missing_sids[num_missing_sids] = NULL; | ||
|
|
||
| /* return list of missing groups */ | ||
| if (_missing != NULL) { | ||
| *_missing = talloc_steal(mem_ctx, missing_sids); | ||
| *_num_missing = num_missing_sids; | ||
| } | ||
|
|
||
| /* return list of missing groups */ | ||
| if (_valid_groups != NULL) { | ||
| *_valid_groups = talloc_steal(mem_ctx, valid_groups); | ||
| *_num_valid = num_valid_groups; | ||
| } | ||
|
|
||
| ret = EOK; | ||
|
|
||
| done: | ||
| talloc_free(tmp_ctx); | ||
| return ret; | ||
| } | ||
|
|
||
| static void | ||
| sdap_ad_tokengroups_initgr_posix_tg_done(struct tevent_req *subreq) | ||
| { | ||
| struct sdap_ad_tokengroups_initgr_posix_state *state = NULL; | ||
| struct tevent_req *req = NULL; | ||
| char **sids = NULL; | ||
| size_t num_sids = 0; | ||
| errno_t ret; | ||
|
|
||
| req = tevent_req_callback_data(subreq, struct tevent_req); | ||
| state = tevent_req_data(req, struct sdap_ad_tokengroups_initgr_posix_state); | ||
|
|
||
| ret = sdap_get_ad_tokengroups_recv(state, subreq, &num_sids, &sids); | ||
| talloc_zfree(subreq); | ||
| if (ret != EOK) { | ||
| DEBUG(SSSDBG_CRIT_FAILURE, "Unable to acquire tokengroups [%d]: %s\n", | ||
| ret, strerror(ret)); | ||
| goto done; | ||
| } | ||
|
|
||
| ret = sdap_ad_tokengroups_get_posix_members(state, state->domain, | ||
| num_sids, sids, | ||
| &state->num_missing_sids, | ||
| &state->missing_sids, | ||
| &state->num_cached_groups, | ||
| &state->cached_groups); | ||
| if (ret != EOK) { | ||
| goto done; | ||
| } | ||
|
|
||
| /* download missing SIDs */ | ||
| subreq = sdap_ad_resolve_sids_send(state, state->ev, state->id_ctx, | ||
| state->conn, | ||
| state->opts, state->domain, | ||
| state->missing_sids); | ||
| if (subreq == NULL) { | ||
| ret = ENOMEM; | ||
| goto done; | ||
| } | ||
|
|
||
| tevent_req_set_callback(subreq, sdap_ad_tokengroups_initgr_posix_sids_done, | ||
| req); | ||
|
|
||
| return; | ||
|
|
||
| done: | ||
| if (ret != EOK) { | ||
| tevent_req_error(req, ret); | ||
| return; | ||
| } | ||
|
|
||
| tevent_req_done(req); | ||
| } | ||
|
|
||
| static void | ||
| sdap_ad_tokengroups_initgr_posix_sids_done(struct tevent_req *subreq) | ||
| { | ||
| struct sdap_ad_tokengroups_initgr_posix_state *state = NULL; | ||
| struct tevent_req *req = NULL; | ||
| errno_t ret; | ||
| char **cached_groups; | ||
| size_t num_cached_groups; | ||
|
|
||
| req = tevent_req_callback_data(subreq, struct tevent_req); | ||
| state = tevent_req_data(req, struct sdap_ad_tokengroups_initgr_posix_state); | ||
|
|
||
| ret = sdap_ad_resolve_sids_recv(subreq); | ||
| talloc_zfree(subreq); | ||
| if (ret != EOK) { | ||
| DEBUG(SSSDBG_CRIT_FAILURE, "Unable to resolve missing SIDs " | ||
| "[%d]: %s\n", ret, strerror(ret)); | ||
| goto done; | ||
| } | ||
|
|
||
| ret = sdap_ad_tokengroups_get_posix_members(state, state->domain, | ||
| state->num_missing_sids, | ||
| state->missing_sids, | ||
| NULL, NULL, | ||
| &num_cached_groups, | ||
| &cached_groups); | ||
| if (ret != EOK){ | ||
| DEBUG(SSSDBG_MINOR_FAILURE, | ||
| "sdap_ad_tokengroups_get_posix_members failed [%d]: %s\n", | ||
| ret, strerror(ret)); | ||
| goto done; | ||
| } | ||
|
|
||
| state->cached_groups = concatenate_string_array(state, | ||
| state->cached_groups, | ||
| state->num_cached_groups, | ||
| cached_groups, | ||
| num_cached_groups); | ||
| if (state->cached_groups == NULL) { | ||
| ret = ENOMEM; | ||
| goto done; | ||
| } | ||
|
|
||
| /* update membership of existing groups */ | ||
| ret = sdap_ad_tokengroups_update_members(state->username, | ||
| state->sysdb, state->domain, | ||
| state->cached_groups); | ||
| if (ret != EOK) { | ||
| DEBUG(SSSDBG_MINOR_FAILURE, "Membership update failed [%d]: %s\n", | ||
| ret, strerror(ret)); | ||
| goto done; | ||
| } | ||
|
|
||
| done: | ||
| if (ret != EOK) { | ||
| tevent_req_error(req, ret); | ||
| return; | ||
| } | ||
|
|
||
| tevent_req_done(req); | ||
| } | ||
|
|
||
| static errno_t sdap_ad_tokengroups_initgr_posix_recv(struct tevent_req *req) | ||
| { | ||
| TEVENT_REQ_RETURN_ON_ERROR(req); | ||
|
|
||
| return EOK; | ||
| } | ||
|
|
||
| struct sdap_ad_get_domain_local_groups_state { | ||
| struct tevent_context *ev; | ||
| struct sdap_id_conn_ctx *conn; | ||
| struct sdap_options *opts; | ||
| struct sdap_id_op *op; | ||
| struct sysdb_ctx *sysdb; |
Check notice
Code scanning / CodeQL
Commented-out code Note
| @@ -1244,26 +1251,25 @@ | |||
| struct tevent_req); | |||
| struct sdap_ad_get_domain_local_groups_state *state = tevent_req_data(req, | |||
| struct sdap_ad_get_domain_local_groups_state); | |||
| int dp_error = DP_ERR_FATAL; | |||
| /* | |||
| int ret; | |||
|
|
|||
| ret = sdap_id_op_connect_recv(subreq, &dp_error); | |||
| ret = sdap_id_op_connect_recv(subreq); | |||
| talloc_zfree(subreq); | |||
|
|
|||
| if (ret != EOK) { | |||
Check notice
Code scanning / CodeQL
Commented-out code Note
Check notice
Code scanning / CodeQL
Commented-out code Note
e0e54c0 to
aae68eb
Compare
(cherry picked from commit 0f5f3b6)
so it can be directly modified
So it can be modified later.
This crafts and implements the new failover interface, it does not provide complete implementation of the failover mechanism yet. It brings the code to a state were the public and private interfaces are stable, working and testable so the following tasks can be split and work on in parallel. What is missing at this state: - server configuration and discovery (failover_server_group/batch/vtable_op) - server selection mechanism (sss_failover_vtable_op_server_next) - kerberos authentication - sharing servers between IPA/AD LDAP and KDC - online/offline callbacks (resolve callback should not be needed) But especially it is possible to start refactoring SSSD code to start using the new failover implementation.
Assisted by: Claude code (Sonnet 4.6)
In low level ldap search functions
The data provider handler methods now return a single output argument. Remove 'dp_error/dp_err' and 'error_message' usage across provider code. The getAccountDomain method still needs to return 'domain_name' string. Assisted by: Claude code (Sonnet 4.6)
- Servers are hardcoded
aae68eb to
6e21717
Compare
This will be done differently inside the failover code
To allow system tests to run in upstream PRCI
** Temporary commit as each provider is ported to new failover ** These provider tests will often fail because they have not yet been ported to the new failover code, and they call into ldap functions which are utilizing the new failover. For example crash in : #0 0x00007fdb8d72cb41 in sss_failover_transaction_ex_send () from /usr/lib64/sssd/libsss_ldap_common.so #1 0x00007fdb8d72ccbd in sss_failover_transaction_send () from /usr/lib64/sssd/libsss_ldap_common.so #2 0x00007fdb8d6e367b in groups_by_user_send () from /usr/lib64/sssd/libsss_ldap_common.so #3 0x00007fdb8d6e6a88 in sdap_handle_acct_req_send () from /usr/lib64/sssd/libsss_ldap_common.so #4 0x00007fdb8d799076 in ipa_id_get_account_info_get_original_step () from /usr/lib64/sssd/libsss_ipa.so #5 0x00007fdb8d7a038b in ipa_id_get_account_info_send () from /usr/lib64/sssd/libsss_ipa.so #6 0x00007fdb8d7a2560 in ipa_account_info_send () from /usr/lib64/sssd/libsss_ipa.so #7 0x00007fdb8d7a2877 in ipa_account_info_handler_send () from /usr/lib64/sssd/libsss_ipa.so
* test_failover tests are expected to fail due to failover interface changes. * test_logging 'offline' tests and test_autofs__propagate_offline_status_for_multiple_domains fail because they are asserting offline related log messages which are not in the new failover code. tests/test_autofs.py::test_autofs__propagate_offline_status_for_multiple_domains (ldap) FAILED [ 35%] tests/test_failover.py::test_failover__reactivation_timeout_is_honored[None-31] (ldap) FAILED [ 39%] tests/test_failover.py::test_failover__reactivation_timeout_is_honored[15-31] (ldap) FAILED [ 39%] tests/test_failover.py::test_failover__reactivation_timeout_is_honored[60-60] (ldap) FAILED [ 39%] tests/test_failover.py::test_failover__connect_using_ipv4_second_family (ldap) FAILED [ 40%] tests/test_logging.py::test_logging__default_settings_logs_offline_errors (ldap) FAILED [ 60%] tests/test_logging.py::test_logging__default_settings_logs_to_syslog_when_ldap_is_offline (ldap) FAILED [ 60%]
This commit will be removed when AD provider is ported to new failover.
6e21717 to
fbad3ad
Compare
No description provided.