Skip to content

Credential api use of tokens#1913

Open
yoks wants to merge 11 commits into
NVIDIA:mainfrom
yoks:credential-api-use-of-tokens
Open

Credential api use of tokens#1913
yoks wants to merge 11 commits into
NVIDIA:mainfrom
yoks:credential-api-use-of-tokens

Conversation

@yoks
Copy link
Copy Markdown
Contributor

@yoks yoks commented May 23, 2026

Description

First phase of SessionTokens API support.

Enforces GetBmcCredentials to use SessionService tokens, meaning if BMC does not support Session, API will error out.

API would first get spiffe identifier of the calling services, then try to rotate token, meaning if there is token in database (there is new table which stored token IDs), it will revoke old token and issue new one. If there is no token, it would just issue new token. Clients expected to call this api to rotate expired tokens themselves (on auth failure).

Another major change is the begging of movent of AvoidLockout circuit breaker to this function, as in future, this should be only place what handles Basic credentials. Auth tokens themselvels could cause lockout. This also why we preffer to not share credentials at all (to consilidate this CircuitBreaker behavior here).

Should in general, work for Sharded envs, but it is preffered what there is specific API instances work with specific set of BMC macs to avoid races/simultanious refreshes and avoid DB locks.

To get BmcCredentials, after this PR is merged, each service is required to have spiffe indentifier, this ensures what each service can get their own credentials/per spiffe. This also adds requirement for all sharded services to maintain propper sharding strategy per spiffe identifier (e.g. they should not overlap BMCs in shards), otherwise credentials will be rotated and can cause credentials reissue storm.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Implements big chunk of: #460

Should finaly fix this bug for good: #1292

Breaking Changes

  • This PR contains breaking changes
    Credentials API no longer returns passwords. It would explicitly not work with BMC which do not support SessionService. We can add flag in future to make exception for that.

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@yoks yoks requested a review from a team as a code owner May 23, 2026 02:08
@yoks yoks requested a review from Matthias247 May 23, 2026 02:09
yoks added 2 commits May 22, 2026 19:21
Signed-off-by: ianisimov <ianisimov@nvidia.com>
Signed-off-by: ianisimov <ianisimov@nvidia.com>
pub bmc_mac_address: MacAddress,
pub session_odata_id: String,
pub issued_at: DateTime<Utc>,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It is better to move definition of StoredSession to model crate. At least, this is the pattern we use for most data types.

Copy link
Copy Markdown
Contributor

@kensimon kensimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions about some of the main issues, I haven't reviewed the whole PR yet.

Comment thread crates/health/src/discovery/spawn.rs Outdated
let key = endpoint.key();
let endpoint_arc = endpoint.clone();

let credentials = endpoint.credentials().ok_or_else(|| {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these will be constructed without credentials (or at least I don't see a code path that sets them prior to spawn_collectors_for_endpoint getting called), should we call endpoint.ensure_credentials().await?; first here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, i moved credentials around few times (initialization), and i think i forgot to call init in spawn. On last move.

Comment thread crates/health/src/api_client.rs Outdated

Self { client }
let credential_provider: Arc<dyn CredentialProvider> = Arc::new(ApiCredentialProvider {
client: client.clone(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(See previous comment) I think the credentials are initialized here, but we don't call endpoint.ensure_credentials().await?; between here and run_discovery_iteration.

.into_iter()
.find(|m| m.raw().odata_id() == &prior_id)
{
if let Err(err) = prior_session.delete().await {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Health creates multiple instances of AuthRefreshingBmc for each endpoint (one for each collector), each of which gets a separate Arc<RwLock<BmcCredentials>>, each constructed separately (ie. not the same Arc.) I think this means they'll each be fetching credentials independently for the same endpoint, minting different tokens. If we're rotating the credentials every time we fetch them, wouldn't that result in a sort of "rotation storm" where they keep invalidating each otehr's credentials?

I think if we're going to do rotation this way we need to make sure the HttpBmc client shares a single Arc<RwLock> for the same endpoint, rather than having health's create_bmc have a new set of credentials for every collector.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To elaborate, AuthRefreshingBmc::refresh_credentials sets the credentials via self.endpoint.refresh(), but then its self.inner HttpBmc object gets its own set_credentials() call, with a separate Arc... so when it invalidates the old token, the other AuthRefreshingBmc's inner's are not refreshed with the new token. I think that means the other collector's next BMC call will return a 401 and then it would refresh, invalidating the first one, and it would keep going like that.

Copy link
Copy Markdown
Contributor

@kensimon kensimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Updating my review to Request changes... I'm not sure if I'm right about my feedback but it's probably better if this didn't merge until we discuss)

Comment thread crates/health/src/endpoint/model.rs Outdated
pub(crate) credentials: Arc<RwLock<Option<BmcCredentials>>>,
pub(crate) provider: Arc<dyn CredentialProvider>,
// Neded to ensure only one collector fetches endpoint
pub(crate) fetch_lock: Arc<AsyncMutex<()>>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of an out-of-band mutex on an empty tuple, could we instead lock self.credentials before fetching and it would accomplish the same thing? (We'd need to make self.credentials a tokio::RwLock, but that's it)

That is, instead of doing:

let _guard = self.fetch_lock.lock().await;
let fresh = self.provider.fetch_credentials(&self.addr).await?;
*self.credentials.write().expect("lock poisoned") = Some(fresh.clone());

Couldn't we just guard on self.credentials itself?

let mut credentials = self.credentials.write().await;
let fresh = self.provider.fetch_credentials(&self.addr).await?;
*credentials = Some(fresh.clone());

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this layerd syncronization need to go. I will try to wrap all inside BMC, is has the most important function (set_credentials) and we need synchronize on it. And not leak it outside

@yoks
Copy link
Copy Markdown
Contributor Author

yoks commented May 26, 2026

@kensimon thanks for review, i think i puzzled myself with several layers of collectors. i was completly focused on just running one BMC collector in all my tests and forgot this issue (with multiple collectors) so that one slipped through.

I need rethink how this whole credentials refresh works, it is several layers of historical (before tokens) refreshes, so better to rewrite it from scratch.

@Matthias247
Copy link
Copy Markdown
Contributor

Enforces GetBmcCredentials to use SessionService tokens, meaning if BMC does not support Session, API will error out.

Do we have to introduce that constraint?

My assumption was that most callers now get credentials from some abstract credentialprovider per entity. And that provider could then either hand out tokens or username/password - depending on whats available.
In that case it could be up for the provider to check whats available. If tokens are available - manage them (including rotate them) and hand them out. If not - hand out username/password.

@Matthias247
Copy link
Copy Markdown
Contributor

Can you add a bit more detail to the description of when sessions are established and tokens are rotated. Eg.

  • session establishment and token rotation is done in site-explorer
  • session establishment and token rotation is done by any callpath in current nico-core which fetches credentials, including site-explorer, state-handler code which interacts with BMCs, fetchBmcCredentials APIs, etc
  • a decided new process which is supposed to manage and rotate credentials

I think it probably works either way in the "there is just 1 nico-core instance" case, but for sharding things might become more interesting (because site-explorer sharding would not necessarily match how hw-health is sharded).

@yoks
Copy link
Copy Markdown
Contributor Author

yoks commented May 26, 2026

Enforces GetBmcCredentials to use SessionService tokens, meaning if BMC does not support Session, API will error out.

Do we have to introduce that constraint?

My assumption was that most callers now get credentials from some abstract credentialprovider per entity. And that provider could then either hand out tokens or username/password - depending on whats available. In that case it could be up for the provider to check whats available. If tokens are available - manage them (including rotate them) and hand them out. If not - hand out username/password.

This is artifical, if we hide/enforce it by config param, is this be ok? Thought is exposing Basic credentials prevent us from ensuring they not be locked out/abused in any way. Easier to add new integrations which would use credentials.

@yoks
Copy link
Copy Markdown
Contributor Author

yoks commented May 26, 2026

Can you add a bit more detail to the description of when sessions are established and tokens are rotated. Eg.

  • session establishment and token rotation is done in site-explorer
  • session establishment and token rotation is done by any callpath in current nico-core which fetches credentials, including site-explorer, state-handler code which interacts with BMCs, fetchBmcCredentials APIs, etc
  • a decided new process which is supposed to manage and rotate credentials

I think it probably works either way in the "there is just 1 nico-core instance" case, but for sharding things might become more interesting (because site-explorer sharding would not necessarily match how hw-health is sharded).

As long as each shard works with only one BMC it should be fine. Tokens should be issued per entity (in my case spiffe). So for current NICo it would be NICO-Core token, but if explorer som day become separat service it should have their own token.

@yoks
Copy link
Copy Markdown
Contributor Author

yoks commented May 27, 2026

@kensimon I removed most of the synchronization logic out and made what Endpoint owns BMCClient, which is only place where auth credentials are rejected and updated/fetched, via provided credential provider.

Also for NVUE i modelled it in similar fashion, with credentials provider refresh.

@Matthias247
Copy link
Copy Markdown
Contributor

should be fine. Tokens should be issued per entity (in my case spiffe). So for current NICo it would be NICO-Core token, but if explorer som day become separat service it should have their own token.

I meant adding these details to the PR description. Right now I'd really need to reverse engineer the code to understand how and when tokens are issued.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants