Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Jan 22, 2026

Following up on this thread, point out that the criteria for delaying patch releases are much higher, because we don't want to stand between users and any other bugfixes that are ready to ship. Also give folks some pointers at mitigation options that are less severe than "withhold the patch fix". In some cases, there are internal process docs (e.g. for KCS creation), but I dunno if I should be linking those in this public doc?

@openshift-ci-robot
Copy link

@wking: This pull request explicitly references no jira issue.

Details

In response to this:

Following up on this thread, point out that the criteria for delaying patch releases are much higher, because we don't want to stand between users and any other bugfixes that are ready to ship. Also give folks some pointers at mitigation options that are less severe than "withhold the patch fix". In some cases, there are internal process docs (e.g. for KCS creation), but I dunno if I should be linking those in this public doc?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 22, 2026
@openshift-ci openshift-ci bot requested review from dtantsur and travier January 22, 2026 21:45
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 22, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jwforres for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking wking force-pushed the clarify-z-stream-blocker-conditions branch from 0997d41 to f8e36ad Compare January 22, 2026 21:47
* [Conditional update risk declarations][4.20-conditional-updates], see [the assessment process](/enhancements/update/update-blocker-lifecycle/README.md) and [the distribution tooling](/enhancements/update/targeted-update-edge-blocking.md).

Both Insights rules and conditional update risk declarations can sometimes be scoped down to the exposed subset of the fleet, to reduce distracting the subset of the fleet that is not exposed.
Both Insights rules and condtional update risk declarations are specific to updating into the risk, and neither provides pre-install warning for users consdering installing an exposed release.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Both Insights rules and condtional update risk declarations are specific to updating into the risk, and neither provides pre-install warning for users consdering installing an exposed release.
Both Insights rules and conditional update risk declarations are specific to updating into the risk. There is no warning prior to a fresh install of an exposed release.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed the "condtional" -> "conditional" fix with f8e36ad -> 030af07 ; thanks!

There is no warning prior to a fresh install of an exposed release.

I didn't pick up this part. What I'm trying to say in this sentence is that these two options (Insights and conditional updates) aren't relevant for install. I'm not ruling out other mitigations that might be helpful for install (e.g. known issues and KCS could help for install, although they would require a fair bit of motivation to work for pre-install). Is that trying to slice it too fine? Or maybe it can work, but I should add a preceding sentence that points out that some mechanisms are better fits for some situations, to get folks in that headspace before I start getting into per-mechanism limitations? Or spread out with subsections for each mitigation option, instead of trying to cover them all in a list and sentence-or-two? Or...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, maybe I should give up on each sentence being able to stand alone, and say something like:

Both Insights rules and conditional update risk declarations can sometimes be scoped down to the exposed subset of the fleet, to reduce distracting the subset of the fleet that is not exposed. They are specific to updating into the risk. Neither provides pre-install or post-failed-install warning for users considering installing an exposed release.

?

When preparing a new X.Y.Z patch release shipping bug fixes, our ability to delay the release is much more constrained.
If a bug is so terrible that _no cluster_ could possibly benefit from the impacted nightlies, setting `Release Blocker` to `Approved` is appropriate.
But if there are even small subsets of clusters that would not be exposed to the OCPBUG, or if there are steps a cluster-admin could take to mitigate exposure, we prefer releasing the patch release anyway.
This avoids delaying access to other bugfixes in the the clusters that could survive with the OCPBUG being considered.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this section I presume we're talking about NEW regressions in the .z release we've discovered, and not some bug that's already in the wild correct?

Already in the wild, I would think rarely would we have to hold a .z for an approved release blocker.

If we discover something has gone wrong with a fix in the .z, and it will break functionality for a subset of customers in a new way, I worry that the mitigations below convey a lack of confidence in our own product and systems that we then make very apparent to customers and expect them to accommodate on every update.

For one I worry they won't read or do sufficient due diligence to determine if the problem will affect them. However even if we could induce that behaviour given how rare this situation is likely to arise, we're still expecting them to see an update and not have confidence it's safe to apply unless THEY take some action to find out if it's going to break them. When those updates go out, we should have confidence we're not making things worse.

I'm new to this and happy to discuss more, and I would hope this has been a VERY rare thing historically, but initially I think we should not regress functionality in .z's.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this section I presume we're talking about NEW regressions in the .z release we've discovered, and not some bug that's already in the wild correct?

I was talking about bugs that were known to affect the current X.Y branch tip, without distinguishing based on whether existing X.Y.Z were exposed. More later in this comment about why I don't think we want to make a distinction there.

...I worry that the mitigations below convey a lack of confidence in our own product and systems that we then make very apparent to customers...

One response to the reputational risk is that, if the rate of regressions within a z stream rises to embarrassing levels, we invest more effort into CI/QE to get the rate back down to something we can stomach. I'm not a fan of ignoring known issues that have already shipped because we're feeling embarrassed; I think we need to own up to those and let users know before more get bit by the known issue. I accept that sometimes our ability to scope a risk declaration to the subset of exposed clusters is small, and we may decide to use other mechanisms to mitigate, depending on how we want to balance the risk of a cluster-admin who is exposed missing the warning against the distraction of a cluster-admin who is not exposed having the irrelevant-for-them warning waived around.

And part of the pitch I'm making is that there are going to be other bugs where we have verified fixes. And getting those fixes out faster to the subset of the fleet that can safely benefit from them seems like it would help our reputation for security and stability. And the fact that there are other fixes in the pipe that will benefit the not-exposed-to-the-regression portion of the fleet is why I don't think we want to distinguish between "already wild" and "just regressed" in our z-stream-release-blocker policy.

...expect them to accommodate on every update.

At least update mitigations include Insights and conditional update risks, which can surface issues to the affected subset of cluster-admins, without requiring them to manually reach out and hunt for known-issues.

...I would hope this has been a VERY rare thing historically, but initially I think we should not regress functionality in .z's.

We're usually pretty good for things in the main CI/QE wheelhouse. But the whole cluster-config space is combinatorically large, and exhaustive test coverage is impossible, and things slip through the cracks (mostly for corner-case configs) fairly often. See here for me doing an audit of our public update risk declarations from back in the 4.13/4.14 timeframe.


In some cases, a hot fix is about to merge, or is percolating through nightlies, or is otherwise expected to show up very soon in the release pipeline.
Setting `Release Blocker` to `Approved` in those cases is acceptable as a way to attract ART attention and ask them to wait for that fix.
They may wait for the fix, or they may decide to build a release anyway, even without the fix, depending on capacity and how much time remains for final QE testing before the planned Errata-publication time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ART can ignore release blocker, we have no mechanism I can see to actually stop a .z from going out. I would propose release blocker approved means no release going out, and we work out the details to decide if it's safe to remove that flag. (which will soon require approvals hopefully)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to leave room for Scott's:

TBH this lever hasn't actually been particularly successful in getting ART to delay promotion...

But I'm ok tightening it up to reflect aspirations instead of current reality, and then working post-merge to try and catch reality up with the aspirations. Is that what you're looking for?

… patch-fixes

Point out that the criteria for delaying patch releases are much
higher, because we don't want to stand between users and any _other_
bugfixes that are ready to ship.  Also give folks some pointers at
mitigation options that are less severe than "withhold the patch fix".
In some cases, there are internal process docs (e.g. for KCS
creation), but I dunno if I should be linking those in this public
doc?
@wking wking force-pushed the clarify-z-stream-blocker-conditions branch from f8e36ad to 030af07 Compare January 23, 2026 20:09
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 23, 2026

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants