-
Notifications
You must be signed in to change notification settings - Fork 535
NO-ISSUE: dev-guide/release-blocker-definition: Distinguish X.Y.0 GA from X.Y.Z patch-fixes #1928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@wking: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
0997d41 to
f8e36ad
Compare
| * [Conditional update risk declarations][4.20-conditional-updates], see [the assessment process](/enhancements/update/update-blocker-lifecycle/README.md) and [the distribution tooling](/enhancements/update/targeted-update-edge-blocking.md). | ||
|
|
||
| Both Insights rules and conditional update risk declarations can sometimes be scoped down to the exposed subset of the fleet, to reduce distracting the subset of the fleet that is not exposed. | ||
| Both Insights rules and condtional update risk declarations are specific to updating into the risk, and neither provides pre-install warning for users consdering installing an exposed release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Both Insights rules and condtional update risk declarations are specific to updating into the risk, and neither provides pre-install warning for users consdering installing an exposed release. | |
| Both Insights rules and conditional update risk declarations are specific to updating into the risk. There is no warning prior to a fresh install of an exposed release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed the "condtional" -> "conditional" fix with f8e36ad -> 030af07 ; thanks!
There is no warning prior to a fresh install of an exposed release.
I didn't pick up this part. What I'm trying to say in this sentence is that these two options (Insights and conditional updates) aren't relevant for install. I'm not ruling out other mitigations that might be helpful for install (e.g. known issues and KCS could help for install, although they would require a fair bit of motivation to work for pre-install). Is that trying to slice it too fine? Or maybe it can work, but I should add a preceding sentence that points out that some mechanisms are better fits for some situations, to get folks in that headspace before I start getting into per-mechanism limitations? Or spread out with subsections for each mitigation option, instead of trying to cover them all in a list and sentence-or-two? Or...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively, maybe I should give up on each sentence being able to stand alone, and say something like:
Both Insights rules and conditional update risk declarations can sometimes be scoped down to the exposed subset of the fleet, to reduce distracting the subset of the fleet that is not exposed. They are specific to updating into the risk. Neither provides pre-install or post-failed-install warning for users considering installing an exposed release.
?
| When preparing a new X.Y.Z patch release shipping bug fixes, our ability to delay the release is much more constrained. | ||
| If a bug is so terrible that _no cluster_ could possibly benefit from the impacted nightlies, setting `Release Blocker` to `Approved` is appropriate. | ||
| But if there are even small subsets of clusters that would not be exposed to the OCPBUG, or if there are steps a cluster-admin could take to mitigate exposure, we prefer releasing the patch release anyway. | ||
| This avoids delaying access to other bugfixes in the the clusters that could survive with the OCPBUG being considered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this section I presume we're talking about NEW regressions in the .z release we've discovered, and not some bug that's already in the wild correct?
Already in the wild, I would think rarely would we have to hold a .z for an approved release blocker.
If we discover something has gone wrong with a fix in the .z, and it will break functionality for a subset of customers in a new way, I worry that the mitigations below convey a lack of confidence in our own product and systems that we then make very apparent to customers and expect them to accommodate on every update.
For one I worry they won't read or do sufficient due diligence to determine if the problem will affect them. However even if we could induce that behaviour given how rare this situation is likely to arise, we're still expecting them to see an update and not have confidence it's safe to apply unless THEY take some action to find out if it's going to break them. When those updates go out, we should have confidence we're not making things worse.
I'm new to this and happy to discuss more, and I would hope this has been a VERY rare thing historically, but initially I think we should not regress functionality in .z's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this section I presume we're talking about NEW regressions in the .z release we've discovered, and not some bug that's already in the wild correct?
I was talking about bugs that were known to affect the current X.Y branch tip, without distinguishing based on whether existing X.Y.Z were exposed. More later in this comment about why I don't think we want to make a distinction there.
...I worry that the mitigations below convey a lack of confidence in our own product and systems that we then make very apparent to customers...
One response to the reputational risk is that, if the rate of regressions within a z stream rises to embarrassing levels, we invest more effort into CI/QE to get the rate back down to something we can stomach. I'm not a fan of ignoring known issues that have already shipped because we're feeling embarrassed; I think we need to own up to those and let users know before more get bit by the known issue. I accept that sometimes our ability to scope a risk declaration to the subset of exposed clusters is small, and we may decide to use other mechanisms to mitigate, depending on how we want to balance the risk of a cluster-admin who is exposed missing the warning against the distraction of a cluster-admin who is not exposed having the irrelevant-for-them warning waived around.
And part of the pitch I'm making is that there are going to be other bugs where we have verified fixes. And getting those fixes out faster to the subset of the fleet that can safely benefit from them seems like it would help our reputation for security and stability. And the fact that there are other fixes in the pipe that will benefit the not-exposed-to-the-regression portion of the fleet is why I don't think we want to distinguish between "already wild" and "just regressed" in our z-stream-release-blocker policy.
...expect them to accommodate on every update.
At least update mitigations include Insights and conditional update risks, which can surface issues to the affected subset of cluster-admins, without requiring them to manually reach out and hunt for known-issues.
...I would hope this has been a VERY rare thing historically, but initially I think we should not regress functionality in .z's.
We're usually pretty good for things in the main CI/QE wheelhouse. But the whole cluster-config space is combinatorically large, and exhaustive test coverage is impossible, and things slip through the cracks (mostly for corner-case configs) fairly often. See here for me doing an audit of our public update risk declarations from back in the 4.13/4.14 timeframe.
|
|
||
| In some cases, a hot fix is about to merge, or is percolating through nightlies, or is otherwise expected to show up very soon in the release pipeline. | ||
| Setting `Release Blocker` to `Approved` in those cases is acceptable as a way to attract ART attention and ask them to wait for that fix. | ||
| They may wait for the fix, or they may decide to build a release anyway, even without the fix, depending on capacity and how much time remains for final QE testing before the planned Errata-publication time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If ART can ignore release blocker, we have no mechanism I can see to actually stop a .z from going out. I would propose release blocker approved means no release going out, and we work out the details to decide if it's safe to remove that flag. (which will soon require approvals hopefully)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to leave room for Scott's:
TBH this lever hasn't actually been particularly successful in getting ART to delay promotion...
But I'm ok tightening it up to reflect aspirations instead of current reality, and then working post-merge to try and catch reality up with the aspirations. Is that what you're looking for?
… patch-fixes Point out that the criteria for delaying patch releases are much higher, because we don't want to stand between users and any _other_ bugfixes that are ready to ship. Also give folks some pointers at mitigation options that are less severe than "withhold the patch fix". In some cases, there are internal process docs (e.g. for KCS creation), but I dunno if I should be linking those in this public doc?
f8e36ad to
030af07
Compare
|
@wking: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Following up on this thread, point out that the criteria for delaying patch releases are much higher, because we don't want to stand between users and any other bugfixes that are ready to ship. Also give folks some pointers at mitigation options that are less severe than "withhold the patch fix". In some cases, there are internal process docs (e.g. for KCS creation), but I dunno if I should be linking those in this public doc?