fix: Version mismatch code fix plus manual verification scripts (ARO-26767)#5369
fix: Version mismatch code fix plus manual verification scripts (ARO-26767)#5369brucebarrera wants to merge 16 commits into
Conversation
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Removed line numbers as per Copilot suggestion
Line numbers removed
Updated FIXING-VERSION-MISMATCH.md to use proper e2e test execution methods based on test/e2e/README.md. Replaced direct go test commands with aro-hcp-tests binary usage for consistency with project standards.
Updating branch with latest changes from main
…rucebarrera/ARO-HCP into node-pool-version-mismatch-fix merging latest changes
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: brucebarrera The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @brucebarrera. Thanks for your PR. I'm waiting for a Azure member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR aims to prevent OpenShift control plane / node pool version mismatches in E2E tests by aligning version resolution and adding guardrails plus helper scripts for local/CI setup.
Changes:
- Reuse the control plane version for node pools when channel groups match and add a semver-based validation to catch misconfiguration early.
- Add CI/local helper scripts to resolve and set synchronized OpenShift versions (bash + PowerShell) plus a diagnostic script.
- Add documentation describing the mismatch root cause and recommended workflows.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| test/util/framework/deployment_params.go | Align node pool version selection with control plane and add an early validation check. |
| test/scripts/sync-ocp-versions-ci.sh | CI helper to resolve one version and export/emit env vars for CP/NP. |
| test/scripts/set-ocp-versions.sh | Local bash helper to resolve a version and set CP/NP env vars. |
| test/scripts/check-channel-groups.sh | Diagnostic script to highlight channel/version mismatch risks. |
| test/scripts/Set-OcpVersions.ps1 | Local PowerShell helper to resolve a version and set CP/NP env vars. |
| test/scripts/FIXING-VERSION-MISMATCH.md | Documentation explaining the issue and how to use the helpers / new behavior. |
| // Only fetch separately if using a different channel group | ||
| if channelGroup != "stable" { | ||
| var err error | ||
| version, err = GetLatestInstallVersion(context.Background(), channelGroup, DefaultOCPVersionId) | ||
| if err != nil { | ||
| if errors.Is(err, ErrNightlyReleaseStreamNotFound) || errors.Is(err, ErrNoAcceptedNightlyTags) || errors.Is(err, ErrVersionNotFound) { | ||
| Skip(fmt.Sprintf("No install version found for %s in %s channel (%s)", version, channelGroup, err.Error())) | ||
| Skip(fmt.Sprintf("No install version found for %s in %s channel (%s)", DefaultOCPVersionId, channelGroup, err.Error())) | ||
| } else { | ||
| Fail(fmt.Sprintf("failed to get latest install version for %s channel: %s", channelGroup, err.Error())) | ||
| } | ||
| } | ||
| } else { | ||
| // For stable channel, also use control plane version to avoid mismatches | ||
| version = DefaultOpenshiftControlPlaneVersionId() | ||
| } |
| // Validate that node pool version doesn't exceed control plane version | ||
| // This catches configuration errors early before they reach the API validation | ||
| npVer, npErr := semver.ParseTolerant(npVersion) | ||
| cpVer, cpErr := semver.ParseTolerant(cpVersion) |
| if [ -n "${ARO_HCP_OPENSHIFT_CONTROLPLANE_VERSION:-}" ] && [ -n "${ARO_HCP_OPENSHIFT_NODEPOOL_VERSION:-}" ]; then | ||
| CP_VERSION="$ARO_HCP_OPENSHIFT_CONTROLPLANE_VERSION" | ||
| NP_VERSION="$ARO_HCP_OPENSHIFT_NODEPOOL_VERSION" | ||
|
|
||
| echo "Versions explicitly set via environment:" | ||
| echo " Control Plane: $CP_VERSION" | ||
| echo " Node Pool: $NP_VERSION" | ||
|
|
||
| # Validate they match | ||
| if [ "$CP_VERSION" != "$NP_VERSION" ]; then | ||
| echo "ERROR: Control plane and node pool versions differ!" >&2 | ||
| echo " This will cause validation errors." >&2 | ||
| echo " Either unset both variables or ensure they match." >&2 | ||
| exit 1 | ||
| fi | ||
|
|
||
| echo "✓ Versions are synchronized" | ||
| exit 0 | ||
| fi |
| if [ -z "$VERSION" ]; then | ||
| echo "ERROR: Failed to resolve version" >&2 | ||
| exit 1 | ||
| fi |
Fix possible version order error due to powershell's default Sort-Object on strings being lexicographic Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
| // Only fetch separately if using a different channel group | ||
| if channelGroup != "stable" { | ||
| var err error | ||
| version, err = GetLatestInstallVersion(context.Background(), channelGroup, DefaultOCPVersionId) | ||
| if err != nil { | ||
| if errors.Is(err, ErrNightlyReleaseStreamNotFound) || errors.Is(err, ErrNoAcceptedNightlyTags) || errors.Is(err, ErrVersionNotFound) { | ||
| Skip(fmt.Sprintf("No install version found for %s in %s channel (%s)", version, channelGroup, err.Error())) | ||
| Skip(fmt.Sprintf("No install version found for %s in %s channel (%s)", DefaultOCPVersionId, channelGroup, err.Error())) | ||
| } else { | ||
| Fail(fmt.Sprintf("failed to get latest install version for %s channel: %s", channelGroup, err.Error())) | ||
| } | ||
| } | ||
| } else { | ||
| // For stable channel, also use control plane version to avoid mismatches | ||
| version = DefaultOpenshiftControlPlaneVersionId() | ||
| } |
| # Check if versions are already explicitly set | ||
| if [ -n "${ARO_HCP_OPENSHIFT_CONTROLPLANE_VERSION:-}" ] && [ -n "${ARO_HCP_OPENSHIFT_NODEPOOL_VERSION:-}" ]; then | ||
| CP_VERSION="$ARO_HCP_OPENSHIFT_CONTROLPLANE_VERSION" | ||
| NP_VERSION="$ARO_HCP_OPENSHIFT_NODEPOOL_VERSION" | ||
|
|
||
| echo "Versions explicitly set via environment:" | ||
| echo " Control Plane: $CP_VERSION" | ||
| echo " Node Pool: $NP_VERSION" | ||
|
|
||
| # Validate they match | ||
| if [ "$CP_VERSION" != "$NP_VERSION" ]; then | ||
| echo "ERROR: Control plane and node pool versions differ!" >&2 | ||
| echo " This will cause validation errors." >&2 | ||
| echo " Either unset both variables or ensure they match." >&2 | ||
| exit 1 | ||
| fi | ||
|
|
||
| echo "✓ Versions are synchronized" | ||
| exit 0 | ||
| fi |
| if [ -z "$VERSION" ]; then | ||
| echo "ERROR: Failed to resolve version" >&2 | ||
| exit 1 | ||
| fi |
| return 1 | ||
| fi | ||
|
|
||
| local version=$(curl --silent --show-error --fail --location --retry 3 --retry-delay 2 --retry-connrefused --max-time 30 "$graph_url" | jq -r '.nodes[].version' | sort -V | tail -1) |
| } | ||
|
|
||
| # Get the latest version using semantic version ordering | ||
| $latestVersion = $response.nodes.version | Sort-Object { [version]$_ } | Select-Object -Last 1 |
|
|
||
| if [ -z "$VERSION" ] || [ "$VERSION" = "null" ]; then | ||
| echo "ERROR: No version found for channel ${CHANNEL_GROUP}-${VERSION_MINOR}" >&2 | ||
| echo "Response was: $GRAPH_JSON" >&2 |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
| fi | ||
|
|
||
| echo "✓ Versions are synchronized" | ||
| exit 0 |
| if ! command -v jq &> /dev/null; then | ||
| echo "ERROR: jq is required but not installed" >&2 | ||
| echo "Install with: apt-get install jq (or equivalent)" >&2 | ||
| exit 1 |
| # Configuration | ||
| CHANNEL_GROUP="${ARO_HCP_OPENSHIFT_CHANNEL_GROUP:-candidate}" | ||
| VERSION_MINOR="${ARO_HCP_OPENSHIFT_VERSION_MINOR:-4.20}" | ||
|
|
||
| echo "=== OpenShift Version Synchronization for CI ===" | ||
| echo "Channel Group: $CHANNEL_GROUP" | ||
| echo "Version Minor: $VERSION_MINOR" | ||
| echo "" | ||
|
|
||
| # Check if versions are already explicitly set | ||
| if [ -n "${ARO_HCP_OPENSHIFT_CONTROLPLANE_VERSION:-}" ] && [ -n "${ARO_HCP_OPENSHIFT_NODEPOOL_VERSION:-}" ]; then | ||
| CP_VERSION="$ARO_HCP_OPENSHIFT_CONTROLPLANE_VERSION" | ||
| NP_VERSION="$ARO_HCP_OPENSHIFT_NODEPOOL_VERSION" | ||
|
|
||
| echo "Versions explicitly set via environment:" | ||
| echo " Control Plane: $CP_VERSION" | ||
| echo " Node Pool: $NP_VERSION" | ||
|
|
||
| # Validate they match | ||
| if [ "$CP_VERSION" != "$NP_VERSION" ]; then | ||
| echo "ERROR: Control plane and node pool versions differ!" >&2 | ||
| echo " This will cause validation errors." >&2 | ||
| echo " Either unset both variables or ensure they match." >&2 | ||
| exit 1 | ||
| fi | ||
|
|
||
| echo "✓ Versions are synchronized" | ||
| exit 0 | ||
| fi | ||
|
|
||
| # Fetch latest version from Cincinnati | ||
| echo "Fetching latest version from OpenShift graph API..." | ||
|
|
||
| GRAPH_URL="https://api.openshift.com/api/upgrades_info/v1/graph?channel=${CHANNEL_GROUP}-${VERSION_MINOR}" | ||
|
|
||
| # Use curl with retries for robustness in CI | ||
| MAX_RETRIES=3 | ||
| RETRY_DELAY=5 | ||
|
|
||
| for i in $(seq 1 $MAX_RETRIES); do | ||
| if GRAPH_JSON=$(curl -s --fail --max-time 30 "$GRAPH_URL" 2>/dev/null); then | ||
| break | ||
| fi | ||
|
|
||
| if [ $i -eq $MAX_RETRIES ]; then | ||
| echo "ERROR: Failed to fetch version after $MAX_RETRIES attempts" >&2 | ||
| exit 1 | ||
| fi | ||
|
|
||
| echo "Attempt $i failed, retrying in ${RETRY_DELAY}s..." | ||
| sleep $RETRY_DELAY | ||
| done | ||
|
|
||
| # Parse latest version (requires jq) | ||
| if ! command -v jq &> /dev/null; then | ||
| echo "ERROR: jq is required but not installed" >&2 | ||
| echo "Install with: apt-get install jq (or equivalent)" >&2 | ||
| exit 1 | ||
| fi | ||
|
|
||
| VERSION=$(echo "$GRAPH_JSON" | jq -r '.nodes[].version' | sort -V | tail -1) | ||
|
|
||
| if [ -z "$VERSION" ] || [ "$VERSION" = "null" ]; then | ||
| echo "ERROR: No version found for channel ${CHANNEL_GROUP}-${VERSION_MINOR}" >&2 | ||
| echo "Response was: $GRAPH_JSON" >&2 | ||
| exit 1 | ||
| fi | ||
|
|
||
| echo "Resolved Version: $VERSION" | ||
| echo "" | ||
|
|
||
| # Export synchronized versions | ||
| export ARO_HCP_OPENSHIFT_CHANNEL_GROUP="$CHANNEL_GROUP" | ||
| export ARO_HCP_OPENSHIFT_NODEPOOL_CHANNEL_GROUP="$CHANNEL_GROUP" | ||
| export ARO_HCP_OPENSHIFT_CONTROLPLANE_VERSION="$VERSION" | ||
| export ARO_HCP_OPENSHIFT_NODEPOOL_VERSION="$VERSION" | ||
|
|
||
| # For GitHub Actions - output to GITHUB_ENV | ||
| if [ -n "${GITHUB_ENV:-}" ]; then | ||
| echo "Exporting to GitHub Actions environment..." | ||
| { | ||
| echo "ARO_HCP_OPENSHIFT_CHANNEL_GROUP=$CHANNEL_GROUP" | ||
| echo "ARO_HCP_OPENSHIFT_NODEPOOL_CHANNEL_GROUP=$CHANNEL_GROUP" | ||
| echo "ARO_HCP_OPENSHIFT_CONTROLPLANE_VERSION=$VERSION" | ||
| echo "ARO_HCP_OPENSHIFT_NODEPOOL_VERSION=$VERSION" | ||
| } >> "$GITHUB_ENV" | ||
| fi | ||
|
|
||
| echo "✓ Synchronized versions set:" | ||
| echo " Channel Group: $CHANNEL_GROUP" | ||
| echo " Control Plane Version: $VERSION" | ||
| echo " Node Pool Version: $VERSION" | ||
| echo "" | ||
|
|
||
| # Output for other CI systems | ||
| echo "export ARO_HCP_OPENSHIFT_CHANNEL_GROUP=\"$CHANNEL_GROUP\"" | ||
| echo "export ARO_HCP_OPENSHIFT_NODEPOOL_CHANNEL_GROUP=\"$CHANNEL_GROUP\"" | ||
| echo "export ARO_HCP_OPENSHIFT_CONTROLPLANE_VERSION=\"$VERSION\"" | ||
| echo "export ARO_HCP_OPENSHIFT_NODEPOOL_VERSION=\"$VERSION\"" |
|
|
||
| if [ -z "$VERSION" ]; then | ||
| echo "ERROR: Failed to resolve version" >&2 | ||
| exit 1 |
| } | ||
|
|
||
| # Get the latest version using semantic version ordering | ||
| $latestVersion = $response.nodes.version | Sort-Object { [version]$_ } | Select-Object -Last 1 |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
| # Script to set synchronized OpenShift versions for E2E tests | ||
| # This ensures control plane and node pool use the same version | ||
|
|
||
| set -euo pipefail |
| echo "To apply these in your current shell, run:" | ||
| echo " source ${BASH_SOURCE[0]} $CHANNEL_GROUP $VERSION_MINOR" |
| # This script fetches the version once and exports it for both control plane and node pools | ||
| # Designed for use in CI environments (Prow, GitHub Actions, etc.) | ||
|
|
||
| set -euo pipefail |
| if is_sourced; then | ||
| return "$status" | ||
| else | ||
| exit "$status" | ||
| fi |
| # Parse latest version (requires jq) | ||
| if ! command -v jq &> /dev/null; then | ||
| echo "ERROR: jq is required but not installed" >&2 | ||
| echo "Install with: apt-get install jq (or equivalent)" >&2 | ||
| return 1 | ||
| fi | ||
|
|
||
| VERSION=$(echo "$GRAPH_JSON" | jq -r '.nodes[].version' | sort -V | tail -1) |
| $latestVersion = $response.nodes.version | Sort-Object { [version]$_ } | Select-Object -Last 1 | ||
| return $latestVersion |
| npVer, npErr := semver.ParseTolerant(npVersion) | ||
| cpVer, cpErr := semver.ParseTolerant(cpVersion) | ||
| if npErr != nil { | ||
| Fail(fmt.Sprintf( | ||
| "Configuration error: failed to parse node pool version %q: %v. "+ | ||
| "Check your channel group settings:\n"+ | ||
| " ARO_HCP_OPENSHIFT_CHANNEL_GROUP=%s\n"+ | ||
| " ARO_HCP_OPENSHIFT_NODEPOOL_CHANNEL_GROUP=%s\n"+ | ||
| " ARO_HCP_OPENSHIFT_CONTROLPLANE_VERSION=%s\n"+ | ||
| " ARO_HCP_OPENSHIFT_NODEPOOL_VERSION=%s", | ||
| npVersion, npErr, | ||
| DefaultOpenshiftChannelGroup(), | ||
| DefaultOpenshiftNodePoolChannelGroup(), | ||
| os.Getenv("ARO_HCP_OPENSHIFT_CONTROLPLANE_VERSION"), | ||
| os.Getenv("ARO_HCP_OPENSHIFT_NODEPOOL_VERSION"), | ||
| )) | ||
| } | ||
| if cpErr != nil { | ||
| Fail(fmt.Sprintf( | ||
| "Configuration error: failed to parse control plane version %q: %v. "+ | ||
| "Check your channel group settings:\n"+ | ||
| " ARO_HCP_OPENSHIFT_CHANNEL_GROUP=%s\n"+ | ||
| " ARO_HCP_OPENSHIFT_NODEPOOL_CHANNEL_GROUP=%s\n"+ | ||
| " ARO_HCP_OPENSHIFT_CONTROLPLANE_VERSION=%s\n"+ | ||
| " ARO_HCP_OPENSHIFT_NODEPOOL_VERSION=%s", | ||
| cpVersion, cpErr, | ||
| DefaultOpenshiftChannelGroup(), | ||
| DefaultOpenshiftNodePoolChannelGroup(), | ||
| os.Getenv("ARO_HCP_OPENSHIFT_CONTROLPLANE_VERSION"), | ||
| os.Getenv("ARO_HCP_OPENSHIFT_NODEPOOL_VERSION"), | ||
| )) |
Resolved conflicts by: - Keeping improved semantic version sorting in Set-OcpVersions.ps1 - Using cleaner main() function structure in sync-ocp-versions-ci.sh while preserving shell option restoration for safe sourcing Both scripts now properly handle pre-release versions and restore shell options when sourced.
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
| // Only fetch separately when using a different channel group and that | ||
| // channel group is not stable; stable also uses the control plane version. | ||
| if channelGroup != "stable" { | ||
| var err error | ||
| version, err = GetLatestInstallVersion(context.Background(), channelGroup, DefaultOCPVersionId) | ||
| if err != nil { | ||
| if errors.Is(err, ErrNightlyReleaseStreamNotFound) || errors.Is(err, ErrNoAcceptedNightlyTags) || errors.Is(err, ErrVersionNotFound) { | ||
| Skip(fmt.Sprintf("No install version found for %s in %s channel (%s)", version, channelGroup, err.Error())) | ||
| Skip(fmt.Sprintf("No install version found for %s in %s channel (%s)", DefaultOCPVersionId, channelGroup, err.Error())) | ||
| } else { | ||
| Fail(fmt.Sprintf("failed to get latest install version for %s channel: %s", channelGroup, err.Error())) | ||
| } | ||
| } | ||
| } else { | ||
| // For stable channel, also use control plane version to avoid mismatches | ||
| version = DefaultOpenshiftControlPlaneVersionId() | ||
| } |
| if is_sourced; then | ||
| OLD_SHELL_OPTS=$(set +o) | ||
| # Trap to restore shell options on exit/return, even on error | ||
| trap 'eval "$OLD_SHELL_OPTS"' RETURN ERR EXIT | ||
| fi |
| if [ "$SOURCED" = "1" ]; then | ||
| # Save current shell options | ||
| OLD_SHELL_OPTS=$(set +o) | ||
| # Trap to restore shell options on exit/return, even on error | ||
| trap 'eval "$OLD_SHELL_OPTS"' RETURN ERR EXIT | ||
| fi |
| # Try to use SemanticVersion (PowerShell 6+), fall back to custom sorting | ||
| try { | ||
| $latestVersion = $versions | ForEach-Object { | ||
| [PSCustomObject]@{ | ||
| Original = $_ | ||
| SemVer = [System.Management.Automation.SemanticVersion]::new($_) | ||
| } | ||
| } | Sort-Object -Property SemVer | Select-Object -Last 1 -ExpandProperty Original | ||
| } | ||
| catch { | ||
| # Fallback for Windows PowerShell 5.1 or if SemanticVersion parsing fails | ||
| # Sort by splitting version components and comparing numerically | ||
| $latestVersion = $versions | Sort-Object -Property { | ||
| $parts = $_ -split '[\.-]' | ||
| # Pad each part to ensure numeric comparison | ||
| ($parts[0].PadLeft(10, '0') + | ||
| $parts[1].PadLeft(10, '0') + | ||
| $parts[2].PadLeft(10, '0')) | ||
| } | Select-Object -Last 1 | ||
| } |
| sort_versions() { | ||
| # Try sort -V (GNU sort, available on Linux and via coreutils on macOS) | ||
| if sort -V /dev/null &>/dev/null 2>&1; then | ||
| sort -V | ||
| # Try gsort -V (GNU sort via Homebrew coreutils on macOS) | ||
| elif command -v gsort &>/dev/null && gsort -V /dev/null &>/dev/null 2>&1; then | ||
| gsort -V | ||
| # Fallback: use Python for semantic version sorting | ||
| elif command -v python3 &>/dev/null; then | ||
| python3 -c ' | ||
| import sys | ||
| from packaging import version | ||
| versions = [line.strip() for line in sys.stdin if line.strip()] | ||
| try: | ||
| sorted_versions = sorted(versions, key=lambda v: version.parse(v)) | ||
| for v in sorted_versions: | ||
| print(v) | ||
| except: | ||
| # Fallback to basic string sort if packaging module not available | ||
| for v in sorted(versions): | ||
| print(v) | ||
| ' | ||
| else | ||
| # Last resort: basic alphanumeric sort (not semver-aware, but better than nothing) | ||
| echo "WARNING: No proper version sorting available. Install GNU coreutils or Python with packaging module for accurate results." >&2 | ||
| sort | ||
| fi | ||
| } |
Jira
Fixes: ARO-26767
Why
This PR fixes an issue where after a release rollout, the default node pool version (4.20.20) is higher than the control plane version (4.20.19) in STG and PROD.
Error: Node pool version '4.20.20' must not be greater than Control Plane version '4.20.19'
Failing Tests:
Customer should update node pool labels and taints
Customer should use workload identity via cluster OIDC
Customer should not perform invalid operations
What
Enhanced synchronization logic (
deployment_params.go:)Early validation (
deployment_params.go)NewDefaultNodePoolParams()Import added:
github.com/blang/semver/v4for version comparisonEnvironment Variable Synchronization: new scripts are provided if you need a quick workaround:
Check Configuration
For Local Development (Bash)
For Local Development (PowerShell - Windows)
For CI/CD (Prow, GitHub Actions)
Add this to your CI pipeline BEFORE running tests:
Related Files Modified
test/util/framework/deployment_params.go- Core fixtest/scripts/check-channel-groups.sh- Diagnostic tooltest/scripts/set-ocp-versions.sh- Local dev helper (Bash)test/scripts/Set-OcpVersions.ps1- Local dev helper (PowerShell)test/scripts/sync-ocp-versions-ci.sh- CI/CD helperTesting