Skip to content

Flaky test report: committed-code failures on 2026-05-09 #260

@andrross

Description

@andrross

Flaky test report: committed-code failures on 2026-05-09

Summary

Analysis of gradle-check failures against committed code (Timer and Post Merge Action builds targeting main) in the 24 hours ending 2026-05-09T10:00Z. Found 31 failure records across 7 distinct builds, representing 10 distinct failing tests (excluding class-level classMethod duplicates). Build 76364 had a systemic failure affecting all qa:smoke-test-http tests due to a RestCancellableNodeClient channel tracking issue.

Failing Tests

1. ClusterDisruptionIT (classMethod)

  • Build: 76306
  • Error: SpanData validation failed for validator AllSpansAreEndedProperly — spans from dispatchedShardOperationOnPrimary not ended during cluster disruption
  • Seed: CB09D2F627911882
  • Reproduced locally: ❌ No (passed with seed)
  • First seen: 2024-04-05
  • Total unique builds affected: 103
  • Pattern: Chronic low-rate flake, stable at ~3-7 builds/month since inception. Slight uptick in Apr 2026 (5 builds) but within historical range. This is a telemetry validation issue during disruption scenarios where spans are not properly ended when nodes are disrupted mid-operation.

2. RemoteSplitIndexIT.testCreateSplitIndex

  • Build: 76313
  • Error: expected:<0> but was:<67>
  • Seed: 26B8D7F0BC427F51
  • Reproduced locally: ❌ No (passed with seed)
  • First seen: 2024-04-11
  • Total unique builds affected: 140
  • Pattern: Chronic flake with a major spike in Nov 2025 (42 builds). Otherwise stable at 1-6 builds/month. The Nov 2025 spike suggests a temporary regression that was later fixed. Currently at baseline rate (~1-2/month).

3. ShardIndexingPressureSettingsIT.testShardIndexingPressureEnforcedEnabledDisabledSetting

  • Build: 76328
  • Error: expected:<0> but was:<2> in waitForTwoOutstandingRequests
  • Seed: 8410C0C2683BE2F0
  • Reproduced locally: ❌ No (passed with seed)
  • First seen: 2024-03-26
  • Total unique builds affected: 198
  • Pattern: Chronic high-rate flake. Historically 5-18 builds/month. Notable worsening trend: Apr 2026 had 10 builds, May 2026 already has 11 builds (9 days in). This is a timing-sensitive test that uses assertBusy to wait for outstanding requests, suggesting a race condition in indexing pressure tracking.

4. NRTReplicationEngineTests.testAcquireLastIndexCommit

  • Build: 76338
  • Error: expected:<2> but was:<1> at NRTReplicationEngineTests.java:81
  • Seed: CF6EC8EA3A74B5A6
  • Reproduced locally: ✅ Yes (deterministic with seed)
  • First seen: 2025-10-13
  • Total unique builds affected: 14
  • Pattern: Relatively new flake, first seen Oct 2025. Low rate (1-3 builds/month) but worsening — Apr 2026 had 3 builds, May 2026 already has 3 builds (9 days in). The deterministic reproduction with seed suggests a test-logic bug rather than a timing issue.

5. IngestFromKafkaIT.testAllActiveOffsetBasedLag

  • Builds: 76338, 76353
  • Error: java.lang.AssertionError (assertion on lag metrics)
  • Seeds: CF6EC8EA3A74B5A6, AE215E3F25600C15
  • Reproduced locally: ❌ No (passed with seed)
  • First seen: 2025-10-15
  • Total unique builds affected: 31
  • Pattern: New and rapidly worsening. Dormant from Nov 2025–Feb 2026, then exploded: Mar 2026 (8 builds), Apr 2026 (13 builds), May 2026 (8 builds in 9 days). This is the most actively worsening test in this report. The Kafka integration tests use embedded Kafka which is sensitive to timing.

6. IngestFromKafkaIT.testCloseIndex

  • Build: 76348
  • Error: ConditionTimeoutException: Condition was not fulfilled within 1 minutes
  • Seed: 84563A7AEEF383D2
  • Reproduced locally: ❌ No (passed with seed)
  • First seen: 2025-03-24
  • Total unique builds affected: 9
  • Pattern: Low-rate chronic flake (~1 build/month when it appears). Stable. The timeout-based failure suggests environmental sensitivity (CI load, Kafka startup time).

7. SharedClusterSnapshotRestoreIT.testSnapshotFileFailureDuringSnapshot

  • Build: 76355
  • Error: Expected: <0L> but: was <1L>
  • Seed: 5FD3E28C78CB69CC
  • Reproduced locally: ❌ No (passed with seed)
  • First seen: 2024-08-31
  • Total unique builds affected: 95
  • Pattern: Chronic flake, stable at 2-6 builds/month. Uptick in Apr 2026 (10 builds) which correlates with the mid-April CI runner migration to m7a.8xlarge. May be CPU-speed sensitive.

8. IndexingIT.testIndexingWithSegRep

  • Build: 76362
  • Error: expected:<0> but was:<1>
  • Seed: 3DD511F96246C162
  • Reproduced locally: ⚠️ Could not run (requires JAVA21_HOME for BWC build)
  • First seen: 2024-03-25
  • Total unique builds affected: 257
  • Pattern: Chronic high-rate flake, the most-affected test in this report by total build count. Consistently 4-29 builds/month. Recent months show elevated rates (Feb 2026: 16, Mar 2026: 18, May 2026: 10 in 9 days). This is a rolling-upgrade test that exercises segment replication across versions.

9. SearchRestCancellationIT (multiple methods) — RestCancellableNodeClient channel leak

  • Build: 76364
  • Error: 1 channels still being tracked in RestCancellableNodeClient while there should be none expected:<0> but was:<1>
  • Seed: 770632E0BB388172
  • Reproduced locally: ❌ No (passed with seed)
  • First seen: 2024-03-26
  • Total unique builds affected: 427
  • Pattern: Chronic high-rate flake affecting the entire qa:smoke-test-http test suite. This is a teardown-time assertion that a REST channel was not properly cleaned up. In build 76364, it caused all HTTP integration tests to fail (14+ test methods across 7 test classes). Historically 5-47 builds/month. Notable spike in Nov 2025 (47 builds). Currently elevated: Apr 2026 (23 builds), May 2026 (25 builds in 9 days). Worsening.

10. DetailedErrorsDisabledIT (same root cause as #9)

  • Build: 76364
  • Error: Same RestCancellableNodeClient channel tracking assertion as Bump nebula.ospackage-base from 9.0.0 to 9.1.1 in /distribution/packages #9
  • Seed: 770632E0BB388172
  • Reproduced locally: ❌ No (passed with seed)
  • Note: Same root cause as SearchRestCancellationIT. All HTTP tests in build 76364 failed with this same assertion. Listed separately because it's a different test class, but the fix would be the same.

Summary Table

# Test Builds Affected First Seen Reproduced Trend
9 SearchRestCancellationIT (RestCancellableNodeClient) 427 2024-03-26 ⬆️ Worsening
8 IndexingIT.testIndexingWithSegRep 257 2024-03-25 ⚠️ N/A ⬆️ Worsening
3 ShardIndexingPressureSettingsIT 198 2024-03-26 ⬆️ Worsening
2 RemoteSplitIndexIT.testCreateSplitIndex 140 2024-04-11 ➡️ Stable
1 ClusterDisruptionIT (classMethod) 103 2024-04-05 ➡️ Stable
7 SharedClusterSnapshotRestoreIT 95 2024-08-31 ⬆️ Worsening (Apr spike)
5 IngestFromKafkaIT.testAllActiveOffsetBasedLag 31 2025-10-15 ⬆️ Rapidly worsening
4 NRTReplicationEngineTests.testAcquireLastIndexCommit 14 2025-10-13 ⬆️ Worsening
6 IngestFromKafkaIT.testCloseIndex 9 2025-03-24 ➡️ Stable

Notes

  • Build 76364 had a systemic failure: a single RestCancellableNodeClient channel leak caused all 7 HTTP test classes (14+ methods) to fail. This is a single root cause manifesting across many tests.
  • NRTReplicationEngineTests.testAcquireLastIndexCommit is the only test that reproduced deterministically with its seed, suggesting a test-logic bug rather than a timing/environmental issue.
  • IngestFromKafkaIT.testAllActiveOffsetBasedLag is the most rapidly worsening test — it went from dormant to 8-13 failures/month in the span of 3 months.
  • The April 2026 CI runner migration (m5.8xlarge → m7a.8xlarge) correlates with upticks in SharedClusterSnapshotRestoreIT and ShardIndexingPressureSettingsIT.
  • IndexingIT.testIndexingWithSegRep could not be reproduced locally because the rolling-upgrade test requires JAVA21_HOME for building the BWC distribution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions