Skip to content

Flaky test report: committed-code failures on 2026-05-11 #262

@andrross

Description

@andrross

Summary

One test failure was detected against committed code (Timer/Post Merge Action builds on main) in the past 24 hours (2026-05-10 to 2026-05-11).

Failing Tests

Test Build Builds Affected (Total) First Failure Pattern
RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadWithReducedAllowedNodes 76481 112 2024-04-03 Worsening

Detailed Findings

RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadWithReducedAllowedNodes

Build: 76481 (Timer, main)

Error:

java.lang.AssertionError: replica shards haven't caught up with primary expected:<25> but was:<22>
  at OpenSearchIntegTestCase.waitForReplication(OpenSearchIntegTestCase.java:2570)
  at RecoveryWhileUnderLoadIT.assertAfterRefreshAndWaitForReplication(RecoveryWhileUnderLoadIT.java:504)
  at RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadWithReducedAllowedNodes(RecoveryWhileUnderLoadIT.java:350)

Seed: C8CCF036B428F9A5:AC55FEE67C7B2DF6

Local reproduction: NOT reproducible. Ran 6 times with the original seed — all passed. The failure is timing-dependent and not deterministic with the seed alone.

Historical pattern (monthly unique builds affected):

  • 2024-04 to 2024-08: Low (1-4/month)
  • 2024-09 to 2025-03: Mostly dormant (0-1/month)
  • 2025-04: 22 (spike begins)
  • 2025-06: 77 (peak)
  • 2025-07: 43
  • 2025-08 to 2026-01: Low (0-14/month)
  • 2026-02: 13 (resurgence)
  • 2026-03: 29
  • 2026-04: 28
  • 2026-05: 19 (11 days in, on pace for ~52/month)

Assessment: This is a chronic flaky test with a worsening trend. The recent resurgence (Feb 2026 onward) correlates with the CI runner migration to faster m7a.8xlarge instances in mid-April 2026, which amplifies timing-sensitive races. The test exercises segment replication recovery under load with node allocation changes — a scenario where replica catch-up timing is inherently non-deterministic. The assertBusy timeout in waitForReplication appears insufficient under faster execution conditions.

Other Builds

The remaining Timer builds on main in this period either passed all tests or experienced build-level (non-test) failures:

  • Build 76511: FAILURE, 0 test failures (build infrastructure issue, only 144 tests ran)
  • Build 76493: FAILURE, 0 test failures (build infrastructure issue, only 257 tests ran)
  • Build 76457 (Post Merge Action): FAILURE, 0 test failures (build infrastructure issue)
  • All other Timer builds: Passed all tests

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions