Skip to content

chore(ci): add RabbitMQ healthcheck and CI wait step to prevent startup race condition #3569

Closed
Sukuna0007Abhi wants to merge 2 commits intoaugurlabs:mainfrom
Sukuna0007Abhi:fix/add-rabbitmq-healthcheck-3
Closed

chore(ci): add RabbitMQ healthcheck and CI wait step to prevent startup race condition #3569
Sukuna0007Abhi wants to merge 2 commits intoaugurlabs:mainfrom
Sukuna0007Abhi:fix/add-rabbitmq-healthcheck-3

Conversation

@Sukuna0007Abhi
Copy link
Copy Markdown
Contributor

@Sukuna0007Abhi Sukuna0007Abhi commented Jan 12, 2026

As I proposed it first, https://chaoss-workspace.slack.com/archives/C0226ELG6R4/p1768225953312839?thread_ts=1768225953.312839&cid=C0226ELG6R4

Description

Added a Docker healthcheck for RabbitMQ( to finish syncing its mnesia table on first startup, which was causing some ci timeouts. this fix allows rabbitmq enough time to start up correctly and stops premature connection attempts.)
and a CI step that waits for RabbitMQ to be healthy before streaming logs / running E2E checks — prevents flaky E2E failures when services aren’t fully initialized. Found in (https://github.com/chaoss/augur/actions/runs/20908242825/job/60065967943?pr=3534)

Changes

Add healthcheck to rabbitmq in docker-compose.yml (uses rabbitmq-diagnostics ping).
Update build_docker.yml start step to start compose detached, poll RabbitMQ readiness, then stream logs into await_all.py (increased timeout).

Test

Locally: docker compose up --build and docker compose exec -T rabbitmq rabbitmq-diagnostics -q ping should succeed.
CI: E2E should wait for RabbitMQ and run reliably.
Notes for Reviewers

Signed commits

  • Yes, I signed my commits.

…up race

Signed-off-by: Sukuna0007Abhi <appsonly310@gmail.com>
@nexpectArpit nexpectArpit force-pushed the fix/add-rabbitmq-healthcheck-3 branch from 713e4a4 to 470ad45 Compare January 12, 2026 16:12
…bitMQ; avoid -d

Signed-off-by: Sukuna0007Abhi <appsonly310@gmail.com>
@nexpectArpit nexpectArpit force-pushed the fix/add-rabbitmq-healthcheck-3 branch from 58011a9 to 7f31b89 Compare January 12, 2026 16:26
@shlokgilda shlokgilda added the redundant PR is submitted in parallel with another mutually exclusive PR label Jan 12, 2026
@shlokgilda
Copy link
Copy Markdown
Collaborator

shlokgilda commented Jan 12, 2026

Thanks for the contribution. Did you mean to open this PR for #3548? If not, can you please link this PR to an issue?

@Sukuna0007Abhi
Copy link
Copy Markdown
Contributor Author

Actually @shlokgilda I found out this by exploring the failure cl on this https://github.com/chaoss/augur/actions/runs/20908242825/job/60065967943?pr=3534
On this #3544 PR,

So, I proposed first a fix(https://chaoss-workspace.slack.com/archives/C0226ELG6R4/p1768225953312839?thread_ts=1768225953.312839&cid=C0226ELG6R4) and do some better changes which is similar fixes to race condition pr #3548 but yeah it similar but not fully same with that issue #3548

@MoralCode
Copy link
Copy Markdown
Collaborator

MoralCode commented Jan 12, 2026

i think this has more to do with augur trying to connect at a time when augur inexlicably is restarting the db?

Stack Trace

augur-db-1      | REFRESH MATERIALIZED VIEW
augur-db-1      | REFRESH MATERIALIZED VIEW
augur-db-1      | REFRESH MATERIALIZED VIEW
augur-db-1      | REFRESH MATERIALIZED VIEW
augur-db-1      | REFRESH MATERIALIZED VIEW
augur-db-1      | REFRESH MATERIALIZED VIEW
augur-db-1      | REFRESH MATERIALIZED VIEW
augur-db-1      | REFRESH MATERIALIZED VIEW
augur-db-1      | REFRESH MATERIALIZED VIEW
augur-db-1      | REFRESH MATERIALIZED VIEW
augur-db-1      | REFRESH MATERIALIZED VIEW
augur-db-1      | REFRESH MATERIALIZED VIEW
augur-db-1      | REFRESH MATERIALIZED VIEW
augur-db-1      | REFRESH MATERIALIZED VIEW
augur-db-1      | 
augur-db-1      | 
augur-db-1      | waiting for server to shut down...2026-01-12 07:24:01.757 UTC [49] LOG:  received fast shutdown request
augur-db-1      | .2026-01-12 07:24:01.758 UTC [49] LOG:  aborting any active transactions
augur-db-1      | 2026-01-12 07:24:01.760 UTC [49] LOG:  background worker "logical replication launcher" (PID 55) exited with exit code 1
augur-db-1      | 2026-01-12 07:24:01.763 UTC [50] LOG:  shutting down
augur-db-1      | 2026-01-12 07:24:01.764 UTC [50] LOG:  checkpoint starting: shutdown immediate
augur-db-1      | 2026-01-12 07:24:01.811 UTC [50] LOG:  checkpoint complete: wrote 1393 buffers (8.5%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.027 s, sync=0.017 s, total=0.049 s; sync files=1082, longest=0.003 s, average=0.001 s; distance=8714 kB, estimate=8714 kB; lsn=0/1D75418, redo lsn=0/1D75418
augur-1         | 
augur-1         | 
augur-1         | augur db create-schema command setup failed
augur-1         | ERROR: connecting to database
augur-1         | HINT: The port is may be incorrectly specified in the AUGUR_DB environment variable
augur-1         | AUGUR_DB=***augur-db:5432/augur
augur-1         | 
augur-db-1      | 2026-01-12 07:24:01.826 UTC [49] LOG:  database system is shut down
augur-db-1      |  done
augur-db-1      | server stopped
augur-db-1      | 
augur-db-1      | PostgreSQL init process complete; ready for start up.
augur-db-1      | 
augur-db-1      | 2026-01-12 07:24:01.885 UTC [1] LOG:  starting PostgreSQL 16.11 (Debian 16.11-1.pgdg13+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 14.2.0-19) 14.2.0, 64-bit
augur-db-1      | 2026-01-12 07:24:01.885 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
augur-db-1      | 2026-01-12 07:24:01.885 UTC [1] LOG:  listening on IPv6 address "::", port 5432
augur-db-1      | 2026-01-12 07:24:01.889 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
augur-db-1      | 2026-01-12 07:24:01.894 UTC [67] LOG:  database system was shut down at 2026-01-12 07:24:01 UTC
augur-db-1      | 2026-01-12 07:24:01.925 UTC [1] LOG:  database system is ready to accept connections

augur-1 exited with code 254

So yeah likely the same underlying race condition fix.

That said. The contents of this issue just adds a health check for rabbitmq too (assuming this health check works/is supported by documentation from rabbit). it wont fix the race condition but maybe its still useful?

@MoralCode MoralCode added containers Related to augur in containers, container images, or the compose file, either in podman or in docker and removed redundant PR is submitted in parallel with another mutually exclusive PR labels Jan 12, 2026
@MoralCode MoralCode linked an issue Jan 19, 2026 that may be closed by this pull request
@MoralCode MoralCode added the redundant PR is submitted in parallel with another mutually exclusive PR label Jan 19, 2026
@MoralCode
Copy link
Copy Markdown
Collaborator

Superseded by #3613

@MoralCode MoralCode closed this Jan 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

containers Related to augur in containers, container images, or the compose file, either in podman or in docker redundant PR is submitted in parallel with another mutually exclusive PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Augur db somehow gets patially intialized

3 participants