Resolve database/GitHub task status disconnect by m-blaha · Pull Request #2961 · packit/packit-service

m-blaha · 2026-01-22T15:04:49Z

TODO:

Write new tests or update the old ones to cover new functionality.

Fixes disconnect between task status stored in packit database and status presented to the GitHub user. For example on the github side the task seems to be stucked in "running" state, but in fact (an in the database) it finished successfully.

The root cause is that StatusReporter.set_status() never raised an exception (even in situations where it actually did not report anything to the user, e.g. due to API rate limits). On the other hand, the database was updated every time.

This PR introduces a new reraise_transient_errors attribute to StatusReporter class, and based on its value the set_status method re-raises certain transient GitHub exceptions. At the moment only rate limit error.

Consider this more a proof of concept PR, I'm still not 100% sure this will work and not break things in another places. Any comments are much appreciated!

Related issue: #2940

gemini-code-assist

Code Review

This pull request effectively addresses the disconnect between the task status stored in the database and the status reported to GitHub. By introducing the reraise_transient_errors flag and reordering status updates and metric recording, the system can now correctly retry operations when transient GitHub API errors occur, ensuring data consistency. The changes are well-implemented and directly tackle the problem described.

gemini-code-assist · 2026-01-22T15:08:03Z

packit_service/worker/handlers/copr.py

+        # set status in db after successfull GitHub reporting
+        self.build.set_status(BuildStatus.success)


Moving self.build.set_status(BuildStatus.success) to occur after the GitHub reporting is a critical correctness fix. This ensures that the internal database state accurately reflects the external status, preventing scenarios where the database shows success but GitHub reporting failed.

gemini-code-assist · 2026-01-22T15:08:03Z

packit_service/worker/reporting/reporters/github.py

+        # TODO: probably also on server errors (5xx)
+        return code == 429


The TODO comment suggests including server errors (5xx) as transient. While not critical for this PR, expanding the definition of transient errors to include 5xx responses would make the retry mechanism more robust against various API issues. Consider implementing this in a follow-up.

gemini-code-assist · 2026-01-22T15:08:03Z

packit_service/worker/handlers/copr.py

+                # Only execute the following if GitHub reporting succeeded
+                self.pushgateway.copr_builds_finished.inc()
+                if self.build.task_accepted_time:
+                    copr_build_time = elapsed_seconds(
+                        begin=self.build.task_accepted_time,
+                        end=datetime.now(timezone.utc),
+                    )
+                    self.pushgateway.copr_build_finished_time.observe(copr_build_time)


Moving the metrics recording (pushgateway.copr_builds_finished.inc() and pushgateway.copr_build_finished_time.observe()) inside this block, after the report_status_to_all_for_chroot call, is a good improvement. It ensures that these metrics are only updated if the GitHub status reporting was successful, aligning with the PR's goal of maintaining consistency between internal state and external reports.

gemini-code-assist · 2026-01-22T15:08:03Z

packit_service/worker/handlers/copr.py

+        self.pushgateway.copr_builds_finished.inc()
+        if self.build.task_accepted_time:
+            copr_build_time = elapsed_seconds(
+                begin=self.build.task_accepted_time,
+                end=datetime.now(timezone.utc),
+            )
+            self.pushgateway.copr_build_finished_time.observe(copr_build_time)


Placing the metrics recording after report_successful_build() and under the comment # Only execute the following if GitHub reporting succeeded is a correct and important change. This ensures that the metrics accurately reflect successful operations that have been communicated externally.

gemini-code-assist · 2026-01-22T15:08:03Z

packit_service/worker/handlers/testing_farm.py

+        # Report to GitHub - if this fails with a transient error, the exception
+        # will be re-raised and TaskWithRetry will retry the entire handler
+        self.testing_farm_job_helper.report_status_to_tests_for_test_target(
+            state=status,
+            description=summary,
+            target=test_run_model.target,
+            url=url if url else self.log_url,
+            links_to_external_services={"Testing Farm": self.log_url},
+        )


Moving the report_status_to_tests_for_test_target call to occur before the metrics recording is a good change. This ensures that if GitHub reporting fails due to a transient error, the task can be retried without prematurely incrementing metrics, maintaining consistency.

centosinfra-prod-github-app · 2026-01-22T15:09:23Z

Build succeeded.
https://gateway-cloud-softwarefactory.apps.ocp.cloud.ci.centos.org/zuul/t/packit-service/buildset/23a806a8369e4d6abc7e604e4a901511

✔️ pre-commit SUCCESS in 1m 49s
✔️ packit-service-tests SUCCESS in 3m 52s

lbarcziova

this approach looks good to me, but I think we should make sure that on the last call(s) in babysit tasks, we update at least the DB (old behaviour). Maybe by setting reraise_transient_errors to False on the final babysit retry, or a similar approach that ensures the DB status is updated even if GitHub reporting fails after all retries are exhausted.

centosinfra-prod-github-app · 2026-01-26T14:25:05Z

Build succeeded.
https://gateway-cloud-softwarefactory.apps.ocp.cloud.ci.centos.org/zuul/t/packit-service/buildset/329e3c09cb9d4946a5629b8c61aeab9f

✔️ pre-commit SUCCESS in 1m 49s
✔️ packit-service-tests SUCCESS in 3m 23s

centosinfra-prod-github-app · 2026-01-26T20:36:35Z

Build failed.
https://gateway-cloud-softwarefactory.apps.ocp.cloud.ci.centos.org/zuul/t/packit-service/buildset/8e85a247b4694546bbb20ca037e30576

✔️ pre-commit SUCCESS in 1m 44s
❌ packit-service-tests FAILURE in 3m 47s

centosinfra-prod-github-app · 2026-01-27T13:31:27Z

Build succeeded.
https://gateway-cloud-softwarefactory.apps.ocp.cloud.ci.centos.org/zuul/t/packit-service/buildset/98ab4675d7554b9f9b7c7a013a359811

✔️ pre-commit SUCCESS in 1m 50s
✔️ packit-service-tests SUCCESS in 3m 29s

centosinfra-prod-github-app · 2026-01-27T15:06:35Z

Build failed.
https://gateway-cloud-softwarefactory.apps.ocp.cloud.ci.centos.org/zuul/t/packit-service/buildset/15ead3e78aca42488554342cfa39c263

✔️ pre-commit SUCCESS in 1m 58s
❌ packit-service-tests FAILURE in 3m 58s

centosinfra-prod-github-app · 2026-01-27T17:23:31Z

Build succeeded.
https://gateway-cloud-softwarefactory.apps.ocp.cloud.ci.centos.org/zuul/t/packit-service/buildset/27a130640b6e4c8fb527783d25eb9bdf

✔️ pre-commit SUCCESS in 1m 46s
✔️ packit-service-tests SUCCESS in 3m 19s

centosinfra-prod-github-app · 2026-01-28T12:56:56Z

Build succeeded.
https://gateway-cloud-softwarefactory.apps.ocp.cloud.ci.centos.org/zuul/t/packit-service/buildset/71ee087184cf44158d00b89fee0b43b6

✔️ pre-commit SUCCESS in 1m 46s
✔️ packit-service-tests SUCCESS in 3m 42s

lbarcziova · 2026-01-28T13:48:10Z

packit_service/worker/helpers/build/babysit.py

            job_config=job_config,
            event=event_dict,
        )
+        # TODO: Consider time-based heuristic instead of always False


what about doing something like here, and use that as condition for setting the reporter?

Is there such a thing as a "last try" for a babysit task? My impression is that babysit runs indefinitely, until the task status in db changes from "pending".

ah, good point, I haven't realised we do this only for the babysit of individual copr build, see here. For the global build/test babysit based on DB, there is the timeout of 7 days. So this becomes trickier. But I fear if we will be setting this always to False for babysitting, we might be still bumping often into the issue of the database/forge disconnect.

I agree that the babysit copr build task does much of the work.
However, I don't think it's easy to solve the problem.
The babysit task is retried for the following reasons:

build hasn't started yet

exception during SRPM update

exception during build update

build hasn't ended

Reason 1 and 4 are the most common, and you can see in this graph how often we retry the babysit Copr build task.

Personally, I would like to be able to spot problems, like the transient errors, in the above graph, but we can't because of points 1 and 4, for those, shouldn't we instead schedule a fresh run for later?

@majamassarini this applies to only the inidividual babysit task, right? I agree, it could be fixed, but maybe outside of the scope of this PR?

Looking into the babysitting, I am also thinking we could set reraise_transient_errors=True for individual babysit and False for the periodic ones. If the individual babysit fails due to a GitHub API error, the periodic babysit runs regularly and will eventually catch any builds still stuck as pending and retry the update. WDYT @m-blaha ? Would you like to keep this PR and test on stg and implement this as followup?

yes sure, outside of the scope of this PR sounds good to me. And yes this is something only related with babysit task.

Actually it seems that handler.set_status_reporter_reraise_transient_errors(False) here doesn't have any effect. The handler instance is only transient and destroyed after it went out of scope. celery_run_async is called with signatures and we would need to somehow pass the reraise_transient_errors to get_signatures(), and then back when celery re-creates handler from the signature.

lbarcziova · 2026-01-28T13:51:03Z

packit_service/worker/handlers/copr.py

+                description="SRPM build succeeded. Waiting for RPM build to start...",
+                url=url,
+            )
+        except GithubAPIException:


looking further in the code changes in status reporter, we are going to start with only GitHub (not Gitlab)? If yes, could you add a note here on this, so that it's clear.

Yes, I started with Github only, but now I'm thinking about making the change more general and implement it for all forges.

centosinfra-prod-github-app · 2026-01-29T12:15:50Z

Build succeeded.
https://gateway-cloud-softwarefactory.apps.ocp.cloud.ci.centos.org/zuul/t/packit-service/buildset/1f5acecd985d41b68116bc2dd7ecc91d

✔️ pre-commit SUCCESS in 1m 44s
✔️ packit-service-tests SUCCESS in 3m 27s

centosinfra-prod-github-app · 2026-01-29T14:54:52Z

Build succeeded.
https://gateway-cloud-softwarefactory.apps.ocp.cloud.ci.centos.org/zuul/t/packit-service/buildset/fdfc06e44ad7461dbdd1342ae9d50534

✔️ pre-commit SUCCESS in 1m 48s
✔️ packit-service-tests SUCCESS in 3m 37s

majamassarini

LGTM thanks!

I have just a question. I was expecting raise_transient_errors to also be True for CoprBuildHandler and TestingFarmHandler, since both of them update check statuses.

Introduce an optional `reraise_transient_errors` parameter to the StatusReporter class which defaults to False, maintaining backward compatibility.

When handling forge errors during setting the status, re-raise transient errors if enabled. Currently it handles rate limit errors (429) but in the future we can add also 5xx server errors. Implemented for Github and Gitlab status reporters, pagure does not swallow any exception, so it should work also there.

Set reraise_transient_errors=True parameter for status reporters in CoprBuildJobHelper and TestingFarmJobHelper constructors. Also non-idempotent operations (like incrementing metrics counters) were moved after successful GitHub reporting. This change should prevent disconnect between status stored in database and the one presented to the user, since the database is also updated only after successful GitHub reporting. Fixes packit#2940

Tests that reporter instances re-raise transient API errors if asked so. Tests that CoprBuildEndHandler and TFResultsHandler handlers set transient errors re-raising.

m-blaha · 2026-02-11T10:37:38Z

LGTM thanks!

I have just a question. I was expecting raise_transient_errors to also be True for CoprBuildHandler and TestingFarmHandler, since both of them update check statuses.

Good point, I looked into this. Both CoprBuildHandler and TestingFarmHandler do report check statuses, but there is a difference. Unlike CoprBuildEndHandler and TestingFarmResultsHandler the database status is not updated here, so even if the reporting fails, there would be no mismatch between DB and github status. But I can be mistaken (the flow is quite complicated and it's possible I do not have a complete picture).

majamassarini · 2026-02-11T11:23:01Z

Good point, I looked into this. Both CoprBuildHandler and TestingFarmHandler do report check statuses, but there is a difference. Unlike CoprBuildEndHandler and TestingFarmResultsHandler the database status is not updated here, so even if the reporting fails, there would be no mismatch between DB and github status. But I can be mistaken (the flow is quite complicated and it's possible I do not have a complete picture).

In this lines of CoprBuildHandler, we set the db and after that we report the state. I think we could have a mismatch here, and probably retrying on a transient error could be useful. However, I agree this could be a low priority followup card. Also, since this is an initial state and it is overwritten later, I don't think it affects the users too much.
For testing farm the situation is similar.

lbarcziova · 2026-02-11T13:18:47Z

Agreed, the priority would be having the disconnect resolved for the final state.

mfocko · 2026-02-11T14:46:29Z

packit_service/worker/reporting/reporters/github.py

+                    f"status for '{check_name}': {e}."
+                )
+                raise
            self._comment_as_set_status_fallback(e, state, description, check_name, url)


Did we get the comments though? 👀

centosinfra-prod-github-app · 2026-02-11T15:08:13Z

Build succeeded.
https://gateway-cloud-softwarefactory.apps.ocp.cloud.ci.centos.org/zuul/t/packit-service/buildset/00f9a9621b434b5b8d9a492437fc4333

✔️ pre-commit SUCCESS in 1m 52s
✔️ packit-service-tests SUCCESS in 3m 32s

m-blaha requested review from a team, majamassarini, mfocko and nforro as code owners January 22, 2026 15:04

usercont-release-bot added this to Packit pull requests Jan 22, 2026

github-project-automation bot moved this to New in Packit pull requests Jan 22, 2026

gemini-code-assist bot reviewed Jan 22, 2026

View reviewed changes

lbarcziova reviewed Jan 23, 2026

View reviewed changes

lbarcziova moved this from New to In review in Packit pull requests Jan 23, 2026

m-blaha force-pushed the status-disconnect-pr branch from b0adf65 to 5a91adf Compare January 26, 2026 14:07

m-blaha force-pushed the status-disconnect-pr branch from 5a91adf to e3af034 Compare January 26, 2026 20:32

m-blaha force-pushed the status-disconnect-pr branch from e3af034 to 5a56cc6 Compare January 27, 2026 13:27

m-blaha force-pushed the status-disconnect-pr branch from 5a56cc6 to b18ffac Compare January 27, 2026 15:01

m-blaha force-pushed the status-disconnect-pr branch from b18ffac to af96944 Compare January 27, 2026 17:19

lbarcziova reviewed Jan 28, 2026

View reviewed changes

m-blaha force-pushed the status-disconnect-pr branch from 3682810 to 7c72528 Compare January 29, 2026 11:47

m-blaha force-pushed the status-disconnect-pr branch from 7c72528 to 8c1a4a4 Compare January 29, 2026 14:50

majamassarini reviewed Feb 3, 2026

View reviewed changes

m-blaha added 3 commits February 11, 2026 10:46

Add reraise_transient_errors param to StatusReporter

25f9e0c

Introduce an optional `reraise_transient_errors` parameter to the StatusReporter class which defaults to False, maintaining backward compatibility.

Add unit tests for re-raising transient errors

78511dc

Tests that reporter instances re-raise transient API errors if asked so. Tests that CoprBuildEndHandler and TFResultsHandler handlers set transient errors re-raising.

m-blaha force-pushed the status-disconnect-pr branch from 8c1a4a4 to 78511dc Compare February 11, 2026 13:25

mfocko reviewed Feb 11, 2026

View reviewed changes

		# set status in db after successfull GitHub reporting
		self.build.set_status(BuildStatus.success)

		# TODO: probably also on server errors (5xx)
		return code == 429

Conversation

m-blaha commented Jan 22, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

centosinfra-prod-github-app bot commented Jan 22, 2026

Uh oh!

lbarcziova left a comment

Choose a reason for hiding this comment

Uh oh!

centosinfra-prod-github-app bot commented Jan 26, 2026

Uh oh!

centosinfra-prod-github-app bot commented Jan 26, 2026

Uh oh!

centosinfra-prod-github-app bot commented Jan 27, 2026

Uh oh!

centosinfra-prod-github-app bot commented Jan 27, 2026

Uh oh!

centosinfra-prod-github-app bot commented Jan 27, 2026

Uh oh!

centosinfra-prod-github-app bot commented Jan 28, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

centosinfra-prod-github-app bot commented Jan 29, 2026

Uh oh!

centosinfra-prod-github-app bot commented Jan 29, 2026

Uh oh!

majamassarini left a comment

Choose a reason for hiding this comment

Uh oh!

m-blaha commented Feb 11, 2026

Uh oh!

majamassarini commented Feb 11, 2026

Uh oh!

lbarcziova commented Feb 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

centosinfra-prod-github-app bot commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development