Skip to content

Resolve database/GitHub task status disconnect#2961

Open
m-blaha wants to merge 4 commits intopackit:mainfrom
m-blaha:status-disconnect-pr
Open

Resolve database/GitHub task status disconnect#2961
m-blaha wants to merge 4 commits intopackit:mainfrom
m-blaha:status-disconnect-pr

Conversation

@m-blaha
Copy link
Member

@m-blaha m-blaha commented Jan 22, 2026

TODO:

  • Write new tests or update the old ones to cover new functionality.

Fixes disconnect between task status stored in packit database and status presented to the GitHub user. For example on the github side the task seems to be stucked in "running" state, but in fact (an in the database) it finished successfully.

The root cause is that StatusReporter.set_status() never raised an exception (even in situations where it actually did not report anything to the user, e.g. due to API rate limits). On the other hand, the database was updated every time.

This PR introduces a new reraise_transient_errors attribute to StatusReporter class, and based on its value the set_status method re-raises certain transient GitHub exceptions. At the moment only rate limit error.

Consider this more a proof of concept PR, I'm still not 100% sure this will work and not break things in another places. Any comments are much appreciated!

Related issue: #2940

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the disconnect between the task status stored in the database and the status reported to GitHub. By introducing the reraise_transient_errors flag and reordering status updates and metric recording, the system can now correctly retry operations when transient GitHub API errors occur, ensuring data consistency. The changes are well-implemented and directly tackle the problem described.

Comment on lines +481 to +482
# set status in db after successfull GitHub reporting
self.build.set_status(BuildStatus.success)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Moving self.build.set_status(BuildStatus.success) to occur after the GitHub reporting is a critical correctness fix. This ensures that the internal database state accurately reflects the external status, preventing scenarios where the database shows success but GitHub reporting failed.

Comment on lines +46 to +47
# TODO: probably also on server errors (5xx)
return code == 429
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The TODO comment suggests including server errors (5xx) as transient. While not critical for this PR, expanding the definition of transient errors to include 5xx responses would make the retry mechanism more robust against various API issues. Consider implementing this in a follow-up.

Comment on lines +348 to +355
# Only execute the following if GitHub reporting succeeded
self.pushgateway.copr_builds_finished.inc()
if self.build.task_accepted_time:
copr_build_time = elapsed_seconds(
begin=self.build.task_accepted_time,
end=datetime.now(timezone.utc),
)
self.pushgateway.copr_build_finished_time.observe(copr_build_time)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Moving the metrics recording (pushgateway.copr_builds_finished.inc() and pushgateway.copr_build_finished_time.observe()) inside this block, after the report_status_to_all_for_chroot call, is a good improvement. It ensures that these metrics are only updated if the GitHub status reporting was successful, aligning with the PR's goal of maintaining consistency between internal state and external reports.

Comment on lines +372 to +378
self.pushgateway.copr_builds_finished.inc()
if self.build.task_accepted_time:
copr_build_time = elapsed_seconds(
begin=self.build.task_accepted_time,
end=datetime.now(timezone.utc),
)
self.pushgateway.copr_build_finished_time.observe(copr_build_time)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Placing the metrics recording after report_successful_build() and under the comment # Only execute the following if GitHub reporting succeeded is a correct and important change. This ensures that the metrics accurately reflect successful operations that have been communicated externally.

Comment on lines +590 to +598
# Report to GitHub - if this fails with a transient error, the exception
# will be re-raised and TaskWithRetry will retry the entire handler
self.testing_farm_job_helper.report_status_to_tests_for_test_target(
state=status,
description=summary,
target=test_run_model.target,
url=url if url else self.log_url,
links_to_external_services={"Testing Farm": self.log_url},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Moving the report_status_to_tests_for_test_target call to occur before the metrics recording is a good change. This ensures that if GitHub reporting fails due to a transient error, the task can be retried without prematurely incrementing metrics, maintaining consistency.

@centosinfra-prod-github-app
Copy link
Contributor

Copy link
Member

@lbarcziova lbarcziova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this approach looks good to me, but I think we should make sure that on the last call(s) in babysit tasks, we update at least the DB (old behaviour). Maybe by setting reraise_transient_errors to False on the final babysit retry, or a similar approach that ensures the DB status is updated even if GitHub reporting fails after all retries are exhausted.

@lbarcziova lbarcziova moved this from New to In review in Packit pull requests Jan 23, 2026
@m-blaha m-blaha force-pushed the status-disconnect-pr branch from b0adf65 to 5a91adf Compare January 26, 2026 14:07
@centosinfra-prod-github-app
Copy link
Contributor

@m-blaha m-blaha force-pushed the status-disconnect-pr branch from 5a91adf to e3af034 Compare January 26, 2026 20:32
@centosinfra-prod-github-app
Copy link
Contributor

@m-blaha m-blaha force-pushed the status-disconnect-pr branch from e3af034 to 5a56cc6 Compare January 27, 2026 13:27
@centosinfra-prod-github-app
Copy link
Contributor

@m-blaha m-blaha force-pushed the status-disconnect-pr branch from 5a56cc6 to b18ffac Compare January 27, 2026 15:01
@centosinfra-prod-github-app
Copy link
Contributor

@m-blaha m-blaha force-pushed the status-disconnect-pr branch from b18ffac to af96944 Compare January 27, 2026 17:19
@centosinfra-prod-github-app
Copy link
Contributor

@centosinfra-prod-github-app
Copy link
Contributor

job_config=job_config,
event=event_dict,
)
# TODO: Consider time-based heuristic instead of always False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about doing something like here, and use that as condition for setting the reporter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there such a thing as a "last try" for a babysit task? My impression is that babysit runs indefinitely, until the task status in db changes from "pending".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, good point, I haven't realised we do this only for the babysit of individual copr build, see here. For the global build/test babysit based on DB, there is the timeout of 7 days. So this becomes trickier. But I fear if we will be setting this always to False for babysitting, we might be still bumping often into the issue of the database/forge disconnect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the babysit copr build task does much of the work.
However, I don't think it's easy to solve the problem.
The babysit task is retried for the following reasons:

  1. build hasn't started yet
  2. exception during SRPM update
  3. exception during build update
  4. build hasn't ended

Reason 1 and 4 are the most common, and you can see in this graph how often we retry the babysit Copr build task.

Personally, I would like to be able to spot problems, like the transient errors, in the above graph, but we can't because of points 1 and 4, for those, shouldn't we instead schedule a fresh run for later?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@majamassarini this applies to only the inidividual babysit task, right? I agree, it could be fixed, but maybe outside of the scope of this PR?

Looking into the babysitting, I am also thinking we could set reraise_transient_errors=True for individual babysit and False for the periodic ones. If the individual babysit fails due to a GitHub API error, the periodic babysit runs regularly and will eventually catch any builds still stuck as pending and retry the update. WDYT @m-blaha ? Would you like to keep this PR and test on stg and implement this as followup?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes sure, outside of the scope of this PR sounds good to me. And yes this is something only related with babysit task.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it seems that handler.set_status_reporter_reraise_transient_errors(False) here doesn't have any effect. The handler instance is only transient and destroyed after it went out of scope. celery_run_async is called with signatures and we would need to somehow pass the reraise_transient_errors to get_signatures(), and then back when celery re-creates handler from the signature.

description="SRPM build succeeded. Waiting for RPM build to start...",
url=url,
)
except GithubAPIException:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking further in the code changes in status reporter, we are going to start with only GitHub (not Gitlab)? If yes, could you add a note here on this, so that it's clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I started with Github only, but now I'm thinking about making the change more general and implement it for all forges.

@m-blaha m-blaha force-pushed the status-disconnect-pr branch from 3682810 to 7c72528 Compare January 29, 2026 11:47
@centosinfra-prod-github-app
Copy link
Contributor

@m-blaha m-blaha force-pushed the status-disconnect-pr branch from 7c72528 to 8c1a4a4 Compare January 29, 2026 14:50
@centosinfra-prod-github-app
Copy link
Contributor

Copy link
Member

@majamassarini majamassarini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!

I have just a question. I was expecting raise_transient_errors to also be True for CoprBuildHandler and TestingFarmHandler, since both of them update check statuses.

Introduce an optional `reraise_transient_errors` parameter to the
StatusReporter class which defaults to False, maintaining backward
compatibility.
When handling forge errors during setting the status, re-raise
transient errors if enabled. Currently it handles rate limit errors
(429) but in the future we can add also 5xx server errors.
Implemented for Github and Gitlab status reporters, pagure does not
swallow any exception, so it should work also there.
Set reraise_transient_errors=True parameter for status reporters in
CoprBuildJobHelper and TestingFarmJobHelper constructors.

Also non-idempotent operations (like incrementing metrics counters) were
moved after successful GitHub reporting.

This change should prevent disconnect between status stored in database
and the one presented to the user, since the database is also updated
only after successful GitHub reporting.

Fixes packit#2940
Tests that reporter instances re-raise transient API errors if asked so.
Tests that CoprBuildEndHandler and TFResultsHandler handlers set
transient errors re-raising.
@m-blaha
Copy link
Member Author

m-blaha commented Feb 11, 2026

LGTM thanks!

I have just a question. I was expecting raise_transient_errors to also be True for CoprBuildHandler and TestingFarmHandler, since both of them update check statuses.

Good point, I looked into this. Both CoprBuildHandler and TestingFarmHandler do report check statuses, but there is a difference. Unlike CoprBuildEndHandler and TestingFarmResultsHandler the database status is not updated here, so even if the reporting fails, there would be no mismatch between DB and github status. But I can be mistaken (the flow is quite complicated and it's possible I do not have a complete picture).

@majamassarini
Copy link
Member

Good point, I looked into this. Both CoprBuildHandler and TestingFarmHandler do report check statuses, but there is a difference. Unlike CoprBuildEndHandler and TestingFarmResultsHandler the database status is not updated here, so even if the reporting fails, there would be no mismatch between DB and github status. But I can be mistaken (the flow is quite complicated and it's possible I do not have a complete picture).

In this lines of CoprBuildHandler, we set the db and after that we report the state. I think we could have a mismatch here, and probably retrying on a transient error could be useful. However, I agree this could be a low priority followup card. Also, since this is an initial state and it is overwritten later, I don't think it affects the users too much.
For testing farm the situation is similar.

@lbarcziova
Copy link
Member

Agreed, the priority would be having the disconnect resolved for the final state.

@m-blaha m-blaha force-pushed the status-disconnect-pr branch from 8c1a4a4 to 78511dc Compare February 11, 2026 13:25
f"status for '{check_name}': {e}."
)
raise
self._comment_as_set_status_fallback(e, state, description, check_name, url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we get the comments though? 👀

@centosinfra-prod-github-app
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

4 participants