Resolve database/GitHub task status disconnect#2961
Resolve database/GitHub task status disconnect#2961m-blaha wants to merge 4 commits intopackit:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request effectively addresses the disconnect between the task status stored in the database and the status reported to GitHub. By introducing the reraise_transient_errors flag and reordering status updates and metric recording, the system can now correctly retry operations when transient GitHub API errors occur, ensuring data consistency. The changes are well-implemented and directly tackle the problem described.
| # set status in db after successfull GitHub reporting | ||
| self.build.set_status(BuildStatus.success) |
There was a problem hiding this comment.
| # TODO: probably also on server errors (5xx) | ||
| return code == 429 |
There was a problem hiding this comment.
| # Only execute the following if GitHub reporting succeeded | ||
| self.pushgateway.copr_builds_finished.inc() | ||
| if self.build.task_accepted_time: | ||
| copr_build_time = elapsed_seconds( | ||
| begin=self.build.task_accepted_time, | ||
| end=datetime.now(timezone.utc), | ||
| ) | ||
| self.pushgateway.copr_build_finished_time.observe(copr_build_time) |
There was a problem hiding this comment.
Moving the metrics recording (pushgateway.copr_builds_finished.inc() and pushgateway.copr_build_finished_time.observe()) inside this block, after the report_status_to_all_for_chroot call, is a good improvement. It ensures that these metrics are only updated if the GitHub status reporting was successful, aligning with the PR's goal of maintaining consistency between internal state and external reports.
| self.pushgateway.copr_builds_finished.inc() | ||
| if self.build.task_accepted_time: | ||
| copr_build_time = elapsed_seconds( | ||
| begin=self.build.task_accepted_time, | ||
| end=datetime.now(timezone.utc), | ||
| ) | ||
| self.pushgateway.copr_build_finished_time.observe(copr_build_time) |
There was a problem hiding this comment.
| # Report to GitHub - if this fails with a transient error, the exception | ||
| # will be re-raised and TaskWithRetry will retry the entire handler | ||
| self.testing_farm_job_helper.report_status_to_tests_for_test_target( | ||
| state=status, | ||
| description=summary, | ||
| target=test_run_model.target, | ||
| url=url if url else self.log_url, | ||
| links_to_external_services={"Testing Farm": self.log_url}, | ||
| ) |
There was a problem hiding this comment.
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 49s |
lbarcziova
left a comment
There was a problem hiding this comment.
this approach looks good to me, but I think we should make sure that on the last call(s) in babysit tasks, we update at least the DB (old behaviour). Maybe by setting reraise_transient_errors to False on the final babysit retry, or a similar approach that ensures the DB status is updated even if GitHub reporting fails after all retries are exhausted.
b0adf65 to
5a91adf
Compare
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 49s |
5a91adf to
e3af034
Compare
|
✔️ pre-commit SUCCESS in 1m 44s |
e3af034 to
5a56cc6
Compare
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 50s |
5a56cc6 to
b18ffac
Compare
|
✔️ pre-commit SUCCESS in 1m 58s |
b18ffac to
af96944
Compare
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 46s |
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 46s |
| job_config=job_config, | ||
| event=event_dict, | ||
| ) | ||
| # TODO: Consider time-based heuristic instead of always False |
There was a problem hiding this comment.
what about doing something like here, and use that as condition for setting the reporter?
There was a problem hiding this comment.
Is there such a thing as a "last try" for a babysit task? My impression is that babysit runs indefinitely, until the task status in db changes from "pending".
There was a problem hiding this comment.
ah, good point, I haven't realised we do this only for the babysit of individual copr build, see here. For the global build/test babysit based on DB, there is the timeout of 7 days. So this becomes trickier. But I fear if we will be setting this always to False for babysitting, we might be still bumping often into the issue of the database/forge disconnect.
There was a problem hiding this comment.
I agree that the babysit copr build task does much of the work.
However, I don't think it's easy to solve the problem.
The babysit task is retried for the following reasons:
- build hasn't started yet
- exception during SRPM update
- exception during build update
- build hasn't ended
Reason 1 and 4 are the most common, and you can see in this graph how often we retry the babysit Copr build task.
Personally, I would like to be able to spot problems, like the transient errors, in the above graph, but we can't because of points 1 and 4, for those, shouldn't we instead schedule a fresh run for later?
There was a problem hiding this comment.
@majamassarini this applies to only the inidividual babysit task, right? I agree, it could be fixed, but maybe outside of the scope of this PR?
Looking into the babysitting, I am also thinking we could set reraise_transient_errors=True for individual babysit and False for the periodic ones. If the individual babysit fails due to a GitHub API error, the periodic babysit runs regularly and will eventually catch any builds still stuck as pending and retry the update. WDYT @m-blaha ? Would you like to keep this PR and test on stg and implement this as followup?
There was a problem hiding this comment.
yes sure, outside of the scope of this PR sounds good to me. And yes this is something only related with babysit task.
There was a problem hiding this comment.
Actually it seems that handler.set_status_reporter_reraise_transient_errors(False) here doesn't have any effect. The handler instance is only transient and destroyed after it went out of scope. celery_run_async is called with signatures and we would need to somehow pass the reraise_transient_errors to get_signatures(), and then back when celery re-creates handler from the signature.
| description="SRPM build succeeded. Waiting for RPM build to start...", | ||
| url=url, | ||
| ) | ||
| except GithubAPIException: |
There was a problem hiding this comment.
looking further in the code changes in status reporter, we are going to start with only GitHub (not Gitlab)? If yes, could you add a note here on this, so that it's clear.
There was a problem hiding this comment.
Yes, I started with Github only, but now I'm thinking about making the change more general and implement it for all forges.
3682810 to
7c72528
Compare
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 44s |
7c72528 to
8c1a4a4
Compare
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 48s |
majamassarini
left a comment
There was a problem hiding this comment.
LGTM thanks!
I have just a question. I was expecting raise_transient_errors to also be True for CoprBuildHandler and TestingFarmHandler, since both of them update check statuses.
Introduce an optional `reraise_transient_errors` parameter to the StatusReporter class which defaults to False, maintaining backward compatibility.
When handling forge errors during setting the status, re-raise transient errors if enabled. Currently it handles rate limit errors (429) but in the future we can add also 5xx server errors. Implemented for Github and Gitlab status reporters, pagure does not swallow any exception, so it should work also there.
Set reraise_transient_errors=True parameter for status reporters in CoprBuildJobHelper and TestingFarmJobHelper constructors. Also non-idempotent operations (like incrementing metrics counters) were moved after successful GitHub reporting. This change should prevent disconnect between status stored in database and the one presented to the user, since the database is also updated only after successful GitHub reporting. Fixes packit#2940
Tests that reporter instances re-raise transient API errors if asked so. Tests that CoprBuildEndHandler and TFResultsHandler handlers set transient errors re-raising.
Good point, I looked into this. Both |
In this lines of |
|
Agreed, the priority would be having the disconnect resolved for the final state. |
8c1a4a4 to
78511dc
Compare
| f"status for '{check_name}': {e}." | ||
| ) | ||
| raise | ||
| self._comment_as_set_status_fallback(e, state, description, check_name, url) |
There was a problem hiding this comment.
Did we get the comments though? 👀
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 52s |
TODO:
Fixes disconnect between task status stored in packit database and status presented to the GitHub user. For example on the github side the task seems to be stucked in "running" state, but in fact (an in the database) it finished successfully.
The root cause is that
StatusReporter.set_status()never raised an exception (even in situations where it actually did not report anything to the user, e.g. due to API rate limits). On the other hand, the database was updated every time.This PR introduces a new
reraise_transient_errorsattribute to StatusReporter class, and based on its value the set_status method re-raises certain transient GitHub exceptions. At the moment only rate limit error.Consider this more a proof of concept PR, I'm still not 100% sure this will work and not break things in another places. Any comments are much appreciated!
Related issue: #2940