Remove outer transaction from submit-and-register-private-job#293
Remove outer transaction from submit-and-register-private-job#293
Conversation
The outer transaction caused save-job-submission's inner transaction to become a savepoint rather than a real commit. This meant the jobs row wasn't visible to other DB connections (like app-exposer) when do-jex-submission fired the HTTP call, because the outer transaction hadn't committed yet. Removing the wrapper lets save-job-submission's transaction commit before the HTTP call. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
slr71
left a comment
There was a problem hiding this comment.
After thinking about this for a bit, I can see some risks to to removing this transaction without adding some new ones. I can foresee some cases where, for example, a mixed multistep job (for example a job that has some steps that run in the DE and others that run in TAPIS) have the overall job saved and only some of the job steps creating an inconsistent database state. The result of doing that shouldn't be too severe, but I'd like to avoid having an inconsistent database state whenever possible.
I think that the general approach we're going to need to take is to split job submissions into three phases:
- In a single database transaction, save the job and all of its steps. The transaction should be committed at the end of this phase.
- The job submission should be sent to the downstream execution system. This can be done outside of a transaction as long as we can ensure that no database updates happen during this phase. This phase should be in a try/catch so that we can ensure that we mark the job as failed if it can't be submitted to the execution system.
- The results of the job submission should be recorded in some way. Currently, in the case where the job submission is successful, there's not really anything else to do as far as I can tell. Most of the submission paths in
appsdon't do any database updates after calling the execution system. If the submission fails, however, we'll want to catch that and record the job as failed in the database along with a descriptive message indicating why the job failed. If the job fails, it's probably reasonable to send a notification to the user immediately indicating that it failed.
The unfortunate corollary of this is that either we're going to have to significantly refactor the apps service or we're going to have to carefully examine each path to ensure that transactions are created in the correct places in all cases.
The refactor is fairly large, we'd have to separate the job preparation steps from the job submission steps entirely, which would force us to update the Apps protocol, update all of its implementations accordingly, then finally update all places in the code that use the modified methods. That's a lot of work for this, and it's probably not worthwhile, at least not at this time.
Fortunately, the second option seems to be a little more palatable. In that option, we simply need to ensure that each distinct code path has a transaction around the portions of the code flow that store the jobs in the database, and that this transaction is committed before calling the downstream execution system. We have three cases to consider:
- Jobs with only DE steps
- Jobs with only one TAPIS step
- Jobs with either multiple TAPIS steps or with both DE and TAPIS steps.
Not much has to be done for jobs with only DE steps. The function that records the job already uses a transaction, and the function that submits the job to the downstream execution system (app-exposer) already has a try/catch. The one change that I would like to see is that I'd like to have a transaction added to apps.service.apps.de.jobs/record-submission-failure so that we can ensure that the jobs, job_steps, and job_status_updates tables all end up in a consistent state. (I have to concede the fact that if the job submission fails but we're not able to update the job to indicate that the submission failed then the overall system is still in an inconsistent state even if some tables were updated and others weren't I still like the idea of keeping related database updates in a single transaction, though. Plus, this should be rare, and we'd have to go in later and update the database tables manually to fix the problem with or without the transaction).
Jobs with only TAPIS steps are a little bit different because for single-step jobs, the job isn't actually recorded in the database until after we've attempted to submit it to TAPIS. In both cases, the job and job step are stored in the DE database and the job submission response is formatted. I think that the one change I'd like to see here is that I'd like to see apps.service.apps.tapis.jobs/handle-successful-submission and apps.service.apps.tapis.jobs/handle-failed-submission both modified so that they wrap the calls to store-tapis-job and store-tapis-job-step in a single transaction just to ensure that no jobs are partially recorded in the database.
The third case is a little more involved because it's more complicated. The first change would be to update apps.service.apps.combined.jobs/submit so that it wraps the call to jp/save-multistep-job in a transaction (alternately, it would be just as reasonable to update apps.persistence.jobs/save-multistep-job so that its entire body is wrapped in a transaction). It might also be reasonable to update apps.service.apps.combined.jobs/record-step-submission so that it wraps its body in a transaction (this is just a precaution in case more database updates are added later). I briefly considered a couple of updates to apps.service.apps.de.jobs/submit-step, but upon closer inspection, those updates appear to be superfluous.
So to summarize, I think that these changes are needed:
- Update
apps.service.apps.de.jobs/record-submission-failureto wrap the calls tojp/update-job,jp/update-job-step, andjp/add-job-status-updatein a singe transaction. - Update
apps.service.apps.tapis.jobs/handle-successful-submissionso that the calls tostore-tapis-jobandstore-job-stepare wrapped in a transaction. It might also be helpful to have a try/catch in here so that we can log a message if any database updates fail. - Update
apps.service.apps.tapis.jobs/handle-failed-submissionso that the calls tostore-tapis-jobandstore-job-stepare wrapped in a transaction. It might also be useful to have a try/catch in here so that we can log a message if any database updates fail. - Optionally refactor
apps.service.apps.tapis.jobs/handle-successful-submissionandapps.service.appstapis.jobs/handle-failed-submissionto eliminate some code duplication. The duplication is reasonable with the functions as they are now, but adding the additional transaction and try-catch might cause enough duplication to make refactoring worthwhile. - Update
apps.service.apps.combined.jobs/submitso that the call tojp/save-multistep-jobis wrapped in a transaction. Or, equivalently, updateapps.persistence.jobs/save-multistep-jobso that its body is wrapped in a transaction. - Update
apps.service.apps.combined.jobs/record-step-submissionso that its body is wrapped in a transaction. (This is really just a safety precaution to prevent problems if more database updates are added to this function later.)
Sorry for the wall of text. There's a lot going on in this function despite its apparent simplicity.
After sharing or unsharing a VICE analysis, the apps service now queries the permissions service for the current list of users with access and pushes the full allowed-users list to app-exposer, which routes it to the operator managing the analysis. This updates the permissions ConfigMap that vice-proxy reads for authorization. Only applies to interactive (VICE) analyses — batch jobs are skipped. Failures are logged at error level but don't block the share/unshare operation, since the permissions DB is the source of truth. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The outer transaction caused save-job-submission's inner transaction to become a savepoint rather than a real commit. This meant the jobs row wasn't visible to other DB connections (like app-exposer) when do-jex-submission fired the HTTP call, because the outer transaction hadn't committed yet. Removing the wrapper lets save-job-submission's transaction commit before the HTTP call.