Skip to content

Remove outer transaction from submit-and-register-private-job#293

Open
johnworth wants to merge 2 commits intomainfrom
multi-cluster
Open

Remove outer transaction from submit-and-register-private-job#293
johnworth wants to merge 2 commits intomainfrom
multi-cluster

Conversation

@johnworth
Copy link
Copy Markdown
Contributor

The outer transaction caused save-job-submission's inner transaction to become a savepoint rather than a real commit. This meant the jobs row wasn't visible to other DB connections (like app-exposer) when do-jex-submission fired the HTTP call, because the outer transaction hadn't committed yet. Removing the wrapper lets save-job-submission's transaction commit before the HTTP call.

The outer transaction caused save-job-submission's inner transaction to
become a savepoint rather than a real commit. This meant the jobs row
wasn't visible to other DB connections (like app-exposer) when
do-jex-submission fired the HTTP call, because the outer transaction
hadn't committed yet. Removing the wrapper lets save-job-submission's
transaction commit before the HTTP call.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@slr71 slr71 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this for a bit, I can see some risks to to removing this transaction without adding some new ones. I can foresee some cases where, for example, a mixed multistep job (for example a job that has some steps that run in the DE and others that run in TAPIS) have the overall job saved and only some of the job steps creating an inconsistent database state. The result of doing that shouldn't be too severe, but I'd like to avoid having an inconsistent database state whenever possible.

I think that the general approach we're going to need to take is to split job submissions into three phases:

  1. In a single database transaction, save the job and all of its steps. The transaction should be committed at the end of this phase.
  2. The job submission should be sent to the downstream execution system. This can be done outside of a transaction as long as we can ensure that no database updates happen during this phase. This phase should be in a try/catch so that we can ensure that we mark the job as failed if it can't be submitted to the execution system.
  3. The results of the job submission should be recorded in some way. Currently, in the case where the job submission is successful, there's not really anything else to do as far as I can tell. Most of the submission paths in apps don't do any database updates after calling the execution system. If the submission fails, however, we'll want to catch that and record the job as failed in the database along with a descriptive message indicating why the job failed. If the job fails, it's probably reasonable to send a notification to the user immediately indicating that it failed.

The unfortunate corollary of this is that either we're going to have to significantly refactor the apps service or we're going to have to carefully examine each path to ensure that transactions are created in the correct places in all cases.

The refactor is fairly large, we'd have to separate the job preparation steps from the job submission steps entirely, which would force us to update the Apps protocol, update all of its implementations accordingly, then finally update all places in the code that use the modified methods. That's a lot of work for this, and it's probably not worthwhile, at least not at this time.

Fortunately, the second option seems to be a little more palatable. In that option, we simply need to ensure that each distinct code path has a transaction around the portions of the code flow that store the jobs in the database, and that this transaction is committed before calling the downstream execution system. We have three cases to consider:

  1. Jobs with only DE steps
  2. Jobs with only one TAPIS step
  3. Jobs with either multiple TAPIS steps or with both DE and TAPIS steps.

Not much has to be done for jobs with only DE steps. The function that records the job already uses a transaction, and the function that submits the job to the downstream execution system (app-exposer) already has a try/catch. The one change that I would like to see is that I'd like to have a transaction added to apps.service.apps.de.jobs/record-submission-failure so that we can ensure that the jobs, job_steps, and job_status_updates tables all end up in a consistent state. (I have to concede the fact that if the job submission fails but we're not able to update the job to indicate that the submission failed then the overall system is still in an inconsistent state even if some tables were updated and others weren't I still like the idea of keeping related database updates in a single transaction, though. Plus, this should be rare, and we'd have to go in later and update the database tables manually to fix the problem with or without the transaction).

Jobs with only TAPIS steps are a little bit different because for single-step jobs, the job isn't actually recorded in the database until after we've attempted to submit it to TAPIS. In both cases, the job and job step are stored in the DE database and the job submission response is formatted. I think that the one change I'd like to see here is that I'd like to see apps.service.apps.tapis.jobs/handle-successful-submission and apps.service.apps.tapis.jobs/handle-failed-submission both modified so that they wrap the calls to store-tapis-job and store-tapis-job-step in a single transaction just to ensure that no jobs are partially recorded in the database.

The third case is a little more involved because it's more complicated. The first change would be to update apps.service.apps.combined.jobs/submit so that it wraps the call to jp/save-multistep-job in a transaction (alternately, it would be just as reasonable to update apps.persistence.jobs/save-multistep-job so that its entire body is wrapped in a transaction). It might also be reasonable to update apps.service.apps.combined.jobs/record-step-submission so that it wraps its body in a transaction (this is just a precaution in case more database updates are added later). I briefly considered a couple of updates to apps.service.apps.de.jobs/submit-step, but upon closer inspection, those updates appear to be superfluous.

So to summarize, I think that these changes are needed:

  • Update apps.service.apps.de.jobs/record-submission-failure to wrap the calls to jp/update-job, jp/update-job-step, and jp/add-job-status-update in a singe transaction.
  • Update apps.service.apps.tapis.jobs/handle-successful-submission so that the calls to store-tapis-job and store-job-step are wrapped in a transaction. It might also be helpful to have a try/catch in here so that we can log a message if any database updates fail.
  • Update apps.service.apps.tapis.jobs/handle-failed-submission so that the calls to store-tapis-job and store-job-step are wrapped in a transaction. It might also be useful to have a try/catch in here so that we can log a message if any database updates fail.
  • Optionally refactor apps.service.apps.tapis.jobs/handle-successful-submission and apps.service.appstapis.jobs/handle-failed-submission to eliminate some code duplication. The duplication is reasonable with the functions as they are now, but adding the additional transaction and try-catch might cause enough duplication to make refactoring worthwhile.
  • Update apps.service.apps.combined.jobs/submit so that the call to jp/save-multistep-job is wrapped in a transaction. Or, equivalently, update apps.persistence.jobs/save-multistep-job so that its body is wrapped in a transaction.
  • Update apps.service.apps.combined.jobs/record-step-submission so that its body is wrapped in a transaction. (This is really just a safety precaution to prevent problems if more database updates are added to this function later.)

Sorry for the wall of text. There's a lot going on in this function despite its apparent simplicity.

After sharing or unsharing a VICE analysis, the apps service now queries
the permissions service for the current list of users with access and
pushes the full allowed-users list to app-exposer, which routes it to
the operator managing the analysis. This updates the permissions ConfigMap
that vice-proxy reads for authorization.

Only applies to interactive (VICE) analyses — batch jobs are skipped.
Failures are logged at error level but don't block the share/unshare
operation, since the permissions DB is the source of truth.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants