Skip to content

fix: set job to FAILED on exception in job_pipeline#3661

Open
Abhishek9639 wants to merge 1 commit intointelowlproject:developfrom
Abhishek9639:stuck
Open

fix: set job to FAILED on exception in job_pipeline#3661
Abhishek9639 wants to merge 1 commit intointelowlproject:developfrom
Abhishek9639:stuck

Conversation

@Abhishek9639
Copy link
Copy Markdown
Contributor

Closes #3653

Description

When job.execute() throws an exception inside job_pipeline, the except block only marks individual plugin reports as FAILED but never updates the Job object itself. Since execute() sets status = RUNNING before building the pipeline, any failure after that point leaves the job stuck in RUNNING forever.

This adds the missing cleanup to the except block setting the job status to FAILED, recording the error, setting finished_analysis_time, updating the parent investigation, and sending a WebSocket notification. This mirrors the same pattern already used in run_plugin and set_final_status.

Type of change

  • Bug fix (non-breaking change which fixes an issue).

Checklist

  • I have read and understood the rules about how to Contribute to this project
  • The pull request is for the branch develop.
  • Linters (Ruff) gave 0 errors. If you have correctly installed pre-commit, it does these checks and adjustments on your behalf.
  • I have added tests for the feature/bug I solved (see tests folder). All the tests (new and old ones) gave 0 errors.

@Abhishek9639
Copy link
Copy Markdown
Contributor Author

Hi @mlodic,

This PR fixes the stuck job issue reported in #3653. The except block in job_pipeline was only marking plugin reports as failed but never updating the job itself so it stayed stuck in RUNNING with no WebSocket notification and a NULL finished_analysis_time.

I aligned the except block with the same cleanup pattern that run_plugin and set_final_status already follow. Added a regression test for it too.

Let me know if anything needs changing.
Thanks

Copy link
Copy Markdown
Contributor

@gks281263 gks281263 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Abhishek9639, nice work getting this fix in — the core logic is correct and addresses the real gap in job_pipeline. I did a close review and have a few hardening suggestions below to make it production better. Overall the direction is exactly right and aligned with run_plugin / set_final_status.

Comment thread intel_owl/tasks.py
JobConsumer.serialize_and_send_job(job)


@shared_task(base=FailureLoggedTask, name="run_plugin", soft_time_limit=500)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Wrap cleanup in its own try/except

If any of these cleanup lines raise (e.g., job.get_root() hits a treebeard MultipleObjectsReturned edge case, or the WebSocket layer has a config issue), the original exception e gets swallowed and replaced with the cleanup exception. The original error would be lost.

Consider:

except Exception as e:
    logger.exception(e)
    # ... existing report cleanup ...
    try:
        job.status = Job.STATUSES.FAILED.value
        job.errors.append(str(e))
        job.finished_analysis_time = now()
        job.save(update_fields=["status", "errors", "finished_analysis_time"])
        if root_investigation := job.get_root().investigation:
            root_investigation.set_correct_status(save=True)
        JobConsumer.serialize_and_send_job(job)
    except Exception as cleanup_err:
        logger.exception(
            f"Failed to clean up job {job_id} after pipeline exception: {cleanup_err}"
        )

This ensures the original error is always logged regardless of cleanup failures.

Comment thread tests/test_crons.py
"api_app.models.Job.objects.get",
return_value=_job,
),
):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary mock — this can be removed

The Job.objects.get mock isn't needed here since _job already exists in the test DB, so the real Job.objects.get(pk=_job.pk) will work fine. Removing it makes the test more realistic and would catch regressions where someone accidentally switches from .save() to .filter().update() in the fix.

Comment thread tests/test_crons.py
"execute",
side_effect=Exception(error_message),
),
patch("api_app.websocket.JobConsumer.serialize_and_send_job"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing assertion: WebSocket notification was actually sent

You're patching JobConsumer.serialize_and_send_job here, but never asserting it was called. Since the whole point of this fix (from the user's perspective) is that the frontend stops showing an infinite spinner, this is worth verifying:

with (
    patch.object(
        _job.__class__,
        "execute",
        side_effect=Exception(error_message),
    ),
    patch("api_app.websocket.JobConsumer.serialize_and_send_job") as mock_ws,
):
    job_pipeline(_job.pk)

mock_ws.assert_called_once()

Comment thread tests/test_crons.py
_job.delete()
an.delete()

def test_job_pipeline_exception_sets_job_to_failed(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Add a second test case for the Investigation status update path

The root_investigation.set_correct_status(save=True) branch in the fix is currently untested because this job has no parent investigation. Consider adding a subTest (or separate test) that creates a job attached to an Investigation and asserts the investigation status transitions correctly after the pipeline failure. That way both branches of the if root_investigation conditional are covered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants