Skip to content

Orphaned RunningTasks entries block same-template tasks forever when subprocess dies without normal cleanup #3681

@Gerry9000

Description

@Gerry9000

Body

Two linked bugs

We hit a production issue traced to two separate gaps in task lifecycle management.
They compound each other but are independently fixable.


Bug 1: Subprocess death does not clean up in-memory RunningTasks (primary blocker)

Severity: Task pool permanently blocked until manual intervention

When a task's subprocess (Ansible/Terraform/etc.) is killed externally (OOM, SIGKILL,
pod eviction) and the Go goroutine running task.run() does not complete normally,
onTaskStop() is never called. The task remains in p.RunningTasks and
p.activeProj[projectID] indefinitely.

The blocks() method checks these in-memory maps:

// services/tasks/TaskPool.go
func (p *TaskPool) blocks(t *TaskRunner) bool {
    // ...
    for _, r := range p.activeProj[t.Task.ProjectID] {
        if r.Task.Status.IsFinished() {
            continue
        }
        if r.Template.ID == t.Task.TemplateID && !r.Template.AllowParallelTasks {
            return true  // <-- permanently true when orphan exists
        }
    }
    // ...
}

With the default AllowParallelTasks = false, any new task for the same template
will be blocked forever
once an orphaned entry exists. The log shows:

level=info msg="Task 104 added to queue"

...and then nothing. No "Set resource locker", no "Task 104 started". The task
sits in the Queue indefinitely, blocked by a goroutine that will never clean up.

Root cause: onTaskStop() is only called via the EventTypeFinished event
from inside the task goroutine. If the goroutine hangs (waiting on a dead process,
stuck on I/O, etc.), the event is never sent and the in-memory maps are never updated.

Reproduction:

  1. Start a task. Kill its subprocess externally (SIGKILL, OOM kill, container restart)
    in a way that leaves the goroutine blocked rather than exiting.
  2. Submit a new task using the same template.
  3. Observe: new task logs "added to queue" but never starts.

Verified against source: CreateTaskPool() (lines 68-92) creates empty RunningTasks
and activeProj maps -- there is no DB hydration on startup. All entries in these
maps come from live goroutines. A goroutine that hangs permanently creates an
irrecoverable orphan.


Bug 2: Tasks in status=waiting in the database are not re-queued after restart

Severity: Waiting tasks are silently lost on every restart

TaskPool.Queue is in-memory only. After a pod restart, CreateTaskPool() creates
an empty slice. TaskPool.Run() makes no database queries on startup -- it simply
waits for new register channel messages:

func (p *TaskPool) Run() {
    // ...
    go p.handleQueue()
    go p.handleLogs()
    for {
        select {
        case task := <-p.register:  // only new API/schedule submissions land here
            // ...
        case <-ticker.C:
            // 5-second tick, no DB read
        }
    }
}

Any tasks that were in the in-memory Queue at the time of restart (status=waiting in DB,
not yet started) are permanently abandoned. They accumulate in the DB as
status=waiting entries with no corresponding in-memory runner. A human operator must
manually re-trigger them or change their status in the DB.

This is confirmed by Discussion #1384 and is the companion to the subprocess cleanup
issue: even if Bug 1 is fixed, a restart that was triggered to clear orphaned entries
will leave pending tasks stranded.


What does NOT cause the problem

DB state does not affect the in-memory pool after restart. This is a common
misconception (including in our own incident analysis). CreateTaskPool() is a pure
constructor with no DB reads. Tasks with status=running in the database after a
restart have no effect on the task pool -- new tasks of the same template would start
normally on a fresh pod. The zombie DB entries are confusing but not blocking.


Proposed fixes

1. Subprocess death should trigger cleanup (fixes Bug 1)

Add a goroutine watcher per TaskRunner that detects when the subprocess exits
unexpectedly and ensures onTaskStop() is called:

// In task.run() or via a separate watchdog goroutine:
go func() {
    <-processExited
    if !taskCompletedNormally {
        p.onTaskStop(task)
        task.SetStatus(TaskErrorStatus)
    }
}()

Alternatively, a periodic reaper in handleQueue that removes RunningTasks entries
where the status is a terminal state or where the goroutine is no longer alive.

2. Re-queue waiting tasks on startup (fixes Bug 2)

On startup, query the DB for tasks with status=waiting and add them to the queue:

// In TaskPool.Run() or a new TaskPool.Recover() method:
waitingTasks, _ := store.GetTasksByStatus(db.TaskStatusWaiting)
for _, t := range waitingTasks {
    runner := NewTaskRunner(t, p, "", keyInstaller)
    p.register <- runner
}

3. Startup cleanup for stale running entries (partial mitigation)

On startup, mark tasks with status=running AND start < NOW() - threshold as error.
This prevents the DB entries from being misleading, though it does not address the
in-memory goroutine issue directly:

store.UpdateTasksStatus(db.TaskStatusError, db.TaskStatusRunning,
    time.Now().Add(-2*time.Hour))

This was requested in #3101.

4. Task timeout configuration

A task_timeout_minutes config option that kills tasks (and calls cleanup) if they
exceed the limit. This provides a safety net for both the subprocess-death and
long-running-stuck-task scenarios.


Workaround (until fixed)

  1. Identify orphaned entries: look for tasks with status=running and a start time
    from before the last pod restart.
  2. Call POST /api/project/{id}/tasks/{task_id}/stop with header
    Content-Type: application/json and body {"force":true} for each orphaned task.
    (Note: must send a valid JSON body -- an empty request returns 400.)
  3. If /stop fails, update DB directly:
    UPDATE task SET status='error' WHERE status='running' AND start < NOW() - INTERVAL '30 minutes';
  4. Restart Semaphore.
  5. Re-trigger any tasks that were in status=waiting and are now missing from the queue.
  6. Mitigation: set "max_parallel_tasks" in config so N-1 orphaned tasks can coexist
    without blocking all work, and add a CronJob to run the cleanup SQL hourly.

Environment

  • Semaphore v2.16.51
  • Self-hosted on Kubernetes (pod-based)
  • Database: Postgres 18
  • Tasks: OpenTofu (Terraform-compatible) via template

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions