Orphaned RunningTasks entries block same-template tasks forever when subprocess dies without normal cleanup


## Body

### Two linked bugs

We hit a production issue traced to two separate gaps in task lifecycle management.
They compound each other but are independently fixable.

---

### Bug 1: Subprocess death does not clean up in-memory RunningTasks (primary blocker)

**Severity:** Task pool permanently blocked until manual intervention

When a task's subprocess (Ansible/Terraform/etc.) is killed externally (OOM, SIGKILL,
pod eviction) and the Go goroutine running `task.run()` does not complete normally,
`onTaskStop()` is never called. The task remains in `p.RunningTasks` and
`p.activeProj[projectID]` indefinitely.

The `blocks()` method checks these in-memory maps:

```go
// services/tasks/TaskPool.go
func (p *TaskPool) blocks(t *TaskRunner) bool {
    // ...
    for _, r := range p.activeProj[t.Task.ProjectID] {
        if r.Task.Status.IsFinished() {
            continue
        }
        if r.Template.ID == t.Task.TemplateID && !r.Template.AllowParallelTasks {
            return true  // <-- permanently true when orphan exists
        }
    }
    // ...
}
```

With the default `AllowParallelTasks = false`, **any new task for the same template
will be blocked forever** once an orphaned entry exists. The log shows:

```
level=info msg="Task 104 added to queue"
```

...and then nothing. No "Set resource locker", no "Task 104 started". The task
sits in the Queue indefinitely, blocked by a goroutine that will never clean up.

**Root cause:** `onTaskStop()` is only called via the `EventTypeFinished` event
from inside the task goroutine. If the goroutine hangs (waiting on a dead process,
stuck on I/O, etc.), the event is never sent and the in-memory maps are never updated.

**Reproduction:**
1. Start a task. Kill its subprocess externally (SIGKILL, OOM kill, container restart)
   in a way that leaves the goroutine blocked rather than exiting.
2. Submit a new task using the same template.
3. Observe: new task logs "added to queue" but never starts.

**Verified against source:** `CreateTaskPool()` (lines 68-92) creates empty `RunningTasks`
and `activeProj` maps -- there is no DB hydration on startup. All entries in these
maps come from live goroutines. A goroutine that hangs permanently creates an
irrecoverable orphan.

---

### Bug 2: Tasks in `status=waiting` in the database are not re-queued after restart

**Severity:** Waiting tasks are silently lost on every restart

`TaskPool.Queue` is in-memory only. After a pod restart, `CreateTaskPool()` creates
an empty slice. `TaskPool.Run()` makes no database queries on startup -- it simply
waits for new `register` channel messages:

```go
func (p *TaskPool) Run() {
    // ...
    go p.handleQueue()
    go p.handleLogs()
    for {
        select {
        case task := <-p.register:  // only new API/schedule submissions land here
            // ...
        case <-ticker.C:
            // 5-second tick, no DB read
        }
    }
}
```

Any tasks that were in the in-memory Queue at the time of restart (status=waiting in DB,
not yet started) are permanently abandoned. They accumulate in the DB as
`status=waiting` entries with no corresponding in-memory runner. A human operator must
manually re-trigger them or change their status in the DB.

This is confirmed by Discussion #1384 and is the companion to the subprocess cleanup
issue: even if Bug 1 is fixed, a restart that was triggered to clear orphaned entries
will leave pending tasks stranded.

---

### What does NOT cause the problem

**DB state does not affect the in-memory pool after restart.** This is a common
misconception (including in our own incident analysis). `CreateTaskPool()` is a pure
constructor with no DB reads. Tasks with `status=running` in the database after a
restart have no effect on the task pool -- new tasks of the same template would start
normally on a fresh pod. The zombie DB entries are confusing but not blocking.

---

### Proposed fixes

**1. Subprocess death should trigger cleanup (fixes Bug 1)**

Add a goroutine watcher per TaskRunner that detects when the subprocess exits
unexpectedly and ensures `onTaskStop()` is called:

```go
// In task.run() or via a separate watchdog goroutine:
go func() {
    <-processExited
    if !taskCompletedNormally {
        p.onTaskStop(task)
        task.SetStatus(TaskErrorStatus)
    }
}()
```

Alternatively, a periodic reaper in `handleQueue` that removes RunningTasks entries
where the status is a terminal state or where the goroutine is no longer alive.

**2. Re-queue waiting tasks on startup (fixes Bug 2)**

On startup, query the DB for tasks with `status=waiting` and add them to the queue:

```go
// In TaskPool.Run() or a new TaskPool.Recover() method:
waitingTasks, _ := store.GetTasksByStatus(db.TaskStatusWaiting)
for _, t := range waitingTasks {
    runner := NewTaskRunner(t, p, "", keyInstaller)
    p.register <- runner
}
```

**3. Startup cleanup for stale running entries (partial mitigation)**

On startup, mark tasks with `status=running AND start < NOW() - threshold` as `error`.
This prevents the DB entries from being misleading, though it does not address the
in-memory goroutine issue directly:

```go
store.UpdateTasksStatus(db.TaskStatusError, db.TaskStatusRunning,
    time.Now().Add(-2*time.Hour))
```

This was requested in #3101.

**4. Task timeout configuration**

A `task_timeout_minutes` config option that kills tasks (and calls cleanup) if they
exceed the limit. This provides a safety net for both the subprocess-death and
long-running-stuck-task scenarios.

---

### Workaround (until fixed)

1. Identify orphaned entries: look for tasks with `status=running` and a start time
   from before the last pod restart.
2. Call `POST /api/project/{id}/tasks/{task_id}/stop` with header
   `Content-Type: application/json` and body `{"force":true}` for each orphaned task.
   (Note: must send a valid JSON body -- an empty request returns 400.)
3. If /stop fails, update DB directly:
   `UPDATE task SET status='error' WHERE status='running' AND start < NOW() - INTERVAL '30 minutes';`
4. Restart Semaphore.
5. Re-trigger any tasks that were in `status=waiting` and are now missing from the queue.
6. Mitigation: set `"max_parallel_tasks"` in config so N-1 orphaned tasks can coexist
   without blocking all work, and add a CronJob to run the cleanup SQL hourly.

---

### Environment

- Semaphore v2.16.51
- Self-hosted on Kubernetes (pod-based)
- Database: Postgres 18
- Tasks: OpenTofu (Terraform-compatible) via template

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Orphaned RunningTasks entries block same-template tasks forever when subprocess dies without normal cleanup #3681

Body

Two linked bugs

Bug 1: Subprocess death does not clean up in-memory RunningTasks (primary blocker)

Bug 2: Tasks in `status=waiting` in the database are not re-queued after restart

What does NOT cause the problem

Proposed fixes

Workaround (until fixed)

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Orphaned RunningTasks entries block same-template tasks forever when subprocess dies without normal cleanup #3681

Description

Body

Two linked bugs

Bug 1: Subprocess death does not clean up in-memory RunningTasks (primary blocker)

Bug 2: Tasks in status=waiting in the database are not re-queued after restart

What does NOT cause the problem

Proposed fixes

Workaround (until fixed)

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Bug 2: Tasks in `status=waiting` in the database are not re-queued after restart