-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Body
Two linked bugs
We hit a production issue traced to two separate gaps in task lifecycle management.
They compound each other but are independently fixable.
Bug 1: Subprocess death does not clean up in-memory RunningTasks (primary blocker)
Severity: Task pool permanently blocked until manual intervention
When a task's subprocess (Ansible/Terraform/etc.) is killed externally (OOM, SIGKILL,
pod eviction) and the Go goroutine running task.run() does not complete normally,
onTaskStop() is never called. The task remains in p.RunningTasks and
p.activeProj[projectID] indefinitely.
The blocks() method checks these in-memory maps:
// services/tasks/TaskPool.go
func (p *TaskPool) blocks(t *TaskRunner) bool {
// ...
for _, r := range p.activeProj[t.Task.ProjectID] {
if r.Task.Status.IsFinished() {
continue
}
if r.Template.ID == t.Task.TemplateID && !r.Template.AllowParallelTasks {
return true // <-- permanently true when orphan exists
}
}
// ...
}With the default AllowParallelTasks = false, any new task for the same template
will be blocked forever once an orphaned entry exists. The log shows:
level=info msg="Task 104 added to queue"
...and then nothing. No "Set resource locker", no "Task 104 started". The task
sits in the Queue indefinitely, blocked by a goroutine that will never clean up.
Root cause: onTaskStop() is only called via the EventTypeFinished event
from inside the task goroutine. If the goroutine hangs (waiting on a dead process,
stuck on I/O, etc.), the event is never sent and the in-memory maps are never updated.
Reproduction:
- Start a task. Kill its subprocess externally (SIGKILL, OOM kill, container restart)
in a way that leaves the goroutine blocked rather than exiting. - Submit a new task using the same template.
- Observe: new task logs "added to queue" but never starts.
Verified against source: CreateTaskPool() (lines 68-92) creates empty RunningTasks
and activeProj maps -- there is no DB hydration on startup. All entries in these
maps come from live goroutines. A goroutine that hangs permanently creates an
irrecoverable orphan.
Bug 2: Tasks in status=waiting in the database are not re-queued after restart
Severity: Waiting tasks are silently lost on every restart
TaskPool.Queue is in-memory only. After a pod restart, CreateTaskPool() creates
an empty slice. TaskPool.Run() makes no database queries on startup -- it simply
waits for new register channel messages:
func (p *TaskPool) Run() {
// ...
go p.handleQueue()
go p.handleLogs()
for {
select {
case task := <-p.register: // only new API/schedule submissions land here
// ...
case <-ticker.C:
// 5-second tick, no DB read
}
}
}Any tasks that were in the in-memory Queue at the time of restart (status=waiting in DB,
not yet started) are permanently abandoned. They accumulate in the DB as
status=waiting entries with no corresponding in-memory runner. A human operator must
manually re-trigger them or change their status in the DB.
This is confirmed by Discussion #1384 and is the companion to the subprocess cleanup
issue: even if Bug 1 is fixed, a restart that was triggered to clear orphaned entries
will leave pending tasks stranded.
What does NOT cause the problem
DB state does not affect the in-memory pool after restart. This is a common
misconception (including in our own incident analysis). CreateTaskPool() is a pure
constructor with no DB reads. Tasks with status=running in the database after a
restart have no effect on the task pool -- new tasks of the same template would start
normally on a fresh pod. The zombie DB entries are confusing but not blocking.
Proposed fixes
1. Subprocess death should trigger cleanup (fixes Bug 1)
Add a goroutine watcher per TaskRunner that detects when the subprocess exits
unexpectedly and ensures onTaskStop() is called:
// In task.run() or via a separate watchdog goroutine:
go func() {
<-processExited
if !taskCompletedNormally {
p.onTaskStop(task)
task.SetStatus(TaskErrorStatus)
}
}()Alternatively, a periodic reaper in handleQueue that removes RunningTasks entries
where the status is a terminal state or where the goroutine is no longer alive.
2. Re-queue waiting tasks on startup (fixes Bug 2)
On startup, query the DB for tasks with status=waiting and add them to the queue:
// In TaskPool.Run() or a new TaskPool.Recover() method:
waitingTasks, _ := store.GetTasksByStatus(db.TaskStatusWaiting)
for _, t := range waitingTasks {
runner := NewTaskRunner(t, p, "", keyInstaller)
p.register <- runner
}3. Startup cleanup for stale running entries (partial mitigation)
On startup, mark tasks with status=running AND start < NOW() - threshold as error.
This prevents the DB entries from being misleading, though it does not address the
in-memory goroutine issue directly:
store.UpdateTasksStatus(db.TaskStatusError, db.TaskStatusRunning,
time.Now().Add(-2*time.Hour))This was requested in #3101.
4. Task timeout configuration
A task_timeout_minutes config option that kills tasks (and calls cleanup) if they
exceed the limit. This provides a safety net for both the subprocess-death and
long-running-stuck-task scenarios.
Workaround (until fixed)
- Identify orphaned entries: look for tasks with
status=runningand a start time
from before the last pod restart. - Call
POST /api/project/{id}/tasks/{task_id}/stopwith header
Content-Type: application/jsonand body{"force":true}for each orphaned task.
(Note: must send a valid JSON body -- an empty request returns 400.) - If /stop fails, update DB directly:
UPDATE task SET status='error' WHERE status='running' AND start < NOW() - INTERVAL '30 minutes'; - Restart Semaphore.
- Re-trigger any tasks that were in
status=waitingand are now missing from the queue. - Mitigation: set
"max_parallel_tasks"in config so N-1 orphaned tasks can coexist
without blocking all work, and add a CronJob to run the cleanup SQL hourly.
Environment
- Semaphore v2.16.51
- Self-hosted on Kubernetes (pod-based)
- Database: Postgres 18
- Tasks: OpenTofu (Terraform-compatible) via template