Skip to content

Add managed scientific jobs for long-running simulations #362

@yamamotoseiji

Description

@yamamotoseiji

Problem

ScienceSwarm can now drive tutorial-scale molecular dynamics work through the project UI, but real scientific runs should not depend on one long-lived chat or assistant request. MD, docking, structure prediction, simulations, and large analyses often run for hours or days. Keeping an assistant turn open for the entire wall clock couples compute lifetime to chat lifetime, makes cancellation and recovery fragile, and pushes the platform toward overly long runtime authorization windows.

Proposed Direction

Add first-class managed scientific jobs:

  • A project-scoped job record persisted under ScienceSwarm state/gbrain, not only in memory.
  • Launch metadata: command or pipeline, working directory, runtime/env fingerprint, input refs, expected outputs, approval state, and provenance.
  • Runtime execution independent of the chat request, with process tracking or a future backend abstraction.
  • Heartbeats and status: queued, running, completed, failed, cancelled, timed out.
  • Log streaming/tail access for stdout/stderr and domain logs.
  • Reliable cancel and stale-worker recovery.
  • Artifact discovery/import after completion.
  • A gbrain launch manifest saved at job start and a final run log/artifact index saved at completion.
  • Short-lived scoped write leases minted by the local app/job service when a write is needed, rather than day-long ambient runtime tokens.

Why Not Just Increase Timeouts?

Longer synchronous assistant timeouts are useful as a bridge for tutorial-scale local runs, but they are not the right architecture for real science. A durable job surface lets the assistant launch work, return a job ID, and later interpret completed artifacts in a fresh turn.

Acceptance Criteria

  • A user can launch a long-running project command from ScienceSwarm without keeping the chat response open.
  • The project page shows job status, elapsed time, and a log tail.
  • Jobs survive page reloads and can be inspected later.
  • Completed jobs expose generated artifacts for import and interpretation.
  • Cancelling a running job terminates the process or marks the job cancelled when the process is already gone.
  • Runtime writes use scoped or renewed app-managed authorization and preserve provenance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions