A resource-aware distributed task scheduler for AI agent workloads on Apple Silicon with multi-machine support via SSH/tmux.
Completed Epics:
- β Epic 2: Worker Runtime β Full task lifecycle, leases, retries, graceful shutdown (18 tests)
- β Epic 3: Local Mac Execution β Process runner, isolation, test sharding (4 tests)
- β Epic 4: Cache + Artifacts β Stable keys, artifact browsing, build caching (3 tests)
- β Epic 5: Distribution & Scaling β Network traits, distributed coordinator, persistent recovery, resilience (circuit breaker, timeouts, backpressure, priority scheduling, priority inversion detection) (20 tests)
In Progress:
- π Epic 1: Scheduler Core β DAG validation, priority + fairness, resource allocation (0/4 stories)
- π Epic 6: Observability + UX β Status CLI, event timelines, tunable config (0/3 stories)
- π Epic 7: Reliability & Soak β Soak tests, metrics validation (0/2 stories)
Test Coverage: 63+ tests passing with RUSTFLAGS="-D warnings", zero unsafe code outside required OS interfaces
The Coordinator is containerized for standard Kubernetes deployment:
- Containerization:
dockworker.tomlat repo root produces a hardened OCI image. - Packaging: Kustomize manifests in
kustomize/(base + overlays). - GitOps: Managed via Flux in the
lornu.aiecosystem. - Probes: HTTP health checks on
:8080(/healthz,/readyz).
Workers remain on bare-metal Apple Silicon hosts to ensure direct NPU/GPU access and macOS process fidelity. They connect to the Coordinator via its K8s Service address (DNS or ClusterIP).
- Rust 1.93+
- macOS (M1/M2/M3/M4 or Intel)
- Pre-commit hooks (optional)
git clone https://github.com/stevedores-org/knittingCrab.git
cd knittingCrab
cargo build --release# All tests
cargo test --all
# Specific crate
cargo test -p knitting-crab-worker --lib
# With strict warnings
RUSTFLAGS="-D warnings" cargo test --all
# Pre-commit checks
pre-commit run --all-filescrates/
βββ core/ # Shared types, traits, scheduling policies (Epic 1-7)
β βββ lease.rs # Lease state machine
β βββ retry.rs # Exponential backoff policy
β βββ priority.rs # Task priorities (Critical/High/Normal/Low)
β βββ circuit_breaker.rs # Resilience pattern
β βββ task_timeout.rs # Soft + hard timeouts
β βββ queue_backpressure.rs # Degradation modes (Normal/Moderate/High/Critical)
β βββ priority_queue.rs # Priority-aware task queue
β βββ time_slice_scheduler.rs # Weighted round-robin (50/30/15/5)
β βββ priority_inversion.rs # Detection for diagnostics
β βββ event_log.rs # Memory + SQLite event sinks
β βββ persistent_lease.rs # SQLite-backed recovery
β βββ traits.rs # Core abstractions (Queue, LeaseStore, EventSink, etc.)
βββ worker/ # Worker runtime (Epic 2 complete + resilience)
β βββ worker_runtime.rs # Main task orchestration
β βββ lease_manager.rs # Lifecycle management
β βββ process.rs # OS process execution (macOS optimized)
β βββ cancel_token.rs # Graceful cancellation
β βββ fake_worker.rs # Test double
βββ scheduler/ # StubScheduler for testing
βββ transport/ # Wire protocol (Epic 5)
β βββ framing.rs # Length-prefixed JSON + framing
β βββ messages.rs # CoordinatorRequest/Response types
β βββ error.rs # Transport error handling
βββ coordinator/ # Server-side scheduler state (Epic 5)
β βββ server.rs # TCP listener + request dispatch
β βββ state.rs # CoordinatorState (Arc-wrapped traits)
β βββ node_registry.rs # Worker node tracking + health
β βββ cache_index.rs # Distributed cache coordination
β βββ error.rs
βββ node/ # Client-side worker integration (Epic 5)
βββ connection.rs # SSH/tmux session mgmt + TCP
βββ network_queue.rs # Queue trait over network
βββ network_lease_store.rs # LeaseStore trait over network
βββ network_event_sink.rs # EventSink trait over network
βββ network_cache.rs # Cache discovery
βββ worker_node.rs # Builder pattern for runtime construction
Task Execution:
- WorkerRuntime: Main async orchestration loop (dequeue β acquire lease β execute β emit events)
- LeaseManager: Prevents duplicate execution, handles expiry/renewal via heartbeat
- ProcessHandle: macOS-optimized subprocess spawning with process groups + graceful shutdown
Resilience:
- CircuitBreaker: 3-state pattern (Closed β Open β Half-Open) for fault tolerance
- TimeoutPolicy: Soft timeouts (graceful) + hard timeouts (forced kill) with load-aware multipliers
- RetryHandler: Exponential backoff (100ms β 2.0x β 30s max) with configurable codes
Scheduling (Epic 5, Phase 1):
- PriorityQueueManager: 4-queue system respects degradation modes
- TimeSliceScheduler: Deterministic round-robin (50% Critical, 30% High, 15% Normal, 5% Low)
- QueueBackpressureManager: Adaptive degradation (Normal β Moderate β High β Critical)
- PriorityInversionDetector: Tracks lock contention for diagnostics
Distribution:
- CoordinatorServer: Multi-client TCP server managing global state, task distribution, lease recovery
- NodeRegistry: Worker tracking with stale detection (60s timeout) + automatic recovery
- FramedTransport: Length-prefixed JSON wire protocol (16 MiB max, robust EOF handling)
- NetworkQueue/NetworkLeaseStore/NetworkEventSink: Trait implementations over TCP
Remote Execution:
- aivcs-session CLI (planned): SSH/tmux session manager for
aivcs.local(Apple Silicon studio)- Sanitizes repo names, generates deterministic session IDs
- Manages tmux session lifecycle (attach-or-create)
- Scheduler shells out to
aivcs-session attach --repo X --work Y --role Z
- 36+ tests passing with zero warnings
- Unit tests: Leases, retries, cancellation
- Component tests: Log streaming, process spawning
- Integration tests: Real subprocess execution
- Pre-commit hooks: Auto-format and lint
- Zero unsafe code (except required for process groups)
- All warnings denied:
RUSTFLAGS="-D warnings" - Clippy strict mode:
cargo clippy -- -D warnings
- IMPLEMENTATION.md: Detailed architecture & design decisions
- Inline documentation for all public APIs
- Test examples in
crates/worker/tests/
The scheduler can execute tasks on remote Apple Silicon machines (aivcs.local) using SSH + tmux:
aivcs-session attach \
--repo knittingCrab \
--work task-12345 \
--role runnerGuarantees:
- Deterministic session naming:
aivcs__knittingcrab__task-12345__runner - Idempotent: Same inputs β attach to existing session
- Safe: Sanitizes inputs, forbids
..and/in repo names - Correct directory: Auto-starts in
~/engineering/code/clone-base/$REPO_NAME
Roles:
agent: Long-lived (keep for debugging)runner: Disposable (auto-kill on completion)human: Manual interactive sessions
See ARCHITECTURE.md for implementation details.
Immediate (Phase 2):
- Epic 1: Scheduler Core (DAG validation, priority + fairness, resource allocation)
- Critical path: DAG cycle detection unblocks all other work
- Epic 6: Observability + UX (status CLI, event timelines, causal reasoning)
- Epic 7: Soak testing + metrics validation
knittingCrab supports rigorous agent evaluations for the Lornu AI bullpen:
- Evaluation Metadata: Tasks can carry AIVCS run IDs and success rubrics to track promotion cycles.
- Roster Gatekeeper: Enforces node isolation policy. Phase 1/2 (nursery) agents are restricted to development nodes, while Phase 3/4 agents and explicit Promotion Gates are permitted on production hardware.
- AIVCS Integration: Every task event can be recorded in the content-addressed sovereign ledger for auditability.
See examples/promotion-gate-task.yaml for an example evaluation task.
See LICENSE file.