Complete ops runbook: quick start, CLI reference, platform troubleshooting, skill/module authoring, deployment, monitoring, backup, and recovery. Companion to
FRAMEWORK_BLUEPRINT.md(architecture) andROADMAP.md(future work).Platform note: Commands below use
pythondirectly. On Linux/macOS with uv installed, you may prefix withuv run. On Windows (Git Bash), always usepythondirectly —uv runis WSL-only.
# 1. Start fleet
python fleet/supervisor.py # Linux/macOS — direct
# Windows: either run inside WSL, or set BIGED_NATIVE_WINDOWS=1
# 2. Start dashboard (web UI on http://localhost:5555)
python fleet/dashboard.py
# 3. Start launcher (tkinter desktop app)
python BigEd/launcher/launcher.py
# 4. Check fleet health
python fleet/lead_client.py statusAll commands below are run as python fleet/lead_client.py <command>.
| Command | Description |
|---|---|
status |
Show all agents (name, role, status, last heartbeat) and task counts |
detect-cli |
Detect best local CLI, shell, network tools, and bridge for this platform |
install-service |
Install fleet as auto-start service (Task Scheduler / systemd / launchd) |
uninstall-service |
Remove the auto-start service |
| Command | Description |
|---|---|
task "instruction" |
Submit natural language task (parsed by conductor model). Add --wait to block until complete. --priority N (1-10, default 5) |
task '{"skill":"web_search","query":"..."}' |
Submit raw JSON task (bypasses intent parser) |
dispatch skill payload |
Dispatch explicit skill + JSON payload. --priority N (default 9), --assigned-to agent, --b64 for base64 payload |
result <task_id> |
Fetch status and result of a specific task |
| Command | Description |
|---|---|
send agent "msg" |
Direct message to a specific agent. --channel fleet|sup|agent|pool |
broadcast "msg" |
Broadcast message to all registered agents. --channel same as above |
inbox agent |
Check agent inbox (unread by default). --all for all, --limit N, --channel filter |
notes channel |
Read channel scratchpad. --post "json" to add, --since ISO, --limit N |
| Command | Description |
|---|---|
usage --period day|week|month |
Token usage breakdown by skill (calls, input/output tokens, cost USD, cache savings) |
usage-delta from_start from_end to_start to_end |
Compare usage between two ISO date ranges (delta %, direction arrows) |
budget |
Token budget status per skill (from [budgets] in fleet.toml) |
| Command | Description |
|---|---|
marathon [session] |
List marathon sessions (last 5). Pass session ID for last 3 snapshots |
marathon-checkpoint |
Show autoresearch training checkpoints (last 10 .pt files, size, modified) |
| Command | Description |
|---|---|
logs agent --tail N |
Tail the log file for a specific agent (default 30 lines) |
| Command | Description |
|---|---|
secret set KEY value |
Set an API key in ~/.secrets (atomic write). --b64 for base64-encoded value |
secret get KEY |
Retrieve a secret value |
secret list |
List all secret keys (values masked) |
| Issue | Windows | Linux | macOS |
|---|---|---|---|
| Ollama not starting | Check Windows Ollama installer; verify http://localhost:11434 responds. Restart: taskkill /f /im ollama.exe && ollama serve |
systemctl status ollama or ollama serve &. Check journalctl -u ollama |
brew services start ollama. Check brew services list |
| Fleet not starting | Run inside WSL: wsl -d Ubuntu -- bash -c "cd /mnt/c/.../fleet && python supervisor.py". Or set BIGED_NATIVE_WINDOWS=1 for native mode |
Direct: python supervisor.py |
Direct: python supervisor.py |
| Dashboard won't launch | Ensure Flask installed: uv pip install flask. Check port 5555: netstat -an | findstr 5555 |
ss -tlnp | grep 5555. Kill squatter: fuser -k 5555/tcp |
lsof -i :5555. Kill: kill $(lsof -t -i :5555) |
| Launcher won't start | Check Python has tkinter: python -c "import tkinter". Usually bundled on Windows |
sudo apt install python3-tk (Debian/Ubuntu) or sudo pacman -S tk (Arch) |
brew install python-tk@3.11 |
| Auto-boot not working | Check Task Scheduler: schtasks /query /tn BigEdFleet. Re-run install-service if missing |
systemctl --user status biged-fleet. Check ~/.config/systemd/user/biged-fleet.service |
launchctl list | grep biged. Check ~/Library/LaunchAgents/com.biged.fleet.plist |
| Issue | Windows | Linux | macOS |
|---|---|---|---|
| GPU not detected | Install nvidia-ml-py: pip install nvidia-ml-py. Verify NVIDIA driver: nvidia-smi |
Install nvidia-ml-py + CUDA toolkit. Verify: nvidia-smi |
No NVIDIA GPU on macOS — CPU-only mode. Apple Silicon uses Metal via Ollama automatically |
| pynvml ImportError | pip install nvidia-ml-py (not pynvml) |
pip install nvidia-ml-py |
N/A — skip pynvml. hw_supervisor detects absence and runs CPU-only |
| Training OOM | Ensure Ollama models fully evicted before training. Check DEPTH in MACHINE_PROFILE.md (max DEPTH=6 for 12GB VRAM). Run Ollama on CPU during training: CUDA_VISIBLE_DEVICES=-1 ollama serve & |
Same approach. Use nvidia-smi to verify GPU memory freed before train.py |
N/A — no GPU training on macOS. CPU-only autoresearch works but is slow |
| Thermal throttling | Check hw_state.json — if thermal.gpu_temp_c > 75, hw_supervisor auto-downscales model tier. Increase fan curve or lower [thermal] gpu_max_sustained_c in fleet.toml |
Same. Also: sensors for CPU temps, nvidia-smi -l 1 for live GPU temp |
N/A — no NVIDIA thermal management. Ollama handles Apple Silicon throttling internally |
| ROCm not found (AMD) | AMD GPUs not supported on Windows for Ollama | Install ROCm per AMD docs. Ollama uses ROCm automatically when available | N/A |
| Issue | Windows | Linux | macOS |
|---|---|---|---|
| WSL networking | Enable mirrored networking in %USERPROFILE%\.wslconfig: [wsl2] / networkingMode=mirrored. Restart WSL: wsl --shutdown && wsl |
N/A | N/A |
| Dashboard unreachable from browser | Check firewall allows port 5555. If running in WSL, use localhost:5555 with mirrored networking, or WSL IP from wsl hostname -I |
Check ufw status — allow 5555 if needed |
Check System Preferences > Security > Firewall |
| Ollama unreachable from fleet | Verify fleet.toml [models] ollama_host matches. Default: http://localhost:11434. In WSL, may need http://host.docker.internal:11434 or mirrored networking |
Verify host setting. If Docker: expose port 11434 | Same — check host setting |
| API calls failing (429s) | Fleet auto-throttles at 20% of rate limits with 300ms min between requests and exponential backoff. Check usage --period day for budget overruns |
Same | Same |
| Issue | Windows | Linux | macOS |
|---|---|---|---|
| DB locked / busy timeout | Long-running write or crashed process holding WAL. Kill stale processes: wsl -- pkill -f worker.py, then restart supervisor |
pkill -f worker.py then restart supervisor |
pkill -f worker.py then restart supervisor |
| Stale tasks stuck in RUNNING | Worker died mid-task. Run: python -c "import db; db.init_db(); print(db.recover_stale_tasks())" |
Same | Same |
| Training lock stuck | Training process crashed without releasing. Run: python -c "import db; db.init_db(); db.release_lock('training')" |
Same | Same |
| fleet.db corrupted | Restore from backup: ~/BigEd-backups/<latest>/fleet.db. Or delete and restart — supervisor recreates tables on boot |
Same | Same |
| Issue | Diagnosis | Fix |
|---|---|---|
| Workers not claiming tasks | lead_client.py status — agents show OFFLINE |
Restart supervisor. Check worker logs in fleet/logs/ |
| Worker stuck on single task | Skill timeout not triggering (default 600s) | Add entry in worker.py:SKILL_TIMEOUTS for slow skills |
| Skill dispatch hangs | Intent parser model not loaded | Ensure conductor model loaded: ollama list. Check fleet.toml [models] conductor_model |
| Module tab not appearing | Module not in profile or disabled | Check fleet.toml [launcher] profile and [launcher.tabs] section |
| Launcher can't find fleet | FLEET_DIR resolution failing |
Set env: BIGED_FLEET_DIR=/path/to/fleet or verify fleet/fleet.toml exists |
| Agent flicker in UI | Widget destroy/recreate pattern (pre-v0.32) | Update to v0.32+ which uses widget cache + configure pattern |
| Issue | Platform | Fix |
|---|---|---|
| WSL not found | Windows | wsl --install or wsl --install -d Ubuntu. Reboot after install |
| Gatekeeper blocks launch | macOS | Right-click the app > Open. Or: xattr -d com.apple.quarantine BigEdCC.app |
python3 not python |
Linux | Alias in shell profile, or install python-is-python3 package (Debian/Ubuntu) |
| Permission denied on scripts | Linux/macOS | chmod +x scripts/backup.sh fleet/supervisor.py |
| Git line endings break scripts | Windows | Set core.autocrlf=input in .gitattributes. Re-clone or git checkout -- scripts/ |
Every skill is a Python file in fleet/skills/ with a single run() function:
# fleet/skills/my_skill.py
def run(payload: dict, config: dict, log) -> dict:
"""
payload — JSON parsed from task's payload_json
config — fleet.toml parsed via config.load_config()
log — worker logger (use log.info/warning/error instead of print)
Returns a dict (serialized to result_json by the worker)
"""
query = payload.get("query", "")
log.info(f"Processing query: {query}")
# ... do work ...
return {"summary": "...", "source": "..."}Import the routing layer — never call Ollama/Claude/Gemini directly:
from skills._models import call_model
result = call_model(prompt, config, provider="local") # Ollama (default)
result = call_model(prompt, config, provider="claude") # Claude API
result = call_model(prompt, config, provider="gemini") # Gemini APIIf a skill requires internet access, declare it at module level:
REQUIRES_NETWORK = True # Worker checks before dispatch; rejected in offline/air-gap modeSkills without this flag (or with REQUIRES_NETWORK = False) are allowed in all modes.
- Save file as
fleet/skills/<skill_name>.py - Add to
lead_client.pyintent parser prompt if it should be dispatchable by natural language - Add to
[affinity]section infleet.tomlif it should route to specific worker roles - Add custom timeout in
worker.py:SKILL_TIMEOUTSif >600s is needed
cd fleet
# Verify all skills import cleanly
python smoke_test.py --fast
# Test a single skill directly
python -c "from skills.my_skill import run; print(run({'query': 'test'}, {}))"
# Test via fleet dispatch (requires running supervisor)
python lead_client.py dispatch my_skill '{"query": "test"}' --waitCreate BigEd/launcher/modules/mod_<name>.py with a Module class:
class Module:
NAME = "my_module" # matches fleet.toml key
LABEL = "My Module" # tab label in UI
VERSION = "0.31"
DEFAULT_ENABLED = False
DEPENDS_ON = [] # other module NAMEs required
DATA_SCHEMA = {
"table": "my_module_data",
"fields": {
"name": {"type": "text", "required": True},
"status": {"type": "text", "enum": ["active", "inactive"]},
},
"retention_days": None,
}
def __init__(self, app):
self.app = app
def build_tab(self, parent):
"""Build UI widgets into parent frame."""
pass
def on_refresh(self):
"""Called periodically by launcher timer. Use _db_query_bg for DB work."""
pass
def on_close(self):
"""Cleanup on app exit."""
pass
def get_settings(self) -> dict:
return {}
def apply_settings(self, cfg):
pass
def export_data(self) -> list[dict]:
"""Return all records as list of dicts for data portability."""
return []
def validate_record(self, data) -> tuple[bool, str]:
"""Validate a record against DATA_SCHEMA."""
return True, "OK"- Save as
BigEd/launcher/modules/mod_<name>.py— auto-discovered bydiscover_modules() - Add to
manifest.json(auto-added on first load, but you can pre-populate) - Add to profile in
modules/__init__.py:DEPLOYMENT_PROFILESif it belongs to a profile - Enable in
fleet.tomlunder[launcher.tabs]:my_module = true
Never block the UI thread. Use the launcher's background query helper:
def on_refresh(self):
def _fetch(con):
return con.execute("SELECT * FROM my_table").fetchall()
def _render(rows):
if rows is None:
return
# update widgets here
self.app._db_query_bg(_fetch, _render)Prerequisites: Python 3.11+, Git, WSL2 (fleet runs inside WSL), Ollama
# 1. Clone
git clone <repo-url> Education && cd Education
# 2. Install launcher deps (Windows native Python)
cd BigEd/launcher
pip install -r requirements.txt
# 3. Install fleet deps (WSL)
wsl
cd /mnt/c/Users/<you>/Projects/Education/fleet
pip install httpx anthropic # or: uv sync
# 4. Configure
cp fleet/fleet.toml.example fleet/fleet.toml # if template exists
# Edit fleet.toml:
# [launcher] profile = "consulting" # or minimal/research/full
# [models] local = "qwen3:8b" # adjust for your GPU VRAM
# [thermal] gpu_max_sustained_c = 75 # adjust for your card
# 5. Set API keys
python fleet/lead_client.py secret set ANTHROPIC_API_KEY <key>
python fleet/lead_client.py secret set GEMINI_API_KEY <key>
# 6. Pull Ollama models
ollama pull qwen3:8b # GPU workers (skip if no GPU)
ollama pull qwen3:4b # conductor (CPU)
ollama pull qwen3:0.6b # maintainer (CPU, always loaded)
# 7. Verify
cd fleet
python smoke_test.py --fast # should pass all checks
# 8. Start fleet (WSL)
nohup python supervisor.py >> logs/supervisor.log 2>&1 &
# 9. Start launcher (Windows)
cd BigEd/launcher
python launcher.pyPrerequisites: Python 3.11+, python3-tk (system package), Git, Ollama
# 1. Clone
git clone <repo-url> Education && cd Education
# 2. Install deps (single environment — no WSL needed)
pip install -r BigEd/launcher/requirements.txt
cd fleet && pip install httpx anthropic # or: uv sync
# 3. Configure (same as Windows)
cp fleet/fleet.toml.example fleet/fleet.toml
# Edit fleet.toml as needed
# 4. Set API keys
python fleet/lead_client.py secret set ANTHROPIC_API_KEY <key>
# 5. Pull Ollama models
ollama pull qwen3:8b
# 6. Verify
cd fleet && python smoke_test.py --fast
# 7. Start fleet (native — no WSL bridge)
nohup python supervisor.py >> logs/supervisor.log 2>&1 &
# 8. Start launcher (same machine)
python BigEd/launcher/launcher.pyKey difference: No WSL layer. Fleet and launcher run in the same OS. DirectBridge replaces wsl()/wsl_bg() calls.
Prerequisites: Python 3.11+ (Homebrew), python-tk (brew), Git, Ollama
Same steps as Linux. Additional notes:
brew install python-tk@3.11if tkinter import fails- First launch may trigger Gatekeeper — right-click > Open to bypass
- No NVIDIA GPU support. CPU-only Ollama or Apple Silicon-optimized models
- If using
.appbundle: drag to/Applications/, launch normally
| GPU VRAM | Recommended local model |
Notes |
|---|---|---|
| None | CPU-only (qwen3:4b, num_gpu: 0) |
Slow but functional |
| 6-8 GB | qwen3:4b (GPU) |
Mid-tier, reliable |
| 10-12 GB | qwen3:8b (GPU) |
Full quality, thermal management recommended |
| 16+ GB | qwen3:8b + headroom for training |
Can run Ollama during autoresearch |
- Safe ceiling: 10 GB for a 12 GB card (leaves headroom for OS/display)
- Sweet spot: DEPTH=6, ~26M params, ~6.9 GB VRAM
- DEPTH=7+ OOMs on 12 GB cards — never exceed DEPTH=6
- Training + Ollama: Run Ollama on CPU during training:
CUDA_VISIBLE_DEVICES=-1 ollama serve & - Eco mode (default): CPU-only, ~40% CPU utilization, 0 VRAM
Windows:
# Start fleet (WSL)
wsl -d Ubuntu -- bash -c "cd /mnt/c/.../fleet && nohup python supervisor.py >> logs/supervisor.log 2>&1 &"
# Start hardware supervisor (WSL, optional — GPU systems only)
wsl -d Ubuntu -- bash -c "cd /mnt/c/.../fleet && nohup python hw_supervisor.py >> logs/hw_supervisor.log 2>&1 &"
# Start dashboard (WSL or Windows)
python fleet/dashboard.py # default http://localhost:5555
python fleet/dashboard.py --port 8080 # custom port
# Start launcher (Windows native)
python BigEd/launcher/launcher.py
# Stop fleet — graceful
wsl -d Ubuntu -- python lead_client.py broadcast '{"type": "pause"}'
# Stop fleet — force
wsl -d Ubuntu -- pkill -f supervisor.pyLinux / macOS:
# Start fleet (native — no WSL)
cd fleet
nohup python supervisor.py >> logs/supervisor.log 2>&1 &
nohup python hw_supervisor.py >> logs/hw_supervisor.log 2>&1 & # optional, GPU only
# Start dashboard
python dashboard.py
# Start launcher
python BigEd/launcher/launcher.py
# Stop fleet
pkill -f supervisor.py
# Or gracefully:
python lead_client.py broadcast '{"type": "pause"}'# Fleet status (agents + task counts)
python lead_client.py status
# Check specific agent log
python lead_client.py logs researcher --tail 50
# Verify DB integrity
python -c "import db; db.init_db(); print(db.get_fleet_status())"
# Quick smoke test (imports + connectivity)
python smoke_test.py --fast
# Extended soak test
python soak_test.py
# Platform detection (shell, network tools, bridge type)
python lead_client.py detect-cli| Component | Log Path |
|---|---|
| Supervisor | fleet/logs/supervisor.log |
| HW Supervisor | fleet/logs/hw_supervisor.log |
| Workers | fleet/logs/<role>.log (e.g., researcher.log, coder_1.log) |
| Dashboard | stdout (Flask dev server) |
| Launcher | stdout (tkinter) |
| File | Purpose | Managed By |
|---|---|---|
fleet/fleet.db |
Task queue, agents, messages, usage tracking | db.py (SQLite WAL mode) |
fleet/rag.db |
RAG document embeddings and chunks | RAG skills |
fleet/hw_state.json |
GPU temps, VRAM, model tier state (updated every 5s) | hw_supervisor.py |
fleet/STATUS.md |
Human-readable fleet snapshot | supervisor.py |
fleet/fleet.toml |
Configuration (models, thermal, tabs, budgets, offline mode) | Manual / config.py |
fleet/keys_registry.toml |
API key registry metadata | lead_client.py |
~/.secrets |
API keys (export KEY='value' format) |
lead_client.py secret |
BigEd/launcher/data/tools.db |
Launcher module data (tools, records) | Launcher modules |
| Mode | Config Flag | Behavior |
|---|---|---|
| Normal | (default) | All skills, APIs, Discord/OpenClaw active |
| Eco | eco_mode = true |
CPU-only Ollama, ~40% CPU, 0 VRAM. Default mode |
| Offline | offline_mode = true |
External API skills rejected, local Ollama works, Discord/OpenClaw skipped |
| Air-Gap | air_gap_mode = true |
Implies offline + dashboard disabled, secrets not loaded, deny-by-default skill whitelist |
Per-zone file access control for SOC 2 compliance. Agents and skills are restricted to declared filesystem zones.
Config (fleet.toml [filesystem]):**
[filesystem.zones]
knowledge = {path = "fleet/knowledge", access = "read_write"}
skills = {path = "fleet/skills", access = "read"}
[filesystem.overrides]
deploy_skill = {zones = ["skills"], access = "full"}
[filesystem.enterprise]
enforce = false # true forces hard rejection on all violations
deny_by_default = true # reject paths not in zones
log_all_access = true # write SOC 2 audit trail to fs_access.logUsage in skills:
from filesystem_guard import FileSystemGuard
guard = FileSystemGuard(config)
ok, reason = guard.check_access("fleet/skills/my_skill.py", "write", skill="deploy_skill")Common errors:
Access denied— path not in any declared zone. Add an entry under[filesystem.zones]or[filesystem.overrides].- Guard not enforcing —
enforce = falseis the default (log-only). Setenforce = truefor hard rejection.
Downloads, verifies, and installs modules from the BigEd-ModuleHub GitHub repo.
Config (fleet.toml [modules]):**
[modules]
hub_url = "https://github.com/mbachaud/BigEd-ModuleHub"
enterprise_hub_url = "" # set for private org hub
verify_checksums = true # SHA-256 on every downloadAPI:
from modules.hub import ModuleHub
hub = ModuleHub(config)
hub.list_available() # modules in remote registry.json
hub.list_installed() # local manifest.json
hub.install_module("crm") # download + verify + write to modules/
hub.uninstall_module("crm") # delete file + update manifest
hub.get_update_available() # modules with newer hub versionCommon errors:
Module not found in hub— registry.json doesn't list that name. Check hub repo.Checksum mismatch— downloaded file corrupted. Retry or setverify_checksums = falsetemporarily.Download failed— network unavailable orhub_urlmisconfigured.
Note: After install, manually add the module to
fleet.toml [launcher.tabs]and restart the launcher. Automatic tab registration is planned for Phase 2 (0.053.01b+).
Identifies and reclaims memory across the fleet — idle models, high-RSS workers, GC fragmentation.
Actions:
| Action | Description |
|---|---|
audit |
Snapshot current RSS per worker, idle model list, GC stats |
optimize |
Unload idle CPU models, trigger GC on high-RSS workers |
compact |
Compact SQLite WAL files, prune stale knowledge artifacts |
monitor |
Continuous watch — signals supervisor to scale down under pressure |
Dispatch:
python lead_client.py task '{"type": "memory_optimizer", "payload": {"action": "audit"}}'Safety: never kills active tasks; all actions are reversible.
The Fleet Comm launcher tab was redesigned. Operational notes:
- Agent Requests section (top): HITL requests collapsed to 1-line summary, expand on click. Pin button holds list open. Dynamic scroll when >1 request.
- Manual Chat section (bottom): Choose Local (Ollama inline), Claude Code (OAuth → VS Code), or Gemini (OAuth → browser). For OAuth models, task-briefing.md is written to
fleet/knowledge/before IDE opens. - HITL response types: Approve, Reject, Need More Info, Provide Feedback, Open Discussion. "More Info" auto-creates a
research_looptask; "Discuss" creates acode_discusstask.
The dashboard runs on http://localhost:5555 (configurable with --port).
Core Status:
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Dashboard web UI (HTML) |
/api/status |
GET | Fleet agents and task counts |
/api/activity |
GET | Recent task activity log |
/api/skills |
GET | Skill execution stats |
/api/timeline |
GET | Task execution timeline |
/api/fleet/health |
GET | Fleet health summary (agents, uptime, errors) |
/api/fleet/uptime |
GET | Uptime and restart history |
/api/fleet/idle |
GET | Idle worker detection |
/api/fleet/workers |
GET | Individual worker process status |
Knowledge & RAG:
| Endpoint | Method | Description |
|---|---|---|
/api/knowledge |
GET | Knowledge artifact inventory |
/api/code_stats |
GET | Code review and index statistics |
/api/reviews |
GET | Recent code/FMA reviews |
/api/discussions |
GET | Active code discussions |
/api/rag |
GET | RAG database stats (chunks, sources) |
Hardware & Thermal:
| Endpoint | Method | Description |
|---|---|---|
/api/thermal |
GET | GPU temperature, VRAM usage, model tier, thermal state from hw_state.json |
/api/training |
GET | Training status (active, checkpoints, VRAM allocation) |
Cost Intelligence:
| Endpoint | Method | Description |
|---|---|---|
/api/usage |
GET | Token usage summary (params: period, group_by) |
/api/usage/delta |
GET | Usage comparison between date ranges |
/api/usage/budgets |
GET | Budget status per skill |
/api/usage/regression |
GET | Cost regression analysis |
Communications:
| Endpoint | Method | Description |
|---|---|---|
/api/comms |
GET | Inter-agent message history |
/api/alerts |
GET | Active alerts (info/warning/critical) |
/api/alerts/ack/<id> |
POST | Acknowledge an alert |
/api/resolutions |
GET | Issue resolution tracking |
/api/modules |
GET | Launcher module status |
/api/data_stats |
GET | Data/record statistics across modules |
Live Streaming:
| Endpoint | Method | Description |
|---|---|---|
/api/stream |
GET | Server-Sent Events (SSE) — live task updates, alerts, agent status changes |
Process Control:
| Endpoint | Method | Description |
|---|---|---|
/api/fleet/start |
POST | Start/restart fleet workers |
/api/fleet/stop |
POST | Stop fleet workers |
/api/fleet/worker/<name>/restart |
POST | Restart a specific worker |
/api/fleet/marathon |
GET | Marathon session data |
/api/fleet/checkpoints |
GET | Training checkpoint inventory |
For routine health monitoring, check these in order:
- Fleet status:
lead_client.py status— all agents should show ONLINE - Task queue: PENDING count should not grow unbounded; RUNNING should be <= worker count
- Thermal:
curl localhost:5555/api/thermal— GPU temp belowgpu_max_sustained_c(default 75C) - Dashboard health:
curl localhost:5555/api/fleet/health— no critical errors - Usage:
lead_client.py usage --period day— no unexpected cost spikes - Logs: Check
fleet/logs/supervisor.logfor repeated errors or restart loops
bash scripts/backup.shThis creates a timestamped snapshot in ~/BigEd-backups/<YYYYMMDD_HHMMSS>/ containing:
| File / Directory | Source | Purpose |
|---|---|---|
fleet.db |
fleet/fleet.db |
Task queue, agents, messages, usage data |
rag.db |
fleet/rag.db |
RAG document embeddings and chunks |
tools.db |
BigEd/launcher/data/tools.db |
Launcher module data |
knowledge/ |
fleet/knowledge/ |
All worker artifacts (reviews, discussions, research, drafts) |
keys_registry.toml |
fleet/keys_registry.toml |
API key registry metadata |
The script automatically prunes old backups, keeping the last 10.
Full fleet reset (clean restart):
# Stop fleet
pkill -f supervisor.py
# Optionally restore DB from backup
cp ~/BigEd-backups/<latest>/fleet.db fleet/fleet.db
# Restart — supervisor recreates missing tables on boot
python fleet/supervisor.pyRAG re-ingestion (after rag.db loss):
# Restore from backup if available
cp ~/BigEd-backups/<latest>/rag.db fleet/rag.db
# Or re-ingest documents (rebuilds the index)
python lead_client.py task "re-index all knowledge documents"Secrets recovery:
# Secrets live in ~/.secrets, not in the backup
# Re-set if lost:
python fleet/lead_client.py secret set ANTHROPIC_API_KEY <key>
python fleet/lead_client.py secret set GEMINI_API_KEY <key>Knowledge artifact recovery:
# Restore knowledge directory from backup
cp -r ~/BigEd-backups/<latest>/knowledge/ fleet/knowledge/| Environment | Frequency | Method |
|---|---|---|
| Development | Before major changes | Manual: bash scripts/backup.sh |
| Production (single user) | Daily | Cron: 0 2 * * * bash /path/to/scripts/backup.sh |
| Production (team) | Every 6 hours | Cron + off-site copy |
Two methods:
- GUI: Click "Report Issue" in the launcher sidebar. Fill in description, optional reproduction steps, check "Include logs". Click Submit — report saved to
reports/debug/. - CLI:
python -m biged.debug_report— generates a snapshot without the GUI. Useful for headless or crashed states.
On unhandled exception, a report is auto-generated and the user is notified.
Reports are JSON files in reports/debug/debug_<timestamp>.json:
| Section | What it tells you |
|---|---|
platform |
OS, Python version, architecture — rules out platform-specific issues |
hardware |
GPU model/VRAM/temp, CPU, RAM — identifies resource constraints |
fleet_state |
Agent statuses, task counts, Ollama state, thermal — fleet health snapshot |
error |
Exception type/traceback, component, trigger — the actual problem |
logs |
Last 50 lines of supervisor/worker logs, last 100 lines of launcher output |
config_snapshot |
Active profile, model tier, thermal limits — reproduction context |
reproduction_steps |
User-provided steps to reproduce (if manual report) |
- GitHub Issues (automated): If
ghCLI is installed, the "Report Issue" dialog offers "Submit to GitHub". Creates an issue with report summary in the body and full report attached. Labels:bugoruser-report. - File export (manual): Report saved as
.jsonto Desktop. Attach to an email or GitHub issue manually.
Reports are sanitized before submission:
- API keys are stripped from the config snapshot
- User paths are anonymized (
C:\Users\max\...becomes~\...) - No file contents are included — only log tails and metadata
- Review the JSON before submitting if in doubt
See FRAMEWORK_BLUEPRINT.md S10-S11 for full debug report schema and resolution tracking lifecycle.
| Worker | Role | Primary Skills |
|---|---|---|
| researcher | Papers, arxiv, web search | web_search, arxiv_fetch, lead_research |
| coder_1..N | Code review (architect/critic/perf) | code_review, code_discuss, code_index, fma_review, skill_draft, code_quality |
| archivist | Flashcards, knowledge org | summarize, synthesize |
| analyst | Autoresearch results analysis | Analysis skills |
| sales | SMB lead research + outreach | lead_research |
| onboarding | Client onboarding checklists | Onboarding skills |
| implementation | Local AI deployment specs | Implementation skills |
| security | Security audits, pen tests, advisories | security_audit, pen_test, security_review |
| planner | Workload planning (queues 5-500 tasks) | Planning/scheduling |
| legal | Legal document review | Legal skills |
| account_manager | Account management | Account skills |
Coder count is configurable via fleet.toml [workers] coder_count (default 1).
| Skill | Output Directory |
|---|---|
code_discuss |
knowledge/code_discussion/ + messages table |
code_index |
knowledge/code_index.jsonl |
code_review |
knowledge/code_reviews/<file>_review_<date>_<agent>.md |
fma_review |
knowledge/fma_reviews/<file>_review_<date>_<agent>.md + discussion |
skill_draft |
knowledge/code_drafts/<name>_draft_<date>_<agent>.py |
security_review |
knowledge/security/reviews/security_review_<date>.md |
code_quality |
knowledge/quality/reviews/quality_review_<date>.md |
Drafts are never auto-deployed — review before copying to skills/.
| Bridge | Config Flag | Status |
|---|---|---|
Discord (discord_bot.py) |
discord_bot_enabled |
Active — routes biged-fleetchat to fleet |
| OpenClaw gateway | openclaw_enabled |
Installed, disabled by default |
Discord commands: /aider, /claude, /gemini, /local, /status, /task, /result, /help
| Supervisor | Responsibility | Loop Interval |
|---|---|---|
supervisor.py |
Process lifecycle: Ollama start/stop, worker respawn, training detection, Discord/OpenClaw | Continuous |
hw_supervisor.py |
Model health: keepalive (~240s), conductor check (~60s), VRAM/thermal scaling, model tier transitions | 5s state writes |
State file: hw_state.json — written by hw_supervisor every 5s, read by supervisor, workers, dashboard, and launcher. Contains: status, model, thermal, models_loaded, conductor status.
This document is updated alongside code changes:
- Version bumps — Review all sections when completing a roadmap phase
- New skill/module — Add to relevant section in the same commit
- New failure mode — Add to troubleshooting table when discovered
- Deployment changes — Update deployment section if install steps change
- New CLI command — Add to CLI Reference when
lead_client.pygains a subcommand - New dashboard endpoint — Add to Monitoring section when
dashboard.pygains a route
| Level | Description | Response Time | Examples |
|---|---|---|---|
| SEV-1 | Data breach / secret exposure | < 1 hour | API key in task result, PII in knowledge/ |
| SEV-2 | Service degradation | < 4 hours | Ollama OOM, supervisor crash loop, thermal runaway |
| SEV-3 | Operational issue | < 24 hours | Failed skill, stale tasks, agent quarantine |
| SEV-4 | Cosmetic / minor | Next business day | UI glitch, stale docs, non-blocking warning |
- DETECT — Watchdog alerts, DLP findings, dashboard alerts, operator report
- TRIAGE — Classify severity (SEV-1 through SEV-4)
- CONTAIN —
- SEV-1:
lead_client.py broadcast "PAUSE" --channel fleet→ all workers pause - SEV-2:
lead_client.py gdpr-erase <affected_agent> --confirmif data compromised - Rotate affected API keys:
lead_client.py dispatch secret_rotate '{"action":"rotate","key":"AFFECTED_KEY"}'
- SEV-1:
- INVESTIGATE —
- Check audit log:
GET /api/audit?type=dlp_alert&limit=50 - Generate debug report: "Report Issue" button or
generate_debug_report() - Review task history:
GET /api/activity
- Check audit log:
- REMEDIATE — Fix root cause, deploy patch, verify via smoke test
- NOTIFY — GDPR Art. 33: supervisory authority within 72 hours if personal data breach
- DOCUMENT — Add to
data/resolutions.jsonlwith fix_commit and regression_test
Subject: Data Breach Notification — BigEd CC Fleet
Date: [DATE]
Nature of breach: [DESCRIPTION]
Data subjects affected: [COUNT/SCOPE]
Likely consequences: [IMPACT]
Measures taken: [CONTAINMENT + REMEDIATION]
Contact: [DPO/OPERATOR]