BigEd CC — Operations Manual

Complete ops runbook: quick start, CLI reference, platform troubleshooting, skill/module authoring, deployment, monitoring, backup, and recovery. Companion to FRAMEWORK_BLUEPRINT.md (architecture) and ROADMAP.md (future work).

Platform note: Commands below use python directly. On Linux/macOS with uv installed, you may prefix with uv run. On Windows (Git Bash), always use python directly — uv run is WSL-only.

Quick Start

# 1. Start fleet
python fleet/supervisor.py              # Linux/macOS — direct
# Windows: either run inside WSL, or set BIGED_NATIVE_WINDOWS=1

# 2. Start dashboard (web UI on http://localhost:5555)
python fleet/dashboard.py

# 3. Start launcher (tkinter desktop app)
python BigEd/launcher/launcher.py

# 4. Check fleet health
python fleet/lead_client.py status

CLI Reference

All commands below are run as python fleet/lead_client.py <command>.

Fleet Status & Control

Command	Description
`status`	Show all agents (name, role, status, last heartbeat) and task counts
`detect-cli`	Detect best local CLI, shell, network tools, and bridge for this platform
`install-service`	Install fleet as auto-start service (Task Scheduler / systemd / launchd)
`uninstall-service`	Remove the auto-start service

Task Management

Command	Description
`task "instruction"`	Submit natural language task (parsed by conductor model). Add `--wait` to block until complete. `--priority N` (1-10, default 5)
`task '{"skill":"web_search","query":"..."}'`	Submit raw JSON task (bypasses intent parser)
`dispatch skill payload`	Dispatch explicit skill + JSON payload. `--priority N` (default 9), `--assigned-to agent`, `--b64` for base64 payload
`result <task_id>`	Fetch status and result of a specific task

Messaging

Command	Description
`send agent "msg"`	Direct message to a specific agent. `--channel fleet\|sup\|agent\|pool`
`broadcast "msg"`	Broadcast message to all registered agents. `--channel` same as above
`inbox agent`	Check agent inbox (unread by default). `--all` for all, `--limit N`, `--channel` filter
`notes channel`	Read channel scratchpad. `--post "json"` to add, `--since ISO`, `--limit N`

Cost Intelligence

Command	Description
`usage --period day\|week\|month`	Token usage breakdown by skill (calls, input/output tokens, cost USD, cache savings)
`usage-delta from_start from_end to_start to_end`	Compare usage between two ISO date ranges (delta %, direction arrows)
`budget`	Token budget status per skill (from `[budgets]` in fleet.toml)

Marathon / Training

Command	Description
`marathon [session]`	List marathon sessions (last 5). Pass session ID for last 3 snapshots
`marathon-checkpoint`	Show autoresearch training checkpoints (last 10 `.pt` files, size, modified)

Logs

Command	Description
`logs agent --tail N`	Tail the log file for a specific agent (default 30 lines)

Secrets

Command	Description
`secret set KEY value`	Set an API key in `~/.secrets` (atomic write). `--b64` for base64-encoded value
`secret get KEY`	Retrieve a secret value
`secret list`	List all secret keys (values masked)

Platform Troubleshooting Matrix (PT-4)

Startup Issues

Issue	Windows	Linux	macOS
Ollama not starting	Check Windows Ollama installer; verify `http://localhost:11434` responds. Restart: `taskkill /f /im ollama.exe && ollama serve`	`systemctl status ollama` or `ollama serve &`. Check `journalctl -u ollama`	`brew services start ollama`. Check `brew services list`
Fleet not starting	Run inside WSL: `wsl -d Ubuntu -- bash -c "cd /mnt/c/.../fleet && python supervisor.py"`. Or set `BIGED_NATIVE_WINDOWS=1` for native mode	Direct: `python supervisor.py`	Direct: `python supervisor.py`
Dashboard won't launch	Ensure Flask installed: `uv pip install flask`. Check port 5555: `netstat -an \| findstr 5555`	`ss -tlnp \| grep 5555`. Kill squatter: `fuser -k 5555/tcp`	`lsof -i :5555`. Kill: `kill $(lsof -t -i :5555)`
Launcher won't start	Check Python has tkinter: `python -c "import tkinter"`. Usually bundled on Windows	`sudo apt install python3-tk` (Debian/Ubuntu) or `sudo pacman -S tk` (Arch)	`brew install python-tk@3.11`
Auto-boot not working	Check Task Scheduler: `schtasks /query /tn BigEdFleet`. Re-run `install-service` if missing	`systemctl --user status biged-fleet`. Check `~/.config/systemd/user/biged-fleet.service`	`launchctl list \| grep biged`. Check `~/Library/LaunchAgents/com.biged.fleet.plist`

GPU / Hardware Issues

Issue	Windows	Linux	macOS
GPU not detected	Install `nvidia-ml-py`: `pip install nvidia-ml-py`. Verify NVIDIA driver: `nvidia-smi`	Install `nvidia-ml-py` + CUDA toolkit. Verify: `nvidia-smi`	No NVIDIA GPU on macOS — CPU-only mode. Apple Silicon uses Metal via Ollama automatically
pynvml ImportError	`pip install nvidia-ml-py` (not `pynvml`)	`pip install nvidia-ml-py`	N/A — skip pynvml. hw_supervisor detects absence and runs CPU-only
Training OOM	Ensure Ollama models fully evicted before training. Check `DEPTH` in `MACHINE_PROFILE.md` (max DEPTH=6 for 12GB VRAM). Run Ollama on CPU during training: `CUDA_VISIBLE_DEVICES=-1 ollama serve &`	Same approach. Use `nvidia-smi` to verify GPU memory freed before `train.py`	N/A — no GPU training on macOS. CPU-only autoresearch works but is slow
Thermal throttling	Check `hw_state.json` — if `thermal.gpu_temp_c > 75`, hw_supervisor auto-downscales model tier. Increase fan curve or lower `[thermal] gpu_max_sustained_c` in fleet.toml	Same. Also: `sensors` for CPU temps, `nvidia-smi -l 1` for live GPU temp	N/A — no NVIDIA thermal management. Ollama handles Apple Silicon throttling internally
ROCm not found (AMD)	AMD GPUs not supported on Windows for Ollama	Install ROCm per AMD docs. Ollama uses ROCm automatically when available	N/A

Networking Issues

Issue	Windows	Linux	macOS
WSL networking	Enable mirrored networking in `%USERPROFILE%\.wslconfig`: `[wsl2]` / `networkingMode=mirrored`. Restart WSL: `wsl --shutdown && wsl`	N/A	N/A
Dashboard unreachable from browser	Check firewall allows port 5555. If running in WSL, use `localhost:5555` with mirrored networking, or WSL IP from `wsl hostname -I`	Check `ufw status` — allow 5555 if needed	Check System Preferences > Security > Firewall
Ollama unreachable from fleet	Verify `fleet.toml [models] ollama_host` matches. Default: `http://localhost:11434`. In WSL, may need `http://host.docker.internal:11434` or mirrored networking	Verify host setting. If Docker: expose port 11434	Same — check host setting
API calls failing (429s)	Fleet auto-throttles at 20% of rate limits with 300ms min between requests and exponential backoff. Check `usage --period day` for budget overruns	Same	Same

Database Issues

Issue	Windows	Linux	macOS
DB locked / busy timeout	Long-running write or crashed process holding WAL. Kill stale processes: `wsl -- pkill -f worker.py`, then restart supervisor	`pkill -f worker.py` then restart supervisor	`pkill -f worker.py` then restart supervisor
Stale tasks stuck in RUNNING	Worker died mid-task. Run: `python -c "import db; db.init_db(); print(db.recover_stale_tasks())"`	Same	Same
Training lock stuck	Training process crashed without releasing. Run: `python -c "import db; db.init_db(); db.release_lock('training')"`	Same	Same
fleet.db corrupted	Restore from backup: `~/BigEd-backups/<latest>/fleet.db`. Or delete and restart — supervisor recreates tables on boot	Same	Same

Skill / Module Issues

Issue	Diagnosis	Fix
Workers not claiming tasks	`lead_client.py status` — agents show OFFLINE	Restart supervisor. Check worker logs in `fleet/logs/`
Worker stuck on single task	Skill timeout not triggering (default 600s)	Add entry in `worker.py:SKILL_TIMEOUTS` for slow skills
Skill dispatch hangs	Intent parser model not loaded	Ensure conductor model loaded: `ollama list`. Check `fleet.toml [models] conductor_model`
Module tab not appearing	Module not in profile or disabled	Check `fleet.toml [launcher] profile` and `[launcher.tabs]` section
Launcher can't find fleet	`FLEET_DIR` resolution failing	Set env: `BIGED_FLEET_DIR=/path/to/fleet` or verify `fleet/fleet.toml` exists
Agent flicker in UI	Widget destroy/recreate pattern (pre-v0.32)	Update to v0.32+ which uses widget cache + configure pattern

Platform-Specific Edge Cases

Issue	Platform	Fix
WSL not found	Windows	`wsl --install` or `wsl --install -d Ubuntu`. Reboot after install
Gatekeeper blocks launch	macOS	Right-click the app > Open. Or: `xattr -d com.apple.quarantine BigEdCC.app`
`python3` not `python`	Linux	Alias in shell profile, or install `python-is-python3` package (Debian/Ubuntu)
Permission denied on scripts	Linux/macOS	`chmod +x scripts/backup.sh fleet/supervisor.py`
Git line endings break scripts	Windows	Set `core.autocrlf=input` in `.gitattributes`. Re-clone or `git checkout -- scripts/`

Skill Authoring

Interface

Every skill is a Python file in fleet/skills/ with a single run() function:

# fleet/skills/my_skill.py
def run(payload: dict, config: dict, log) -> dict:
    """
    payload — JSON parsed from task's payload_json
    config  — fleet.toml parsed via config.load_config()
    log     — worker logger (use log.info/warning/error instead of print)
    Returns a dict (serialized to result_json by the worker)
    """
    query = payload.get("query", "")
    log.info(f"Processing query: {query}")
    # ... do work ...
    return {"summary": "...", "source": "..."}

Using LLM Providers

Import the routing layer — never call Ollama/Claude/Gemini directly:

from skills._models import call_model

result = call_model(prompt, config, provider="local")   # Ollama (default)
result = call_model(prompt, config, provider="claude")  # Claude API
result = call_model(prompt, config, provider="gemini")  # Gemini API

Network Requirements

If a skill requires internet access, declare it at module level:

REQUIRES_NETWORK = True  # Worker checks before dispatch; rejected in offline/air-gap mode

Skills without this flag (or with REQUIRES_NETWORK = False) are allowed in all modes.

Registration

Save file as fleet/skills/<skill_name>.py
Add to lead_client.py intent parser prompt if it should be dispatchable by natural language
Add to [affinity] section in fleet.toml if it should route to specific worker roles
Add custom timeout in worker.py:SKILL_TIMEOUTS if >600s is needed

Testing

cd fleet

# Verify all skills import cleanly
python smoke_test.py --fast

# Test a single skill directly
python -c "from skills.my_skill import run; print(run({'query': 'test'}, {}))"

# Test via fleet dispatch (requires running supervisor)
python lead_client.py dispatch my_skill '{"query": "test"}' --wait

Module Authoring

Interface Contract

Create BigEd/launcher/modules/mod_<name>.py with a Module class:

class Module:
    NAME = "my_module"          # matches fleet.toml key
    LABEL = "My Module"         # tab label in UI
    VERSION = "0.31"
    DEFAULT_ENABLED = False
    DEPENDS_ON = []             # other module NAMEs required

    DATA_SCHEMA = {
        "table": "my_module_data",
        "fields": {
            "name": {"type": "text", "required": True},
            "status": {"type": "text", "enum": ["active", "inactive"]},
        },
        "retention_days": None,
    }

    def __init__(self, app):
        self.app = app

    def build_tab(self, parent):
        """Build UI widgets into parent frame."""
        pass

    def on_refresh(self):
        """Called periodically by launcher timer. Use _db_query_bg for DB work."""
        pass

    def on_close(self):
        """Cleanup on app exit."""
        pass

    def get_settings(self) -> dict:
        return {}

    def apply_settings(self, cfg):
        pass

    def export_data(self) -> list[dict]:
        """Return all records as list of dicts for data portability."""
        return []

    def validate_record(self, data) -> tuple[bool, str]:
        """Validate a record against DATA_SCHEMA."""
        return True, "OK"

Registration

Save as BigEd/launcher/modules/mod_<name>.py — auto-discovered by discover_modules()
Add to manifest.json (auto-added on first load, but you can pre-populate)
Add to profile in modules/__init__.py:DEPLOYMENT_PROFILES if it belongs to a profile
Enable in fleet.toml under [launcher.tabs]: my_module = true

Async DB Pattern

Never block the UI thread. Use the launcher's background query helper:

def on_refresh(self):
    def _fetch(con):
        return con.execute("SELECT * FROM my_table").fetchall()
    def _render(rows):
        if rows is None:
            return
        # update widgets here
    self.app._db_query_bg(_fetch, _render)

Deployment (Fresh Machine)

Windows (Current Production)

Prerequisites: Python 3.11+, Git, WSL2 (fleet runs inside WSL), Ollama

# 1. Clone
git clone <repo-url> Education && cd Education

# 2. Install launcher deps (Windows native Python)
cd BigEd/launcher
pip install -r requirements.txt

# 3. Install fleet deps (WSL)
wsl
cd /mnt/c/Users/<you>/Projects/Education/fleet
pip install httpx anthropic  # or: uv sync

# 4. Configure
cp fleet/fleet.toml.example fleet/fleet.toml   # if template exists
# Edit fleet.toml:
#   [launcher] profile = "consulting"    # or minimal/research/full
#   [models]   local = "qwen3:8b"        # adjust for your GPU VRAM
#   [thermal]  gpu_max_sustained_c = 75  # adjust for your card

# 5. Set API keys
python fleet/lead_client.py secret set ANTHROPIC_API_KEY <key>
python fleet/lead_client.py secret set GEMINI_API_KEY <key>

# 6. Pull Ollama models
ollama pull qwen3:8b     # GPU workers (skip if no GPU)
ollama pull qwen3:4b     # conductor (CPU)
ollama pull qwen3:0.6b   # maintainer (CPU, always loaded)

# 7. Verify
cd fleet
python smoke_test.py --fast   # should pass all checks

# 8. Start fleet (WSL)
nohup python supervisor.py >> logs/supervisor.log 2>&1 &

# 9. Start launcher (Windows)
cd BigEd/launcher
python launcher.py

Linux

Prerequisites: Python 3.11+, python3-tk (system package), Git, Ollama

# 1. Clone
git clone <repo-url> Education && cd Education

# 2. Install deps (single environment — no WSL needed)
pip install -r BigEd/launcher/requirements.txt
cd fleet && pip install httpx anthropic  # or: uv sync

# 3. Configure (same as Windows)
cp fleet/fleet.toml.example fleet/fleet.toml
# Edit fleet.toml as needed

# 4. Set API keys
python fleet/lead_client.py secret set ANTHROPIC_API_KEY <key>

# 5. Pull Ollama models
ollama pull qwen3:8b

# 6. Verify
cd fleet && python smoke_test.py --fast

# 7. Start fleet (native — no WSL bridge)
nohup python supervisor.py >> logs/supervisor.log 2>&1 &

# 8. Start launcher (same machine)
python BigEd/launcher/launcher.py

Key difference: No WSL layer. Fleet and launcher run in the same OS. DirectBridge replaces wsl()/wsl_bg() calls.

macOS

Prerequisites: Python 3.11+ (Homebrew), python-tk (brew), Git, Ollama

Same steps as Linux. Additional notes:

brew install python-tk@3.11 if tkinter import fails
First launch may trigger Gatekeeper — right-click > Open to bypass
No NVIDIA GPU support. CPU-only Ollama or Apple Silicon-optimized models
If using .app bundle: drag to /Applications/, launch normally

Hardware Sizing

GPU VRAM	Recommended `local` model	Notes
None	CPU-only (`qwen3:4b`, `num_gpu: 0`)	Slow but functional
6-8 GB	`qwen3:4b` (GPU)	Mid-tier, reliable
10-12 GB	`qwen3:8b` (GPU)	Full quality, thermal management recommended
16+ GB	`qwen3:8b` + headroom for training	Can run Ollama during autoresearch

VRAM Safety Rules

Safe ceiling: 10 GB for a 12 GB card (leaves headroom for OS/display)
Sweet spot: DEPTH=6, ~26M params, ~6.9 GB VRAM
DEPTH=7+ OOMs on 12 GB cards — never exceed DEPTH=6
Training + Ollama: Run Ollama on CPU during training: CUDA_VISIBLE_DEVICES=-1 ollama serve &
Eco mode (default): CPU-only, ~40% CPU utilization, 0 VRAM

Operations

Start / Stop

Windows:

# Start fleet (WSL)
wsl -d Ubuntu -- bash -c "cd /mnt/c/.../fleet && nohup python supervisor.py >> logs/supervisor.log 2>&1 &"

# Start hardware supervisor (WSL, optional — GPU systems only)
wsl -d Ubuntu -- bash -c "cd /mnt/c/.../fleet && nohup python hw_supervisor.py >> logs/hw_supervisor.log 2>&1 &"

# Start dashboard (WSL or Windows)
python fleet/dashboard.py              # default http://localhost:5555
python fleet/dashboard.py --port 8080  # custom port

# Start launcher (Windows native)
python BigEd/launcher/launcher.py

# Stop fleet — graceful
wsl -d Ubuntu -- python lead_client.py broadcast '{"type": "pause"}'
# Stop fleet — force
wsl -d Ubuntu -- pkill -f supervisor.py

Linux / macOS:

# Start fleet (native — no WSL)
cd fleet
nohup python supervisor.py >> logs/supervisor.log 2>&1 &
nohup python hw_supervisor.py >> logs/hw_supervisor.log 2>&1 &  # optional, GPU only

# Start dashboard
python dashboard.py

# Start launcher
python BigEd/launcher/launcher.py

# Stop fleet
pkill -f supervisor.py
# Or gracefully:
python lead_client.py broadcast '{"type": "pause"}'

Health Checks

# Fleet status (agents + task counts)
python lead_client.py status

# Check specific agent log
python lead_client.py logs researcher --tail 50

# Verify DB integrity
python -c "import db; db.init_db(); print(db.get_fleet_status())"

# Quick smoke test (imports + connectivity)
python smoke_test.py --fast

# Extended soak test
python soak_test.py

# Platform detection (shell, network tools, bridge type)
python lead_client.py detect-cli

Log Locations

Component	Log Path
Supervisor	`fleet/logs/supervisor.log`
HW Supervisor	`fleet/logs/hw_supervisor.log`
Workers	`fleet/logs/<role>.log` (e.g., `researcher.log`, `coder_1.log`)
Dashboard	stdout (Flask dev server)
Launcher	stdout (tkinter)

Key Files (Runtime)

File	Purpose	Managed By
`fleet/fleet.db`	Task queue, agents, messages, usage tracking	`db.py` (SQLite WAL mode)
`fleet/rag.db`	RAG document embeddings and chunks	RAG skills
`fleet/hw_state.json`	GPU temps, VRAM, model tier state (updated every 5s)	`hw_supervisor.py`
`fleet/STATUS.md`	Human-readable fleet snapshot	`supervisor.py`
`fleet/fleet.toml`	Configuration (models, thermal, tabs, budgets, offline mode)	Manual / `config.py`
`fleet/keys_registry.toml`	API key registry metadata	`lead_client.py`
`~/.secrets`	API keys (`export KEY='value'` format)	`lead_client.py secret`
`BigEd/launcher/data/tools.db`	Launcher module data (tools, records)	Launcher modules

Operational Modes

Mode	Config Flag	Behavior
Normal	(default)	All skills, APIs, Discord/OpenClaw active
Eco	`eco_mode = true`	CPU-only Ollama, ~40% CPU, 0 VRAM. Default mode
Offline	`offline_mode = true`	External API skills rejected, local Ollama works, Discord/OpenClaw skipped
Air-Gap	`air_gap_mode = true`	Implies offline + dashboard disabled, secrets not loaded, deny-by-default skill whitelist

New Systems (v0.051.05b – v0.053.01b)

FileSystemGuard (`fleet/filesystem_guard.py`)

Per-zone file access control for SOC 2 compliance. Agents and skills are restricted to declared filesystem zones.

Config (fleet.toml [filesystem]):**

[filesystem.zones]
knowledge = {path = "fleet/knowledge", access = "read_write"}
skills    = {path = "fleet/skills", access = "read"}

[filesystem.overrides]
deploy_skill = {zones = ["skills"], access = "full"}

[filesystem.enterprise]
enforce         = false   # true forces hard rejection on all violations
deny_by_default = true    # reject paths not in zones
log_all_access  = true    # write SOC 2 audit trail to fs_access.log

Usage in skills:

from filesystem_guard import FileSystemGuard
guard = FileSystemGuard(config)
ok, reason = guard.check_access("fleet/skills/my_skill.py", "write", skill="deploy_skill")

Common errors:

Access denied — path not in any declared zone. Add an entry under [filesystem.zones] or [filesystem.overrides].
Guard not enforcing — enforce = false is the default (log-only). Set enforce = true for hard rejection.

Module Hub (`BigEd/launcher/modules/hub.py`)

Downloads, verifies, and installs modules from the BigEd-ModuleHub GitHub repo.

Config (fleet.toml [modules]):**

[modules]
hub_url              = "https://github.com/mbachaud/BigEd-ModuleHub"
enterprise_hub_url   = ""      # set for private org hub
verify_checksums     = true    # SHA-256 on every download

API:

from modules.hub import ModuleHub
hub = ModuleHub(config)
hub.list_available()           # modules in remote registry.json
hub.list_installed()           # local manifest.json
hub.install_module("crm")     # download + verify + write to modules/
hub.uninstall_module("crm")   # delete file + update manifest
hub.get_update_available()    # modules with newer hub version

Common errors:

Module not found in hub — registry.json doesn't list that name. Check hub repo.
Checksum mismatch — downloaded file corrupted. Retry or set verify_checksums = false temporarily.
Download failed — network unavailable or hub_url misconfigured.

Note: After install, manually add the module to fleet.toml [launcher.tabs] and restart the launcher. Automatic tab registration is planned for Phase 2 (0.053.01b+).

Memory Optimizer Skill (`fleet/skills/memory_optimizer.py`)

Identifies and reclaims memory across the fleet — idle models, high-RSS workers, GC fragmentation.

Actions:

Action	Description
`audit`	Snapshot current RSS per worker, idle model list, GC stats
`optimize`	Unload idle CPU models, trigger GC on high-RSS workers
`compact`	Compact SQLite WAL files, prune stale knowledge artifacts
`monitor`	Continuous watch — signals supervisor to scale down under pressure

Dispatch:

python lead_client.py task '{"type": "memory_optimizer", "payload": {"action": "audit"}}'

Safety: never kills active tasks; all actions are reversible.

Fleet Comm UX (v0.051.05b redesign)

The Fleet Comm launcher tab was redesigned. Operational notes:

Agent Requests section (top): HITL requests collapsed to 1-line summary, expand on click. Pin button holds list open. Dynamic scroll when >1 request.
Manual Chat section (bottom): Choose Local (Ollama inline), Claude Code (OAuth → VS Code), or Gemini (OAuth → browser). For OAuth models, task-briefing.md is written to fleet/knowledge/ before IDE opens.
HITL response types: Approve, Reject, Need More Info, Provide Feedback, Open Discussion. "More Info" auto-creates a research_loop task; "Discuss" creates a code_discuss task.

Monitoring

Dashboard Endpoints

The dashboard runs on http://localhost:5555 (configurable with --port).

Core Status:

Endpoint	Method	Description
`/`	GET	Dashboard web UI (HTML)
`/api/status`	GET	Fleet agents and task counts
`/api/activity`	GET	Recent task activity log
`/api/skills`	GET	Skill execution stats
`/api/timeline`	GET	Task execution timeline
`/api/fleet/health`	GET	Fleet health summary (agents, uptime, errors)
`/api/fleet/uptime`	GET	Uptime and restart history
`/api/fleet/idle`	GET	Idle worker detection
`/api/fleet/workers`	GET	Individual worker process status

Knowledge & RAG:

Endpoint	Method	Description
`/api/knowledge`	GET	Knowledge artifact inventory
`/api/code_stats`	GET	Code review and index statistics
`/api/reviews`	GET	Recent code/FMA reviews
`/api/discussions`	GET	Active code discussions
`/api/rag`	GET	RAG database stats (chunks, sources)

Hardware & Thermal:

Endpoint	Method	Description
`/api/thermal`	GET	GPU temperature, VRAM usage, model tier, thermal state from `hw_state.json`
`/api/training`	GET	Training status (active, checkpoints, VRAM allocation)

Cost Intelligence:

Endpoint	Method	Description
`/api/usage`	GET	Token usage summary (params: `period`, `group_by`)
`/api/usage/delta`	GET	Usage comparison between date ranges
`/api/usage/budgets`	GET	Budget status per skill
`/api/usage/regression`	GET	Cost regression analysis

Communications:

Endpoint	Method	Description
`/api/comms`	GET	Inter-agent message history
`/api/alerts`	GET	Active alerts (info/warning/critical)
`/api/alerts/ack/<id>`	POST	Acknowledge an alert
`/api/resolutions`	GET	Issue resolution tracking
`/api/modules`	GET	Launcher module status
`/api/data_stats`	GET	Data/record statistics across modules

Live Streaming:

Endpoint	Method	Description
`/api/stream`	GET	Server-Sent Events (SSE) — live task updates, alerts, agent status changes

Process Control:

Endpoint	Method	Description
`/api/fleet/start`	POST	Start/restart fleet workers
`/api/fleet/stop`	POST	Stop fleet workers
`/api/fleet/worker/<name>/restart`	POST	Restart a specific worker
`/api/fleet/marathon`	GET	Marathon session data
`/api/fleet/checkpoints`	GET	Training checkpoint inventory

Monitoring Checklist

For routine health monitoring, check these in order:

Fleet status: lead_client.py status — all agents should show ONLINE
Task queue: PENDING count should not grow unbounded; RUNNING should be <= worker count
Thermal: curl localhost:5555/api/thermal — GPU temp below gpu_max_sustained_c (default 75C)
Dashboard health: curl localhost:5555/api/fleet/health — no critical errors
Usage: lead_client.py usage --period day — no unexpected cost spikes
Logs: Check fleet/logs/supervisor.log for repeated errors or restart loops

Backup & Recovery

Running a Backup

bash scripts/backup.sh

This creates a timestamped snapshot in ~/BigEd-backups/<YYYYMMDD_HHMMSS>/ containing:

File / Directory	Source	Purpose
`fleet.db`	`fleet/fleet.db`	Task queue, agents, messages, usage data
`rag.db`	`fleet/rag.db`	RAG document embeddings and chunks
`tools.db`	`BigEd/launcher/data/tools.db`	Launcher module data
`knowledge/`	`fleet/knowledge/`	All worker artifacts (reviews, discussions, research, drafts)
`keys_registry.toml`	`fleet/keys_registry.toml`	API key registry metadata

The script automatically prunes old backups, keeping the last 10.

Recovery Procedures

Full fleet reset (clean restart):

# Stop fleet
pkill -f supervisor.py

# Optionally restore DB from backup
cp ~/BigEd-backups/<latest>/fleet.db fleet/fleet.db

# Restart — supervisor recreates missing tables on boot
python fleet/supervisor.py

RAG re-ingestion (after rag.db loss):

# Restore from backup if available
cp ~/BigEd-backups/<latest>/rag.db fleet/rag.db

# Or re-ingest documents (rebuilds the index)
python lead_client.py task "re-index all knowledge documents"

Secrets recovery:

# Secrets live in ~/.secrets, not in the backup
# Re-set if lost:
python fleet/lead_client.py secret set ANTHROPIC_API_KEY <key>
python fleet/lead_client.py secret set GEMINI_API_KEY <key>

Knowledge artifact recovery:

# Restore knowledge directory from backup
cp -r ~/BigEd-backups/<latest>/knowledge/ fleet/knowledge/

Backup Schedule Recommendation

Environment	Frequency	Method
Development	Before major changes	Manual: `bash scripts/backup.sh`
Production (single user)	Daily	Cron: `0 2 * * * bash /path/to/scripts/backup.sh`
Production (team)	Every 6 hours	Cron + off-site copy

Issue Reporting

Generating a Debug Report

Two methods:

GUI: Click "Report Issue" in the launcher sidebar. Fill in description, optional reproduction steps, check "Include logs". Click Submit — report saved to reports/debug/.
CLI: python -m biged.debug_report — generates a snapshot without the GUI. Useful for headless or crashed states.

On unhandled exception, a report is auto-generated and the user is notified.

Reading a Debug Report

Reports are JSON files in reports/debug/debug_<timestamp>.json:

Section	What it tells you
`platform`	OS, Python version, architecture — rules out platform-specific issues
`hardware`	GPU model/VRAM/temp, CPU, RAM — identifies resource constraints
`fleet_state`	Agent statuses, task counts, Ollama state, thermal — fleet health snapshot
`error`	Exception type/traceback, component, trigger — the actual problem
`logs`	Last 50 lines of supervisor/worker logs, last 100 lines of launcher output
`config_snapshot`	Active profile, model tier, thermal limits — reproduction context
`reproduction_steps`	User-provided steps to reproduce (if manual report)

Submitting a Report

GitHub Issues (automated): If gh CLI is installed, the "Report Issue" dialog offers "Submit to GitHub". Creates an issue with report summary in the body and full report attached. Labels: bug or user-report.
File export (manual): Report saved as .json to Desktop. Attach to an email or GitHub issue manually.

Privacy

Reports are sanitized before submission:

API keys are stripped from the config snapshot
User paths are anonymized (C:\Users\max\... becomes ~\...)
No file contents are included — only log tails and metadata
Review the JSON before submitting if in doubt

See FRAMEWORK_BLUEPRINT.md S10-S11 for full debug report schema and resolution tracking lifecycle.

Worker Roles Reference

Worker	Role	Primary Skills
researcher	Papers, arxiv, web search	`web_search`, `arxiv_fetch`, `lead_research`
coder_1..N	Code review (architect/critic/perf)	`code_review`, `code_discuss`, `code_index`, `fma_review`, `skill_draft`, `code_quality`
archivist	Flashcards, knowledge org	`summarize`, `synthesize`
analyst	Autoresearch results analysis	Analysis skills
sales	SMB lead research + outreach	`lead_research`
onboarding	Client onboarding checklists	Onboarding skills
implementation	Local AI deployment specs	Implementation skills
security	Security audits, pen tests, advisories	`security_audit`, `pen_test`, `security_review`
planner	Workload planning (queues 5-500 tasks)	Planning/scheduling
legal	Legal document review	Legal skills
account_manager	Account management	Account skills

Coder count is configurable via fleet.toml [workers] coder_count (default 1).

Skill Output Locations

Skill	Output Directory
`code_discuss`	`knowledge/code_discussion/` + messages table
`code_index`	`knowledge/code_index.jsonl`
`code_review`	`knowledge/code_reviews/<file>_review_<date>_<agent>.md`
`fma_review`	`knowledge/fma_reviews/<file>_review_<date>_<agent>.md` + discussion
`skill_draft`	`knowledge/code_drafts/<name>_draft_<date>_<agent>.py`
`security_review`	`knowledge/security/reviews/security_review_<date>.md`
`code_quality`	`knowledge/quality/reviews/quality_review_<date>.md`

Drafts are never auto-deployed — review before copying to skills/.

Messaging Bridges

Bridge	Config Flag	Status
Discord (`discord_bot.py`)	`discord_bot_enabled`	Active — routes `biged-fleetchat` to fleet
OpenClaw gateway	`openclaw_enabled`	Installed, disabled by default

Discord commands: /aider, /claude, /gemini, /local, /status, /task, /result, /help

Dual Supervisor Architecture

Supervisor	Responsibility	Loop Interval
`supervisor.py`	Process lifecycle: Ollama start/stop, worker respawn, training detection, Discord/OpenClaw	Continuous
`hw_supervisor.py`	Model health: keepalive (~240s), conductor check (~60s), VRAM/thermal scaling, model tier transitions	5s state writes

State file: hw_state.json — written by hw_supervisor every 5s, read by supervisor, workers, dashboard, and launcher. Contains: status, model, thermal, models_loaded, conductor status.

Maintenance Protocol

This document is updated alongside code changes:

Version bumps — Review all sections when completing a roadmap phase
New skill/module — Add to relevant section in the same commit
New failure mode — Add to troubleshooting table when discovered
Deployment changes — Update deployment section if install steps change
New CLI command — Add to CLI Reference when lead_client.py gains a subcommand
New dashboard endpoint — Add to Monitoring section when dashboard.py gains a route

Incident Response SOP (SOC 2 CC8 / GDPR Art. 33)

Severity Classification

Level	Description	Response Time	Examples
SEV-1	Data breach / secret exposure	< 1 hour	API key in task result, PII in knowledge/
SEV-2	Service degradation	< 4 hours	Ollama OOM, supervisor crash loop, thermal runaway
SEV-3	Operational issue	< 24 hours	Failed skill, stale tasks, agent quarantine
SEV-4	Cosmetic / minor	Next business day	UI glitch, stale docs, non-blocking warning

Response Procedure

DETECT — Watchdog alerts, DLP findings, dashboard alerts, operator report
TRIAGE — Classify severity (SEV-1 through SEV-4)
CONTAIN —
- SEV-1: lead_client.py broadcast "PAUSE" --channel fleet → all workers pause
- SEV-2: lead_client.py gdpr-erase <affected_agent> --confirm if data compromised
- Rotate affected API keys: lead_client.py dispatch secret_rotate '{"action":"rotate","key":"AFFECTED_KEY"}'
INVESTIGATE —
- Check audit log: GET /api/audit?type=dlp_alert&limit=50
- Generate debug report: "Report Issue" button or generate_debug_report()
- Review task history: GET /api/activity
REMEDIATE — Fix root cause, deploy patch, verify via smoke test
NOTIFY — GDPR Art. 33: supervisory authority within 72 hours if personal data breach
DOCUMENT — Add to data/resolutions.jsonl with fix_commit and regression_test

Breach Notification Template (GDPR Art. 33)

Subject: Data Breach Notification — BigEd CC Fleet
Date: [DATE]
Nature of breach: [DESCRIPTION]
Data subjects affected: [COUNT/SCOPE]
Likely consequences: [IMPACT]
Measures taken: [CONTAINMENT + REMEDIATION]
Contact: [DPO/OPERATOR]

FilesExpand file tree

OPERATIONS.md

Latest commit

History

OPERATIONS.md

File metadata and controls

BigEd CC — Operations Manual

Quick Start

CLI Reference

Fleet Status & Control

Task Management

Messaging

Cost Intelligence

Marathon / Training

Logs

Secrets

Platform Troubleshooting Matrix (PT-4)

Startup Issues

GPU / Hardware Issues

Networking Issues

Database Issues

Skill / Module Issues

Platform-Specific Edge Cases

Skill Authoring

Interface

Using LLM Providers

Network Requirements

Registration

Testing

Module Authoring

Interface Contract

Registration

Async DB Pattern

Deployment (Fresh Machine)

Windows (Current Production)

Linux

macOS

Hardware Sizing

VRAM Safety Rules

Operations

Start / Stop

Health Checks

Log Locations

Key Files (Runtime)

Operational Modes

New Systems (v0.051.05b – v0.053.01b)

FileSystemGuard (fleet/filesystem_guard.py)

Module Hub (BigEd/launcher/modules/hub.py)

Memory Optimizer Skill (fleet/skills/memory_optimizer.py)

Fleet Comm UX (v0.051.05b redesign)

Monitoring

Dashboard Endpoints

Monitoring Checklist

Backup & Recovery

Running a Backup

Recovery Procedures

Backup Schedule Recommendation

Issue Reporting

Generating a Debug Report

Reading a Debug Report

Submitting a Report

Privacy

Worker Roles Reference

Skill Output Locations

Messaging Bridges

Dual Supervisor Architecture

Maintenance Protocol

Incident Response SOP (SOC 2 CC8 / GDPR Art. 33)

Severity Classification

Response Procedure

Breach Notification Template (GDPR Art. 33)

FileSystemGuard (`fleet/filesystem_guard.py`)

Module Hub (`BigEd/launcher/modules/hub.py`)

Memory Optimizer Skill (`fleet/skills/memory_optimizer.py`)