Update: 1st place award at HackPrinceton 2026 for the Hardware x AI track!
Distributed LLM inference across consumer devices. Phones, tablets, Raspberry Pis, and laptops form a swarm that thinks together.
REVIVE turns everyday devices into a collective AI system. No cloud. No GPUs. Just the devices in the room.
The repo implements two complementary strategies for edge LLM inference, plus a shared build pipeline that feeds them. The full system design, contracts between modules, and deployment flow are in ARCHITECTURE.md.
Strategy 1 — Mixture-of-Agents (MoA) (revive/, android/, rpi/, macos/)
Each device runs a complete small model with a specialized role — Reasoner, Writer, Critic, Factchecker, etc. A coordinator fans queries out in parallel; an aggregator synthesizes the best answer. Only text (kilobytes) crosses the network, so consumer WiFi jitter is tolerated and any device can drop out without breaking the swarm.
Strategy 2 — True Pipeline-Parallel Distribution (true-distribution/)
One Qwen3 model with its transformer layers physically split across multiple devices. Hidden states travel over HTTP; each worker owns only a slice of the weights. A $10 Arduino Uno runs the cluster's management plane (BMC) — it owns the authoritative partition table and refuses to let the coordinator route through a node it has declared dead. Verified token-for-token correct against the reference HuggingFace model. Details in true-distribution/README.md.
Build pipeline (LLM/)
Offline pipeline that takes Qwen3 bases and produces role-specialized, device-tiered GGUFs: expanded data generation (Haiku + local Qwen3-4B distillation), QLoRA fine-tuning, ShortGPT layer pruning, imatrix-calibrated K-quant export at Q2_K / Q3_K_S / Q4_K_M / Q5_K_M. Emits a full role × device-tier matrix with a SHA-256 manifest.json. Currently feeds Strategy 1; Strategy 2 consumes fp16 weights directly from HuggingFace.
Built at HackPrinceton Spring 2026.
- Project Status
- How It Works
- Architecture
- Supported Platforms
- Models
- Quick Start
- The Agent System
- Query Routing
- MoA Aggregation
- Web Dashboard
- API Reference
- Training & Fine-tuning
- Protocol Specification
- Project Structure
- Performance
- FAQ
MoA Runtime — production:
- iOS/iPadOS app with Worker and Coordinator modes, in-process llama.cpp via Metal (
revive/) - Android app with native JNI llama.cpp, foreground service, Android NSD discovery (
android/) - Raspberry Pi mesh aggregator with systemd auto-start (
rpi/) - MacBook CLI (Metal on Apple Silicon, CPU on Intel) with Worker + interactive Coordinator modes (
macos/) - Two web dashboards: PWA bundled with the iOS coordinator (
revive/Coordinator/WebDashboard/) and a standalone mDNS aggregator UI (dashboard/) - mDNS auto-discovery and OpenAI-compatible
POST /v1/chat/completionson every platform
True Pipeline-Parallel Distribution — working end-to-end (true-distribution/):
- Layer-sliced Qwen3 loading with per-slice KV cache
- N-stage HTTP ring with a custom binary wire protocol (fp16 hidden states, ~2 KB/token)
- Heterogeneous layer partitioner (PipeEdge-DP + greedy fallback) that minimizes the slowest pipeline stage
- Arduino Uno BMC firmware (
arduino/revive_bmc.ino) plus a Python simulator (bmc_sim.py) that speaks the identical line protocol over TCP - Multi-BMC HA controller with leader election and replica failover
- Live chaos-engineering dashboard: kill/heal buttons, BMC event log, per-stage latency bars, token streaming
- Benchmark CLI reporting tok/s, prefill vs decode, per-stage latency, wire bandwidth
- 11 self-contained tests (correctness verified token-for-token against single-model HuggingFace reference under greedy sampling)
LLM Build Pipeline — functional end-to-end (LLM/):
- Expanded dataset generation: Claude Haiku + local Qwen3-4B teacher distillation → ~1500 examples/role (
LLM/data/) - Per-role QLoRA fine-tuning for all 8 roles on Qwen3-0.6B and Qwen3-1.7B bases (
LLM/train/) - Role-aware ShortGPT block-importance layer pruning (
LLM/prune/) - imatrix-calibrated K-quant export at Q2_K / Q3_K_S / Q4_K_M / Q5_K_M (
LLM/quantize/) - Role × device-tier matrix export producing up to 32 GGUFs with a SHA-256
manifest.json(LLM/export/) - Per-role evaluation harness against a Qwen3-4B teacher (
LLM/eval/) - AWS EC2 spot-instance launcher that trains end-to-end and uploads to S3 (
LLM/scripts/aws_*.sh)
Research & probes:
llamacpp-stages/tests/probe_embd.py— gate test determining whetherllama-cpp-python'sbatch.embdinput path bypasses only the embedding lookup (needed for a future llama.cpp-backed pipeline-parallel path).
Documentation:
- ARCHITECTURE.md — system-level design of MoA + LLM build pipeline, module contracts, deployment flow, glossary
- PROTOCOL.md — mDNS service type, TXT record fields, HTTP API contract, default ports
- LLM/README.md — build-pipeline internals and quick start
- true-distribution/README.md — pipeline-parallel architecture, BMC protocol, chaos-demo walkthrough, benchmark numbers
On the roadmap:
- MoA (ARCHITECTURE.md §6): on-device tier auto-selection from
manifest.json; heterogeneous prompt formats; swarm-level speculative decoding; modular LoRA adapter distribution. - True Distribution (true-distribution/README.md §Status): pipeline overlap for long-prompt prefill (Jupiter intra-sequence PP); iOS port (Swift worker wrapping a forked llama.cpp with a
llama_decode_layers(start, end)API); GGUF Q4_K_M quantization for 4× memory + bandwidth savings.
This section describes Strategy 1 (MoA). The pipeline-parallel flow is documented separately in true-distribution/README.md.
You type a question
│
▼
┌─────────────────────┐
│ Coordinator (iPad │ ◄── Classifies query type
│ or Raspberry Pi) │ (simple fact? code? creative?)
└────────┬────────────┘
│ Fan-out to matching roles
┌────┼────┬────┬────┐
▼ ▼ ▼ ▼ ▼
📱 📱 📱 📱 💻
iPhone iPhone Pixel Pi MacBook
Reasoner Writer Critic Fact Drafter
│ │ │ │ │
└────┴────┴────┴────┘
│ Collect responses
▼
┌─────────────────────┐
│ Aggregator Model │ ◄── Synthesizes best answer
│ (runs on coordinator)│ from all agents
└─────────────────────┘
│
▼
Final answer
- Discovery: Devices find each other automatically via Bonjour/mDNS on the local WiFi network. Zero configuration.
- Classification: A fast "Spotter" agent classifies the query (complex reasoning, code, math, creative, etc.) and determines which agent roles are needed.
- Fan-out: The coordinator sends the query to all relevant workers in parallel, each running with its specialized system prompt.
- Inference: Each device runs a complete GGUF model locally via llama.cpp. iPhones use Metal GPU acceleration, Android uses ARM NEON, MacBooks use Metal, Raspberry Pis use ARM NEON.
- Aggregation: The coordinator's aggregator model synthesizes the best answer from all agent responses — preferring the Factchecker for accuracy, the Writer for clarity, and the Reasoner for logical structure.
The key insight: many small models with different perspectives, running in parallel on separate devices, produce better answers than any single small model alone. This is the Mixture-of-Agents pattern applied to edge devices.
The sections below describe Strategy 1 (MoA). For the pipeline-parallel internals — layer partitioning, hidden-state wire format, Arduino BMC control plane, chaos-engineering demo — see true-distribution/README.md.
The two strategies answer different questions. Both are implemented in this repo.
MoA (revive/, android/, rpi/, macos/) |
Pipeline-parallel (true-distribution/) |
|
|---|---|---|
| What's distributed | Different models on different devices | Different layers of one model on different devices |
| Wire payload | Text (kilobytes per query) | fp16 hidden states (~2 KB per token, per hop) |
| Failure mode | Graceful degrade — missing role omitted from synthesis | Hard stop — BMC refuses to route through a dead node |
| WiFi sensitivity | Insensitive to jitter (text is small, work is parallel) | Sensitive to per-hop latency (serial dependency between stages) |
| Best for | Many moderate devices answering in parallel with different perspectives | Running one bigger model than any single device could host |
| Ensemble method | Text-level MoA synthesis | None — a single coherent forward pass |
The MoA choice below is optimized for consumer WiFi's 5–50ms latency with unpredictable jitter: each device runs a complete small model independently, only the text outputs (kilobytes) travel over the network, and if a phone overheats or drops off the swarm keeps working. true-distribution/ makes the opposite trade-off and proves it works — the measured bandwidth on a 3-worker Qwen3-0.6B ring is 58 KB/s, 214× under 802.11n WiFi capacity. Pipeline parallelism is compute-bound, not network-bound, once hidden states are fp16.
| Component | Role | Runs On |
|---|---|---|
| Worker | Loads a GGUF model, serves inference via HTTP, advertises via mDNS | Any device |
| Coordinator | Discovers workers, classifies queries, routes to agents, aggregates responses | iPad, Pi, or MacBook |
| Aggregator Model | Fine-tuned to synthesize multi-agent responses into a single best answer | Coordinator device |
| Spotter | Fast classification model that determines query type and needed roles | Any worker |
| Web Dashboard | Browser-based real-time visualization of the swarm | Served by coordinator |
- Discovery: Bonjour/mDNS (
_revive._tcpservice type) - Transport: HTTP/1.1 REST
- API Format: OpenAI-compatible
/v1/chat/completions - Serialization: JSON
- Scope: Local WiFi network (same subnet)
| Platform | Build Method | Hardware Acceleration | Status |
|---|---|---|---|
| iPhone / iPad | Xcode project (xcframework) | Metal GPU + Accelerate + NEON | Production |
| Android | Gradle + NDK (Kotlin/Compose) | ARM NEON (CPU), optionally Vulkan | Production |
| Raspberry Pi | Python + cmake (llama-server) | ARM NEON | Production |
| MacBook | Python + cmake (llama-server) | Metal GPU (Apple Silicon) | Production |
All four platforms speak the same protocol, discover each other via mDNS, and expose the same HTTP API. Any combination works together on the same WiFi network.
REVIVE uses the Qwen3 model family, fine-tuned per role and quantized to GGUF. Qwen3 was chosen because:
- The 1.7B model matches previous-generation 3B quality
- Dual thinking/non-thinking mode maps naturally to role specialization
- GQA (grouped-query attention) halves KV cache memory — critical on phones
- Native ChatML format matches our prompt structure
The LLM/ pipeline produces role-specialized models at four compression levels matched to device RAM:
| Tier | RAM | Quant | Pruning | Example devices |
|---|---|---|---|---|
ewaste |
1–3 GB | Q2_K | Aggressive (−40% layers) | iPhone 6s/7, Pi 3, 3GB Android |
budget |
3–4 GB | Q3_K_S | Moderate (−25% layers) | iPhone 8/X, Pi 4 4GB, 4GB Android |
standard |
4–6 GB | Q4_K_M | None | iPhone 12/13/14, Pi 5 8GB, 6GB Android |
modern |
6 GB+ | Q5_K_M | None | iPhone 15 Pro, Pixel 8, MacBook M-series |
Pruning only applies to simple roles (Spotter: −40%, Drafter/Concise: −25%). Reasoning-heavy roles (Reasoner, Critic, Factchecker, Aggregator) are never pruned.
revive-{role}-qwen3-{size}-{tier}-{quant}.gguf
Examples:
revive-reasoner-qwen3-1.7b-modern-Q5_K_M.gguf ← iPhone 15 Pro
revive-writer-qwen3-1.7b-standard-Q4_K_M.gguf ← iPhone 13/14
revive-spotter-qwen3-0.6b-budget-Q3_K_S.gguf ← Pi 4 4GB
revive-drafter-qwen3-0.6b-ewaste-Q2_K.gguf ← iPhone 7
A manifest.json is emitted alongside all GGUFs listing role, tier, quant, size in bytes, and SHA-256 for every file.
Models are loaded from local device storage. There is no cloud dependency at inference time.
Prerequisites: Xcode 15+, macOS, a clone of llama.cpp
# 1. Clone the repo
git clone https://github.com/your-org/revive.git
cd revive
# 2. Build llama.xcframework (one-time, 15-20 min)
cd /path/to/llama.cpp
bash build-xcframework.sh
# 3. Generate Xcode project
brew install xcodegen # if needed
xcodegen generate
# 4. Open in Xcode
open revive.xcodeproj
# Set your signing team, build and run on deviceOn the device:
- Transfer a
.ggufmodel file to the app's Documents folder (via Finder, Files app, or AirDrop) - Launch REVIVE
- Choose Worker or Coordinator mode:
- Worker: Select a role (Reasoner, Writer, etc.), pick the model file, tap Start
- Coordinator: Auto-discovers workers on the same WiFi, loads aggregator model from Documents
Worker mode starts an HTTP server on port 50001, advertises via Bonjour, and shows live metrics (tok/s, thermal state, battery, request count). The device stays awake via a silent audio session.
Coordinator mode shows a SwiftUI dashboard with worker cards, a chat interface, and mode toggles (Swarm vs Speed). It also serves a web dashboard on port 8080.
Prerequisites: Android Studio, NDK, a clone of llama.cpp next to the project
# 1. Clone llama.cpp alongside the project
git clone --depth 1 https://github.com/ggerganov/llama.cpp
# 2. Open android/ in Android Studio
# 3. Build and run on device (arm64)On the device:
- Place a
.ggufmodel in the app's external files directory (Android/data/com.revive.worker/files/models/) - Launch REVIVE
- Select a model, pick a role, set the port, tap START WORKER
- The app starts a foreground service (persists when backgrounded), advertises via Android NSD, and serves HTTP inference
The Android app uses JNI to call llama.cpp natively — no Termux or Python required.
Termux fallback (for devices where you can't build the app):
# In Termux
cd ~/revive/android
ROLE=drafter PORT=8080 bash setup.sh
bash ~/start_worker.shPrerequisites: Raspberry Pi 4 or 5, Raspberry Pi OS (64-bit), WiFi or Ethernet on the same LAN
# One-command setup
git clone https://github.com/your-org/revive.git
cd revive
bash rpi/setup.shThis installs all dependencies, builds llama.cpp, downloads the appropriate model (auto-selected by available RAM), and creates a systemd service.
# Start the mesh aggregator
sudo systemctl start revive-aggregator
# View logs
journalctl -u revive-aggregator -f
# Access the web dashboard
# Open http://<pi-ip>:9090 in any browserThe Raspberry Pi runs as the mesh aggregator — the central hub of the swarm. It discovers all workers on the LAN via mDNS, routes queries to the right agents, and synthesizes responses using a local aggregator model. The systemd service auto-starts on boot and restarts on failure.
Prerequisites: macOS, Homebrew, Python 3
# One-command setup
cd revive
bash macos/setup.shThis builds llama.cpp with Metal (auto-detected on Apple Silicon), downloads the appropriate model, and creates launch scripts.
# Worker mode (joins the swarm as a specific role)
bash macos/start-worker.sh reasoner
# Coordinator mode (interactive REPL)
bash macos/start-coordinator.shCoordinator interactive mode:
REVIVE Swarm CLI — type your query (Ctrl+C to exit)
> What causes the seasons on Earth?
[Querying swarm...]
[Reasoner] The seasons are caused by Earth's 23.5° axial tilt...
[Critic] A common misconception is that seasons are caused by distance from the Sun...
[Factchecker] Earth's orbital eccentricity is only 0.017, meaning distance varies by ~3.4%...
[SWARM] Earth's seasons are caused by its 23.5° axial tilt relative to its orbital plane...
(3 agents, 4200ms, COMPLEX_REASONING)
> status
Workers: 4
reasoner | ios | 192.168.1.20:50001 | idle
writer | ios | 192.168.1.21:50001 | idle
critic | android | 192.168.1.30:8080 | idle
drafter | macos | 192.168.1.40:8080 | idle
The MacBook CLI auto-detects Apple Silicon and uses Metal GPU acceleration (99 GPU layers offloaded). On Intel Macs, it falls back to CPU.
REVIVE defines 8 agent roles, each with a distinct personality and system prompt:
| Role | Purpose | System Prompt Summary | Recommended Model |
|---|---|---|---|
| Reasoner | Step-by-step logical analysis | "Think step by step. Show your reasoning chain explicitly." | Qwen3-1.7B |
| Writer | Clear, engaging prose | "Write clear, well-structured, engaging responses." | Qwen3-1.7B |
| Concise | Maximum brevity | "Answer in as few words as possible while being complete." | Qwen3-0.6B |
| Critic | Devil's advocate | "Identify flaws, edge cases, counterarguments." | Qwen3-1.7B |
| Factchecker | Verification focus | "Focus on verifiable, accurate information. Flag uncertainty." | Qwen3-1.7B |
| Drafter | Fast first-pass | "Produce a fast first-pass answer. Speed over polish." | Qwen3-0.6B |
| Spotter | Query classification | "Classify the query into one category." | Qwen3-0.6B |
| Aggregator | MoA synthesis | "Synthesize the single best answer from multiple agents." | Fine-tuned Qwen3-1.7B |
Each worker runs exactly one role. The coordinator assigns queries to the appropriate subset of roles based on the query type.
Each worker has a dynamic weight score used for aggregation priority:
weight = min(tokens_per_second / 40, 1.0) × thermal_penalty
Thermal penalties:
- Nominal/Fair: 1.0x
- Serious: 0.5x
- Critical: 0.1x
Higher-weight workers' responses are given more consideration during aggregation.
When a query arrives, the coordinator follows this flow:
The Spotter agent (a fast 0.6B model) classifies the query into one of six types:
| Query Type | Example | Target Roles |
|---|---|---|
SIMPLE_FACT |
"What is the capital of Australia?" | Reasoner, Concise |
COMPLEX_REASONING |
"Should AI be granted legal personhood?" | Reasoner, Writer, Critic, Factchecker, Drafter |
CREATIVE |
"Write a poem about distributed computing" | Writer, Reasoner, Critic |
CODE |
"Explain the CAP theorem" | Reasoner, Factchecker, Critic |
MATH |
"Explain the Monty Hall problem" | Reasoner, Factchecker, Concise |
OPINION |
"Is nuclear power necessary for climate change?" | Writer, Critic, Reasoner |
The coordinator queries all matching workers simultaneously using Swift's TaskGroup (iOS) or asyncio.gather (Pi/Mac). Timeouts are adjusted by complexity:
- Simple facts: 6 second timeout
- Complex queries: 12 second timeout
- Speed mode: 8 second timeout
If the query type needs aggregation and multiple agents responded, the aggregator model synthesizes a final answer. Otherwise, the best single response is returned directly.
Speed mode bypasses full MoA aggregation. It pairs a Drafter (fast speculation) with the highest-weight Verifier (usually a Reasoner). This trades answer quality for lower latency — useful for simple questions or when the swarm is small.
The aggregator uses a ChatML prompt to synthesize responses:
<|im_start|>system
You are the Aggregator of a distributed AI swarm. You receive multiple
responses from specialized agents and synthesize the single best answer
by taking the strongest elements from each, resolving contradictions
(prefer Factchecker and Critic over Drafter), using the Writer's clarity,
and the Reasoner's logic. Output only the final synthesized answer.<|im_end|>
<|im_start|>user
Original question: {query}
Agent responses:
[The Reasoner — qwen3-1.7b-q4_k_m]:
{reasoner_response}
---
[The Critic — qwen3-1.7b-q4_k_m]:
{critic_response}
---
[The Writer — qwen3-0.6b-q4_k_m]:
{writer_response}
Synthesize the single best answer:<|im_end|>
<|im_start|>assistant
The aggregator is fine-tuned specifically for this task using QLoRA on 2000+ synthetic examples (see Training).
The coordinator serves a browser-based dashboard accessible from any device on the LAN.
iOS coordinator: http://<ipad-ip>:8080
Raspberry Pi: http://<pi-ip>:9090
- Live swarm topology canvas — coordinator at center, workers in orbit, animated particles during inference
- Agent cards — role, model, tok/s, status, device metrics, weight bar
- Chat interface — submit queries, see individual agent responses + synthesized answer
- Mode toggle — switch between Swarm and Speed mode
- Impact calculator — estimates power savings if N% of the world's 5.3B phones joined the swarm vs. traditional GPU clusters
- PWA-enabled — add to home screen on mobile devices
- Auto-refresh — polls swarm state every 1.5 seconds
Dark cyberpunk theme (#0a0a0a background, #00ff88 primary green). Canvas renders topology with glowing nodes, pulsing rings during generation, and particle effects showing data flow from workers to coordinator. Each agent role has a distinct color that carries through the entire UI.
All REVIVE nodes expose the same base API. The aggregator adds swarm-specific endpoints.
{"status": "ok", "role": "reasoner", "platform": "ios", "uptime": 3600}OpenAI-compatible inference endpoint.
Request:
{
"model": "qwen3-1.7b-q4_k_m",
"messages": [
{"role": "system", "content": "You are a rigorous analytical thinker."},
{"role": "user", "content": "What causes the seasons?"}
],
"max_tokens": 150,
"temperature": 0.7,
"stream": false
}Response:
{
"choices": [{"message": {"role": "assistant", "content": "..."}}],
"metrics": {
"tokens_generated": 42,
"tokens_per_second": 18.5,
"time_to_first_token_ms": 120,
"total_time_ms": 2300,
"thermal_state": "nominal",
"battery_percent": 85,
"memory_used_mb": 512
}
}Submit a query to the full swarm.
{
"query": "What are the implications of quantum computing on encryption?",
"mode": "swarm",
"timeout_seconds": 12
}Response:
{
"query": "...",
"query_type": "COMPLEX_REASONING",
"agent_responses": [
{"role": "reasoner", "model": "qwen3-1.7b", "content": "..."},
{"role": "critic", "model": "qwen3-0.6b", "content": "..."}
],
"final_answer": "Synthesized answer...",
"mode": "swarm",
"total_time_ms": 4500,
"agents_responded": 3
}Full swarm topology.
{
"aggregator": {"host": "raspberrypi", "platform": "rpi", "uptime": 3600},
"workers": [
{"name": "REVIVE-reasoner-iPhone", "role": "reasoner", "host": "192.168.1.20", "port": 50001, "status": "idle", "platform": "ios", "metrics": {...}}
],
"total_workers": 4,
"total_tps": 72.3
}Manual worker registration for devices that can't do mDNS.
{
"host": "192.168.1.42",
"port": 8080,
"role": "reasoner",
"model": "qwen3-1.7b-q4_k_m",
"platform": "android",
"ram_mb": 6144
}Returns full coordinator state including workers, chat messages, mode, and query status.
Triggers a query from the web dashboard:
{"query": "...", "mode": "swarm"}Switch between swarm and speed mode:
{"mode": "speed"}REVIVE includes a complete fine-tuning pipeline that produces role-specialized, device-tiered GGUF models from Qwen3 base weights.
The LLM/ directory is the current pipeline. It produces a full tier matrix — up to 32 GGUFs covering every role × device tier combination.
Data generation
├── Claude Haiku → 750 examples/role (diverse queries, ~$2 total)
└── Qwen3-4B teacher (local) → 750 examples/role (higher-signal distillation)
↓ merge_datasets.py → ~1500 examples/role
↓
QLoRA fine-tune (Unsloth, r=32, 3 epochs, A10G ~30min/role)
↓ merged 16-bit HF checkpoint
↓
Role-aware pruning (ShortGPT block importance)
Spotter: −40% layers
Drafter/Concise: −25% layers
Reasoner/Writer/Critic/Factchecker/Aggregator: no pruning
↓
imatrix-calibrated quantization per device tier
ewaste → Q2_K │ budget → Q3_K_S
standard→ Q4_K_M │ modern → Q5_K_M
↓
LLM/output/gguf/revive-{role}-qwen3-{size}-{tier}-{quant}.gguf
+ manifest.json
Quick start:
cd LLM
pip install -r requirements.txt
# Minimum viable ship: standard tier only (~6 hrs on g5.xlarge)
ANTHROPIC_API_KEY=sk-... bash scripts/bootstrap.sh
# Full tier matrix: all 32 GGUFs
bash scripts/compress_all.sh
# Validate all roles hit their quality bar
python3 -m LLM.eval.eval_role --allTraining configuration:
| Aggregator | Large roles | Small roles | |
|---|---|---|---|
| Base model | Qwen3-1.7B | Qwen3-1.7B | Qwen3-0.6B |
| LoRA rank | 32 | 32 | 32 |
| Sequence length | 4096 | 2048 | 1024 |
| Epochs | 3 | 3 | 3 |
| Learning rate | 2e-4 | 2e-4 | 2e-4 |
The recommended way to run training is on an EC2 g5.xlarge (A10G 24GB, ~$0.30/hr spot). Three scripts handle the full lifecycle:
Prerequisites: AWS CLI configured (aws configure), an AWS account with EC2 + S3 + IAM permissions.
export ANTHROPIC_API_KEY=sk-...
export S3_BUCKET=my-revive-models # will be created if it doesn't exist
export AWS_REGION=us-east-1 # optional, defaults to us-east-1
# Preview config without launching
bash LLM/scripts/aws_launch.sh --dry-run
# Launch spot instance (~$1.50–6 total, ~6 hrs)
bash LLM/scripts/aws_launch.shThe instance:
- Installs all dependencies (Unsloth, llama.cpp with CUDA, Python packages)
- Downloads the Qwen3-4B teacher GGUF from HuggingFace
- Runs
bootstrap.sh(data generation → training → standard tier export) - Uploads all GGUFs +
manifest.jsontos3://{bucket}/revive-models/ - Self-terminates when done
# When training completes (~6 hrs), pull models locally
S3_BUCKET=my-revive-models bash LLM/scripts/aws_pull.sh
# → syncs to LLM/output/gguf/Monitor progress:
aws ec2 get-console-output --instance-id <id> --region us-east-1Estimated cost: $1.50–6 on spot (g5.xlarge ~$0.30/hr). On-demand fallback: ~$6 if spot capacity is unavailable.
training/ is the original single-tier pipeline (300 examples/role, Q4_K_M only). It stays as a reference and fallback. The LLM/ pipeline imports its prompt banks without modifying it.
# Generate data + train all roles (single Q4_K_M tier)
ANTHROPIC_API_KEY=sk-... bash training/train_all.shSee PROTOCOL.md for the full cross-platform protocol specification, including:
- mDNS service type and TXT record fields
- HTTP API contract for all endpoints
- Port conventions per platform
- Model recommendations per device class
All REVIVE nodes advertise as _revive._tcp with these TXT record fields:
| Key | Description |
|---|---|
role |
Agent role (reasoner, writer, etc.) |
model |
Model filename |
ram |
Device RAM in MB |
port |
HTTP server port |
platform |
ios, android, rpi, macos |
caps |
Hardware capabilities (metal, neon, vulkan, cpu) |
| Platform | Port |
|---|---|
| iOS worker | 50001 |
| iOS web dashboard | 8080 |
| Android worker | 8080 |
| Raspberry Pi aggregator | 9090 |
| MacBook worker/coordinator | 8080 |
revive/
├── LLM/ # Model build pipeline (offline, GPU workstation)
│ ├── train/
│ │ ├── train_qwen_role.py # QLoRA fine-tune per role (Unsloth)
│ │ └── train_all.sh # Train all 8 roles sequentially
│ ├── data/
│ │ ├── generate_expanded_dataset.py # 750 examples/role via Claude Haiku
│ │ ├── distill_from_qwen4b.py # 750 examples/role via local teacher
│ │ ├── merge_datasets.py # Dedup + merge → ~1500/role
│ │ ├── build_calibration.py # Seed imatrix calibration prompts
│ │ └── calibration_prompts/ # Per-role calibration text files
│ ├── prune/
│ │ ├── layer_prune.py # ShortGPT block importance pruning
│ │ ├── heal_lora.py # Optional LoRA heal after pruning
│ │ └── prune_profiles.yaml # Per-role drop fractions
│ ├── quantize/
│ │ ├── imatrix_gen.py # Role-specific importance matrix
│ │ └── quantize_tiers.py # Q2_K/Q3_K_S/Q4_K_M/Q5_K_M export
│ ├── export/
│ │ ├── export_tier_matrix.py # Orchestrates full role × tier matrix
│ │ ├── export_single.sh # Single role/tier export
│ │ └── manifest.py # manifest.json writer
│ ├── eval/
│ │ ├── eval_role.py # Per-role quality metrics vs teacher
│ │ └── eval_tier.py # Sweep quant/prune variants for a role
│ ├── common/
│ │ ├── role_registry.py # Role → base model + seq_len mapping
│ │ ├── device_tiers.yaml # Tier → quant + prune + RAM envelope
│ │ └── gguf_io.py # llama.cpp subprocess helpers
│ ├── scripts/
│ │ ├── bootstrap.sh # End-to-end: data → train → standard tier
│ │ ├── compress_all.sh # Full tier matrix export
│ │ ├── quick_test.sh # Smoke test binaries + spotter export
│ │ ├── aws_launch.sh # Launch EC2 spot instance for training
│ │ ├── aws_user_data.sh # EC2 boot script (installs, trains, uploads)
│ │ └── aws_pull.sh # Sync S3 artifacts → local gguf/
│ └── requirements.txt
│
├── training/ # Legacy single-tier pipeline (reference)
│ ├── generate_dataset.py # Aggregator training data (Claude Haiku)
│ ├── generate_role_dataset.py # Per-role training data
│ ├── train.py # QLoRA aggregator training
│ ├── train_role.py # QLoRA per-role training
│ └── train_all.sh # Full legacy pipeline
│
├── revive/ # iOS app (Swift/SwiftUI)
│ ├── reviveApp.swift # Entry point, mode selection
│ ├── Info.plist # Bonjour permissions, background modes
│ ├── revive.entitlements # Network access (sandbox disabled)
│ ├── Coordinator/
│ │ ├── CoordinatorView.swift # Coordinator UI (swarm overview, chat, metrics)
│ │ ├── CoordinatorWebServer.swift # HTTP server for web dashboard (port 8080)
│ │ ├── SwarmManager.swift # Bonjour discovery, parallel fan-out
│ │ ├── QueryRouter.swift # Query classification + routing
│ │ ├── Aggregator.swift # MoA synthesis on local model
│ │ └── WebDashboard/ # Browser-accessible dashboard
│ │ ├── index.html # PWA-enabled dashboard
│ │ ├── style.css # Cyberpunk dark theme
│ │ └── dashboard.js # Canvas topology, realtime polling, chat
│ ├── Worker/
│ │ ├── WorkerView.swift # Worker setup + telemetry UI
│ │ └── InferenceHandler.swift # HTTP → llama inference bridge
│ └── Shared/
│ ├── Models.swift # AgentRole, QueryType, WorkerInfo, etc.
│ ├── BonjourService.swift # mDNS advertisement + discovery
│ ├── HTTPServer.swift # Minimal HTTP/1.1 server (Network.framework)
│ ├── LibLlama.swift # Swift binding to llama.cpp
│ ├── Metrics.swift # Device telemetry (thermal, battery, memory)
│ └── Extensions.swift # Hex color parser
│
├── android/ # Android app (Kotlin/Jetpack Compose)
│ ├── app/
│ │ ├── build.gradle.kts # Gradle config with NDK + Compose
│ │ └── src/main/
│ │ ├── AndroidManifest.xml # Permissions, foreground service
│ │ ├── cpp/
│ │ │ ├── CMakeLists.txt # Builds llama.cpp via NDK
│ │ │ └── native-lib.cpp # JNI bridge (tokenize, inference)
│ │ ├── java/com/revive/worker/
│ │ │ ├── MainActivity.kt # Compose UI (model picker, role selector)
│ │ │ ├── LlamaJNI.kt # JNI bindings + LlamaEngine wrapper
│ │ │ ├── HttpServer.kt # Ktor server (/health, /v1/chat/completions)
│ │ │ ├── NsdAdvertiser.kt # Android NSD (native mDNS)
│ │ │ ├── WorkerService.kt # Foreground service for background inference
│ │ │ └── DeviceMetrics.kt # Battery, thermal, memory
│ │ └── res/values/themes.xml
│ ├── build.gradle.kts
│ ├── settings.gradle.kts
│ ├── setup.sh # Termux fallback setup
│ └── advertise.py # Termux mDNS advertiser
│
├── rpi/ # Raspberry Pi mesh aggregator
│ ├── aggregator.py # Main coordinator: discovery, routing, MoA, web UI
│ ├── discovery.py # SwarmDiscovery + ServiceAdvertiser (shared)
│ ├── worker.py # llama-server subprocess manager
│ ├── metrics.py # CPU temp, memory, thermal state
│ ├── setup.sh # One-command install + systemd service
│ └── requirements.txt # Python dependencies
│
├── macos/ # MacBook CLI
│ ├── revive-cli.py # Worker or coordinator mode, Metal GPU
│ └── setup.sh # Build llama.cpp, download model
│
├── true-distribution/ # Pipeline-parallel runtime (one model, layers split)
│ ├── pipeline/
│ │ ├── worker.py # Pipeline stage: layer-sliced forward + per-stage KV cache
│ │ ├── coordinator.py # Ring driver: streams tokens, respects BMC's view
│ │ ├── partitioner.py # Greedy + PipeEdge-DP layer assignment
│ │ ├── protocol.py # Tensor wire format (coordinator ↔ worker)
│ │ ├── bmc_protocol.py # Host ↔ Arduino line protocol (v2)
│ │ ├── bmc_sim.py # Python BMC firmware simulator (TCP)
│ │ ├── controller.py # Host-side BMC client (serial or TCP)
│ │ ├── multi_bmc.py # Multi-BMC HA (leader election, replica failover)
│ │ └── net_utils.py
│ ├── arduino/
│ │ ├── revive_bmc.ino # Production BMC firmware for Arduino Uno
│ │ └── README.md # How to flash and verify
│ ├── dashboard/
│ │ ├── server.py # Embeds BMC sim + controller + coordinator
│ │ └── index.html # Topology, chaos buttons, per-stage latency, token stream
│ ├── scripts/
│ │ ├── launch_cluster.sh # Full stack: workers + dashboard (SERIAL= for Arduino)
│ │ ├── launch_local.sh # Spawn N workers for ad-hoc dev
│ │ ├── stop_local.sh # Kill workers
│ │ ├── benchmark.py # Throughput + per-stage latency + bandwidth report
│ │ └── demo_full.py # CLI chaos-engineering demo
│ ├── tests/ # 11 self-contained test runners (no pytest needed)
│ └── README.md # Pipeline-parallel architecture + BMC protocol
│
├── dashboard/ # Standalone mDNS swarm dashboard (runs independently)
│ ├── server.py # aiohttp server with Bonjour discovery, SSE streaming
│ └── index.html # Live worker list + chat
│
├── llamacpp-stages/ # Research: llama.cpp slicing probes
│ └── tests/probe_embd.py # Gate test for batch.embd bypass path
│
├── ARCHITECTURE.md # System design (MoA + LLM build pipeline)
├── PROTOCOL.md # Cross-platform mDNS/HTTP protocol spec
├── project.yml # XcodeGen project config
└── setup.sh # iOS one-time setup (build xcframework)
Measured with Qwen3-1.7B Q4_K_M, 150-token generation:
| Device | tok/s | 150-token time | Hardware |
|---|---|---|---|
| MacBook M2 Pro | 60-80 | 2-3s | Metal GPU |
| iPhone 15 Pro | 25-35 | 4-6s | Metal GPU (A17 Pro) |
| iPhone 14 | 15-20 | 7-10s | Metal GPU (A15) |
| Pixel 8 | 8-15 | 10-18s | ARM NEON (CPU) |
| Raspberry Pi 5 (8GB) | 3-6 | 25-50s | ARM NEON |
With the MoA pattern and 4 workers, total swarm throughput is the sum of individual device speeds — the queries run in parallel. A swarm of 4 iPhones produces ~100 tok/s aggregate throughput.
| Query Type | Agents Used | Typical Latency |
|---|---|---|
| Simple fact | 2 (reasoner + concise) | 3-6s |
| Complex reasoning | 5 (all roles) | 6-12s |
| Code | 3 (reasoner + factchecker + critic) | 5-10s |
| Creative | 3 (writer + reasoner + critic) | 5-10s |
Latency is bounded by the slowest responding agent within the timeout window, not the sum. Faster devices finish first; the coordinator starts aggregation as soon as all responses arrive (or the timeout is reached).
Each phone draws ~2-3W during inference. A swarm of 10 phones uses 20-30W total — compared to 300W+ for a single A100 GPU, while providing comparable throughput for small model inference.
Q: Do all devices need to be on the same WiFi network?
Yes. REVIVE uses Bonjour/mDNS for discovery, which works within a single broadcast domain (same WiFi network or LAN). Devices on different networks won't find each other. You can use manual registration (POST /v1/swarm/register) to add devices across networks if you handle routing.
Q: What happens if a device drops out mid-query? The coordinator uses timeouts. If a worker doesn't respond within the timeout (6-12 seconds depending on query type), its response is simply omitted. The aggregator synthesizes from whatever responses arrived.
Q: Can I use different models on different devices? Yes. Each worker loads its own GGUF model independently. The coordinator doesn't care which model a worker runs — it only cares about the role. A Reasoner running Qwen3-1.7B and a Reasoner running Llama-3.2-1B will both be queried.
Q: Does the coordinator need a model? The coordinator loads an aggregator model for MoA synthesis. Without it, it falls back to returning the highest-weight worker's response directly. For best results, use the fine-tuned aggregator GGUF.
Q: Can I use this without fine-tuning? Yes. The base Qwen3 models work well with the role-based system prompts. Fine-tuning improves aggregation quality by ~15-20% (based on our evaluations) but is not required to get started.
Q: How many devices can join the swarm? There's no hard limit. Bonjour can discover hundreds of services. The coordinator queries workers in parallel using async task groups. Practically, 3-8 devices is the sweet spot — enough for diverse perspectives without excessive aggregation latency.
Q: Can the Raspberry Pi be a worker instead of the aggregator?
Yes. Set REVIVE_PORT=8080 and change the role in the systemd service. But the Pi's low inference speed (3-6 tok/s) makes it more useful as a coordinator that delegates to faster phones.
Q: Does this work offline? Yes, completely. All inference runs locally on each device. The only network dependency is device-to-device communication over local WiFi. No internet connection is needed after the initial model download.
MIT