Your AI. Your hardware. Your rules.
A fully self-contained, private AI stack running on the NVIDIA DGX Spark (128GB unified memory, GB10 Superchip, ~$4,000–$5,000 as of March 2026).
No cloud. No API keys. No rate limits. No surveillance. No subscriptions. No data leaving your machine. Ever.
Brain serves a standard OpenAI-compatible API — any agentic framework that speaks this protocol works out of the box. We test with OpenClaw, but you can plug in LangChain, AutoGen, CrewAI, Open Interpreter, LobeChat, or anything else. The infrastructure layer doesn't care what's on top.
Proprietary frontier models come with strings attached — rate limiting, usage-based pricing, mass data collection, content moderation that blocks legitimate work, and terms of service that change without notice. You don't own anything. You're renting access to someone else's computer, on their terms.
The open-source community has been closing the gap fast. Models available today for private, local use are approaching — and in some benchmarks surpassing — proprietary alternatives. The hardware to run them is now available at consumer price points.
spark-sovereign is the bridge: a working, tested, production-ready setup that takes a DGX Spark from box-open to a fully operational private AI server — with CLI coding, chat, Telegram communication, voice, agentic tool use, multimodal input, web search, memory, and MCP integrations — all running locally.
This is for anyone who wants to own their AI infrastructure instead of renting it.
This setup lets you pick the best available open-weight model, serve it locally on your own 24/7 hardware, and point OpenClaw at it for full agentic capabilities. The only thing outside your control is electricity. Your AI stays up as long as you can pay the power bill. Use it to your heart's content — and as more intelligent, faster models become available, swap them in and instantly gain speed and intelligence boosts, with the VRAM limits of your Nvidia-CUDA optimized hardware.
TLDR: As of April 2026, this setup is a practical replacement for Claude Code and ChatGPT Codex for day-to-day engineering work. CLI coding, agentic tool use, parallel agents, chat, voice, Telegram, MCP integrations — all running locally, 24/7, with zero API dependency. An engineer can go fully off-grid and still get professional work done. Now running Qwen3.6 with 73.4% SWE-bench Verified and 262K native context.
- ~53 tokens/sec sustained inference — no queue, no throttling, no network latency
- 262K context window — long conversations, full codebase analysis, deep reasoning
- Agentic coding — tool calling, code execution, file management, web search
- Parallel agents — OpenClaw spawns multiple workers for complex tasks simultaneously
- Voice I/O — speak to it, it speaks back (local Whisper STT, configurable TTS)
- Telegram bot — message your AI from your phone, send voice notes, images, text
- Persistent memory — remembers across sessions, learns your codebase and preferences
- Multimodal — send images and video, Brain analyzes natively
- MCP tools — git, GitHub, browser, shell, databases, Slack, Stripe, and more
- Auto-start on boot — plug in power, walk away, it's ready in 5 minutes
- 109 of 128 GB VRAM utilized — this setup pushes a single DGX Spark to its limit
| spark-sovereign (Qwen3.6-35B-A3B) | Claude Code (Opus 4.6) | ChatGPT Codex (GPT-5.4) | |
|---|---|---|---|
| Speed | ~53 tok/s sustained, zero latency | Variable — depends on server load and queue | Variable — depends on server load and queue |
| Coding | Strong — handles day-to-day engineering, debugging, refactoring, and generation | Best-in-class for complex multi-step coding | Strong, comparable to Claude on most tasks |
| Hard reasoning | Good for most tasks; frontier models still lead on the hardest problems | Strongest on complex architectural reasoning | Strong, especially on math and long-chain logic |
| Agentic | Full — parallel agents, tool calling, MCP, code execution via OpenClaw | Full — native tool use, computer use | Full — native tool use, code interpreter |
| Context window | 262K tokens | 200K tokens | 128K–1M tokens |
| Chat / conversation | Unlimited — no session limits, no token caps | Session-limited, rate-limited on heavy use | Generous but usage-capped on Pro tier |
| Voice | Local STT + configurable TTS, Telegram voice notes | Not available in CLI | Voice mode available |
| Privacy | 100% local — zero data leaves your machine | Data processed on Anthropic servers | Data processed on OpenAI servers |
| Ownership | You own the hardware, the model, and every byte of output | You own nothing — renting API access | You own nothing — renting API access |
| Rate limits | None — run it 24/7 at full speed | Yes — throttled during peak usage, hard caps on Pro | Yes — usage caps on all tiers |
| Cost after setup | Electricity only (~$5–15/month) | $20–200/month + API overages | $20–200/month + API overages |
| Availability | 24/7 — works offline, no outages, no maintenance windows | Dependent on Anthropic infrastructure | Dependent on OpenAI infrastructure |
| Bans / ToS risk | Zero — no terms of service, no content policy, no account to lose | Subject to Anthropic's acceptable use policy | Subject to OpenAI's usage policies |
| Model upgrades | Swap in newer open-weight models as they release — instant | Automatic but you have no choice or control | Automatic but you have no choice or control |
The honest take: Frontier models like Opus 4.6 and GPT-5.4 still lead on the hardest reasoning tasks — the kind where you need 500B+ active parameters grinding through a complex multi-file refactor or novel algorithm design. But for the vast majority of professional engineering work — writing code, debugging, reviewing PRs, chatting, running agents, using tools — this local setup gets the job done at ~53 tok/s with zero ongoing cost, total privacy, and no one standing between you and your AI.
The gap is closing fast. Every few weeks, a new open-weight model drops that's smarter and faster than the last. This hardware will only get more capable over time.
We tested multiple models to find the best intelligence-to-speed ratio on Spark hardware. The open-source ecosystem moves fast — what was best last month gets surpassed the next.
| Release | Model | Active Params | tok/s | Intelligence | Status |
|---|---|---|---|---|---|
| v1.0 | Qwen3.5-27B-FP8 (dense) | 27B | ~14–30 | High | Too slow — hit memory bandwidth ceiling |
| v2.0 | Nemotron-3-Nano-30B-A3B-FP8 | 3B | ~35–45 | Medium | Fast but weaker on coding/reasoning |
| v3.0 | Qwen3.5-35B-A3B-FP8 | 3B | ~49 | High | Retired — superseded by v4.0 |
| v4.0 | Qwen3.6-35B-A3B-FP8 | 3B | ~53 | High | Current — drop-in upgrade from v3.0 |
The current model (Qwen3.6-35B-A3B-FP8) is a Gated DeltaNet + MoE hybrid that activates only 3B parameters per token while having 35B total params to draw from. The DeltaNet architecture uses linear attention for 3/4 of layers, dramatically reducing KV cache pressure at long contexts — native 262K context vs 131K on the previous Qwen3.5. Scores 73.4% on SWE-bench Verified (+3.4 over v3.0) and 51.5% on Terminal-Bench 2.0 (+11 over v3.0).
For the full build journey and every decision made, see docs/LESSONS.md.
vLLM (Brain) → http://localhost:8000/v1 (OpenAI-compatible API)
|
Agentic layer → OpenClaw, LangChain, AutoGen, CrewAI, or any framework
|
You → Terminal, Telegram, browser UI, CLI, whatever your framework supports
Brain runs as a Docker container serving the model via vLLM on a standard OpenAI-compatible endpoint. Any framework that can call /v1/chat/completions works — tool calling, streaming, multimodal, all supported at the API level.
We test and document with OpenClaw (open source, fully local, no API key). But this is a plug-and-play infrastructure layer — swap in whatever agentic framework fits your workflow.
| Component | Model | Weights | Port | tok/s |
|---|---|---|---|---|
| Brain | Qwen/Qwen3.6-35B-A3B-FP8 | ~35 GB | 8000 | ~53 |
Key specs:
- Gated DeltaNet + MoE hybrid: 35B total, 3B active per token — fast inference, high intelligence
vllm/vllm-openai:cu130-nightly— standard image, no custom builds (requires vLLM >= 0.19.0)qwen3_codertool parser +qwen3reasoning parser- FP8 weights + FP8 KV cache
gpu_memory_utilization: 0.80(~97GB to vLLM — ~35GB weights + ~58GB KV cache, ~24GB left for OS/Docker)- 262K native context — DeltaNet linear attention keeps KV cache manageable
- Prefix caching enabled — fast repeated prompts
This setup uses ~109 of 128 GB — pushing a single DGX Spark close to its limit.
128GB DGX Spark Unified Memory (121.69 GiB visible to CUDA)
===============================================================
Qwen3.6-35B-A3B FP8 (Brain) ~97.4 GB 0.80 util (~35GB weights + ~58GB KV cache)
OS + Docker + vLLM 6.0 GB always-on
OpenClaw + overhead 2.0 GB always-on
---------------------------------------------------------------
TOTAL ALLOCATED (est.) ~109.0 GB
HEADROOM (est.) ~12.7 GB safe — MoE only activates 3B/token
===============================================================
As NVIDIA improves the DGX Spark hardware and the open-source community releases smarter, more efficiently quantized models, these numbers will only get better. The Spark is a long-term investment — the models you run on it next year will be significantly more capable than what's available today, on the same hardware.
The capabilities below depend on your chosen framework. OpenClaw provides all of these out of the box. Other frameworks may offer different subsets or equivalents.
| Capability | OpenClaw | Other Frameworks |
|---|---|---|
| Voice I/O | Speak → transcribe → Brain responds → speaks back | Varies by framework |
| STT (Speech-to-Text) | Local Whisper CLI (GPU-accelerated) or cloud providers | Framework-dependent |
| TTS (Text-to-Speech) | Provider-based (ElevenLabs, Microsoft, OpenAI) | Framework-dependent |
| Image / video | Send photo or video → Brain analyzes natively | Any framework can pass multimodal to the API |
| Memory | Persistent across sessions — learns from every conversation | Framework-dependent |
| Web search | Live search, results fed to Brain | Framework-dependent |
| Telegram | Message your bot → Brain responds. Voice notes, images, text | Varies |
| MCP tools | Files, git, GitHub, browser, HTTP, shell, AWS, Stripe, Slack | Growing MCP ecosystem |
| Agent orchestration | Brain spawns parallel workers for long tasks | LangChain, AutoGen, CrewAI, etc. |
| TUI / Chat | openclaw tui — interactive terminal chat |
Most frameworks include a chat interface |
Three layers, run once, done.
Layer 1: First boot wizard — physical, one time, ~15 min
Layer 2: NVIDIA Sync + SSH — on your laptop, one time, ~10 min
Layer 3: spark-sovereign — on the Spark, via SSH
- There is no power button — plugging in power = immediate boot
- Connect all peripherals before plugging in power
- Keep the Quick Start Guide — hostname and hotspot credentials are on a sticker inside
Headless: Power on → connect to Spark's WiFi hotspot → browser wizard opens → set username/password → connect to home WiFi
With monitor: Same wizard appears on display.
After WiFi connects, Spark downloads updates (~10 min) and reboots.
- Download NVIDIA Sync from
https://build.nvidia.com/spark/connect-to-your-spark/sync - Add Device → enter hostname (
spark-XXXX.local), username, password - Tray → select device → Terminal
Remote access: NVIDIA Sync → Settings → Tailscale → Enable → Add a Device
# One-time setup
sudo usermod -aG docker $USER && newgrp docker
# Clone and configure
git clone https://github.com/thatwonguy/spark-sovereign.git ~/spark-sovereign
cd ~/spark-sovereign
cp .env.example .env
nano .env # set HF_TOKEN at minimum
# Run these scripts in order (idempotent, safe to re-run)
bash scripts/00_first_boot.sh # Tailscale + confirms setup
bash scripts/01_system_prep.sh # Docker config, dirs, Python deps, auto-start service, watchdog timer
bash scripts/02_download_models.sh # Downloads model → /opt/models (~35GB)
bash scripts/03_vllm_servers.sh # Starts Brain on port 8000 — waits until ready
bash scripts/04_voice_stt.sh # Optional — local Whisper STT (~450MB)Then connect your agentic framework of choice to http://localhost:8000/v1.
With OpenClaw (recommended): openclaw onboard → enter http://localhost:8000/v1 as the base URL.
With any other framework: Point it at http://localhost:8000/v1 using the OpenAI-compatible API. Model ID is the served_name from config/models.yml. API key can be any string.
See docs/OPENCLAW_SETUP.md for detailed connection examples (curl, Python, Node.js).
Script 02 automatically prunes old models. Any model directory in
/opt/modelsnot listed inconfig/models.ymlis deleted before the new download.
Script 01 installs a systemd service that starts Brain automatically on every power cycle. No manual intervention needed.
Brain takes 3–5 minutes to load after a cold boot (~35GB of weights loading into memory). OpenClaw reconnects automatically once ready.
systemctl status spark-sovereign
journalctl -u spark-sovereign -fspark-watchdog.timer runs every 2 min after boot and self-heals the Docker containers spark-sovereign manages (searxng, brain, asr-server, tts-server). It is idempotent — healthy services are never touched — and bounded: after 3 consecutive failed recoveries, a service is quarantined to prevent restart loops.
The watchdog is intentionally framework-agnostic — it does not monitor agent layers (OpenClaw, LibreChat, n8n, etc.). Run your agent framework as a systemd user unit with Restart=on-failure (user linger is already enabled). To monitor an additional container, add check_container <name> "docker start <name>" to the tick block at the bottom of scripts/watchdog.sh.
# Live heartbeat — one summary line every 2 min
sudo journalctl -u spark-watchdog -f
# e.g. [watchdog] tick searxng=up brain=up asr-server=absent tts-server=absent
# Inspect state
ls -la /var/lib/spark-sovereign/state/
cat /var/lib/spark-sovereign/state/*.fails
# Clear a quarantine after fixing the underlying issue
sudo rm /var/lib/spark-sovereign/state/<svc>.quarantinedBrain gets a 10-min load grace window — the watchdog will not restart Brain mid-load (see docs/LESSONS.md).
All model config lives in config/models.yml — the single source of truth.
- Edit
config/models.yml— update model fields bash scripts/02_download_models.sh— downloads new, prunes oldbash scripts/start_brain_ad_hoc.sh— restarts Brain- Update OpenClaw model ID →
openclaw gateway restart
Each section in models.yml has commented swap examples. See docs/LESSONS.md for what we've tested and why.
spark-sovereign/
├── config/
│ ├── models.yml ← SINGLE SOURCE OF TRUTH for all models
│ └── mcp_servers.json ← MCP server catalog (copy blocks into OpenClaw)
├── scripts/
│ ├── 00_first_boot.sh ← WiFi setup + NVIDIA Sync + Tailscale
│ ├── 01_system_prep.sh ← Docker config, directories, Python deps, boot service
│ ├── 02_download_models.sh ← Download model from HF → /opt/models (prunes unused)
│ ├── 03_vllm_servers.sh ← Start Brain (port 8000)
│ ├── 04_voice_stt.sh ← Local Whisper STT setup (optional)
│ ├── boot_sequence.sh ← Auto-start on boot (oneshot, runs once at boot)
│ ├── watchdog.sh ← Self-healing tick (every 2 min via systemd timer)
│ ├── start_brain_ad_hoc.sh ← Restart Brain manually
│ └── check_stack.sh ← Health check
├── docs/
│ ├── LESSONS.md ← Full build journey and model decisions
│ ├── OPENCLAW_SETUP.md ← Agentic framework connection guide
│ └── TROUBLESHOOTING.md
├── .env.example ← Copy to .env, fill in HF_TOKEN at minimum
└── .gitignore
Common fixes:
- Brain not loading →
docker logs brain --tail 50 - OOM → reduce
gpu_memory_utilizationinconfig/models.yml - Swap model → edit
config/models.yml, re-run02_download_models.sh+start_brain_ad_hoc.sh - Check auto-start logs →
journalctl -u spark-sovereign -f
spark-sovereign is the brain — your agentic framework is the body it controls. spark-sovereign is the sovereign private intelligence that replaces ChatGPT, Claude, and every other paid API endpoint. It's the brain you own — running on your hardware, serving your model, answering to no one. Your agentic framework is the body — the claws that grip tools, the legs that walk through your filesystem, the nervous system that connects voice, chat, agents, memory, and MCP. The brain thinks, the body acts.
Without spark-sovereign, your framework needs someone else's brain (a cloud API). Without an agentic framework, spark-sovereign is just a model sitting on a port with no way to reach the world. Together, they're a fully autonomous AI that belongs to you.
OpenClaw is open source, requires no API key, and runs fully local — matching spark-sovereign's zero-cloud philosophy. It provides voice, memory, Telegram, MCP tools, and agent orchestration in a single package.
Feature request: openclaw/openclaw#60792 — we've proposed spark-sovereign as a community hardware reference for DGX Spark users.
Any framework that supports OpenAI-compatible endpoints works. Point it at:
Base URL: http://localhost:8000/v1
Model ID: qwen36-35b (or your served_name from config/models.yml)
API key: any string
See docs/OPENCLAW_SETUP.md for connection examples in curl, Python, and Node.js.
Apache License 2.0 — see LICENSE.
Free to use, modify, and distribute with attribution. The models referenced are open-weight and available on HuggingFace under their respective licenses. vLLM and OpenClaw are open source (MIT/Apache 2.0).
Built in public. Own your AI.