spark-sovereign

Your AI. Your hardware. Your rules.

A fully self-contained, private AI stack running on the NVIDIA DGX Spark (128GB unified memory, GB10 Superchip, ~$4,000–$5,000 as of March 2026).

No cloud. No API keys. No rate limits. No surveillance. No subscriptions. No data leaving your machine. Ever.

Brain serves a standard OpenAI-compatible API — any agentic framework that speaks this protocol works out of the box. We test with OpenClaw, but you can plug in LangChain, AutoGen, CrewAI, Open Interpreter, LobeChat, or anything else. The infrastructure layer doesn't care what's on top.

Why This Exists

Proprietary frontier models come with strings attached — rate limiting, usage-based pricing, mass data collection, content moderation that blocks legitimate work, and terms of service that change without notice. You don't own anything. You're renting access to someone else's computer, on their terms.

The open-source community has been closing the gap fast. Models available today for private, local use are approaching — and in some benchmarks surpassing — proprietary alternatives. The hardware to run them is now available at consumer price points.

spark-sovereign is the bridge: a working, tested, production-ready setup that takes a DGX Spark from box-open to a fully operational private AI server — with CLI coding, chat, Telegram communication, voice, agentic tool use, multimodal input, web search, memory, and MCP integrations — all running locally.

This is for anyone who wants to own their AI infrastructure instead of renting it.

This setup lets you pick the best available open-weight model, serve it locally on your own 24/7 hardware, and point OpenClaw at it for full agentic capabilities. The only thing outside your control is electricity. Your AI stays up as long as you can pay the power bill. Use it to your heart's content — and as more intelligent, faster models become available, swap them in and instantly gain speed and intelligence boosts, with the VRAM limits of your Nvidia-CUDA optimized hardware.

What You Get

TLDR: As of April 2026, this setup is a practical replacement for Claude Code and ChatGPT Codex for day-to-day engineering work. CLI coding, agentic tool use, parallel agents, chat, voice, Telegram, MCP integrations — all running locally, 24/7, with zero API dependency. An engineer can go fully off-grid and still get professional work done. Now running Qwen3.6 with 73.4% SWE-bench Verified and 262K native context.

~53 tokens/sec sustained inference — no queue, no throttling, no network latency
262K context window — long conversations, full codebase analysis, deep reasoning
Agentic coding — tool calling, code execution, file management, web search
Parallel agents — OpenClaw spawns multiple workers for complex tasks simultaneously
Voice I/O — speak to it, it speaks back (local Whisper STT, configurable TTS)
Telegram bot — message your AI from your phone, send voice notes, images, text
Persistent memory — remembers across sessions, learns your codebase and preferences
Multimodal — send images and video, Brain analyzes natively
MCP tools — git, GitHub, browser, shell, databases, Slack, Stripe, and more
Auto-start on boot — plug in power, walk away, it's ready in 5 minutes
109 of 128 GB VRAM utilized — this setup pushes a single DGX Spark to its limit

How This Compares (April 2026 — Honest Assessment)

	spark-sovereign (Qwen3.6-35B-A3B)	Claude Code (Opus 4.6)	ChatGPT Codex (GPT-5.4)
Speed	~53 tok/s sustained, zero latency	Variable — depends on server load and queue	Variable — depends on server load and queue
Coding	Strong — handles day-to-day engineering, debugging, refactoring, and generation	Best-in-class for complex multi-step coding	Strong, comparable to Claude on most tasks
Hard reasoning	Good for most tasks; frontier models still lead on the hardest problems	Strongest on complex architectural reasoning	Strong, especially on math and long-chain logic
Agentic	Full — parallel agents, tool calling, MCP, code execution via OpenClaw	Full — native tool use, computer use	Full — native tool use, code interpreter
Context window	262K tokens	200K tokens	128K–1M tokens
Chat / conversation	Unlimited — no session limits, no token caps	Session-limited, rate-limited on heavy use	Generous but usage-capped on Pro tier
Voice	Local STT + configurable TTS, Telegram voice notes	Not available in CLI	Voice mode available
Privacy	100% local — zero data leaves your machine	Data processed on Anthropic servers	Data processed on OpenAI servers
Ownership	You own the hardware, the model, and every byte of output	You own nothing — renting API access	You own nothing — renting API access
Rate limits	None — run it 24/7 at full speed	Yes — throttled during peak usage, hard caps on Pro	Yes — usage caps on all tiers
Cost after setup	Electricity only (~$5–15/month)	$20–200/month + API overages	$20–200/month + API overages
Availability	24/7 — works offline, no outages, no maintenance windows	Dependent on Anthropic infrastructure	Dependent on OpenAI infrastructure
Bans / ToS risk	Zero — no terms of service, no content policy, no account to lose	Subject to Anthropic's acceptable use policy	Subject to OpenAI's usage policies
Model upgrades	Swap in newer open-weight models as they release — instant	Automatic but you have no choice or control	Automatic but you have no choice or control

The honest take: Frontier models like Opus 4.6 and GPT-5.4 still lead on the hardest reasoning tasks — the kind where you need 500B+ active parameters grinding through a complex multi-file refactor or novel algorithm design. But for the vast majority of professional engineering work — writing code, debugging, reviewing PRs, chatting, running agents, using tools — this local setup gets the job done at ~53 tok/s with zero ongoing cost, total privacy, and no one standing between you and your AI.

The gap is closing fast. Every few weeks, a new open-weight model drops that's smarter and faster than the last. This hardware will only get more capable over time.

Model Evolution

We tested multiple models to find the best intelligence-to-speed ratio on Spark hardware. The open-source ecosystem moves fast — what was best last month gets surpassed the next.

Release	Model	Active Params	tok/s	Intelligence	Status
v1.0	Qwen3.5-27B-FP8 (dense)	27B	~14–30	High	Too slow — hit memory bandwidth ceiling
v2.0	Nemotron-3-Nano-30B-A3B-FP8	3B	~35–45	Medium	Fast but weaker on coding/reasoning
v3.0	Qwen3.5-35B-A3B-FP8	3B	~49	High	Retired — superseded by v4.0
v4.0	Qwen3.6-35B-A3B-FP8	3B	~53	High	Current — drop-in upgrade from v3.0

The current model (Qwen3.6-35B-A3B-FP8) is a Gated DeltaNet + MoE hybrid that activates only 3B parameters per token while having 35B total params to draw from. The DeltaNet architecture uses linear attention for 3/4 of layers, dramatically reducing KV cache pressure at long contexts — native 262K context vs 131K on the previous Qwen3.5. Scores 73.4% on SWE-bench Verified (+3.4 over v3.0) and 51.5% on Terminal-Bench 2.0 (+11 over v3.0).

For the full build journey and every decision made, see docs/LESSONS.md.

Architecture

vLLM (Brain)  →  http://localhost:8000/v1  (OpenAI-compatible API)
      |
Agentic layer  →  OpenClaw, LangChain, AutoGen, CrewAI, or any framework
      |
You  →  Terminal, Telegram, browser UI, CLI, whatever your framework supports

Brain runs as a Docker container serving the model via vLLM on a standard OpenAI-compatible endpoint. Any framework that can call /v1/chat/completions works — tool calling, streaming, multimodal, all supported at the API level.

We test and document with OpenClaw (open source, fully local, no API key). But this is a plug-and-play infrastructure layer — swap in whatever agentic framework fits your workflow.

Current Model

Component	Model	Weights	Port	tok/s
Brain	Qwen/Qwen3.6-35B-A3B-FP8	~35 GB	8000	~53

Key specs:

Gated DeltaNet + MoE hybrid: 35B total, 3B active per token — fast inference, high intelligence
vllm/vllm-openai:cu130-nightly — standard image, no custom builds (requires vLLM >= 0.19.0)
qwen3_coder tool parser + qwen3 reasoning parser
FP8 weights + FP8 KV cache
gpu_memory_utilization: 0.80 (~97GB to vLLM — ~35GB weights + ~58GB KV cache, ~24GB left for OS/Docker)
262K native context — DeltaNet linear attention keeps KV cache manageable
Prefix caching enabled — fast repeated prompts

Memory Map

This setup uses ~109 of 128 GB — pushing a single DGX Spark close to its limit.

128GB DGX Spark Unified Memory (121.69 GiB visible to CUDA)
===============================================================
 Qwen3.6-35B-A3B FP8 (Brain)  ~97.4 GB    0.80 util (~35GB weights + ~58GB KV cache)
 OS + Docker + vLLM             6.0 GB    always-on
 OpenClaw + overhead            2.0 GB    always-on
---------------------------------------------------------------
 TOTAL ALLOCATED (est.)       ~109.0 GB
 HEADROOM (est.)               ~12.7 GB   safe — MoE only activates 3B/token
===============================================================

As NVIDIA improves the DGX Spark hardware and the open-source community releases smarter, more efficiently quantized models, these numbers will only get better. The Spark is a long-term investment — the models you run on it next year will be significantly more capable than what's available today, on the same hardware.

What the Agentic Layer Provides

The capabilities below depend on your chosen framework. OpenClaw provides all of these out of the box. Other frameworks may offer different subsets or equivalents.

Capability	OpenClaw	Other Frameworks
Voice I/O	Speak → transcribe → Brain responds → speaks back	Varies by framework
STT (Speech-to-Text)	Local Whisper CLI (GPU-accelerated) or cloud providers	Framework-dependent
TTS (Text-to-Speech)	Provider-based (ElevenLabs, Microsoft, OpenAI)	Framework-dependent
Image / video	Send photo or video → Brain analyzes natively	Any framework can pass multimodal to the API
Memory	Persistent across sessions — learns from every conversation	Framework-dependent
Web search	Live search, results fed to Brain	Framework-dependent
Telegram	Message your bot → Brain responds. Voice notes, images, text	Varies
MCP tools	Files, git, GitHub, browser, HTTP, shell, AWS, Stripe, Slack	Growing MCP ecosystem
Agent orchestration	Brain spawns parallel workers for long tasks	LangChain, AutoGen, CrewAI, etc.
TUI / Chat	`openclaw tui` — interactive terminal chat	Most frameworks include a chat interface

Setup — Box Open to Running

Three layers, run once, done.

Layer 1: First boot wizard   — physical, one time, ~15 min
Layer 2: NVIDIA Sync + SSH   — on your laptop, one time, ~10 min
Layer 3: spark-sovereign     — on the Spark, via SSH

Layer 1 — First Boot (Physical)

There is no power button — plugging in power = immediate boot
Connect all peripherals before plugging in power
Keep the Quick Start Guide — hostname and hotspot credentials are on a sticker inside

Headless: Power on → connect to Spark's WiFi hotspot → browser wizard opens → set username/password → connect to home WiFi

With monitor: Same wizard appears on display.

After WiFi connects, Spark downloads updates (~10 min) and reboots.

Layer 2 — NVIDIA Sync + SSH (On Your Laptop)

Download NVIDIA Sync from https://build.nvidia.com/spark/connect-to-your-spark/sync
Add Device → enter hostname (spark-XXXX.local), username, password
Tray → select device → Terminal

Remote access: NVIDIA Sync → Settings → Tailscale → Enable → Add a Device

Layer 3 — Scripts (Run on the Spark via SSH)

# One-time setup
sudo usermod -aG docker $USER && newgrp docker

# Clone and configure
git clone https://github.com/thatwonguy/spark-sovereign.git ~/spark-sovereign
cd ~/spark-sovereign
cp .env.example .env
nano .env   # set HF_TOKEN at minimum

# Run these scripts in order (idempotent, safe to re-run)
bash scripts/00_first_boot.sh      # Tailscale + confirms setup
bash scripts/01_system_prep.sh     # Docker config, dirs, Python deps, auto-start service, watchdog timer
bash scripts/02_download_models.sh # Downloads model → /opt/models (~35GB)
bash scripts/03_vllm_servers.sh    # Starts Brain on port 8000 — waits until ready
bash scripts/04_voice_stt.sh       # Optional — local Whisper STT (~450MB)

Then connect your agentic framework of choice to http://localhost:8000/v1.

With OpenClaw (recommended): openclaw onboard → enter http://localhost:8000/v1 as the base URL.

With any other framework: Point it at http://localhost:8000/v1 using the OpenAI-compatible API. Model ID is the served_name from config/models.yml. API key can be any string.

See docs/OPENCLAW_SETUP.md for detailed connection examples (curl, Python, Node.js).

Script 02 automatically prunes old models. Any model directory in /opt/models not listed in config/models.yml is deleted before the new download.

Auto-Start on Boot

Script 01 installs a systemd service that starts Brain automatically on every power cycle. No manual intervention needed.

Brain takes 3–5 minutes to load after a cold boot (~35GB of weights loading into memory). OpenClaw reconnects automatically once ready.

systemctl status spark-sovereign
journalctl -u spark-sovereign -f

Self-Healing Watchdog

spark-watchdog.timer runs every 2 min after boot and self-heals the Docker containers spark-sovereign manages (searxng, brain, asr-server, tts-server). It is idempotent — healthy services are never touched — and bounded: after 3 consecutive failed recoveries, a service is quarantined to prevent restart loops.

The watchdog is intentionally framework-agnostic — it does not monitor agent layers (OpenClaw, LibreChat, n8n, etc.). Run your agent framework as a systemd user unit with Restart=on-failure (user linger is already enabled). To monitor an additional container, add check_container <name> "docker start <name>" to the tick block at the bottom of scripts/watchdog.sh.

# Live heartbeat — one summary line every 2 min
sudo journalctl -u spark-watchdog -f
# e.g.  [watchdog] tick searxng=up brain=up asr-server=absent tts-server=absent

# Inspect state
ls -la /var/lib/spark-sovereign/state/
cat /var/lib/spark-sovereign/state/*.fails

# Clear a quarantine after fixing the underlying issue
sudo rm /var/lib/spark-sovereign/state/<svc>.quarantined

Brain gets a 10-min load grace window — the watchdog will not restart Brain mid-load (see docs/LESSONS.md).

Swapping the Model

All model config lives in config/models.yml — the single source of truth.

Edit config/models.yml — update model fields
bash scripts/02_download_models.sh — downloads new, prunes old
bash scripts/start_brain_ad_hoc.sh — restarts Brain
Update OpenClaw model ID → openclaw gateway restart

Each section in models.yml has commented swap examples. See docs/LESSONS.md for what we've tested and why.

Repo Structure

spark-sovereign/
├── config/
│   ├── models.yml          ← SINGLE SOURCE OF TRUTH for all models
│   └── mcp_servers.json    ← MCP server catalog (copy blocks into OpenClaw)
├── scripts/
│   ├── 00_first_boot.sh       ← WiFi setup + NVIDIA Sync + Tailscale
│   ├── 01_system_prep.sh      ← Docker config, directories, Python deps, boot service
│   ├── 02_download_models.sh  ← Download model from HF → /opt/models (prunes unused)
│   ├── 03_vllm_servers.sh     ← Start Brain (port 8000)
│   ├── 04_voice_stt.sh        ← Local Whisper STT setup (optional)
│   ├── boot_sequence.sh       ← Auto-start on boot (oneshot, runs once at boot)
│   ├── watchdog.sh            ← Self-healing tick (every 2 min via systemd timer)
│   ├── start_brain_ad_hoc.sh  ← Restart Brain manually
│   └── check_stack.sh         ← Health check
├── docs/
│   ├── LESSONS.md          ← Full build journey and model decisions
│   ├── OPENCLAW_SETUP.md   ← Agentic framework connection guide
│   └── TROUBLESHOOTING.md
├── .env.example            ← Copy to .env, fill in HF_TOKEN at minimum
└── .gitignore

Troubleshooting

See docs/TROUBLESHOOTING.md.

Common fixes:

Brain not loading → docker logs brain --tail 50
OOM → reduce gpu_memory_utilization in config/models.yml
Swap model → edit config/models.yml, re-run 02_download_models.sh + start_brain_ad_hoc.sh
Check auto-start logs → journalctl -u spark-sovereign -f

Agentic Layer — OpenClaw and Beyond

spark-sovereign is the brain — your agentic framework is the body it controls. spark-sovereign is the sovereign private intelligence that replaces ChatGPT, Claude, and every other paid API endpoint. It's the brain you own — running on your hardware, serving your model, answering to no one. Your agentic framework is the body — the claws that grip tools, the legs that walk through your filesystem, the nervous system that connects voice, chat, agents, memory, and MCP. The brain thinks, the body acts.

Without spark-sovereign, your framework needs someone else's brain (a cloud API). Without an agentic framework, spark-sovereign is just a model sitting on a port with no way to reach the world. Together, they're a fully autonomous AI that belongs to you.

Why we test with OpenClaw

OpenClaw is open source, requires no API key, and runs fully local — matching spark-sovereign's zero-cloud philosophy. It provides voice, memory, Telegram, MCP tools, and agent orchestration in a single package.

Feature request: openclaw/openclaw#60792 — we've proposed spark-sovereign as a community hardware reference for DGX Spark users.

Using a different framework

Any framework that supports OpenAI-compatible endpoints works. Point it at:

Base URL:  http://localhost:8000/v1
Model ID:  qwen36-35b  (or your served_name from config/models.yml)
API key:   any string

See docs/OPENCLAW_SETUP.md for connection examples in curl, Python, and Node.js.

License

Apache License 2.0 — see LICENSE.

Free to use, modify, and distribute with attribution. The models referenced are open-weight and available on HuggingFace under their respective licenses. vLLM and OpenClaw are open source (MIT/Apache 2.0).

Built in public. Own your AI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-sovereign

Why This Exists

What You Get

How This Compares (April 2026 — Honest Assessment)

Model Evolution

Architecture

Current Model

Memory Map

What the Agentic Layer Provides

Setup — Box Open to Running

Layer 1 — First Boot (Physical)

Layer 2 — NVIDIA Sync + SSH (On Your Laptop)

Layer 3 — Scripts (Run on the Spark via SSH)

Auto-Start on Boot

Self-Healing Watchdog

Swapping the Model

Repo Structure

Troubleshooting

Agentic Layer — OpenClaw and Beyond

Why we test with OpenClaw

Using a different framework

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
.github		.github
agent		agent
config		config
docs		docs
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

spark-sovereign

Why This Exists

What You Get

How This Compares (April 2026 — Honest Assessment)

Model Evolution

Architecture

Current Model

Memory Map

What the Agentic Layer Provides

Setup — Box Open to Running

Layer 1 — First Boot (Physical)

Layer 2 — NVIDIA Sync + SSH (On Your Laptop)

Layer 3 — Scripts (Run on the Spark via SSH)

Auto-Start on Boot

Self-Healing Watchdog

Swapping the Model

Repo Structure

Troubleshooting

Agentic Layer — OpenClaw and Beyond

Why we test with OpenClaw

Using a different framework

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages