Multimodal Gemma 4 on Apple Silicon — a chat app, a local API server, and an open MLX runtime in one.
Voice, vision, documents, web research — plus any mlx-community model you bring. All on your Mac.
No cloud. No telemetry. (Two optional API keys.)
- Why GemX
- What it does (at a glance)
- Requirements
- Install
- Models
- Features in depth
- Why
mlx-vlmspecifically - Architecture
- Tech stack
- Storage & settings
- Privacy
- Building a distributable
- Troubleshooting
- Release notes
- Credits
- License
Local LLMs on Mac aren't new — Ollama and LM Studio have both matured significantly through 2025–2026. So why another one?
GemX is opinionated. It started as a focused chat app built around Gemma 4 and Apple's mlx-vlm — the MLX runtime purpose-built for vision-language models — and grew into a small local-AI hub: a polished chat client, an OpenAI-compatible server other apps can build on (in-app or headless), and an open runtime that loads any mlx-community model. It's still Gemma-4-first, and ships everything a research-style assistant needs out of the box: voice input, document parsing, multi-step web search with inline citations.
- Ollama — open-source backend with an OpenAI-compatible API. Added a native macOS app in July 2025 and an MLX preview for Apple Silicon in March 2026. Multimodal engine landed May 2025 (Llama 3.2 Vision, Qwen3-VL, LLaVA); Gemma 4 multimodal is not in their headline model list. Web-search API exists (Sep 2025) but there is no built-in chat research workflow.
- LM Studio — polished closed-source chat app with
mlx-vlmsupport and explicit Gemma 4 multimodal listings. Supports document RAG and image attachments. No voice input, no built-in web search. - Most MLX chat apps in the ecosystem still use
mlx-lm, Apple's text-only MLX runtime — fine for chat, but it doesn't load Gemma 4's image encoder.
GemX is the only project that bundles all of the following in one app:
- Multimodal Gemma 4 via
mlx-vlm— paste an image, the model sees it - Voice input via on-device Whisper
- Document attachments — PDF, DOCX, PPTX, code, plain text
- Web research with citations — Tavily + DuckDuckGo, multi-step research loop, clickable
[N]sources - Local API server (opt-in) — an OpenAI-compatible endpoint so editors, scripts, and other apps can use GemX's loaded model as a backend; run it in-app or headless from the terminal (
gemx serve); loopback-only by default, bearer-token auth, LAN exposure is a separate explicit opt-in - Bring your own model — load any
mlx-communityrepo (visionmlx-vlmor text-onlymlx-lm), auto-probed for runtime / context / thinking — not only the bundled Gemma 4 family - Custom instructions & personas — global instructions appended to the system prompt, plus named personas you switch per conversation
- Sampling controls — Precise / Balanced / Creative presets, with raw temperature / top-p / top-k / repetition-penalty under Advanced
- Native Mac chat UX — pinned chats, conversation search, ⌘K command palette, smart titles, ⌘B sidebar, context-usage meter, model hot-swap mid-session
- 100 % local inference — your prompts never leave the machine (the only optional outbound calls are to your chosen search backend)
Verified against each project's own documentation in mid-2026. Caveats are listed in-cell instead of glossed over.
| GemX | Ollama | LM Studio | |
|---|---|---|---|
| Form factor | Mac chat app | OSS API server + CLI; native app added Jul 2025 | Mac/Win/Linux chat app + API |
| MLX backend | mlx-vlm (multimodal) | MLX preview (Mar 2026) | mlx-vlm (multimodal, May 2025) |
| Native multimodal Gemma 4 | ✓ via mlx-vlm | ✗ supported via mlx-lm (text only) | ✓ via mlx-vlm |
| Voice input (Whisper, on-device) | ✓ built-in | ✗ | ✗ |
| Image attachments in chat | ✓ paste / drag / browse | ✗ via supported vision models but mlx-lm is text only | ✓ via supported vision models |
| Document attachments (PDF / DOCX) | ✓ PDF + DOCX + code | ✓ RAG (text only, not supported for all types) | ✓ RAG (text only) |
| Built-in web search workflow | ✓ multi-step Tavily + DDG | partial — web-search API, no chat loop | ✗ |
| Local OpenAI-compatible API | ✓ opt-in; in-app or headless (gemx serve); loopback by default, token auth |
✓ core feature | ✓ |
| Bring your own model | ✓ any mlx-community repo (mlx-vlm or mlx-lm), auto-probed |
✓ pull any model | ✓ download any |
Inline [N] citations + sources list |
✓ enforced by system prompt | ✗ | ✗ |
| Pinned conversations, smart titles | ✓ | n/a (no built-in chat UI) | partial |
| Open source | ✓ (MIT) | ✗ GUI app (free, closed-source) | ✗ (free, closed-source) |
- Chat with Gemma 4 in a polished native Mac UI — full Markdown, KaTeX math, syntax-highlighted code blocks, tables etc.
- See — drag or paste images straight into the conversation
- Read — drop a PDF, DOCX, PPTX, or code/text file; math is preserved (DOCX/PPTX→LaTeX, PDF→vision), and big files are indexed locally so the relevant passages are retrieved per question (on-device RAG)
- Listen — click the mic, talk, get a transcript via local Whisper
- Search — agentic web research with Tavily, multi-step fetch loop, clickable citations
- Serve — flip on the local API server (in-app toggle or headless
gemx serve) and point any OpenAI-compatible client athttp://127.0.0.1:11535/v1 - Extend — bring your own model: load any
mlx-communityrepo (multimodal or text-only), not just the five bundled Gemma 4 variants - Shape — custom instructions + per-conversation personas, and Precise/Balanced/Creative sampling presets
- Switch models without restarting — five Gemma 4 variants from ~4 GB to ~18 GB, plus your custom adds
- Stay private — every byte stays on your Mac except (optionally) your chosen search query
- Apple Silicon Mac (M1 or later) — MLX is ARM-only
- macOS 13 Ventura or later
- Python 3.10 – 3.13 (
brew install python@3.13if missing; 3.14+ not yet supported bymlx-vlm) - Node.js 20+
- ~5 GB free disk space minimum (smallest model + Python venv + node_modules); up to 20 GB if you run the 31B variant
There are two ways to get GemX running. Most users want Option 1. If you want to hack on the code or build your own .dmg, use Option 2.
Both routes share the same Requirements (Apple Silicon, macOS 13+, Python 3.10–3.13) and converge on the same first-launch flow.
-
Go to the Releases page on GitHub
-
Download the latest
GemX-<version>.dmg -
Double-click to mount, then drag GemX.app to your
/Applicationsfolder -
Eject the disk image and launch GemX from Applications, Launchpad, or Spotlight
-
If macOS warns "GemX cannot be opened because the developer cannot be verified", right-click → Open the first time (one-time bypass). As a one-liner alternative, run in Terminal:
xattr -dr com.apple.quarantine /Applications/GemX.app
You do not need Node.js, npm, this repo, or any build tools for this route — just Python 3.10–3.13 on your PATH (brew install python@3.13 if missing). Everything else is provisioned by the app on first launch.
For development or producing your own signed build:
git clone https://github.com/Avaneesh40585/GemX.git
cd GemX
npm install
npm run devThis runs Electron + Vite in development mode with hot-module reloading. To produce a distributable .dmg yourself, see Building a distributable.
The very first time GemX starts, it will:
- Detect Python on your
PATHand create an isolated virtualenv at~/Library/Application Support/GemX/mlx/venv/ pip install mlx-vlminto that venv (cached for future launches)- Show the model picker — choose one of the five Gemma 4 variants
- Download the model weights from HuggingFace into the local cache
- Boot the MLX inference server on
127.0.0.1:11534 - Drop you into the chat
Subsequent launches skip steps 1–4 — the model loads from cache in seconds.
💡 Tip: A free HuggingFace token (added in Settings) speeds up model downloads and avoids rate-limits. Gated models also require it.
All variants are 4-bit quantized via mlx-community and natively multimodal (text + image). Switch between them at any time via the dropdown in the chat header.
Warning
Running a local LLM is resource-intensive. Inference saturates the GPU, holds the entire model in unified memory, and is sensitive to thermal throttling. Before launching a chat, close memory-heavy background apps — browsers with dozens of tabs, IDEs, Docker, Slack, games, video editors. If your Mac is on battery, plug it in. The bigger the model, the less headroom you have for everything else.
The Minimum column below is the smallest unified-memory configuration where the model loads and runs without swap thrashing or OOM kills. Recommended is the configuration where it feels comfortable with a couple of other apps open at the same time. RAM = unified memory on Apple Silicon.
| Model | HF repo | Disk | Default context | Max context | Minimum RAM | Recommended RAM | Notes |
|---|---|---|---|---|---|---|---|
| Gemma 4 E2B | mlx-community/gemma-4-e2b-it-4bit |
~4 GB | 8 K | 128 K | 8 GB | 16 GB | Fastest, lowest footprint. Good for quick Q&A. |
| Gemma 4 E4B ★ | mlx-community/gemma-4-e4b-it-4bit |
~5.5 GB | 16 K | 128 K | 8 GB | 16 GB | Recommended default. Best balance. |
| Gemma 4 12B | mlx-community/gemma-4-12B-it-4bit |
~7 GB | 32 K | 256 K | 12 GB | 16 GB | Dense mid-tier — sharper reasoning than E4B without the MoE's RAM cliff. Comfortable on a 16 GB Mac. |
| Gemma 4 26B MoE | mlx-community/gemma-4-26b-a4b-it-4bit |
~16 GB | 64 K | 256 K | 24 GB | 32 GB | Mixture-of-Experts: 26 B total params, 4 B active per token. |
| Gemma 4 31B | mlx-community/gemma-4-31b-it-4bit |
~19 GB | 64 K | 256 K | 32 GB | 48–64 GB | Dense — highest raw quality. Comfortable on M3 Max / M4 Pro 36 GB+ or M-series Max/Ultra. |
Important
Disk size is roughly the lower bound on RAM. Add ~2–4 GB for the OS, ~1–6 GB for the KV cache depending on context window, plus headroom for whatever else you're running. On a 16 GB Mac, stick to E2B or E4B.
The context window (set per-session in Settings) determines how many tokens of conversation the model can hold at once. Larger windows = longer memory, but they reserve more KV-cache memory upfront — independently of how full the conversation actually is.
A rough rule of thumb: doubling the context window roughly doubles the KV-cache allocation. The cache scales with the model's layer count and hidden size, so the bigger the model, the more each context-window step costs.
Practical guidance per model (approximate, in addition to the model weights themselves):
| Model | 8 K | 32 K | 64 K | 128 K | 256 K |
|---|---|---|---|---|---|
| Gemma 4 E2B | ~0.3 GB | ~1 GB | ~2 GB | ~4 GB | n/a (max 128 K) |
| Gemma 4 E4B | ~0.5 GB | ~2 GB | ~4 GB | ~6 GB | n/a (max 128 K) |
| Gemma 4 12B | ~0.8 GB | ~3 GB | ~6 GB | ~10 GB | ~20 GB |
| Gemma 4 26B MoE | ~1 GB | ~3 GB | ~5 GB | ~10 GB | ~18 GB |
| Gemma 4 31B | ~1.5 GB | ~5 GB | ~8 GB | ~14 GB | ~26 GB |
The 256 K column is the model's native max — and as the numbers above make clear, on a typical Mac you'll OOM long before reaching it. Most users want to stay at or below 64 K. Push it higher only if you have an Ultra-class machine with RAM to spare.
If you OOM mid-conversation:
- Open Settings → Context window and step it down (it's a power-of-2 slider — each step halves the KV-cache reservation)
- Close other RAM-hungry apps and reload the conversation
- If it still doesn't fit, switch to a smaller model
- Streaming responses with mid-stream cancel
- Edit the last user message and re-send (regenerates from that point)
- Regenerate the last assistant response with one click
- Copy any message or any individual code block
- Full GFM Markdown — headings, bold, italic,
strikethrough, tables, nested lists, blockquotes - LaTeX math — both inline
$x^2$and block$$\int_0^1 f(x)\,dx$$rendered via KaTeX - Syntax-highlighted code blocks with:
- Language detection from the fence
- Pretty labels (
cpp→ C++,ts→ TS,py→ Python,rs→ Rust, …) - One-click copy
- Custom Atom-One-Dark-derived theme with no per-line backgrounds
- Stop-and-resume indicators — clear "Response stopped" message when you cancel mid-stream, error blocks instead of raw error text in the bubble
- Empty-response indicator — never silently swallows a turn (see Empty responses, explained below)
- Collapsible thinking blocks — when Thinking is enabled in Settings, Gemma 4's reasoning trace streams into a collapsible block above the answer. The block header shimmers through rotating verbs (
Thinking → Considering → Planning → Pondering → Reasoning → Sketching) while reasoning is in flight, then collapses to "Thought process" once the answer starts. The body renders with the full markdown stack — syntax-highlighted code blocks, KaTeX math, GFM tables, and target-blank links — so reasoning looks as polished as the answer itself. mlx-vlm's chat template strips past thinking from history automatically so multi-turn context stays clean.
Sometimes the very first turn of a fresh conversation comes back empty — the message bubble appears with just a small "No response" indicator instead of text. This isn't a crash; it's a known quirk of Gemma 4 + mlx-vlm while the model finishes warming up.
What's happening under the hood:
- On the first turn after launch (or after switching models), the MLX server is still bringing weights into unified memory, compiling kernels, and JIT-priming the cache.
- When tools are enabled, GemX also issues the request with a stricter system prompt that tells the model to emit XML tool blocks. Occasionally Gemma 4 returns a stop token before producing any user-visible text — the same pattern that makes the
tools/tool_choiceAPI path unusable with this model. - Rather than silently swallow the turn (which would leave you staring at a blank bubble wondering if it's still working), GemX detects the zero-token completion and renders the "No response" indicator. The conversation state is preserved.
What to do:
- Click Regenerate on the empty message — second attempts almost always succeed
- If it happens repeatedly on the same turn, try shortening / rephrasing the prompt, or temporarily turn Web Search off in Settings (the tool-enabled system prompt is what most often triggers it)
- After 2–3 successful turns in a chat the quirk effectively disappears — the model is fully warm
This is purely a UI affordance; nothing is logged or sent anywhere. The "No response" state is mutually exclusive with the "Response stopped" indicator (manual cancel) and the red error indicator (actual generation failure), so you can always tell what happened at a glance.
- Paste (⌘V), drag-and-drop, or browse via the attach button
- Up to 4 images per message
- Client-side resize to 1024 px longest edge, JPEG-encoded — saves memory and tokens
- Thumbnails render in the user bubble; click to view full-size
GemX reads PDF, DOCX, PPTX, and text-based files (code, plaintext, Markdown, JSON/YAML, …) — not arbitrary binaries. Images go to the model as vision input; audio is transcribed (see below).
- DOCX — math preserved as LaTeX. Word stores equations as structured OMML; GemX converts them to LaTeX (
$…$) instead of dropping them likemammothdoes, so formulas survive into the model and render via KaTeX. (Falls back to plain text on any odd file.) - PPTX — slides, math, and notes. PowerPoint is the same OOXML format as Word, so GemX extracts each slide's text + equations (OMML→LaTeX) in display order and includes speaker notes. (Slide images / diagrams aren't captured — text and math only.)
- PDF — math read by the vision model. A PDF stores equations as un-extractable glyphs, so when a multimodal model (e.g. Gemma 4) is active, GemX renders pages to images and attaches the relevant page(s) to your question at retrieval time so the model reads the real math. With a text-only model it falls back to plain
pdf-parsetext. - Code / plaintext / Markdown read directly —
.js,.ts,.py,.cpp,.rs,.go,.java,.md,.txt,.json,.yaml, … - Smart retrieval for large files — documents under ~8 KB are inlined whole; bigger files are indexed locally (chunked + embedded with a tiny on-device model,
gte-smallvia Transformers.js) and the passages relevant to each question are pulled in per turn. No cloud, no 8 KB truncation, and the document never gets pruned out of a long conversation. A brief "Indexing…" status shows in the composer while it works; vectors (and rendered PDF pages) live on disk underrag/<conversation>/, never in the conversation blob. - Shows as a filename chip above the bubble — never raw
<context>XML
- Click the mic button to start recording; live timer; click again to stop
- Local speech-to-text via Whisper
base.en(no cloud) - The Whisper model (~150 MB) downloads once on first use and is cached in browser IndexedDB
- WebGPU by default with automatic WASM fallback
- Currently English only
GemX is agentic when it comes to research — it doesn't just summarize from snippets.
- Primary: Tavily API — purpose-built for AI agents, 1 000 free searches/month, no card required
- Fallback: DuckDuckGo HTML scrape — works when DDG isn't bot-blocking; GemX detects the bot-challenge HTML and surfaces a clear message asking you to set a Tavily key
- Two tools are wired up to the model:
<web_search><query>…</query></web_search>→ returns numbered results<fetch_url><url>…</url></fetch_url>→ returns the page text (≤8 KB); escalates to Tavily extraction for JavaScript-heavy pages, and says so honestly if a page can't be read (instead of inventing it)
- Mandatory multi-step workflow (enforced by both the system prompt and dynamic per-tool follow-up messages):
- Search → list of URLs
- Fetch top 2-3 URLs — landing-page snippets are NOT enough
- Synthesize — final answer with citations
- Online/offline auto-detection via
navigator.onLine+online/offlineevent listeners; in Auto mode tools deactivate the moment you go offline
- The model emits inline
[1],[2]markers as it writes - A required "Sources:" section at the bottom of every research answer lists those references as clickable markdown links
- Clicking a citation opens the source in your default browser — a
will-navigateinterceptor in the main process routes external HTTPS links viashell.openExternal, never in-app
- Unlimited saved conversations in the sidebar
- Smart auto-generated titles — after your first turn, a background model call writes a 3–5 word title
- Pinned conversations float to a dedicated "Pinned" section at the top
- 3-dot dropdown menu per chat: Pin / Unpin · Rename · Delete
- Inline rename — click rename, type, Enter to commit
- Delete with confirmation
- All conversations live in
localStorageundergemx:conversations:v7
When a conversation grows past 75 % of the model's context window, GemX prunes it gracefully:
- Phase 1 — large tool results (
fetch_urlcontent blocks) are collapsed to short summaries - Phase 2 — if still over the limit, drop the oldest user/assistant turns one by one, always preserving the system prompt and the most recent user message
This is done so that there are no more "context length exceeded" errors mid-research.
- Toggle with ⌘B or the icon in the chat header
- Hidden state is persisted across sessions; on first launch the sidebar auto-collapses on small windows (< 640 px wide) so the chat area isn't crowded
- Slim mode below
md(768 px) —w-52instead ofw-60 - Two sections: Pinned (if any) then Chats
The settings modal has a left nav rail (a horizontal tab strip on narrow windows) with five sections:
- Behavior — controls that shape a response:
- Thinking mode (
Disabled/Enabled) — Gemma 4's native reasoning channel; slower but stronger on math/code/multi-step questions - Web Search mode (
Disabled/Enabled) with an Online/Offline dot, plus the Tavily API key (recommended; stored attokens/tavily-key) when enabled - Context window — a powers-of-2 stepper (1 024 → 262 144) with per-model maximums and a Reset button
- Response Style —
Precise/Balanced/Creativepresets, plus an Advanced expander with raw temperature / top-p / top-k / repetition-penalty (editing a value flips the preset to "custom")
- Thinking mode (
- Personalization:
- Custom Instructions — free text appended to GemX's system prompt on every chat (date/tools/vision keep working)
- Personas — named, reusable prompts you create here and switch per conversation from a header dropdown
- Models — every cached built-in plus every custom you've added (one-click delete per row; the active model can't be deleted until you switch off it), and the optional HuggingFace token for gated downloads (stored at
tokens/hf-token) - Appearance — Theme:
System/Light/Dark(Systemfollows macOS Appearance live and keeps the titlebar tint + native scrollbars aligned) - Developer — the Local API server (see below)
Token files have 0o600 permissions (owner read/write only); token inputs support Enter to save, show a brief Saved ✓, and have always-visible Clear buttons.
- Click the model name in the header → dropdown listing all 5 variants
- Selecting a different model restarts the MLX server internally
- Loading overlay appears during the switch; conversation history is preserved
Turn on Settings ▸ Developer ▸ Local API server to expose an OpenAI-compatible endpoint so editors, scripts, or any OpenAI client can use GemX's loaded model as a backend. It's a thin proxy in front of the MLX server, so it serves whichever model is currently loaded and shares that one process with in-app chats.
Defaults are conservative: off until you enable it, bound to 127.0.0.1 only, and /v1/*-only. The panel shows the base URL (default http://127.0.0.1:11535/v1) with a copy button and a live running/stopped dot.
# List the loaded model
curl http://127.0.0.1:11535/v1/models
# Chat (streaming)
curl http://127.0.0.1:11535/v1/chat/completions \
-H 'content-type: application/json' \
-d '{"model":"gemx","messages":[{"role":"user","content":"hi"}],"stream":true}'- Bearer token (optional on loopback, required for network access): set one in the panel, then send
-H "Authorization: Bearer <token>". Requests without it get401. - LAN exposure is a separate, explicit opt-in (binds
0.0.0.0) with a visible warning — only enable it on trusted networks, and always with a token. - The
modelfield in your request can be anything — GemX rewrites it to the currently-loaded model before forwarding (the MLX server would otherwise try to load whatever id a client sends).GET /v1/modelslikewise returns exactly the loaded model.
You can run the same server headless (Ollama-style), no GUI.
One-time install — put a tiny wrapper script on your PATH. Unable to find helper app. If you already made one, remove it first:
which gemx # find it (skip if you never made one)
rm "$(which gemx)" # remove the symlink
sudo tee /usr/local/bin/gemx >/dev/null <<'EOF'
#!/bin/bash
exec "/Applications/GemX.app/Contents/MacOS/GemX" "$@"
EOF
sudo chmod +x /usr/local/bin/gemxThen:
# Start the server (defaults: port 11535, last-used model, loopback)
gemx serve
gemx serve --port 8080 --model mlx-community/gemma-4-e4b-it-4bit
gemx serve --lan --token my-secret # expose on the LAN with auth
gemx serve --helpgemx serve boots the same MLX runtime and proxy with no window, prints the endpoint, and stays in the foreground until you Ctrl-C. The model weights and Python runtime must already be installed — launch the GemX app once first so it can set those up. (Building from source? npm run serve does the same against a dev build.)
You can't open the GemX window while
gemx serveis running (and vice-versa) — this is expected.gemx serveis the same app in headless mode, not a separate daemon, so the two share one app instance and one model server (port 11534). While serve holds them, double-clicking GemX.app just reactivates the windowless headless process, so no UI appears.If you want the UI and the API endpoint at the same time, don't use
gemx serveat all — just run the app and turn on Settings ▸ Developer ▸ Local API server. That serves the same OpenAI-compatible endpoint on port 11535 from inside the running app. Usegemx serveonly when you want the server without a GUI (a headless box, an SSH session, a login-item daemon).
GemX emits native OpenAI tool_calls (Gemma 4 via mlx-vlm — verified end-to-end), so agentic harnesses work, not just chat. Setup guides for Cline, Kilo Code, Continue.dev, Zed, JetBrains AI Assistant, Goose, and OpenCode — plus the two tools that can't connect (Codex CLI is Responses-API-only; Cursor can't reach localhost) and why — live in docs/api-clients.md. Short version: point any of them at http://127.0.0.1:11535/v1 with any API key, and prefer Gemma 4 12B with a raised context window for agent work — small models emit valid tool calls but are clumsier on long multi-step tasks.
Click the + Custom button next to the model picker to add any mlx-vlm or mlx-lm model from the mlx-community HuggingFace organisation.
- Only
mlx-community/*repos are accepted. Other repos may not be MLX-quantised and won't load. Browse the curated catalogue at https://huggingface.co/mlx-community/collections. - Paste a repo id (e.g.
mlx-community/Qwen3-8B-MLX-4bit) and GemX probes HuggingFace for runtime (multimodalmlx-vlmvs text-onlymlx-lm), default + max context window, and thinking support — auto-detected from the model'sconfig.jsonand chat template. - Runtime is auto-detected and locked at add time — the modal shows it as a readonly chip with no override. Every other field (label, context window, max context, thinking support, description) is editable because
config.jsonis sometimes wrong. All settings become immutable after Add — a prominent banner in the modal reminds you that the only way to fix a mistake is to remove and re-add. Default context for new customs is 16 K to keep first-launch RAM modest. - Custom models appear in the header picker under a Custom subheader (label + runtime).
- Removal is in Settings → Downloaded Models (custom models appear there with a
customtag and the same trash icon as built-ins). Removing also clears the cached weights. - The Settings Thinking toggle auto-disables when the active custom model doesn't expose a reasoning channel; the Context Window slider's upper bound respects the model's Max.
- Image attach is blocked for text-only (
mlx-lm) models. When a custommlx-lmmodel is active, drag-and-drop / paste / file-picker ofimage/*files is silently filtered with a "This model is text-only — images can't be attached." toast. PDFs, code, and audio attachments still work. - Runtime-mismatch errors are caught and surfaced clearly. If a model was auto-detected with the wrong runtime — server fails to load, or a stray image reaches a text-only server — GemX intercepts the cryptic stderr and surfaces an action-oriented advisory pointing you at Settings → Downloaded Models to remove and re-add.
- The first time you select a text-only (
mlx-lm) custom model, GemX installsmlx-lminto the existing venv — a one-time ~30 s step. Subsequent custom-model boots skip it. - Gated repos require your HuggingFace token in Settings.
Apple's MLX is the framework. Two main runtimes are built on top of it:
mlx-lm— Apple's text-only LLM runtime. Optimized, polished, but doesn't load vision encoders.mlx-vlm— Vision-language model runtime by Prince Canuma. Knows how to load multimodal architectures (Gemma 4, Pixtral, Qwen-VL, LLaVA, …) end-to-end. Exposes an OpenAI-compatible/v1/chat/completionsendpoint that acceptsimage_urlcontent blocks alongside text.
GemX uses mlx-vlm because Gemma 4 is a natively multimodal model — its image encoder is part of the architecture, trained jointly with the language model. Stripping that out via a text-only loader would discard half of what the model can do. mlx-vlm preserves and exposes the full stack with the performance benefits of Apple's MLX framework on M-series hardware (unified memory, fused kernels, neural-engine-aware optimizations).
┌──────────────────────────────────────────────────────────┐
│ Renderer (React + TS + Vite) │
│ • Chat UI, streaming display, markdown rendering │
│ • Whisper (in-browser via @huggingface/transformers) │
│ • localStorage: conversations + settings + sidebar │
└────────────────────────┬─────────────────────────────────┘
│ contextBridge IPC
┌────────────────────────┴─────────────────────────────────┐
│ Main process (Electron + TS) │
│ • MLX server lifecycle (spawn / abort / restart) │
│ • Tool execution: web_search (Tavily / DDG), fetch_url │
│ • XML tool-call parsing after stream completion │
│ • File I/O: tokens/, .gemx-setup-complete │
│ • Document parsing: pdf-parse, mammoth │
│ • will-navigate interceptor → shell.openExternal │
└────────────────────────┬─────────────────────────────────┘
│ HTTP SSE
┌────────────────────────┴─────────────────────────────────┐
│ mlx-vlm.server (Python subprocess, 127.0.0.1:11534) │
│ • Loads Gemma 4 weights into Apple MLX │
│ • OpenAI-compatible /v1/chat/completions │
│ • Streams tokens back over Server-Sent Events │
└──────────────────────────────────────────────────────────┘
GemX doesn't use the OpenAI tools / tool_choice API fields — empirically that path produces empty responses with Gemma 4 + mlx-vlm. Instead it uses a Cline-style pure-XML format in the model's text output:
<web_search>
<query>latest AI news</query>
</web_search>After each round, the main process:
- Accumulates the streamed text
- Regex-matches for
<web_search>and<fetch_url>blocks - Executes the tool
- Injects the result back as a
role: 'user'message (avoiding the OpenAItoolrole which Gemma 4 handles poorly via mlx-vlm) - Sends a tool-specific follow-up instruction — e.g., after
web_search, the injection demands "You MUST now callfetch_urlon the most relevant URL. Do NOT summarize from snippets."
Up to 10 rounds per turn — enough for one search + several fetches + the final synthesis.
For a contributor-level deep-dive — the full IPC contract, the MLX runtime lifecycle, the chat pipeline, the tool-calling loop, the data model, and the rationale behind the non-obvious design choices — see docs/.
| Layer | Technology | Why |
|---|---|---|
| Shell | Electron 42 | Native Mac app with web tech |
| Bundler | electron-vite (Vite 6) | Sub-second HMR in dev |
| UI | React 19 + TypeScript 5 | Type-safe components |
| Styles | Tailwind CSS 3 | Utility-first, fast iteration. Layout is responsive sm / md / lg — usable down to 700 × 500 |
| Inference | mlx-vlm (Python) on 127.0.0.1:11534 |
Multimodal MLX runtime |
| Speech | @huggingface/transformers |
In-browser Whisper, WebGPU |
| Embeddings (RAG) | @huggingface/transformers (gte-small) |
In-browser document retrieval, WebGPU |
| Markdown | react-markdown + remark-gfm + remark-math + rehype-katex |
Full GFM + LaTeX |
| Code highlight | react-syntax-highlighter (Prism) |
Custom theme, no token backgrounds |
| Document parsing | pdf-parse, pdfjs-dist, custom OOXML/OMML→LaTeX (jszip + @xmldom/xmldom) |
PDF text + page rendering; DOCX & PPTX text with math preserved as LaTeX |
| Search | Tavily API + DuckDuckGo fallback | AI-friendly, reliable |
Everything lives under ~/Library/Application Support/GemX/:
| Path | Size | Required? | What |
|---|---|---|---|
mlx/venv/ |
~500 MB | yes | Python virtualenv with mlx-vlm installed |
mlx/models/ |
~4–18 GB | yes | HuggingFace model cache (Gemma weights) |
tokens/hf-token |
<1 KB | optional | HuggingFace token (gated downloads) |
tokens/tavily-key |
<1 KB | optional | Tavily API key |
.gemx-setup-complete |
33 B | auto | First-run skip marker |
Local Storage/ |
~100 KB | yes (your data) | Conversations + settings (Chromium IndexedDB-style files) |
Service Worker/ |
~150 MB | optional | Whisper model cache |
Cache/, Code Cache/, GPUCache/, etc. |
variable | auto | Chromium runtime caches — safe to delete |
localStorage keys (inside Local Storage/):
gemx:conversations:v7— all your chatsgemx:settings:v1— tools mode, context overridegemx:sidebar-open— sidebar visibility state
Want a minimum backup? Copy mlx/, tokens/, and Local Storage/. Everything else regenerates.
GemX is local-first by design. Here is every outbound connection it can make:
| When | Where | Why | Optional? |
|---|---|---|---|
| First run only | huggingface.co |
Download model weights | Required for first install of each model |
| First voice use only | huggingface.co (CDN) |
Download Whisper base.en (~150 MB) | Required for first voice use |
| Web search query | api.tavily.com |
Search results (if Tavily key set) | ✓ disable in Settings |
| Web search query (fallback) | html.duckduckgo.com |
Search results (if no Tavily key) | ✓ disable in Settings |
fetch_url tool |
The destination domain | Page content | ✓ model only calls this if you ask it to |
| Connectivity check | none — uses navigator.onLine (OS-level) |
Online/offline detection | n/a — no network call |
No telemetry, no analytics, no crash reporting, no auto-update pings. Your conversations are never transmitted anywhere.
npm run dev # Dev mode with HMR (Vite + Electron)
npm run typecheck # TypeScript check across main + renderer
npm run build # Compile + bundle
npm run dist # Package into a notarized .dmg installernpm run dist produces dist/GemX-<version>.dmg. Drag to /Applications.
| Symptom | Fix |
|---|---|
| First run hangs on "Installing MLX" | Verify python3 --version is 3.10–3.13. Reinstall via brew install python@3.13 if needed. |
| Model download fails / 401 | Add a free HuggingFace token in Settings. |
| "DuckDuckGo blocked this request" | DDG is bot-challenging your IP. Get a free Tavily key (1 000/mo) — set it in Settings. |
| Voice transcription doesn't work | Grant microphone permission in macOS System Settings → Privacy. WebGPU falls back to WASM automatically. |
| OOM on long chats | Lower the Context Window in Settings. The 26B MoE model is memory-efficient (only 4 B params active) — try that if you have ≥ 16 GB. |
The full set of release notes for every version — what changed, what shipped, and the rationale behind each round of work — lives in the Releases section. Each tagged release has the matching .dmg artifact and a complete changelog.
GemX was inspired by Ammaar Reshi's gemma-chat — the project that sparked the idea of a local-first Gemma chat app on Apple Silicon. Thank you.
- Gemma by Google DeepMind — the model family this is built around
- MLX by Apple Machine Learning Research — the inference framework
- mlx-vlm by Prince Canuma — the multimodal runtime that makes this possible
- transformers.js by HuggingFace — Whisper in the browser
- Cline — XML tool-call format inspiration
- Tavily — search API built for AI agents
Created by Avaneesh.
MIT — see LICENSE.
