GemX

Multimodal Gemma 4 on Apple Silicon — a chat app, a local API server, and an open MLX runtime in one.
Voice, vision, documents, web research — plus any mlx-community model you bring. All on your Mac.
No cloud. No telemetry. (Two optional API keys.)

Why GemX

Local LLMs on Mac aren't new — Ollama and LM Studio have both matured significantly through 2025–2026. So why another one?

GemX is opinionated. It started as a focused chat app built around Gemma 4 and Apple's mlx-vlm — the MLX runtime purpose-built for vision-language models — and grew into a small local-AI hub: a polished chat client, an OpenAI-compatible server other apps can build on (in-app or headless), and an open runtime that loads any mlx-community model. It's still Gemma-4-first, and ships everything a research-style assistant needs out of the box: voice input, document parsing, multi-step web search with inline citations.

Ollama — open-source backend with an OpenAI-compatible API. Added a native macOS app in July 2025 and an MLX preview for Apple Silicon in March 2026. Multimodal engine landed May 2025 (Llama 3.2 Vision, Qwen3-VL, LLaVA); Gemma 4 multimodal is not in their headline model list. Web-search API exists (Sep 2025) but there is no built-in chat research workflow.
LM Studio — polished closed-source chat app with mlx-vlm support and explicit Gemma 4 multimodal listings. Supports document RAG and image attachments. No voice input, no built-in web search.
Most MLX chat apps in the ecosystem still use mlx-lm, Apple's text-only MLX runtime — fine for chat, but it doesn't load Gemma 4's image encoder.

GemX is the only project that bundles all of the following in one app:

Multimodal Gemma 4 via mlx-vlm — paste an image, the model sees it
Voice input via on-device Whisper
Document attachments — PDF, DOCX, PPTX, code, plain text
Web research with citations — Tavily + DuckDuckGo, multi-step research loop, clickable [N] sources
Local API server (opt-in) — an OpenAI-compatible endpoint so editors, scripts, and other apps can use GemX's loaded model as a backend; run it in-app or headless from the terminal (gemx serve); loopback-only by default, bearer-token auth, LAN exposure is a separate explicit opt-in
Bring your own model — load any mlx-community repo (vision mlx-vlm or text-only mlx-lm), auto-probed for runtime / context / thinking — not only the bundled Gemma 4 family
Custom instructions & personas — global instructions appended to the system prompt, plus named personas you switch per conversation
Sampling controls — Precise / Balanced / Creative presets, with raw temperature / top-p / top-k / repetition-penalty under Advanced
Native Mac chat UX — pinned chats, conversation search, ⌘K command palette, smart titles, ⌘B sidebar, context-usage meter, model hot-swap mid-session
100 % local inference — your prompts never leave the machine (the only optional outbound calls are to your chosen search backend)

Side-by-side

Verified against each project's own documentation in mid-2026. Caveats are listed in-cell instead of glossed over.

	GemX	Ollama	LM Studio
Form factor	Mac chat app	OSS API server + CLI; native app added Jul 2025	Mac/Win/Linux chat app + API
MLX backend	mlx-vlm (multimodal)	MLX preview (Mar 2026)	mlx-vlm (multimodal, May 2025)
Native multimodal Gemma 4	✓ via mlx-vlm	✗ supported via mlx-lm (text only)	✓ via mlx-vlm
Voice input (Whisper, on-device)	✓ built-in	✗	✗
Image attachments in chat	✓ paste / drag / browse	✗ via supported vision models but mlx-lm is text only	✓ via supported vision models
Document attachments (PDF / DOCX)	✓ PDF + DOCX + code	✓ RAG (text only, not supported for all types)	✓ RAG (text only)
Built-in web search workflow	✓ multi-step Tavily + DDG	partial — web-search API, no chat loop	✗
Local OpenAI-compatible API	✓ opt-in; in-app or headless (`gemx serve`); loopback by default, token auth	✓ core feature	✓
Bring your own model	✓ any `mlx-community` repo (mlx-vlm or mlx-lm), auto-probed	✓ pull any model	✓ download any
Inline `[N]` citations + sources list	✓ enforced by system prompt	✗	✗
Pinned conversations, smart titles	✓	n/a (no built-in chat UI)	partial
Open source	✓ (MIT)	✗ GUI app (free, closed-source)	✗ (free, closed-source)

What it does (at a glance)

Chat with Gemma 4 in a polished native Mac UI — full Markdown, KaTeX math, syntax-highlighted code blocks, tables etc.
See — drag or paste images straight into the conversation
Read — drop a PDF, DOCX, PPTX, or code/text file; math is preserved (DOCX/PPTX→LaTeX, PDF→vision), and big files are indexed locally so the relevant passages are retrieved per question (on-device RAG)
Listen — click the mic, talk, get a transcript via local Whisper
Search — agentic web research with Tavily, multi-step fetch loop, clickable citations
Serve — flip on the local API server (in-app toggle or headless gemx serve) and point any OpenAI-compatible client at http://127.0.0.1:11535/v1
Extend — bring your own model: load any mlx-community repo (multimodal or text-only), not just the five bundled Gemma 4 variants
Shape — custom instructions + per-conversation personas, and Precise/Balanced/Creative sampling presets
Switch models without restarting — five Gemma 4 variants from ~4 GB to ~18 GB, plus your custom adds
Stay private — every byte stays on your Mac except (optionally) your chosen search query

Requirements

Apple Silicon Mac (M1 or later) — MLX is ARM-only
macOS 13 Ventura or later
Python 3.10 – 3.13 (brew install python@3.13 if missing; 3.14+ not yet supported by mlx-vlm)
Node.js 20+
~5 GB free disk space minimum (smallest model + Python venv + node_modules); up to 20 GB if you run the 31B variant

Install

There are two ways to get GemX running. Most users want Option 1. If you want to hack on the code or build your own .dmg, use Option 2.

Both routes share the same Requirements (Apple Silicon, macOS 13+, Python 3.10–3.13) and converge on the same first-launch flow.

Option 1 — Prebuilt `.dmg` (recommended)

Go to the Releases page on GitHub
Download the latest GemX-<version>.dmg
Double-click to mount, then drag GemX.app to your /Applications folder
Eject the disk image and launch GemX from Applications, Launchpad, or Spotlight
If macOS warns "GemX cannot be opened because the developer cannot be verified", right-click → Open the first time (one-time bypass). As a one-liner alternative, run in Terminal:
```
xattr -dr com.apple.quarantine /Applications/GemX.app
```

You do not need Node.js, npm, this repo, or any build tools for this route — just Python 3.10–3.13 on your PATH (brew install python@3.13 if missing). Everything else is provisioned by the app on first launch.

Option 2 — Build from source

For development or producing your own signed build:

git clone https://github.com/Avaneesh40585/GemX.git
cd GemX
npm install
npm run dev

This runs Electron + Vite in development mode with hot-module reloading. To produce a distributable .dmg yourself, see Building a distributable.

First launch (either option)

The very first time GemX starts, it will:

Detect Python on your PATH and create an isolated virtualenv at ~/Library/Application Support/GemX/mlx/venv/
pip install mlx-vlm into that venv (cached for future launches)
Show the model picker — choose one of the five Gemma 4 variants
Download the model weights from HuggingFace into the local cache
Boot the MLX inference server on 127.0.0.1:11534
Drop you into the chat

Subsequent launches skip steps 1–4 — the model loads from cache in seconds.

💡 Tip: A free HuggingFace token (added in Settings) speeds up model downloads and avoids rate-limits. Gated models also require it.

Models

All variants are 4-bit quantized via mlx-community and natively multimodal (text + image). Switch between them at any time via the dropdown in the chat header.

Warning

Running a local LLM is resource-intensive. Inference saturates the GPU, holds the entire model in unified memory, and is sensitive to thermal throttling. Before launching a chat, close memory-heavy background apps — browsers with dozens of tabs, IDEs, Docker, Slack, games, video editors. If your Mac is on battery, plug it in. The bigger the model, the less headroom you have for everything else.

The Minimum column below is the smallest unified-memory configuration where the model loads and runs without swap thrashing or OOM kills. Recommended is the configuration where it feels comfortable with a couple of other apps open at the same time. RAM = unified memory on Apple Silicon.

Model	HF repo	Disk	Default context	Max context	Minimum RAM	Recommended RAM	Notes
Gemma 4 E2B	`mlx-community/gemma-4-e2b-it-4bit`	~4 GB	8 K	128 K	8 GB	16 GB	Fastest, lowest footprint. Good for quick Q&A.
Gemma 4 E4B ★	`mlx-community/gemma-4-e4b-it-4bit`	~5.5 GB	16 K	128 K	8 GB	16 GB	Recommended default. Best balance.
Gemma 4 12B	`mlx-community/gemma-4-12B-it-4bit`	~7 GB	32 K	256 K	12 GB	16 GB	Dense mid-tier — sharper reasoning than E4B without the MoE's RAM cliff. Comfortable on a 16 GB Mac.
Gemma 4 26B MoE	`mlx-community/gemma-4-26b-a4b-it-4bit`	~16 GB	64 K	256 K	24 GB	32 GB	Mixture-of-Experts: 26 B total params, 4 B active per token.
Gemma 4 31B	`mlx-community/gemma-4-31b-it-4bit`	~19 GB	64 K	256 K	32 GB	48–64 GB	Dense — highest raw quality. Comfortable on M3 Max / M4 Pro 36 GB+ or M-series Max/Ultra.

Important

Disk size is roughly the lower bound on RAM. Add ~2–4 GB for the OS, ~1–6 GB for the KV cache depending on context window, plus headroom for whatever else you're running. On a 16 GB Mac, stick to E2B or E4B.

How the context window affects RAM

The context window (set per-session in Settings) determines how many tokens of conversation the model can hold at once. Larger windows = longer memory, but they reserve more KV-cache memory upfront — independently of how full the conversation actually is.

A rough rule of thumb: doubling the context window roughly doubles the KV-cache allocation. The cache scales with the model's layer count and hidden size, so the bigger the model, the more each context-window step costs.

Practical guidance per model (approximate, in addition to the model weights themselves):

Model	8 K	32 K	64 K	128 K	256 K
Gemma 4 E2B	~0.3 GB	~1 GB	~2 GB	~4 GB	n/a (max 128 K)
Gemma 4 E4B	~0.5 GB	~2 GB	~4 GB	~6 GB	n/a (max 128 K)
Gemma 4 12B	~0.8 GB	~3 GB	~6 GB	~10 GB	~20 GB
Gemma 4 26B MoE	~1 GB	~3 GB	~5 GB	~10 GB	~18 GB
Gemma 4 31B	~1.5 GB	~5 GB	~8 GB	~14 GB	~26 GB

The 256 K column is the model's native max — and as the numbers above make clear, on a typical Mac you'll OOM long before reaching it. Most users want to stay at or below 64 K. Push it higher only if you have an Ultra-class machine with RAM to spare.

If you OOM mid-conversation:

Open Settings → Context window and step it down (it's a power-of-2 slider — each step halves the KV-cache reservation)
Close other RAM-hungry apps and reload the conversation
If it still doesn't fit, switch to a smaller model

Features in depth

Chat

Streaming responses with mid-stream cancel
Edit the last user message and re-send (regenerates from that point)
Regenerate the last assistant response with one click
Copy any message or any individual code block
Full GFM Markdown — headings, bold, italic, ~~strikethrough~~, tables, nested lists, blockquotes
LaTeX math — both inline $x^2$ and block $$\int_0^1 f(x)\,dx$$ rendered via KaTeX
Syntax-highlighted code blocks with:
- Language detection from the fence
- Pretty labels (cpp → C++, ts → TS, py → Python, rs → Rust, …)
- One-click copy
- Custom Atom-One-Dark-derived theme with no per-line backgrounds
Stop-and-resume indicators — clear "Response stopped" message when you cancel mid-stream, error blocks instead of raw error text in the bubble
Empty-response indicator — never silently swallows a turn (see Empty responses, explained below)
Collapsible thinking blocks — when Thinking is enabled in Settings, Gemma 4's reasoning trace streams into a collapsible block above the answer. The block header shimmers through rotating verbs (Thinking → Considering → Planning → Pondering → Reasoning → Sketching) while reasoning is in flight, then collapses to "Thought process" once the answer starts. The body renders with the full markdown stack — syntax-highlighted code blocks, KaTeX math, GFM tables, and target-blank links — so reasoning looks as polished as the answer itself. mlx-vlm's chat template strips past thinking from history automatically so multi-turn context stays clean.

Empty responses, explained

Sometimes the very first turn of a fresh conversation comes back empty — the message bubble appears with just a small "No response" indicator instead of text. This isn't a crash; it's a known quirk of Gemma 4 + mlx-vlm while the model finishes warming up.

What's happening under the hood:

On the first turn after launch (or after switching models), the MLX server is still bringing weights into unified memory, compiling kernels, and JIT-priming the cache.
When tools are enabled, GemX also issues the request with a stricter system prompt that tells the model to emit XML tool blocks. Occasionally Gemma 4 returns a stop token before producing any user-visible text — the same pattern that makes the tools / tool_choice API path unusable with this model.
Rather than silently swallow the turn (which would leave you staring at a blank bubble wondering if it's still working), GemX detects the zero-token completion and renders the "No response" indicator. The conversation state is preserved.

What to do:

Click Regenerate on the empty message — second attempts almost always succeed
If it happens repeatedly on the same turn, try shortening / rephrasing the prompt, or temporarily turn Web Search off in Settings (the tool-enabled system prompt is what most often triggers it)
After 2–3 successful turns in a chat the quirk effectively disappears — the model is fully warm

This is purely a UI affordance; nothing is logged or sent anywhere. The "No response" state is mutually exclusive with the "Response stopped" indicator (manual cancel) and the red error indicator (actual generation failure), so you can always tell what happened at a glance.

Image attachments

Paste (⌘V), drag-and-drop, or browse via the attach button
Up to 4 images per message
Client-side resize to 1024 px longest edge, JPEG-encoded — saves memory and tokens
Thumbnails render in the user bubble; click to view full-size

Document attachments

GemX reads PDF, DOCX, PPTX, and text-based files (code, plaintext, Markdown, JSON/YAML, …) — not arbitrary binaries. Images go to the model as vision input; audio is transcribed (see below).

DOCX — math preserved as LaTeX. Word stores equations as structured OMML; GemX converts them to LaTeX ( $…$ ) instead of dropping them like mammoth does, so formulas survive into the model and render via KaTeX. (Falls back to plain text on any odd file.)
PPTX — slides, math, and notes. PowerPoint is the same OOXML format as Word, so GemX extracts each slide's text + equations (OMML→LaTeX) in display order and includes speaker notes. (Slide images / diagrams aren't captured — text and math only.)
PDF — math read by the vision model. A PDF stores equations as un-extractable glyphs, so when a multimodal model (e.g. Gemma 4) is active, GemX renders pages to images and attaches the relevant page(s) to your question at retrieval time so the model reads the real math. With a text-only model it falls back to plain pdf-parse text.
Code / plaintext / Markdown read directly — .js, .ts, .py, .cpp, .rs, .go, .java, .md, .txt, .json, .yaml, …
Smart retrieval for large files — documents under ~8 KB are inlined whole; bigger files are indexed locally (chunked + embedded with a tiny on-device model, gte-small via Transformers.js) and the passages relevant to each question are pulled in per turn. No cloud, no 8 KB truncation, and the document never gets pruned out of a long conversation. A brief "Indexing…" status shows in the composer while it works; vectors (and rendered PDF pages) live on disk under rag/<conversation>/, never in the conversation blob.
Shows as a filename chip above the bubble — never raw <context> XML

Voice input

Click the mic button to start recording; live timer; click again to stop
Local speech-to-text via Whisper base.en (no cloud)
The Whisper model (~150 MB) downloads once on first use and is cached in browser IndexedDB
WebGPU by default with automatic WASM fallback
Currently English only

Web search & research loop

GemX is agentic when it comes to research — it doesn't just summarize from snippets.

Primary: Tavily API — purpose-built for AI agents, 1 000 free searches/month, no card required
Fallback: DuckDuckGo HTML scrape — works when DDG isn't bot-blocking; GemX detects the bot-challenge HTML and surfaces a clear message asking you to set a Tavily key
Two tools are wired up to the model:
- <web_search><query>…</query></web_search> → returns numbered results
- <fetch_url><url>…</url></fetch_url> → returns the page text (≤8 KB); escalates to Tavily extraction for JavaScript-heavy pages, and says so honestly if a page can't be read (instead of inventing it)
Mandatory multi-step workflow (enforced by both the system prompt and dynamic per-tool follow-up messages):
1. Search → list of URLs
2. Fetch top 2-3 URLs — landing-page snippets are NOT enough
3. Synthesize — final answer with citations
Online/offline auto-detection via navigator.onLine + online/offline event listeners; in Auto mode tools deactivate the moment you go offline

Citations

The model emits inline [1], [2] markers as it writes
A required "Sources:" section at the bottom of every research answer lists those references as clickable markdown links
Clicking a citation opens the source in your default browser — a will-navigate interceptor in the main process routes external HTTPS links via shell.openExternal, never in-app

Conversation management

Unlimited saved conversations in the sidebar
Smart auto-generated titles — after your first turn, a background model call writes a 3–5 word title
Pinned conversations float to a dedicated "Pinned" section at the top
3-dot dropdown menu per chat: Pin / Unpin · Rename · Delete
Inline rename — click rename, type, Enter to commit
Delete with confirmation
All conversations live in localStorage under gemx:conversations:v7

Smart context pruning

When a conversation grows past 75 % of the model's context window, GemX prunes it gracefully:

Phase 1 — large tool results (fetch_url content blocks) are collapsed to short summaries
Phase 2 — if still over the limit, drop the oldest user/assistant turns one by one, always preserving the system prompt and the most recent user message

This is done so that there are no more "context length exceeded" errors mid-research.

Sidebar

Toggle with ⌘B or the icon in the chat header
Hidden state is persisted across sessions; on first launch the sidebar auto-collapses on small windows (< 640 px wide) so the chat area isn't crowded
Slim mode below md (768 px) — w-52 instead of w-60
Two sections: Pinned (if any) then Chats

Settings

The settings modal has a left nav rail (a horizontal tab strip on narrow windows) with five sections:

Behavior — controls that shape a response:
- Thinking mode (Disabled / Enabled) — Gemma 4's native reasoning channel; slower but stronger on math/code/multi-step questions
- Web Search mode (Disabled / Enabled) with an Online/Offline dot, plus the Tavily API key (recommended; stored at tokens/tavily-key) when enabled
- Context window — a powers-of-2 stepper (1 024 → 262 144) with per-model maximums and a Reset button
- Response Style — Precise / Balanced / Creative presets, plus an Advanced expander with raw temperature / top-p / top-k / repetition-penalty (editing a value flips the preset to "custom")
Personalization:
- Custom Instructions — free text appended to GemX's system prompt on every chat (date/tools/vision keep working)
- Personas — named, reusable prompts you create here and switch per conversation from a header dropdown
Models — every cached built-in plus every custom you've added (one-click delete per row; the active model can't be deleted until you switch off it), and the optional HuggingFace token for gated downloads (stored at tokens/hf-token)
Appearance — Theme: System / Light / Dark (System follows macOS Appearance live and keeps the titlebar tint + native scrollbars aligned)
Developer — the Local API server (see below)

Token files have 0o600 permissions (owner read/write only); token inputs support Enter to save, show a brief Saved ✓, and have always-visible Clear buttons.

Model hot-swap

Click the model name in the header → dropdown listing all 5 variants
Selecting a different model restarts the MLX server internally
Loading overlay appears during the switch; conversation history is preserved

Local API server

Turn on Settings ▸ Developer ▸ Local API server to expose an OpenAI-compatible endpoint so editors, scripts, or any OpenAI client can use GemX's loaded model as a backend. It's a thin proxy in front of the MLX server, so it serves whichever model is currently loaded and shares that one process with in-app chats.

Defaults are conservative: off until you enable it, bound to 127.0.0.1 only, and /v1/*-only. The panel shows the base URL (default http://127.0.0.1:11535/v1) with a copy button and a live running/stopped dot.

# List the loaded model
curl http://127.0.0.1:11535/v1/models

# Chat (streaming)
curl http://127.0.0.1:11535/v1/chat/completions \
  -H 'content-type: application/json' \
  -d '{"model":"gemx","messages":[{"role":"user","content":"hi"}],"stream":true}'

Bearer token (optional on loopback, required for network access): set one in the panel, then send -H "Authorization: Bearer <token>". Requests without it get 401.
LAN exposure is a separate, explicit opt-in (binds 0.0.0.0) with a visible warning — only enable it on trusted networks, and always with a token.
The model field in your request can be anything — GemX rewrites it to the currently-loaded model before forwarding (the MLX server would otherwise try to load whatever id a client sends). GET /v1/models likewise returns exactly the loaded model.

From the terminal, without opening the app

You can run the same server headless (Ollama-style), no GUI.

One-time install — put a tiny wrapper script on your PATH. ⚠️ Not a symlink: macOS Electron locates its bundled Helper apps relative to the executable path, so a symlinked launch crashes with Unable to find helper app. If you already made one, remove it first:

which gemx          # find it (skip if you never made one)
rm "$(which gemx)"  # remove the symlink
sudo tee /usr/local/bin/gemx >/dev/null <<'EOF'
#!/bin/bash
exec "/Applications/GemX.app/Contents/MacOS/GemX" "$@"
EOF
sudo chmod +x /usr/local/bin/gemx

Then:

# Start the server (defaults: port 11535, last-used model, loopback)
gemx serve
gemx serve --port 8080 --model mlx-community/gemma-4-e4b-it-4bit
gemx serve --lan --token my-secret      # expose on the LAN with auth
gemx serve --help

gemx serve boots the same MLX runtime and proxy with no window, prints the endpoint, and stays in the foreground until you Ctrl-C. The model weights and Python runtime must already be installed — launch the GemX app once first so it can set those up. (Building from source? npm run serve does the same against a dev build.)

You can't open the GemX window while gemx serve is running (and vice-versa) — this is expected. gemx serve is the same app in headless mode, not a separate daemon, so the two share one app instance and one model server (port 11534). While serve holds them, double-clicking GemX.app just reactivates the windowless headless process, so no UI appears.

If you want the UI and the API endpoint at the same time, don't use gemx serve at all — just run the app and turn on Settings ▸ Developer ▸ Local API server. That serves the same OpenAI-compatible endpoint on port 11535 from inside the running app. Use gemx serve only when you want the server without a GUI (a headless box, an SSH session, a login-item daemon).

Use it from editors & agent harnesses

GemX emits native OpenAI tool_calls (Gemma 4 via mlx-vlm — verified end-to-end), so agentic harnesses work, not just chat. Setup guides for Cline, Kilo Code, Continue.dev, Zed, JetBrains AI Assistant, Goose, and OpenCode — plus the two tools that can't connect (Codex CLI is Responses-API-only; Cursor can't reach localhost) and why — live in docs/api-clients.md. Short version: point any of them at http://127.0.0.1:11535/v1 with any API key, and prefer Gemma 4 12B with a raised context window for agent work — small models emit valid tool calls but are clumsier on long multi-step tasks.

Bring your own model

Click the + Custom button next to the model picker to add any mlx-vlm or mlx-lm model from the mlx-community HuggingFace organisation.

Only mlx-community/* repos are accepted. Other repos may not be MLX-quantised and won't load. Browse the curated catalogue at https://huggingface.co/mlx-community/collections.
Paste a repo id (e.g. mlx-community/Qwen3-8B-MLX-4bit) and GemX probes HuggingFace for runtime (multimodal mlx-vlm vs text-only mlx-lm), default + max context window, and thinking support — auto-detected from the model's config.json and chat template.
Runtime is auto-detected and locked at add time — the modal shows it as a readonly chip with no override. Every other field (label, context window, max context, thinking support, description) is editable because config.json is sometimes wrong. All settings become immutable after Add — a prominent banner in the modal reminds you that the only way to fix a mistake is to remove and re-add. Default context for new customs is 16 K to keep first-launch RAM modest.
Custom models appear in the header picker under a Custom subheader (label + runtime).
Removal is in Settings → Downloaded Models (custom models appear there with a custom tag and the same trash icon as built-ins). Removing also clears the cached weights.
The Settings Thinking toggle auto-disables when the active custom model doesn't expose a reasoning channel; the Context Window slider's upper bound respects the model's Max.
Image attach is blocked for text-only (mlx-lm) models. When a custom mlx-lm model is active, drag-and-drop / paste / file-picker of image/* files is silently filtered with a "This model is text-only — images can't be attached." toast. PDFs, code, and audio attachments still work.
Runtime-mismatch errors are caught and surfaced clearly. If a model was auto-detected with the wrong runtime — server fails to load, or a stray image reaches a text-only server — GemX intercepts the cryptic stderr and surfaces an action-oriented advisory pointing you at Settings → Downloaded Models to remove and re-add.
The first time you select a text-only (mlx-lm) custom model, GemX installs mlx-lm into the existing venv — a one-time ~30 s step. Subsequent custom-model boots skip it.
Gated repos require your HuggingFace token in Settings.

Why `mlx-vlm` specifically

Apple's MLX is the framework. Two main runtimes are built on top of it:

mlx-lm — Apple's text-only LLM runtime. Optimized, polished, but doesn't load vision encoders.
mlx-vlm — Vision-language model runtime by Prince Canuma. Knows how to load multimodal architectures (Gemma 4, Pixtral, Qwen-VL, LLaVA, …) end-to-end. Exposes an OpenAI-compatible /v1/chat/completions endpoint that accepts image_url content blocks alongside text.

GemX uses mlx-vlm because Gemma 4 is a natively multimodal model — its image encoder is part of the architecture, trained jointly with the language model. Stripping that out via a text-only loader would discard half of what the model can do. mlx-vlm preserves and exposes the full stack with the performance benefits of Apple's MLX framework on M-series hardware (unified memory, fused kernels, neural-engine-aware optimizations).

Architecture

┌──────────────────────────────────────────────────────────┐
│  Renderer (React + TS + Vite)                            │
│  • Chat UI, streaming display, markdown rendering        │
│  • Whisper (in-browser via @huggingface/transformers)    │
│  • localStorage: conversations + settings + sidebar      │
└────────────────────────┬─────────────────────────────────┘
                         │ contextBridge IPC
┌────────────────────────┴─────────────────────────────────┐
│  Main process (Electron + TS)                            │
│  • MLX server lifecycle (spawn / abort / restart)        │
│  • Tool execution: web_search (Tavily / DDG), fetch_url  │
│  • XML tool-call parsing after stream completion         │
│  • File I/O: tokens/, .gemx-setup-complete               │
│  • Document parsing: pdf-parse, mammoth                  │
│  • will-navigate interceptor → shell.openExternal        │
└────────────────────────┬─────────────────────────────────┘
                         │ HTTP SSE
┌────────────────────────┴─────────────────────────────────┐
│  mlx-vlm.server (Python subprocess, 127.0.0.1:11534)     │
│  • Loads Gemma 4 weights into Apple MLX                  │
│  • OpenAI-compatible /v1/chat/completions                │
│  • Streams tokens back over Server-Sent Events           │
└──────────────────────────────────────────────────────────┘

Tool calling

GemX doesn't use the OpenAI tools / tool_choice API fields — empirically that path produces empty responses with Gemma 4 + mlx-vlm. Instead it uses a Cline-style pure-XML format in the model's text output:

<web_search>
  <query>latest AI news</query>
</web_search>

After each round, the main process:

Accumulates the streamed text
Regex-matches for <web_search> and <fetch_url> blocks
Executes the tool
Injects the result back as a role: 'user' message (avoiding the OpenAI tool role which Gemma 4 handles poorly via mlx-vlm)
Sends a tool-specific follow-up instruction — e.g., after web_search, the injection demands "You MUST now call fetch_url on the most relevant URL. Do NOT summarize from snippets."

Up to 10 rounds per turn — enough for one search + several fetches + the final synthesis.

Developer documentation

For a contributor-level deep-dive — the full IPC contract, the MLX runtime lifecycle, the chat pipeline, the tool-calling loop, the data model, and the rationale behind the non-obvious design choices — see docs/.

Tech stack

Layer	Technology	Why
Shell	Electron 42	Native Mac app with web tech
Bundler	electron-vite (Vite 6)	Sub-second HMR in dev
UI	React 19 + TypeScript 5	Type-safe components
Styles	Tailwind CSS 3	Utility-first, fast iteration. Layout is responsive `sm` / `md` / `lg` — usable down to 700 × 500
Inference	mlx-vlm (Python) on `127.0.0.1:11534`	Multimodal MLX runtime
Speech	`@huggingface/transformers`	In-browser Whisper, WebGPU
Embeddings (RAG)	`@huggingface/transformers` (`gte-small`)	In-browser document retrieval, WebGPU
Markdown	`react-markdown` + `remark-gfm` + `remark-math` + `rehype-katex`	Full GFM + LaTeX
Code highlight	`react-syntax-highlighter` (Prism)	Custom theme, no token backgrounds
Document parsing	`pdf-parse`, `pdfjs-dist`, custom OOXML/OMML→LaTeX (`jszip` + `@xmldom/xmldom`)	PDF text + page rendering; DOCX & PPTX text with math preserved as LaTeX
Search	Tavily API + DuckDuckGo fallback	AI-friendly, reliable

Storage & settings

Everything lives under ~/Library/Application Support/GemX/:

Path	Size	Required?	What
`mlx/venv/`	~500 MB	yes	Python virtualenv with `mlx-vlm` installed
`mlx/models/`	~4–18 GB	yes	HuggingFace model cache (Gemma weights)
`tokens/hf-token`	<1 KB	optional	HuggingFace token (gated downloads)
`tokens/tavily-key`	<1 KB	optional	Tavily API key
`.gemx-setup-complete`	33 B	auto	First-run skip marker
`Local Storage/`	~100 KB	yes (your data)	Conversations + settings (Chromium IndexedDB-style files)
`Service Worker/`	~150 MB	optional	Whisper model cache
`Cache/`, `Code Cache/`, `GPUCache/`, etc.	variable	auto	Chromium runtime caches — safe to delete

localStorage keys (inside Local Storage/):

gemx:conversations:v7 — all your chats
gemx:settings:v1 — tools mode, context override
gemx:sidebar-open — sidebar visibility state

Want a minimum backup? Copy mlx/, tokens/, and Local Storage/. Everything else regenerates.

Privacy

GemX is local-first by design. Here is every outbound connection it can make:

When	Where	Why	Optional?
First run only	`huggingface.co`	Download model weights	Required for first install of each model
First voice use only	`huggingface.co` (CDN)	Download Whisper base.en (~150 MB)	Required for first voice use
Web search query	`api.tavily.com`	Search results (if Tavily key set)	✓ disable in Settings
Web search query (fallback)	`html.duckduckgo.com`	Search results (if no Tavily key)	✓ disable in Settings
`fetch_url` tool	The destination domain	Page content	✓ model only calls this if you ask it to
Connectivity check	none — uses `navigator.onLine` (OS-level)	Online/offline detection	n/a — no network call

No telemetry, no analytics, no crash reporting, no auto-update pings. Your conversations are never transmitted anywhere.

Building a distributable

npm run dev        # Dev mode with HMR (Vite + Electron)
npm run typecheck  # TypeScript check across main + renderer
npm run build      # Compile + bundle
npm run dist       # Package into a notarized .dmg installer

npm run dist produces dist/GemX-<version>.dmg. Drag to /Applications.

Troubleshooting

Symptom	Fix
First run hangs on "Installing MLX"	Verify `python3 --version` is 3.10–3.13. Reinstall via `brew install python@3.13` if needed.
Model download fails / 401	Add a free HuggingFace token in Settings.
"DuckDuckGo blocked this request"	DDG is bot-challenging your IP. Get a free Tavily key (1 000/mo) — set it in Settings.
Voice transcription doesn't work	Grant microphone permission in macOS System Settings → Privacy. WebGPU falls back to WASM automatically.
OOM on long chats	Lower the Context Window in Settings. The 26B MoE model is memory-efficient (only 4 B params active) — try that if you have ≥ 16 GB.

Release notes

The full set of release notes for every version — what changed, what shipped, and the rationale behind each round of work — lives in the Releases section. Each tagged release has the matching .dmg artifact and a complete changelog.

Credits

GemX was inspired by Ammaar Reshi's gemma-chat — the project that sparked the idea of a local-first Gemma chat app on Apple Silicon. Thank you.

Gemma by Google DeepMind — the model family this is built around
MLX by Apple Machine Learning Research — the inference framework
mlx-vlm by Prince Canuma — the multimodal runtime that makes this possible
transformers.js by HuggingFace — Whisper in the browser
Cline — XML tool-call format inspiration
Tavily — search API built for AI agents

Created by Avaneesh.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
bin		bin
build		build
docs		docs
resources		resources
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
electron-builder.yml		electron-builder.yml
electron.vite.config.ts		electron.vite.config.ts
package-lock.json		package-lock.json
package.json		package.json
postcss.config.js		postcss.config.js
tailwind.config.js		tailwind.config.js
tsconfig.json		tsconfig.json
tsconfig.node.json		tsconfig.node.json
tsconfig.web.json		tsconfig.web.json

Folders and files

Latest commit

History

Repository files navigation

GemX

Table of contents

Why GemX

Side-by-side

What it does (at a glance)

Requirements

Install

Option 1 — Prebuilt .dmg (recommended)

Option 2 — Build from source

First launch (either option)

Models

How the context window affects RAM

Features in depth

Chat

Empty responses, explained

Image attachments

Document attachments

Voice input

Web search & research loop

Citations

Conversation management

Smart context pruning

Sidebar

Settings

Model hot-swap

Local API server

From the terminal, without opening the app

Use it from editors & agent harnesses

Bring your own model

Why mlx-vlm specifically

Architecture

Tool calling

Developer documentation

Tech stack

Storage & settings

Privacy

Building a distributable

Troubleshooting

Release notes

Credits

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 6

Contributors

Uh oh!

Languages

Option 1 — Prebuilt `.dmg` (recommended)

Why `mlx-vlm` specifically