Skip to content

Avaneesh40585/GemX

Repository files navigation

GemX

GemX

Multimodal Gemma 4 on Apple Silicon — a chat app, a local API server, and an open MLX runtime in one.
Voice, vision, documents, web research — plus any mlx-community model you bring. All on your Mac.
No cloud. No telemetry. (Two optional API keys.)

GemX Demo


Table of contents


Why GemX

Local LLMs on Mac aren't new — Ollama and LM Studio have both matured significantly through 2025–2026. So why another one?

GemX is opinionated. It started as a focused chat app built around Gemma 4 and Apple's mlx-vlm — the MLX runtime purpose-built for vision-language models — and grew into a small local-AI hub: a polished chat client, an OpenAI-compatible server other apps can build on (in-app or headless), and an open runtime that loads any mlx-community model. It's still Gemma-4-first, and ships everything a research-style assistant needs out of the box: voice input, document parsing, multi-step web search with inline citations.

  • Ollama — open-source backend with an OpenAI-compatible API. Added a native macOS app in July 2025 and an MLX preview for Apple Silicon in March 2026. Multimodal engine landed May 2025 (Llama 3.2 Vision, Qwen3-VL, LLaVA); Gemma 4 multimodal is not in their headline model list. Web-search API exists (Sep 2025) but there is no built-in chat research workflow.
  • LM Studio — polished closed-source chat app with mlx-vlm support and explicit Gemma 4 multimodal listings. Supports document RAG and image attachments. No voice input, no built-in web search.
  • Most MLX chat apps in the ecosystem still use mlx-lm, Apple's text-only MLX runtime — fine for chat, but it doesn't load Gemma 4's image encoder.

GemX is the only project that bundles all of the following in one app:

  • Multimodal Gemma 4 via mlx-vlm — paste an image, the model sees it
  • Voice input via on-device Whisper
  • Document attachments — PDF, DOCX, PPTX, code, plain text
  • Web research with citations — Tavily + DuckDuckGo, multi-step research loop, clickable [N] sources
  • Local API server (opt-in) — an OpenAI-compatible endpoint so editors, scripts, and other apps can use GemX's loaded model as a backend; run it in-app or headless from the terminal (gemx serve); loopback-only by default, bearer-token auth, LAN exposure is a separate explicit opt-in
  • Bring your own model — load any mlx-community repo (vision mlx-vlm or text-only mlx-lm), auto-probed for runtime / context / thinking — not only the bundled Gemma 4 family
  • Custom instructions & personas — global instructions appended to the system prompt, plus named personas you switch per conversation
  • Sampling controls — Precise / Balanced / Creative presets, with raw temperature / top-p / top-k / repetition-penalty under Advanced
  • Native Mac chat UX — pinned chats, conversation search, ⌘K command palette, smart titles, ⌘B sidebar, context-usage meter, model hot-swap mid-session
  • 100 % local inference — your prompts never leave the machine (the only optional outbound calls are to your chosen search backend)

Side-by-side

Verified against each project's own documentation in mid-2026. Caveats are listed in-cell instead of glossed over.

GemX Ollama LM Studio
Form factor Mac chat app OSS API server + CLI; native app added Jul 2025 Mac/Win/Linux chat app + API
MLX backend mlx-vlm (multimodal) MLX preview (Mar 2026) mlx-vlm (multimodal, May 2025)
Native multimodal Gemma 4 ✓ via mlx-vlm ✗ supported via mlx-lm (text only) ✓ via mlx-vlm
Voice input (Whisper, on-device) ✓ built-in
Image attachments in chat ✓ paste / drag / browse ✗ via supported vision models but mlx-lm is text only ✓ via supported vision models
Document attachments (PDF / DOCX) ✓ PDF + DOCX + code ✓ RAG (text only, not supported for all types) ✓ RAG (text only)
Built-in web search workflow ✓ multi-step Tavily + DDG partial — web-search API, no chat loop
Local OpenAI-compatible API ✓ opt-in; in-app or headless (gemx serve); loopback by default, token auth ✓ core feature
Bring your own model ✓ any mlx-community repo (mlx-vlm or mlx-lm), auto-probed ✓ pull any model ✓ download any
Inline [N] citations + sources list ✓ enforced by system prompt
Pinned conversations, smart titles n/a (no built-in chat UI) partial
Open source ✓ (MIT) ✗ GUI app (free, closed-source) ✗ (free, closed-source)

What it does (at a glance)

  • Chat with Gemma 4 in a polished native Mac UI — full Markdown, KaTeX math, syntax-highlighted code blocks, tables etc.
  • See — drag or paste images straight into the conversation
  • Read — drop a PDF, DOCX, PPTX, or code/text file; math is preserved (DOCX/PPTX→LaTeX, PDF→vision), and big files are indexed locally so the relevant passages are retrieved per question (on-device RAG)
  • Listen — click the mic, talk, get a transcript via local Whisper
  • Search — agentic web research with Tavily, multi-step fetch loop, clickable citations
  • Serve — flip on the local API server (in-app toggle or headless gemx serve) and point any OpenAI-compatible client at http://127.0.0.1:11535/v1
  • Extend — bring your own model: load any mlx-community repo (multimodal or text-only), not just the five bundled Gemma 4 variants
  • Shape — custom instructions + per-conversation personas, and Precise/Balanced/Creative sampling presets
  • Switch models without restarting — five Gemma 4 variants from ~4 GB to ~18 GB, plus your custom adds
  • Stay private — every byte stays on your Mac except (optionally) your chosen search query

Requirements

  • Apple Silicon Mac (M1 or later) — MLX is ARM-only
  • macOS 13 Ventura or later
  • Python 3.10 – 3.13 (brew install python@3.13 if missing; 3.14+ not yet supported by mlx-vlm)
  • Node.js 20+
  • ~5 GB free disk space minimum (smallest model + Python venv + node_modules); up to 20 GB if you run the 31B variant

Install

There are two ways to get GemX running. Most users want Option 1. If you want to hack on the code or build your own .dmg, use Option 2.

Both routes share the same Requirements (Apple Silicon, macOS 13+, Python 3.10–3.13) and converge on the same first-launch flow.

Option 1 — Prebuilt .dmg (recommended)

  1. Go to the Releases page on GitHub

  2. Download the latest GemX-<version>.dmg

  3. Double-click to mount, then drag GemX.app to your /Applications folder

  4. Eject the disk image and launch GemX from Applications, Launchpad, or Spotlight

  5. If macOS warns "GemX cannot be opened because the developer cannot be verified", right-click → Open the first time (one-time bypass). As a one-liner alternative, run in Terminal:

    xattr -dr com.apple.quarantine /Applications/GemX.app

You do not need Node.js, npm, this repo, or any build tools for this route — just Python 3.10–3.13 on your PATH (brew install python@3.13 if missing). Everything else is provisioned by the app on first launch.

Option 2 — Build from source

For development or producing your own signed build:

git clone https://github.com/Avaneesh40585/GemX.git
cd GemX
npm install
npm run dev

This runs Electron + Vite in development mode with hot-module reloading. To produce a distributable .dmg yourself, see Building a distributable.

First launch (either option)

The very first time GemX starts, it will:

  1. Detect Python on your PATH and create an isolated virtualenv at ~/Library/Application Support/GemX/mlx/venv/
  2. pip install mlx-vlm into that venv (cached for future launches)
  3. Show the model picker — choose one of the five Gemma 4 variants
  4. Download the model weights from HuggingFace into the local cache
  5. Boot the MLX inference server on 127.0.0.1:11534
  6. Drop you into the chat

Subsequent launches skip steps 1–4 — the model loads from cache in seconds.

💡 Tip: A free HuggingFace token (added in Settings) speeds up model downloads and avoids rate-limits. Gated models also require it.


Models

All variants are 4-bit quantized via mlx-community and natively multimodal (text + image). Switch between them at any time via the dropdown in the chat header.

Warning

Running a local LLM is resource-intensive. Inference saturates the GPU, holds the entire model in unified memory, and is sensitive to thermal throttling. Before launching a chat, close memory-heavy background apps — browsers with dozens of tabs, IDEs, Docker, Slack, games, video editors. If your Mac is on battery, plug it in. The bigger the model, the less headroom you have for everything else.

The Minimum column below is the smallest unified-memory configuration where the model loads and runs without swap thrashing or OOM kills. Recommended is the configuration where it feels comfortable with a couple of other apps open at the same time. RAM = unified memory on Apple Silicon.

Model HF repo Disk Default context Max context Minimum RAM Recommended RAM Notes
Gemma 4 E2B mlx-community/gemma-4-e2b-it-4bit ~4 GB 8 K 128 K 8 GB 16 GB Fastest, lowest footprint. Good for quick Q&A.
Gemma 4 E4B mlx-community/gemma-4-e4b-it-4bit ~5.5 GB 16 K 128 K 8 GB 16 GB Recommended default. Best balance.
Gemma 4 12B mlx-community/gemma-4-12B-it-4bit ~7 GB 32 K 256 K 12 GB 16 GB Dense mid-tier — sharper reasoning than E4B without the MoE's RAM cliff. Comfortable on a 16 GB Mac.
Gemma 4 26B MoE mlx-community/gemma-4-26b-a4b-it-4bit ~16 GB 64 K 256 K 24 GB 32 GB Mixture-of-Experts: 26 B total params, 4 B active per token.
Gemma 4 31B mlx-community/gemma-4-31b-it-4bit ~19 GB 64 K 256 K 32 GB 48–64 GB Dense — highest raw quality. Comfortable on M3 Max / M4 Pro 36 GB+ or M-series Max/Ultra.

Important

Disk size is roughly the lower bound on RAM. Add ~2–4 GB for the OS, ~1–6 GB for the KV cache depending on context window, plus headroom for whatever else you're running. On a 16 GB Mac, stick to E2B or E4B.

How the context window affects RAM

The context window (set per-session in Settings) determines how many tokens of conversation the model can hold at once. Larger windows = longer memory, but they reserve more KV-cache memory upfront — independently of how full the conversation actually is.

A rough rule of thumb: doubling the context window roughly doubles the KV-cache allocation. The cache scales with the model's layer count and hidden size, so the bigger the model, the more each context-window step costs.

Practical guidance per model (approximate, in addition to the model weights themselves):

Model 8 K 32 K 64 K 128 K 256 K
Gemma 4 E2B ~0.3 GB ~1 GB ~2 GB ~4 GB n/a (max 128 K)
Gemma 4 E4B ~0.5 GB ~2 GB ~4 GB ~6 GB n/a (max 128 K)
Gemma 4 12B ~0.8 GB ~3 GB ~6 GB ~10 GB ~20 GB
Gemma 4 26B MoE ~1 GB ~3 GB ~5 GB ~10 GB ~18 GB
Gemma 4 31B ~1.5 GB ~5 GB ~8 GB ~14 GB ~26 GB

The 256 K column is the model's native max — and as the numbers above make clear, on a typical Mac you'll OOM long before reaching it. Most users want to stay at or below 64 K. Push it higher only if you have an Ultra-class machine with RAM to spare.

If you OOM mid-conversation:

  1. Open Settings → Context window and step it down (it's a power-of-2 slider — each step halves the KV-cache reservation)
  2. Close other RAM-hungry apps and reload the conversation
  3. If it still doesn't fit, switch to a smaller model

Features in depth

Chat

  • Streaming responses with mid-stream cancel
  • Edit the last user message and re-send (regenerates from that point)
  • Regenerate the last assistant response with one click
  • Copy any message or any individual code block
  • Full GFM Markdown — headings, bold, italic, strikethrough, tables, nested lists, blockquotes
  • LaTeX math — both inline $x^2$ and block $$\int_0^1 f(x)\,dx$$ rendered via KaTeX
  • Syntax-highlighted code blocks with:
    • Language detection from the fence
    • Pretty labels (cpp → C++, ts → TS, py → Python, rs → Rust, …)
    • One-click copy
    • Custom Atom-One-Dark-derived theme with no per-line backgrounds
  • Stop-and-resume indicators — clear "Response stopped" message when you cancel mid-stream, error blocks instead of raw error text in the bubble
  • Empty-response indicator — never silently swallows a turn (see Empty responses, explained below)
  • Collapsible thinking blocks — when Thinking is enabled in Settings, Gemma 4's reasoning trace streams into a collapsible block above the answer. The block header shimmers through rotating verbs (Thinking → Considering → Planning → Pondering → Reasoning → Sketching) while reasoning is in flight, then collapses to "Thought process" once the answer starts. The body renders with the full markdown stack — syntax-highlighted code blocks, KaTeX math, GFM tables, and target-blank links — so reasoning looks as polished as the answer itself. mlx-vlm's chat template strips past thinking from history automatically so multi-turn context stays clean.

Empty responses, explained

Sometimes the very first turn of a fresh conversation comes back empty — the message bubble appears with just a small "No response" indicator instead of text. This isn't a crash; it's a known quirk of Gemma 4 + mlx-vlm while the model finishes warming up.

What's happening under the hood:

  1. On the first turn after launch (or after switching models), the MLX server is still bringing weights into unified memory, compiling kernels, and JIT-priming the cache.
  2. When tools are enabled, GemX also issues the request with a stricter system prompt that tells the model to emit XML tool blocks. Occasionally Gemma 4 returns a stop token before producing any user-visible text — the same pattern that makes the tools / tool_choice API path unusable with this model.
  3. Rather than silently swallow the turn (which would leave you staring at a blank bubble wondering if it's still working), GemX detects the zero-token completion and renders the "No response" indicator. The conversation state is preserved.

What to do:

  • Click Regenerate on the empty message — second attempts almost always succeed
  • If it happens repeatedly on the same turn, try shortening / rephrasing the prompt, or temporarily turn Web Search off in Settings (the tool-enabled system prompt is what most often triggers it)
  • After 2–3 successful turns in a chat the quirk effectively disappears — the model is fully warm

This is purely a UI affordance; nothing is logged or sent anywhere. The "No response" state is mutually exclusive with the "Response stopped" indicator (manual cancel) and the red error indicator (actual generation failure), so you can always tell what happened at a glance.

Image attachments

  • Paste (⌘V), drag-and-drop, or browse via the attach button
  • Up to 4 images per message
  • Client-side resize to 1024 px longest edge, JPEG-encoded — saves memory and tokens
  • Thumbnails render in the user bubble; click to view full-size

Document attachments

GemX reads PDF, DOCX, PPTX, and text-based files (code, plaintext, Markdown, JSON/YAML, …) — not arbitrary binaries. Images go to the model as vision input; audio is transcribed (see below).

  • DOCX — math preserved as LaTeX. Word stores equations as structured OMML; GemX converts them to LaTeX ($…$) instead of dropping them like mammoth does, so formulas survive into the model and render via KaTeX. (Falls back to plain text on any odd file.)
  • PPTX — slides, math, and notes. PowerPoint is the same OOXML format as Word, so GemX extracts each slide's text + equations (OMML→LaTeX) in display order and includes speaker notes. (Slide images / diagrams aren't captured — text and math only.)
  • PDF — math read by the vision model. A PDF stores equations as un-extractable glyphs, so when a multimodal model (e.g. Gemma 4) is active, GemX renders pages to images and attaches the relevant page(s) to your question at retrieval time so the model reads the real math. With a text-only model it falls back to plain pdf-parse text.
  • Code / plaintext / Markdown read directly — .js, .ts, .py, .cpp, .rs, .go, .java, .md, .txt, .json, .yaml, …
  • Smart retrieval for large files — documents under ~8 KB are inlined whole; bigger files are indexed locally (chunked + embedded with a tiny on-device model, gte-small via Transformers.js) and the passages relevant to each question are pulled in per turn. No cloud, no 8 KB truncation, and the document never gets pruned out of a long conversation. A brief "Indexing…" status shows in the composer while it works; vectors (and rendered PDF pages) live on disk under rag/<conversation>/, never in the conversation blob.
  • Shows as a filename chip above the bubble — never raw <context> XML

Voice input

  • Click the mic button to start recording; live timer; click again to stop
  • Local speech-to-text via Whisper base.en (no cloud)
  • The Whisper model (~150 MB) downloads once on first use and is cached in browser IndexedDB
  • WebGPU by default with automatic WASM fallback
  • Currently English only

Web search & research loop

GemX is agentic when it comes to research — it doesn't just summarize from snippets.

  • Primary: Tavily API — purpose-built for AI agents, 1 000 free searches/month, no card required
  • Fallback: DuckDuckGo HTML scrape — works when DDG isn't bot-blocking; GemX detects the bot-challenge HTML and surfaces a clear message asking you to set a Tavily key
  • Two tools are wired up to the model:
    • <web_search><query>…</query></web_search> → returns numbered results
    • <fetch_url><url>…</url></fetch_url> → returns the page text (≤8 KB); escalates to Tavily extraction for JavaScript-heavy pages, and says so honestly if a page can't be read (instead of inventing it)
  • Mandatory multi-step workflow (enforced by both the system prompt and dynamic per-tool follow-up messages):
    1. Search → list of URLs
    2. Fetch top 2-3 URLs — landing-page snippets are NOT enough
    3. Synthesize — final answer with citations
  • Online/offline auto-detection via navigator.onLine + online/offline event listeners; in Auto mode tools deactivate the moment you go offline

Citations

  • The model emits inline [1], [2] markers as it writes
  • A required "Sources:" section at the bottom of every research answer lists those references as clickable markdown links
  • Clicking a citation opens the source in your default browser — a will-navigate interceptor in the main process routes external HTTPS links via shell.openExternal, never in-app

Conversation management

  • Unlimited saved conversations in the sidebar
  • Smart auto-generated titles — after your first turn, a background model call writes a 3–5 word title
  • Pinned conversations float to a dedicated "Pinned" section at the top
  • 3-dot dropdown menu per chat: Pin / Unpin · Rename · Delete
  • Inline rename — click rename, type, Enter to commit
  • Delete with confirmation
  • All conversations live in localStorage under gemx:conversations:v7

Smart context pruning

When a conversation grows past 75 % of the model's context window, GemX prunes it gracefully:

  1. Phase 1 — large tool results (fetch_url content blocks) are collapsed to short summaries
  2. Phase 2 — if still over the limit, drop the oldest user/assistant turns one by one, always preserving the system prompt and the most recent user message

This is done so that there are no more "context length exceeded" errors mid-research.

Sidebar

  • Toggle with ⌘B or the icon in the chat header
  • Hidden state is persisted across sessions; on first launch the sidebar auto-collapses on small windows (< 640 px wide) so the chat area isn't crowded
  • Slim mode below md (768 px) — w-52 instead of w-60
  • Two sections: Pinned (if any) then Chats

Settings

The settings modal has a left nav rail (a horizontal tab strip on narrow windows) with five sections:

  • Behavior — controls that shape a response:
    • Thinking mode (Disabled / Enabled) — Gemma 4's native reasoning channel; slower but stronger on math/code/multi-step questions
    • Web Search mode (Disabled / Enabled) with an Online/Offline dot, plus the Tavily API key (recommended; stored at tokens/tavily-key) when enabled
    • Context window — a powers-of-2 stepper (1 024 → 262 144) with per-model maximums and a Reset button
    • Response StylePrecise / Balanced / Creative presets, plus an Advanced expander with raw temperature / top-p / top-k / repetition-penalty (editing a value flips the preset to "custom")
  • Personalization:
    • Custom Instructions — free text appended to GemX's system prompt on every chat (date/tools/vision keep working)
    • Personas — named, reusable prompts you create here and switch per conversation from a header dropdown
  • Models — every cached built-in plus every custom you've added (one-click delete per row; the active model can't be deleted until you switch off it), and the optional HuggingFace token for gated downloads (stored at tokens/hf-token)
  • AppearanceTheme: System / Light / Dark (System follows macOS Appearance live and keeps the titlebar tint + native scrollbars aligned)
  • Developer — the Local API server (see below)

Token files have 0o600 permissions (owner read/write only); token inputs support Enter to save, show a brief Saved ✓, and have always-visible Clear buttons.

Model hot-swap

  • Click the model name in the header → dropdown listing all 5 variants
  • Selecting a different model restarts the MLX server internally
  • Loading overlay appears during the switch; conversation history is preserved

Local API server

Turn on Settings ▸ Developer ▸ Local API server to expose an OpenAI-compatible endpoint so editors, scripts, or any OpenAI client can use GemX's loaded model as a backend. It's a thin proxy in front of the MLX server, so it serves whichever model is currently loaded and shares that one process with in-app chats.

Defaults are conservative: off until you enable it, bound to 127.0.0.1 only, and /v1/*-only. The panel shows the base URL (default http://127.0.0.1:11535/v1) with a copy button and a live running/stopped dot.

# List the loaded model
curl http://127.0.0.1:11535/v1/models

# Chat (streaming)
curl http://127.0.0.1:11535/v1/chat/completions \
  -H 'content-type: application/json' \
  -d '{"model":"gemx","messages":[{"role":"user","content":"hi"}],"stream":true}'
  • Bearer token (optional on loopback, required for network access): set one in the panel, then send -H "Authorization: Bearer <token>". Requests without it get 401.
  • LAN exposure is a separate, explicit opt-in (binds 0.0.0.0) with a visible warning — only enable it on trusted networks, and always with a token.
  • The model field in your request can be anything — GemX rewrites it to the currently-loaded model before forwarding (the MLX server would otherwise try to load whatever id a client sends). GET /v1/models likewise returns exactly the loaded model.

From the terminal, without opening the app

You can run the same server headless (Ollama-style), no GUI.

One-time install — put a tiny wrapper script on your PATH. ⚠️ Not a symlink: macOS Electron locates its bundled Helper apps relative to the executable path, so a symlinked launch crashes with Unable to find helper app. If you already made one, remove it first:

which gemx          # find it (skip if you never made one)
rm "$(which gemx)"  # remove the symlink
sudo tee /usr/local/bin/gemx >/dev/null <<'EOF'
#!/bin/bash
exec "/Applications/GemX.app/Contents/MacOS/GemX" "$@"
EOF
sudo chmod +x /usr/local/bin/gemx

Then:

# Start the server (defaults: port 11535, last-used model, loopback)
gemx serve
gemx serve --port 8080 --model mlx-community/gemma-4-e4b-it-4bit
gemx serve --lan --token my-secret      # expose on the LAN with auth
gemx serve --help

gemx serve boots the same MLX runtime and proxy with no window, prints the endpoint, and stays in the foreground until you Ctrl-C. The model weights and Python runtime must already be installed — launch the GemX app once first so it can set those up. (Building from source? npm run serve does the same against a dev build.)

You can't open the GemX window while gemx serve is running (and vice-versa) — this is expected. gemx serve is the same app in headless mode, not a separate daemon, so the two share one app instance and one model server (port 11534). While serve holds them, double-clicking GemX.app just reactivates the windowless headless process, so no UI appears.

If you want the UI and the API endpoint at the same time, don't use gemx serve at all — just run the app and turn on Settings ▸ Developer ▸ Local API server. That serves the same OpenAI-compatible endpoint on port 11535 from inside the running app. Use gemx serve only when you want the server without a GUI (a headless box, an SSH session, a login-item daemon).

Use it from editors & agent harnesses

GemX emits native OpenAI tool_calls (Gemma 4 via mlx-vlm — verified end-to-end), so agentic harnesses work, not just chat. Setup guides for Cline, Kilo Code, Continue.dev, Zed, JetBrains AI Assistant, Goose, and OpenCode — plus the two tools that can't connect (Codex CLI is Responses-API-only; Cursor can't reach localhost) and why — live in docs/api-clients.md. Short version: point any of them at http://127.0.0.1:11535/v1 with any API key, and prefer Gemma 4 12B with a raised context window for agent work — small models emit valid tool calls but are clumsier on long multi-step tasks.

Bring your own model

Click the + Custom button next to the model picker to add any mlx-vlm or mlx-lm model from the mlx-community HuggingFace organisation.

  • Only mlx-community/* repos are accepted. Other repos may not be MLX-quantised and won't load. Browse the curated catalogue at https://huggingface.co/mlx-community/collections.
  • Paste a repo id (e.g. mlx-community/Qwen3-8B-MLX-4bit) and GemX probes HuggingFace for runtime (multimodal mlx-vlm vs text-only mlx-lm), default + max context window, and thinking support — auto-detected from the model's config.json and chat template.
  • Runtime is auto-detected and locked at add time — the modal shows it as a readonly chip with no override. Every other field (label, context window, max context, thinking support, description) is editable because config.json is sometimes wrong. All settings become immutable after Add — a prominent banner in the modal reminds you that the only way to fix a mistake is to remove and re-add. Default context for new customs is 16 K to keep first-launch RAM modest.
  • Custom models appear in the header picker under a Custom subheader (label + runtime).
  • Removal is in Settings → Downloaded Models (custom models appear there with a custom tag and the same trash icon as built-ins). Removing also clears the cached weights.
  • The Settings Thinking toggle auto-disables when the active custom model doesn't expose a reasoning channel; the Context Window slider's upper bound respects the model's Max.
  • Image attach is blocked for text-only (mlx-lm) models. When a custom mlx-lm model is active, drag-and-drop / paste / file-picker of image/* files is silently filtered with a "This model is text-only — images can't be attached." toast. PDFs, code, and audio attachments still work.
  • Runtime-mismatch errors are caught and surfaced clearly. If a model was auto-detected with the wrong runtime — server fails to load, or a stray image reaches a text-only server — GemX intercepts the cryptic stderr and surfaces an action-oriented advisory pointing you at Settings → Downloaded Models to remove and re-add.
  • The first time you select a text-only (mlx-lm) custom model, GemX installs mlx-lm into the existing venv — a one-time ~30 s step. Subsequent custom-model boots skip it.
  • Gated repos require your HuggingFace token in Settings.

Why mlx-vlm specifically

Apple's MLX is the framework. Two main runtimes are built on top of it:

  • mlx-lm — Apple's text-only LLM runtime. Optimized, polished, but doesn't load vision encoders.
  • mlx-vlm — Vision-language model runtime by Prince Canuma. Knows how to load multimodal architectures (Gemma 4, Pixtral, Qwen-VL, LLaVA, …) end-to-end. Exposes an OpenAI-compatible /v1/chat/completions endpoint that accepts image_url content blocks alongside text.

GemX uses mlx-vlm because Gemma 4 is a natively multimodal model — its image encoder is part of the architecture, trained jointly with the language model. Stripping that out via a text-only loader would discard half of what the model can do. mlx-vlm preserves and exposes the full stack with the performance benefits of Apple's MLX framework on M-series hardware (unified memory, fused kernels, neural-engine-aware optimizations).


Architecture

┌──────────────────────────────────────────────────────────┐
│  Renderer (React + TS + Vite)                            │
│  • Chat UI, streaming display, markdown rendering        │
│  • Whisper (in-browser via @huggingface/transformers)    │
│  • localStorage: conversations + settings + sidebar      │
└────────────────────────┬─────────────────────────────────┘
                         │ contextBridge IPC
┌────────────────────────┴─────────────────────────────────┐
│  Main process (Electron + TS)                            │
│  • MLX server lifecycle (spawn / abort / restart)        │
│  • Tool execution: web_search (Tavily / DDG), fetch_url  │
│  • XML tool-call parsing after stream completion         │
│  • File I/O: tokens/, .gemx-setup-complete               │
│  • Document parsing: pdf-parse, mammoth                  │
│  • will-navigate interceptor → shell.openExternal        │
└────────────────────────┬─────────────────────────────────┘
                         │ HTTP SSE
┌────────────────────────┴─────────────────────────────────┐
│  mlx-vlm.server (Python subprocess, 127.0.0.1:11534)     │
│  • Loads Gemma 4 weights into Apple MLX                  │
│  • OpenAI-compatible /v1/chat/completions                │
│  • Streams tokens back over Server-Sent Events           │
└──────────────────────────────────────────────────────────┘

Tool calling

GemX doesn't use the OpenAI tools / tool_choice API fields — empirically that path produces empty responses with Gemma 4 + mlx-vlm. Instead it uses a Cline-style pure-XML format in the model's text output:

<web_search>
  <query>latest AI news</query>
</web_search>

After each round, the main process:

  1. Accumulates the streamed text
  2. Regex-matches for <web_search> and <fetch_url> blocks
  3. Executes the tool
  4. Injects the result back as a role: 'user' message (avoiding the OpenAI tool role which Gemma 4 handles poorly via mlx-vlm)
  5. Sends a tool-specific follow-up instruction — e.g., after web_search, the injection demands "You MUST now call fetch_url on the most relevant URL. Do NOT summarize from snippets."

Up to 10 rounds per turn — enough for one search + several fetches + the final synthesis.

Developer documentation

For a contributor-level deep-dive — the full IPC contract, the MLX runtime lifecycle, the chat pipeline, the tool-calling loop, the data model, and the rationale behind the non-obvious design choices — see docs/.


Tech stack

Layer Technology Why
Shell Electron 42 Native Mac app with web tech
Bundler electron-vite (Vite 6) Sub-second HMR in dev
UI React 19 + TypeScript 5 Type-safe components
Styles Tailwind CSS 3 Utility-first, fast iteration. Layout is responsive sm / md / lg — usable down to 700 × 500
Inference mlx-vlm (Python) on 127.0.0.1:11534 Multimodal MLX runtime
Speech @huggingface/transformers In-browser Whisper, WebGPU
Embeddings (RAG) @huggingface/transformers (gte-small) In-browser document retrieval, WebGPU
Markdown react-markdown + remark-gfm + remark-math + rehype-katex Full GFM + LaTeX
Code highlight react-syntax-highlighter (Prism) Custom theme, no token backgrounds
Document parsing pdf-parse, pdfjs-dist, custom OOXML/OMML→LaTeX (jszip + @xmldom/xmldom) PDF text + page rendering; DOCX & PPTX text with math preserved as LaTeX
Search Tavily API + DuckDuckGo fallback AI-friendly, reliable

Storage & settings

Everything lives under ~/Library/Application Support/GemX/:

Path Size Required? What
mlx/venv/ ~500 MB yes Python virtualenv with mlx-vlm installed
mlx/models/ ~4–18 GB yes HuggingFace model cache (Gemma weights)
tokens/hf-token <1 KB optional HuggingFace token (gated downloads)
tokens/tavily-key <1 KB optional Tavily API key
.gemx-setup-complete 33 B auto First-run skip marker
Local Storage/ ~100 KB yes (your data) Conversations + settings (Chromium IndexedDB-style files)
Service Worker/ ~150 MB optional Whisper model cache
Cache/, Code Cache/, GPUCache/, etc. variable auto Chromium runtime caches — safe to delete

localStorage keys (inside Local Storage/):

  • gemx:conversations:v7 — all your chats
  • gemx:settings:v1 — tools mode, context override
  • gemx:sidebar-open — sidebar visibility state

Want a minimum backup? Copy mlx/, tokens/, and Local Storage/. Everything else regenerates.


Privacy

GemX is local-first by design. Here is every outbound connection it can make:

When Where Why Optional?
First run only huggingface.co Download model weights Required for first install of each model
First voice use only huggingface.co (CDN) Download Whisper base.en (~150 MB) Required for first voice use
Web search query api.tavily.com Search results (if Tavily key set) ✓ disable in Settings
Web search query (fallback) html.duckduckgo.com Search results (if no Tavily key) ✓ disable in Settings
fetch_url tool The destination domain Page content ✓ model only calls this if you ask it to
Connectivity check none — uses navigator.onLine (OS-level) Online/offline detection n/a — no network call

No telemetry, no analytics, no crash reporting, no auto-update pings. Your conversations are never transmitted anywhere.


Building a distributable

npm run dev        # Dev mode with HMR (Vite + Electron)
npm run typecheck  # TypeScript check across main + renderer
npm run build      # Compile + bundle
npm run dist       # Package into a notarized .dmg installer

npm run dist produces dist/GemX-<version>.dmg. Drag to /Applications.


Troubleshooting

Symptom Fix
First run hangs on "Installing MLX" Verify python3 --version is 3.10–3.13. Reinstall via brew install python@3.13 if needed.
Model download fails / 401 Add a free HuggingFace token in Settings.
"DuckDuckGo blocked this request" DDG is bot-challenging your IP. Get a free Tavily key (1 000/mo) — set it in Settings.
Voice transcription doesn't work Grant microphone permission in macOS System Settings → Privacy. WebGPU falls back to WASM automatically.
OOM on long chats Lower the Context Window in Settings. The 26B MoE model is memory-efficient (only 4 B params active) — try that if you have ≥ 16 GB.

Release notes

The full set of release notes for every version — what changed, what shipped, and the rationale behind each round of work — lives in the Releases section. Each tagged release has the matching .dmg artifact and a complete changelog.

Credits

GemX was inspired by Ammaar Reshi's gemma-chat — the project that sparked the idea of a local-first Gemma chat app on Apple Silicon. Thank you.

  • Gemma by Google DeepMind — the model family this is built around
  • MLX by Apple Machine Learning Research — the inference framework
  • mlx-vlm by Prince Canuma — the multimodal runtime that makes this possible
  • transformers.js by HuggingFace — Whisper in the browser
  • Cline — XML tool-call format inspiration
  • Tavily — search API built for AI agents

Created by Avaneesh.

License

MIT — see LICENSE.

About

Local multimodal AI for Apple Silicon: chat with Gemma 4 via MLX, serve it over an OpenAI-compatible API, or run your own mlx-community model. On-device voice, documents, and agentic web research with citations.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors