Jarvis is a powerful, extensible, and privacy-focused AI assistant built natively for macOS. It combines an ultra-fast local voice engine, on-device biometric security, and a online-first LLM brain to deliver an "Iron Man" style assistant experience. Featuring a robust Python backend and a native Swift UI desktop client connected via a Socket API, Jarvis executes highly complex "Nuclear" skills—from autonomously bootstrapping entire codebases to mimicking physical workflows. Your AI, your machine.
Quick Start · View Architecture · Report Issue
- Ultra-Fast Voice Engine: Sub-200ms latency using hybrid Apple Speech Recognition and local Faster Whisper fallback. VAD optimized for instant conversational gaps.
- Privacy & Security First: Online-first processing architecture. can load LLM locally on-demand and offline. Features FaceID biometric authentication before executing sensitive commands.
- "Nuclear" Capabilities: Includes Architect Mode (project scaffolding), The Mimic (macro recording/playback), Content Assassin (YouTube/media summarization), and Dead Drop (secure file hand-offs).
- Gesture Control: Built-in webcam hand-gesture tracking to control the macOS cursor.
- Extensible Architecture: Modular "Skill" systems operating through a central Service Registry and Event Manager.
- Features Overview
- Use Cases
- Nuclear Capabilities
- Voice & AI Architecture
- Security: FaceID
- Installation & Quick Start
- Deep Dive Documentation (Phases 1-15)
- System Architecture Flow
- Custom Skill Development Guide
- Complete Configuration Reference
- Troubleshooting & FAQ
| Feature | Description |
|---|---|
| Voice Engine | Hybrid Apple STT + Whisper fallback. <200ms latency with 100ms VAD detection. |
| AI Brain | online execution via Ollama (Llama 3.3) + fallback/complex routing via Groq/OpenRouter. |
| FaceID Auth | Local biometric facial recognition required for secure commands (e.g., shutdown, delete). |
| Smart Web Search | AI-powered sub-500ms intent classifier determines if external factual data is needed. |
| Gesture Cursor | Control your Mac with hand signs (Point, Pinch, Fist, Peace). |
| Swift HUD | Native macOS frontend communicating via TCP port 8492 sockets. |
| Automation | System volume, brightness, Apple Music/Spotify, focus modes, and dynamic audio ducking. |
| Offline Mode | Core functionality runs completely off-grid without internet dependencies. |
- Instant Project Scaffolding: Use Architect Mode to generate entire Python projects, complete with
main.py,requirements.txt, and virtual environments, completely hands-free. - Rubber Duck Debugging: Talk through complex logic verbally while Jarvis uses the Ollama Llama 3.2 engine to provide local, private code analysis.
- Physical Automation: Use The Mimic to record a complex series of mouse movements and keystrokes (like formatting a spreadsheet), and play it back at 2x speed.
- Media Summarization: Feed Jarvis a YouTube URL; Content Assassin will extract the subtitles and build a comprehensive Markdown summary of the video.
- Zero-Cloud Execution: Keep your data on your machine. Jarvis's core brain and voice engine operate completely offline.
- Biometric Security: FaceID ensures that destructive commands (like system shutdown or file deletion) are strictly executed only when you are physically recognized by the webcam.
These advanced skills elevate Jarvis beyond a standard voice assistant:
Autonomously builds entire codebases. Ask Jarvis to "Build a Snake game in Python", and it will:
- Scaffold complete project folders.
- Write functional
main.pyandrequirements.txt. - Launch the application programmatically.
Advanced macro recording and playback.
- Command: "Watch this" starts recording mouse clicks, movements, and keystrokes.
- Command: "Mimic recent macro" replays the exact sequence at 1x, 2x, or 0.5x speed.
Deep video and media intelligence.
- Instantly downloads YouTube subtitles from a provided URL.
- Utilizes the LLM to generate clean, formatted markdown study notes and content summaries.
Secure, ephemeral data hand-off.
- Uploads specified Finder files to secure ephemeral hosts (Oshi.at, PixelDrain).
- Generates a terminal QR code for instant, seamless mobile downloads.
- Primary: Apple's native macOS Speech Recognition framework (forced on-device processing).
- Fallback: Faster Whisper (optimized for CPU/int8). Loads lazily only if Apple STT fails.
- Audio Ducking: Automatically lowers system media volume while listening and speaking.
- Online Priority: Heavy reliance on online Ollama models (e.g.,
llama-70b versatile) for context and chat. - Cloud Escalation: High-complexity tasks (code generation, heavy intent routing) are dynamically routed to Groq (
llama-3.3-70b-versatile) or OpenRouter to ensure maximum intelligence without blocking the main event loop.
Jarvis utilizes an integrated FaceID module for zero-trust local execution.
- Reference Image: Set your face in
data/me.jpg. - Trigger: When destructive or high-security commands are detected (e.g., "Lock my screen", "Shut down"), Jarvis activates the webcam.
- Verification: Uses the
face_recognitionlibrary to process biometrics locally. Execution is denied if unrecognized.
- System Control: Manage Spotify, Chrome, Focus Mode, brightness, and volume.
- Scheduling & Alarms: Native macOS Clock integration, Reminders, and Calendar event parsing.
- Communications: Read and send emails, manage Contacts.
- Utilities: Natural language math calculator, real-time weather, news fetching, and text translation.
- OS: macOS 13+ (Apple Silicon recommended for optimal AI performance)
- Runtime: Python 3.10+ (3.11 recommended)
- Tools: Xcode Command Line Tools (
xcode-select --install)
git clone https://github.com/Sam-06060/jarvis-assistant.git
cd jarvis-assistantThe bootstrap script automatically creates your virtual environment, installs Python dependencies, and prepares directories.
# Standard runtime
./scripts/bootstrap_macos.sh
# With developer tools
./scripts/bootstrap_macos.sh --devCopy the template and edit your API keys.
# Rename template if the script didn't already
cp .env.example .envEdit .env and configure:
PICOVOICE_API_KEY: Required for the wake word engine. (Get at Picovoice Console)OPENROUTER_API_KEY/GROQ_API_KEY: Required for advanced cloud LLM queries.REFERENCE_IMAGE_PATH: Path to your security photo for FaceID.
Run the built-in diagnostic tool to ensure everything is perfect before first boot.
.venv/bin/python scripts/doctor.py --strictCompile the Swift frontend and fire up the system.
# Build the native macOS HUD
cd JarvisApp
./build_app.sh
cd ..
# Start the Python Backend + Swift App
./start_jarvis.sh- Language: Python 3.11
- Design Pattern: Service-oriented. Core processes (Speech, Brain, Files) register with a centralized
ServiceRegistry. - Concurrency: Threaded listener and isolated worker loops. Heavy IO plugins (like models) use Lazy Loading via
ServiceProxyto guarantee instant start-up. - IPC: Communicates with the Frontend via a custom Socket Server on port
8492.
- Language: Swift 5
- UI: Custom borderless floating window HUD.
- Responsibilities: Receives state telemetry (IDLE, LISTENING, PROCESSING) and displays real-time subtitles and system feedback.
Phase 1: Core System Architecture & Event Loop
Reference Files: jarvis.py, core/registry.py, core/events.py, core/proxy.py
Instead of having one giant main file that controls everything and gets messy over time, Jarvis uses a ServiceRegistry.
- How it works: When a new system part starts up (like the speech engine or the AI brain), it "registers" itself here.
- Why it matters: If the speech engine needs to talk to the AI brain, it doesn't need to be hard-coded to find it. It just asks the registry, "Hey, can you give me the brain?" This makes the code very clean, easy to update, and if a part crashes, Jarvis can just restart that specific part without breaking the whole system.
Jarvis is designed to start up instantly, just like the built-in Mac dictation.
- How it works: To boot up fast, Jarvis doesn't load heavy features (like the Music Controller, Alarms, or Calculator) right away. Instead, it puts a lightweight "Smart Proxy" in their place.
- Why it matters: These proxies secretly load the real, heavy features in the background while Jarvis is already listening to you. If you ask Jarvis to do something before the background loading is finished (like asking for the weather right as you open the app), the proxy kindly tells you: "One moment sir, the Weather Service is still loading." This prevents the app from freezing or crashing.
Jarvis has an internal messaging system called the EventManager.
- How it works: Imagine a radio station. One part of the code can broadcast a message (like "The system health has changed!"), and any other part of the code that cares about system health can tune in and listen.
- Why it matters: This means different parts of the code don't have to be tied together directly. They just listen for announcements, which keeps the system fast and flexible.
The JarvisApp is the main engine running the whole show. It runs in a continuous loop, cycling between IDLE (waiting), LISTENING, and PROCESSING.
A. Getting Ready: When you first start Jarvis, it quickly checks to make sure it has permission to use your Mac's microphone, camera, and other settings. It also starts a connection to talk to the graphical desktop app (the Swift HUD).
B. Waiting for You: While Jarvis is IDLE, it uses almost 0% of your computer's power. It patiently waits for either a typed message from the desktop app or for you to speak the wake word.
- Audio Ducking: The moment Jarvis hears its name, it automatically lowers your Mac's volume and pauses your music so it can hear you clearly.
- FaceID Check: Before it does anything dangerous (like shutting down your Mac), it can quickly check the webcam to make sure it's actually you giving the command.
C. The Safety Net (Interrupt Guard): This is the most important part of the main loop. It allows you to interrupt Jarvis at any time.
- How it works: When Jarvis starts performing a long task (like generating code using cloud AI), it puts that heavy work in the background. Meanwhile, the main system keeps a specialized ear open.
- Why it matters: If you say "Jarvis, stop" or hit the emergency kill switch, Jarvis instantly throws away the background work, stops talking, and immediately goes back to waiting for your next command. It never gets "stuck" thinking.
Phase 2: Voice Engine Deep Dive
Reference Files: modules/speech.py, utils/audio_manager.py
Jarvis uses a "Hybrid" approach to listening and speaking to be as fast as possible.
- How it works: Instead of relying entirely on heavy offline models that drain battery, Jarvis secretly uses Apple's built-in macOS Dictation engine to convert your speech to text.
- Why it matters: Apple's engine is incredibly fast (under 200 milliseconds to understand you). However, if Apple's engine fails or doesn't understand your accent, Jarvis has a backup plan: it instantly switches to "Faster Whisper," a completely offline AI model that guarantees it will understand you, even without the internet.
When you talk to most voice assistants, you have to wait a second or two after you finish speaking before they realize you are done. Jarvis uses Voice Activity Detection (VAD).
- How it works: Jarvis is tuned to detect silences as short as 100 milliseconds.
- Why it matters: The millisecond you stop speaking, Jarvis immediately starts processing your command without any awkward pauses, making conversations feel highly natural and snappy. It's smart enough to filter out fans, humming, and other background noises so it doesn't accidentally trigger.
Nothing is more annoying than an assistant that tries to talk over your loud music. Jarvis handles your Mac's audio dynamically.
- How it works: When you say "Jarvis," the
audio_manager.pyinstantly scans your Mac to see what is making noise. Is Spotify open? Is Apple Music playing? Is a YouTube video playing in background tabs on Safari or Chrome? - Why it matters: Jarvis uses hidden AppleScripts and JavaScript to physically turn down the exact app making the noise (
duck_audio()). It won't lower your Mac's master volume (which would make Jarvis's own voice too quiet to hear), but it will perfectly lower Spotify so you can chat. When the conversation is over, it returns your music to the exact volume it was at before!
Phase 3: The AI Brain & Multi-Model Routing
Reference Files: modules/brain.py, modules/groq_client.py, modules/conversation_history.py
Jarvis uses a "Cloud-First" approach for standard conversations so it never slows down your Mac.
- How it works: When you ask a normal question, Jarvis secretly sends it to ultra-fast cloud servers (like Groq or OpenRouter) running massive AI models like Llama 3.3.
- Why it matters: Your Mac's fans will never turn on. You get the intelligence of a massive supercomputer, but it happens so fast (often under 1 second) that it feels like the brain is running locally.
What happens if your internet goes out or the cloud servers crash?
- How it works: Jarvis has a built-in safety net. It will automatically switch to "Local Mode" and use Ollama (running Llama 3.2 directly on your Mac's chip).
- Why it matters: This ensures Jarvis is always available to help you, even if you are on an airplane with no Wi-Fi. It will gracefully switch back and forth depending on your connection.
AI models traditionally don't remember what you said 5 minutes ago. Jarvis fixes this.
- How it works: Every time you talk to Jarvis, the
ConversationHistorymodule saves a tiny text log. Before sending your new question to the AI, Jarvis secretly injects the last 5 things you talked about into the background code (self._build_system_prompt()). - Why it matters: This makes conversations flow naturally. If you say "Who is Elon Musk?" and Jarvis answers, you can follow up with "How old is he?". Because Jarvis injected the memory of the last message, the AI knows exactly who "he" is.
Phase 4: Command Processing & Intent Recognition
Reference Files: modules/commands.py, utils/fuzzy_matcher.py
When you speak to Jarvis, your words don't just go to the AI. They pass through a strict filter of "Skills".
- How it works: Jarvis has a prioritized list of over 20 skills. For example, the
InteractionSkill(handling commands like "Stop" or "Shut down") is at the very top. TheCalculatorSkillis in the middle. TheAIBrainis at the very bottom. - Why it matters: If you say "Stop listening", the
InteractionSkillinstantly catches it and shuts down the microphone faster than the AI could ever process the text. This hierarchy makes Jarvis incredibly fast for everyday tasks, while still letting it use the AI for complex questions.
Sometimes, it's hard to tell if you are asking a question or giving a command.
- How it works: Before any skill runs, Jarvis passes your words through a fast NLP (Natural Language Processing) engine called the
IntentRouter. It gives your command a mathematical "confidence score." - Why it matters: If you say "Build me a new login screen", a standard assistant might just tell you what a login screen is. But Jarvis's Intent Engine recognizes the intent to create software (scoring 90%+ confidence) and automatically routes your command to the powerful "Architect Mode" instead of just answering the question.
Humans make mistakes, and microphones sometimes mishear words. Jarvis is designed to be forgiving.
- How it works: If you say "Open Crome" instead of "Google Chrome", the
FuzzyMatcherkicks in. It compares your broken command against a massive dictionary of known apps and commands, looking for a match that is at least 70% similar. - Why it matters: You don't have to speak like a robot. You can mumble "play spotfy" or "open my calender", and Jarvis will automatically fix the typo in the background and execute the correct command without complaining that it didn't understand you.
Phase 5: Security Architecture & FaceID
Reference Files: modules/security.py, utils/permission_checker.py
Because Jarvis has deep access to your Mac, certain commands (like "Shut Down" or "Restart") are highly dangerous if a guest uses them.
- How it works: When Jarvis detects a dangerous command, it refuses to run it immediately. Instead, it silently snaps a photo using your Mac's webcam and compares it mathematically to a secure reference photo (
data/me.jpg) you provide. - Why it matters: If your friend walks into the room and yells "Jarvis, sleep my Mac!", Jarvis will scan their face, realize it's not you, say "Access Denied," and completely ignore them.
FaceID is great, but webcams struggle in the dark.
- How it works: When FaceID turns on, Jarvis instantly checks the "brightness level" of the webcam feed. If the room is completely dark (brightness under 60%), Jarvis uses AppleScripts to instantly crank your Mac's screen brightness to 100% and displays a pure white window to illuminate your face.
- Why it matters: It acts exactly like a smartphone's FaceID flash, ensuring you can authenticate and lock your screen even in a pitch-black room. When the scan finishes, Jarvis safely returns your screen brightness to normal.
To work properly, Jarvis needs 10 different macOS permissions (Microphone, Camera, Accessibility, etc.). Usually, apps spam you with 10 error popups at once.
- How it works: Jarvis has a built-in "Doctor." Every time it turns on, it quietly checks all 10 permissions in the background without triggering any annoying macOS alerts.
- Why it matters: It only bothers you if something is actually broken. If it finds you are missing exactly 1 permission (like Contacts), it prints a beautiful terminal checklist showing 9/10 permissions are good, and then cleanly opens only the specific Apple Settings page you need to fix.
Phase 6: Nuclear Skill - Architect Mode
Reference Files: modules/skills/architect_skill.py
When you ask a normal voice assistant to "build a weather app," it will probably just read you a Wikipedia article about weather apps or give you a short 10-line Python script that barely works. Jarvis doesn't do that.
Jarvis features a "Nuclear Skill" called Architect Mode. When triggered, Jarvis completely shifts from being a conversational assistant into becoming a fully autonomous Software Engineer. It designs the architecture, writes the code, structures the file system, and saves a ready-to-run folder directly to your Desktop.
This module (architect_skill.py) is one of the most complex in the entire system, featuring a robust multi-stage pipeline designed to guarantee that the code it generates actually works, rather than just looking good on paper.
How does Jarvis know when to just answer a question vs. when to build an entire application?
- How it works: If you say "How do I build a website?", Jarvis will simply have a conversation with you. But if you use high-intent phrases like "Build an app," "Make a project," or "Scaffold a dashboard," the NLP Intent Engine instantly routes your request to Architect Mode. It explicitly ignores trigger words if they are mixed with other commands (like "Make an alarm," which goes to the Alarm Skill).
- Why it matters: You don't have to use special robotic wake words like "Enter Developer Mode." You just talk naturally, and Jarvis uses mathematical confidence scores to understand when you want it to act as a programmer.
Standard AI models are notoriously lazy. If you ask them to build a large feature, they will often write // ... rest of the code here ... to save time. Jarvis fights this laziness using a rigorous pipeline.
- Stage 1: Strict Formatting (
<file>tags). Jarvis forces the AI to output exactly one thing: raw XML code. It uses<file name="index.html">tags to separate out the HTML, the CSS, and the JavaScript. - Stage 2: The Stub Check. Before saving anything to your Mac, Jarvis scans the generated code. If it finds any "stubs" (files smaller than 20 characters, or with the text
...), the build is immediately paused. - Stage 3: Multi-Pass Generation. If stubs were detected, Jarvis abandons the single-pass approach. Instead, it reads the names of the files that the AI intended to create, and loops back to the AI, asking it to generate every single file individually. It even passes the previously generated files into the prompt as context, so the CSS perfectly matches the HTML.
- Why it matters: This ensures every single file in your project is 100% complete and ready for production, saving you from having to manually fill in the blanks the AI left behind.
Building software is rarely a one-shot process. Usually, you need to make changes after seeing the first version.
- How it works: Every time Jarvis builds a project, it creates a new folder in
~/Desktop/Jarvis_Builds/(e.g.,Weather_App_20260301). It remembers this path! If you look at the app and say, "Jarvis, change the background to dark blue and make the text bigger," Jarvis automatically scans that entire folder, reads all the existingHTML/CSS/JSfiles, and sends them back to the AI as "Previous Context." - Why it matters: You can build, refine, and polish software entirely through conversation. It feels like pair-programming with a real human who remembers exactly what you are both working on.
What happens if you ask for an update, and the AI makes a mistake that ruins your perfectly good code?
- How it works: Jarvis never overwrites your previous work directly. If you are iterating on a project, Jarvis performs an "Atomic Copy" of the entire folder (e.g., creating
Weather_App_v2). - The Strict Snippet Guard: Before writing the new code to the new version folder, Jarvis counts the characters. If the new file is less than 50% the size of the old file, Jarvis assumes the AI made a mistake and tried to give you a "snippet" instead of a full file. It blocks the update, prevents code loss, and saves the broken snippet as a
.snippetfile for your review. - Why it matters: You can confidently ask Jarvis to make bold changes, knowing that your previous, working version is always safely preserved.
We didn't want Jarvis building websites that look like they belong in 1999.
- How it works: When generating UI, Jarvis secretly injects high-end design requirements into the AI's system prompt. It explicitly forbids "fixed widths" (like
width: 400px) and demands fluid, scalable layouts usingclamp(),rem,vw, and CSS Grid. It also automatically pulls in CDNs for modern libraries like Tailwind CSS, GSAP, or FontAwesome. - Intelligent Asset Replacement: If the AI is lazy and uses empty image tags like
<img src="">or a broken placeholder, Jarvis intercepts it before saving and automatically replaces it with a beautiful, random image from Unsplash (source.unsplash.com). - Why it matters: The code doesn't just work—it looks professional, scales to any device size, and is visually engaging from the very first second it opens in your browser.
Jarvis acts like a neat developer tracking its own work.
- How it works: Inside every project folder, Jarvis generates a hidden
jarvis_manifest.jsonfile. It records the date, the exact voice command you used, and automatically analyzes the files to figure out what "stack" it used (e.g.,web_vanilla,node_js,python). - Why it matters: If you come back to the project a week later and ask for an update, Jarvis reads this manifest. If it sees the project is built in React, it forces the AI to stick to React, ensuring the technology stack doesn't accidentally change halfway through development.
Even the smartest cloud AIs sometimes output invalid code.
- Ollama Repair Protocol: If the cloud AI puts code outside the
<file>XML tags, standard code parsers would normally crash. Before giving up, Jarvis sends the broken mess to its local brain (Ollama) with the strict instruction: "Fix these XML tags." The local brain repairs the formatting so the build can continue. - Syntax Validation: Before saving a
.pyPython script, Jarvis runs an AST (Abstract Syntax Tree) check. Before saving.jsor.cssfiles, it checks for basic errors like unbalanced curly braces{}or parentheses(). If an error is found, Jarvis automatically asks the AI to repair the syntax. - Why it matters: You consistently get functioning codebase outputs, even if the AI hallucinates bad syntax on its first try. Jarvis acts as its own Quality Assurance tester.
Phase 7: Nuclear Skill - The Mimic
Reference Files: modules/mimic.py
While Architect Mode writes software, The Mimic writes automation. It is a powerful "Nuclear Skill" that allows Jarvis to physically take control of your mouse and keyboard to perform repetitive tasks on your behalf. Rather than relying on complex API integrations or fragile web scrapers, The Mimic operates exactly like a human would: by clicking buttons, dragging the cursor, and typing on the keyboard.
This means Jarvis can automate any application on your Mac—even older, legacy apps, terminal interfaces, or secure websites that strictly forbid developer APIs—simply by watching you do it once.
When you instruct Jarvis to "Watch this," it enters a specialized, heightened observation state.
- How it works: The Mimic utilizes the
pynputlibrary to instantiate extremely lightweight, non-blocking background listeners for both your mouse and keyboard. Once activated, every single physical action you take is intercepted and recorded into a real-time event array in active memory. - What it tracks:
- Mouse Movement (
on_move): Tracks exact X and Y coordinate translations of your cursor across the screen. To prevent massive file sizes and system memory bloat, it heavily optimizes the capture rate so it doesn't log useless micro-jitters, focusing only on meaningful cursor travel. - Mouse Actions (
on_click,on_scroll): Records left, right, and middle clicks, tracking both the exact moment the button is pressed and when it is physically released. It also captures exact scroll wheel deltas, allowing Jarvis to pan up and down webpages exactly as far as you did. - Keyboard Strokes (
on_press,on_release): Captures individual alphanumeric characters and specialKeyCodeevents (like Shift, Command, or Option). By tracking both the strict press and release of a key, it can perfectly understand and record complex, multi-key keyboard shortcuts (like Cmd+Shift+4).
- Mouse Movement (
- Why it matters: Because the event listeners are non-blocking and deeply optimized, your Mac doesn't freeze, stutter, or slow down while you record the macro. You perform the task entirely naturally, and Jarvis quietly builds a mathematical map of your exact physical inputs in the background without you noticing.
A recorded macro is completely useless if the recorded clicks happen too fast for the computer to actually load the next window. Computers execute code in milliseconds, but humans take seconds.
- How it works: Instead of just recording what you clicked, The Mimic engine records when you clicked it. Every time an action is logged (like a click or a keypress), the engine calculates the exact
time.time()differential since the very previous event. It attaches this precise delay to the event object. - Why it matters: If you click a login button, wait exactly 3.5 seconds for a slow webpage to finish loading, and then begin typing a password, Jarvis will see that exact visual delay. During the replay, Jarvis will artificially sleep for exactly 3.5 seconds before it starts typing. This perfectly preserves the human cadence of your actions, drastically reducing the chances of a macro failing or mis-clicking due to a slow-loading UI element or a sluggish network connection.
Once you finish demonstrating the physical task, you explicitly tell Jarvis, "Set this as Server Backup."
- How it works: Jarvis cleanly halts the
pynputlistener threads and immediately runs the captured event array through a custom serialization helper. Because native Pythonpynput Keyobjects cannot directly be saved to a standard text file, the helper converts them back into raw, parsable string formats (differentiating between standard chars and special system keys). - Storage: The entire physical sequence is structurally dumped into a local folder as
macros/server_backup.json. - Why it matters: Your physical automations persist across computer reboots. You can build up an entire local library of incredibly complex workflows—like "Run Morning Server Checks", "Export Video Timeline", or "Format Weekly Report"—and they are permanently stored as lightweight, human-readable JSON files. Because they are plain text, you can even open the JSON file and manually edit the delay of a specific click if you want to perfectly optimize the timing without having to re-record the whole sequence!
When it's time to put Jarvis to work, you simply say "Mimic Server Backup."
- How it works: Jarvis reads the saved JSON file, loads the events back into memory, and boots up a dedicated
_replay_thread. This is a crucial architectural decision: by running the physical macro in an isolated background thread, the main Jarvis voice assistant stays wide awake and completely responsive. - The Magic of Speed Multipliers: When you call the macro, you don't have to sit there and watch it run at your slow, original human speed. The
executefunction accepts aspeed_multiplierargument (e.g.,1.5or0.5). Right before executing thetime.sleep()for the recorded delay, the engine divides the delay by your multiplier. - Why it matters: You can intentionally record a slow, highly careful data-entry task at 1x speed to ensure you don't make any mistakes. Later, you can tell Jarvis to "Mimic Data Entry at double speed." Jarvis will flawlessly execute your exact clicks and keystrokes twice as fast as you ever could, physically turning hours of tedious clicking into seconds of automated background work.
There is a massive flaw with building Python apps that control your mouse: when a Python script takes control, the active Python terminal window usually jumps to the very front of your screen, physically blocking and covering up the application you are actually trying to automate.
- How it works: Jarvis anticipates and fixes this by digging deeply into Apple's native Objective-C APIs. Directly before executing the macro, The Mimic triggers a hidden
_hide_dock_icon()system function. It explicitly imports macOS's nativeAppKitand forces the activeNSApplicationinstance to switch its activation policy state toNSApplicationActivationPolicyAccessory(Value 1). - Why it matters: This natively hides the Jarvis Python icon from your macOS Dock and completely removes it from the
Cmd+TabApp Switcher. Jarvis essentially becomes a "ghost process"—it can control your mouse and keyboard smoothly without constantly asserting its own application window to the front. This allows it to seamlessly operate underlying apps like Safari, Microsoft Excel, or Final Cut Pro without any visual interruption or frustrating focus-stealing bugs.
Phase 8: Nuclear Skills - Content Assassin & Dead Drop
Reference Files: modules/content_assassin.py, modules/dead_drop.py
Sometimes you don't need Jarvis to write code or move your mouse; you just need it to handle digital logistics instantly. Phase 8 introduces two highly specialized "Nuclear Skills" designed for rapid information extraction and secure file sharing.
Watching a 45-minute YouTube lecture or tutorial is incredibly time-consuming. Content Assassin allows Jarvis to instantly "assassinate" a video and extract its core knowledge payload in seconds.
- The Subtitle Harvester: Instead of trying to use a slow web-scraper or a heavy headless browser to load YouTube, Jarvis utilizes the extremely mature
yt_dlplibrary. However, it explicitly suppresses downloading the heavy video and audio streams ("skip_download": True). It exclusively requests the raw.vtt(Web Video Text Tracks) subtitle file. This makes the download nearly instant. - The Text Cleaner: Raw YouTube subtitles are incredibly messy. They contain timestamps like
00:01:23.400 --> 00:01:25.100, metadata headers, HTML tags, HTML entities like , and thousands of repeated duplicate lines. The_clean_vttfunction uses advanced Regular Expressions (Regex) to aggressively strip out all timing data, flatten the text, and remove duplicate consecutive lines caused by YouTube's auto-generated rolling captions. - The AI Handshake & Output: Once the text is perfectly clean, Jarvis fires the massive block of text into the high-performance Cloud Brain (Groq). The AI summarizes the text, extracts key takeaways, and formats it beautifully. Jarvis then automatically creates a Markdown file (e.g.,
Study_Notes_VideoTitle.md) directly on your Desktop and programmatically opens it. - Why it matters: You can feed Jarvis an hour-long podcast link, and in about 5 seconds, a perfectly formatted text file containing comprehensive study notes will pop open on your screen without you ever watching the video.
Getting a file from your Mac to someone else's phone often involves annoying AirDrop failures, slow email attachments, or logging into heavy cloud drives. Dead Drop solves this using an autonomous, secure terminal upload sequence.
- Zero-Click Finder Integration: When you say "Jarvis, Dead Drop," you don't even have to type a file path. Jarvis runs a hidden
osascript(AppleScript) snippet that silently asks the macOS Finder: "What file does the user currently have selected right now?" It instantly grabs the POSIX path of whatever you highlighted. - The "Ironclad" cURL Engine: Python's standard
requestslibrary is great, but it can crash on massive file uploads or unstable networks. Instead, Jarvis uses a raw, deeply optimizedcurlterminal command (run_curl_progress). It bypasses SSL checks (-k), forces IPv4 (-4), and uses the--http1.1and-H "Expect:"flags. This prevents modern HTTP/2 streaming errors and forces the server to accept the data instantly without pre-authorization delays. - The Multi-Provider Waterfall System: Free file hosts go down constantly. Dead Drop is built for guaranteed delivery. It attempts to upload your file to
Oshi.at(the fastest). If that fails, it catches the timeout and seamlessly falls back toPixelDrain(the most stable for heavy files). If the file is incredibly massive and PixelDrain fails, it falls back toLitterbox(a 1-hour 1GB host). - The Physical QR Hand-Off: Once the file safely lands on the cloud server, Jarvis intercepts the direct download URL. It uses the
qrcodelibrary to generate an ASCII QR code printed directly in your terminal, and also saves a high-res PNG version to a temporary folder, automatically popping it open on your screen (open qr_path). - Why it matters: You can click a 50MB video file on your Mac, say "Dead Drop," and seconds later, a QR code appears on your screen. Your friend scans it with their iPhone camera and the video instantly downloads to their phone. No accounts. No AirDrop.
Security and cleanliness are built into Dead Drop.
- How it works: After the QR code is generated and displayed, Jarvis starts a ticking threading timer (
threading.Timer(120, self.cleanup).start()). - Why it matters: Exactly two minutes after the transfer finishes, the system quietly deletes the generated QR code image and destroys the temporary folder footprint (
self.temp_dir). The cloud hosts (like Litterbox) already auto-delete the online file after 1 to 24 hours. This ensures you never clutter your Mac's hard drive with useless generated QR codes, and your files don't live online forever.
Phase 9: Hardware Integrations - Cursor & Gesture Control
Reference Files: modules/cursor_control.py
Voice commands are great, but sometimes it's faster to just point. Jarvis includes a custom-built computer vision engine that turns your Mac's webcam into a spatial tracking device. By reading the physical position and shape of your hand in thin air, Jarvis allows you to control your Mac's mouse cursor exactly like Tom Cruise in Minority Report.
This system goes far beyond basic "webcam mice." It features dedicated jitter-reduction algorithms and a complex state-locking machine to ensure the cursor feels liquid-smooth and completely natural to use.
- How it works: When you say "Enable Cursor," Jarvis boots up the
cv2(OpenCV) camera feed and feeds the raw video frames into Google'sMediaPipe Handsmachine learning model. This model drops 21 distinct 3D landmarks (X, Y, and Z coordinates) onto the joints of your hand in real-time. - Gesture Analytics (
get_gesture_refined): The system doesn't just look for generic blobs; it performs constant trigonometric math on these 21 landmarks. By calculating the exact 3D distance between your thumb tip (Landmark 4) and index finger tip (Landmark 8) (math.hypot(thumb_x - index_x, thumb_y - index_y)), it can detect a precise "Pinch." By checking if the Y-coordinate of your fingertip is physically higher than its base knuckle, it can independently determine exactly which fingers are currently raised. - The Core Gestures:
- POINT (1 Finger Up): Moves the mouse cursor across the screen.
- PINCH (Thumb + Index Touch): Clicks and holds.
- SCROLL_V (2 Fingers Up): Locks X-axis, scrolls webpages up/down.
- SCROLL_H (3 Fingers Up): Locks Y-axis, scrolls timelines left/right.
- FIST (0 Fingers Up): Instantly grabs and drags the active window.
- PEACE (2 Fingers Spread Wide): Triggers the shutdown/exit sequence.
The biggest problem with webcam cursor control is that human hands naturally shake, and webcams have noise. Without filtering, the cursor would vibrate violently, making it impossible to click small buttons.
- How it works: Jarvis implements a custom
OneEuroFilter(a 1st-order low-pass filter with an adaptive cutoff frequency). Every millisecond, the filter calculates the velocity (dx) of your hand. - Dynamic Smoothing: If you move your hand very slowly (low velocity), the filter aggressively smooths the coordinates (high beta cutoff) to perfectly eliminate micro-jitters, allowing you to click tiny close buttons. If you suddenly whip your hand across the screen (high velocity), the filter instantly reduces the smoothing so the cursor can snap across your monitor with zero latency or "lag" feeling.
- Why it matters: It perfectly bridges the gap between surgical precision for clicking and instantaneous speed for traveling across large dual-monitor setups.
Have you ever tried to scroll on a trackpad, but the page accidentally zooms in or the mouse clicks something instead? Jarvis prevents accidental inputs using strict State Locks.
- How it works: The moment the vision core detects a scrolling gesture (e.g., 2 fingers up), it enters
SCROLL_LOCKmode (self.active_mode = "SCROLL"). While in this locked state, it literally ignores the other fingers. Even if you accidentally drop a finger for a split second, or your hand slightly changes shape, Jarvis forces the system to stay in scrolling mode. - The 0.5s Release Window: To legitimately exit scrolling mode and go back to pointing mode, you must hold a different gesture for longer than
self.lock_duration(0.5 seconds). - Why it matters: This prevents "flicker." You can confidently scroll down a massive webpage without accidentally left-clicking links or dragging the browser window when your hand naturally morphs during the movement.
Standard pyautogui.scroll() requires whole integers (you can't scroll 0.4 pixels). This makes slow, fine scrolling feel incredibly chunky and robotic.
- How it works: Jarvis solves this using a Decimal Accumulator (
self.scroll_acc_y). As you slowly move your hand, the system generates decimal pixel deltas (like0.3pixels per frame). It stores these decimals in the accumulator. Over three frames, the accumulator might read0.3,0.6,0.9, and then finally1.2. The moment it crosses1.0, Jarvis triggers a realpyautogui.scroll(1)and keeps the0.2remainder for the next calculation! - History Buffering: To make it even smoother, it averages your last 5 frames of vertical movement (
sum(self.scroll_history_y) / len(self.scroll_history_y)) before applying it to the accumulator. - Why it matters: Combining the 5-frame average with the Decimal Accumulator creates a scrolling experience that feels identical to the buttery-smooth inertia of a native Apple Magic Trackpad, rather than the chunky clicking of a cheap plastic scroll wheel.
Phase 10: System Integrations & Automation
Reference Files: modules/system_info.py, modules/music_controller.py, modules/focus_manager.py
While Jarvis uses purely theoretical AI to answer questions, it uses highly specific macOS hooks to physically alter the state of your computer. Phase 10 covers the translation layer between natural language commands and raw macOS system architecture.
Jarvis constantly monitors the physical health of your Mac.
- How it works: The
SystemInfomodule utilizes thepsutillibrary to execute low-level sweeps of your hardware. It tracks CPU usage limits, checks for thermal throttling by calculating available memory (mem.available), and monitors hard drive write-capacity limits. - Real-Time Context: When you ask "How is my battery?", Jarvis doesn't just read a number. It calculates the remaining drain time (
battery.secsleft // 3600) and thepower_pluggedstate to provide an accurate estimate (e.g., "75%, 4h 20m remaining"). - Why it matters: Because
system_info.pyruns natively on the machine, Jarvis is aware of its own physical limits. If theMEMORY_LIMIT_MBthreshold is breached while running a massive AI query, Jarvis can preemptively warn you that your system is running out of RAM before your Mac crashes.
Controlling Apple Music and Spotify usually requires complicated OAuth API keys. Jarvis bypasses this entirely by utilizing macOS's native osascript (AppleScript) engine.
- Dynamic Application Routing: Jarvis doesn't assume which app you are using. Every time you ask to play a song, it runs a silent background query (
tell application "System Events" to (name of processes)) to dynamically check if Spotify or Apple Music is currently occupying your system RAM. - The "Brute Force" Spotify Connector: Because Spotify famously deprecated their local AppleScript search API, Jarvis uses a brilliant workaround. When you say "Play Lofi Beats," Jarvis constructs a raw URI (
spotify:search:Lofi%20Beats) and forces the app to open it. Then, it uses extreme low-level UI scripting (tell process "Spotify") to simulate a human pressing theTab,Command+A, andEnterkeys to physically trigger playback. - Track Context Injection: When you ask the AI, "Who sang this song?", Jarvis fires a split-second query to the active music player:
set trackName to name of current track. It injects the result (e.g., "Blinding Lights by The Weeknd") directly into the AI's prompt before the AI generates its answer. - Why it matters: You can control massive third-party applications securely and instantly, without messing around with developer tokens or relying on cloud webhooks, because Jarvis controls the apps exactly the way a physical macOS user would.
Since macOS Monterey, Apple aggressively locked down third-party access to "Do Not Disturb" states. Jarvis uses macOS Shortcuts to bridge this gap.
- How it works: The
FocusManagerrelies on theshortcuts runcommand-line interface. By executingsubprocess.run(["shortcuts", "run", "Do Not Disturb"]), Jarvis can trigger system-level notification silencing. - Why it matters: You can tie Focus modes to larger Jarvis routines. You could program a "Deep Work" command where Jarvis simultaneously launches Spotify's "Deep Focus" playlist, turns your Mac's brightness up, and throws your computer into "Do Not Disturb" mode all in less than two seconds.
Phase 11: Core Productivity & Personal Management
Reference Files: modules/alarm_manager.py, modules/reminder_manager.py, modules/calendar_manager.py, modules/contact_manager.py
A true assistant needs to manage your time and communications. Phase 11 documents how Jarvis bridges the gap between text-based AI logic and your personal iCloud data.
Time is notoriously difficult for computers to parse from natural language. If you say "Set an alarm for 5," does that mean 5 minutes from now, 5:00 AM, or 5:00 PM?
- How it works: The
_parse_smart_timefunction uses a complex decision tree combined with thedateutil.parser. First, it checks for relative regex boundaries (e.g.,in (\d+) min). If it's absolute, it runs a predictive AM/PM inference engine. - The AM/PM Engine: If it is currently 2:00 PM, and you say "Set an alarm for 9", the standard
parserdefaults to 9:00 AM today (which is in the past). Jarvis catches this temporal anomaly. It calculatesparsed_time < now, and attempts to add 12 hours (9:00 PM). It checks if 9:00 PM is in the future. If so, it locks it in. If 9:00 PM is also in the past (e.g., it is 10:00 PM), it automatically rolls the date forward 24 hours to set the alarm for 9:00 AM tomorrow. - Why it matters: You never have to specify "AM", "PM", or "Tomorrow" when speaking to Jarvis. The system intelligently deduces your intent based on the current hour of the day.
Unlike alarms that hook into the native macOS Clock via Shortcuts, Reminders are handled entirely in-house using a zero-dependency JSON database.
- How it works: When Jarvis boots up, he spins up an isolated background daemon thread (
self.check_thread = threading.Thread(target=self._check_reminders_loop, daemon=True)). This loop sleeps for 0.5 seconds, wakes up, diffs the current systems clock against every pending timestamp indata/reminders.json, and goes back to sleep. - Non-Blocking Execution: If a reminder triggers, it fires a native macOS notification (
osascript -e display notification), plays theGlass.aiffsystem sound, and triggers thesayTTS engine synchronously. Because this all happens in the daemon thread, Jarvis can continue answering a complicated coding question in the main thread while simultaneously alerting you that your laundry is done.
Privacy is critical when dealing with personal contacts. Jarvis does not upload your address book to the cloud.
- How it works: When you ask to call someone,
contact_manager.pyexecutes a sandboxed AppleScript block (tell application "Contacts" to (every person whose name contains...)). This queries your local Mac Contacts database, pulls the phone integer array, formats it, and triggers a FaceTime Audio call using thefacetime-audio://URI scheme. - The JSON "Black Book": For specific workflows (like the Dead Drop emailer), Jarvis maintains a separate JSON dictionary of explicitly authorized emails (
self.email_db). You can verbally add people to this Black Book without granting Jarvis sweeping access to your Apple ID. - Speech Parsing Correction: Voice transcription often mishears names (e.g., "Call Samson" might transcribe as "Call Samsung"). The
apply_name_aliasesdictionary automatically intercepts these common Whisper API errors before querying the database, ensuring high reliability for difficult names.
Phase 12: Information Retrieval & Web Search Engine
Reference Files: modules/web_search.py, modules/news_service.py, modules/weather_service.py, modules/translator.py
Large Language Models (LLMs) suffer from hallucinations and knowledge cut-offs. Phase 12 explains how Jarvis overcomes these limitations by autonomously searching the live internet and injecting real-time facts into his context window before generating a response.
Relying on a single third-party web scraper is dangerous because search engines constantly update their bot-blocking security. The WebSearch module uses a robust, 4-stage fallback waterfall to guarantee data retrieval.
- Stage 1 (DuckDuckGo API): First, Jarvis attempts to use the
duckduckgo-searchPython package. It grabs the top 8 results, filters out generic login pages, drops snippets shorter than 40 characters, and returns the top 5 highest-quality textual summaries. - Stage 2 (Google Search): If DuckDuckGo rate-limits the IP, Jarvis immediately fails over to the
googlesearch-pythonmodule, scraping raw Google result descriptions. - Stage 3 (The Manual DDG Scraper): If both community libraries break due to API changes, Jarvis resorts to a custom-built manual scraper. It fakes a Chrome macOS User-Agent header, downloads the raw HTML of
html.duckduckgo.com, and usesBeautifulSoupto physically parse the DOM tree (.result__title aand.result__snippet), extracting the text. - Stage 4 (Wikipedia Fallback): As an absolute last resort, it uses the
wikipedialibrary to grab the top 4 sentences of the highest-matching encyclopedia article. - Why it matters: Your AI will never fail to find an answer because a random NPM/PyPI search package broke overnight.
How does Jarvis know when to search the web versus when to just answer from memory?
-
The Zero-Shot Classifier: Every time you speak, Jarvis runs your raw text through a blazing-fast local NLP (Natural Language Processing) regex engine inside
intent_router.py. If it detects trigger words (e.g., "who is", "what is the latest", "current price"), it seamlessly halts the standard brain execution. -
The Context Injection: Jarvis executes the Search Waterfall, takes the resulting formatted string (
--- WEB SEARCH RESULTS ---), and physically prepends it to your original prompt before sending it to the LLM (Groq/Ollama). -
Why it matters: The AI never knows that it searched the web. It simply receives a prompt saying: "Here is factually true data. The user asks: Who won the game last night? Answer them using the data." This completely eliminates hallucinations.
-
Language (
translator.py): Usesdeep-translatorto hook directly into Google Translate's backend. When you ask, "How do you say good morning in Japanese", the regex engine parses the target language (Japanese->ja), generates the translation, and passes it to the TTS engine so Jarvis physically speaks the Japanese text aloud.
Phase 13: Swift Frontend HUD & Socket API
Reference Files: modules/socket_server.py, modules/hud.py, JarvisApp/Sources/SocketClient.swift, JarvisApp/Sources/ContentView.swift
While Jarvis runs natively as a Python backend terminal script, staring at a terminal is not user-friendly. Phase 13 outlines the custom, native macOS Swift frontend that brings Jarvis to life visually.
Python and Swift cannot natively share memory space. Jarvis solves this using an asynchronous TCP socket server running on port 8492.
- The Python Broadcaster: When Jarvis boots, it spins up
JarvisSocketServerin a daemon thread (0.0.0.0:8492). It maintains a queue (self.clients) of all connected GUI frontends. - The Swift Subscriber (
SocketClient.swift): The macOS app uses Apple's nativeNetworkframework (NWConnection). It aggressively attempts to connect to127.0.0.1:8492every 1 second until it establishes a TCP handshake. Once connected, it listens for newline-delimited JSON objects. - Why it matters: This decouples the intelligence from the UI. You could theoretically write a web frontend, an iOS app, or a Linux GUI, and as long as they connect to port
8492and speak JSON, Jarvis will work perfectly.
Voice-to-text engines are noisy. The socket server prevents the GUI from freaking out over bad data.
- The Partial Filter: As Apple's Speech Recognition engine generates text in real-time, it spits out
__PARTIAL__tokens. The socket server caches thelast_partial_time. If the Swift app accidentally sends back a "final" command less than 1.0 seconds after a partial token, the Python server drops it (🔇 Ignoring Native Speech Final), preventing double-execution of the same voice command. - Type Safety Pipeline: Before Python accepts a command from the Swift UI, it runs it through Pydantic (
from core.schemas import JarvisCommand). If the Swift developer sends malformed JSON, the Python backend catches theValidationErrorand refuses to crash.
The macOS app is purely decorative—a "dumb terminal" with zero AI logic, designed purely for aesthetics.
- Real-Time Token Streaming (
ButterStreamText): When Groq generated code at 800 tokens per second, the SwiftUI Text view would instantly block-render the entire wall of text, feeling robotic. TheButterStreamTextview fixes this. It implements an asynchronousTaskloop that manually calculates astreamingBatchSizebased on the total character count. It reveals 3-5 tokens every12_000_000nanoseconds, creating a buttery smooth "typing" animation identical to ChatGPT. - Contextual Haptic Feedback (
SocketClient.swift): As the status pushes from Python (e.g., "LISTENING" -> "THINKING" -> "ERROR"), the Swift client translates these strings into physical macOS trackpad vibrations usingNSHapticFeedbackManager. Errors trigger an.alignment(sharp click), while thinking triggers a.levelChange(soft thud). - The State Shimmer Engine: The
JarvisUIStateenum maps raw Python headers ("FETCHING SUBTITLES") into visual states. If the state is.thinking, it triggers theStateShimmerIndicatorView, overlaying a pulsingLinearGradientmask that sweeps across the text at 60FPS using a hardware-accelerated.blendMode(.screen).
Phase 14: System Reliability - Health, Watchdog & Diagnostics
Reference Files: modules/health_checker.py, core/health.py, scripts/doctor.py
An AI assistant mapping to the physical OS, controlling webcams, executing terminal commands, and modifying system states is incredibly brittle by nature. Phase 14 documents Jarvis's three tiers of system reliability designed to prevent silent failures.
Before the heavy AI models are even loaded into memory, Jarvis performs a lightning-fast preflight check.
- The Microphone Test (
check_microphone): It fires a dummyPyAudioinitialization to16000Hz. If macOS throws a permission error, it instantly halts booting, preventing the voice engine from hanging in a permanent "Listening" state later. - The Binary Validator (
check_system_commands): It runs asubprocess.run(["which", cmd])sweep acrossosascript,say, andopen. If these core macOS binaries are missing or corrupted, Jarvis will flag an error. - The API Connectivity Ping: It verifies that
PICOVOICE_API_KEYandOPENROUTER_API_KEYexist in.env, and specifically pings theopenrouter.ai/api/v1endpoint to ensure the local ISP isn't blocking the connection.
When things break catastrophically, developers need a quick diagnostic tool without digging through stack traces.
- How it works: Running
.venv/bin/python scripts/doctor.pykicks off a standalone diagnostic sweep. It parses.envlooking foris_placeholder(value)to ensure the user actually changedYOUR_PICOVOICE_KEY_HERE. - Dependency Simulation: It attempts bare
importcalls on complex libraries (e.g.,faster_whisper,mediapipe). If PyPI failed during installation,doctor.pypinpoints exactly which module is missing. - Port Conflict Detection: Because Jarvis relies on TCP socket
8492, Doctor runs a hiddensocket.bind()test. If it fails, it warns the user:⚠️ Port 8492 is already in use (backend may already be running).
Once Jarvis is successfully booted, a background watchdog thread guarantees continuous uptime.
- The Heartbeat Loop: The
HealthWatchdogclass implements a daemon thread that wakes up every 30 seconds (time.sleep(30)). It loops through theServiceRegistry, querying the.heartbeat()method of every active background module. - The Crash Recovery Paradigm: If
modules/music_controller.pygets stuck in an infinite AppleScript loop, its heartbeat dies. The watchdog intercepts the crash (self.report_crash(service_name)) and executes a non-blocking recoverythreading.Thread(target=self._attempt_recovery, ...). It safely unloads the corrupted module from memory and attempts a hot-reload of the service viaapp.restart_service(service_name), completely transparent to the user.
Reference File: config.py
Jarvis is designed to run efficiently on Apple Silicon (M-Series processors) without melting your battery. Phase 15 documents the config.py hyper-parameters that allow you to push Jarvis to sub-200ms latency.
The difference between a robot and a human conversation is the latency between when you stop speaking and when they reply.
VAD_SILENCE_DURATION(0.1ms): By default, speech recognizers wait 600ms - 1000ms after you stop speaking to assume you are done. Jarvis's VAD is cranked down to0.1s(100ms). The instant you close your mouth, the AI prompt is fired off.- The Tradeoff: This ultra-aggressive VAD requires you to speak fluidly without long pauses, but the resulting "instant" conversational speed is unparalleled.
Running raw transformers locally destroys battery life. Jarvis forces hardware-level optimizations natively.
WHISPER_DEVICE = "cpu": Counter-intuitively, testing showed that PyTorch's Metal Performance Shaders (MPS) for the GPU actually incurred a slower initialization time for tiny audio clips.config.pyforces Whisper to the CPU, leaning entirely on the M4 processor's massive dedicated cache.WHISPER_COMPUTE_TYPE = "int8": Neural networks natively usefloat32(32-bit decimal numbers) orfp16.config.pyviolently compresses the matrix math down to 8-bit integers (int8). The transcription accuracy barely drops 2%, but the CPU compute time is slashed by over 50%, maintaining cool thermals.
Python's import statements block the main thread. If you import mediapipe for hand-tracking on boot, Jarvis takes 4 seconds to start up.
LAZY_LOAD_WHISPER&LAZY_LOAD_CURSOR: These boolean flags prevent heavy libraries from loading until you explicitly ask for them. Jarvis boots in milliseconds using purely standard library packages. MediaPipe is only cached into memory the first time you execute a specific gesture control command.- Feature Toggles (
ENABLE_NEWS,ENABLE_CALENDAR): Theconfig.pycontains 20+ feature toggles. Turning them toFalsecompletely strips their respective classes out of theServiceRegistryat boot, freeing up system RAM and keeping Jarvis strictly focused.
🏗️ System Architecture Flow
Understanding the entire request pipeline from voice input to UI rendering is critical for debugging and expanding Jarvis. Here is the complete lifecycle of a single command:
sequenceDiagram
participant User
participant SwiftHUD as Swift GUI (Port 8492)
participant SocketServer as TCP Server
participant VoiceEngine as Speech.py (Picovoice+Apple)
participant EventRouter as EventManager (Pub/Sub)
participant Intent as IntentRouter (Regex/NLP)
participant Brain as Brain.py (Groq/Ollama)
participant Skill as Skill / Module
%% Wake & Speech Phase
User->>VoiceEngine: "Jarvis?" (Wake Word)
VoiceEngine-->>SwiftHUD: Status: LISTENING
User->>VoiceEngine: "Turn on my focus mode"
VoiceEngine-->>VoiceEngine: VAD Silence Detected (100ms)
VoiceEngine->>EventRouter: Publish: "command_received"
%% GUI & Intent Routing Phase
EventRouter->>SocketServer: Broadcast Transcript
SocketServer->>SwiftHUD: Status: THINKING (Shimmer UI)
EventRouter->>Intent: Analyze Command Topology
Intent-->>Intent: Regex match found for "focus"
%% Execution Phase
alt Exact Skill Match
Intent->>Skill: Execute FocusManager
Skill-->>EventRouter: Publish: "command_completed"
else Fuzzy/AI Match
Intent->>Brain: No exact route found. Route to LLM.
Brain->>Brain: Inject Context (Time, System State, Web Search)
Brain-->>Brain: Token Stream Generation (800 T/s)
Brain->>EventRouter: Publish: "tts_speak" (Chunked)
end
%% Post-Execution & TTS Phase
EventRouter->>VoiceEngine: Convert text to speech
VoiceEngine-->>SwiftHUD: Status: SPEAKING (Audio Waveform)
VoiceEngine-->>User: Plays audio through speakers
VoiceEngine-->>SwiftHUD: Status: IDLE
</details>
<details>
<summary><b>🛠️ Custom Skill Development Guide</b></summary>
<br>
Jarvis is built entirely on a modular `ServiceRegistry` and a publish/subscribe `EventManager`. This means you can drop new Python files into the codebase and have Jarvis instantly recognize them without modifying his core brain.
Here is a step-by-step guide to writing your own skill.
### Step 1: Create the Skill File
Create a new file in `modules/skills/my_new_skill.py`. All skills must inherit from the `BaseSkill` class to be automatically registered by the intent router.
```python
from modules.skills.base_skill import BaseSkill
from core.events import EventManager
class MyNewSkill(BaseSkill):
def __init__(self, event_manager: EventManager):
super().__init__(event_manager)
self.name = "MyNewSkill"
self.description = "A custom skill that controls my smart lights."
# This regex array tells the Intent Router when to wake this skill up.
self.trigger_phrases = [
r"turn (on|off) the lights",
r"make it (bright|dark) in here",
r"lights (on|off)"
]
def can_handle(self, command: str) -> bool:
# The intent router will pass every spoken word through here first.
return super().can_handle(command)
def execute(self, command: str) -> str:
# 1. Parse the command
action = "on" if "on" in command or "bright" in command else "off"
# 2. Tell the GUI you are doing something
self.event_manager.publish("gui_status", {
"header": "SMART HOME",
"detail": f"Turning lights {action}..."
})
# 3. Perform the physical action (HTTP requests, API calls, etc)
# requests.post(f"http://philips-hue-bridge/api/lights/1/state", json={"on": action == "on"})
# 4. Return the dialogue Jarvis should speak aloud.
return f"I have successfully turned {action} the studio lights, Sir."
Open jarvis.py. At the top of the file, import your new skill, and then register it into the Service Engine before the main loop starts.
# In jarvis.py
from modules.skills.my_new_skill import MyNewSkill
def register_services():
# ... existing code ...
registry.register("my_lights", MyNewSkill(event_manager))Boot Jarvis (python jarvis.py) and say: "Jarvis, make it bright in here."
The Intent Router perfectly bypasses the expensive Groq LLM API, matches your regex string, instantly fires your Python code, updates the Swift GUI with the "SMART HOME" header, and speaks the return string aloud.
⚙️ Complete Configuration Reference (`config.py`)
To keep the main code clean, almost every single aspect of Jarvis's behavior is exposed as a toggle mechanism inside config.py. Here is the complete reference table for all available environment and execution flags.
| Variable Name | Type | Default Value | Description |
|---|---|---|---|
| Identity & Core | |||
ASSISTANT_NAME |
String | "Jarvis" |
The name the AI refers to itself as. |
USER_NAME |
String | "Sir" |
The honorific the AI uses to refer to you. |
PICOVOICE_API_KEY |
String | (Required in .env) | The access token for the offline Wake Word engine. |
EXIT_WORDS |
List | ["quit", "goodbye"...] |
Spoken phrases that hard-kill the Application process. |
STOP_WORDS |
List | ["stop", "quiet"...] |
Spoken phrases that immediately halt current TTS audio. |
| System & Memory limits | |||
MEMORY_LIMIT_MB |
Integer | 500 |
Limits the Context Memory arrays to prevent RAM crashing. |
CONVERSATION_HISTORY_DAYS |
Integer | 30 |
Auto-deletes old chat logs after this duration for privacy. |
MAC_APPS |
Dict | { "safari": "Safari" } |
Maps spoken words to actual macOS .app bundle names. |
| Voice Engine Latency | |||
VAD_SILENCE_DURATION |
Float | 0.1 |
Milliseconds of silence needed to stop recording (100ms). |
USE_APPLE_SPEECH |
Boolean | True |
Prioritize the offline macOS Dictation engine over Whisper. |
APPLE_SPEECH_ON_DEVICE |
Boolean | True |
Disables sending audio to Apple cloud servers for transcribing. |
VOICE_ENGINE_FALLBACK |
Boolean | True |
Auto-boots Whisper if NSSpeechRecognizer crashes natively. |
WHISPER_DEVICE |
String | "cpu" |
Hardcodes PyTorch inference vectors to Apple Silicon CPU cache. |
WHISPER_COMPUTE_TYPE |
String | "int8" |
8-bit quantization format for Faster-Whisper memory slicing. |
VOICE_RATE |
Integer | 240 |
The WPM (Words Per Minute) of the TTS output. |
| Hardware Integrations | |||
ENABLE_CURSOR_CONTROL |
Boolean | True |
Activates the MediaPipe webcam listener for hand tracking. |
CURSOR_CAMERA_INDEX |
Integer | 0 |
The hardware ID of your webcam (0 = built-in, 1 = external). |
CURSOR_SPEED |
Float | 4.0 |
Multiplier for the hand-tracking to mouse-movement ratio. |
CLICK_THRESHOLD |
Integer | 10 |
The pixel distance between thumb/index to trigger a left click. |
CLICK_COOLDOWN |
Float | 0.5 |
The anti-bounce delay between allowable consecutive clicks. |
| Security & Privacy | |||
ENABLE_PROXIMITY_LOCK |
Boolean | True |
Forces Mac to sleep if PHONE_MAC_ADDRESS Bluetooth disconnects. |
ENABLE_FACE_ID |
Boolean | True |
Engages OpenCV facial detection before executing nuclear skills. |
REFERENCE_IMAGE_PATH |
Path | "data/me.jpg" |
The baseline photo used by face_recognition_models to verify you. |
REQUIRE_CONFIRMATION_FOR |
List | ["delete", "shutdown"] |
Forces a secondary "Are you sure?" voice prompt before executing. |
| Lazy Loading Architecture | |||
LAZY_LOAD_WHISPER |
Boolean | True |
Prevents 2GB Whisper models from flooding RAM on startup. |
LAZY_LOAD_CURSOR |
Boolean | True |
Prevents MediaPipe OpenCV initialization arrays from blocking boot. |
CHECK_HEALTH_ON_STARTUP |
Boolean | False |
Skips doctor.py permission checks for instant cold boots. |
- Port 8492 Blocked? The backend uses port 8492 to talk to the Swift HUD. If it crashes, clear it:
lsof -tiTCP:8492 -sTCP:LISTEN | xargs kill
- Permissions Error? Jarvis relies heavily on macOS Accessibility, Microphone, Camera, and Speech Recognition. If it hangs, check System Settings > Privacy & Security and ensure terminal/JarvisApp has access.
- Whisper/Audio Issues? Run
.venv/bin/python scripts/doctor.pyto diagnose missing audio chunks or missing.envkeys.
Q: Can I run Jarvis completely offline without an internet connection?
A: Yes! As long as USE_APPLE_SPEECH and VOICE_ENGINE_FALLBACK are set, and your underlying brain module is pointed to a local Ollama instance (instead of Groq), Jarvis will function 100% offline. The only modules that will fail are Web Search, News, and Weather.
Q: Why does Jarvis sometimes type out the wrong name when I ask him to call someone?
A: Cloud transcribe APIs often misinterpret proper nouns. If you find Jarvis consistently misunderstanding a contact's name, open contact_manager.py and add it to the apply_name_aliases dictionary (e.g., "samsung": "samson").
Q: How do I change the Wake Word from "Jarvis"?
A: You must generate a new custom .ppn wake word file from the Picovoice console and place it in the application's root directory. Then, update the path in speech.py.
Q: Is the Swift HUD compatible with iOS or iPadOS?
A: Currently, the Swift codebase is written using AppKit and specifically optimized for macOS. However, because the Python backend communicates purely over a standard TCP Socket on port 8492 using JSON, you can easily write your own iOS client to interact with Jarvis over your local Wi-Fi network.
Contributions to expand Jarvis's skill set are always welcome! Focus areas: Adding new automation skills, optimising model prompts, or improving the Swift UI.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Distributed under the Apache 2.0 License. See LICENSE for more information.
“Just a rather very intelligent system.”

