An embodied AI pharmacist robot that watches, reminds, and cares.
Built on Reachy Mini · Powered by NVIDIA Nemotron Nano VL · Voice by MiniMax TTS
Reachy RX is an embodied AI pharmacist that helps elderly patients take the right medications on time. It uses a camera to watch for people and medication bottles, a vision-language model to understand what it sees, and text-to-speech to talk through the robot's speaker, all while expressing itself with head gestures, antenna wiggles, and synthesized sound effects.
The robot persona is an upbeat, goofy pharmacist, like a cheerful nurse who cracks dad jokes while keeping patients safe.
curl -LsSf https://astral.sh/uv/install.sh | sh
uv syncCopy the example env and fill in your keys:
cp .env.example .env# Required for speech
MINIMAX_TTS_KEY=your_api_key_here
MINIMAX_TTS_GROUP_ID=your_group_id_here
# Only needed for the standalone voice agent (optional)
AGORA_APP_ID=your_app_id_here
AGORA_RESTFUL_KEY=your_restful_key_here
AGORA_RESTFUL_SECRET=your_restful_secret_hereThe daemon is a background server that handles low-level communication with motors and sensors. It must be running before you launch the app.
With robot (USB):
uv run reachy-mini-daemonSimulation (no robot needed):
uv run reachy-mini-daemon --simNote: Keep the daemon terminal open. It must stay running while the app is active.
In a new terminal:
uv run main.py # normal mode
uv run main.py --debug # save frames + verbose logging
# custom model/server
uv run main.py --model my-model --server http://host:8000/v1
# custom medication schedule
uv run main.py --sheet-url "https://docs.google.com/spreadsheets/d/..."The vision-language model is NVIDIA Nemotron Nano VL 12B V2 (FP8), running on an NVIDIA L40S GPU (48 GiB) hosted on Brev and accessed via an OpenAI-compatible API through a Cloudflare tunnel.
- Model:
nemotron-nano-12b-vl, a 13B parameter vision-language model with C-RADIOv2 vision encoder - Quantization: FP8 for fast inference on NVIDIA GPUs
- GPU: NVIDIA L40S (48 GiB) hosted on Brev
- Capabilities: Image understanding, OCR, visual Q&A, tool/function calling
- Serving: vLLM with OpenAI-compatible API endpoints
Override the defaults with --model and --server flags.
Pass --lmstudio to use the LM Studio client (default), which works around tool call parsing issues by describing tools in the system prompt and extracting calls from the model's text output via regex. Use --no-lmstudio for servers with native structured tool call support (vLLM, Ollama, OpenAI).
Speech is handled by MiniMax T2A v2, a cloud TTS API that produces natural-sounding speech.
- Model:
speech-2.6-turbo - Voice:
English_Upbeat_Woman, matches the robot's cheerful persona - Flow: Text → MiniMax HTTP API → hex-encoded WAV → decode → resample to 16kHz → push PCM to Reachy's speaker
- Behavior: Non-blocking (daemon thread), drops requests if already speaking
Requires MINIMAX_TTS_KEY and MINIMAX_TTS_GROUP_ID in your .env file.
Reachy RX turns a Reachy Mini desktop robot into an autonomous medication reminder assistant. It's designed for elderly patients who may forget to take their pills on time, a real problem that leads to over 100,000 preventable deaths per year in the US alone.
Instead of a phone alarm that's easy to ignore, Reachy RX is a physical presence that:
- Watches for the patient through a camera
- Knows the medication schedule (pulled live from a Google Sheet)
- Reminds with escalating urgency, gentle chirps at first, alarm beeps if ignored
- Verifies the patient is taking the right medication by reading bottle labels
- Confirms with a thumbs-up gesture check before marking meds as taken
- Celebrates when medications are taken, happy wiggles and all
Medication non-adherence is one of the biggest problems in elder care. Existing solutions (phone alarms, pill organizers, smart dispensers) are either too easy to ignore or too expensive and complex. A robot with a face, a voice, and a personality is much harder to dismiss, and the dad jokes don't hurt either.
The key insight: a medication reminder needs to be persistent AND likeable. Reachy RX escalates from a gentle chirp to an urgent alarm, but always with a warm personality. It's the difference between a nagging phone notification and a friendly nurse who genuinely cares.
The core of Reachy RX is a sequential vision loop in main.py, running on a NVIDIA Jetson Nano Super. It runs one iteration at a time, no overlapping frames, no parallel audio, to keep things simple and prevent garbled speech.
Here's what happens on every cycle:
Why sequential? Overlapping frames while audio is playing leads to the VLM seeing a "speaking robot" state and generating contradictory actions. Running one complete cycle at a time keeps behavior predictable.
The medication schedule lives in a Google Sheet, just a shared spreadsheet that a caregiver or pharmacist can edit from anywhere. No database, no custom backend.
| Medication | Dosage | Form | Frequency | Times | Instructions | Condition |
|---|---|---|---|---|---|---|
| Lisinopril | 10mg | Tablet | Once daily | 08:00 | Take with water | Hypertension |
| Omeprazole | 20mg | Capsule | Once daily | 07:30 | Before breakfast | Acid reflux |
| Metformin | 500mg | Tablet | Twice daily | 08:00,18:00 | Take with food | Diabetes |
The system reads this via Google's gviz JSON endpoint, a lightweight way to pull structured data from Sheets without a full API integration.
How reminders work:
- Every loop cycle,
MedicationReminder.check_and_remind()checks the schedule - Medications within a ±15 minute window of their scheduled time are flagged as "due"
- Each due med gets a nag count that increments every cycle
- The nag count drives escalating urgency (see below)
- When the patient gives a thumbs up,
mark_medication_taken()persists it tomedication_taken.json - Once marked taken, that med stops generating reminders for the rest of the day
- Schedule is cached for 30 seconds to avoid hammering Google's servers
Reachy doesn't just remind once and give up. It gets increasingly animated:
| Level | Nag Count | Sound | Gesture | Mood |
|---|---|---|---|---|
| 🟢 | 1 | Gentle chirp ↗ | Soft head tilt + curious antenna perk | "Hey, just a reminder..." |
| 🟡 | 2 | Double chirp ↗↗ | Bouncy side-to-side wiggle | "C'mon, time for your meds!" |
| 🟠 | 3 | Triple chirp ↗↗↗ | Wiggles + antenna flapping + look-up plea | "Please? Pretty please?" |
| 🔴 | 4+ | Alarm beeps | Rapid wiggles → sad droop → hopeful perk-up | "I'm REALLY worried now!" |
All sounds are synthesized with numpy at runtime, no audio files. Pure math generating chirps, arpeggios, and alarm tones at 16kHz.
The VLM controls Reachy through 6 tool calls defined as OpenAI-format function schemas:
| Tool | What It Does | Physical Effect |
|---|---|---|
nod_yes() |
Confirm / say yes | Head pitch up/down ×2 |
shake_no() |
Deny / signal concern | Head yaw left/right ×2 |
look_at(direction) |
Track patient position | Head turns to left/right/up/down/center |
speak(message) |
Talk to the patient (only audible output) | MiniMax TTS → WAV → Reachy speaker |
remind_medication(name) |
Play reminder chirp + gesture | Escalating animation based on nag count |
mark_medication_taken(name, due_time) |
Record med as taken | Celebration sound + happy wiggle dance |
Important:
speak()is the only way the patient hears the robot. Everything else the VLM outputs is internal thinking. If the model doesn't callspeak(), the patient hears nothing.
The vision loop tracks whether someone is in front of the camera using a simple keyword-based state machine, no separate face detection model needed.
After each VLM response, the text output is scanned for keywords:
- Person present: "person", "someone", "patient", "face", "thumbs", "holding", etc.
- No one present: "no one", "nobody", "empty", "alone", "waiting"
State transitions:
- No one → Person detected: Inject "🆕 NEW PERSON" context, VLM greets once
- Person present: Inject "👤 PATIENT PRESENT" context, no re-greeting
- Person → No one: Reset greeting state, ready for next visitor
The VLM integration uses an abstract base class pattern so backends are swappable:
BaseVLMClient (ABC)
├── LMStudioVLMClient - Tools described in system prompt, parsed from text via regex
└── OpenAIVLMClient - Native structured tool calls via tools= API parameter
Why two clients? LM Studio's tool call parser silently drops tool calls for certain models (including Nemotron VL). The LM Studio client works around this by embedding tool descriptions in the system prompt and using regex to extract calls like nod_yes() or speak({"message": "Hello!"}) from the model's text output.
The OpenAI client works with any server that properly implements the OpenAI tools API (vLLM, Ollama, OpenAI itself).
Both clients share:
- Frame encoding: JPEG → base64 data URI (85% quality)
- Rolling history: Last 100 action/observation entries to prevent repetition
- Context injection:
inject_context()prepends situational info to the next prompt - Async support:
step_async()/step_collect()for overlapping network latency with action execution
| File | Purpose |
|---|---|
main.py |
Entry point, vision loop, camera init, Reachy connection, context injection, person state machine |
vlm_client.py |
Base client + tool definitions (6 tools) + execute_tool_calls() gesture choreography |
vlm_client_lmstudio.py |
LM Studio backend, regex-based text tool call parsing |
vlm_client_openai.py |
Standard OpenAI-compatible backend |
medication_reminder.py |
Google Sheets schedule fetcher, due-med checker, taken-log persistence |
minimax_tts.py |
Direct MiniMax HTTP TTS → Reachy speaker, non-blocking daemon thread |
sounds.py |
Synthesized sound effects (chirps, celebration), pure numpy, no audio files |
macbook_camera.py |
MacBook FaceTime camera fallback for development |
system_prompt.md |
Robot persona, behavior rules, action examples |
medication_taken.json |
Daily log of medications taken (auto-generated, gitignored) |
| Component | Technology |
|---|---|
| Robot | Reachy Mini, desktop robot with head servos, antennas, speaker |
| Edge Compute | NVIDIA Jetson Nano Super, runs the vision loop and all local processing |
| VLM | NVIDIA Nemotron Nano VL 12B V2 FP8, 13B param vision-language model on NVIDIA L40S (48 GiB) via Brev |
| TTS | MiniMax T2A v2, speech-2.6-turbo model, English_Upbeat_Woman voice |
| Schedule | Google Sheets via gviz JSON API |
| Language | Python 3.11–3.12, managed with uv |
| Vision | OpenCV (BGR → JPEG → base64) |
| Audio | numpy-synthesized sounds at 16kHz, pushed via Reachy's PCM speaker API |
| Serving | vLLM on NVIDIA L40S (48 GiB) via Brev, exposed through Cloudflare tunnel |