ericwang520 · tangxiya-star · Jun 7, 2026
diff --git a/README.md b/README.md
@@ -1,7 +1,154 @@
-# GreenTag — AI inspector for the construction site.
+# GreenTag — the AI inspector for the construction site
 
-[GreenTag](https://github.com/ericwang520/GreenTag) is the AI inspector for the construction site — it checks work against local building codes in real time and catches violations before the official inspection, so contractors pass the first time.
+### 🦺 Pokémon Go for contractors.
 
-## Feature
+**Point your phone at a wall, and a voice-driven AR layer checks the framing against local
+building code in real time — so contractors pass inspection the first time.**
 
-YC Conversational AI Hackathon 2026/6/6-7 · Eric × Xiya
+Same instinct as the game: raise your phone, and an overlay appears on the real world. But
+instead of catching creatures, you're catching code violations before they cost you a wall.
+
+A contractor points their phone at a framed wall. GreenTag detects the studs, measures
+the center-to-center spacing in AR, and a voice agent tells them — out loud, hands-free —
+whether the spacing passes local code (16" or 24" on-center) and what to do next. No
+typing: they're on a ladder, in gloves, with both hands full.
+
+> YC Conversational AI Hackathon · SF · 2026-06-06/07 · built in 24h by **Eric** (iOS/AR) × **Xiya** (AI/Voice/RAG)
+
+---
+
+## Why it matters
+
+Framing spacing that's out of spec means tearing the wall back down and re-nailing —
+**thousands of dollars in rework** per miss. ~200k US residential contractors face this
+daily, and they catch it only when the official inspector shows up. GreenTag moves that
+check to the moment the wall goes up, by voice, with a real code citation behind every
+verdict.
+
+---
+
+## The one demo loop
+
+```
+Roboflow (detect studs)
+   └─> ARKit (measure center-to-center, inches)
+          └─> [field_observation JSON]  ──LiveKit data channel (topic="field_observation")──┐
+                                                                                             v
+   User speaks: "Is this one ok?" ──LiveKit audio──> MiniMax STT ──> LLM               Agent worker
+                                                                       │  stores latest_observation
+                                                                       │  get_current_reading()
+                                                                       │  lookup_code(city, question)
+                                                                       v
+                                            Moss query (hybrid, filter by city) <── Unsiloed-parsed codes
+                                                                       │            (pre-indexed, offline)
+                                                                       v
+                                            LLM composes verdict ──> MiniMax TTS ──> spoken answer
+```
+
+The AR side sends **facts only** — a raw measurement, never a pass/fail. Whether 16" or
+24" applies depends on the wall (load-bearing? number of floors?), so the verdict is the
+agent's call, grounded in the retrieved code. The agent holds the latest reading and only
+speaks when the contractor asks.
+
+---
+
+## Architecture — three pieces
+
+```
+  ┌───────────────────────────────────────┐         ┌──────────────────────────────────────────┐
+  │            📱 iOS device               │         │              ☁️  Agent worker              │
+  │              app-ios/                  │         │                  agent/                    │
+  │                                        │         │                                            │
+  │  ┌──────────┐   ┌──────────────────┐   │         │   ┌────────────────────────────────────┐   │
+  │  │ Roboflow │──▶│ ARKit            │   │  audio  │   │ MiniMax  STT ─▶ LLM ─▶ TTS          │   │
+  │  │ (CoreML) │   │ center-to-center │   │◀───────▶│   │          (verdict, spoken, short)  │   │
+  │  │  detect  │   │  spacing (in)    │   │  ┌────┐ │   └───────────────┬────────────────────┘   │
+  │  └──────────┘   └────────┬─────────┘   │  │ Li │ │       holds latest_observation             │
+  │                          │             │  │ ve │ │       get_current_reading()                │
+  │   green / red AR overlay │             │  │ Ki │ │                │  lookup_code(city, q)      │
+  │                          ▼             │  │ t  │ │                ▼   (in-process import)      │
+  │            field_observation JSON ─────┼─▶│    │─┼─▶ ┌────────────────────────────────────┐   │
+  │            (facts only, no verdict)    │  └────┘ │   │           backend/                 │   │
+  │                                        │   data  │   │   moss_codes.lookup_code()         │   │
+  └────────────────────────────────────────┘  channel│   │   Moss index ◀─ Unsiloed chunks    │   │
+                                                      │   │   (hybrid α=0.6, filter by city)   │   │
+                                                      │   │   FastAPI /codes/* + US map        │   │
+                                                      │   └────────────────────────────────────┘   │
+                                                      └──────────────────────────────────────────┘
+        Roboflow · ARKit            ──── LiveKit ────            MiniMax            Moss · Unsiloed
+```
+
+| Directory | Owner | What it does |
+|---|---|---|
+| [`app-ios/`](app-ios/) | Eric | SwiftUI + ARKit app. Roboflow CoreML model detects lumber, ARKit measures center-to-center spacing, AR overlay shows green/red, and it publishes `field_observation` events over the LiveKit data channel + streams mic audio. |
+| [`agent/`](agent/README.md) | Xiya | LiveKit Agents voice worker. MiniMax STT → LLM → TTS. Holds the latest observation, exposes `get_current_reading()` and a code-lookup tool, and speaks short verdicts with citations. |
+| [`backend/`](backend/README.md) | Xiya | Moss retrieval (`lookup_code`), the offline Unsiloed ingest pipeline, a FastAPI service, and a US coverage map. The agent imports `lookup_code` **in-process** for sub-10ms retrieval. |
+
+The agent and backend share one Python workspace, so the voice loop calls Moss directly
+(no network hop). iOS talks to the agent purely over LiveKit.
+
+---
+
+## The stack — every layer is a sponsor tool
+
+| Layer | Tool | Role |
+|---|---|---|
+| Object detection | **Roboflow** | detect `lumber`/stud in the camera frame (CoreML, on-device) |
+| Measurement | **ARKit** | center-to-center spacing in inches |
+| Voice transport | **LiveKit** | realtime audio + data channel between phone and agent |
+| Speech + LLM | **MiniMax** | STT, reasoning/verdict, TTS |
+| Code parsing | **Unsiloed** | building-code PDFs → structured chunks (offline, pre-demo) |
+| Retrieval | **Moss** | semantic + keyword search over codes, <10ms, filter by city |
+
+> *Parsing by Unsiloed, retrieval by Moss, voice on LiveKit + MiniMax, detection by Roboflow, measurement in ARKit.*
+
+---
+
+## What's working
+
+- **Real ARKit measurement** of center-to-center stud spacing, with an on-device green/red overlay.
+- **On-device Roboflow CoreML** lumber detection feeding the measurement.
+- **Full MiniMax voice loop** — STT → LLM verdict → TTS — over LiveKit, answering live spoken questions.
+- **Moss retrieval** over building codes pre-parsed by Unsiloed, tagged by city, queried hybrid (`alpha=0.6`) so exact tokens like `R602.3` still hit.
+- **Four jurisdictions indexed**: San Francisco, Seattle, Austin, and the IRC model code as the base.
+- **A US coverage map** (served by the backend) showing which cities are live.
+
+**Honest framing:** SF and Seattle resolve to the *same* 16"/24" numbers — both derive from
+IRC `R602.3(5)`. The story isn't "different spacing per city"; it's *one model code with
+per-city overrides, and Moss switching jurisdiction by `city` metadata*. And nothing here is
+offline — LiveKit and MiniMax are cloud. Moss's real win is **sub-10ms retrieval = no voice
+lag and big token savings**.
+
+---
+
+## Quickstart
+
+Each component has its own README with full setup. The short version:
+
+```bash
+# 1. Backend — build the Moss index from pre-parsed code chunks (one-time, offline)
+cd backend && .venv/bin/python scripts/ingest.py     # see backend/README.md
+
+# 2. Agent — start the LiveKit voice worker (imports the backend's lookup_code)
+cd agent && uv run python src/agent.py dev            # see agent/README.md
+
+# 3. iOS — open app-ios/ in Xcode and run on a device with ARKit + camera
+```
+
+Secrets live in the repo-root `.env` (gitignored): `MOSS_PROJECT_ID`, `MOSS_PROJECT_KEY`,
+`UNSILOED_API_KEY`, plus the LiveKit and MiniMax credentials.
+
+- **Agent setup & run** → [`agent/README.md`](agent/README.md)
+- **Backend retrieval, ingest & API** → [`backend/README.md`](backend/README.md) · [`backend/API.md`](backend/API.md)
+- **iOS → agent event contract** → [`schema.md`](schema.md)
+
+---
+
+## Team
+
+| | |
+|---|---|
+| **Eric** | iOS / AR — ARKit, Roboflow CoreML, measurement, LiveKit publishing |
+| **Xiya** | AI / Voice / RAG — LiveKit worker, MiniMax STT/LLM/TTS, Moss + Unsiloed |
+
+Built for the YC Conversational AI Hackathon, San Francisco, June 2026.
diff --git a/agent/livekit.toml b/agent/livekit.toml
@@ -0,0 +1,5 @@
+[project]
+  subdomain = "yc-hackaathon-moss-ha1snxvm"
+
+[agent]
+  id = "CA_9vgGWEbTYCVc"
diff --git a/agent/src/agent.py b/agent/src/agent.py
@@ -51,12 +51,12 @@
 # the browser map and curl testing. Both funnel into the same EventDispatcher.
 FIELD_OBSERVATION_TOPIC = "field_observation"
 
-# MiniMax chat model for the voice loop. M3 is a slow *reasoning* model: it spends
-# tokens in a <think> block before the first spoken word, which hurts voice
-# time-to-first-audio. M2.7-highspeed is MiniMax's own recommendation for voice
-# pipelines (~100 tok/s). Overridable so you can drop to MiniMax-M2.1-highspeed
-# (non-reasoning, snappiest) — confirm the exact id string in your MiniMax console.
-MINIMAX_MODEL = os.getenv("MINIMAX_MODEL", "MiniMax-M2.7-highspeed")
+# MiniMax chat model for the voice loop. The whole M2 family *always* reasons
+# before the first spoken word (it cannot be disabled), which hurts voice
+# time-to-first-audio. Measured on our key: M2.1-highspeed at reasoning_effort
+# "low" is the snappiest combo (~200 think tokens / ~6s per verdict vs ~280 /
+# ~8s on M2.7-highspeed), so it's the default for the demo.
+MINIMAX_MODEL = os.getenv("MINIMAX_MODEL", "MiniMax-M2.1-highspeed")
 
 # Bridge to Eric's backend package (a sibling uv project, not pip-installed) so
 # the agent can call his Moss retrieval in-process — the design he documented in
@@ -155,9 +155,13 @@ async def speak(obs: FieldObservation) -> None:
         code = requirement_from_chunks(chunks)
         code = _attach_spacing_threshold(obs, code)
         announcement = build_spoken_announcement(obs, code)
+        # Interruptible on purpose: announcements can stack up while the model
+        # thinks, and blocking interruptions left the contractor unable to get
+        # a word in (the iOS side used to also close the mic while the agent
+        # spoke). The session's min_interruption_words gate filters echo/noise.
         session.say(
             announcement,
-            allow_interruptions=False,
+            allow_interruptions=True,
             add_to_chat_ctx=True,
         )
 
@@ -252,6 +256,13 @@ def __init__(self, store: ObservationStore | None = None) -> None:
                 model=MINIMAX_MODEL,
                 base_url="https://api.minimax.io/v1",
                 api_key=os.getenv("MINIMAX_API_KEY"),
+                # reasoning_split moves the model's <think> chain-of-thought out
+                # of `content` into a separate reasoning field, so it can never
+                # leak into the spoken reply (without it, fragments like "The
+                # user…" escaped the SDK's tag stripping and were spoken aloud).
+                # reasoning_effort "low" trims think tokens — "minimal" is not a
+                # value MiniMax honors and measured *slower* than "low".
+                extra_body={"reasoning_split": True, "reasoning_effort": "low"},
             ),
             # To use a realtime model instead of a voice pipeline, replace the LLM
             # with a RealtimeModel and remove the STT/TTS from the AgentSession

diff --git a/app-ios/GreenTag/Models/VoiceAgentSession.swift b/app-ios/GreenTag/Models/VoiceAgentSession.swift
@@ -96,9 +96,11 @@ final class VoiceAgentSession: ObservableObject {
             print("[GreenTag LiveKit] connecting room=\(details.roomName) participant=\(details.participantName)")
             try await room.connect(url: details.serverUrl, token: details.participantToken)
             phase = .connected
-            // Hands-free: open the mic so the agent can hear the wake word. The
-            // half-duplex rule below then closes it whenever the agent is
-            // speaking, which kills the speaker→mic echo loop.
+            // Hands-free: open the mic and keep it open for the whole session
+            // (full duplex). Echo from the speaker is handled by the
+            // `.voiceChat` AVAudioSession mode's hardware echo cancellation
+            // plus the agent's interruption gating — closing the mic while the
+            // agent spoke made the user inaudible for most of the conversation.
             reconcileMic()
         } catch {
             phase = .failed(error.localizedDescription)
@@ -116,18 +118,20 @@ final class VoiceAgentSession: ObservableObject {
         phase = .idle
     }
 
-    /// User toggles their own mute. Auto half-duplex still applies on top.
+    /// User toggles their own mute.
     func toggleMute() {
         muted.toggle()
         reconcileMic()
     }
 
-    /// Desired mic state for hands-free half-duplex: open while connected and
-    /// unmuted, but closed whenever the agent is speaking (so its voice from the
-    /// speaker doesn't loop back into the mic). Called on connect and on every
+    /// Desired mic state for hands-free full duplex: open while connected and
+    /// unmuted — including while the agent is speaking, so the user can talk
+    /// over / interrupt it. Speaker→mic echo is suppressed by the `.voiceChat`
+    /// audio-session AEC; the agent additionally requires several words before
+    /// treating speech as an interruption. Called on connect and on every
     /// agent-state change.
     private func reconcileMic() {
-        desiredMic = phase == .connected && !muted && agentState != .speaking
+        desiredMic = phase == .connected && !muted
         // Only one drain runs at a time; a concurrent reconcile just updates
         // `desiredMic` and the in-flight drain picks it up on its next pass.
         guard !micDraining else { return }

diff --git a/vision/.gitkeep b/vision/.gitkeep