diff --git a/README.md b/README.md index 844f5e7..7b483f4 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,154 @@ -# GreenTag β€” AI inspector for the construction site. +# GreenTag β€” the AI inspector for the construction site -[GreenTag](https://github.com/ericwang520/GreenTag) is the AI inspector for the construction site β€” it checks work against local building codes in real time and catches violations before the official inspection, so contractors pass the first time. +### 🦺 PokΓ©mon Go for contractors. -## Feature +**Point your phone at a wall, and a voice-driven AR layer checks the framing against local +building code in real time β€” so contractors pass inspection the first time.** -YC Conversational AI Hackathon 2026/6/6-7 Β· Eric Γ— Xiya +Same instinct as the game: raise your phone, and an overlay appears on the real world. But +instead of catching creatures, you're catching code violations before they cost you a wall. + +A contractor points their phone at a framed wall. GreenTag detects the studs, measures +the center-to-center spacing in AR, and a voice agent tells them β€” out loud, hands-free β€” +whether the spacing passes local code (16" or 24" on-center) and what to do next. No +typing: they're on a ladder, in gloves, with both hands full. + +> YC Conversational AI Hackathon Β· SF Β· 2026-06-06/07 Β· built in 24h by **Eric** (iOS/AR) Γ— **Xiya** (AI/Voice/RAG) + +--- + +## Why it matters + +Framing spacing that's out of spec means tearing the wall back down and re-nailing β€” +**thousands of dollars in rework** per miss. ~200k US residential contractors face this +daily, and they catch it only when the official inspector shows up. GreenTag moves that +check to the moment the wall goes up, by voice, with a real code citation behind every +verdict. + +--- + +## The one demo loop + +``` +Roboflow (detect studs) + └─> ARKit (measure center-to-center, inches) + └─> [field_observation JSON] ──LiveKit data channel (topic="field_observation")──┐ + v + User speaks: "Is this one ok?" ──LiveKit audio──> MiniMax STT ──> LLM Agent worker + β”‚ stores latest_observation + β”‚ get_current_reading() + β”‚ lookup_code(city, question) + v + Moss query (hybrid, filter by city) <── Unsiloed-parsed codes + β”‚ (pre-indexed, offline) + v + LLM composes verdict ──> MiniMax TTS ──> spoken answer +``` + +The AR side sends **facts only** β€” a raw measurement, never a pass/fail. Whether 16" or +24" applies depends on the wall (load-bearing? number of floors?), so the verdict is the +agent's call, grounded in the retrieved code. The agent holds the latest reading and only +speaks when the contractor asks. + +--- + +## Architecture β€” three pieces + +``` + β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” + β”‚ πŸ“± iOS device β”‚ β”‚ ☁️ Agent worker β”‚ + β”‚ app-ios/ β”‚ β”‚ agent/ β”‚ + β”‚ β”‚ β”‚ β”‚ + β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ + β”‚ β”‚ Roboflow │──▢│ ARKit β”‚ β”‚ audio β”‚ β”‚ MiniMax STT ─▢ LLM ─▢ TTS β”‚ β”‚ + β”‚ β”‚ (CoreML) β”‚ β”‚ center-to-center β”‚ │◀───────▢│ β”‚ (verdict, spoken, short) β”‚ β”‚ + β”‚ β”‚ detect β”‚ β”‚ spacing (in) β”‚ β”‚ β”Œβ”€β”€β”€β”€β” β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ + β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ Li β”‚ β”‚ holds latest_observation β”‚ + β”‚ β”‚ β”‚ β”‚ ve β”‚ β”‚ get_current_reading() β”‚ + β”‚ green / red AR overlay β”‚ β”‚ β”‚ Ki β”‚ β”‚ β”‚ lookup_code(city, q) β”‚ + β”‚ β–Ό β”‚ β”‚ t β”‚ β”‚ β–Ό (in-process import) β”‚ + β”‚ field_observation JSON ─────┼─▢│ │─┼─▢ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ + β”‚ (facts only, no verdict) β”‚ β””β”€β”€β”€β”€β”˜ β”‚ β”‚ backend/ β”‚ β”‚ + β”‚ β”‚ data β”‚ β”‚ moss_codes.lookup_code() β”‚ β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ channelβ”‚ β”‚ Moss index ◀─ Unsiloed chunks β”‚ β”‚ + β”‚ β”‚ (hybrid Ξ±=0.6, filter by city) β”‚ β”‚ + β”‚ β”‚ FastAPI /codes/* + US map β”‚ β”‚ + β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + Roboflow Β· ARKit ──── LiveKit ──── MiniMax Moss Β· Unsiloed +``` + +| Directory | Owner | What it does | +|---|---|---| +| [`app-ios/`](app-ios/) | Eric | SwiftUI + ARKit app. Roboflow CoreML model detects lumber, ARKit measures center-to-center spacing, AR overlay shows green/red, and it publishes `field_observation` events over the LiveKit data channel + streams mic audio. | +| [`agent/`](agent/README.md) | Xiya | LiveKit Agents voice worker. MiniMax STT β†’ LLM β†’ TTS. Holds the latest observation, exposes `get_current_reading()` and a code-lookup tool, and speaks short verdicts with citations. | +| [`backend/`](backend/README.md) | Xiya | Moss retrieval (`lookup_code`), the offline Unsiloed ingest pipeline, a FastAPI service, and a US coverage map. The agent imports `lookup_code` **in-process** for sub-10ms retrieval. | + +The agent and backend share one Python workspace, so the voice loop calls Moss directly +(no network hop). iOS talks to the agent purely over LiveKit. + +--- + +## The stack β€” every layer is a sponsor tool + +| Layer | Tool | Role | +|---|---|---| +| Object detection | **Roboflow** | detect `lumber`/stud in the camera frame (CoreML, on-device) | +| Measurement | **ARKit** | center-to-center spacing in inches | +| Voice transport | **LiveKit** | realtime audio + data channel between phone and agent | +| Speech + LLM | **MiniMax** | STT, reasoning/verdict, TTS | +| Code parsing | **Unsiloed** | building-code PDFs β†’ structured chunks (offline, pre-demo) | +| Retrieval | **Moss** | semantic + keyword search over codes, <10ms, filter by city | + +> *Parsing by Unsiloed, retrieval by Moss, voice on LiveKit + MiniMax, detection by Roboflow, measurement in ARKit.* + +--- + +## What's working + +- **Real ARKit measurement** of center-to-center stud spacing, with an on-device green/red overlay. +- **On-device Roboflow CoreML** lumber detection feeding the measurement. +- **Full MiniMax voice loop** β€” STT β†’ LLM verdict β†’ TTS β€” over LiveKit, answering live spoken questions. +- **Moss retrieval** over building codes pre-parsed by Unsiloed, tagged by city, queried hybrid (`alpha=0.6`) so exact tokens like `R602.3` still hit. +- **Four jurisdictions indexed**: San Francisco, Seattle, Austin, and the IRC model code as the base. +- **A US coverage map** (served by the backend) showing which cities are live. + +**Honest framing:** SF and Seattle resolve to the *same* 16"/24" numbers β€” both derive from +IRC `R602.3(5)`. The story isn't "different spacing per city"; it's *one model code with +per-city overrides, and Moss switching jurisdiction by `city` metadata*. And nothing here is +offline β€” LiveKit and MiniMax are cloud. Moss's real win is **sub-10ms retrieval = no voice +lag and big token savings**. + +--- + +## Quickstart + +Each component has its own README with full setup. The short version: + +```bash +# 1. Backend β€” build the Moss index from pre-parsed code chunks (one-time, offline) +cd backend && .venv/bin/python scripts/ingest.py # see backend/README.md + +# 2. Agent β€” start the LiveKit voice worker (imports the backend's lookup_code) +cd agent && uv run python src/agent.py dev # see agent/README.md + +# 3. iOS β€” open app-ios/ in Xcode and run on a device with ARKit + camera +``` + +Secrets live in the repo-root `.env` (gitignored): `MOSS_PROJECT_ID`, `MOSS_PROJECT_KEY`, +`UNSILOED_API_KEY`, plus the LiveKit and MiniMax credentials. + +- **Agent setup & run** β†’ [`agent/README.md`](agent/README.md) +- **Backend retrieval, ingest & API** β†’ [`backend/README.md`](backend/README.md) Β· [`backend/API.md`](backend/API.md) +- **iOS β†’ agent event contract** β†’ [`schema.md`](schema.md) + +--- + +## Team + +| | | +|---|---| +| **Eric** | iOS / AR β€” ARKit, Roboflow CoreML, measurement, LiveKit publishing | +| **Xiya** | AI / Voice / RAG β€” LiveKit worker, MiniMax STT/LLM/TTS, Moss + Unsiloed | + +Built for the YC Conversational AI Hackathon, San Francisco, June 2026. diff --git a/agent/livekit.toml b/agent/livekit.toml new file mode 100644 index 0000000..fac4829 --- /dev/null +++ b/agent/livekit.toml @@ -0,0 +1,5 @@ +[project] + subdomain = "yc-hackaathon-moss-ha1snxvm" + +[agent] + id = "CA_9vgGWEbTYCVc" diff --git a/agent/src/agent.py b/agent/src/agent.py index 2a881a4..1de8880 100644 --- a/agent/src/agent.py +++ b/agent/src/agent.py @@ -51,12 +51,12 @@ # the browser map and curl testing. Both funnel into the same EventDispatcher. FIELD_OBSERVATION_TOPIC = "field_observation" -# MiniMax chat model for the voice loop. M3 is a slow *reasoning* model: it spends -# tokens in a block before the first spoken word, which hurts voice -# time-to-first-audio. M2.7-highspeed is MiniMax's own recommendation for voice -# pipelines (~100 tok/s). Overridable so you can drop to MiniMax-M2.1-highspeed -# (non-reasoning, snappiest) β€” confirm the exact id string in your MiniMax console. -MINIMAX_MODEL = os.getenv("MINIMAX_MODEL", "MiniMax-M2.7-highspeed") +# MiniMax chat model for the voice loop. The whole M2 family *always* reasons +# before the first spoken word (it cannot be disabled), which hurts voice +# time-to-first-audio. Measured on our key: M2.1-highspeed at reasoning_effort +# "low" is the snappiest combo (~200 think tokens / ~6s per verdict vs ~280 / +# ~8s on M2.7-highspeed), so it's the default for the demo. +MINIMAX_MODEL = os.getenv("MINIMAX_MODEL", "MiniMax-M2.1-highspeed") # Bridge to Eric's backend package (a sibling uv project, not pip-installed) so # the agent can call his Moss retrieval in-process β€” the design he documented in @@ -155,9 +155,13 @@ async def speak(obs: FieldObservation) -> None: code = requirement_from_chunks(chunks) code = _attach_spacing_threshold(obs, code) announcement = build_spoken_announcement(obs, code) + # Interruptible on purpose: announcements can stack up while the model + # thinks, and blocking interruptions left the contractor unable to get + # a word in (the iOS side used to also close the mic while the agent + # spoke). The session's min_interruption_words gate filters echo/noise. session.say( announcement, - allow_interruptions=False, + allow_interruptions=True, add_to_chat_ctx=True, ) @@ -252,6 +256,13 @@ def __init__(self, store: ObservationStore | None = None) -> None: model=MINIMAX_MODEL, base_url="https://api.minimax.io/v1", api_key=os.getenv("MINIMAX_API_KEY"), + # reasoning_split moves the model's chain-of-thought out + # of `content` into a separate reasoning field, so it can never + # leak into the spoken reply (without it, fragments like "The + # user…" escaped the SDK's tag stripping and were spoken aloud). + # reasoning_effort "low" trims think tokens β€” "minimal" is not a + # value MiniMax honors and measured *slower* than "low". + extra_body={"reasoning_split": True, "reasoning_effort": "low"}, ), # To use a realtime model instead of a voice pipeline, replace the LLM # with a RealtimeModel and remove the STT/TTS from the AgentSession diff --git a/app-ios/GreenTag/Models/VoiceAgentSession.swift b/app-ios/GreenTag/Models/VoiceAgentSession.swift index 30f0d73..3f990cc 100644 --- a/app-ios/GreenTag/Models/VoiceAgentSession.swift +++ b/app-ios/GreenTag/Models/VoiceAgentSession.swift @@ -96,9 +96,11 @@ final class VoiceAgentSession: ObservableObject { print("[GreenTag LiveKit] connecting room=\(details.roomName) participant=\(details.participantName)") try await room.connect(url: details.serverUrl, token: details.participantToken) phase = .connected - // Hands-free: open the mic so the agent can hear the wake word. The - // half-duplex rule below then closes it whenever the agent is - // speaking, which kills the speakerβ†’mic echo loop. + // Hands-free: open the mic and keep it open for the whole session + // (full duplex). Echo from the speaker is handled by the + // `.voiceChat` AVAudioSession mode's hardware echo cancellation + // plus the agent's interruption gating β€” closing the mic while the + // agent spoke made the user inaudible for most of the conversation. reconcileMic() } catch { phase = .failed(error.localizedDescription) @@ -116,18 +118,20 @@ final class VoiceAgentSession: ObservableObject { phase = .idle } - /// User toggles their own mute. Auto half-duplex still applies on top. + /// User toggles their own mute. func toggleMute() { muted.toggle() reconcileMic() } - /// Desired mic state for hands-free half-duplex: open while connected and - /// unmuted, but closed whenever the agent is speaking (so its voice from the - /// speaker doesn't loop back into the mic). Called on connect and on every + /// Desired mic state for hands-free full duplex: open while connected and + /// unmuted β€” including while the agent is speaking, so the user can talk + /// over / interrupt it. Speakerβ†’mic echo is suppressed by the `.voiceChat` + /// audio-session AEC; the agent additionally requires several words before + /// treating speech as an interruption. Called on connect and on every /// agent-state change. private func reconcileMic() { - desiredMic = phase == .connected && !muted && agentState != .speaking + desiredMic = phase == .connected && !muted // Only one drain runs at a time; a concurrent reconcile just updates // `desiredMic` and the in-flight drain picks it up on its next pass. guard !micDraining else { return } diff --git a/vision/.gitkeep b/vision/.gitkeep deleted file mode 100644 index e69de29..0000000