Upload a PDF novel. Pick a chapter. The app narrates it in a high-quality AI voice and plays a generative ambient drone tuned to the chapter's emotional tone — synthesized live in the browser with no external music API.
- What it does
- System Architecture
- Request Lifecycle
- Module Deep-Dives
- Key Technical Decisions
- Data Shapes
- Environment Variables
- Local Development
- Deployment
- Available Voices
The application takes a PDF novel as input and turns any chapter into a narrated audiobook excerpt with a mood-matched ambient soundscape:
- Parse — extracts and structures chapters from the PDF using AI-assisted heading detection
- Analyze — reads the chapter's emotional tone (anger, joy, sadness, optimism) using an LLM
- Narrate — synthesizes a high-quality voice recording of the chapter excerpt
- Ambience — generates a synthesized drone in the browser, tuned to the detected mood, that plays beneath the narration
┌─────────────────────────────────────────────────────────────────────────┐
│ Browser │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ app/page.tsx (React) │ │
│ │ │ │
│ │ [Upload zone] → POST /api/parse → [Chapter list] │ │
│ │ │ │
│ │ [Settings: voice, speed, ambient volume] │ │
│ │ │ │
│ │ [Generate] → POST /api/analyze ──────────────────────────┐ │ │
│ │ │ │ │ │
│ │ ▼ │ │ │
│ │ POST /api/narrate mood str │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ [Audio player] ←──────── startAmbientDrone() │ │
│ │ Web Audio API │ │
│ └──────────────────────────────────────────────────────────────────┘ │
└───────────┬─────────────────────────────┬──────────────────────────────┘
│ HTTPS │ HTTPS
▼ ▼
┌───────────────────────┐ ┌─────────────────────────┐
│ Vercel Serverless │ │ Vercel Serverless │
│ /api/parse │ │ /api/analyze │
│ │ │ │
│ pdf-parse │ │ lib/emotion.ts │
│ lib/pdf.ts │ │ → Groq API │
│ → Groq API (×2) │ │ llama-3.3-70b │
│ │ │ → buildMusicProfile() │
└───────────────────────┘ └─────────────────────────┘
┌───────────────────────────────────────────┐
│ Vercel Serverless │
│ /api/narrate │
│ │
│ lib/narrate.ts │
│ truncateAtSentence() → 1500 chars │
│ polishForTTS() → Groq API │
│ formatForNarration() → paragraph pauses │
│ narrateText() → Deepgram API │
│ Aura 2 TTS │
│ WAV (24 kHz) │
└───────────────────────────────────────────┘
External APIs used:
Groq — llama-3.3-70b-versatile (chapter detection, emotion analysis, TTS polish)
Deepgram — Aura 2 (voice synthesis)
User drops PDF
│
▼
POST /api/parse (multipart/form-data)
│
├─ pdf-parse(buffer)
│ Extracts raw text, preserving line breaks.
│ Does NOT reliably preserve paragraph spacing.
│
├─ cleanPdfText(rawText)
│ Frequency analysis: counts every short line (<80 chars).
│ Lines appearing 4+ times are repeating headers/footers → drop.
│ Exception: chapter headings appearing 4+ times are kept once.
│ Also strips: bare page numbers, Roman numerals, "Page N" patterns.
│ Rejoins hyphenated line-breaks ("some-\nwhere" → "somewhere").
│ Collapses 3+ blank lines to 2.
│
├─ [PARALLEL Groq calls]
│
│ detectChapterPattern(text, client)
│ ├─ Sends first 6000 characters to llama-3.3-70b
│ ├─ Asks for: heading_pattern (JS regex string) + has_subtitle (bool)
│ ├─ Examples handled: "^Chapter\\s+\\d+", "^[—–]\\s*CHAPTER\\s+[A-Z]+"
│ └─ Returns: { headingRegex: RegExp, hasSubtitle: boolean } | null
│
│ detectBookMeta(text, client)
│ ├─ Sends first 1500 characters to llama-3.3-70b
│ ├─ Asks for: book_title, author, genre
│ └─ Returns: { bookTitle, author, genre }
│
├─ buildChaptersFromPattern(text, headingRegex, hasSubtitle)
│ Scans every line of the full text.
│ On heading match: strips surrounding em-dashes from the line.
│ If hasSubtitle: looks ahead up to 6 lines for a non-heading
│ line (<80 chars) and appends it as "HEADING — Subtitle".
│ Collects (title, lineIndex) pairs, then slices content between
│ consecutive headings. Drops chapters with <100 chars of content.
│
├─ cleanTitles(chapters)
│ Strips any remaining leading/trailing em-dashes from keys.
│
├─ [Fallback if Groq returns no pattern or <2 chapters found]
│ parseChaptersFallback(text)
│ Tries 5 regex patterns in priority order:
│ 1. Em-dash decorated: /^([—–-]\s*CHAPTER\s+...)/m
│ 2. "Chapter N" / "CHAPTER N"
│ 3. "Part N" / "PART N"
│ 4. Two-line headings (keyword + subtitle on next line)
│ 5. Paragraph chunking (every 10 paragraphs = one section)
│ 6. Last resort: { "Full Text": entireText }
│
└─ Response: { chapters: Record<string,string>, bookTitle, author, genre }
User clicks Generate
│
▼
POST /api/analyze (JSON)
body: { chapterText, bookTitle, genre, chapterTitle }
│
├─ Groq llama-3.3-70b
│ Prompt includes book/genre/chapter context for accuracy.
│ Returns: { anger, joy, sadness, optimism } (floats summing to 1.0)
│ Normalises the scores if they don't sum to 1.
│ Falls back to { anger:0.1, joy:0.4, sadness:0.2, optimism:0.3 }
│ on any error.
│
├─ cinematicMode(scores)
│ Priority-ordered rules map score combinations → one of 10 modes:
│ danger, awe, melancholy, triumph, hope, tension, joy, hope,
│ optimism, sadness, anger.
│ Example: sadness≥0.55 AND optimism≥0.20 → "awe"
│
├─ buildMusicProfile(scores)
│ Looks up the cinematic mode in PROFILES (10 entries).
│ Builds a text prompt (tempo, key, instrumentation, mix notes).
│ Returns MusicProfile + dominant_emotion + prompt string.
│
└─ Response: { scores, musicPrompt, mood }
mood format: "Awe — wonder-filled and bittersweet"
│
▼ (mood string stored in React state)
POST /api/narrate (JSON)
body: { chapterText, voice, speed }
│
├─ truncateAtSentence(text, 1500)
│ Cuts at last sentence boundary before 1500 chars.
│ Adds "…" suffix. Prevents incomplete sentences going to TTS.
│
├─ polishForTTS(truncated) [Groq call]
│ Fixes PDF extraction artifacts before TTS:
│ • Ligatures: fi→fi, fl→fl, ff→ff, ffi→ffi
│ • Missing space after punctuation: "said.He" → "said. He"
│ • Hyphenated line-breaks: "some- where" → "somewhere"
│ • Sentences >50 words split at natural clause boundaries
│ • Ensures every paragraph ends with sentence punctuation
│ Falls back to raw text on any Groq error.
│
├─ formatForNarration(polished)
│ Splits on 2+ blank lines (paragraph boundaries).
│ Ensures each paragraph ends with punctuation.
│ Joins paragraphs with " ... " — Deepgram reads "..." as
│ an audible pause, creating natural paragraph breathing.
│
├─ narrateText(text, voice, speed)
│ POST to Deepgram /v1/speak
│ model={voice}&encoding=linear16&sample_rate=24000&container=wav
│ Returns raw WAV bytes (Buffer).
│ Timeout: 55 seconds.
│
└─ Response:
Content-Type: audio/wav
X-Narrated-Text: encodeURIComponent(narrationText)
Body: WAV binary
│
▼ (client receives blob + header)
Browser:
URL.createObjectURL(blob) → <audio src>
mood → startAmbientDrone(mood, volume)
mood string (e.g. "Awe — wonder-filled")
│
▼
moodToFrequencies(mood)
Regex matches against 8 emotional categories:
dark/dread → [55, 73.4, 82.4] Hz, LPF: 280 Hz
tense → [110, 123.5, 164.8] Hz, LPF: 500 Hz
mysterious → [87.3, 116.5, 155.6] Hz, LPF: 380 Hz
sad → [110, 130.8, 196] Hz, LPF: 550 Hz
peaceful → [196, 261.6, 329.6] Hz, LPF: 1400 Hz
hopeful → [261.6, 329.6, 392] Hz, LPF: 1800 Hz
joyful → [329.6, 392, 523.3] Hz, LPF: 2200 Hz
romantic → [220, 277.2, 329.6] Hz, LPF: 1200 Hz
default → [130.8, 174.6, 196] Hz, LPF: 700 Hz
│
▼
AudioContext created
│
├─ syntheticReverb(ctx) → ConvolverNode
│ Creates a 3.5-second stereo impulse response buffer.
│ Each sample: random noise × exponential decay (power 2.2).
│ This simulates a large reverberant space without a real IR file.
│
├─ BiquadFilter (lowpass, Q=0.6)
│ Cutoff = filterHz from mood mapping.
│ Removes upper harmonics; keeps the sound dark and textural.
│
├─ GainNode (masterGain) → ctx.destination
│ Controlled by the ambient volume slider (0–18%).
│
└─ For each of the 3 frequencies:
OscillatorNode
type: sine (freq[0]) | triangle (freq[1], freq[2])
frequency: hz + (i × 0.25) ← micro-detune for stereo width
GainNode: 0.22 / numFreqs
LFO OscillatorNode
frequency: 0.04 + (i × 0.013) ← ~25s cycle, staggered per osc
GainNode: hz × 0.004 ← modulation depth scales with freq
→ connected to osc.frequency ← creates organic pitch drift
Signal path: osc → gain → reverb → LPF → masterGain → destination
Lifecycle:
onPlay → drone starts (or setVolume restores from 0)
onPause → setVolume(0) — oscillators keep running, instant silence
onEnded → stop() — oscillators stopped, AudioContext closed after 200ms
Exports: parseBookPdf(buffer: Buffer): Promise<ParsedBook>
The only function the route needs to call. Internally:
cleanPdfText— normalises raw PDF text before any analysisdetectChapterPattern— Groq call #1: heading regex identificationdetectBookMeta— Groq call #2: book title, author, genrebuildChaptersFromPattern— applies the regex to the full textcleanTitles— strips decorative em-dashes from chapter title stringssortChapters— sorts by chapter number (Arabic numerals or spelled-out words)parseChaptersFallback— regex-then-chunking fallback chain
Why two separate Groq calls (not one)?
Early versions used a combined prompt. Groq consistently returned an empty heading_pattern when asked to identify headings AND extract metadata simultaneously. Splitting into two focused, parallel calls restored accuracy on both tasks. The parallel execution means no latency penalty.
Why pdf-parse over pdfjs-dist?
pdfjs-dist has a complex worker setup that is awkward in serverless Node.js. pdf-parse is synchronous-friendly, has no worker thread requirement, and produces the same text output for standard text-based PDFs.
Why frequency analysis for header/footer removal?
Running headers and footers (book title, author name, chapter title repeated on every page) appear as short identical lines. Counting line frequency and dropping lines that appear 4+ times catches these robustly without needing regex patterns for every possible book. Chapter headings are protected by the CHAPTER_HEADING regex guard — they're allowed to appear once even if they repeat.
Exports: EmotionScores, MusicProfile, buildMusicProfile(), analyzeEmotion()
analyzeEmotion(text, bookTitle, genre, chapterTitle)
Sends the first 3000 characters of the chapter to Groq. Book and chapter context is prepended to the prompt because a chapter of Harry Potter with "dark, raining, owl" text is different emotionally from the same text in a horror novel. Context prevents naive surface-level scoring.
The model returns { anger, joy, sadness, optimism }. Scores are normalised to sum to 1.0 before use, guarding against floating-point drift in the model's output.
cinematicMode(scores)
Maps the normalised scores to one of 10 named cinematic modes using priority-ordered threshold rules:
anger ≥ 0.55 → danger
sadness ≥ 0.55 AND optimism ≥ 0.20 → awe (bittersweet)
sadness ≥ 0.55 → melancholy
optimism ≥ 0.60 AND joy ≥ 0.20 → triumph
optimism ≥ 0.55 → hope
anger ≥ 0.35 AND sadness ≥ 0.25 → tension
(fallback: dominant raw score name) → joy / sadness / anger / optimism
The awe mode exists specifically for the bittersweet pattern — high sadness but with a strong vein of optimism — which is common in climactic scenes.
PROFILES
A static record of 10 cinematic music profiles (not used for music generation anymore — the project moved from Jamendo API to Web Audio synthesis). The mood field from each profile is used to drive the ambient drone's frequency selection.
Exports: truncateAtSentence(), polishForTTS(), formatForNarration(), narrateText()
truncateAtSentence(text, limit=1500)
Hard-truncating at a character limit produces mid-sentence cuts that sound unnatural. This function finds the last sentence-ending punctuation (. , ! , ? ) before the limit and cuts there, appending …. The 1500-character limit is chosen to keep Vercel's 60-second function timeout safe — Deepgram typically renders 1500 chars in under 8 seconds.
polishForTTS(text)
PDF extraction introduces artifacts that TTS engines render awkwardly:
| Artifact | Example | Fix |
|---|---|---|
| Ligatures | fire |
→ fire |
| Missing spaces | said.He |
→ said. He |
| Hyphenated breaks | some-\nwhere |
→ somewhere |
| Run-on sentences (>50 words) | long clause | split at comma/conjunction |
This is a Groq call with temperature: 0.0 — deterministic output, fixes artifacts without altering story text.
formatForNarration(text)
Deepgram reads ... in input text as an audible pause. This function splits text on paragraph boundaries (\n\n or more) and joins them with ... — double-spacing around the ellipsis ensures Deepgram doesn't merge it with adjacent words. Each paragraph also gets a guaranteed terminal punctuation mark, because Deepgram's prosody (rising/falling intonation) depends on sentence endings.
narrateText(text, voice, _speed)
Calls Deepgram's /v1/speak endpoint. Output format: linear16 PCM, 24 kHz, WAV container. The speed parameter is accepted but currently unused — Deepgram Aura 2 does not yet expose a speed knob via the API. The underscore prefix signals this intentionally.
Exports: DroneHandle, startAmbientDrone(mood, volume)
This module replaced a Jamendo music API integration. The problem with licensed music APIs: even tracks tagged "ambient" often have recognizable melodies, rhythmic patterns, or production signatures that feel tonally wrong beneath narration. A synthesized drone has no melody — it's pure textural atmosphere.
Signal chain:
OscillatorNode (sine/triangle)
+ micro-detune (i × 0.25 Hz) ← stereo width
+ LFO → osc.frequency ← organic pitch drift
│
▼
GainNode (0.22 / numOscillators)
│
▼
ConvolverNode (synthetic reverb) ← spaciousness
│
▼
BiquadFilter (lowpass, Q=0.6) ← removes brightness, keeps texture
│
▼
GainNode (masterGain = volume) ← slider control
│
▼
AudioContext.destination
Why a ConvolverNode with a random IR instead of a delay-based reverb?
A ConvolverNode using a noise-burst impulse response (exponentially decaying random samples) produces a dense, diffuse reverb that disguises the oscillator's waveform origin. Simple delay lines produce audible echo artifacts at drone frequencies. The IR is generated at runtime so there's no audio file to load.
DroneHandle interface
interface DroneHandle {
setVolume: (v: number) => void; // live gain change, no clicks
stop: () => void; // stops oscillators, closes AudioContext after 200ms
}setVolume(0) on pause rather than stop() keeps the oscillators running silently. Restarting oscillators produces an audible click; live gain change does not. The 200ms delay before ctx.close() on stop prevents audio glitches at the end of the last buffer.
All routes use export const runtime = "nodejs" — Node.js runtime rather than the Edge runtime — because pdf-parse requires Node.js built-ins (Buffer, fs) not available in the Edge runtime.
| Route | Method | Input | Output |
|---|---|---|---|
/api/parse |
POST | multipart/form-data — file: File |
{ chapters, bookTitle, author, genre } |
/api/analyze |
POST | { chapterText, bookTitle, genre, chapterTitle } |
{ scores, musicPrompt, mood } |
/api/narrate |
POST | { chapterText, voice, speed } |
WAV binary + X-Narrated-Text header |
/api/narrate sets export const maxDuration = 60 — Vercel's free tier cap. The three API calls inside (Groq polish + Deepgram TTS) typically complete in 10–25 seconds combined.
A single React component with no external UI library. State:
| State | Type | Purpose |
|---|---|---|
chapters |
Record<string,string> |
Full chapter map from parse response |
selectedChapter |
string |
Key into chapters map |
bookTitle, bookGenre |
string |
Displayed in header |
fileName |
string |
Shown in upload zone after file picked |
voice, speed |
string, number |
Passed to narrate API |
ambientVolume |
number |
0–0.18, controls drone gain |
step |
"idle" | "analyzing" | "generating" | "done" | "error" |
Controls button label and loading state |
audioUrl |
string |
Object URL for WAV blob |
mood |
string |
From analyze response, drives drone |
narratedText |
string |
Decoded from X-Narrated-Text header |
showTranscript |
boolean |
Toggles transcript panel |
droneRef holds a DroneHandle across renders without triggering re-renders. It's cleaned up on every new generate call and on narration end.
The generate flow is sequential (not parallel) because the mood string from /api/analyze must be available before the drone can be configured. The narrate call is independent of the mood but must start after analyze to allow the UI to progress through step labels correctly.
Groq's free tier provides 14,400 requests/day on llama-3.3-70b-versatile with no credit card required. The model quality is sufficient for structured JSON tasks (emotion scoring, regex extraction, text polishing). Latency is lower than OpenAI on equivalent tasks due to Groq's custom LPU hardware.
Deepgram offers $200 of free credit on signup. Aura 2 voices are natural-sounding for long-form narration and the API is straightforward (single POST, binary WAV response). OpenAI TTS charges from day one and ElevenLabs has a character-per-month limit on the free tier.
Jamendo's "ambient" catalogue includes tracks with recognizable melodies, production signatures, and tempo — all of which clash with narration. A Web Audio synthesizer produces guaranteed-textural output because there is no melody to clash. It also eliminates a third API key, a third external dependency, and the latency of a music API call.
A combined prompt asking for both heading pattern and book metadata consistently degraded chapter detection: the model returned an empty heading_pattern roughly 40% of the time when the prompt also included metadata fields. Separating into two concurrent calls restored accuracy on both tasks with no additional latency.
Vercel functions on the free tier time out at 60 seconds. Benchmarking showed Deepgram renders ~1500 characters in 8–12 seconds. The Groq polish call adds ~3 seconds. Margin is kept for cold starts and network variance. The limit is applied at sentence boundaries so the excerpt always ends cleanly.
pdfjs-dist requires a Web Worker for non-blocking parsing and has significant configuration overhead in a serverless environment. pdf-parse wraps the same underlying PDF.js parser but exposes a simple pdfParse(buffer) → { text } async API that works out of the box in Node.js serverless functions.
// POST /api/parse response
interface ParsedBook {
chapters: Record<string, string>; // { "CHAPTER ONE — The Boy Who Lived": "It was a dark..." }
bookTitle: string;
author: string;
genre: string;
}
// POST /api/analyze response
interface AnalyzeResponse {
scores: { anger: number; joy: number; sadness: number; optimism: number };
musicPrompt: string; // e.g. "expansive sci-fi orchestral score. wonder-filled tone. 72 BPM..."
mood: string; // e.g. "Awe — wonder-filled and bittersweet"
}
// POST /api/narrate response
// Content-Type: audio/wav
// X-Narrated-Text: URI-encoded narration string
// Body: raw WAV bytes (linear16, 24kHz, stereo)| Variable | Required | Description |
|---|---|---|
GROQ_API_KEY |
Yes | console.groq.com — free, no card |
DEEPGRAM_API_KEY |
Yes | console.deepgram.com — $200 free credit |
cp .env.example .env.local
# fill in both keysIf GROQ_API_KEY is missing, the parse route skips AI chapter detection (falls back to regex), polishForTTS returns the raw text, and analyzeEmotion throws. If DEEPGRAM_API_KEY is missing, the narrate route throws.
npm install
npm run dev
# → http://localhost:3000Type-check without building:
npx tsc --noEmitThe project deploys to Vercel with no configuration changes. Click the button at the top of this README, or:
npm i -g vercel
vercelAdd both environment variables in Vercel Dashboard → Project → Settings → Environment Variables.
The /api/narrate function has maxDuration = 60 set. All other functions use the default (300s on Pro, 60s on Hobby). If narration times out on cold start, re-triggering typically succeeds as the function instance stays warm.
Aura 2 (recommended — higher naturalness on long-form text):
| Voice ID | Gender |
|---|---|
aura-2-thalia-en |
Female |
aura-2-andromeda-en |
Female |
aura-2-luna-en |
Female |
aura-2-stella-en |
Female |
aura-2-zeus-en |
Male |
aura-2-orion-en |
Male |
Aura v1 (also available, slightly lower naturalness):
| Voice ID | Gender |
|---|---|
aura-asteria-en |
Female |
aura-luna-en |
Female |
aura-orion-en |
Male |
aura-orpheus-en |
Male |