Skip to content

AyushGupta-Code/Music

Repository files navigation

Novel to Audio

Next.js TypeScript Tailwind CSS Groq Deepgram License Deploy with Vercel

Upload a PDF novel. Pick a chapter. The app narrates it in a high-quality AI voice and plays a generative ambient drone tuned to the chapter's emotional tone — synthesized live in the browser with no external music API.


Table of Contents

  1. What it does
  2. System Architecture
  3. Request Lifecycle
  4. Module Deep-Dives
  5. Key Technical Decisions
  6. Data Shapes
  7. Environment Variables
  8. Local Development
  9. Deployment
  10. Available Voices

What it does

The application takes a PDF novel as input and turns any chapter into a narrated audiobook excerpt with a mood-matched ambient soundscape:

  1. Parse — extracts and structures chapters from the PDF using AI-assisted heading detection
  2. Analyze — reads the chapter's emotional tone (anger, joy, sadness, optimism) using an LLM
  3. Narrate — synthesizes a high-quality voice recording of the chapter excerpt
  4. Ambience — generates a synthesized drone in the browser, tuned to the detected mood, that plays beneath the narration

System Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│  Browser                                                                │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │  app/page.tsx (React)                                            │  │
│  │                                                                  │  │
│  │  [Upload zone]  →  POST /api/parse  →  [Chapter list]           │  │
│  │                                                                  │  │
│  │  [Settings: voice, speed, ambient volume]                        │  │
│  │                                                                  │  │
│  │  [Generate] → POST /api/analyze ──────────────────────────┐     │  │
│  │                       │                                   │     │  │
│  │                       ▼                                   │     │  │
│  │             POST /api/narrate                         mood str  │  │
│  │                       │                                   │     │  │
│  │                       ▼                                   ▼     │  │
│  │              [Audio player] ←──────── startAmbientDrone()       │  │
│  │                                       Web Audio API             │  │
│  └──────────────────────────────────────────────────────────────────┘  │
└───────────┬─────────────────────────────┬──────────────────────────────┘
            │ HTTPS                       │ HTTPS
            ▼                             ▼
┌───────────────────────┐   ┌─────────────────────────┐
│  Vercel Serverless    │   │  Vercel Serverless       │
│  /api/parse           │   │  /api/analyze            │
│                       │   │                          │
│  pdf-parse            │   │  lib/emotion.ts          │
│  lib/pdf.ts           │   │  → Groq API              │
│  → Groq API (×2)      │   │    llama-3.3-70b         │
│                       │   │  → buildMusicProfile()   │
└───────────────────────┘   └─────────────────────────┘

┌───────────────────────────────────────────┐
│  Vercel Serverless                        │
│  /api/narrate                             │
│                                           │
│  lib/narrate.ts                           │
│  truncateAtSentence() → 1500 chars        │
│  polishForTTS()       → Groq API          │
│  formatForNarration() → paragraph pauses  │
│  narrateText()        → Deepgram API      │
│                          Aura 2 TTS       │
│                          WAV (24 kHz)     │
└───────────────────────────────────────────┘

External APIs used:
  Groq    — llama-3.3-70b-versatile  (chapter detection, emotion analysis, TTS polish)
  Deepgram — Aura 2                  (voice synthesis)

Request Lifecycle

Phase 1 — PDF Upload

User drops PDF
       │
       ▼
POST /api/parse  (multipart/form-data)
       │
       ├─ pdf-parse(buffer)
       │      Extracts raw text, preserving line breaks.
       │      Does NOT reliably preserve paragraph spacing.
       │
       ├─ cleanPdfText(rawText)
       │      Frequency analysis: counts every short line (<80 chars).
       │      Lines appearing 4+ times are repeating headers/footers → drop.
       │      Exception: chapter headings appearing 4+ times are kept once.
       │      Also strips: bare page numbers, Roman numerals, "Page N" patterns.
       │      Rejoins hyphenated line-breaks ("some-\nwhere" → "somewhere").
       │      Collapses 3+ blank lines to 2.
       │
       ├─ [PARALLEL Groq calls]
       │
       │   detectChapterPattern(text, client)
       │   ├─ Sends first 6000 characters to llama-3.3-70b
       │   ├─ Asks for: heading_pattern (JS regex string) + has_subtitle (bool)
       │   ├─ Examples handled: "^Chapter\\s+\\d+", "^[—–]\\s*CHAPTER\\s+[A-Z]+"
       │   └─ Returns: { headingRegex: RegExp, hasSubtitle: boolean } | null
       │
       │   detectBookMeta(text, client)
       │   ├─ Sends first 1500 characters to llama-3.3-70b
       │   ├─ Asks for: book_title, author, genre
       │   └─ Returns: { bookTitle, author, genre }
       │
       ├─ buildChaptersFromPattern(text, headingRegex, hasSubtitle)
       │      Scans every line of the full text.
       │      On heading match: strips surrounding em-dashes from the line.
       │      If hasSubtitle: looks ahead up to 6 lines for a non-heading
       │        line (<80 chars) and appends it as "HEADING — Subtitle".
       │      Collects (title, lineIndex) pairs, then slices content between
       │        consecutive headings. Drops chapters with <100 chars of content.
       │
       ├─ cleanTitles(chapters)
       │      Strips any remaining leading/trailing em-dashes from keys.
       │
       ├─ [Fallback if Groq returns no pattern or <2 chapters found]
       │      parseChaptersFallback(text)
       │      Tries 5 regex patterns in priority order:
       │        1. Em-dash decorated: /^([—–-]\s*CHAPTER\s+...)/m
       │        2. "Chapter N" / "CHAPTER N"
       │        3. "Part N" / "PART N"
       │        4. Two-line headings (keyword + subtitle on next line)
       │        5. Paragraph chunking (every 10 paragraphs = one section)
       │        6. Last resort: { "Full Text": entireText }
       │
       └─ Response: { chapters: Record<string,string>, bookTitle, author, genre }

Phase 2 — Generate (on button click)

User clicks Generate
       │
       ▼
POST /api/analyze  (JSON)
  body: { chapterText, bookTitle, genre, chapterTitle }
       │
       ├─ Groq llama-3.3-70b
       │    Prompt includes book/genre/chapter context for accuracy.
       │    Returns: { anger, joy, sadness, optimism } (floats summing to 1.0)
       │    Normalises the scores if they don't sum to 1.
       │    Falls back to { anger:0.1, joy:0.4, sadness:0.2, optimism:0.3 }
       │      on any error.
       │
       ├─ cinematicMode(scores)
       │    Priority-ordered rules map score combinations → one of 10 modes:
       │    danger, awe, melancholy, triumph, hope, tension, joy, hope,
       │    optimism, sadness, anger.
       │    Example: sadness≥0.55 AND optimism≥0.20 → "awe"
       │
       ├─ buildMusicProfile(scores)
       │    Looks up the cinematic mode in PROFILES (10 entries).
       │    Builds a text prompt (tempo, key, instrumentation, mix notes).
       │    Returns MusicProfile + dominant_emotion + prompt string.
       │
       └─ Response: { scores, musicPrompt, mood }
              mood format: "Awe — wonder-filled and bittersweet"

                    │
                    ▼ (mood string stored in React state)

POST /api/narrate  (JSON)
  body: { chapterText, voice, speed }
       │
       ├─ truncateAtSentence(text, 1500)
       │    Cuts at last sentence boundary before 1500 chars.
       │    Adds "…" suffix. Prevents incomplete sentences going to TTS.
       │
       ├─ polishForTTS(truncated)   [Groq call]
       │    Fixes PDF extraction artifacts before TTS:
       │      • Ligatures: fi→fi, fl→fl, ff→ff, ffi→ffi
       │      • Missing space after punctuation: "said.He" → "said. He"
       │      • Hyphenated line-breaks: "some- where" → "somewhere"
       │      • Sentences >50 words split at natural clause boundaries
       │      • Ensures every paragraph ends with sentence punctuation
       │    Falls back to raw text on any Groq error.
       │
       ├─ formatForNarration(polished)
       │    Splits on 2+ blank lines (paragraph boundaries).
       │    Ensures each paragraph ends with punctuation.
       │    Joins paragraphs with "  ...  " — Deepgram reads "..." as
       │      an audible pause, creating natural paragraph breathing.
       │
       ├─ narrateText(text, voice, speed)
       │    POST to Deepgram /v1/speak
       │      model={voice}&encoding=linear16&sample_rate=24000&container=wav
       │    Returns raw WAV bytes (Buffer).
       │    Timeout: 55 seconds.
       │
       └─ Response:
            Content-Type: audio/wav
            X-Narrated-Text: encodeURIComponent(narrationText)
            Body: WAV binary

                    │
                    ▼ (client receives blob + header)

Browser:
  URL.createObjectURL(blob) → <audio src>
  mood → startAmbientDrone(mood, volume)

Phase 3 — Ambient Drone (client-side, no API)

mood string (e.g. "Awe — wonder-filled")
       │
       ▼
moodToFrequencies(mood)
  Regex matches against 8 emotional categories:
  dark/dread   → [55, 73.4, 82.4] Hz,       LPF: 280 Hz
  tense        → [110, 123.5, 164.8] Hz,     LPF: 500 Hz
  mysterious   → [87.3, 116.5, 155.6] Hz,    LPF: 380 Hz
  sad          → [110, 130.8, 196] Hz,        LPF: 550 Hz
  peaceful     → [196, 261.6, 329.6] Hz,      LPF: 1400 Hz
  hopeful      → [261.6, 329.6, 392] Hz,      LPF: 1800 Hz
  joyful       → [329.6, 392, 523.3] Hz,      LPF: 2200 Hz
  romantic     → [220, 277.2, 329.6] Hz,      LPF: 1200 Hz
  default      → [130.8, 174.6, 196] Hz,      LPF: 700 Hz
       │
       ▼
AudioContext created
       │
       ├─ syntheticReverb(ctx) → ConvolverNode
       │    Creates a 3.5-second stereo impulse response buffer.
       │    Each sample: random noise × exponential decay (power 2.2).
       │    This simulates a large reverberant space without a real IR file.
       │
       ├─ BiquadFilter (lowpass, Q=0.6)
       │    Cutoff = filterHz from mood mapping.
       │    Removes upper harmonics; keeps the sound dark and textural.
       │
       ├─ GainNode (masterGain) → ctx.destination
       │    Controlled by the ambient volume slider (0–18%).
       │
       └─ For each of the 3 frequencies:
            OscillatorNode
              type: sine (freq[0]) | triangle (freq[1], freq[2])
              frequency: hz + (i × 0.25)   ← micro-detune for stereo width
            GainNode: 0.22 / numFreqs
            LFO OscillatorNode
              frequency: 0.04 + (i × 0.013)  ← ~25s cycle, staggered per osc
              GainNode: hz × 0.004            ← modulation depth scales with freq
              → connected to osc.frequency    ← creates organic pitch drift
            Signal path: osc → gain → reverb → LPF → masterGain → destination

Lifecycle:
  onPlay  → drone starts (or setVolume restores from 0)
  onPause → setVolume(0) — oscillators keep running, instant silence
  onEnded → stop() — oscillators stopped, AudioContext closed after 200ms

Module Deep-Dives

lib/pdf.ts — PDF Parsing Pipeline

Exports: parseBookPdf(buffer: Buffer): Promise<ParsedBook>

The only function the route needs to call. Internally:

  • cleanPdfText — normalises raw PDF text before any analysis
  • detectChapterPattern — Groq call #1: heading regex identification
  • detectBookMeta — Groq call #2: book title, author, genre
  • buildChaptersFromPattern — applies the regex to the full text
  • cleanTitles — strips decorative em-dashes from chapter title strings
  • sortChapters — sorts by chapter number (Arabic numerals or spelled-out words)
  • parseChaptersFallback — regex-then-chunking fallback chain

Why two separate Groq calls (not one)? Early versions used a combined prompt. Groq consistently returned an empty heading_pattern when asked to identify headings AND extract metadata simultaneously. Splitting into two focused, parallel calls restored accuracy on both tasks. The parallel execution means no latency penalty.

Why pdf-parse over pdfjs-dist? pdfjs-dist has a complex worker setup that is awkward in serverless Node.js. pdf-parse is synchronous-friendly, has no worker thread requirement, and produces the same text output for standard text-based PDFs.

Why frequency analysis for header/footer removal? Running headers and footers (book title, author name, chapter title repeated on every page) appear as short identical lines. Counting line frequency and dropping lines that appear 4+ times catches these robustly without needing regex patterns for every possible book. Chapter headings are protected by the CHAPTER_HEADING regex guard — they're allowed to appear once even if they repeat.


lib/emotion.ts — Emotion Analysis

Exports: EmotionScores, MusicProfile, buildMusicProfile(), analyzeEmotion()

analyzeEmotion(text, bookTitle, genre, chapterTitle)

Sends the first 3000 characters of the chapter to Groq. Book and chapter context is prepended to the prompt because a chapter of Harry Potter with "dark, raining, owl" text is different emotionally from the same text in a horror novel. Context prevents naive surface-level scoring.

The model returns { anger, joy, sadness, optimism }. Scores are normalised to sum to 1.0 before use, guarding against floating-point drift in the model's output.

cinematicMode(scores)

Maps the normalised scores to one of 10 named cinematic modes using priority-ordered threshold rules:

anger ≥ 0.55                        → danger
sadness ≥ 0.55 AND optimism ≥ 0.20  → awe        (bittersweet)
sadness ≥ 0.55                      → melancholy
optimism ≥ 0.60 AND joy ≥ 0.20      → triumph
optimism ≥ 0.55                     → hope
anger ≥ 0.35 AND sadness ≥ 0.25     → tension
(fallback: dominant raw score name) → joy / sadness / anger / optimism

The awe mode exists specifically for the bittersweet pattern — high sadness but with a strong vein of optimism — which is common in climactic scenes.

PROFILES

A static record of 10 cinematic music profiles (not used for music generation anymore — the project moved from Jamendo API to Web Audio synthesis). The mood field from each profile is used to drive the ambient drone's frequency selection.


lib/narrate.ts — Text-to-Speech Pipeline

Exports: truncateAtSentence(), polishForTTS(), formatForNarration(), narrateText()

truncateAtSentence(text, limit=1500)

Hard-truncating at a character limit produces mid-sentence cuts that sound unnatural. This function finds the last sentence-ending punctuation (. , ! , ? ) before the limit and cuts there, appending . The 1500-character limit is chosen to keep Vercel's 60-second function timeout safe — Deepgram typically renders 1500 chars in under 8 seconds.

polishForTTS(text)

PDF extraction introduces artifacts that TTS engines render awkwardly:

Artifact Example Fix
Ligatures fire fire
Missing spaces said.He said. He
Hyphenated breaks some-\nwhere somewhere
Run-on sentences (>50 words) long clause split at comma/conjunction

This is a Groq call with temperature: 0.0 — deterministic output, fixes artifacts without altering story text.

formatForNarration(text)

Deepgram reads ... in input text as an audible pause. This function splits text on paragraph boundaries (\n\n or more) and joins them with ... — double-spacing around the ellipsis ensures Deepgram doesn't merge it with adjacent words. Each paragraph also gets a guaranteed terminal punctuation mark, because Deepgram's prosody (rising/falling intonation) depends on sentence endings.

narrateText(text, voice, _speed)

Calls Deepgram's /v1/speak endpoint. Output format: linear16 PCM, 24 kHz, WAV container. The speed parameter is accepted but currently unused — Deepgram Aura 2 does not yet expose a speed knob via the API. The underscore prefix signals this intentionally.


lib/ambient-drone.ts — Generative Audio

Exports: DroneHandle, startAmbientDrone(mood, volume)

This module replaced a Jamendo music API integration. The problem with licensed music APIs: even tracks tagged "ambient" often have recognizable melodies, rhythmic patterns, or production signatures that feel tonally wrong beneath narration. A synthesized drone has no melody — it's pure textural atmosphere.

Signal chain:

OscillatorNode (sine/triangle)
    + micro-detune (i × 0.25 Hz)          ← stereo width
    + LFO → osc.frequency                  ← organic pitch drift
        │
        ▼
    GainNode (0.22 / numOscillators)
        │
        ▼
    ConvolverNode (synthetic reverb)        ← spaciousness
        │
        ▼
    BiquadFilter (lowpass, Q=0.6)           ← removes brightness, keeps texture
        │
        ▼
    GainNode (masterGain = volume)          ← slider control
        │
        ▼
    AudioContext.destination

Why a ConvolverNode with a random IR instead of a delay-based reverb? A ConvolverNode using a noise-burst impulse response (exponentially decaying random samples) produces a dense, diffuse reverb that disguises the oscillator's waveform origin. Simple delay lines produce audible echo artifacts at drone frequencies. The IR is generated at runtime so there's no audio file to load.

DroneHandle interface

interface DroneHandle {
  setVolume: (v: number) => void;  // live gain change, no clicks
  stop: () => void;                // stops oscillators, closes AudioContext after 200ms
}

setVolume(0) on pause rather than stop() keeps the oscillators running silently. Restarting oscillators produces an audible click; live gain change does not. The 200ms delay before ctx.close() on stop prevents audio glitches at the end of the last buffer.


API Routes

All routes use export const runtime = "nodejs" — Node.js runtime rather than the Edge runtime — because pdf-parse requires Node.js built-ins (Buffer, fs) not available in the Edge runtime.

Route Method Input Output
/api/parse POST multipart/form-datafile: File { chapters, bookTitle, author, genre }
/api/analyze POST { chapterText, bookTitle, genre, chapterTitle } { scores, musicPrompt, mood }
/api/narrate POST { chapterText, voice, speed } WAV binary + X-Narrated-Text header

/api/narrate sets export const maxDuration = 60 — Vercel's free tier cap. The three API calls inside (Groq polish + Deepgram TTS) typically complete in 10–25 seconds combined.


app/page.tsx — Frontend

A single React component with no external UI library. State:

State Type Purpose
chapters Record<string,string> Full chapter map from parse response
selectedChapter string Key into chapters map
bookTitle, bookGenre string Displayed in header
fileName string Shown in upload zone after file picked
voice, speed string, number Passed to narrate API
ambientVolume number 0–0.18, controls drone gain
step "idle" | "analyzing" | "generating" | "done" | "error" Controls button label and loading state
audioUrl string Object URL for WAV blob
mood string From analyze response, drives drone
narratedText string Decoded from X-Narrated-Text header
showTranscript boolean Toggles transcript panel

droneRef holds a DroneHandle across renders without triggering re-renders. It's cleaned up on every new generate call and on narration end.

The generate flow is sequential (not parallel) because the mood string from /api/analyze must be available before the drone can be configured. The narrate call is independent of the mood but must start after analyze to allow the UI to progress through step labels correctly.


Key Technical Decisions

Groq over OpenAI

Groq's free tier provides 14,400 requests/day on llama-3.3-70b-versatile with no credit card required. The model quality is sufficient for structured JSON tasks (emotion scoring, regex extraction, text polishing). Latency is lower than OpenAI on equivalent tasks due to Groq's custom LPU hardware.

Deepgram over ElevenLabs / OpenAI TTS

Deepgram offers $200 of free credit on signup. Aura 2 voices are natural-sounding for long-form narration and the API is straightforward (single POST, binary WAV response). OpenAI TTS charges from day one and ElevenLabs has a character-per-month limit on the free tier.

Generative drone over Jamendo API

Jamendo's "ambient" catalogue includes tracks with recognizable melodies, production signatures, and tempo — all of which clash with narration. A Web Audio synthesizer produces guaranteed-textural output because there is no melody to clash. It also eliminates a third API key, a third external dependency, and the latency of a music API call.

Two parallel Groq calls for PDF parsing

A combined prompt asking for both heading pattern and book metadata consistently degraded chapter detection: the model returned an empty heading_pattern roughly 40% of the time when the prompt also included metadata fields. Separating into two concurrent calls restored accuracy on both tasks with no additional latency.

1500-character narration limit

Vercel functions on the free tier time out at 60 seconds. Benchmarking showed Deepgram renders ~1500 characters in 8–12 seconds. The Groq polish call adds ~3 seconds. Margin is kept for cold starts and network variance. The limit is applied at sentence boundaries so the excerpt always ends cleanly.

pdf-parse over pdfjs-dist

pdfjs-dist requires a Web Worker for non-blocking parsing and has significant configuration overhead in a serverless environment. pdf-parse wraps the same underlying PDF.js parser but exposes a simple pdfParse(buffer) → { text } async API that works out of the box in Node.js serverless functions.


Data Shapes

// POST /api/parse response
interface ParsedBook {
  chapters: Record<string, string>; // { "CHAPTER ONE — The Boy Who Lived": "It was a dark..." }
  bookTitle: string;
  author: string;
  genre: string;
}

// POST /api/analyze response
interface AnalyzeResponse {
  scores: { anger: number; joy: number; sadness: number; optimism: number };
  musicPrompt: string; // e.g. "expansive sci-fi orchestral score. wonder-filled tone. 72 BPM..."
  mood: string;        // e.g. "Awe — wonder-filled and bittersweet"
}

// POST /api/narrate response
// Content-Type: audio/wav
// X-Narrated-Text: URI-encoded narration string
// Body: raw WAV bytes (linear16, 24kHz, stereo)

Environment Variables

Variable Required Description
GROQ_API_KEY Yes console.groq.com — free, no card
DEEPGRAM_API_KEY Yes console.deepgram.com — $200 free credit
cp .env.example .env.local
# fill in both keys

If GROQ_API_KEY is missing, the parse route skips AI chapter detection (falls back to regex), polishForTTS returns the raw text, and analyzeEmotion throws. If DEEPGRAM_API_KEY is missing, the narrate route throws.


Local Development

npm install
npm run dev
# → http://localhost:3000

Type-check without building:

npx tsc --noEmit

Deployment

The project deploys to Vercel with no configuration changes. Click the button at the top of this README, or:

npm i -g vercel
vercel

Add both environment variables in Vercel Dashboard → Project → Settings → Environment Variables.

The /api/narrate function has maxDuration = 60 set. All other functions use the default (300s on Pro, 60s on Hobby). If narration times out on cold start, re-triggering typically succeeds as the function instance stays warm.


Available Voices

Aura 2 (recommended — higher naturalness on long-form text):

Voice ID Gender
aura-2-thalia-en Female
aura-2-andromeda-en Female
aura-2-luna-en Female
aura-2-stella-en Female
aura-2-zeus-en Male
aura-2-orion-en Male

Aura v1 (also available, slightly lower naturalness):

Voice ID Gender
aura-asteria-en Female
aura-luna-en Female
aura-orion-en Male
aura-orpheus-en Male

About

Text to Audiobook

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors