Skip to content

cjessett/narrate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎙️ pdf-epub-to-audio

A Fastify REST API that converts PDF and EPUB files to natural-sounding audio using AI TTS. Supports OpenAI TTS, ElevenLabs, and Google Cloud TTS — switch providers by changing a single env var.

screenshot


Features

  • 📄 PDF text extraction (via pdfjs-dist)
  • 📚 EPUB text extraction with chapter awareness (via epub2)
  • 🎤 Multi-provider TTS — OpenAI, ElevenLabs, Google Cloud
  • 🔀 Smart chunking — splits on sentence boundaries, never mid-word
  • 🎵 Output formats — MP3, WAV, OGG (converted via ffmpeg)
  • Async jobs — upload → get job ID → poll progress → download
  • 🔁 Concurrent synthesis — up to 3 parallel TTS requests per job

Prerequisites

  • Node.js ≥ 18
  • ffmpeg installed and in your PATH
# macOS
brew install ffmpeg

# Ubuntu / Debian
sudo apt install ffmpeg

# Windows
winget install ffmpeg

Setup

# 1. Install dependencies
npm install

# 2. Configure environment
cp .env.example .env
# Edit .env and set your TTS provider API key

Minimum .env for Kokoro (free, local — recommended)

TTS_PROVIDER=kokoro
KOKORO_BASE_URL=http://localhost:8880   # default, change if needed
KOKORO_VOICE=af_bella

Minimum .env for OpenAI

TTS_PROVIDER=openai
OPENAI_API_KEY=sk-...

Running

# Development (hot reload)
npm run dev

# Production
npm run build
npm start

Server starts at http://localhost:3000.


API Reference

POST /convert

Upload a PDF or EPUB and start conversion.

Content-Type: multipart/form-data

Field Type Description
file File .pdf or .epub file (required)
format string mp3 | wav | ogg (default: mp3)
voice string Provider-specific voice ID
speed number 0.25–4.0 (default: 1.0)
provider string Override: openai | elevenlabs | google

Response 202:

{
  "jobId": "abc-123",
  "status": "queued",
  "statusUrl": "/jobs/abc-123",
  "downloadUrl": "/download/abc-123"
}

GET /jobs/:jobId

Poll job progress.

Response:

{
  "jobId": "abc-123",
  "status": "converting",
  "progress": 45,
  "message": "Synthesizing chunk 3/7…",
  "totalChunks": 7,
  "completedChunks": 3
}

status values: queuedextractingconvertingmergingdone | error


GET /download/:jobId

Download the finished audio file (only available when status === "done").

Returns the audio file as a binary stream with the correct Content-Type.


GET /voices

List available voices for the active (or specified) provider.

GET /voices?provider=openai

GET /formats

Returns supported output formats and providers.


GET /health

Health check.


Example: cURL

# Convert a PDF to MP3
curl -X POST http://localhost:3000/convert \
  -F "file=@my-book.pdf" \
  -F "format=mp3" \
  -F "voice=nova" \
  -F "speed=1.1"

# Poll status
curl http://localhost:3000/jobs/abc-123

# Download when done
curl -o my-book.mp3 http://localhost:3000/download/abc-123

Provider Notes

Kokoro (local, free — recommended)

  • Runs entirely on your machine, no API key, no cost per character
  • Quality rivals paid providers, especially af_bella and bm_george
  • Requires a running kokoro-fastapi server

Setup (pick one):

# Option A — Python (CPU)
pip install kokoro-fastapi
python -m kokoro_fastapi

# Option B — Docker CPU
docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest

# Option C — Docker GPU (much faster for long books)
docker run --gpus all -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-gpu:latest

Available voices:

ID Name Accent
af_bella Bella American Female
af_nicole Nicole American Female
af_sarah Sarah American Female
af_sky Sky American Female
am_adam Adam American Male
am_michael Michael American Male
bf_emma Emma British Female
bf_isabella Isabella British Female
bm_george George British Male
bm_lewis Lewis British Male

Check the server is reachable before converting:

curl http://localhost:3000/health/kokoro

OpenAI TTS

  • Models: tts-1 (faster, cheaper) or tts-1-hd (higher quality)
  • Voices: alloy, echo, fable, onyx, nova, shimmer
  • ~$0.015 per 1K characters (tts-1-hd)

ElevenLabs

  • Most natural, expressive voices
  • Use GET /voices to list available voice IDs
  • Turbo v2 model used by default

Google Cloud TTS

  • Requires a service account JSON key
  • Set GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json
  • Neural2 voices are highest quality

Project Structure

src/
├── server.ts          # Fastify entry point
├── config.ts          # Env-based config
├── pipeline.ts        # Conversion orchestrator
├── types.ts           # Shared TypeScript types
├── extractors/
│   ├── pdf.ts         # pdfjs-dist extractor
│   └── epub.ts        # epub2 extractor
├── providers/
│   ├── openai.ts      # OpenAI TTS
│   ├── elevenlabs.ts  # ElevenLabs TTS
│   ├── google.ts      # Google Cloud TTS
│   └── factory.ts     # Provider resolver
├── routes/
│   └── index.ts       # All API routes
└── utils/
    ├── chunker.ts     # Text chunking + cleaning
    ├── audioMerger.ts # ffmpeg concat + encode
    └── jobStore.ts    # In-memory job tracker

About

PDF & EPUB → Audio

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors