Extract intelligence from any social media link.
Paste a URL. Get the transcript, metadata, entities, summary, and key insights — all processed locally, no API keys required.
$ harvest https://youtube.com/watch?v=... https://x.com/user/status/...
- 1000+ platforms — YouTube, Twitter/X, Instagram, TikTok, Reddit, LinkedIn, Twitch, Vimeo, and more via yt-dlp
- Automatic fallback chain — yt-dlp → cobalt.tools → HTML scraper, with circuit breakers
- Local transcription — Whisper via
faster-whisper, no API key, runs fully offline - Smart caching — content-addressable, never processes the same URL twice
- Structured output — JSON, Markdown, plain text, or NDJSON (for pipes)
- Optional AI analysis — summary, key points, entities, topics, sentiment via Ollama (local) or Claude/OpenAI
brew tap devjoaocastro/harvest
brew install harvestpipx install harvest-cliuv tool install harvest-cli# Extract metadata + transcribe
harvest https://www.youtube.com/watch?v=dQw4w9WgXcQ
# Multiple URLs concurrently
harvest url1 url2 url3
# Save transcript to markdown
harvest https://youtube.com/... --format markdown --output transcript.md
# Use a larger Whisper model for better accuracy
harvest https://youtube.com/... --model large-v3
# Add AI analysis via local Ollama
harvest https://youtube.com/... --analyse
# Export structured JSON (perfect for piping)
harvest https://youtube.com/... --format json | jq '.[] | .transcript.full_text'
# Force re-process (ignore cache)
harvest https://youtube.com/... --force| Model | Size | Speed (CPU) | Notes |
|---|---|---|---|
tiny |
39M | ~32x | Fastest, English only |
base |
74M | ~16x | Default — good balance |
small |
244M | ~6x | Better multilingual |
medium |
769M | ~2x | High accuracy |
large-v3 |
1.5B | ~1x | Best, slow on CPU |
turbo |
809M | ~8x | large-v3 quality, needs GPU |
Set with --model <size> or HARVEST_WHISPER__MODEL=small.
Config file at ~/.config/harvest/config.toml:
[whisper]
model = "small"
language = "pt" # force language (optional)
[analysis]
enabled = true
provider = "ollama" # ollama | anthropic | openai
model = "llama3.2"
[cache]
ttl_seconds = 604800 # 1 week
max_size_mb = 2048
[extractor]
chain = ["ytdlp", "cobalt", "html"] # fallback orderAll settings can also be overridden via environment variables:
HARVEST_WHISPER__MODEL=large-v3
HARVEST_ANALYSIS__ENABLED=true
HARVEST_ANALYSIS__PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...# 1. Install Ollama
brew install ollama
ollama pull llama3.2
# 2. Run
harvest https://... --analyseexport ANTHROPIC_API_KEY=sk-ant-...
HARVEST_ANALYSIS__PROVIDER=anthropic HARVEST_ANALYSIS__MODEL=claude-haiku-4-5-20251001 \
harvest https://... --analyseexport OPENAI_API_KEY=sk-...
HARVEST_ANALYSIS__PROVIDER=openai HARVEST_ANALYSIS__MODEL=gpt-4o-mini \
harvest https://... --analyseharvest cache stats # show size + entries
harvest cache clear # wipe all cached results
harvest cache invalidate <url> # remove one URLURL(s)
│
▼
ExtractorChain (circuit-breaker protected)
├── yt-dlp (1000+ platforms, primary)
├── cobalt.tools (public API, fallback)
└── HTML scraper (metadata-only, last resort)
│
▼
LocalWhisperTranscriber (faster-whisper, fully offline)
│
▼
BaseAnalyser (optional)
├── OllamaAnalyser (local LLM)
├── ClaudeAnalyser (Anthropic API)
└── OpenAIAnalyser (OpenAI API)
│
▼
ResultCache (content-addressable, SHA-256 keyed)
│
▼
HarvestResult (Pydantic v2 model)
→ JSON / Markdown / Text / NDJSON
Key design principles:
- Async-first throughout — concurrent URL processing with
asyncio - Circuit breaker per extractor — automatic failover after N failures
- Content-addressable cache — never re-process the same URL
- Streaming results — partial output as each URL completes
- Zero-config defaults — works out of the box with no configuration
- Python 3.12+
- ffmpeg (
brew install ffmpeg) - yt-dlp (auto-installed)
- faster-whisper (auto-installed)
MIT © João Castro