harvest

Extract intelligence from any social media link.

Paste a URL. Get the transcript, metadata, entities, summary, and key insights — all processed locally, no API keys required.

$ harvest https://youtube.com/watch?v=... https://x.com/user/status/...

Features

1000+ platforms — YouTube, Twitter/X, Instagram, TikTok, Reddit, LinkedIn, Twitch, Vimeo, and more via yt-dlp
Automatic fallback chain — yt-dlp → cobalt.tools → HTML scraper, with circuit breakers
Local transcription — Whisper via faster-whisper, no API key, runs fully offline
Smart caching — content-addressable, never processes the same URL twice
Structured output — JSON, Markdown, plain text, or NDJSON (for pipes)
Optional AI analysis — summary, key points, entities, topics, sentiment via Ollama (local) or Claude/OpenAI

Install

Homebrew (recommended for macOS)

brew tap devjoaocastro/harvest
brew install harvest

pipx (Python users)

pipx install harvest-cli

uv (fastest)

uv tool install harvest-cli

Usage

# Extract metadata + transcribe
harvest https://www.youtube.com/watch?v=dQw4w9WgXcQ

# Multiple URLs concurrently
harvest url1 url2 url3

# Save transcript to markdown
harvest https://youtube.com/... --format markdown --output transcript.md

# Use a larger Whisper model for better accuracy
harvest https://youtube.com/... --model large-v3

# Add AI analysis via local Ollama
harvest https://youtube.com/... --analyse

# Export structured JSON (perfect for piping)
harvest https://youtube.com/... --format json | jq '.[] | .transcript.full_text'

# Force re-process (ignore cache)
harvest https://youtube.com/... --force

Whisper Models

Model	Size	Speed (CPU)	Notes
`tiny`	39M	~32x	Fastest, English only
`base`	74M	~16x	Default — good balance
`small`	244M	~6x	Better multilingual
`medium`	769M	~2x	High accuracy
`large-v3`	1.5B	~1x	Best, slow on CPU
`turbo`	809M	~8x	large-v3 quality, needs GPU

Set with --model <size> or HARVEST_WHISPER__MODEL=small.

Configuration

Config file at ~/.config/harvest/config.toml:

[whisper]
model = "small"
language = "pt"   # force language (optional)

[analysis]
enabled = true
provider = "ollama"   # ollama | anthropic | openai
model = "llama3.2"

[cache]
ttl_seconds = 604800   # 1 week
max_size_mb = 2048

[extractor]
chain = ["ytdlp", "cobalt", "html"]   # fallback order

All settings can also be overridden via environment variables:

HARVEST_WHISPER__MODEL=large-v3
HARVEST_ANALYSIS__ENABLED=true
HARVEST_ANALYSIS__PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...

Analysis Providers

Ollama (default — fully local)

# 1. Install Ollama
brew install ollama
ollama pull llama3.2

# 2. Run
harvest https://... --analyse

Anthropic Claude

export ANTHROPIC_API_KEY=sk-ant-...
HARVEST_ANALYSIS__PROVIDER=anthropic HARVEST_ANALYSIS__MODEL=claude-haiku-4-5-20251001 \
  harvest https://... --analyse

OpenAI

export OPENAI_API_KEY=sk-...
HARVEST_ANALYSIS__PROVIDER=openai HARVEST_ANALYSIS__MODEL=gpt-4o-mini \
  harvest https://... --analyse

Cache Management

harvest cache stats       # show size + entries
harvest cache clear       # wipe all cached results
harvest cache invalidate <url>  # remove one URL

Architecture

URL(s)
  │
  ▼
ExtractorChain (circuit-breaker protected)
  ├── yt-dlp      (1000+ platforms, primary)
  ├── cobalt.tools (public API, fallback)
  └── HTML scraper (metadata-only, last resort)
  │
  ▼
LocalWhisperTranscriber (faster-whisper, fully offline)
  │
  ▼
BaseAnalyser (optional)
  ├── OllamaAnalyser  (local LLM)
  ├── ClaudeAnalyser  (Anthropic API)
  └── OpenAIAnalyser  (OpenAI API)
  │
  ▼
ResultCache (content-addressable, SHA-256 keyed)
  │
  ▼
HarvestResult (Pydantic v2 model)
  → JSON / Markdown / Text / NDJSON

Key design principles:

Async-first throughout — concurrent URL processing with asyncio
Circuit breaker per extractor — automatic failover after N failures
Content-addressable cache — never re-process the same URL
Streaming results — partial output as each URL completes
Zero-config defaults — works out of the box with no configuration

Requirements

Python 3.12+
ffmpeg (brew install ffmpeg)
yt-dlp (auto-installed)
faster-whisper (auto-installed)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Formula		Formula
assets		assets
scripts		scripts
src/harvest		src/harvest
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
app_launcher.py		app_launcher.py
pyproject.toml		pyproject.toml
setup_app.py		setup_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

harvest

Features

Install

Homebrew (recommended for macOS)

pipx (Python users)

uv (fastest)

Usage

Whisper Models

Configuration

Analysis Providers

Ollama (default — fully local)

Anthropic Claude

OpenAI

Cache Management

Architecture

Requirements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

harvest

Features

Install

Homebrew (recommended for macOS)

pipx (Python users)

uv (fastest)

Usage

Whisper Models

Configuration

Analysis Providers

Ollama (default — fully local)

Anthropic Claude

OpenAI

Cache Management

Architecture

Requirements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages