Skip to content

devjoaocastro/harvest

Repository files navigation

harvest

Extract intelligence from any social media link.

Paste a URL. Get the transcript, metadata, entities, summary, and key insights — all processed locally, no API keys required.

$ harvest https://youtube.com/watch?v=... https://x.com/user/status/...

Features

  • 1000+ platforms — YouTube, Twitter/X, Instagram, TikTok, Reddit, LinkedIn, Twitch, Vimeo, and more via yt-dlp
  • Automatic fallback chain — yt-dlp → cobalt.tools → HTML scraper, with circuit breakers
  • Local transcription — Whisper via faster-whisper, no API key, runs fully offline
  • Smart caching — content-addressable, never processes the same URL twice
  • Structured output — JSON, Markdown, plain text, or NDJSON (for pipes)
  • Optional AI analysis — summary, key points, entities, topics, sentiment via Ollama (local) or Claude/OpenAI

Install

Homebrew (recommended for macOS)

brew tap devjoaocastro/harvest
brew install harvest

pipx (Python users)

pipx install harvest-cli

uv (fastest)

uv tool install harvest-cli

Usage

# Extract metadata + transcribe
harvest https://www.youtube.com/watch?v=dQw4w9WgXcQ

# Multiple URLs concurrently
harvest url1 url2 url3

# Save transcript to markdown
harvest https://youtube.com/... --format markdown --output transcript.md

# Use a larger Whisper model for better accuracy
harvest https://youtube.com/... --model large-v3

# Add AI analysis via local Ollama
harvest https://youtube.com/... --analyse

# Export structured JSON (perfect for piping)
harvest https://youtube.com/... --format json | jq '.[] | .transcript.full_text'

# Force re-process (ignore cache)
harvest https://youtube.com/... --force

Whisper Models

Model Size Speed (CPU) Notes
tiny 39M ~32x Fastest, English only
base 74M ~16x Default — good balance
small 244M ~6x Better multilingual
medium 769M ~2x High accuracy
large-v3 1.5B ~1x Best, slow on CPU
turbo 809M ~8x large-v3 quality, needs GPU

Set with --model <size> or HARVEST_WHISPER__MODEL=small.


Configuration

Config file at ~/.config/harvest/config.toml:

[whisper]
model = "small"
language = "pt"   # force language (optional)

[analysis]
enabled = true
provider = "ollama"   # ollama | anthropic | openai
model = "llama3.2"

[cache]
ttl_seconds = 604800   # 1 week
max_size_mb = 2048

[extractor]
chain = ["ytdlp", "cobalt", "html"]   # fallback order

All settings can also be overridden via environment variables:

HARVEST_WHISPER__MODEL=large-v3
HARVEST_ANALYSIS__ENABLED=true
HARVEST_ANALYSIS__PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...

Analysis Providers

Ollama (default — fully local)

# 1. Install Ollama
brew install ollama
ollama pull llama3.2

# 2. Run
harvest https://... --analyse

Anthropic Claude

export ANTHROPIC_API_KEY=sk-ant-...
HARVEST_ANALYSIS__PROVIDER=anthropic HARVEST_ANALYSIS__MODEL=claude-haiku-4-5-20251001 \
  harvest https://... --analyse

OpenAI

export OPENAI_API_KEY=sk-...
HARVEST_ANALYSIS__PROVIDER=openai HARVEST_ANALYSIS__MODEL=gpt-4o-mini \
  harvest https://... --analyse

Cache Management

harvest cache stats       # show size + entries
harvest cache clear       # wipe all cached results
harvest cache invalidate <url>  # remove one URL

Architecture

URL(s)
  │
  ▼
ExtractorChain (circuit-breaker protected)
  ├── yt-dlp      (1000+ platforms, primary)
  ├── cobalt.tools (public API, fallback)
  └── HTML scraper (metadata-only, last resort)
  │
  ▼
LocalWhisperTranscriber (faster-whisper, fully offline)
  │
  ▼
BaseAnalyser (optional)
  ├── OllamaAnalyser  (local LLM)
  ├── ClaudeAnalyser  (Anthropic API)
  └── OpenAIAnalyser  (OpenAI API)
  │
  ▼
ResultCache (content-addressable, SHA-256 keyed)
  │
  ▼
HarvestResult (Pydantic v2 model)
  → JSON / Markdown / Text / NDJSON

Key design principles:

  • Async-first throughout — concurrent URL processing with asyncio
  • Circuit breaker per extractor — automatic failover after N failures
  • Content-addressable cache — never re-process the same URL
  • Streaming results — partial output as each URL completes
  • Zero-config defaults — works out of the box with no configuration

Requirements

  • Python 3.12+
  • ffmpeg (brew install ffmpeg)
  • yt-dlp (auto-installed)
  • faster-whisper (auto-installed)

License

MIT © João Castro

About

Extract intelligence from any social media link — transcript, metadata, AI analysis. 1000+ platforms, offline-first.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors