---
### **2. Add Repository Description** (on GitHub)
Click "About" (gear icon on right side), add:
- **Description:** `LLM-guided web scraper with personality β talk to McGee, get targeted content extraction`
- **Website:** Leave blank or add your portfolio
- **Topics:** `web-scraping`, `llm`, `gemini`, `python`, `ai-agent`, `automation`
---
### **3. Optional: Add a License**
On GitHub: Add file β Create new file β Name it `LICENSE`
Paste MIT license (if you want open source):
MIT License
Copyright (c) 2025 Merry
Permission is hereby granted, free of charge...
LLM-guided web scraper with personality β talk to McGee, get targeted content extraction
Scrapes McGee is a conversational web scraper that uses AI to understand what you want and extract it intelligently. No CSS selectors, no regex hell β just tell McGee what you're looking for in natural language.
Most scrapers online:
- Hardcoded CSS selectors that break when sites change
- Dumb link following (scrape everything or nothing)
- No content filtering β you wade through garbage
- Configuration is painful YAML/JSON hell
Scrapes McGee:
- β LLM-guided extraction β describe what you want, McGee figures it out
- β Smart link following β only crawls pages that match your criteria
- β Conversational interface β chat with McGee like a colleague
- β Adaptive β handles site structure changes automatically
- β Context-aware β remembers what it found, avoids duplicates
- β Go 2-3+ links deep β intelligent depth control with quality filtering
git clone https://github.com/merrypranxter/scrapes-mcgee.git
cd scrapes-mcgee
# Install dependencies
pip install -e .
# Or use uv (faster)
uv pip install -e .- Go to Google AI Studio
- Create an API key
- Create
.envfile:
echo "GEMINI_API_KEY=your_key_here" > .envpython scrapes.pyYou: scrape erowid for DMT entity encounters, grab about 50 reports
McGee: Aight, hunting for machine elves on Erowid. Targeting experience
vaults, looking for breakthrough + entity keywords. Want me to grab
dosages and entity descriptions too?
You: yeah, and only the weird ones, skip boring shit
McGee: Got it β filtering for high-weirdness reports. Starting crawl...
[shows progress]
Found 12 so far. This one mentions 'autonomous hypercube beings' π
Keep going?
You: keep going, get to 50
McGee: [continues...]
Done. 53 reports extracted. Top entities: jesters (23),
geometric beings (18), insectoid intelligence (8).
Want JSON, markdown, or SQLite?
You: markdown
McGee: β Exported to data/erowid_dmt_20240315.md
Just talk to McGee:
"scrape shadertoy for voronoi noise techniques"
"get me 100 salvia trip reports, focus on the zipper/wheel entities"
"find all McKenna talks mentioning timewave zero"
"grab shader code from these URLs [paste list]"
McGee uses LLMs to:
- Decide which links to follow based on your criteria
- Extract only what you asked for
- Skip irrelevant pages
- Adapt to different site structures
Example config:
selection_prompt: |
Only follow links to experience reports that mention:
- Geometric entities or beings
- Breakthrough experiences
- Entity communication
extraction_prompt: |
Extract:
- Dosage (mg)
- Entity description (exact quotes)
- Duration of contact- JSON β structured data for code
- Markdown β human-readable with citations
- YAML β config-friendly format
- SQLite β queryable database with full-text search
McGee remembers:
- What you've already scraped
- Patterns it's finding
- Your preferences
Avoids:
- Re-scraping the same content
- Duplicate reports with different URLs
- Low-quality matches
Not just "go 3 links deep" β McGee understands:
stop_conditions = {
"target_count": 50, # stop after 50 matching pages
"content_threshold": "high", # only high-quality matches
"max_depth": 5, # safety limit
"time_limit": "30min" # don't run forever
}python scrapes.py
You: scrape erowid DMT experience vault for entity encounters,
get 50 reports, extract dosage, ROA, and entity descriptions
# McGee handles the restYou: get all McKenna talks from organism.earth about language and
etymology, extract quotes and concepts
# Results: data/mckenna_language.mdYou: scrape shadertoy for fractal techniques, need code + descriptions
# Results: data/shadertoy_fractals.jsonscrapes-mcgee/
βββ scrapes.py # Main chat interface β run this
βββ scraper/ # Core scraping engine
β βββ core.py # LLM-guided scraper
β βββ storage.py # SQLite + export utilities
β βββ extractors.py # Content cleaning (future)
βββ mcgee/ # McGee's brain
β βββ agent.py # Conversational agent
β βββ personality.py # McGee's voice (future)
βββ targets/ # Example scrape configs
β βββ examples/
β βββ erowid_dmt_entities.yaml
β βββ mckenna_language.yaml
βββ data/ # Scraped content
βββ scraper.db # SQLite database
For repeated scrapes, save configs:
# targets/my_scrape.yaml
target_url: "https://example.com"
max_depth: 3
max_pages: 50
selection_prompt: |
Only follow links about [topic]
extraction_prompt: |
Extract:
- Field 1
- Field 2
output_format: jsonLoad it:
You: load targets/my_scrape.yaml and run it- Fork this repo
- Click "Code" β "Codespaces" β "Create codespace"
- Add your
GEMINI_API_KEYto Codespaces secrets - Run
python scrapes.py
- Core LLM-guided scraper
- Conversational interface
- SQLite storage + FTS
- Multiple export formats
- Playwright for JS-heavy sites
- Parallel/async crawling (speed boost)
- Web UI (chat in browser)
- Docker deployment
- MCP server integration
- Hugging Face Space demo
| Component | Why |
|---|---|
| Gemini 2.0 Flash | Free, huge context (1M tokens), fast |
| httpx | Async HTTP requests |
| BeautifulSoup | HTML parsing |
| SQLite + FTS5 | Storage with full-text search |
| Rich | Beautiful terminal UI |
| YAML | Human-readable configs |
McGee is open source! PRs welcome.
Ideas:
- New extraction strategies
- Site-specific scrapers
- UI improvements
- Personality enhancements
MIT β scrape responsibly, respect robots.txt, don't be a dick.
Open an issue or start a discussion. McGee doesn't bite (much).
Built by: @merrypranxter
Powered by: Gemini 2.0, chaos, and coffee