Skip to content

merrypranxter/scrapes_mcgee

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Scrapes McGee Banner

Scrapes McGee πŸ•·οΈ


---

### **2. Add Repository Description** (on GitHub)

Click "About" (gear icon on right side), add:
- **Description:** `LLM-guided web scraper with personality β€” talk to McGee, get targeted content extraction`
- **Website:** Leave blank or add your portfolio
- **Topics:** `web-scraping`, `llm`, `gemini`, `python`, `ai-agent`, `automation`

---

### **3. Optional: Add a License**

On GitHub: Add file β†’ Create new file β†’ Name it `LICENSE`

Paste MIT license (if you want open source):

MIT License

Copyright (c) 2025 Merry

Permission is hereby granted, free of charge...

Scrapes McGee πŸ•·οΈ

LLM-guided web scraper with personality β€” talk to McGee, get targeted content extraction

Scrapes McGee is a conversational web scraper that uses AI to understand what you want and extract it intelligently. No CSS selectors, no regex hell β€” just tell McGee what you're looking for in natural language.


Why McGee Kicks Ass

Most scrapers online:

  • Hardcoded CSS selectors that break when sites change
  • Dumb link following (scrape everything or nothing)
  • No content filtering β€” you wade through garbage
  • Configuration is painful YAML/JSON hell

Scrapes McGee:

  • βœ… LLM-guided extraction β€” describe what you want, McGee figures it out
  • βœ… Smart link following β€” only crawls pages that match your criteria
  • βœ… Conversational interface β€” chat with McGee like a colleague
  • βœ… Adaptive β€” handles site structure changes automatically
  • βœ… Context-aware β€” remembers what it found, avoids duplicates
  • βœ… Go 2-3+ links deep β€” intelligent depth control with quality filtering

Quick Start

1. Clone and Setup

git clone https://github.com/merrypranxter/scrapes-mcgee.git
cd scrapes-mcgee

# Install dependencies
pip install -e .

# Or use uv (faster)
uv pip install -e .

2. Get a Gemini API Key

  1. Go to Google AI Studio
  2. Create an API key
  3. Create .env file:
echo "GEMINI_API_KEY=your_key_here" > .env

3. Run McGee

python scrapes.py

4. Start Scraping

You: scrape erowid for DMT entity encounters, grab about 50 reports

McGee: Aight, hunting for machine elves on Erowid. Targeting experience 
       vaults, looking for breakthrough + entity keywords. Want me to grab 
       dosages and entity descriptions too?

You: yeah, and only the weird ones, skip boring shit

McGee: Got it β€” filtering for high-weirdness reports. Starting crawl...
       [shows progress]
       Found 12 so far. This one mentions 'autonomous hypercube beings' πŸ‘€
       Keep going?

You: keep going, get to 50

McGee: [continues...]
       Done. 53 reports extracted. Top entities: jesters (23), 
       geometric beings (18), insectoid intelligence (8).
       Want JSON, markdown, or SQLite?

You: markdown

McGee: βœ“ Exported to data/erowid_dmt_20240315.md

Features

πŸ€– Natural Language Interface

Just talk to McGee:

"scrape shadertoy for voronoi noise techniques"
"get me 100 salvia trip reports, focus on the zipper/wheel entities"
"find all McKenna talks mentioning timewave zero"
"grab shader code from these URLs [paste list]"

🎯 Smart Content Selection

McGee uses LLMs to:

  • Decide which links to follow based on your criteria
  • Extract only what you asked for
  • Skip irrelevant pages
  • Adapt to different site structures

Example config:

selection_prompt: |
  Only follow links to experience reports that mention:
  - Geometric entities or beings
  - Breakthrough experiences
  - Entity communication
  
extraction_prompt: |
  Extract:
  - Dosage (mg)
  - Entity description (exact quotes)
  - Duration of contact

πŸ“Š Multiple Output Formats

  • JSON β€” structured data for code
  • Markdown β€” human-readable with citations
  • YAML β€” config-friendly format
  • SQLite β€” queryable database with full-text search

🧠 Context Memory

McGee remembers:

  • What you've already scraped
  • Patterns it's finding
  • Your preferences

Avoids:

  • Re-scraping the same content
  • Duplicate reports with different URLs
  • Low-quality matches

πŸ” Intelligent Depth Control

Not just "go 3 links deep" β€” McGee understands:

stop_conditions = {
    "target_count": 50,           # stop after 50 matching pages
    "content_threshold": "high",  # only high-quality matches
    "max_depth": 5,               # safety limit
    "time_limit": "30min"         # don't run forever
}

Example Use Cases

Erowid Trip Report Corpus

python scrapes.py

You: scrape erowid DMT experience vault for entity encounters,
     get 50 reports, extract dosage, ROA, and entity descriptions

# McGee handles the rest

McKenna Transcript Collection

You: get all McKenna talks from organism.earth about language and 
     etymology, extract quotes and concepts

# Results: data/mckenna_language.md

Shader Technique Library

You: scrape shadertoy for fractal techniques, need code + descriptions

# Results: data/shadertoy_fractals.json

Project Structure

scrapes-mcgee/
β”œβ”€β”€ scrapes.py              # Main chat interface β€” run this
β”œβ”€β”€ scraper/                # Core scraping engine
β”‚   β”œβ”€β”€ core.py            # LLM-guided scraper
β”‚   β”œβ”€β”€ storage.py         # SQLite + export utilities
β”‚   └── extractors.py      # Content cleaning (future)
β”œβ”€β”€ mcgee/                  # McGee's brain
β”‚   β”œβ”€β”€ agent.py           # Conversational agent
β”‚   └── personality.py     # McGee's voice (future)
β”œβ”€β”€ targets/                # Example scrape configs
β”‚   └── examples/
β”‚       β”œβ”€β”€ erowid_dmt_entities.yaml
β”‚       └── mckenna_language.yaml
└── data/                   # Scraped content
    └── scraper.db         # SQLite database

Advanced: YAML Configs

For repeated scrapes, save configs:

# targets/my_scrape.yaml
target_url: "https://example.com"
max_depth: 3
max_pages: 50

selection_prompt: |
  Only follow links about [topic]

extraction_prompt: |
  Extract:
  - Field 1
  - Field 2
  
output_format: json

Load it:

You: load targets/my_scrape.yaml and run it

GitHub Codespaces

  1. Fork this repo
  2. Click "Code" β†’ "Codespaces" β†’ "Create codespace"
  3. Add your GEMINI_API_KEY to Codespaces secrets
  4. Run python scrapes.py

Roadmap

  • Core LLM-guided scraper
  • Conversational interface
  • SQLite storage + FTS
  • Multiple export formats
  • Playwright for JS-heavy sites
  • Parallel/async crawling (speed boost)
  • Web UI (chat in browser)
  • Docker deployment
  • MCP server integration
  • Hugging Face Space demo

Tech Stack

Component Why
Gemini 2.0 Flash Free, huge context (1M tokens), fast
httpx Async HTTP requests
BeautifulSoup HTML parsing
SQLite + FTS5 Storage with full-text search
Rich Beautiful terminal UI
YAML Human-readable configs

Contributing

McGee is open source! PRs welcome.

Ideas:

  • New extraction strategies
  • Site-specific scrapers
  • UI improvements
  • Personality enhancements

License

MIT β€” scrape responsibly, respect robots.txt, don't be a dick.


Questions?

Open an issue or start a discussion. McGee doesn't bite (much).

Built by: @merrypranxter
Powered by: Gemini 2.0, chaos, and coffee

About

a useful & kinda sassy lil scrapey mcscraperson. talk to him like a normal ai & just tell him what u want scraped. and he can shove all the scrapings right up ur github for u when he's done.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages