Skip to content

JGrace99/Scrapeyard

Repository files navigation

ScrapeYard

A Python-based job scraper and dashboard that collects remote software engineering job postings from multiple sources, stores them in a local database, and displays them in an interactive web UI.


Features

  • Scrapes job postings from Remotive API and We Work Remotely
  • Filters jobs by location — Canada, Remote, Worldwide, and US roles open to Canadians
  • Deduplicates jobs across runs using URL matching
  • Validates stored job URLs and marks inactive postings automatically
  • Interactive Streamlit dashboard with filters, metrics, and charts
  • Fully configurable via config.json — no hardcoded values
  • Scheduled scraping and validation run automatically

Project Structure

scrapeyard/
├── scraper.py       — Fetches, filters, deduplicates, and stores job postings
├── validate.py      — Checks stored URLs and marks inactive jobs
├── database.py      — SQLAlchemy ORM schema and database setup
├── dashboard.py     — Streamlit web UI
├── config.json      — All configurable settings
├── run.sh           — Launch script for all services
├── jobs.db          — SQLite database (auto-created on first run)
└── requirements.txt — Python dependencies

Requirements

  • Python 3.9+
  • pip

Setup

1. Clone the repository

git clone https://github.com/yourusername/scrapeyard.git
cd scrapeyard

2. Create and activate a virtual environment

python -m venv venv
source venv/bin/activate        # Mac / Linux
venv\Scripts\activate           # Windows

3. Install dependencies

pip install -r requirements.txt

4. Create the database

python database.py

Configuration

All settings live in config.json:

{
    "categories": [
        "software-dev",
        "devops-sysadmin",
        "data-science"
    ],
    "wwr_categories": [
        "https://weworkremotely.com/categories/remote-programming-jobs",
        "https://weworkremotely.com/categories/remote-devops-sysadmin-jobs"
    ],
    "limit_per_category": 50,
    "allowed_locations": [
        "canada",
        "worldwide",
        "remote",
        "usa",
        "us",
        "north america",
        ""
    ],
    "check_interval_days": 7
}
Setting Description
categories Remotive API job categories to scrape
wwr_categories We Work Remotely category page URLs
limit_per_category Max jobs fetched per Remotive category per run
allowed_locations Keywords used to filter jobs by location
check_interval_days How many days before a stored job URL is rechecked

Running the Project

Launch everything at once:

chmod +x run.sh     # only needed once
./run.sh

This starts the scraper, validator, and dashboard in one command. The dashboard opens automatically at http://localhost:8501.

Or run each service manually in separate terminals:

python scraper.py                  # terminal 1
python validate.py                 # terminal 2
streamlit run dashboard.py         # terminal 3

How It Works

Scraper (scraper.py)

  • Reads categories and settings from config.json
  • Calls the Remotive API for each configured category
  • Scrapes We Work Remotely HTML pages using BeautifulSoup
  • Filters jobs by location keywords
  • Deduplicates against the database using a single URL set query
  • Saves new jobs to jobs.db
  • Runs automatically every hour via the schedule library

Validator (validate.py)

  • Queries jobs that are active and haven't been checked within check_interval_days
  • Makes an HTTP request to each job URL with a 10-second timeout
  • Marks jobs as inactive if a 404 is returned
  • Skips jobs on network errors — does not remove jobs it cannot confirm are gone
  • Commits all changes in a single batch after processing
  • Runs automatically every 24 hours

Dashboard (dashboard.py)

  • Loads only active jobs from the database
  • Caches the query with @st.cache_data for performance
  • Provides sidebar filters: title search, category, location, company, tags
  • Displays metrics (total jobs, unique companies) that reflect current filters
  • Shows a top tags bar chart
  • Allows export of the current filtered view as CSV
  • Refresh button clears the cache and reloads from the database

Data Sources

Source Type Notes
Remotive JSON API Free, no auth required
We Work Remotely HTML scraping Static HTML, BeautifulSoup

Dependencies

Package Purpose
requests HTTP requests for APIs and web pages
beautifulsoup4 HTML parsing for We Work Remotely
sqlalchemy ORM and SQLite database management
pandas DataFrame manipulation in the dashboard
streamlit Interactive web dashboard
schedule Recurring job scheduling
colorama Coloured terminal output

Notes

  • jobs.db is created automatically on first run of database.py
  • Jobs are never hard-deleted — inactive jobs are flagged with is_active = False
  • The dashboard only shows active jobs
  • Running python database.py again is safe — it skips existing tables

About

A Python-based job scraper and dashboard that collects remote software engineering job postings from multiple sources, stores them in a local database, and displays them in an interactive web UI.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors