ScrapeYard

A Python-based job scraper and dashboard that collects remote software engineering job postings from multiple sources, stores them in a local database, and displays them in an interactive web UI.

Features

Scrapes job postings from Remotive API and We Work Remotely
Filters jobs by location — Canada, Remote, Worldwide, and US roles open to Canadians
Deduplicates jobs across runs using URL matching
Validates stored job URLs and marks inactive postings automatically
Interactive Streamlit dashboard with filters, metrics, and charts
Fully configurable via config.json — no hardcoded values
Scheduled scraping and validation run automatically

Project Structure

scrapeyard/
├── scraper.py       — Fetches, filters, deduplicates, and stores job postings
├── validate.py      — Checks stored URLs and marks inactive jobs
├── database.py      — SQLAlchemy ORM schema and database setup
├── dashboard.py     — Streamlit web UI
├── config.json      — All configurable settings
├── run.sh           — Launch script for all services
├── jobs.db          — SQLite database (auto-created on first run)
└── requirements.txt — Python dependencies

Requirements

Python 3.9+
pip

Setup

1. Clone the repository

git clone https://github.com/yourusername/scrapeyard.git
cd scrapeyard

2. Create and activate a virtual environment

python -m venv venv
source venv/bin/activate        # Mac / Linux
venv\Scripts\activate           # Windows

3. Install dependencies

pip install -r requirements.txt

4. Create the database

python database.py

Configuration

All settings live in config.json:

{
    "categories": [
        "software-dev",
        "devops-sysadmin",
        "data-science"
    ],
    "wwr_categories": [
        "https://weworkremotely.com/categories/remote-programming-jobs",
        "https://weworkremotely.com/categories/remote-devops-sysadmin-jobs"
    ],
    "limit_per_category": 50,
    "allowed_locations": [
        "canada",
        "worldwide",
        "remote",
        "usa",
        "us",
        "north america",
        ""
    ],
    "check_interval_days": 7
}

Setting	Description
`categories`	Remotive API job categories to scrape
`wwr_categories`	We Work Remotely category page URLs
`limit_per_category`	Max jobs fetched per Remotive category per run
`allowed_locations`	Keywords used to filter jobs by location
`check_interval_days`	How many days before a stored job URL is rechecked

Running the Project

Launch everything at once:

chmod +x run.sh     # only needed once
./run.sh

This starts the scraper, validator, and dashboard in one command. The dashboard opens automatically at http://localhost:8501.

Or run each service manually in separate terminals:

python scraper.py                  # terminal 1
python validate.py                 # terminal 2
streamlit run dashboard.py         # terminal 3

How It Works

Scraper (`scraper.py`)

Reads categories and settings from config.json
Calls the Remotive API for each configured category
Scrapes We Work Remotely HTML pages using BeautifulSoup
Filters jobs by location keywords
Deduplicates against the database using a single URL set query
Saves new jobs to jobs.db
Runs automatically every hour via the schedule library

Validator (`validate.py`)

Queries jobs that are active and haven't been checked within check_interval_days
Makes an HTTP request to each job URL with a 10-second timeout
Marks jobs as inactive if a 404 is returned
Skips jobs on network errors — does not remove jobs it cannot confirm are gone
Commits all changes in a single batch after processing
Runs automatically every 24 hours

Dashboard (`dashboard.py`)

Loads only active jobs from the database
Caches the query with @st.cache_data for performance
Provides sidebar filters: title search, category, location, company, tags
Displays metrics (total jobs, unique companies) that reflect current filters
Shows a top tags bar chart
Allows export of the current filtered view as CSV
Refresh button clears the cache and reloads from the database

Data Sources

Source	Type	Notes
Remotive	JSON API	Free, no auth required
We Work Remotely	HTML scraping	Static HTML, BeautifulSoup

Dependencies

Package	Purpose
`requests`	HTTP requests for APIs and web pages
`beautifulsoup4`	HTML parsing for We Work Remotely
`sqlalchemy`	ORM and SQLite database management
`pandas`	DataFrame manipulation in the dashboard
`streamlit`	Interactive web dashboard
`schedule`	Recurring job scheduling
`colorama`	Coloured terminal output

Notes

jobs.db is created automatically on first run of database.py
Jobs are never hard-deleted — inactive jobs are flagged with is_active = False
The dashboard only shows active jobs
Running python database.py again is safe — it skips existing tables

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScrapeYard

Features

Project Structure

Requirements

Setup

Configuration

Running the Project

How It Works

Scraper (`scraper.py`)

Validator (`validate.py`)

Dashboard (`dashboard.py`)

Data Sources

Dependencies

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
__pycache__		__pycache__
.gitignore		.gitignore
README.md		README.md
config.json		config.json
dashboard.py		dashboard.py
database.py		database.py
jobs.db		jobs.db
requirements.txt		requirements.txt
run.sh		run.sh
scraper.py		scraper.py
validate.py		validate.py

Folders and files

Latest commit

History

Repository files navigation

ScrapeYard

Features

Project Structure

Requirements

Setup

Configuration

Running the Project

How It Works

Scraper (scraper.py)

Validator (validate.py)

Dashboard (dashboard.py)

Data Sources

Dependencies

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Scraper (`scraper.py`)

Validator (`validate.py`)

Dashboard (`dashboard.py`)

Packages