A Python-based job scraper and dashboard that collects remote software engineering job postings from multiple sources, stores them in a local database, and displays them in an interactive web UI.
- Scrapes job postings from Remotive API and We Work Remotely
- Filters jobs by location — Canada, Remote, Worldwide, and US roles open to Canadians
- Deduplicates jobs across runs using URL matching
- Validates stored job URLs and marks inactive postings automatically
- Interactive Streamlit dashboard with filters, metrics, and charts
- Fully configurable via
config.json— no hardcoded values - Scheduled scraping and validation run automatically
scrapeyard/
├── scraper.py — Fetches, filters, deduplicates, and stores job postings
├── validate.py — Checks stored URLs and marks inactive jobs
├── database.py — SQLAlchemy ORM schema and database setup
├── dashboard.py — Streamlit web UI
├── config.json — All configurable settings
├── run.sh — Launch script for all services
├── jobs.db — SQLite database (auto-created on first run)
└── requirements.txt — Python dependencies
- Python 3.9+
- pip
1. Clone the repository
git clone https://github.com/yourusername/scrapeyard.git
cd scrapeyard2. Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # Mac / Linux
venv\Scripts\activate # Windows3. Install dependencies
pip install -r requirements.txt4. Create the database
python database.pyAll settings live in config.json:
{
"categories": [
"software-dev",
"devops-sysadmin",
"data-science"
],
"wwr_categories": [
"https://weworkremotely.com/categories/remote-programming-jobs",
"https://weworkremotely.com/categories/remote-devops-sysadmin-jobs"
],
"limit_per_category": 50,
"allowed_locations": [
"canada",
"worldwide",
"remote",
"usa",
"us",
"north america",
""
],
"check_interval_days": 7
}| Setting | Description |
|---|---|
categories |
Remotive API job categories to scrape |
wwr_categories |
We Work Remotely category page URLs |
limit_per_category |
Max jobs fetched per Remotive category per run |
allowed_locations |
Keywords used to filter jobs by location |
check_interval_days |
How many days before a stored job URL is rechecked |
Launch everything at once:
chmod +x run.sh # only needed once
./run.shThis starts the scraper, validator, and dashboard in one command. The dashboard opens automatically at http://localhost:8501.
Or run each service manually in separate terminals:
python scraper.py # terminal 1
python validate.py # terminal 2
streamlit run dashboard.py # terminal 3- Reads categories and settings from
config.json - Calls the Remotive API for each configured category
- Scrapes We Work Remotely HTML pages using BeautifulSoup
- Filters jobs by location keywords
- Deduplicates against the database using a single URL set query
- Saves new jobs to
jobs.db - Runs automatically every hour via the
schedulelibrary
- Queries jobs that are active and haven't been checked within
check_interval_days - Makes an HTTP request to each job URL with a 10-second timeout
- Marks jobs as inactive if a 404 is returned
- Skips jobs on network errors — does not remove jobs it cannot confirm are gone
- Commits all changes in a single batch after processing
- Runs automatically every 24 hours
- Loads only active jobs from the database
- Caches the query with
@st.cache_datafor performance - Provides sidebar filters: title search, category, location, company, tags
- Displays metrics (total jobs, unique companies) that reflect current filters
- Shows a top tags bar chart
- Allows export of the current filtered view as CSV
- Refresh button clears the cache and reloads from the database
| Source | Type | Notes |
|---|---|---|
| Remotive | JSON API | Free, no auth required |
| We Work Remotely | HTML scraping | Static HTML, BeautifulSoup |
| Package | Purpose |
|---|---|
requests |
HTTP requests for APIs and web pages |
beautifulsoup4 |
HTML parsing for We Work Remotely |
sqlalchemy |
ORM and SQLite database management |
pandas |
DataFrame manipulation in the dashboard |
streamlit |
Interactive web dashboard |
schedule |
Recurring job scheduling |
colorama |
Coloured terminal output |
jobs.dbis created automatically on first run ofdatabase.py- Jobs are never hard-deleted — inactive jobs are flagged with
is_active = False - The dashboard only shows active jobs
- Running
python database.pyagain is safe — it skips existing tables