Skip to content

harsh8423/Jobzy-Cross_platform_job_scraper

Repository files navigation

🚀 Jobzy — Automated Job Aggregator

Jobzy is a full-stack job aggregation platform that automatically scrapes job listings from multiple Indian and global job portals — LinkedIn, Naukri, Foundit (Monster India), and Indeed — on a configurable schedule, deduplicates them, and surfaces them in a clean Next.js dashboard.

You define Preferences (search criteria + platforms + polling interval), and Jobzy keeps your job feed fresh in the background — automatically.


✨ Features

  • 🔍 Multi-platform scraping — LinkedIn, Naukri, Foundit, Indeed scraped via Playwright + stealth mode
  • ⚙️ Preference-driven scheduling — set keywords, location, experience range, job type, and poll interval per preference
  • 📦 BullMQ queues — one queue per platform; repeatable background jobs run on your schedule
  • 🧠 Smart deduplication — Redis-backed seen-job cache prevents duplicate DB inserts
  • 🕐 24-hour freshness cap — stale jobs are filtered out regardless of datePosted preference
  • Applied-jobs tracking — mark jobs as applied; they disappear from the main feed automatically
  • 🗑️ Ignore jobs — permanently remove unwanted listings from your feed instantly
  • 📊 Stats API — aggregated job counts by platform, total preferences count
  • ♻️ Self-healing on restart — server re-enqueues any preferences that lost their queue entries
  • 🌐 Next.js frontend — job feed, applied-jobs list, preference management dashboard
  • 🐳 Docker Compose — one command to spin up API, workers, and frontend together

🏗️ Architecture

┌──────────────────────────────────────────────────────┐
│                    Docker Compose                    │
│                                                      │
│  ┌──────────────┐  ┌──────────────┐  ┌───────────┐  │
│  │  Next.js     │  │  Express API │  │  BullMQ   │  │
│  │  Frontend    │→ │  (port 3000) │  │  Workers  │  │
│  │  (port 3001) │  └──────┬───────┘  └─────┬─────┘  │
│  └──────────────┘         │                │        │
└──────────────────────────────────────────────────────┘
                            │                │
              ┌─────────────┴──┐     ┌───────┴──────┐
              │  MongoDB Atlas │     │  Redis Cloud │
              │  (Jobs, Prefs, │     │  (Queues +   │
              │   AppliedJobs) │     │   Dedup cache│
              └────────────────┘     └──────────────┘

Components

Component Tech Purpose
API Server Express.js REST API for preferences, jobs, stats, health
Workers BullMQ + Node.js Background scraping jobs per platform
Scrapers Playwright + stealth Browser automation for each job portal
Database MongoDB Atlas Stores jobs, preferences, applied jobs
Queue / Cache Redis Cloud BullMQ job queues + seen-job dedup
Frontend Next.js 14 (App Router) Dashboard UI

📁 Project Structure

jobzy/
├── src/
│   ├── server.js               # Express app entry point
│   ├── browser.js              # Playwright browser factory (stealth mode)
│   ├── db/
│   │   ├── mongoose.js         # MongoDB connection
│   │   └── models/
│   │       ├── Preference.js   # Search preference schema
│   │       ├── Job.js          # Scraped job schema
│   │       └── AppliedJob.js   # Applied-job snapshot schema
│   ├── routes/
│   │   ├── jobs.js             # LinkedIn scraper route
│   │   ├── naukri.js           # Naukri scraper route
│   │   ├── foundit.js          # Foundit scraper route
│   │   ├── indeed.js           # Indeed scraper route
│   │   └── preferences.js      # Preference CRUD + jobs-list + applied-jobs
│   ├── scrapers/
│   │   ├── linkedin.js         # LinkedIn scraper
│   │   ├── naukri.js           # Naukri scraper
│   │   ├── foundit.js          # Foundit scraper
│   │   └── indeed.js           # Indeed scraper
│   ├── workers/
│   │   ├── index.js            # Worker process entry point (starts all 4 workers)
│   │   ├── linkedin.worker.js
│   │   ├── naukri.worker.js
│   │   ├── foundit.worker.js
│   │   ├── indeed.worker.js
│   │   └── extractors.js       # Platform-specific job ID extractors
│   ├── queues/
│   │   └── index.js            # BullMQ queue definitions & helpers
│   ├── dedup/
│   │   └── redis.js            # Redis client + isJobSeen / markJobSeen helpers
│   └── utils/
│       └── jobFreshness.js     # 24h freshness filter
├── web/                        # Next.js frontend
│   ├── app/
│   │   ├── page.tsx            # Main jobs dashboard
│   │   ├── layout.tsx
│   │   └── jobs/               # Job detail pages
│   ├── components/             # Reusable UI components
│   ├── lib/                    # API helper utilities
│   └── next.config.ts
├── scripts/
│   └── reset.js                # CLI tool to clear queue / DB data
├── Dockerfile                  # API + Worker image
├── docker-compose.yml
├── .env.example
└── package.json

⚙️ Prerequisites

  • Node.js v18+ and npm
  • MongoDB Atlas free cluster (atlas.mongodb.com)
  • Redis Cloud free subscription (redis.io)
  • Docker (optional, for containerised deployment)

🧩 How It Works — End to End

  1. Create a Preference via the dashboard or API (POST /api/preferences).
  2. The API saves it to MongoDB and calls enqueuePreference(), which adds a repeatable BullMQ job to each selected platform's queue with repeat: { every: repeatEvery, immediately: true }.
  3. The Worker process (running separately) picks up jobs from each queue.
  4. Each worker calls the corresponding Playwright scraper, which launches a headless Chromium browser in stealth mode to scrape results.
  5. For each scraped job:
    • A unique platformJobId is extracted from the URL.
    • The Redis seen-job cache (isJobSeen) is checked — duplicates are skipped.
    • A 24-hour freshness check filters out old jobs.
    • Fresh, unseen jobs are inserted into MongoDB and marked in Redis (markJobSeen).
  6. The frontend polls /api/jobs-list and displays the aggregated, deduplicated feed.
  7. Clicking Apply calls POST /api/preferences/apply-job/:id, which snapshots the job to AppliedJob and removes it from the main feed.
  8. Clicking Ignore calls DELETE /api/preferences/job/:id, which hard-deletes the record.

🛠️ Local Setup

1. Clone & Install

git clone https://github.com/your-username/jobzy.git
cd jobzy
npm install

2. Install Playwright Browser

npm run install:browsers
# Equivalent to: npx playwright install chromium

3. Configure Environment

cp .env.example .env

Edit .env:

PORT=3000
HEADLESS=true

# Redis Cloud — use rediss:// (double-s) for TLS
REDIS_URL=rediss://default:<password>@<host>:<port>

# MongoDB Atlas SRV connection string
MONGODB_URI=mongodb+srv://<username>:<password>@cluster0.xxxxx.mongodb.net/jobzy?retryWrites=true&w=majority

# Next.js — API base URL seen by the browser
NEXT_PUBLIC_API_URL=http://localhost:3000

4. Start the API Server

npm run dev
# or for production:
npm start

5. Start the Background Workers (separate terminal)

npm run dev:worker
# or for production:
npm run worker

6. Start the Frontend

cd web
npm install
npm run dev     # Runs on http://localhost:3001

🐳 Docker Compose (Recommended)

Spins up the API, worker, and Next.js frontend in one command:

# Build and start all services
docker compose up --build

# Run in the background
docker compose up -d --build
Service Port Description
api 3000 Express REST API
worker BullMQ background workers (no HTTP port)
web 3001 Next.js frontend

Note: Redis and MongoDB run on the cloud. Only REDIS_URL and MONGODB_URI in .env are needed — no local Redis/Mongo containers required.


📡 API Reference

Health Check

GET /health

Returns API status and a summary of all available endpoints.


Preferences

Preferences are the core concept — each defines what to search, where, and how often.

Method Endpoint Description
GET /api/preferences List all preferences with live queue status
POST /api/preferences Create a new preference
DELETE /api/preferences/:id Delete preference and remove from queue
POST /api/preferences/:id/start Resume scraping for a preference
POST /api/preferences/:id/pause Pause scraping (keeps DB record)
GET /api/preferences/:id/status Get live queue status (active | paused)
GET /api/preferences/stats Job counts per platform + total prefs

Create Preference — Request Body

{
  "name": "Senior React Developer",
  "filters": {
    "keywords": "React Node.js",
    "location": "Bangalore",
    "experience": "2-5",
    "experienceMin": 2,
    "experienceMax": 5,
    "datePosted": 60,
    "jobType": "fulltime"
  },
  "platforms": ["naukri", "foundit", "linkedin", "indeed"],
  "repeatEvery": 600000,
  "startNow": true
}
Field Type Description
name string Required. Human-readable label
filters.keywords string Job title / skills to search
filters.location string City or region
filters.experience string Experience string for LinkedIn/Naukri (e.g. "2-5")
filters.experienceMin number Min years of experience (Foundit)
filters.experienceMax number Max years of experience (Foundit)
filters.datePosted number Age filter in minutes (e.g. 60 = last hour, 1440 = last day)
filters.jobType string Job type filter
platforms array Platforms to scrape. Defaults to all four
repeatEvery number Poll interval in ms. Clamped to 3 min – 60 min
startNow boolean Auto-start on creation. Defaults to true

Jobs Feed

Method Endpoint Description
GET /api/jobs-list Paginated job feed (excludes applied jobs)
POST /api/preferences/apply-job/:jobId Mark a job as applied
GET /api/preferences/applied-jobs List all applied jobs
DELETE /api/preferences/job/:jobId Permanently delete/ignore a job

Jobs List Query Parameters

Param Type Description
platform string Filter by platform (linkedin, naukri, etc.)
prefId string Filter by preference ID
keyword string Case-insensitive title search
page number Page number (default: 1)
limit number Results per page (default: 50)

On-Demand Scraper Endpoints

Trigger a scrape without a preference, useful for testing:

Endpoint Platform Key Query Params
GET /api/jobs/search LinkedIn keywords, location, datePosted, jobType
GET /api/naukri/search Naukri keywords, location, experience, datePosted
GET /api/foundit/search Foundit keywords, location, experienceMin, experienceMax, datePosted, jobType
GET /api/indeed/search Indeed keywords, location, datePosted

🗄️ Data Models

Preference

{
  name: String,            // e.g. "Frontend Roles - Bangalore"
  filters: {
    keywords:      String, // Search query
    location:      String,
    experience:    String, // LinkedIn / Naukri format
    experienceMin: Number, // Foundit min years
    experienceMax: Number, // Foundit max years
    datePosted:    Number, // Minutes (60, 1440, etc.)
    jobType:       String,
  },
  platforms: [String],    // ['naukri', 'foundit', 'linkedin', 'indeed']
  repeatEvery: Number,    // Interval in ms (3min–60min)
  createdAt: Date,
}

Job

{
  platform: String,       // 'naukri' | 'foundit' | 'linkedin' | 'indeed'
  platformJobId: String,  // Unique job ID from source platform
  preferenceIds: [String],
  title: String,
  company: String,
  location: String,
  experience: String,
  salary: String,
  postedAt: String,       // ISO date string from source
  skills: [String],
  url: String,
  easyApply: Boolean,
  fetchedAt: Date,
}

AppliedJob

Snapshot of a job at the time it was marked applied. Fields mirror Job. Includes appliedAt: Date.


🔧 NPM Scripts

Script Command Description
start node src/server.js Start API server (production)
dev nodemon src/server.js Start API server with auto-reload
worker node src/workers/index.js Start all BullMQ workers (production)
dev:worker nodemon src/workers/index.js Start workers with auto-reload
reset node scripts/reset.js Clear queue jobs only
reset:all node scripts/reset.js --all Clear queue jobs + MongoDB data
install:browsers npx playwright install chromium Install Playwright Chromium

🔄 Self-Healing Queue

On every server startup, Jobzy checks all preferences in MongoDB and automatically re-enqueues any whose BullMQ repeatable jobs are missing (e.g. after a Redis flush or container restart). This makes the system resilient to infrastructure restarts with zero manual intervention.


🌍 Environment Variables Reference

Variable Required Description
PORT No API server port (default: 3000)
HEADLESS No Set to true for headless Playwright (recommended in production)
REDIS_URL Yes Redis Cloud connection string (rediss://...)
MONGODB_URI Yes MongoDB Atlas connection string
NEXT_PUBLIC_API_URL Yes (web) API base URL visible to the browser (http://localhost:3000 locally, http://api:3000 in Docker)

🚧 Limitations & Known Behaviour

  • Rate limiting: Scrapers use stealth mode but may get CAPTCHA challenges on aggressive poll intervals. Keep repeatEvery at 15+ minutes for production use.
  • LinkedIn: Requires a publicly accessible listing URL; sign-in-gated jobs are not scraped.
  • Headless mode: Set HEADLESS=false locally to visually debug scrapers.
  • Dedup TTL: Redis seen-job keys should be given a TTL in production; currently they persist indefinitely.

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/my-feature
  3. Commit your changes: git commit -m 'feat: add my feature'
  4. Push to the branch: git push origin feature/my-feature
  5. Open a Pull Request

📄 License

MIT © 2025 Jobzy

About

Automated full-stack job aggregation platform that continuously scrapes LinkedIn, Naukri, Foundit, and Indeed using Playwright and scheduled BullMQ workers, deduplicates listings via Redis, and surfaces a unified job feed in a Next.js dashboard.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors