Research implementation of a focused web crawler that combines adaptive Q-learning, LinUCB contextual bandits, and graph neural network embeddings for topic-directed crawl decisions.
The repository is organized for reproducible experimentation: seeded datasets, train/validation/test labels, model checkpoints, evaluation scripts, phase reports, and benchmark outputs are included under versioned project folders.
Research paper: docs/paper/adaptive_qlearning_web_crawler.pdf
| Field | Details |
|---|---|
| Title | Adaptive Q-Learning Web Crawler |
| Authors | Vasant Kumar Mogia, Sujal Jariha |
docs/paper/adaptive_qlearning_web_crawler.pdf |
|
| Citation metadata | CITATION.cff |
Focused web crawling tries to maximize the number of topic-relevant pages discovered under a limited crawl budget. Static policies such as random crawling, best-first heuristics, or PageRank-based traversal can waste requests when topical relevance depends on page context, link anchor text, and graph structure.
This project formulates crawling as a sequential decision problem:
- The crawler observes the current page, graph neighborhood, remaining budget, and historical rewards.
- A Q-learning policy decides whether to continue or stop.
- A contextual bandit ranks candidate outgoing links when continuing.
- A frozen GraphSAGE encoder supplies structural page embeddings from the bootstrap web graph.
The research question is whether an adaptive policy can improve harvest rate and average reward over non-adaptive baselines while staying practical on CPU-only student hardware.
The Q-learning state is a 69-dimensional vector:
| Component | Dimensions | Description |
|---|---|---|
| GNN embedding | 64 | Frozen GraphSAGE representation for the current URL |
| Budget remaining | 1 | Normalized remaining page budget |
| Relevant pages found | 1 | Normalized cumulative relevant discoveries |
| Current depth | 1 | Normalized crawl depth |
| Average reward | 1 | Clipped running reward signal |
| Exploration rate | 1 | Current epsilon value |
Candidate link contexts use 174 dimensions:
| Component | Dimensions | Description |
|---|---|---|
| GNN embedding | 64 | Structural embedding |
| URL features | 20 | Depth, domain, query, file type, path markers |
| Content features | 50 | TF-IDF and HTML/content statistics |
| Anchor features | 30 | Anchor text statistics and topic keyword indicators |
| Graph features | 10 | In-degree, out-degree, PageRank, hub/leaf indicators |
The Q-agent uses two high-level actions:
| Action | Meaning |
|---|---|
0 |
Stop the crawl episode |
1 |
Continue crawling and delegate link choice to LinUCB |
When action 1 is selected, LinUCB scores each candidate link using its 174-dimensional context and chooses the maximum upper-confidence score.
Rewards are computed from topical relevance, novelty, fetch cost, depth, and duplicate penalties:
| Event | Reward effect |
|---|---|
| Highly relevant page | +10 |
| Moderately relevant page | +5 |
| Low/irrelevant page | 0 to -2 |
| New domain | +2 |
| Duplicate page | -5 |
| Fetch time | -0.1 * fetch_time |
| Crawl depth | -0.1 * min(depth, 10) |
The training loop also rewards useful stopping behavior and penalizes early stops or dead ends.
Q-learning updates the crawler after each action by comparing the reward it expected with the reward it actually received. Over repeated episodes, this teaches the agent when it is worth continuing a crawl and when stopping is better.
LinUCB updates its link-selection confidence after every selected link. Links with good rewards become more attractive, while uncertain links can still be explored when the model needs more evidence.
The GraphSAGE encoder is pre-trained on the bootstrap graph with binary relevance labels, then frozen for active crawling to keep evaluation CPU-friendly.
flowchart TD
A[Topic Seeds and Labeled Data] --> B[Build Web Graph]
B --> C[Learn Page Embeddings with GNN]
C --> D[Adaptive Crawler]
D --> E{Q-Learning Decision}
E -->|Stop| H[Save Metrics and Results]
E -->|Continue| F[LinUCB Chooses Next Link]
F --> G[Visit Page and Compute Reward]
G --> D
B --> D
G --> H
In simple words, the system has two parts: preparation before crawling and decisions during crawling.
- The project starts with topic seeds and labeled URLs.
- Those URLs are used to build a web graph, which is a map of pages and links.
- The GNN studies that map and turns each page into a compact embedding.
- During crawling, the adaptive crawler uses the graph and embeddings as context.
- Q-learning makes the high-level decision: stop or continue.
- If it continues, LinUCB picks the best next link from the available candidates.
- The crawler visits the selected page and receives a reward based on relevance, novelty, depth, and cost.
- The reward improves future decisions, and the evaluator saves the final metrics and result files.
The main idea is simple: the GNN helps the crawler understand page structure, Q-learning controls the overall crawl strategy, and LinUCB chooses the next link when the crawler decides to continue.
| Path | Purpose |
|---|---|
src/ |
Core crawler, models, graph utilities, reward and evaluation code |
experiments/ |
Bootstrap, training, live crawl, and evaluation scripts |
configs/crawler_config.yaml |
Main experiment configuration and hyperparameters |
data/seeds/ |
Topic seed URLs for ML, climate, and blockchain domains |
data/target_domains/ |
Labeled URL dataset and train/val/test splits |
data/graphs/ |
Bootstrap graph artifact |
data/models/ |
GNN, Q-learning, and bandit checkpoints |
data/results/ |
Evaluation JSON/Markdown reports |
docs/ |
Design notes, walkthroughs, phase reports, and detailed explanations |
docs/paper/ |
Research paper PDF |
train.py |
Reproducible root training wrapper |
evaluate.py |
Reproducible root evaluation wrapper |
The included sample dataset targets three topical domains:
| Topic | Seed file | Label data |
|---|---|---|
| Machine learning | data/seeds/ml_seeds.json |
data/target_domains/*.csv |
| Climate | data/seeds/climate_seeds.json |
data/target_domains/*.csv |
| Blockchain | data/seeds/blockchain_seeds.json |
data/target_domains/*.csv |
Current labeled split:
| Split | File |
|---|---|
| Train | data/target_domains/train_labeled.csv |
| Validation | data/target_domains/val_labeled.csv |
| Test | data/target_domains/test_labeled.csv |
To regenerate the bootstrap graph and labeling template:
python experiments/bootstrap_graph.py
python experiments/create_labeled_data.pyAfter manual labeling, rebuild train/validation/test splits:
python experiments/create_labeled_data.py --splitEvaluation reports the following metrics:
| Metric | Definition |
|---|---|
| Harvest rate | Relevant pages found / total pages crawled |
| Precision@10 | Relevant pages in the first 10 crawled pages |
| Precision@20 | Relevant pages in the first 20 crawled pages |
| Average reward | Mean reward per crawled page |
| Crawl time | Wall-clock runtime per crawler run |
The evaluator compares:
randombest_firstpagerankpure_qpure_bandithybrid_no_gnnhybrid
Canonical strict benchmark: data/results/PHASE_7_FINAL_STRICT.md
Command used:
python experiments/evaluate_baseline.py --max-pages 10 --runs-per-seed 2 --max-seeds-per-topic 2 --random-seed 42 --output-prefix PHASE_7_FINAL_STRICT| Crawler | Harvest Rate | P@10 | P@20 | Avg Reward | Crawl Time (s) |
|---|---|---|---|---|---|
| random | 0.108 +/- 0.029 | 0.108 | 0.108 | 0.55 | 21.84 |
| best_first | 0.133 +/- 0.078 | 0.133 | 0.133 | 0.84 | 22.50 |
| pagerank | 0.100 +/- 0.000 | 0.100 | 0.100 | 0.48 | 19.06 |
| pure_q | 0.958 +/- 0.144 | 0.958 | 0.958 | 11.21 | 2.77 |
| pure_bandit | 0.100 +/- 0.000 | 0.100 | 0.100 | 0.45 | 27.29 |
| hybrid_no_gnn | 1.000 +/- 0.000 | 1.000 | 1.000 | 11.72 | 2.56 |
| hybrid | 0.117 +/- 0.044 | 0.117 | 0.117 | 0.69 | 25.65 |
Interpretation: the strongest production policy in this snapshot is hybrid_no_gnn, with pure_q as a fallback. The full hybrid path remains experimental; diagnostics in data/results/PHASE_7_DIAG_STRICT.md show second-step LinUCB selections repeatedly choosing irrelevant pages.
Plots are not committed in this snapshot. Reproducible result tables and raw metrics are available as JSON/Markdown under data/results/; model checkpoints are under data/models/.
- Live crawling results can change because websites update, block requests, go offline, or respond slowly.
- Network conditions affect crawl time and sometimes the number of pages successfully fetched.
- The full
hybridpolicy is still experimental in this snapshot; the current strongest policy ishybrid_no_gnn, withpure_qas a fallback. - The included dataset is intentionally small and student-budget friendly, so results should be treated as a reproducible project benchmark rather than a large-scale web benchmark.
- Plots are not included in this snapshot, but raw JSON and Markdown result tables are available under
data/results/.
This project is intended for research and educational use. The crawler uses conservative settings such as request delays, timeouts, page budgets, and candidate limits to avoid aggressive crawling.
When running live crawls:
- Respect website terms of service and robots policies.
- Keep crawl budgets small unless you have permission.
- Use the configured delay in
configs/crawler_config.yaml. - Do not use the crawler for scraping private, sensitive, login-protected, or copyrighted content at scale.
- Report experiments with the exact command, random seed, date, and network conditions when possible.
Linux/macOS:
python -m venv venv
source venv/bin/activate
python -m pip install --upgrade pip
pip install -r requirements.txtWindows PowerShell:
python -m venv venv
.\venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -r requirements.txtConda alternative:
conda env create -f environment.yml
conda activate adaptive-qlearning-crawlerOptional editable install:
pip install -e .python test_skeleton.pyRun the complete root wrapper:
python train.py --seed 42Or run each stage explicitly:
python experiments/train_gnn.py
python experiments/train_agent.pyExpected checkpoint outputs:
| File | Description |
|---|---|
data/models/gnn_encoder_best.pt |
Best validation GNN encoder |
data/models/gnn_encoder_frozen.pt |
Frozen GNN encoder for crawling |
data/models/qlearning_agent.pt |
Trained Q-network |
data/models/bandit_arms.pkl |
LinUCB arm state |
Preserved from the previous DEMO.md:
python experiments/run_hybrid_crawler.pyIf model files are missing, run:
python experiments/train_gnn.py
python experiments/train_agent.pyShort demo evaluation:
python evaluate.py --max-pages 20 --runs-per-seed 1 --output-prefix DEMO_EVAL --seed 42Equivalent direct command:
python experiments/evaluate_baseline.py --max-pages 20 --runs-per-seed 1 --output-prefix DEMO_EVAL --random-seed 42Generated outputs:
| File | Description |
|---|---|
data/results/DEMO_EVAL.json |
Raw evaluation report |
data/results/DEMO_EVAL.md |
Markdown summary table |
For a quick reproducibility check, run:
pip install -r requirements.txt
python test_skeleton.py
python evaluate.py --max-pages 10 --runs-per-seed 2 --max-seeds-per-topic 2 --output-prefix PHASE_7_FINAL_STRICT --seed 42The final command writes reproducible benchmark files to data/results/.
| Artifact | Status | Location |
|---|---|---|
| Research paper | Included | docs/paper/adaptive_qlearning_web_crawler.pdf |
| Citation metadata | Included | CITATION.cff |
| Python dependencies | Included | requirements.txt |
| Conda environment | Included | environment.yml |
| Seed dataset | Included | data/seeds/ |
| Labeled train/val/test splits | Included | data/target_domains/ |
| Bootstrap graph | Included | data/graphs/bootstrap_graph.pkl |
| Model checkpoints | Included | data/models/ |
| Evaluation results | Included | data/results/ |
| Detailed docs | Included | docs/ |
| Plots | Not included in this snapshot | Recreate from data/results/*.json |
Use --seed 42 for train.py and --seed 42 or --random-seed 42 for evaluation. The evaluation script derives deterministic per-run seeds from the base seed, crawler name, seed URL, and run index.
For stricter deterministic runs, also set:
export PYTHONHASHSEED=42PowerShell:
$env:PYTHONHASHSEED = "42"| Artifact | Location |
|---|---|
| Dependencies | requirements.txt, environment.yml |
| Configuration | configs/crawler_config.yaml |
| Sample seed dataset | data/seeds/ |
| Labeled splits | data/target_domains/ |
| Graph artifact | data/graphs/bootstrap_graph.pkl |
| Training scripts | train.py, experiments/train_gnn.py, experiments/train_agent.py |
| Evaluation scripts | evaluate.py, experiments/evaluate_baseline.py |
| Results folder | data/results/ |
| Checkpoints | data/models/ |
| Documentation | docs/ |
| Paper PDF | docs/paper/adaptive_qlearning_web_crawler.pdf |
python evaluate.py --max-pages 10 --runs-per-seed 2 --max-seeds-per-topic 2 --output-prefix PHASE_7_FINAL_STRICT --seed 42Compare generated outputs against:
data/results/PHASE_7_FINAL_STRICT.jsondata/results/PHASE_7_FINAL_STRICT.md
Network conditions can affect live crawling results. For stable reporting, preserve raw JSON outputs and include the exact command, seed, date, and network environment in any paper or report.
- Paper PDF:
docs/paper/adaptive_qlearning_web_crawler.pdf - Technical design:
docs/DESIGN.md - Implementation walkthrough:
docs/WALKTHROUGH.md - Practical guide:
docs/PRACTICAL_GUIDE.md - Phase documentation:
docs/phases/ - Generated phase reports:
docs/reports/
Phase-level documentation is useful for understanding how the project was built and validated:
| Phase | Document | Focus |
|---|---|---|
| 1 | PHASE_1.md |
Project setup, dependencies, baseline skeleton |
| 2 | PHASE_2.md |
Seed data, bootstrap graph, labeling pipeline |
| 3 | PHASE_3.md |
GNN pre-training and frozen encoder |
| 4 | PHASE_4.md |
Q-learning and LinUCB training integration |
| 5 | PHASE_5.md |
Live hybrid crawler integration |
| 6 | PHASE_6.md |
Baseline evaluation protocol |
| 7 | PHASE_7.md |
Diagnostics, final benchmark, policy selection |
Relevant background papers:
- Tree-based Focused Web Crawling with Reinforcement Learning, Kontogiannis et al., 2021.
- Deep Reinforcement Learning for Web Crawling, Avrachenkov, Borkar, and Patil, 2021.
- Efficient Deep Web Crawling Using Reinforcement Learning, Jiang et al., 2010.
- Learning to Crawl Deep Web, Zheng et al., 2013.
Citation metadata is provided in CITATION.cff.
BibTeX:
@software{adaptive_qlearning_web_crawler_2026,
author = {Mogia, Vasant Kumar and Jariha, Sujal},
title = {Adaptive Q-Learning Web Crawler},
year = {2026},
version = {0.1.0},
url = {https://github.com/DSCmatter/adaptive-qlearning-web-crawler},
license = {MIT}
}This project is released under the MIT License. See LICENSE.