Description
The current crawler (crawler/src/main.cpp) indiscriminately crawls URLs without checking robots.txt. This is bad practice and can get our crawler blocked or banned by site administrators.
Tasks
Acceptance Criteria
- Crawler respects
User-agent: * and User-agent: MaxSearchEngineBot rules in robots.txt.
- Crawler does not fetch disallowed paths.
Description
The current crawler (
crawler/src/main.cpp) indiscriminately crawls URLs without checkingrobots.txt. This is bad practice and can get our crawler blocked or banned by site administrators.Tasks
RobotsParserclass or use a library (e.g., specific C++ library or simple parsing logic).MaxSearchEngineBot).robots.txtfor domains to avoid re-fetching it for every URL.Acceptance Criteria
User-agent: *andUser-agent: MaxSearchEngineBotrules inrobots.txt.