Skip to content

Implement robots.txt Compliance in Crawler #25

@Digvijay-x1

Description

@Digvijay-x1

Description

The current crawler (crawler/src/main.cpp) indiscriminately crawls URLs without checking robots.txt. This is bad practice and can get our crawler blocked or banned by site administrators.

Tasks

  • Crawler (C++): Add a RobotsParser class or use a library (e.g., specific C++ library or simple parsing logic).
  • Crawler (C++): Before adding a URL to the queue or fetching it, check if it is allowed for our user-agent (MaxSearchEngineBot).
  • Crawler (C++): Cache robots.txt for domains to avoid re-fetching it for every URL.

Acceptance Criteria

  • Crawler respects User-agent: * and User-agent: MaxSearchEngineBot rules in robots.txt.
  • Crawler does not fetch disallowed paths.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions