Collin's Webcrawler

Hi, I built this simple web crawler to practice implementing data structures and algorithms. The code is publically available, and the key tech and components are detailed below.

Tech & Components

Frontier Queue

The frontier queue is implemented with a modified BSF (Breadth First Search) algorithm. For the most part, URLs are handled first in first out, but certain URLs are prioritzed based on certain factors like path length and depth. Shorter pathnames like https://google.com have a higher priority than https://google.com/x/y/z. Additionally, as the queue grows larger, URLs with a lower depth are prioritized more. Each crawl iteration is a higher depth, so lower depth URLs will be prioritized to ensure first in first out is followed.
The retreival function for the frontier queue automatically handles politeness and parsing of robots.txt file. Potlieness is the process of throttling requests to a specific hostname, so we do not overwhelm the same site with too much traffic. By default, this is 1s delay between requests, but may change based on the site's robots.txt file.

Queue Worker

The queue worker is the system responsible for retrieving URLs from the queue and crawling them. This worker handles concurrency, so 5 URLs can be crawled at the same time.
The worker is optimized to parse documents in an efficient way. Once each document is fetched, it is parsed with keywords and URLs extracted in a separate worker thread. This keeps CPU usage low and does not block other operations like the queue.
After URLs are extracted from the document, the new URLs are added to the queue to be crawled. Priority is then calculated based on the pathname and depth of the URL.
Extracted keywords are added to the reverse keyword index. Rather than an index mapping URL to its keywords, a reverse index is used, so keywords are mapped to all URLs that contain that keyword, with their respective TF-IDF score.
The TF-IDF algorithm (Term Frequency - Inverse Document Frequency) is a formula used to calculate keyword relevancy. Term frequency represents the ratio of that keyword in the document, so if the word "example" appears 10 times in a document with 100 words, the term frequency is 10/100, 0.1 or 10%. This term frequency is combined with inverse document frequency, which represents how rare the word is in the index. Words like "the", "is", and "what" are more common than words like "redox" or "subprocess." Including IDF is important in this algorithm, so when searching for relevant keywords, words that are more rare are boosted over common words.

Search Manager

After the queue has finished crawling the specified number of URLs, the search manager is started to search the calculated reverse keyword index. When the user enters a prompt, the prompt is tokenized and parsed. Each word is stemmed to its lowest root, so "running" "ran" "runner" all reduce "run", and all words are converted to lowercase form.
After the prompt is tokenized and parsed, each keyword is looked up in the reverse keyword index, and all URL TF-IDF scores are summed up and sorted to calculate the three highest scores. The three highest scores are then returned to the user as the top results with the most relevant keywords.

Thank you for checking out my web crawler. I've learned a great amount while researching the topics above and implementing them into this project. Check out my GitHub page for more https://github.com/Collin222.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
ts		ts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Collin's Webcrawler

Tech & Components

Frontier Queue

Queue Worker

Search Manager

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Collin's Webcrawler

Tech & Components

Frontier Queue

Queue Worker

Search Manager

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages