Skip to content

Collin222/webcrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Collin's Webcrawler

Hi, I built this simple web crawler to practice implementing data structures and algorithms. The code is publically available, and the key tech and components are detailed below.

Tech & Components

Frontier Queue

  • The frontier queue is implemented with a modified BSF (Breadth First Search) algorithm. For the most part, URLs are handled first in first out, but certain URLs are prioritzed based on certain factors like path length and depth. Shorter pathnames like https://google.com have a higher priority than https://google.com/x/y/z. Additionally, as the queue grows larger, URLs with a lower depth are prioritized more. Each crawl iteration is a higher depth, so lower depth URLs will be prioritized to ensure first in first out is followed.
  • The retreival function for the frontier queue automatically handles politeness and parsing of robots.txt file. Potlieness is the process of throttling requests to a specific hostname, so we do not overwhelm the same site with too much traffic. By default, this is 1s delay between requests, but may change based on the site's robots.txt file.

Queue Worker

  • The queue worker is the system responsible for retrieving URLs from the queue and crawling them. This worker handles concurrency, so 5 URLs can be crawled at the same time.
  • The worker is optimized to parse documents in an efficient way. Once each document is fetched, it is parsed with keywords and URLs extracted in a separate worker thread. This keeps CPU usage low and does not block other operations like the queue.
  • After URLs are extracted from the document, the new URLs are added to the queue to be crawled. Priority is then calculated based on the pathname and depth of the URL.
  • Extracted keywords are added to the reverse keyword index. Rather than an index mapping URL to its keywords, a reverse index is used, so keywords are mapped to all URLs that contain that keyword, with their respective TF-IDF score.
  • The TF-IDF algorithm (Term Frequency - Inverse Document Frequency) is a formula used to calculate keyword relevancy. Term frequency represents the ratio of that keyword in the document, so if the word "example" appears 10 times in a document with 100 words, the term frequency is 10/100, 0.1 or 10%. This term frequency is combined with inverse document frequency, which represents how rare the word is in the index. Words like "the", "is", and "what" are more common than words like "redox" or "subprocess." Including IDF is important in this algorithm, so when searching for relevant keywords, words that are more rare are boosted over common words.

Search Manager

  • After the queue has finished crawling the specified number of URLs, the search manager is started to search the calculated reverse keyword index. When the user enters a prompt, the prompt is tokenized and parsed. Each word is stemmed to its lowest root, so "running" "ran" "runner" all reduce "run", and all words are converted to lowercase form.
  • After the prompt is tokenized and parsed, each keyword is looked up in the reverse keyword index, and all URL TF-IDF scores are summed up and sorted to calculate the three highest scores. The three highest scores are then returned to the user as the top results with the most relevant keywords.

Thank you for checking out my web crawler. I've learned a great amount while researching the topics above and implementing them into this project. Check out my GitHub page for more https://github.com/Collin222.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors