Skip to content

kprotopapas/scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scraping helper repo

This repository includes a few helper files for scraping GitHub. Most of the scraping for our projects was done using ScraPy, although the scraping.py file should be general enough to work in a lot of frameworks. The most important files here are the "usernames/passwords.txt" files which contain auth token pairs which can be used to crawl GitHub at a much faster rate without getting rate limited. Authenticated requests are rate limited per-account, whereas unauthenticated requests are limited per-IP. Therefore the two strategies to get around rate-limiting are to routinely change IP (using expressvpn-python) which is quite clumsy, or to simply switch auth tokens when you run out of requests (a lot cleaner). In some circumstances (if requests are slow enough), the first account will be ready before the last account runs out of requests. Care must be taken to manage errors, especially http error codes, as these can dramatically slow down scraping otherwise.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages