This repository includes a few helper files for scraping GitHub. Most of the scraping for our projects was done using ScraPy, although the scraping.py file should be general enough to work in a lot of frameworks. The most important files here are the "usernames/passwords.txt" files which contain auth token pairs which can be used to crawl GitHub at a much faster rate without getting rate limited. Authenticated requests are rate limited per-account, whereas unauthenticated requests are limited per-IP. Therefore the two strategies to get around rate-limiting are to routinely change IP (using expressvpn-python) which is quite clumsy, or to simply switch auth tokens when you run out of requests (a lot cleaner). In some circumstances (if requests are slow enough), the first account will be ready before the last account runs out of requests. Care must be taken to manage errors, especially http error codes, as these can dramatically slow down scraping otherwise.
kprotopapas/scraping
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|