Scraping helper repo

This repository includes a few helper files for scraping GitHub. Most of the scraping for our projects was done using ScraPy, although the scraping.py file should be general enough to work in a lot of frameworks. The most important files here are the "usernames/passwords.txt" files which contain auth token pairs which can be used to crawl GitHub at a much faster rate without getting rate limited. Authenticated requests are rate limited per-account, whereas unauthenticated requests are limited per-IP. Therefore the two strategies to get around rate-limiting are to routinely change IP (using expressvpn-python) which is quite clumsy, or to simply switch auth tokens when you run out of requests (a lot cleaner). In some circumstances (if requests are slow enough), the first account will be ready before the last account runs out of requests. Care must be taken to manage errors, especially http error codes, as these can dramatically slow down scraping otherwise.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
passwords.txt		passwords.txt
requirements.txt		requirements.txt
scraping.py		scraping.py
usernames.txt		usernames.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scraping helper repo

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Scraping helper repo

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages