GitHub - codescholar/webb: Python: An all-in-one Web Crawler and Web Parser library

Webb - Web Scrapper and Crawler

An all-in-one Python library to scrap, parse and crawl web pages

Gist

This is a light-weight, dynamic and featured Python library crawl, download, index, parse, scrap analyze web pages in a systematic manner.

Compatability

This library is compatible with both Python 2 (2.x) as well as Python 3 (3.x) versions. It is a download-import-and-run program with no or little changes as required by users.

Dependencies

There are no dependencies to this project. It functions entirely of the standard in-build library support. It does not need any external support or installations. Just download and run!!!

Usage

Download webb.py
Create a new file in the same folder and give it any name like main.py
Import the webb library by writing import webb in the main.py file
Once this is done, use any of the following commands:

Traceroute a URL:
webb.traceroute("your-web-page-url")
Get IP address of a webpage:
webb.get_ip("your-web-page-url")
Download entire HTML page:
print (webb.download_page("your-web-page-url"))
Print the page title:
webb.title("your-web-page-url")
Find all the links in a web page and print it one below the other:
webb.find_all_links("your-web-page-url")
Find all the URLs in a page and convert them into absolute URLs and print it webb.find_all_links("www.zseries.in","absolute")
Find all the links in a pirticular web page and print it as a list:
print (webb.find_all_links_as_list("your-web-page-url"))
Crawl web pages in breathe-first manner:
webb.web_crawl("your-web-page-url")
Crawl web pages with delay of 2 seconds after every page crawled:
webb.web_crawl("your-web-page-url",2)
Normalize URL (Convert Relative URL to absolute URL):
print(webb.url_normalize("your-relative-url","your-seed-page-url"))
Download Google Images from keywords: webb.download_google_images("keyword") or webb.download_google_images(['keyword 1','keyword 2','keyword 3'])
Get links of all the images in a given web page:
webb.get_all_images("your-web-page-url")
Get links of all the images in a given web page and download all those images on local disk (computer):
webb.get_all_images("your-web-page-url","download")

Status

This is a stand-alone python script which is ready-to-run, but still under development. Many more features will be added to it shortly.

Disclaimer

The crawler function lets you download and crawl tons of web pages. Please do not download and crawl any pages of a domain without reading about the robot.txt file of that domain.

It is inappropriate to violate the robot.txt file. This may even lead to the domain completely blocking your crawler and thus blacklisting it. It is also not appropriate to crawl pages at high rate as it may put a lot of pressure on the server.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
LICENSE.txt		LICENSE.txt
README.md		README.md
webb.py		webb.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Webb - Web Scrapper and Crawler

Gist

Compatability

Dependencies

Usage

Status

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Webb - Web Scrapper and Crawler

Gist

Compatability

Dependencies

Usage

Status

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages