An all-in-one Python library to scrap, parse and crawl web pages
This is a light-weight, dynamic and featured Python library crawl, download, index, parse, scrap analyze web pages in a systematic manner.
This library is compatible with both Python 2 (2.x) as well as Python 3 (3.x) versions. It is a download-import-and-run program with no or little changes as required by users.
There are no dependencies to this project. It functions entirely of the standard in-build library support. It does not need any external support or installations. Just download and run!!!
- Download
webb.py - Create a new file in the same folder and give it any name like
main.py - Import the webb library by writing
import webbin themain.pyfile - Once this is done, use any of the following commands:
-
Traceroute a URL:
webb.traceroute("your-web-page-url") -
Get IP address of a webpage:
webb.get_ip("your-web-page-url") -
Download entire HTML page:
print (webb.download_page("your-web-page-url")) -
Print the page title:
webb.title("your-web-page-url") -
Find all the links in a web page and print it one below the other:
webb.find_all_links("your-web-page-url") -
Find all the URLs in a page and convert them into absolute URLs and print it
webb.find_all_links("www.zseries.in","absolute") -
Find all the links in a pirticular web page and print it as a list:
print (webb.find_all_links_as_list("your-web-page-url")) -
Crawl web pages in breathe-first manner:
webb.web_crawl("your-web-page-url") -
Crawl web pages with delay of 2 seconds after every page crawled:
webb.web_crawl("your-web-page-url",2) -
Normalize URL (Convert Relative URL to absolute URL):
print(webb.url_normalize("your-relative-url","your-seed-page-url")) -
Download Google Images from keywords:
webb.download_google_images("keyword")orwebb.download_google_images(['keyword 1','keyword 2','keyword 3']) -
Get links of all the images in a given web page:
webb.get_all_images("your-web-page-url") -
Get links of all the images in a given web page and download all those images on local disk (computer):
webb.get_all_images("your-web-page-url","download")
This is a stand-alone python script which is ready-to-run, but still under development. Many more features will be added to it shortly.
The crawler function lets you download and crawl tons of web pages. Please do not download and crawl any pages of a domain without reading about the robot.txt file of that domain.
It is inappropriate to violate the robot.txt file. This may even lead to the domain completely blocking your crawler and thus blacklisting it. It is also not appropriate to crawl pages at high rate as it may put a lot of pressure on the server.