The "Indian Cricket Team Web Crawler" project aims to develop a web crawler that extracts data from Wikipedia pages related to the Indian Cricket Team. The project begins by scraping Wikipedia to gather information about the team, extracting relevant links, and storing the content in HTML format within a directory. Subsequently, an indexer processes this data to generate a JSON file containing title and content information. Using this indexed data, a TF-IDF vectorizer and cosine similarity model are constructed.
The project utilizes Flask to create a web application where users can input queries. The cosine similarity model is employed to compute similarities between the user query and stored documents, allowing the display of top-k relevant documents based on the query.
Next steps involve optimizing the web crawler's efficiency, refining the indexing and querying mechanisms, and enhancing the user interface for a seamless experience.
The "Indian Cricket Team Web Crawler" project focuses on extracting title and content data from Wikipedia pages related to the Indian Cricket Team using Scrapy, a web crawling and scraping framework. The solution outline is as follows:
-
Web Scraping and Data Extraction:
- Use Scrapy to crawl Wikipedia pages related to the Indian Cricket Team.
- Save HTML files of the crawled pages for further processing.
-
Indexer (Data Processing and Indexing):
- Extract title and content information from the saved HTML files.
- Clean and preprocess the extracted data for storage and indexing.
- Generate
webpage_records.jsonand create the TF-IDF vectorizer.
-
Processor (Querying and Presentation):
- Develop a command-line interface (CLI) or Flask-based web application to interact with the indexed data.
- Implement commands or routes to query and retrieve relevant information from the stored dataset.
- Use cosine similarity to compute relevance scores and present top-k results based on user queries.
Follow these steps to set up and run the Indian Cricket Team Information Retrieval System on your local machine:
-
Clone the Project
- Clone the project repository to your local machine:
- Use
git clone https://github.com/patroswastik/Web_Crawler.gitto clone the project. - Navigate into the project directory (
indian_cricket_team_information_retrieval).
- Use
- Clone the project repository to your local machine:
-
Configure Filepaths
- Locate and open the
config.inifile in the project directory. - Update the filepaths in the configuration file to specify desired file locations.
- Locate and open the
-
Create and Activate Virtual Environment
- Create a virtual environment for the project:
- Run
python -m venv envto create the virtual environment. - Activate the virtual environment:
- On Windows:
.\env\Scripts\activate - On macOS/Linux:
source env/bin/activate
- On Windows:
- Run
- Create a virtual environment for the project:
-
Install Requirements
- Install project dependencies using pip:
- Execute
pip install -r requirements.txtto install required packages.
- Execute
- Install project dependencies using pip:
-
Scrape Data
- Navigate to the
indian_cricket_team_crawlerdirectory. - Run the Scrapy spider to start scraping data:
- Use
scrapy crawl indian_cricketto initiate the scraping process.
- Use
- Navigate to the
-
Generate TF-IDF Index
- Go to the
indian_cricket_team_indexerdirectory. - Execute
python tfidf_generator.pyto generate the TF-IDF index. - After execution,
tfidf_index.pklwill be generated containing the TF-IDF matrix and vectorizer.
- Go to the
-
Run Flask Server
- Move to the
indian_cricket_team_processordirectory. - Here the
tfidf_index.pklfile will be loaded, which will return TF-IDF matrix and vectorizer, and will be further used in cosine similarity. - Start the Flask server to run the information retrieval system:
- Run
python processor.pyto start the server.
- Run
- Move to the
-
Access the API
- Once the Flask server is running, use tools like Postman:
- Access the endpoint
http://127.0.0.1:5000/primary_search/<query_search>in Postman to retrieve search results for a specific query.
- Access the endpoint
- Once the Flask server is running, use tools like Postman:
-
Use Web Interface (Additional)
- Open a web browser and visit
http://127.0.0.1:5000/search:- Enter a query in the input box on the webpage to perform a search and view results.
- Open a web browser and visit
This project has been a valuable learning experience, allowing me to deepen my understanding of information retrieval techniques while gaining practical skills in web scraping using Scrapy and API development with Flask. Building a fully working web crawler has enhanced my proficiency in data extraction, processing, and API integration. Moving forward, I am excited to apply these skills to future projects and explore more advanced applications of web data retrieval and analysis.
Here you can find all the Downloaded HTML files related to Indian Cricket Team that were download from Scrapy -> Webpages
Attached a requirements.txt to directly install the libraries
The following libraries has been used
-
Scrapy: Scrapy is a powerful web crawling and scraping framework used to extract data from websites and APIs efficiently.
-
Flask: Flask is a lightweight and flexible web framework for building web applications in Python, providing tools and libraries to create RESTful APIs and web services.
-
Pandas: Pandas is a popular library for data manipulation and analysis in Python, offering data structures and tools for reading, writing, and processing structured data.
-
Scikit-learn: Scikit-learn is a comprehensive machine learning library in Python, providing simple and efficient tools for data mining and analysis, including various algorithms for classification, regression, clustering, and more.
-
BeautifulSoup4: BeautifulSoup4 is a Python library for parsing HTML and XML documents, enabling easy navigation, extraction, and manipulation of data from web pages.
-
Scrapy:
-
Flask:
-
Pandas:
- Book: "Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython" by Wes McKinney
-
TF-IDF Vectorizer with Scikit-learn:
- Journal Article: "Scikit-learn: Machine Learning in Python" by Pedregosa et al. (2011)