This project focuses on collecting data from different newspapers from a particular region so that an analysis can be made and salient topics can be extracted. These topics can be used to show relevant information about the current trends of that region.
Articles pertaining to Bangalore were used for analysis. The required articles were scraped from two different newspapers:
- Deccan Herald
- The Indian Express
Data from August 2018 to June 2019 was scraped, extracted, cleaned and stored in a CSV(Comma Separated Values) format. Special attention was taken to remove duplicates which may have occured while scraping
The model used for training and clustering data is LDA(Latent Dirichlet Allocation) whose description can be found here. Clusters of the topics were formed which were then used to classify each of the articles into a given cluster using probability distribution of words in a cluster. Now a frequency table corresponding to each month was created and can be used as a Supervised Model for future analysis.
The website was created using Web Technologies. For Frontend, the technologies used were:
- HTML5
- CSS3
- JavaScript
- Twitter Bootstrap
For Backend, the technology used was:
- Flask
For data querying, CSV file generated from our model was used. Graphs were plotted using Bokeh Libray in Python and were rendered in the Frontend using Jinga2 Syntax. The Graphs used were namely:
- Frequency Distributions corresponding to each month for a given cluster.
Data of different places can be added and trained using the LDA model. Useful insights, hence, can be generated for those places.