GitHub - Sharpentine/Text_Representation_Analysis_of_BBC_Article_on_Dengue: Applying NLP preprocessing and feature engineering techniques to analyze a newspaper article. This will help understand how these methods transform raw text into meaningful data for machine learning models.

Text Representation Analysis of BBC Article on Dengue

Introduction

This report examines the application of three text representation methods – Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and Word Embeddings – to a BBC article focusing on dengue fever in the Philippines. The primary goal is to assess the effectiveness of each method in capturing the article's semantic meaning and facilitating further analysis.

Data and Methods

The dataset used for this analysis is a news article sourced from BBC News, titled "Philippine town offers bounty for mosquitoes as dengue rises." The article details the response of a Philippine town to an increase in dengue cases by offering rewards for captured mosquitoes.

Three text representation methods were employed:

Bag of Words (BoW):
- Represents text as a collection of words and their frequencies, disregarding word order and semantic relationships.
TF-IDF (Term Frequency-Inverse Document Frequency):
- Similar to BoW but considers the importance of words in the entire corpus, giving higher weights to words that are frequent in a specific document but rare overall.
Word Embeddings (Word2Vec):
- Represents words as dense vectors in a continuous space, capturing semantic relationships and placing words with similar meanings closer together.

Findings

BoW:
- This method successfully created a basic representation of the article's vocabulary and word frequencies.
- However, it failed to capture the semantic relationships between words, hindering a deeper understanding of the article's context.
TF-IDF:
- While offering an improvement over BoW by highlighting important words, TF-IDF still exhibited limitations in capturing the nuances of the article's meaning.
- It couldn't fully grasp the connections between words like "dengue" and "fever" or "mosquito" and "insect."
Word Embeddings:
- This method proved to be the most effective in capturing the article's semantic meaning.
- By leveraging pre-trained word embeddings (word2vec-google-news-300), it successfully represented words as vectors in a semantic space, allowing for the identification of relationships between words and phrases.
- This representation provided a deeper understanding of the article's context and subject matter.

Analysis and Discussion

Each method exhibits distinct strengths and weaknesses:

BoW:
- Strengths: Simple and efficient, suitable for basic text analysis.
- Weaknesses: Overlooks semantic relationships and word order. Inadequate for complex semantic understanding.
TF-IDF:
- Strengths: Improves upon BoW by considering word importance.
- Weaknesses: Still lacks the ability to capture deeper semantic relationships.
Word Embeddings:
- Strengths: Offers the most comprehensive semantic representation, capturing relationships between words and phrases, enabling a richer understanding of the text.
- Weaknesses: Can be computationally expensive for training and might not perfectly handle out-of-vocabulary words.

Conclusion

For this specific task of capturing the semantic meaning of a news article about dengue, Word Embeddings emerged as the most effective method. By representing words as vectors in a semantic space, it facilitated the understanding of relationships between crucial concepts like "dengue," "fever," "mosquito," and "bounty."

While BoW and TF-IDF provide basic representations, they fall short in capturing the deeper meaning and context embedded in the article. For tasks requiring nuanced semantic understanding, such as topic modeling or sentiment analysis, Word Embeddings are the preferred choice.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
Text_Representation_Analysis_of_BBC_Article_on_Dengue.ipynb		Text_Representation_Analysis_of_BBC_Article_on_Dengue.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Representation Analysis of BBC Article on Dengue

Introduction

Data and Methods

Findings

Analysis and Discussion

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Text Representation Analysis of BBC Article on Dengue

Introduction

Data and Methods

Findings

Analysis and Discussion

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages