Skip to content

rushikeshraghatate90/Natural_Language_Processing

Repository files navigation


Natural Language Processing (NLP) with NLTK and spaCy 🤖📚

This repository demonstrates basic NLP tasks using the NLTK and spaCy libraries. It contains an extensive collection of Python code and explanations for tasks like tokenization, POS tagging, N-gram language modeling, named entity recognition (NER), text classification, stemming, and more.


Getting Started 🚀

To get started with this repository, follow the steps below:

Prerequisites 🛠️

  1. Python 3.7+
  2. Install the required libraries:
    pip install nltk spacy
    python -m spacy download en_core_web_sm

Files in the Repository 📂

  • 📘 n_gram_language_model.ipynb: Build and evaluate N-gram language models using NLTK.
  • 🌍 named_entity_recognition.ipynb: Perform NER with NLTK and spaCy.
  • 🏷️ pos_tagging.ipynb: POS tagging using NLTK and spaCy.
  • spelling_correction.ipynb: Demonstrates spelling correction techniques.
  • 🌱 stemming_stopwords.ipynb: Covers stemming types and stopword removal methods.
  • 📊 text_classification.ipynb: Basic text classification using NLTK.
  • ✂️ tokenization.ipynb: Different tokenization techniques using NLTK and spaCy.

Topics Covered 📖

1. Tokenization ✂️

Splitting a sentence into words or subwords.

from nltk.tokenize import word_tokenize
sentence = "Natural Language Processing is exciting!"
tokens = word_tokenize(sentence)
print(tokens)

📌 Output: ['Natural', 'Language', 'Processing', 'is', 'exciting', '!']


2. N-gram Language Model with NLTK 📈

Building N-grams and predicting the next word.

from nltk import ngrams
sentence = "I am learning NLP."
n_grams = list(ngrams(sentence.split(), 2))
print(n_grams)

🔗 Output: [('I', 'am'), ('am', 'learning'), ('learning', 'NLP.')]


3. Named Entity Recognition (NER) 🌍

Identifying entities like names, locations, and dates in text using spaCy.

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Barack Obama was the 44th President of the USA.")
for ent in doc.ents:
    print(ent.text, ent.label_)

📌 Output:

  • Barack Obama PERSON
  • 44th ORDINAL
  • USA GPE

4. Part-of-Speech (POS) Tagging 🏷️

Tagging words in a sentence with their respective parts of speech.

from nltk import pos_tag
from nltk.tokenize import word_tokenize
sentence = "NLTK makes POS tagging simple."
tags = pos_tag(word_tokenize(sentence))
print(tags)

📌 Output: [('NLTK', 'NNP'), ('makes', 'VBZ'), ('POS', 'NNP'), ('tagging', 'NN'), ('simple', 'JJ'), ('.', '.')]


5. Spelling Correction

Correcting misspelled words using NLTK's edit_distance.

from nltk.metrics.distance import edit_distance
def correct_spelling(word, vocab):
    return min(vocab, key=lambda x: edit_distance(word, x))

vocab = {"learning", "machine", "intelligence"}
print(correct_spelling("lerning", vocab))

📌 Output: learning


6. Stemming Types 🌱

Reducing words to their base form using algorithms like Porter and Lancaster stemmers.

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem("running"))  # Output: run

🔗 Other Examples:

  • learninglearn
  • connectedconnect

7. Stopword Removal 🛑

Removing common stopwords that do not add much meaning.

from nltk.corpus import stopwords
words = ["I", "am", "learning", "NLP", "with", "NLTK"]
filtered_words = [w for w in words if w not in stopwords.words("english")]
print(filtered_words)

📌 Output: ['learning', 'NLP', 'NLTK']


8. Text Classification 📊

Classifying text into predefined categories using NLTK.

from nltk.classify import NaiveBayesClassifier
train_data = [({"word": "love"}, "positive"), ({"word": "hate"}, "negative")]
classifier = NaiveBayesClassifier.train(train_data)
print(classifier.classify({"word": "love"}))  # Output: positive

📌 Output: positive


How to Use the Repository

  1. Clone the repository:
    git clone https://github.com/rushikeshraghatate90/Natural_Language_Processing.git
  2. Navigate to the project directory:
    cd Natural_Language_Processing
  3. Open any .ipynb file in Jupyter Notebook or JupyterLab to explore the code.

Additional Resources


Contributing

Contributions are welcome! Please fork the repository and submit a pull request with your changes.


License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A collection of Python notebooks demonstrating NLP tasks using NLTK and spaCy, including tokenization, n-gram language models, named entity recognition, POS tagging, spelling correction, stemming, stopword removal, and text classification. Ideal for beginners exploring natural language processing concepts.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors