This repository demonstrates basic NLP tasks using the NLTK and spaCy libraries. It contains an extensive collection of Python code and explanations for tasks like tokenization, POS tagging, N-gram language modeling, named entity recognition (NER), text classification, stemming, and more.
To get started with this repository, follow the steps below:
- Python 3.7+
- Install the required libraries:
pip install nltk spacy python -m spacy download en_core_web_sm
- 📘
n_gram_language_model.ipynb: Build and evaluate N-gram language models using NLTK. - 🌍
named_entity_recognition.ipynb: Perform NER with NLTK and spaCy. - 🏷️
pos_tagging.ipynb: POS tagging using NLTK and spaCy. - ✅
spelling_correction.ipynb: Demonstrates spelling correction techniques. - 🌱
stemming_stopwords.ipynb: Covers stemming types and stopword removal methods. - 📊
text_classification.ipynb: Basic text classification using NLTK. - ✂️
tokenization.ipynb: Different tokenization techniques using NLTK and spaCy.
Splitting a sentence into words or subwords.
from nltk.tokenize import word_tokenize
sentence = "Natural Language Processing is exciting!"
tokens = word_tokenize(sentence)
print(tokens)📌 Output: ['Natural', 'Language', 'Processing', 'is', 'exciting', '!']
Building N-grams and predicting the next word.
from nltk import ngrams
sentence = "I am learning NLP."
n_grams = list(ngrams(sentence.split(), 2))
print(n_grams)🔗 Output: [('I', 'am'), ('am', 'learning'), ('learning', 'NLP.')]
Identifying entities like names, locations, and dates in text using spaCy.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Barack Obama was the 44th President of the USA.")
for ent in doc.ents:
print(ent.text, ent.label_)📌 Output:
Barack Obama PERSON44th ORDINALUSA GPE
Tagging words in a sentence with their respective parts of speech.
from nltk import pos_tag
from nltk.tokenize import word_tokenize
sentence = "NLTK makes POS tagging simple."
tags = pos_tag(word_tokenize(sentence))
print(tags)📌 Output: [('NLTK', 'NNP'), ('makes', 'VBZ'), ('POS', 'NNP'), ('tagging', 'NN'), ('simple', 'JJ'), ('.', '.')]
Correcting misspelled words using NLTK's edit_distance.
from nltk.metrics.distance import edit_distance
def correct_spelling(word, vocab):
return min(vocab, key=lambda x: edit_distance(word, x))
vocab = {"learning", "machine", "intelligence"}
print(correct_spelling("lerning", vocab))📌 Output: learning
Reducing words to their base form using algorithms like Porter and Lancaster stemmers.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem("running")) # Output: run🔗 Other Examples:
learning→learnconnected→connect
Removing common stopwords that do not add much meaning.
from nltk.corpus import stopwords
words = ["I", "am", "learning", "NLP", "with", "NLTK"]
filtered_words = [w for w in words if w not in stopwords.words("english")]
print(filtered_words)📌 Output: ['learning', 'NLP', 'NLTK']
Classifying text into predefined categories using NLTK.
from nltk.classify import NaiveBayesClassifier
train_data = [({"word": "love"}, "positive"), ({"word": "hate"}, "negative")]
classifier = NaiveBayesClassifier.train(train_data)
print(classifier.classify({"word": "love"})) # Output: positive📌 Output: positive
- Clone the repository:
git clone https://github.com/rushikeshraghatate90/Natural_Language_Processing.git
- Navigate to the project directory:
cd Natural_Language_Processing - Open any
.ipynbfile in Jupyter Notebook or JupyterLab to explore the code.
Contributions are welcome! Please fork the repository and submit a pull request with your changes.
This project is licensed under the MIT License - see the LICENSE file for details.