GitHub - amerfarooq/multiclass-text-classification: A Jupyter notebook that employs a variety of techniques to perform Mutliclass Text Classification

Dataset:

The dataset I was working with consisted of emails with a total of 21 classes. The dataset contained 450 rows and I used an 80-20 split for training and classification purposes.

Techniques used:

Sklearn: SVC with 2 features: tfid vectorizer (ngram = (1,3)) and embeddings generated using SpacyVectorTransformer
TextCategorizer in Spacy
Ludwig
FastText
Keras
Fined tuned roBERTa using k-train

Results:

The best results I had were by using a roBERTa

Text Classification Techniques

Categorizing emails into 21 distinct categories is difficult due to two reasons:
- there is simply not enough data to properly train the model
- the data in the classes is imbalanced e.g. some classes have 40 data rows and others only have 5.
One technique I tried was to lump together similar classes into categories. The idea was to train a classifier that can place an email into one of these categories and then another classifier would further classifiy that email into one of the classes belonging to that category. With this method, I ended up with four categories and a total of five classifiers: one for classifying an email to a category (from four categories) and the other four that classifiy an email to a particular class of that category. This technique however did not produce relatively decent results.
SVC (Support Vector Classifier) from Sklearn:

I used the SVC from from Sklearn and intialized it with the following paramters:
```
SVC(C=150, gamma=0.02, probability=True)
```
Please have a look at this articles to get a better understanding of the SVC paramters.

No machine learning classifier or model can directly operate on text. It only understands numbers, whether the input is text, images etc and so first, we must convert our text data into a numeric form. To do this I have used two techniques in conjuction with the SVC classifier:
- The Sklearn library provides three vectorizer methods for converting text to numeric form: TfidfVectorizer, CountVectorizer and HashingVectorizer. After testing with all of these methods, TfIDVectorizer with the following parameters gave the best results:
```
TfidfVectorizer(ngram_range=(1, 3))
```
  The ngram_range defines the number of tokens to use from the text e.g. for a text he ate the apple, a value of 1 will produce: [he, ate, the, apple] which are called unigrams. A a value of 2 will produce: [he ate, he the, he apple] and is called bigrams and then trigrams and so on. By giving a range (1,3) we are essentially using unigrams, bigrams and trigrams altogether.
- A vectorizer utilizing word vectors from sPacy.
These two vectorizerers are using together in the following way:
```
outer_pipe = Pipeline(
    steps=[
        ("combined_features", FeatureUnion(
                transformer_list=[
                    ("tfid", TfidfVectorizer(ngram_range=(1,3))),
                    ("embed", SpacyVectorTransformer(nlp)),
                ]
        )),
        ("classifier", SVC(C=150, gamma=0.02, probability=True))
    ]
)
```
The output of both vectorizerers are then fed to our earlier SVC model.

Ludwig:

Ludwig is a ML library developed by Uber that allows for the creation of models without expending effort into coding them.

The following model was used:

model_definition = 
{
  'input_features': 
               [{'name': 'Text', 'type': 'text', "level": "word", "dropout": True}], 
 
   'output_features': 
    		   [{'name': 'Label', 'type': 'category'}],        
 
    'training': 
              {'epochs':50}
}

model = LudwigModel(model_definition)
training_stats = model.train(training_dataframe, logging_level=logging.INFO)

The results from this model were not particuarly good, with an accuracy around 50% at the best.

fastText:

fastText is a text classification library from Facebook. fastText does not provide much facilities regarding finetuning the model and requires input in a different format. It's result were better then Ludwig but not particuarly decent.

Preparing data for fastText:

data = pd.read_csv("Train.csv")
data = data.dropna()
training_df, validation_df = train_test_split(data, test_size=0.20, random_state=45)

def create_file(df, filename):
    file = open(filename +'.txt', "w")

    for index, row in df.iterrows():
        label = row['Label'].replace("_", "-")
        text = row['Text'].replace('\r', '').replace('\n', '')
        text = tokenizer(text)

        line = '\n__label__' + label + ' ' + text
        file.write(line)

    file.close()
    
create_file(training_df, 'fasttext-train')
create_file(validation_df, 'fasttext-val')

Training model:

model = fasttext.train_supervised(input="fasttext-train.txt", autotuneValidationFile='fasttext-val.txt')

The autotuneValidationFile allows the model to automatically set its parameters.

kTrain:

kTrain is a library designed to make use of deep learning and other ML models easy. We are particularly interested in its text classification faciltiies. Overall, using the roberta-large model gave the best overall results in terms of prediction accuracy.

MODEL_NAME = 'roberta-large' 
t = text.Transformer(MODEL_NAME, classes=my_classes)
trn = t.preprocess_train(x_train.values, y_train.values)
val = t.preprocess_test(x_test.values, y_test.values)
model = t.get_classifier()
# Batch size determines the number of data rows using during training. The
# larger the batch size, the more resource-intensize the training will be.
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=8)

# 5e-5 = 0.00005 and specifies the learning rate.
# 6 is the number of epochs i.e. training cycles.
learner.fit_onecycle(5e-5, 6)

predictor = ktrain.get_predictor(learner.model, preproc=t)

# Testing the models accuracy
predictions = predictor.predict(x_test.values)
print("Accuracy:", metrics.accuracy_score(y_true=y_test.values, y_pred=predictions, normalize=False))
print(metrics.classification_report(y_true=y_test.values, y_pred=predictions))

Creating an Ensemble of SVC and Roberta:

An ensemble is simply a collection of multiple ML models trained on the same input and which combine their ouputs into a single prediction. Since SVC and Roberta were giving me the best results, I tried creating an ensemble of the two to see if I could overcome their individual weaknesses and combine their strengths:
```
from mlxtend.classifier import EnsembleVoteClassifier
ensemble = EnsembleVoteClassifier(clfs=[pipe, predictor], weights=[1, 1], voting='soft', refit=False)
ensemble.fit(x_train, y_train)
ensmbl_preds = ensemble.predict(x_test.values)
```
Here pipe and predictor in the clfs (classifiers) parameter of EnsembleVoteClassifier are the roberta and SVC models. The issue with this approach was that the probability against any single prediction of roberta was very high so it would almost always dominate the overall prediction. Hence, the ensemble gave slighlty worse results than either the individual predictors.
Techniques employed for performance tuning:
1. Parameter optimization:
  
  Parameter optimization basically invovles trying out all sorts of different parameter values for a ML model to try to finds the values that give the best results e.g. the following are ranges for different parameters of a RandomForest:
```
params = {
    "combined_features__bow__tfidf__use_idf": [True, False],
    "combined_features__bow__tfidf__ngram_range": [(1, 1), (1, 2)],
    "classifier__bootstrap": [True, False],
    "classifier__class_weight": ["balanced", None],
    "classifier__n_estimators": [100, 300, 500, 800, 1200],
    "classifier__max_depth": [5, 8, 15, 25, 30],
    "classifier__min_samples_split": [2, 5, 10, 15, 100],
    "classifier__min_samples_leaf": [1, 2, 5, 10]
}
search = RandomizedSearchCV(pipe, params)
```
  RandomizedSearchCV tries random variations of the parameters to find the best fit. There is also a GridSearchCV which is much more exhaustive and tries all possible combinations. However, this can take a very long time which is why the randomized search is sometimes preferred.
2. Data Augmentation:
  
  This technique is used when the training data is limited. The idea is to use different strategies to augment the training data availabe e.g for text classification we can replace words with their synonyms. As an example, for the sentence amer enjoyed eating the apple, we could generate another similar sentence: amer liked eating the apple.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
Text Classification.ipynb		Text Classification.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset:

Techniques used:

Results:

Text Classification Techniques

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dataset:

Techniques used:

Results:

Text Classification Techniques

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages