NLP_Project

This repository contains codes for training Chinese news classification models and the User interface which allows users to input Chinese texts(news) and predict a label.

1.Training classification models

Install the required packages:

First, create an environment and install packages using environment.yml with anaconda.

Download environment.yml, and create a new environment.
In the command line, type:
conda env create --name your_envname -f environment.yml
For example,
conda env create --name nlptask -f /Users/downloads/environment.yml
Then, conda activate your_envname (in the example, conda activate nlptask)

Dataset

Please go to this dataset repository and download the Chinanews dataset. Chinanews dataset has 7 categories: 'mainland China politics', 'Taiwan - Hong Kong- Macau politics', 'International news','financial news','culture','entertainment','sports'.

Both training dataset and testing dataset are required. Rename the training dataset file to 'Chinanews_train.csv',
and the test dataset file to 'Chinanews_test.csv'.

Fasttext pre-trained word vectors

Download Fasttext's Pre-trianed Chinese vector model at fasttext.

In the Models section, find 'Chinese' select 'bin' format file.
Download the file to a folder and unzip it.
Now you can find a 'cc.zh.300.bin' file in the folder.

Training

(1)Training LR, SVM and NB classifier With Fasttext's pre-trained word vectors
The model will be saved as output.

code	expected output(saved model)
LR_pretrained_segmented.py	LR_pretrained_segmented_model.sav
svm_pretrained_segmented.py	svm_pretrained_segmented_model.sav
NB_pretrained_segmented.py	nb_pretrained_segmented_model.sav

(2)Train a word vector model using the Chinanews Dataset
Run Chinese2vec.py, this step will generate a model called 'Chinanews_word2vec.model'.

(3)Training LR, SVM and NB classifier uisng the 'Chinanews_word2vec.model'

code	expected output(saved model)
LR_word2vec_segmented.py	LR_word2vec_segmented_model.sav
svm_word2vec_segmented.py	svm_word2vec_segmented_model.sav
NB_word2vec_segmented.py	nb_word2vec_segmented_model.sav

2.UI for predicting a Chinese(news) text to a category.

Check out the video showing how the UI works here!
Before using the 'GUI.py' code,
you can either run the LR_pretrained_segmented.py, svm_pretrained_segmented.py and save your classification model,
Or you can go to our pre-trained classification model, download 'LR_pretrained_segmented_model.sav' and 'svm_pretrained_segmented_model.sav'.
The pre-trained classification models are trained on 56000 pieces of Chinese news, each category 8000 piece.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
code		code
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP_Project

1.Training classification models

Install the required packages:

Dataset

Fasttext pre-trained word vectors

Training

2.UI for predicting a Chinese(news) text to a category.

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NLP_Project

1.Training classification models

Install the required packages:

Dataset

Fasttext pre-trained word vectors

Training

2.UI for predicting a Chinese(news) text to a category.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages