18 lines (12 loc) · 639 Bytes

MachineTranslation

Dataset :
Hindi English Parallel Corpus consisting of 15k pairs of sentences in Hindi and English.

Pre-processing steps performed are :

Removing special characters, quotes and extra space
Adding start and end tokens in both English and Hindi
Obtaining the Hindi and English Vocabulary

Model Architecture :
Used a Encoder Decoder Architecture consisting of LSTM modules and Attention mechanism for capturing long range dependencies.

Model Evaulation :
Used bleu score metric of NLTK module to detect accuracies of model.