Text classification is one of the common Natural Language Processing(NLP) tasks in which a input text is categorized to a class among many available classes. The same technique can be used for email classification. Different department can be taken as different classes and thus the given input is classified to specific department.

We have taken subject of email as input and different department as classes so that the email is classified according to its subject. For this we have taken 20 newsgroups (news20) dataset which is publicly available. This dataset contains 20 different newsgroups thus resulting in 20 different classes. Now, lets elaborate the classification process we have followed.

  1. Data Preprocessing:

The data is initially 20 different sub-directories for each classes. The sub directories consists of email files ranging from 625 to 997 in number.So we did under sampling to make sure each directory consist of 525 email files thus resulting 525 email subjects of each category. The rest of the files were employed for test purpose.Then we extracted subjects from each file and created the vocabulary dictionary from those subjects.

We then finally created a numpy file containing the index values of those subjects and classes.

  1.   Classification Model:

The input indexes which are integers are passed to embedding layer where the input are mapped to feature vectors.The classes which were also integers are converted to one-hot encoding values. We used LSTM(Long Short Term Memory) network for training the given inputs.

The LSTM is variant of Recurrent Neural Network(RNN) which was introduced to overcome vanishing gradient problem in RNN by introducing three new gates input gate, forget gate and output gate.If you like to go more through LSTM you can visit this website: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

We have also added a attention layer in LSTM to improve performance by allowing this layer to select the input sequence of more importance than others.The variant of LSTM we have used here is dynamic rnn which is cable of unfolding the inputs dynamically and producing dynamic graphs which can be written in tensorflow as:
outputs, f_= tf.nn.dynamic_rnn(cell = lstm_cell(), inputs = _x, sequence_length=_x_seq_lens, dtype=tf.float32)
Where,
outputs = final output from dynamic rnn in the shape [Batch_size, Input sequence   Length, lstm_cell size]
cell = lstm_cell with specific cell size
inputs = inputs in shape [batch_size, input sequence length, feature vector size]
sequence_length = maximum length of input sentence in a batch

We have made sure that sequence length is determined across each batch and the values are sorted in ascending order of length.
Then we tune the hyperparameters like vocab_size, batch_size, training algorithm, learning rate.
Thus we obtain output as one hot encoded vector which is then converted to specific class labels and after comparison to original data labels we were successful to attain 80% accuracy on test data with news20 dataset.

 

For further lookup:

https://www.tensorflow.org/tutorials/word2vec

http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/

http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/

Leave a Reply

Your email address will not be published. Required fields are marked *