October 17, 2018

How to Use Machine Learning for Sentiment Analysis with Word Embeddings

Machine learning researchers have already attempted automatically predicting the sentiment of a sentence with machine learning many times. Predicting sentiment is a typical problem of NLP (Natural Language Process) and there are many papers and techniques that address it using different methods of machine learning.

Most of these papers and techniques take the approach of feature extractions and use it in a "Traditional Machine Learning" algorithm. In the last few years, the appearance of Word Embeddings has made the usage of Neural Networks more common than traditional algorithms because of their superior results.

In this article, I will take an approach using Convolutional Neural Networks and Word Embeddings to predict the sentiment of Amazon's reviews.

Sentiment Analysis
Sentiment Analysis

Amazon Reviews Corpus

Analysing the taxonomy of machine learning algorithms, neural networks are classified as Supervised Learning. In supervised learning, the algorithms learn a function that maps a specific input to a specific output using a set of pre-classified examples (called Training Set). When the learning phase is completed, we can apply the learned function to classify new examples.

In the case of neural networks, we have a set of "neurons" that computes a specific function, called the Activation Function. The neurons are grouped in layers and linked together with specific weights. Each neuron receives an input from the previous layer and through the result of applying the activation function to the next layer.

The number of layer, neurons per layers and type of activation functions is called Neural Network Architecture.

Given a set of neurons with a specific architecture, the learning algorithm adjusts the values of the weights using the training set.

Training set
Training set

In the case of Sentiment Analysis, we want to find a function that can take a sentence and determine its sentiment (positive or negative). In order to learn this function using neural networks, however, we need to train the machine with a corpus of sentences with their corresponding sentiment.

In this post, we will use the "Amazon Reviews Dataset" (https://www.kaggle.com/bittlingmayer/amazonreviews) published by Kaggle (https://www.kaggle.com/datasets).

This dataset is a CSV file that contains a set of 4,000,000 Amazon reviews with the following information:

  • Title
  • Review Text
  • Label

The label can be "label1" if the review has 1 or 2 stars or “label2” if the review has 4 or 5 stars. Reviews with 3 stars (neutral reviews) are not included in the dataset. This is an especially interesting dataset because it contains reviews that cover many topics.

For example, a positive review (with label label2 ):

Title: A fascinating insight into the life of modern Japanese teens

Text: I thoroughly enjoyed Rising Sons and Daughters. I don't know of any other book that looks at Japanese society from the point of view of its young people poised as they are between their parents' age-old Japanese culture of restraint and obedience to the will of the community, and their peers' adulation of Western culture. True to form, the "New Young" of Japan seem to be creating an "international" blend, as the Ando family demonstrates in this beautifully written book of vignettes of the private lives of members of this family. Steven Wardell is clearly a talented young author, adopted for some of his schooling into this family of four teens, and thus able to view family life in Japan from the inside out. A great read!

And, a negative review (with label label1):

Title: ADDONICS PORTABLE CD DRIVE - I am disappointed in its performance

Text: I am disappointed in its performance. It seems underpowered and is constantly trying to read CDs, half the time unsuccessfully. I am going to try to return it to Amazon.

This dataset is split into two groups:

  • "Training Set" containing 360,000 reviews

  • "Testing Set" containing 400,000 reviews

The idea is to use the training set to train our neural network how to assess sentiment and evaluate how well it has learned to do so using the testing set.


In order to compare the results, I will define a baseline using the Liu and Hu Opinion Lexicon. This lexicon contains a list of words that are classified as positive or negative. The baseline algorithm is very simple and intuitive. The idea is to count the number of positive and negative words that a sentence has. If the number of positive words is greater than negative, we can assume that the sentiment of the sentence is positive. If we have more negative words, the sentence is assumed negative. However, there are many cases that this algorithm doesn't cover, such as the negation.

The baseline algorithm is the following:

from nltk.corpus import opinion_lexicon
from nltk.tokenize import treebank

def demo_liu_hu_lexicon(sentence):
    tokenizer = treebank.TreebankWordTokenizer()
    pos_words = 0
    neg_words = 0
    tokenized_sent = [word.lower() for word in tokenizer.tokenize(sentence)]

    for word in tokenized_sent:
        if word in opinion_lexicon.positive():
            pos_words += 1
        elif word in opinion_lexicon.negative():
            neg_words += 1
    if pos_words > neg_words:
        return 'Positive'
    elif pos_words < neg_words:
        return 'Negative'
        return 'Neutral'

When we apply this algorithm to the Amazon dataset we obtain an accuracy of 50%.

Neural Networks Algorithms

The neural network algorithms require the input to be expressed as high dimension (greater than 100) integer vectors. In order to use a neural network in NLP tasks we need a way to convert sentences into vectors. The task of converting sentences in vectors is called pre-processing and there are several ways of accomplishing it.

In this article, I will use a word embedding model to obtain the vectors associated with the sentences. Word embeddings are a mapping between words and vectors built using an automatic algorithm. Word embeddings are very useful in NLP tasks because the generated vectors maintain the semantic relation between words.

For example, the function that transforms Man into Woman will return Queen if we apply it to King.


Word embeddings also maintain Semantic Similarity. Words that have similar semantics will generate vectors that are similar.

In the context of this post, I will use a word embedding model provided by Google generated with word2vec algorithm from a corpus of news.

To obtain a vector that represents the sentences we need to pre-process them. The first step is to tokenize each sentence, which will convert them into a list of tokens.

For example:

This sound track was beautiful! It paints the senery [sic] in your mind so well I would recomend [sic] it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^

is tokenized like:

[*'this', 'sound', 'track', 'was', 'beautiful', '!', 'it', 'paints', 'the', 'senery', 'in', 'your', 'mind', 'so', 'well', 'i', 'would', 'recomend', 'it', 'even', 'to', 'people', 'who', 'hate', 'vid', '.', 'game', 'music', '!', 'i', 'have', 'played', 'the', 'game', 'chrono', 'cross', 'but', 'out', 'of', 'all', 'of', 'the', 'games', 'i', 'have', 'ever', 'played', 'it', 'has', 'the', 'best', 'music', '!', 'it', 'backs', 'away', 'from', 'crude', 'keyboarding', 'and', 'takes', 'a', 'fresher', 'step', 'with', 'grate', 'guitars', 'and', 'soulful', 'orchestras', '.', 'it', 'would', 'impress', 'anyone', 'who', 'cares', 'to', 'listen', '!', '^_^'*]

The next step consists of replacing all the "stop words" and preserving only the N most common tokens. By maintaining only the "N" most common tokens, we can ignore those words that occur in few reviews and don't contribute to the learning algorithm.

After conducting some experimental tests with the training set by varying the value of N, I was able to determine that N=7000 is the optimal value.

This is what a review looks like after replacing the stop words and keeping only the most common tokens:

[*'<UNK>', '<UNK>', 'finished', 'reading', '<UNK>', '<UNK>', '<UNK>', 'wicked', '<UNK>', '<UNK>', '<UNK>', 'fell', '<UNK>', 'love', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', 'expected', '<UNK>', 'average', 'romance', 'read', '<UNK>', '<UNK>', 'instead', '<UNK>', 'found', 'one', '<UNK>', '<UNK>', 'favorite', 'books', '<UNK>', '<UNK>', 'time', '<UNK>', '<UNK>', '<UNK>', '<UNK>', 'thought', '<UNK>', 'could', 'predict', '<UNK>', 'outcome', '<UNK>', '<UNK>', 'shocked', '<UNK>', '<UNK>', 'writting', '<UNK>', '<UNK>', 'descriptive', '<UNK>', '<UNK>', 'heart', 'broke', '<UNK>', 'julia', "'s", '<UNK>', '<UNK>', '<UNK>', 'felt', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', 'instead', '<UNK>', '<UNK>', '<UNK>', 'distant', 'reader', '<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>', 'lover', '<UNK>', 'romance', 'novels', '<UNK>', '<UNK>', '<UNK>', '<UNK>', 'must', 'read', '<UNK>', '<UNK>', "n't", 'let', '<UNK>', 'cover', 'fool', '<UNK>', '<UNK>', 'book', '<UNK>', 'spectacular', '<UNK>*']

Finally, we must define an embedding matrix and replace words with the corresponding index in the matrix. This method is described in more detail here: http://adventuresinmachinelearning.com/gensim-word2vec-tutorial/

After this step, we will be left with the following:

[7000, 97, 362, 7000, 283, 7000, 7000, 5936, 7000, 7000, 7000, 7000, 313, 7000, 16, 7000, 6, 1359, 7000, 14, 7000, 40, 7000, 552, 7000, 7000, 126, 44, 7000, 7000, 7000, 390, 7000, 126, 7000, 1681, 7000, 7000, 7000, 7000, 7000, 7000, 655, 7000, 7000, 48, 390, 7000, 7000, 7000, 23, 44, 7000, 7000, 6737, 157, 7000, 3762, 7000, 7000, 302, 7000, 7000, 1131, 7000, 7000, 2921, 7000, 3830, 7000, 7000, 7000, 6, 4382, 83, 7000, 1971, 7000, 216, 7000, 7000]

Basic Neural Model

The first approach that I took to predict the sentiment of a review was using a basic neural network with two dense layers.

The neural networks were defined and trained using keras (https://keras.io/) with a TensorFlow (https://www.tensorflow.org/) background.

The architecture of the network was defined as:

model = Sequential()
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Where embedding_layer is defined as:

Embedding( len(text_vector_dict_keys), 300, weights=[embedding_matrix] input_length=MAX_WORDS, trainable=False)

The first layer is an Embedding layer. This layer receives a vector with an index and returns a matrix with the corresponding word embeddings.

The second layer is a Flatten layer. This layer concatenates all the vectors in a single vector. If we adjust the vectors to 500 words and the word embeddings model has dimension 300, the result of this layer is a vector with the dimensions 500x300.

Finally, the third and fourth layer are Dense layers and reduce the dimensions of these vectors to 250 and 1 accordingly.

The learning algorithm only fit the weights in 3rd and 4th layers. The weights in the first layer are not trained because they represent the word embedding model.

This basic neural network result operate with 80% accuracy. This is a 30% improvement compared with the baseline. However, the state of art in sentiment analysis is over 95%, so the algorithm still requires a little bit of improvement.

CNN Neural Model

The next step I took was replacing the basic neural network with a Convolutional Neural Network (CNN). CNNs were designed to process images quickly. The problem with regular networks is that all their layers are fully connected, meaning that if we create a large number of hidden layers the cost of adjust the weights is high.

The definition of CNNs and their differences from regulars networks can be found at http://cs231n.github.io/convolutional-networks/.

In summary, with CNNs we don't consider the input vector as a whole, we only consider the near element of each vector component. This change greatly reduces the fit of the networks because the number of connections between neurons is considerably less.

CNN can be used in problems for which the neighborhood is important. This condition is satisfied by images (the changes in a pixel affects the images in terms of its neighborhood) and also in sentences.

Regular Neural Network
Regular Neural Network
Convolutional Neural Network
Convolutional Neural Network

We can use CNN in our problem with the following architecture:

model = Sequential()
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(Conv1D(filters=64, kernel_size=3, padding='same', activation='relu'))
model.add(Conv1D(filters=128, kernel_size=3, padding='same', activation='relu'))
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

The architecture is similar to the previous solution. The main difference is that we introduce some additional layers with convolutions.

When I apply this network to the Amazon reviews corpus the achieved results are near 90% of accuracy. These results are good and I consider acceptable.

Conclusions and Future Work

In this article, I have shown that the use of word embeddings to solve NLP problems is a good approach. Word embeddings have properties that we can use to our benefit. The property of maintaining the semantic relations between words is very important.

Also, they let us omit the feature extraction phase. This point is very important because in some NLP problems the required features are not trivial and the feature extraction requires a lot of time of researching and testing.

Another advantage that word embedding has is that allows us to use neural networks in NLP tasks. Without word embeddings, using neural networks for NLP is not possible because we would need to extract more than 100 features.

I have also demonstrated how CNNs can be used in outside of image processing. CNNs are useful in sentiment analysis, improving the accuracy by 10% with respect to traditional neural networks .

For future work, we could use different word embedding models to obtain better results. For example, we could create a vector model from a review corpus instead of a news corpus. Using a "Sentiment Word Embedding" should be more accurate than a generic word embedding.

"How to Use Machine Learning for Sentiment Analysis with Word Embeddings" by Pablo Grill is licensed under CC BY SA. Source code examples are licensed under MIT. Categorized under research & learning.

Let’s build a great product together
We treat projects as if they were our own, understanding the underlying needs and astonishing users with the end results.
Contact us