Sentiment Analysis on Movie Reviews

So a sentiment analysis is performed on a movie review database to classify whether they are positive or negative reviews. Here, I covered a wide variety of techniques of both supervised and unsupervised models in order to analyze the sentiment. They are:

  • Unsupervised Lexicon based models
  • Supervised Traditional Machine Learning models

In order to solve the problem in hand i.e sentiment analysis or opinion mining, I analyzed some textual documents and predict their sentiment or opinion based on the content of these documents.

A text corpus consists of multiple text documents and each document can be as simple as a single sentence to a complete document with multiple paragraphs. Textual data, in spite of being highly unstructured, can be classified into two major types of documents:

  • Factual/objective documents: typically depict some form of statements or facts with no specific feelings or emotion attached to them.
  • Subjective documents: text that expresses feelings, moods, emotions, and opinions.

Typically sentiment analysis seems to work best on subjective text, where people express opinions, feelings, and their mood. From a real-world industry standpoint, sentiment analysis is widely used to analyze corporate surveys, feedback surveys, social media data, and reviews for movies, places, commodities, and many more. The idea is to analyze and understand the reactions of people toward a specific entity and take insightful actions based on their sentiment.

Image for post
Image for post

Sentiment polarity is typically a numeric score that’s assigned to both the positive and negative aspects of a text document based on subjective parameters like specific words and phrases expressing feelings and emotion. Neutral sentiment typically has 0 polarity since it does not express and specific sentiment, positive sentiment will have polarity > 0, and negative < 0.

Classify Sentiments

Image for post
Image for post

Machine Learning:

This approach, employes a machine-learning technique and diverse features to construct a classifier that can identify text that expresses sentiment. Nowadays, deep-learning methods are popular because they fit on data learning representations.

Lexicon-Based:

This method uses a variety of words annotated by polarity score, to decide the general assessment score of given content. The strongest asset of this technique is that it does not require any training data, while its weakest point is that a large number of words and expressions are not included in sentiment lexicons.

STEP 1: Text Pre-Processing and Normalization

An initial step in text and sentiment classification is pre-processing. A significant amount of techniques is applied to data in order to the improvement of classification effectiveness. This enables standardization across a document corpus, which helps build meaningful features, to reduce dimensionality and reduce noise that can be introduced due to many factors like irrelevant symbols, special characters, XML and HTML tags, and so on.

The main components in our text normalization pipeline are:

Cleaning Text — strip HTML

Our text often contains unnecessary content like HTML tags, which do not add much value when analyzing sentiment. Hence we need to make sure we remove them before extracting features. The BeautifulSoup library does an excellent job in providing necessary functions for this. Our strip_html_tags(…) function enables in cleaning and stripping out HTML code.

Removing accented characters

In our dataset, we are dealing with reviews in the English language so we need to make sure that characters with any other format, especially accented characters are converted and standardized into ASCII characters. A simple example would be converting é to e. Our remove_accented_chars(…) function helps us in this respect.

Expanding Contractions

In the English language, contractions are basically shortened versions of words or syllables. Contractions pose a problem in text normalization because we have to deal with special characters like the apostrophe and we also have to convert each contraction to its expanded, original form. Our expand_contractions(…) function uses regular expressions and various contractions mapped to expand all contractions in our text corpus.

Removing Special Characters

Simple regexes can be used to achieve this. Our function remove_special_characters(…) helps us remove special characters. In our code, we have retained numbers but you can also remove numbers if you do not want them in your normalized corpus.

Lemmatizing text

Word stems are usually the base form of possible words that can be created by attaching affixes like prefixes and suffixes to the stem to create new words. This is known as inflexion. The reverse process of obtaining the base form of a word is known as stemming. The nltk package offers a wide range of stemmers like the PorterStemmer and LancasterStemmer. Lemmatization is very similar to stemming, where we remove word affixes to get to the base form of a word. However, the base form, in this case, is known as the root word but not the root stem. The difference is that the root word is always a lexicographically correct word, present in the dictionary, but the root stem may not be so. We will be using lemmatization only in our normalization pipeline to retain lexicographically correct words. The function lemmatize_text(…) helps us with this aspect.

Removing Stopwords

Words which have little or no significance especially when constructing meaningful features from the text are also known as stopwords or stop words. These are usually words that end up having the maximum frequency if you do a simple term or word frequency in a document corpus. Words like a, an, the, and so on are considered to be stopwords. There is no universal stopword list but we use a standard English language stopwords list from nltk. You can also add your own domain-specific stopwords if needed. The function remove_stopwords(…) helps us remove stopwords and retain words having the most significance and context in a corpus.

Normalize text corpus — tying it all together

We use all these components and tie them together in the following function called normalize_corpus(…), which can be used to take a document corpus as input and return the same corpus with cleaned and normalized text documents.

STEP 2: Loading files

This step mostly includes codes for the importing of different libraries and loading the data files and processing them into a proper format for manipulation. Let us no go into the details of the code and all the code can be found in my github profile linked below. We also cleaned the data and pre-processed into a usable formal.

STEP 3: Building Files

Sentiment Analysis — Unsupervised Lexical

Even though we have labelled data, this section should give you a good idea of how lexicon-based models work and you can apply the same in your own datasets when you do not have labelled data.

Unsupervised sentiment analysis models use well-curated knowledgebases, ontologies, lexicons, and databases that have detailed information pertaining to subjective words, phrases including sentiment, mood, polarity, objectivity, subjectivity, and so on. A lexicon model typically uses a lexicon, also known as a dictionary or vocabulary of words specifically aligned toward sentiment analysis. Usually, these lexicons contain a list of words associated with positive and negative sentiment, polarity (magnitude of negative or positive score), parts of speech (POS) tags, subjectivity classifiers (strong, weak, neutral), mood, modality, and so on. You can use these lexicons and compute sentiment of a text document by matching the presence of specific words from the lexicon, look at other additional factors like presence of negation parameters, surrounding words, overall context and phrases and aggregate overall sentiment polarity scores to decide the final sentiment score.

There are several popular lexicon models used for sentiment analysis. Some of them are mentioned as follows.

  • Bing Liu’s Lexicon
  • MPQA Subjectivity Lexicon
  • Pattern Lexicon
  • AFINN Lexicon
  • SentiWordNet Lexicon
  • VADER Lexicon

We’ll look at the AFINN, SentiWordNet and the VADER Lexicon models.

Sentiment Analysis with AFINN

The AFINN lexicon is perhaps one of the simplest and most popular lexicons that can be used extensively for sentiment analysis. It is a list of words rated for valence with an integer between minus five (negative) and plus five (positive). The current version of the lexicon is AFINN-en-165.txt and it contains over 3,300+ words with a polarity score associated with each word. The author has also created a nice wrapper library on top of this in Python called AFINN which we will be using for our analysis needs. AFINN takes into account other aspects like emoticons and exclamations.

Sentiment Analysis with SentiWordNet

The WordNet corpus is definitely one of the most popular corpora for the English language used extensively in natural language processing and semantic analysis. WordNet gave us the concept of synsets or synonym sets. The SentiWordNet lexicon is based on WordNet synsets and can be used for sentiment analysis and opinion mining. The SentiWordNet lexicon typically assigns three sentiment scores for each WordNet synset. These include a positive polarity score, a negative polarity score and an objectivity score. We will be using the nltk library, which provides a Pythonic interface into SentiWordNet. Consider we have the adjective awesome.

Sentiment Analysis with VADER

The VADER lexicon, developed by C.J. Hutto, is a lexicon that is based on a rule-based sentiment analysis framework, specifically tuned to analyze sentiments in social media. VADER stands for Valence Aware Dictionary and Sentiment Reasoner. The file titled vader_lexicon.txt contains necessary sentiment scores associated with words, emoticons and slangs (like wtf, lol, nah, and so on). There were a total of over 9,000 lexical features from which over 7,500 curated lexical features were finally selected in the lexicon with proper validated valence scores. Each feature was rated on a scale from “[-4] Extremely Negative” to “[4] Extremely Positive”, with allowance for “[0] Neutral (or Neither, N/A)”. The process of selecting lexical features was done by keeping all features that had a non-zero mean rating and whose standard deviation was less than 2.5, which was determined by the aggregate of ten independent raters.

Classifying Sentiment with Supervised Learning

Feature Engineering

The feature engineering techniques I used is based on the Bag of Words model and the TF-IDF model.

The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier. The core principle is to convert text documents into numeric vectors. The dimension or size of each vector is N where N indicates all possible distinct words across the corpus of documents. Each document once transformed is a numeric vector of size N where the values or weights in the vector indicate the frequency of each word in that specific document.

There are some potential problems which might arise with the Bag of Words model when it is used on large corpora. Since the feature vectors are based on absolute term frequencies, there might be some terms which occur frequently across all documents and these will tend to overshadow other terms in the feature set. The TF-IDF model tries to combat this issue by using a scaling or normalizing factor in its computation. TF-IDF stands for Term Frequency-Inverse Document Frequency, which uses a combination of two metrics in its computation, namely: term frequency (tf) and inverse document frequency (idf).

Traditional Supervised Machine Learning Models

We can now use some traditional supervised Machine Learning algorithms which work very well on text classification. We recommend using logistic regression, support vector machines, and multinomial Naïve Bayes models

Model Training

The logistic regression is intended for binary (two-class) classification problems, where it will predict the probability of an instance belonging to the default class, which can be snapped into a 0 or 1 classification. In this case, we try to predict the probability that a given movie review will belong to one of the discrete classes.

P(X) = P(Y=1|X)

STEP 4: Evaluations

This is the main evaluation function used.

Evaluation of Unsupervised lexicon models

AFINN

Model Performance metrics: 
--------------------
Accuracy of the model is 71.04%
Recall score is 84.65%
Precision score is 0.664
F1 Score is 0.74

Model Classification report:
--------------------
precision recall f1-score support

positive 0.79 0.58 0.67 15072
negative 0.66 0.85 0.74 14928

accuracy 0.71 30000
macro avg 0.73 0.71 0.71 30000
weighted avg 0.73 0.71 0.71 30000


Prediction Confusion Matrix:
--------------------
predicted:
positive negative
Actual: positive 12637 2291
negative 6396 8676

SentiWord Net

Model Performance metrics: 
--------------------
Accuracy of the model is 68.26%
Recall score is 74.37%
Precision score is 0.661
F1 Score is 0.7

Model Classification report:
--------------------
precision recall f1-score support

positive 0.71 0.62 0.66 15072
negative 0.66 0.74 0.70 14928

accuracy 0.68 30000
macro avg 0.69 0.68 0.68 30000
weighted avg 0.69 0.68 0.68 30000


Prediction Confusion Matrix:
--------------------
predicted:
positive negative
Actual: positive 11102 3826
negative 5696 9376

VADER

Model Performance metrics: 
--------------------
Accuracy of the model is 70.97%
Recall score is 82.78%
Precision score is 0.668
F1 Score is 0.74

Model Classification report:
--------------------
precision recall f1-score support

positive 0.78 0.59 0.67 15072
negative 0.67 0.83 0.74 14928

accuracy 0.71 30000
macro avg 0.72 0.71 0.71 30000
weighted avg 0.72 0.71 0.71 30000


Prediction Confusion Matrix:
--------------------
predicted:
positive negative
Actual: positive 12358 2570
negative 6139 8933

So, here it can be concluded that the best performing lexicon model is the AFINN, with the accuracy of 71% and F1 score of 0.74.

Evaluation of Supervised ML models

Logistic Regression

Logistic Regression results with Bow:
Model Performance metrics:
--------------------
Accuracy of the model is 90.13%
Recall score is 91.22%
Precision score is 0.893
F1 Score is 0.9

Model Classification report:
--------------------
precision recall f1-score support

positive 0.91 0.89 0.90 7507
negative 0.89 0.91 0.90 7493

accuracy 0.90 15000
macro avg 0.90 0.90 0.90 15000
weighted avg 0.90 0.90 0.90 15000


Prediction Confusion Matrix:
--------------------
predicted:
positive negative
Actual: positive 6835 658
negative 823 6684
Logistic Regression results with TD-IDF:
Model Performance metrics:
--------------------
Accuracy of the model is 89.15%
Recall score is 90.66%
Precision score is 0.88
F1 Score is 0.89

Model Classification report:
--------------------
precision recall f1-score support

positive 0.90 0.88 0.89 7507
negative 0.88 0.91 0.89 7493

accuracy 0.89 15000
macro avg 0.89 0.89 0.89 15000
weighted avg 0.89 0.89 0.89 15000


Prediction Confusion Matrix:
--------------------
predicted:
positive negative
Actual: positive 6793 700
negative 927 6580

Support Vector Machine

SVM results with BOW:
Model Performance metrics:
--------------------
Accuracy of the model is 84.97%
Recall score is 78.86%
Precision score is 0.898
F1 Score is 0.84

Model Classification report:
--------------------
precision recall f1-score support

positive 0.81 0.91 0.86 7507
negative 0.90 0.79 0.84 7493

accuracy 0.85 15000
macro avg 0.85 0.85 0.85 15000
weighted avg 0.85 0.85 0.85 15000


Prediction Confusion Matrix:
--------------------
predicted:
positive negative
Actual: positive 5909 1584
negative 671 6836
SVM results with TD-IDF:
Model Performance metrics:
--------------------
Accuracy of the model is 89.15%
Recall score is 90.66%
Precision score is 0.88
F1 Score is 0.89

Model Classification report:
--------------------
precision recall f1-score support

positive 0.90 0.88 0.89 7507
negative 0.88 0.91 0.89 7493

accuracy 0.89 15000
macro avg 0.89 0.89 0.89 15000
weighted avg 0.89 0.89 0.89 15000


Prediction Confusion Matrix:
--------------------
predicted:
positive negative
Actual: positive 6793 700
negative 927 6580

So, using traditional ML models I got an accuracy from 85-90%, among which the Logistic Regression model trained with bag-of-words characteristics out-performed all others with 90% accuracy and an F1 score about 0.9. Whereas the best Lexicon model only gave accuracy to about 71%.

Machine Learning Data Scientist with a hobby of micro-blogging about my work and a enthusiast coder.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store