Text Sentiment Analysis


This session we will be using the Natural Language TookKit (NLTK) to train a Text Sentiment Analysis classification model, and then using it to determine if Movie Reviews are positive or negative.

Installation & Setup

As always, we're using Python3. We will be using Python for the foreseeable future, so please ensure you have it installed.

In this session we will be using a single Python library: nltk. It can be installed from the command prompt or terminal using the following command:

pip install nltk

As always, if you run into issues when installing these libraries, please ask a committee member for help.

Let's Get Coding!

This week we are learning about sentiment analysis, the way computers can identify emotions from a given source, usually text or video.

We will be using a Naive Bayes classificaiton model - don't worry you don't need to know what this is! But if you're interested click here.

To demonstrate, let's have a play with the NLTK library. NLTK (Natural Language ToolKit) is an easy to use platform for building Python programs that work with human language data

First step is importing the required features.

# Importing the required methods and structures from nltk
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

# Downloading the movie reviews dataset

This will allow you to access the library's resources.

Next, we will write a function to extract the key words that determine if a review is positive or negative.

Paste the below code at the bottom of your Python file

# Extracting key words from dataset
def extract_features(word_list):
    return dict([(word, True) for word in word_list])

Now that we have a function to work with the dataset, we can now begin to build our model.

First we want to split the reviews into positive and negative. From now on all code is whithin this if statement.

Paste the below code at the bottom of your Python file

if __name__=='__main__':
    # Load positive and negative reviews  
    positive_fileids = movie_reviews.fileids('pos')
    negative_fileids = movie_reviews.fileids('neg')

Next we will be using our extract_features function to take these reviews and gather the key words for both positive and negative.

Paste the below code at the bottom of your Python file

# Gathering Positive key words
    features_positive = [(extract_features(movie_reviews.words(fileids=[f])), 'Positive') for f in positive_fileids]
# Gathering Negative key words
    features_negative = [(extract_features(movie_reviews.words(fileids=[f])), 'Negative') for f in negative_fileids]

Okay now we have our data prepared to train our model! But how do we do this?

First we need to decide what percentage of our data is being used to train the model, play around with this value and see if you can get a more accurate model! I am using 80% Training data.

Paste the below code at the bottom of your Python file

# Determine the threshold factor to use (80/20)
    threshold_factor = 0.8
    threshold_positive = int(threshold_factor * len(features_positive))
    threshold_negative = int(threshold_factor * len(features_negative))

Next we are going to use these thresholds to split the key words into training and test datasets.

Paste the below code at the bottom of your Python file

# Split the data into test and training lists
    features_train = features_positive[:threshold_positive] + features_negative[:threshold_negative]
    features_test = features_positive[threshold_positive:] +     features_negative[threshold_negative:]  
    print ("\nNumber of training datapoints:", len(features_train))
    print ("Number of test datapoints:", len(features_test))


Now, after we have setup everything, we can train and use our model!

Paste the below code at the bottom of your Python file

# Train a Naive Bayes classifier
    classifier = NaiveBayesClassifier.train(features_train)
    print ("\nAccuracy of the classifier:", nltk.classify.util.accuracy(classifier, features_test))

We have now created a Naive Bayes classifer, and trained it using the movie reviews we gathered and processed earlier! This is the model we will use to determine the sentiment of the text input.

Finally we can use this model to classify reviews! Enter some movie reviews and see the results after running the script.

Paste the below code at the bottom of your Python file

# Sample input reviews
    input_reviews = [
        "add more reviews by adding to this list", 
    print ("\nPredictions:")
    for review in input_reviews:
        print ("\nReview:", review)
        probdist = classifier.prob_classify(extract_features(review.split()))
        pred_sentiment = probdist.max()
        print ("Predicted sentiment:", pred_sentiment)
        print ("Probability:", round(probdist.prob(pred_sentiment), 2))

Going Further

So, what are the next steps?

Text Sentiment Analysis is used in many different scenarios, in many industries.

Research some of these different use cases and try to modify the code we wrote today, use a different data set, or output different results such as the most important key words.