Sentiment Analysis

Quantifying emotions in language using code!

CodeMD

Jul 2, 2020 • 10 min read

Quite often, the data that we have is in a written format. We may want to convert that to a number for analysis, and so may analyse it ourselves using a set of criteria that we devise in our heads. However, this introduces human bias. To reduce this, we can create some code which will take a chunk of text as input, process it using a technique called Sentiment Analysis, and output a meaningful number.

Pre-Requisites

Ensure that you read this starter tutorial to learn the basics of Colaboratory and how to navigate around files, and the interface. We’ll be using Colaboratory for our code so we won’t need to download anything.

Stage 1 – Getting text from the user

We need some text to analyse. This can be done in a few different ways, incuding automated methods to obtain text from websites, or excel spreadsheets to name but a few. For now, we’re going to keep it simple. Let’s ask the user to import some text each time they run the code, and assign it to a variable called “text”. This is done like so:

text = input("Input text here")

This prompts the user with “Input text here”, and an empty text box. Here the user types in their text, and hits ENTER to assign it to the variable “text”

For our purposes, let’s using the following block of text from BBC News:

The 5p fee for plastic carrier bags in England will be doubled to 10p, and extended to all shops, under plans set out by the environment secretary.

The change is contained in a government consultation aimed at further reducing the plastic used by consumers and could come into effect in January 2020.

Smaller retailers, who are exempt from the current levy, supply an estimated 3.6 billion single-use bags annually.

Schools in England are also being told to eliminate unnecessary plastic.

Education Secretary Damian Hinds is urging school leaders to replace items such as plastic straws, bottles and food packaging with sustainable alternatives by 2022.

Now let’s analyse the text!

Stage 2 – Sentiment Analysis using pre-defined classifier

Recently, VADER was released as part of the NLTK (Natural Language Toolkit) package for python. VADER stands for “Valence Aware Dictionary and sEntiment Reasoner”. In short terms, it’s a package which is designed to process sentiments from social media and informal written/spoken language. It is fully open source, making it a perfect package to use for large scale projects.

Nuances of VADER

VADER is able to handle some of the complexities of NLP that simple programs such as TextBlob are not able to handle. For example, VADER can consider punctuation, capitalisation of words, and conjuctions (e.g. ‘but’) which can often go missed with simpler classifiers.

NB. To get the absolutely gold standard results, it would be ideal to extract phrases/words which are positive and negative for our work and train our own Naive Bayesian Classifier, however, for the purpose of this tutorial – we’re not going to go that far.

Let’s install VADER with !pip install vaderSentiment. Once that’s done, we can now test it out. We need to import the SentimentIntensityAnalyzer package from vaderSentiment.

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()

We also create a variable “analyser” to be equal to the function SentimentIntensityAnalyzer() (NB. American spelling) which allows us to use the built-in function much more easily.

We can then create a function called sentiment_analyzer_scores, which takes 1 input called sentence. sentence can be called anything, it actually doesn’t matter. This is because we give the input when we “call” the function. For example we can run it as follows: sentiment_analyzer_scores(variable containing text to process is put here). When we create the function we just need a placeholder to tell the function what we will give it. In our case, it’s called sentence.

def sentiment_analyzer_scores(sentence):
    score = analyser.polarity_scores(sentence)
    print(sentence, "\n", score)

Our function takes one input sentence, and then gets the polarity score of the sentence, setting it equal to the variable score. We then print the sentence and score with a new line between them – "\n".

When we run our text from above, we find the following:
{'neg': 0.0, 'neu': 0.986, 'pos': 0.014, 'compound': 0.1027}

There are 4 values given to us:

VALUE	DESCRIPTION
Neg	As a %, how negative the text was
Neu	As a %, how neutral the text was
Pos	As a %, how positive the text was
Compound	An overall sentiment for the text, ranging from -1 (highly negative) to +1 (highly positive)

Note that Neg + Neu + Pos must be equal to 1 as they express % of the text.

Importing a word document for analysis

For most applications, it’s much easier to import a word document to analyse the text. We can do so using our existing code, with a slight alteration. We need to use a package called Docx2txt (NB. this works for .docx files only).

Let’s install it using:
!pip install docx2txt

We can then upload our word document by clicking the upload button in the left hand pane:

Now we have the docx2txt package and our word document, we then need to import the package, and process our text. We can copy and paste the name of our file by right clicking our document in the left hand pane and choosing Copy path. Then paste it in place of “testfile.docx”

import docx2txt
my_text = docx2txt.process("testfile.docx")

Finally, we run the analysis as before:

sentiment_analyzer_scores(my_text)

Great! So far we have determined the sentiment, we can also see which words appear most within the text.

Stage 3 – Metrics

Let’s get get some metrics on the data so we have a better understanding of the data. We’re going to write some code to remove the “stop words” that we have in text so that we can make the quality of our text better. Stop words are those that do not have sentiment, and are used primarily to allow flow when reading. We do this by firstly defining a list of stop words:

stoplist = 'a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your'

Once we’ve done that, we can remove these words from our text input. To do that, we need to change the commas in the stoplist words for spaces so it looks like stoplist = 'a able about across... instead. This is done so we’re not searching for “a,” (including a comma) to remove.

stopwords = stoplist.replace(',', ' ')
stoplist = set(stopwords.split())

We use the .replace function on our stopwords string in order to replace commas with spaces. Then we create a list using the set() function, where the stopwords have been split to be individual items in the set. So rather than a set with one item (whole list of stopwords), we’ve changed it to be a set with multiple items (each being an individual stopword).

From there, we can now use a data type called a dictionary to store our words, and measure how many times they have appeared in the text. A dictionary has two key features – keys and values. In our case, the keys are the words in the text, and the values represent the number of times each word has been seen. From a abstract perspective, with our dictionary, we must add words that have not been seen before and set their value to 1, and add 1 to the value of a word that is already in the dictionary.

We can flesh that out in code using a function:

def word_count(str):
    #Create an empty dictionary called "counts" and splits the words that are inputted.
    counts = dict()
    words = str.split()

    for word in words:
        #Adds 1 to the count of a word if it's not a stopword
        if word in counts and word not in stoplist:
            counts[word] += 1
        #Creates a word in the dictionary if it's not a stopword and hasn't been processed already    
        elif word not in stoplist:
            counts[word] = 1
        #If the word is a stopword, then ignore (aka pass and move to the next word)
        else:
            pass
    #The output of the function - the completed dictionary called "counts"
    return counts
    
    
texts_string = ''.join(str(word) for word in text)
counted_words = word_count(texts_string)

This function contains code that is run using the input, and has an output, wihch in our case is the completed dictionary counts, which is given back to us using the command return.

When we run it – we get the following output:

{'5p': 1, 'fee': 1, 'plastic': 3, 'carrier': 1, 'bags': 2, 'England': 2, 'doubled': 1, ...}

Sorting the dictionary

To make it easier to interpret, we can sort the dictionary by the values. This is done using a small function called lamda.

sorted_by_value = sorted(counted_words.items(), key=lambda kv: kv[1], reverse=True)
print (sorted_by_value)

Lambda here allows us to run a “mini-function”. The mini-function here is to get the second column in our dictionary (word frequency). Second columns in Python are generally denoted by [1] due to something zero-indexing where the first value/column is actually the 0th rather than the first. Reverse=True puts the highest frequency at the top of the dictionary.

When we run it – we get the following output:

[('plastic', 3), ('bags', 2), ('England', 2), ('5p', 1), ('fee', 1), ('carrier', 1), ('doubled', 1), ...]

Looks like it worked! We can also plot the data if we want to analyse it graphically.

Stage 4 – Plotting word frequency

We can plot our graphs using the Python package – Matplotlib. It’s pre-installed on Colaboratory so we don’t need to install it. Let’s import it and tell Colaboratory to show our graphs in the notebook:

#Imports the Matplotlib module to use to plot the frequency of words
import matplotlib.pyplot as plt
#Tells Matplotlib to show our graphs in this notebook
%matplotlib inline

We can then create our plot:

#Creates an empty figure of size 20x10
plt.figure(figsize=(20, 10)) # This increases chart size

#Tells Matplotlib to create a bar chart from 0-(length of the dictionary), with height being equal to the values of each key
plt.bar(range(len(counted_words)), list(counted_words.values()), align='center')

#Changes the numbers 0-(length of dictionary) to be equal to the words in the dictionary
#font size can be changed to xx-small if text is overlapping
plt.xticks(range(len(counted_words)), list(counted_words.keys()), rotation='vertical', fontsize='small')

#Saves the figure to a file called 'output.png'
plt.savefig('output.png', dpi=100)

If we click refresh on the left hand pane (under files), we can see that we have a new file called ‘output.png’ which is our graph. We can download this by right clicking and choosing download. It should look something like this:

Fantastic! – we can see that it’s worked and we’ve got a graphical representation our word frequencies.

Stage 5 – Large Scale Sentiment Analysis

We may also have a super large dataset from a survey which we may want to analyse. For this, it’s best to automate the process rather than manually copying and pasting! We can do this using Microsoft Excel. For this next section, we’re assuming that our data is in an single column excel spreadsheet.

Let’s upload the excel spreadsheet to Colaboratory as we did before using the ‘Upload’ button in the left hand pane. Once that’s done, right click the file, and click copy path.

Photo by Debbie Molle / Unsplash

Let’s import the Pandas package, and our excel file.

import pandas as pd

df = pd.read_excel('positive_comments.xlsx')
df.head()

We’ve named a variable df which represents the excel file, and then called df.head() which prints the top 5 rows of the data. The reason why we’ve called it df is that it stands for dataframe (a common term which means a 2-dimensional labeled data structure with columns of potentially different types). In short, df means excel file.

Let’s also bring our VADER analysis function back:

analyser = SentimentIntensityAnalyzer()

def sentiment_analyzer_scores(sentence):
    score = analyser.polarity_scores(sentence)
    print (score)
    return (score)

From there, we’re going to add some columns into the data to enable us to add in the values for neg, neu, pos, and compound.

#Creates empty lists for each metric
neg_list = []
neu_list = []
pos_list = []
compound_list = []


#Creates empty column for each metric in the Excel Spreadsheet
df['neg'] =""
df['neu'] =""
df['pos'] =""
df['compound'] =""

#Loops over the data to add each value to the respective list
for i in range(len(df)):
  results = (sentiment_analyzer_scores(df.iloc[i][0]))
  neg_list.append(results['neg'])  
  neu_list.append(results['neu'])
  pos_list.append(results['pos'])
  compound_list.append(results['compound'])

#These 4 blocks then adds the values in the list to their respective column that we initially created as empty.
series_neg = pd.Series(neg_list)
df['neg'] = series_neg.values

series_neu = pd.Series(neu_list)
df['neu'] = series_neu.values

series_pos = pd.Series(pos_list)
df['pos'] = series_pos.values

series_compound = pd.Series(compound_list)
df['compound'] = series_compound.values


df

What we’re doing here is creating empty lists and empty columns which are then filled with the numbers for each metric. First the list is filled, then the values are copied into the columns.

For the full code on Large Scale Sentiment Analysis, check it out on Github here.

Fantastic – looks like it worked!

Summary

VADER is a great tool to analyse sentiment in text that typically would be difficult to analyse with other packages.
Dictionaries can be used to iterate through a document and measure word frequencies.
For improved NLP classification, it is recommended to train a classifier on your data to tailor it for your needs.
We can automate large scale sentiment analysis using excel and loops to go through the data.