-
Notifications
You must be signed in to change notification settings - Fork 46
/
Copy path11-sentiment-analysis.Rmd
323 lines (230 loc) · 15.5 KB
/
11-sentiment-analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
---
title: "Sentiment analysis"
author: ""
date: ""
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
Sentiment analysis is the study of the emotional content of a body of text. When we read text, as humans, we infer the emotional content from the words used in the text, and some more subtle cues involving how these words are put together. Sentiment analysis tries to do the same thing algorithmically.
One way of approaching the problem is to assess the sentiment of individual words, and then aggregate the sentiments of the words in a body of text in some way. For example, if we can classify whether each word is positive, negative, or neutral, we can count up the number of positive, negative, and neutral words in the document and define that as the sentiment of the document. This is just one way - a particularly simple way - of doing document-level sentiment analysis.
When assessing the sentiment or emotional content of individual words, we usually make use of existing sentiment dictionaries (or "lexicons") that have already done this using some kind of manual classification.
This notebook is an introduction to sentiment analysis, in which we will:
1. Introduce the sentiment lexicons that come with **tidytext**.
2. Look at how to aggregate sentiments over words to assess the sentiment of a longer sequence of text, like a tweet.
3. See how to handle "negation" words like "not" that reverse the sentiment of the word that follows it - for example "not good".
4. Use all of the above to analyze the emotional content of Donald Trump's tweets and examine how these have changed over time.
[Chapter 2](tidytextmining.com/sentiment.html) of TMR covers sentiment analysis, and negation is handled in [Chapter 4](http://tidytextmining.com/ngrams.html#tokenizing-by-n-gram). Many of the ideas and some of the code in this workbook are drawn from these chapters.
First load the packages we need for this lesson:
```{r}
library(tidyverse)
library(tidytext)
library(stringr)
library(lubridate)
options(repr.plot.width=4, repr.plot.height=3) # set plot size in the notebook
```
Next, we load the .RData file containing the tweets and get the data into tidy text format. These are the same operations we did in the previous notebook, so go back to that notebook if you need more details about what is happening below.
> Note! This notebook has been updated and analyses tweets up to 18 August 2018. The results will look a bit different to the ones in the lecture video, which is from 2017. Comment out the call to load the up-to-2018 tweets if you want it to look like the video!
```{r}
load("data/trump-tweets-2018.RData")
# load("data/trump-tweets.RData") ### 2017 version
# make data a tibble
tweets <- as.tibble(tweets)
# parse the date and add some date related variables
tweets <- tweets %>%
mutate(date = parse_datetime(str_sub(tweets$created_at,5,30), "%b %d %H:%M:%S %z %Y")) %>%
mutate(is_prez = (date > ymd(20161108))) %>%
mutate(month = make_date(year(date), month(date)))
# turn into tidy text
replace_reg <- "(https?:.*?([\\s]|[a-zA-Z0-9]$))|(www:.*?([\\s]|[a-zA-Z0-9]$))|&|<|>|RT"
unnest_reg <- "[^A-Za-z_\\d#@']"
tidy_tweets <- tweets %>%
filter(!str_detect(text, "^RT")) %>% # remove retweets
mutate(text = str_replace_all(text, replace_reg, "")) %>% # remove stuff we don't want like links
unnest_tokens(word, text, token = "regex", pattern = unnest_reg) %>% # tokenize
filter(!word %in% stop_words$word, str_detect(word, "[a-z]")) %>% # remove stop words
select(date,word,is_prez,favorite_count,id_str,month) # choose the variables we need
```
## Using sentiment lexicons
The **tidytext** package comes with a three existing sentiment lexicons or dictionaries. These describe the emotional content of individual words in different formats, and have been put together manually.
* *afinn*: a list of words given a positivity score between minus five (negative) and plus five (positive). The words have been manually labelled by Finn Arup Nielsen. See [here](https://finnaarupnielsen.wordpress.com/2011/03/16/afinn-a-new-word-list-for-sentiment-analysis/) and [here](http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010) for more details.
* *bing*: a sentiment lexicon created by Bing Liu and collaborators. A list of words are labelled as "positive" or "negative". More details [here](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html).
* *nrc*: a sentiment lexicon put together by Saif Mohammad and Peter Turney using crowdsourcing on Amazon Mechanical Turk. Words are labelled as "positive" or "negative", but also as "anger", "anticipation", "disgust", "fear", "joy", "sadness", "surprise", or "trust". A word can receive multiple labels. More details [here](http://saifmohammad.com/WebPages/lexicons.html).
```{r}
get_sentiments("afinn") %>% head(10)
get_sentiments("bing") %>% head(10)
get_sentiments("nrc") %>% head(10)
```
Below we use the *bing* lexicon to add a new variable indicating whether each word in our `tidy_tweets` data frame is positive or negative. We use a left join here, which keeps *all* the words in `tidy_tweets`. Words appearing in our tweets but not in the *bing* lexicon will appear as `NA`. We rename these "neutral", but need to be a bit careful here. No sentiment lexicon contains all words, so some words that are *actually* positive or negative will be labelled as `NA` and hence "neutral". We can avoid this problem by using an inner join rather than a left join, by filtering out neutral words later on, or by just keeping in mind that "neutral" doesn't really mean "neutral".
There's one last issue: in the *bing* lexicon the word "trump" is positive, which will obviously skew the sentiment of Trump's tweets, particularly bearing in mind he often tweets about himself! We (rather generously I think) manually recode the sentiment of this word to "neutral".
```{r}
tidy_tweets <- tidy_tweets %>%
left_join(get_sentiments("bing")) %>% # add sentiments (pos or neg)
select(word,sentiment,everything()) %>%
mutate(sentiment = ifelse(word == "trump", NA, sentiment)) %>% # "trump" is a positive word in the bing lexicon!
mutate(sentiment = ifelse(is.na(sentiment), "neutral", sentiment))
```
Let's look at Trump's 20 most common positive words:
```{r}
tidy_tweets %>%
filter(sentiment == "positive") %>%
count(word) %>%
arrange(desc(n)) %>%
filter(rank(desc(n)) <= 20) %>%
ggplot(aes(reorder(word,n),n)) + geom_col() + coord_flip() + xlab("")
```
and the 20 most common negative words:
```{r}
tidy_tweets %>%
filter(sentiment == "negative") %>%
count(word) %>%
arrange(desc(n)) %>%
filter(rank(desc(n)) <= 20) %>%
ggplot(aes(reorder(word,n),n)) + geom_col() + coord_flip() + xlab("")
```
## Changes in sentiment over time
Once we have attached sentiments to words in our data frame, we can analyze these in various ways. For example, we can examine trends in sentiment over time. Here we count the number of positive, negative and neutral words used each month and plot these. Because the neutral words dominate, its difficult to see any trends with them included (try this and see for yourself). We therefore remove the neutral words before plotting.
```{r}
sentiments_per_month <- tidy_tweets %>%
group_by(month, sentiment) %>%
summarize(n = n())
```
```{r}
ggplot(filter(sentiments_per_month, sentiment != "neutral"), aes(x = month, y = n, fill = sentiment)) +
geom_col()
```
There doesn't seem to be any very clear trend here, but the variation in the number of tweets made each month makes it difficult to see. We can improve the visualization by plotting the *proportion* of all words tweeted in a month that were positive or negative. The plot shows the raw proportions as well as smoothed versions of these.
```{r}
sentiments_per_month <- sentiments_per_month %>%
left_join(sentiments_per_month %>%
group_by(month) %>%
summarise(total = sum(n))) %>%
mutate(freq = n/total)
```
```{r}
sentiments_per_month %>% filter(sentiment != "neutral") %>%
ggplot(aes(x = month, y = freq, colour = sentiment)) +
geom_line() +
geom_smooth(aes(colour = sentiment))
```
We can fit a simple linear model to check with the proportion of negative words has increased over time. Strictly speaking the linear model is not appropriate as the response is bounded to lie between 0 and 1 - you could try fitting e.g. a binomial GLM instead.
```{r}
model <- lm(freq ~ month, data = subset(sentiments_per_month, sentiment == "negative"))
summary(model)
```
## Aggregating sentiment over words
So far we've looked at the sentiment of individual words. How can we assess the sentiment of longer sequences of text, like bigrams, sentences or entire tweets. One approach is to attach sentiments to each word in the longer sequence, and then add up the sentiments over words. This isn't the only way, but it is relatively easy to do and fits in nicely with the use of tidy text data.
Suppose we want to analyze the sentiment of entire tweets. We'll measure the positivity of a tweet by the difference in the number of positive and negative words used in the tweet.
```{r}
sentiments_per_tweet <- tidy_tweets %>%
group_by(id_str) %>%
summarize(net_sentiment = (sum(sentiment == "positive") - sum(sentiment == "negative")),
month = first(month))
```
To see if the measure makes sense, let's have a look at the most negative tweets.
```{r}
tweets %>%
left_join(sentiments_per_tweet) %>%
arrange(net_sentiment) %>%
head(10) %>%
select(text, net_sentiment)
```
And the most positive tweets:
```{r}
tweets %>%
left_join(sentiments_per_tweet) %>%
arrange(desc(net_sentiment)) %>%
head(10) %>%
select(text, net_sentiment)
```
We can also look at trends over time. The plot below shows the proportion of monthly tweets that were negative (i.e. where the number of negative words exceeded the number of positive ones).
```{r}
sentiments_per_tweet %>%
group_by(month) %>%
summarize(prop_neg = sum(net_sentiment < 0) / n()) %>%
ggplot(aes(x = month, y = prop_neg)) +
geom_line() + geom_smooth()
```
## Dealing with negation
One problem we haven't considered yet is what to do with terms like "not good", where a positive word is negated by the use of "not" before it. We need to reverse the sentiment of words that are preceded by negation words like not, never, *etc*.
We'll do this in the context of a sentiment analysis on bigrams. We start by creating the bigrams, and separating the two words making up each bigram. This is the same code used in the previous notebook.
```{r}
bigrams_separated <- tweets %>%
filter(!str_detect(text, "^RT")) %>%
mutate(text = str_replace_all(text, replace_reg, "")) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ")
```
Then we use the *bing* sentiment dictionary to look up the sentiment of each word in each bigram.
```{r}
bigrams_separated <- bigrams_separated %>%
# add sentiment for word 1
left_join(get_sentiments("bing"), by = c(word1 = "word")) %>%
rename(sentiment1 = sentiment) %>%
mutate(sentiment1 = ifelse(word1 == "trump", NA, sentiment1)) %>%
mutate(sentiment1 = ifelse(is.na(sentiment1), "neutral", sentiment1)) %>%
# add sentiment for word 1
left_join(get_sentiments("bing"), by = c(word2 = "word")) %>%
rename(sentiment2 = sentiment) %>%
mutate(sentiment2 = ifelse(word2 == "trump", NA, sentiment2)) %>%
mutate(sentiment2 = ifelse(is.na(sentiment2), "neutral", sentiment2)) %>%
select(month,word1,word2,sentiment1,sentiment2,everything())
```
Now we need a list of words that we consider to be negation words. I'll use the following set, taken from TMR [Chapter 4](http://tidytextmining.com/ngrams.html), and show a few examples.
```{r}
negation_words <- c("not", "no", "never", "without")
# show a few
filter(bigrams_separated, word1 %in% negation_words) %>%
head(10) %>% select(month, word1, word2, sentiment1, sentiment2) # for display purposes
```
We now reverse the sentiment of `word2` whenever it is preceded by a negation word, and then add up the number of positive and negative words within a bigram and take the difference. That difference (a score from -2 to +2) is the sentiment of the bigram.
We do this in two steps for illustrative purposes. First we reverse the sentiment of the second word in the bigram if the first one is a negation word.
```{r}
bigrams_separated <- bigrams_separated %>%
# create a variable that is the opposite of sentiment2
mutate(opp_sentiment2 = recode(sentiment2, "positive" = "negative",
"negative" = "positive",
"neutral" = "neutral")) %>%
# reverse sentiment2 if word1 is a negation word
mutate(sentiment2 = ifelse(word1 %in% negation_words, opp_sentiment2, sentiment2)) %>%
# remove the opposite sentiment variable, which we don't need any more
select(-opp_sentiment2)
```
Next, we calculate the sentiment of each bigram and join up the words in the bigram again.
```{r}
bigrams_separated <- bigrams_separated %>%
mutate(net_sentiment = (sentiment1 == "positive") + (sentiment2 == "positive") -
(sentiment1 == "negative") - (sentiment2 == "negative")) %>%
unite(bigram, word1, word2, sep = " ", remove = FALSE)
```
Below we show Trump's most common positive and negative bigrams.
```{r}
bigrams_separated %>%
filter(net_sentiment > 0) %>% # get positive bigrams
count(bigram, sort = TRUE) %>%
filter(rank(desc(n)) < 20) %>%
ggplot(aes(reorder(bigram,n),n)) + geom_col() + coord_flip() + xlab("")
```
```{r}
bigrams_separated %>%
filter(net_sentiment < 0) %>% # get negative bigrams
count(bigram, sort = TRUE) %>%
filter(rank(desc(n)) < 20) %>%
ggplot(aes(reorder(bigram,n),n)) + geom_col() + coord_flip() + xlab("")
```
None of the most common negative bigrams have negated words in them but some that are slightly less frequently used do. Notice that the most frequently used bigram below is "no wonder" - which is not really negative, although you can see how, using the approach we have, it has ended up classified as such. Cases like these would need to be handled on an individual basis.
```{r}
bigrams_separated %>%
filter(net_sentiment < 0) %>% # get negative bigrams
filter(word1 %in% negation_words) %>% # get bigrams where first word is negation
count(bigram, sort = TRUE) %>%
filter(rank(desc(n)) < 20) %>%
ggplot(aes(reorder(bigram,n),n)) + geom_col() + coord_flip() + xlab("")
```
## Exercises
1. Adverbs like "very", "extremely", "totally", "really", "quite", "marginally", *etc* increase or decrease the magnitude of the sentiment that follows them. Think of a few qualifiers and build these into the analyses above. What are they most positive and negative tweets once these qualifiers have been added? Does it affect the assessment of whether the sentiment of Trump's tweets has changed over time?
2. Is there a relationship between the sentiment of a tweet and its popularity? Are negative tweets less popular than positive ones?
3. Is there a "day of the week" effect on the number and sentiment of tweets? Are weekends any different to weekdays?
4. Use the more nuanced *nrc* lexicon, which categorises words as "anger", "fear", *etc* to re-analyse Trump's tweets. Can you pick up anything interesting?
5. Note the appearance of the bigrams "miss usa" and "miss universe" in the plot of most common negative bigrams in the notebook. This is a clear misinterpretation of "miss". How might you remove the ambiguity in these kinds of bigrams in a general way (either just think of a way or, better, implement it!)