The SentimentAnalysis
package introduces a powerful toolchain facilitating the sentiment
analysis of textual contents in R. This implementation utilizes various
existing dictionaries, such as QDAP, Harvard IV and Loughran-McDonald.
Furthermore, it can also create customized dictionaries. The latter
function uses LASSO regularization as a statistical approach to select
relevant terms based on an exogenous response variable. Finally, all
methods can be easily compared using built-in evaluation routines.
Sentiment analysis is a research branch located at the heart of natural language processing (NLP), computational linguistics and text mining. It refers to any measures by which subjective information is extracted from textual documents. In other words, it extracts the polarity of the expressed opinion in a range spanning from positive to negative. As a result, one may also refer to sentiment analysis as opinion mining (Pang and Lee 2008).
Sentiment analysis has received great traction lately (Ravi and Ravi 2015; Pang and Lee 2008), which we explore in the following. Current research in finance and the social sciences utilizes sentiment analysis to understand human decisions in response to textual materials. This immediately reveals manifold implications for practitioners, as well as those involved in the fields of finance research and the social sciences: researchers can use R to extract text components that are relevant for readers and test their hypotheses on this basis. By the same token, practitioners can measure which wording actually matters to their readership and enhance their writing accordingly (Pröllochs, Feuerriegel, and Neumann 2015). We demonstrate below the added benefits in two case studies drawn from finance and the social sciences.
Several applications demonstrate the uses of sentiment analysis for organizations and enterprises:
Finance: Investors in financial markets refer to textual information in the form of financial news disclosures before exercising ownership in stocks. Interestingly, they rely not only on quantitative numbers, but also soft information, such as tone and sentiment (Henry 2008; Loughran and McDonald 2011; Tetlock 2007), which thereby strongly influences stock prices. By utilizing sentiment analysis, automated traders can automatically analyze the sentiment conveyed in financial disclosures in order to trigger investment decisions within milliseconds.
Marketing: Marketing departments are often interested in tracking brand image. For that purpose, they collect large volumes of user opinions from social media and evaluate the feelings of individuals towards brands, products and services. Practitioners in the field of marketing can exploit these insights to enhance their wording according to the feedback of their readership.
Rating and review platforms: Rating and review platforms fulfill a valuable function by collecting user ratings or preferences for certain products and services. Here, one can automatically process large volumes of user-generated content and exploit the knowledge gained thereby. For example, one can identify which cues convey a positive or negative opinion, or even automatically validate their credibility.
As sentiment analysis is applied to a broad variety of domains and textual sources, research has devised various approaches to measuring sentiment. A recent literature overview (Pang and Lee 2008) provides a comprehensive, domain-independent survey.
On the one hand, machine learning approaches are preferred when one strives for high prediction performance. However, machine learning usually works as a black-box, thereby making interpretations diffucult. On the other hand, dictionary-based approaches generate lists of positive and negative words. The respective occurrences of these words are then combined into a single sentiment score. Therefore, the underlying decisions become traceable and researchers can understand the factors that result in a specific sentiment.
In addition, SentimentAnalysis
allows one to generate
tailored dictionaries. These are customized to a specific domain,
improve prediction performance compared to pure dictionaries and allow
full interpretability. Details of this methodology can be found in (Pröllochs, Feuerriegel, and Neumann 2018).
In the process of performing sentiment analysis, one must convert the
running text into a machine-readable format. This is achieved by
executing a series of preprocessing operations. First, the text is
tokenized into single words, followed by what are common preprocessing
steps: stopword removal, stemming, removal of punctuation and conversion
to lower-case. These operations are also conducted by default in
SentimentAnalysis
, but can be adapted to one’s personal
needs.
Even though sentiment analysis has received great traction lately,
the available tools are not yet living up to the needs of researchers.
The SentimentAnalysis
package is intended to partially
close this gap and offer capabilities that most research demands.
First, simply install the package SentimentAnalysis
from
CRAN. Afterwards, one merely needs to load the
SentimentAnalysis
package as follows. This section shows
the basic functionality to crawl for ad hoc filings. The following lines
extract the ad hoc disclosure that was published most recently.
##
## Attaching package: 'SentimentAnalysis'
## The following object is masked from 'package:base':
##
## write
# Analyze a single string to obtain a binary response (positive / negative)
sentiment <- analyzeSentiment("Yeah, this was a great soccer game for the German team!")
convertToBinaryResponse(sentiment)$SentimentQDAP
## [1] positive
## Levels: negative positive
# Create a vector of strings
documents <- c("Wow, I really like the new light sabers!",
"That book was excellent.",
"R is a fantastic language.",
"The service in this restaurant was miserable.",
"This is neither positive or negative.",
"The waiter forget about my dessert -- what poor service!")
# Analyze sentiment
sentiment <- analyzeSentiment(documents)
# Extract dictionary-based sentiment according to the QDAP dictionary
sentiment$SentimentQDAP
## [1] 0.3333333 0.5000000 0.5000000 -0.3333333 0.0000000 -0.4000000
# View sentiment direction (i.e. positive, neutral and negative)
convertToDirection(sentiment$SentimentQDAP)
## [1] positive positive positive negative neutral negative
## Levels: negative neutral positive
## Warning in cor(sentiment, response): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(sentiment, response): the standard deviation is zero
## WordCount SentimentGI NegativityGI PositivityGI
## cor -0.18569534 0.990011498 -9.974890e-01 0.942954167
## cor.t.statistic -0.37796447 14.044046450 -2.816913e+01 5.664705543
## cor.p.value 0.72465864 0.000149157 9.449687e-06 0.004788521
## lm.t.value -0.37796447 14.044046450 -2.816913e+01 5.664705543
## r.squared 0.03448276 0.980122766 9.949843e-01 0.889162562
## RMSE 3.82970843 0.450102869 1.186654e+00 0.713624032
## MAE 3.33333333 0.400000000 1.100000e+00 0.666666667
## Accuracy 0.66666667 1.000000000 6.666667e-01 0.666666667
## Precision NaN 1.000000000 NaN NaN
## Sensitivity 0.00000000 1.000000000 0.000000e+00 0.000000000
## Specificity 1.00000000 1.000000000 1.000000e+00 1.000000000
## F1 NaN 1.000000000 NaN NaN
## BalancedAccuracy 0.50000000 1.000000000 5.000000e-01 0.500000000
## avg.sentiment.pos.response 3.25000000 0.333333333 8.333333e-02 0.416666667
## avg.sentiment.neg.response 4.00000000 -0.633333333 6.333333e-01 0.000000000
## SentimentHE NegativityHE PositivityHE SentimentLM
## cor 0.4152274 -0.083045480 0.3315938 0.7370455
## cor.t.statistic 0.9128709 -0.166666667 0.7029595 2.1811142
## cor.p.value 0.4129544 0.875718144 0.5208394 0.0946266
## lm.t.value 0.9128709 -0.166666667 0.7029595 2.1811142
## r.squared 0.1724138 0.006896552 0.1099545 0.5432361
## RMSE 0.8416254 0.922958207 0.8525561 0.7234178
## MAE 0.7500000 0.888888889 0.8055556 0.6333333
## Accuracy 0.6666667 0.666666667 0.6666667 0.8333333
## Precision NaN NaN NaN 1.0000000
## Sensitivity 0.0000000 0.000000000 0.0000000 0.5000000
## Specificity 1.0000000 1.000000000 1.0000000 1.0000000
## F1 NaN NaN NaN 0.6666667
## BalancedAccuracy 0.5000000 0.500000000 0.5000000 0.7500000
## avg.sentiment.pos.response 0.1250000 0.083333333 0.2083333 0.2500000
## avg.sentiment.neg.response 0.0000000 0.000000000 0.0000000 -0.1000000
## NegativityLM PositivityLM RatioUncertaintyLM
## cor -0.40804713 0.6305283 NA
## cor.t.statistic -0.89389841 1.6247248 NA
## cor.p.value 0.42189973 0.1795458 NA
## lm.t.value -0.89389841 1.6247248 NA
## r.squared 0.16650246 0.3975659 NA
## RMSE 0.96186547 0.7757911 0.9128709
## MAE 0.92222222 0.7222222 0.8333333
## Accuracy 0.66666667 0.6666667 0.6666667
## Precision NaN NaN NaN
## Sensitivity 0.00000000 0.0000000 0.0000000
## Specificity 1.00000000 1.0000000 1.0000000
## F1 NaN NaN NaN
## BalancedAccuracy 0.50000000 0.5000000 0.5000000
## avg.sentiment.pos.response 0.08333333 0.3333333 0.0000000
## avg.sentiment.neg.response 0.10000000 0.0000000 0.0000000
## SentimentQDAP NegativityQDAP PositivityQDAP
## cor 0.9865356369 -0.944339551 0.942954167
## cor.t.statistic 12.0642877257 -5.741148345 5.664705543
## cor.p.value 0.0002707131 0.004560908 0.004788521
## lm.t.value 12.0642877257 -5.741148345 5.664705543
## r.squared 0.9732525629 0.891777188 0.889162562
## RMSE 0.5398902495 1.068401367 0.713624032
## MAE 0.4888888889 1.011111111 0.666666667
## Accuracy 1.0000000000 0.666666667 0.666666667
## Precision 1.0000000000 NaN NaN
## Sensitivity 1.0000000000 0.000000000 0.000000000
## Specificity 1.0000000000 1.000000000 1.000000000
## F1 1.0000000000 NaN NaN
## BalancedAccuracy 1.0000000000 0.500000000 0.500000000
## avg.sentiment.pos.response 0.3333333333 0.083333333 0.416666667
## avg.sentiment.neg.response -0.3666666667 0.366666667 0.000000000
## WordCount SentimentGI NegativityGI PositivityGI
## Accuracy 0.6666667 1.0000000 0.66666667 0.6666667
## Precision NaN 1.0000000 NaN NaN
## Sensitivity 0.0000000 1.0000000 0.00000000 0.0000000
## Specificity 1.0000000 1.0000000 1.00000000 1.0000000
## F1 NaN 1.0000000 NaN NaN
## BalancedAccuracy 0.5000000 1.0000000 0.50000000 0.5000000
## avg.sentiment.pos.response 3.2500000 0.3333333 0.08333333 0.4166667
## avg.sentiment.neg.response 4.0000000 -0.6333333 0.63333333 0.0000000
## SentimentHE NegativityHE PositivityHE SentimentLM
## Accuracy 0.6666667 0.66666667 0.6666667 0.8333333
## Precision NaN NaN NaN 1.0000000
## Sensitivity 0.0000000 0.00000000 0.0000000 0.5000000
## Specificity 1.0000000 1.00000000 1.0000000 1.0000000
## F1 NaN NaN NaN 0.6666667
## BalancedAccuracy 0.5000000 0.50000000 0.5000000 0.7500000
## avg.sentiment.pos.response 0.1250000 0.08333333 0.2083333 0.2500000
## avg.sentiment.neg.response 0.0000000 0.00000000 0.0000000 -0.1000000
## NegativityLM PositivityLM RatioUncertaintyLM
## Accuracy 0.66666667 0.6666667 0.6666667
## Precision NaN NaN NaN
## Sensitivity 0.00000000 0.0000000 0.0000000
## Specificity 1.00000000 1.0000000 1.0000000
## F1 NaN NaN NaN
## BalancedAccuracy 0.50000000 0.5000000 0.5000000
## avg.sentiment.pos.response 0.08333333 0.3333333 0.0000000
## avg.sentiment.neg.response 0.10000000 0.0000000 0.0000000
## SentimentQDAP NegativityQDAP PositivityQDAP
## Accuracy 1.0000000 0.66666667 0.6666667
## Precision 1.0000000 NaN NaN
## Sensitivity 1.0000000 0.00000000 0.0000000
## Specificity 1.0000000 1.00000000 1.0000000
## F1 1.0000000 NaN NaN
## BalancedAccuracy 1.0000000 0.50000000 0.5000000
## avg.sentiment.pos.response 0.3333333 0.08333333 0.4166667
## avg.sentiment.neg.response -0.3666667 0.36666667 0.0000000
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
## Warning: Failed to fit group -1.
## Caused by error in `smooth.construct.cr.smooth.spec()`:
## ! x has insufficient unique values to support 10 knots: reduce k.
The SentimentAnalysis
package works very cleverly and
neatly here in order to remove the effort for the user: it recognizes
that the user has inserted a vector of strings and thus automatically
performs a set of default preprocessing operations from text mining.
Hence, it tokenizes each document and finally converts the input into a
document-term matrix. All of the previous operations are undertaken
without manual specification. The analyzeSentiment()
routine also accepts other input formats in case the user has already
performed a preprocessing step or wants to implement a specific set of
operations.
The following sections present the functionality in terms of working with different input formats and the underlying dictionaries.
The SentimentAnalysis
package provides interfaces with
several other input formats, among which are
Vector of strings.
DocumentTermMatrix and TermDocumentMatrix as implemented in the
tm
package (Feinerer, Hornik, and
Meyer 2008).
Corpus object as implemented by the tm
package (Feinerer, Hornik, and Meyer 2008).
We provide examples in the following.
documents <- c("This is good",
"This is bad",
"This is inbetween")
convertToDirection(analyzeSentiment(documents)$SentimentQDAP)
## [1] positive negative neutral
## Levels: negative neutral positive
## Loading required package: NLP
corpus <- VCorpus(VectorSource(documents))
convertToDirection(analyzeSentiment(corpus)$SentimentQDAP)
## [1] positive negative neutral
## Levels: negative neutral positive
## [1] positive negative neutral
## Levels: negative neutral positive
Since the package can work directly with a document-term matrix, this
allows one to use customized preprocessing operations in the first
place. Afterwards, one can utilize the SentimentAnalysis
package for the computation of sentiment scores. For instance, one can
replace the stopwords with those from a different list, or even perform
tailored synonym merging, among other options. By default, the package
uses the built-in routines transformIntoCorpus()
to convert
the input into a Corpus
object and
preprocessCorpus()
to convert it into a
DocumentTermMatrix
.
The SentimentAnalysis
package entails three different
dictionaries:
Harvard-IV dictionary
Henry’s Financial dictionary (Henry 2008)
Loughran-McDonald Financial dictionary (Loughran and McDonald 2011)
QDAP dictionary from the package qdapDictionaries
All of them can be manually inspected and even accessed as follows:
## Warning in data(DictionarHE): data set 'DictionarHE' not found
## List of 2
## $ negative: chr [1:85] "below" "challenge" "challenged" "challenges" ...
## $ positive: chr [1:105] "above" "accomplish" "accomplished" "accomplishes" ...
# Access dictionary as an object of type SentimentDictionary
dict.HE <- loadDictionaryHE()
# Print summary statistics of dictionary
summary(dict.HE)
## Dictionary type: binary (positive / negative)
## Total entries: 97
## Positive entries: 53 (54.64%)
## Negative entries: 44 (45.36%)
## List of 3
## $ negative : chr [1:2355] "abandon" "abandoned" "abandoning" "abandonment" ...
## $ positive : chr [1:354] "able" "abundance" "abundant" "acclaimed" ...
## $ uncertainty: chr [1:297] "abeyance" "abeyances" "almost" "alteration" ...
The SentimentAnalysis
package distinguishes between
three different types of dictionaries. All of them differ by the data
they store, which ultimately also controls which methods of sentiment
analysis one can apply. The dictionaries are as follows:
SentimentDictionaryWordlist
contains a list of words
belonging to a single category. For instance, it can bundle a list of
uncertainty words in order to compute the ratio of uncertainty words in
that particular document.
SentimentDictionaryBinary
stores two lists of words,
one for positive and one for negative entries. This allows one to later
compute the polarity of the document on a scale from very positive to
very negative. However, the categories are not further distinguished or
rated, i.e. all positive words are assigned the same degree of
positivity.
SentimentDictionaryWeighted
allows words to take on
continuous sentiment scores. This allows one, for instance, to rate
increase as being more positive than improve. These
weights can then be transformed into a linear model. For this purpose,
the SentimentDictionaryWeighted also entails an
intercept. It can also store an additional factor in order to revert the
weighting by an inverse document frequency.
## Dictionary type: word list (single set)
## Total entries: 3
## Dictionary type: word list (single set)
## Total entries: 3
## Dictionary type: binary (positive / negative)
## Total entries: 5
## Positive entries: 3 (60%)
## Negative entries: 2 (40%)
# Alternative call
d <- SentimentDictionary(c("increase", "rise", "more"),
c("fall", "drop"))
summary(d)
## Dictionary type: binary (positive / negative)
## Total entries: 5
## Positive entries: 3 (60%)
## Negative entries: 2 (40%)
d <- SentimentDictionaryWeighted(c("increase", "decrease", "exit"),
c(+1, -1, -10),
rep(NA, 3))
summary(d)
## Dictionary type: weighted (words with individual scores)
## Total entries: 3
## Positive entries: 1 (33.33%)
## Negative entries: 2 (66.67%)
## Neutral entries: 0 (0%)
##
## Details
## Average score: -3.333333
## Median: -1
## Min: -10
## Max: 1
## Standard deviation: 5.859465
## Skewness: -0.6155602
# Alternative call
d <- SentimentDictionary(c("increase", "decrease", "exit"),
c(+1, -1, -10),
rep(NA, 3))
summary(d)
## Dictionary type: weighted (words with individual scores)
## Total entries: 3
## Positive entries: 1 (33.33%)
## Negative entries: 2 (66.67%)
## Neutral entries: 0 (0%)
##
## Details
## Average score: -3.333333
## Median: -1
## Min: -10
## Max: 1
## Standard deviation: 5.859465
## Skewness: -0.6155602
The following example shows how the SentimentAnalysis
package can extract statistically relevant textual drivers based on an
exogenous response variable. The details of this method are presented in
(Pröllochs, Feuerriegel, and Neumann
2018), while we provide a brief summary here. Let denote a
response variable in the form of a vector. Furthermore, variables give
the number of occurrences of word in a document. The methodology then
estimates a linear model with intercept and coefficients . The
estimation routine is based on LASSO regularization, which implicitly
performs variable selection. In so doing, it sets some of the
coefficients to exactly zero. The remaining words can then be ranked by
polarity according to their coefficient.
# Create a vector of strings
documents <- c("This is a good thing!",
"This is a very good thing!",
"This is okay.",
"This is a bad thing.",
"This is a very bad thing.")
response <- c(1, 0.5, 0, -0.5, -1)
# Generate dictionary with LASSO regularization
dict <- generateDictionary(documents, response)
dict
## Type: weighted (words with individual scores)
## Intercept: 5.55333e-05
## -0.51 bad
## 0.51 good
## Dictionary type: weighted (words with individual scores)
## Total entries: 2
## Positive entries: 1 (50%)
## Negative entries: 1 (50%)
## Neutral entries: 0 (0%)
##
## Details
## Average score: -5.251165e-05
## Median: -5.251165e-05
## Min: -0.5119851
## Max: 0.5118801
## Standard deviation: 0.7239821
## Skewness: 0
In practice, users have several options for fine-tuning. Among these,
they can disable the intercept and fix it to zero, or standardize the
response variable . In addition, it is possible to replace the LASSO
with any variant of the elastic net, simply by changing the argument
alpha
.
Finally, one can save and reload dictionaries using
read()
and write()
as follows:
Ultimately, several routines allow one to exlore the generated
dictionary further. On the one hand, a simple overview can be displayed
by means of the summary()
routine. On the other hand, a
Kernel Density Estimation can also visualize the distribution of
positive and negative words. For instance, one can identify whether the
opinionated words were skewed to either end of the polarity scale.
Lastly, the compareDictionary()
routine can compare the
generated dictionary to dictionaries from the literature. It
automatically computes various metrics, among which are the overlap or
the correlation.
## Comparing: wordlist vs weighted
##
## Total unique words: 4213
## Matching entries: 2 (0.0004747211%)
## Entries with same classification: 0 (0%)
## Entries with different classification: 2 (0.0004747211%)
## Correlation between scores of matching entries: 1
## $totalUniqueWords
## [1] 4213
##
## $totalSameWords
## [1] 2
##
## $ratioSameWords
## [1] 0.0004747211
##
## $numWordsEqualClass
## [1] 0
##
## $numWordsDifferentClass
## [1] 2
##
## $ratioWordsEqualClass
## [1] 0
##
## $ratioWordsDifferentClass
## [1] 0.0004747211
##
## $correlation
## [1] 1
## Dictionary
## cor 0.94868330
## cor.t.statistic 5.19615237
## cor.p.value 0.01384683
## lm.t.value 5.19615237
## r.squared 0.90000000
## RMSE 0.23301039
## MAE 0.20001111
## Accuracy 1.00000000
## Precision 1.00000000
## Sensitivity 1.00000000
## Specificity 1.00000000
## F1 1.00000000
## BalancedAccuracy 1.00000000
## avg.sentiment.pos.response 0.45116801
## avg.sentiment.neg.response -0.67675202
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
## Warning: Failed to fit group -1.
## Caused by error in `smooth.construct.cr.smooth.spec()`:
## ! x has insufficient unique values to support 10 knots: reduce k.
The following example demonstrates how a calculated dictionary can be used for predicting the sentiment of out-of-sample data. In addition, the code then evaluates the prediction performance by comparing it to the built-in dictionaries.
test_documents <- c("This is neither good nor bad",
"What a good idea!",
"Not bad")
test_response <- c(0, 1, 1)
pred <- predict(dict, test_documents)
compareToResponse(pred, test_response)
## Dictionary
## cor 5.922189e-05
## cor.t.statistic 5.922189e-05
## cor.p.value 9.999623e-01
## lm.t.value 5.922189e-05
## r.squared 3.507232e-09
## RMSE 8.523018e-01
## MAE 6.666521e-01
## Accuracy 3.333333e-01
## Precision 0.000000e+00
## Sensitivity NaN
## Specificity 3.333333e-01
## F1 NaN
## BalancedAccuracy NaN
## avg.sentiment.pos.response 1.457684e-05
## avg.sentiment.neg.response NaN
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
## Warning: Failed to fit group -1.
## Caused by error in `smooth.construct.cr.smooth.spec()`:
## ! x has insufficient unique values to support 10 knots: reduce k.
## Warning in cor(sentiment, response): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(sentiment, response): the standard deviation is zero
## WordCount SentimentGI NegativityGI PositivityGI
## cor -0.8660254 -0.18898224 0.18898224 -0.18898224
## cor.t.statistic -1.7320508 -0.19245009 0.19245009 -0.19245009
## cor.p.value 0.3333333 0.87896228 0.87896228 0.87896228
## lm.t.value -1.7320508 -0.19245009 0.19245009 -0.19245009
## r.squared 0.7500000 0.03571429 0.03571429 0.03571429
## RMSE 1.8257419 1.19023807 0.60858062 0.67357531
## MAE 1.3333333 0.83333333 0.44444444 0.61111111
## Accuracy 1.0000000 0.66666667 1.00000000 1.00000000
## Precision NaN 0.00000000 NaN NaN
## Sensitivity NaN NaN NaN NaN
## Specificity 1.0000000 0.66666667 1.00000000 1.00000000
## F1 NaN NaN NaN NaN
## BalancedAccuracy NaN NaN NaN NaN
## avg.sentiment.pos.response 2.0000000 -0.16666667 0.44444444 0.27777778
## avg.sentiment.neg.response NaN NaN NaN NaN
## SentimentHE NegativityHE PositivityHE SentimentLM
## cor -0.18898224 NA -0.18898224 -0.18898224
## cor.t.statistic -0.19245009 NA -0.19245009 -0.19245009
## cor.p.value 0.87896228 NA 0.87896228 0.87896228
## lm.t.value -0.19245009 NA -0.19245009 -0.19245009
## r.squared 0.03571429 NA 0.03571429 0.03571429
## RMSE 0.67357531 0.8164966 0.67357531 1.19023807
## MAE 0.61111111 0.6666667 0.61111111 0.83333333
## Accuracy 1.00000000 1.0000000 1.00000000 0.66666667
## Precision NaN NaN NaN 0.00000000
## Sensitivity NaN NaN NaN NaN
## Specificity 1.00000000 1.0000000 1.00000000 0.66666667
## F1 NaN NaN NaN NaN
## BalancedAccuracy NaN NaN NaN NaN
## avg.sentiment.pos.response 0.27777778 0.0000000 0.27777778 -0.16666667
## avg.sentiment.neg.response NaN NaN NaN NaN
## NegativityLM PositivityLM RatioUncertaintyLM
## cor 0.18898224 -0.18898224 NA
## cor.t.statistic 0.19245009 -0.19245009 NA
## cor.p.value 0.87896228 0.87896228 NA
## lm.t.value 0.19245009 -0.19245009 NA
## r.squared 0.03571429 0.03571429 NA
## RMSE 0.60858062 0.67357531 0.8164966
## MAE 0.44444444 0.61111111 0.6666667
## Accuracy 1.00000000 1.00000000 1.0000000
## Precision NaN NaN NaN
## Sensitivity NaN NaN NaN
## Specificity 1.00000000 1.00000000 1.0000000
## F1 NaN NaN NaN
## BalancedAccuracy NaN NaN NaN
## avg.sentiment.pos.response 0.44444444 0.27777778 0.0000000
## avg.sentiment.neg.response NaN NaN NaN
## SentimentQDAP NegativityQDAP PositivityQDAP
## cor -0.18898224 0.18898224 -0.18898224
## cor.t.statistic -0.19245009 0.19245009 -0.19245009
## cor.p.value 0.87896228 0.87896228 0.87896228
## lm.t.value -0.19245009 0.19245009 -0.19245009
## r.squared 0.03571429 0.03571429 0.03571429
## RMSE 1.19023807 0.60858062 0.67357531
## MAE 0.83333333 0.44444444 0.61111111
## Accuracy 0.66666667 1.00000000 1.00000000
## Precision 0.00000000 NaN NaN
## Sensitivity NaN NaN NaN
## Specificity 0.66666667 1.00000000 1.00000000
## F1 NaN NaN NaN
## BalancedAccuracy NaN NaN NaN
## avg.sentiment.pos.response -0.16666667 0.44444444 0.27777778
## avg.sentiment.neg.response NaN NaN NaN
When desired, one can implement a tailored preprocessing stage that
adapts to specific needs. The following code snippets demonstrate such
adaptation. In particular, the SentimentAnalysis
package
ships a function ngram_tokenize()
in order to extract
-grams from the corpus. This does not affect the results of the built-in
dictionaries but rather changes the features used as part of dictionary
generation.
corpus <- VCorpus(VectorSource(documents))
tdm <- TermDocumentMatrix(corpus,
control=list(wordLengths=c(1,Inf),
tokenize=function(x) ngram_tokenize(x, char=FALSE,
ngmin=1, ngmax=2)))
rownames(tdm)
## [1] "a" "a bad" "a good" "a very" "bad"
## [6] "bad thing." "good" "good thing!" "is" "is a"
## [11] "is okay." "okay." "thing!" "thing." "this"
## [16] "this is" "very" "very bad" "very good"
## Dictionary type: weighted (words with individual scores)
## Total entries: 7
## Positive entries: 3 (42.86%)
## Negative entries: 4 (57.14%)
## Neutral entries: 0 (0%)
##
## Details
## Average score: 2.136424e-06
## Median: -5.891262e-05
## Min: -0.4368759
## Max: 0.4380331
## Standard deviation: 0.3016221
## Skewness: 0.004124293
## Type: weighted (words with individual scores)
## Intercept: -3.230431e-05
## -0.44 bad
## -0.29 very bad
## -0.00 bad thing.
## -0.00 thing.
## 0.00 good thing!
## 0.29 a good
## 0.44 good
Once the user has decided upon a preferred rule, he can adapt the
analyzeSentiment()
routine by restricting it to calculate
only the rules of interest. Such behavior can be implemented by changing
the default value of the argument rules
. See the following
code snippets for an example:
sentiment <- analyzeSentiment(documents,
rules=list("SentimentLM"=list(ruleSentiment, loadDictionaryLM())))
sentiment
## SentimentLM
## 1 0.5
## 2 0.5
## 3 0.0
## 4 -0.5
## 5 -0.5
SentimentAnalysis
can be adapted for use with languages
other than English. In order to do this, one needs to introduce changes
at two points:
Preprocessing: The built-in routines use a
parameter language="english"
to perform all preprocessing
operations for the English language. Instead, one might prefer to change
stemming and stopwords to a desired language. If one wishes to make
further changes to the preprocessing, it might be beneficial to replace
the automatic preprocessing with one’s own routines, which then return a
DocumentTermMatrix
.
Dictionary: If one has a response or baseline
variable, one can use the dictionary generation approach that is shipped
with SentimentAnalysis
. This can then automatically
generate a dictionary of positive and negative words that can be applied
to the given language. Otherwise, if one has no baseline variable at
hand, one needs to load a dictionary for that langauge. It might be
worthwhile to search online for pre-defined lists of positive and
negative words.
The following example demonstrates how SentimentAnalysis
can be adapted to work with a sample in German. Here, we supply a
positive and negative document in the variable documents
.
Afterwards, we introduce a very small dictionary of positive and
negative words, which is stored in dictionaryGerman
.
Finally, we use analyzeSentiment()
to perform a sentiment
analysis, where we introduce changes as follows: first of all, we supply
language="german"
to ensure that all preprocessing
operations are being made for the German language. Additionally, we
define our custom rule for GermanSentiment
that uses our
previous, customized dictionary.
documents <- c("Das ist ein gutes Resultat",
"Das Ergebnis war schlecht")
dictionaryGerman <- SentimentDictionaryBinary(c("gut"),
c("schlecht"))
sentiment <- analyzeSentiment(documents,
language="german",
rules=list("GermanSentiment"=list(ruleSentiment, dictionaryGerman)))
sentiment
## GermanSentiment
## 1 0.0
## 2 -0.5
## [1] positive negative
## Levels: negative positive
Similarly, one can implement a dictionary with custom sentiment scores.
woorden <- c("goed","slecht")
scores <- c(0.8,-0.5)
dictionaryDutch <- SentimentDictionaryWeighted(woorden, scores)
documents <- "dit is heel slecht"
sentiment <- analyzeSentiment(documents,
language="dutch",
rules=list("DutchSentiment"=list(ruleLinearModel, dictionaryDutch)))
sentiment
## DutchSentiment
## 1 -0.5
Notes:
The argument rules
is a named list of approaches,
where each entry specifies a combination of a rule and a
dictionary.
Caution is needed when working with stemming. The default
routines of SentimentAnalysis
automatically perform
stemming. Therefore, it is necessary to included stemmed terms in the
original dictionary. One can easily achieve such a conversion by calling
tm::stemDocument()
.
The following example shows the usage of
SentimentAnalysis
in an applied setting. More precisely, we
utilize Reuters oil-related news from the tm
package.
library(tm)
data("crude")
# Analyze sentiment
sentiment <- analyzeSentiment(crude)
# Count positive and negative news releases
table(convertToBinaryResponse(sentiment$SentimentLM))
##
## negative positive
## 16 4
# News releases with highest and lowest sentiment
crude[[which.max(sentiment$SentimentLM)]]$meta$heading
## [1] "HOUSTON OIL <HO> RESERVES STUDY COMPLETED"
## [1] "DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.08772 -0.04366 -0.02341 -0.02953 -0.01375 0.00000
## SentimentLM SentimentHE SentimentQDAP
## SentimentLM 1.0000000 0.2769878 0.4769730
## SentimentHE 0.2769878 1.0000000 0.6141075
## SentimentQDAP 0.4769730 0.6141075 1.0000000
SentimentAnalysis
can also be used to count words with
the help of countWords()
in documents.
## WordCount
## 1 3
## WordCount
## 1 4
Note: The package has a built-in rule ruleWordCount()
,
which is used for the “WordCount” column when calling
analyzeSentiment()
. However, the former is likely to return
different results as it is subject to the preprocessing rules of
analyzeSentiment()
. By default, it removes stopwords,
excludes words with equal or less than 3 letters and might apply a
sparsity operation. Hence, one should always use
countWords()
when working with word counts.
The current version leaves open avenues for further enhancement. In the future, we see the following items as being potentially subject to improvements:
Negations: We envision a generic negation rule object that can be injected to negate fixed windows or apply complex negation rules (Pröllochs, Feuerriegel, and Neumann 2016).
Multi-language support: The current version has
built-in dictionaries for the English language only. We think that the
package would benefit greatly from support of further languages. In such
a setup, one would not need to adapt the preprocessing routines, as the
underlying tm
package would already have support for
further languages (Feinerer, Hornik, and Meyer
2008). Instead, it would only be required that the user tailor
the applied dictionaries.
We cordially invite everyone to contribute source code, dictionaries and further demos.
SentimentAnalysis is released under the MIT License Copyright (c) 2021 Stefan Feuerriegel & Nicolas Pröllochs