Title: | Dictionary-Based Sentiment Analysis |
---|---|
Description: | Performs a sentiment analysis of textual contents in R. This implementation utilizes various existing dictionaries, such as Harvard IV, or finance-specific dictionaries. Furthermore, it can also create customized dictionaries. The latter uses LASSO regularization as a statistical approach to select relevant terms based on an exogenous response variable. |
Authors: | Nicolas Proellochs [aut, cre], Stefan Feuerriegel [aut] |
Maintainer: | Nicolas Proellochs <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.3-5 |
Built: | 2024-12-17 04:12:35 UTC |
Source: | https://github.com/sfeuerriegel/sentimentanalysis |
Performs sentiment analysis of given object (vector of strings, document-term matrix, corpus).
analyzeSentiment( x, language = "english", aggregate = NULL, rules = defaultSentimentRules(), removeStopwords = TRUE, stemming = TRUE, ... ) ## S3 method for class 'Corpus' analyzeSentiment( x, language = "english", aggregate = NULL, rules = defaultSentimentRules(), removeStopwords = TRUE, stemming = TRUE, ... ) ## S3 method for class 'character' analyzeSentiment( x, language = "english", aggregate = NULL, rules = defaultSentimentRules(), removeStopwords = TRUE, stemming = TRUE, ... ) ## S3 method for class 'data.frame' analyzeSentiment( x, language = "english", aggregate = NULL, rules = defaultSentimentRules(), removeStopwords = TRUE, stemming = TRUE, ... ) ## S3 method for class 'TermDocumentMatrix' analyzeSentiment( x, language = "english", aggregate = NULL, rules = defaultSentimentRules(), removeStopwords = TRUE, stemming = TRUE, ... ) ## S3 method for class 'DocumentTermMatrix' analyzeSentiment( x, language = "english", aggregate = NULL, rules = defaultSentimentRules(), removeStopwords = TRUE, stemming = TRUE, ... )
analyzeSentiment( x, language = "english", aggregate = NULL, rules = defaultSentimentRules(), removeStopwords = TRUE, stemming = TRUE, ... ) ## S3 method for class 'Corpus' analyzeSentiment( x, language = "english", aggregate = NULL, rules = defaultSentimentRules(), removeStopwords = TRUE, stemming = TRUE, ... ) ## S3 method for class 'character' analyzeSentiment( x, language = "english", aggregate = NULL, rules = defaultSentimentRules(), removeStopwords = TRUE, stemming = TRUE, ... ) ## S3 method for class 'data.frame' analyzeSentiment( x, language = "english", aggregate = NULL, rules = defaultSentimentRules(), removeStopwords = TRUE, stemming = TRUE, ... ) ## S3 method for class 'TermDocumentMatrix' analyzeSentiment( x, language = "english", aggregate = NULL, rules = defaultSentimentRules(), removeStopwords = TRUE, stemming = TRUE, ... ) ## S3 method for class 'DocumentTermMatrix' analyzeSentiment( x, language = "english", aggregate = NULL, rules = defaultSentimentRules(), removeStopwords = TRUE, stemming = TRUE, ... )
x |
A vector of characters, a |
language |
Language used for preprocessing operations (default: English) |
aggregate |
A factor variable by which documents can be grouped. This helpful when joining e.g. news from the same day or move reviews by the same author |
rules |
A named list containing individual sentiment metrics. Therefore, each entry consists itself of a list with first a method, followed by an optional dictionary. |
removeStopwords |
Flag indicating whether to remove stopwords or not (default: yes) |
stemming |
Perform stemming (default: TRUE) |
... |
Additional parameters passed to function for e.g. preprocessing |
This function returns a data.frame with continuous values. If one desires
other formats, one needs to convert these. Common examples of such formats are
binary response values (positive / negative) or tertiary (positive, neutral,
negative). Hence, consider using the functions convertToBinaryResponse
and
convertToDirection
, which can convert a vector of continuous sentiment
scores into a factor object.
Result is a matrix with sentiment values for each document across all defined rules
compareToResponse
for evaluating the results,
convertToBinaryResponse
and convertToDirection
for
for getting binary results, generateDictionary
for dictionary generation,
plotSentiment
and plotSentimentResponse
for visualization
## Not run: library(tm) # via vector of strings corpus <- c("Positive text", "Neutral but uncertain text", "Negative text") sentiment <- analyzeSentiment(corpus) compareToResponse(sentiment, c(+1, 0, -2)) # via Corpus from tm package data("crude") sentiment <- analyzeSentiment(crude) # via DocumentTermMatrix (with stemmed entries) dtm <- DocumentTermMatrix(VCorpus(VectorSource(c("posit posit", "negat neutral")))) sentiment <- analyzeSentiment(dtm) compareToResponse(sentiment, convertToBinaryResponse(c(+1, -1))) # By adapting the parameter rules, one can incorporate customized dictionaries # e.g. in order to adapt to arbitrary languages dictionaryAmplifiers <- SentimentDictionary(c("more", "much")) sentiment <- analyzeSentiment(corpus, rules=list("Amplifiers"=list(ruleRatio, dictionaryAmplifiers))) # On can also restrict the number of computed methods to the ones of interest # in order to achieve performance optimizations sentiment <- analyzeSentiment(corpus, rules=list("SentimentLM"=list(ruleSentiment, loadDictionaryLM()))) sentiment ## End(Not run)
## Not run: library(tm) # via vector of strings corpus <- c("Positive text", "Neutral but uncertain text", "Negative text") sentiment <- analyzeSentiment(corpus) compareToResponse(sentiment, c(+1, 0, -2)) # via Corpus from tm package data("crude") sentiment <- analyzeSentiment(crude) # via DocumentTermMatrix (with stemmed entries) dtm <- DocumentTermMatrix(VCorpus(VectorSource(c("posit posit", "negat neutral")))) sentiment <- analyzeSentiment(dtm) compareToResponse(sentiment, convertToBinaryResponse(c(+1, -1))) # By adapting the parameter rules, one can incorporate customized dictionaries # e.g. in order to adapt to arbitrary languages dictionaryAmplifiers <- SentimentDictionary(c("more", "much")) sentiment <- analyzeSentiment(corpus, rules=list("Amplifiers"=list(ruleRatio, dictionaryAmplifiers))) # On can also restrict the number of computed methods to the ones of interest # in order to achieve performance optimizations sentiment <- analyzeSentiment(corpus, rules=list("SentimentLM"=list(ruleSentiment, loadDictionaryLM()))) sentiment ## End(Not run)
Routine compares two dictionaries in terms of how similarities and differences. Among the calculated measures are the total of distinct words, the overlap between both dictionaries, etc.
compareDictionaries(d1, d2)
compareDictionaries(d1, d2)
d1 |
is the first sentiment dictionary of type |
d2 |
is the first sentiment dictionary of type |
Returns list with different metrics depending on dictionary type
Currently, this routine only supports the case where both dictionaries are of the same type
SentimentDictionaryWordlist
,
SentimentDictionaryBinary
,
SentimentDictionaryWeighted
for the specific classes
d1 <- SentimentDictionary(c("uncertain", "possible", "likely")) d2 <- SentimentDictionary(c("rather", "intend", "likely")) cmp <- compareDictionaries(d1, d2) d1 <- SentimentDictionary(c("increase", "rise", "more"), c("fall", "drop")) d2 <- SentimentDictionary(c("positive", "rise", "more"), c("negative", "drop")) cmp <- compareDictionaries(d1, d2) d1 <- SentimentDictionary(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3)) d2 <- SentimentDictionary(c("increase", "decrease", "drop", "neutral"), c(+2, -5, -1, 0), rep(NA, 4)) cmp <- compareDictionaries(d1, d2)
d1 <- SentimentDictionary(c("uncertain", "possible", "likely")) d2 <- SentimentDictionary(c("rather", "intend", "likely")) cmp <- compareDictionaries(d1, d2) d1 <- SentimentDictionary(c("increase", "rise", "more"), c("fall", "drop")) d2 <- SentimentDictionary(c("positive", "rise", "more"), c("negative", "drop")) cmp <- compareDictionaries(d1, d2) d1 <- SentimentDictionary(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3)) d2 <- SentimentDictionary(c("increase", "decrease", "drop", "neutral"), c(+2, -5, -1, 0), rep(NA, 4)) cmp <- compareDictionaries(d1, d2)
This function compares the calculated sentiment values with an external response variable. Examples of such an exogenous response are stock market movements or IMDb move rating. Both usually reflect a "true" value that the sentiment should match.
compareToResponse(sentiment, response) ## S3 method for class 'logical' compareToResponse(sentiment, response) ## S3 method for class 'factor' compareToResponse(sentiment, response) ## S3 method for class 'integer' compareToResponse(sentiment, response) ## S3 method for class 'data.frame' compareToResponse(sentiment, response) ## S3 method for class 'numeric' compareToResponse(sentiment, response)
compareToResponse(sentiment, response) ## S3 method for class 'logical' compareToResponse(sentiment, response) ## S3 method for class 'factor' compareToResponse(sentiment, response) ## S3 method for class 'integer' compareToResponse(sentiment, response) ## S3 method for class 'data.frame' compareToResponse(sentiment, response) ## S3 method for class 'numeric' compareToResponse(sentiment, response)
sentiment |
Matrix with sentiment scores for each document across several sentiment rules |
response |
Vector with "true" response. This vector can either be of a continuous numeric or binary values. In case of the latter, FALSE is matched to a negative sentiment value, while TRUE is matched to a non-negative one. |
Matrix with different performance metrics for all given sentiment rules
sentiment <- matrix(c(5.5, 2.9, 0.9, -1), dimnames=list(c("A", "B", "C", "D"), c("Sentiment"))) # continuous numeric response variable response <- c(5, 3, 1, -1) compareToResponse(sentiment, response) # binary response variable response <- c(TRUE, TRUE, FALSE, FALSE) compareToResponse(sentiment, response)
sentiment <- matrix(c(5.5, 2.9, 0.9, -1), dimnames=list(c("A", "B", "C", "D"), c("Sentiment"))) # continuous numeric response variable response <- c(5, 3, 1, -1) compareToResponse(sentiment, response) # binary response variable response <- c(TRUE, TRUE, FALSE, FALSE) compareToResponse(sentiment, response)
This function converts continuous sentiment scores into a their corresponding binary sentiment class. As such, the result is a factor with two levels indicating positive and negative content. Neutral documents (with a sentiment score of 0) are counted as positive.
convertToBinaryResponse(sentiment)
convertToBinaryResponse(sentiment)
sentiment |
Vector, matrix or data.frame with sentiment scores. |
If a matrix or data.frame is provided, this routine does not touch all columns. In fact, it scans for those where the column name starts with "Sentiment" and changes these columns only. Hence, columns with pure negativity, positivity or ratios or word counts are ignored.
If a vector is supplied, it returns a factor with two levels representing positive and negative content. Otherwise, it returns a data.frame with the corresponding columns being exchanged.
sentiment <- c(-1, -0.5, +1, 0.6, 0) convertToBinaryResponse(sentiment) convertToDirection(sentiment) df <- data.frame(No=1:5, Sentiment=sentiment) df convertToBinaryResponse(df) convertToDirection(df)
sentiment <- c(-1, -0.5, +1, 0.6, 0) convertToBinaryResponse(sentiment) convertToDirection(sentiment) df <- data.frame(No=1:5, Sentiment=sentiment) df convertToBinaryResponse(df) convertToDirection(df)
This function converts continuous sentiment scores into a their corresponding
sentiment direction. As such, the result is a factor with three levels
indicating positive, neutral and negative content. In contrast
to convertToBinaryResponse
, neutral documents have their own category.
convertToDirection(sentiment)
convertToDirection(sentiment)
sentiment |
Vector, matrix or data.frame with sentiment scores. |
If a matrix or data.frame is provided, this routine does not touch all columns. In fact, it scans for those where the column name starts with "Sentiment" and changes these columns only. Hence, columns with pure negativity, positivity or ratios or word counts are ignored.
If a vector is supplied, it returns a factor with three levels representing positive, neutral and negative content. Otherwise, it returns a data.frame with the corresponding columns being exchanged.
sentiment <- c(-1, -0.5, +1, 0.6, 0) convertToBinaryResponse(sentiment) convertToDirection(sentiment) df <- data.frame(No=1:5, Sentiment=sentiment) df convertToBinaryResponse(df) convertToDirection(df)
sentiment <- c(-1, -0.5, +1, 0.6, 0) convertToBinaryResponse(sentiment) convertToDirection(sentiment) df <- data.frame(No=1:5, Sentiment=sentiment) df convertToBinaryResponse(df) convertToDirection(df)
Function counts the words in each document
countWords( x, aggregate = NULL, removeStopwords = TRUE, language = "english", ... ) ## S3 method for class 'Corpus' countWords( x, aggregate = NULL, removeStopwords = TRUE, language = "english", ... ) ## S3 method for class 'character' countWords( x, aggregate = NULL, removeStopwords = TRUE, language = "english", ... ) ## S3 method for class 'data.frame' countWords( x, aggregate = NULL, removeStopwords = TRUE, language = "english", ... ) ## S3 method for class 'TermDocumentMatrix' countWords( x, aggregate = NULL, removeStopwords = TRUE, language = "english", ... ) ## S3 method for class 'DocumentTermMatrix' countWords( x, aggregate = NULL, removeStopwords = TRUE, language = "english", ... )
countWords( x, aggregate = NULL, removeStopwords = TRUE, language = "english", ... ) ## S3 method for class 'Corpus' countWords( x, aggregate = NULL, removeStopwords = TRUE, language = "english", ... ) ## S3 method for class 'character' countWords( x, aggregate = NULL, removeStopwords = TRUE, language = "english", ... ) ## S3 method for class 'data.frame' countWords( x, aggregate = NULL, removeStopwords = TRUE, language = "english", ... ) ## S3 method for class 'TermDocumentMatrix' countWords( x, aggregate = NULL, removeStopwords = TRUE, language = "english", ... ) ## S3 method for class 'DocumentTermMatrix' countWords( x, aggregate = NULL, removeStopwords = TRUE, language = "english", ... )
x |
A vector of characters, a |
aggregate |
A factor variable by which documents can be grouped. This helpful when joining e.g. news from the same day or move reviews by the same author |
removeStopwords |
Flag indicating whether to remove stopwords or not (default: yes) |
language |
Language used for preprocessing operations (default: English) |
... |
Additional parameters passed to function for e.g. preprocessing |
Result is a matrix with word counts for each document across
documents <- c("This is a test", "an one more") # count words (without stopwords) countWords(documents) # count all words (including stopwords) countWords(documents, removeStopwords=FALSE)
documents <- c("This is a test", "an one more") # count words (without stopwords) countWords(documents) # count all words (including stopwords) countWords(documents, removeStopwords=FALSE)
Dictionary with a list of positive and negative words according to the psychological Harvard-IV dictionary as used in the General Inquirer software. This is a general-purpose dictionary developed by the Harvard University.
data(DictionaryGI)
data(DictionaryGI)
A list with different terms according to Henry
All words are in lower case and non-stemmed
https://inquirer.sites.fas.harvard.edu/homecat.htm
data(DictionaryGI) summary(DictionaryGI)
data(DictionaryGI) summary(DictionaryGI)
Dictionary with a list of positive and negative words according to the Henry's finance-specific dictionary. This dictionary was first presented in the Journal of Business Communication among one of the early adopters of text analysis in the finance discipline.
data(DictionaryHE)
data(DictionaryHE)
A list with different wordlists according to Henry
All words are in lower case and non-stemmed
Henry (2008): Are Investors Influenced By How Earnings Press Releases Are Written?, Journal of Business Communication, 45:4, 363-407
data(DictionaryHE) summary(DictionaryHE)
data(DictionaryHE) summary(DictionaryHE)
Dictionary with a list of positive, negative and uncertainty words according to the Loughran-McDonald finance-specific dictionary. This dictionary was first presented in the Journal of Finance and has been widely used in the finance domain ever since.
data(DictionaryLM)
data(DictionaryLM)
A list with different terms according to Loughran-McDonald
All words are in lower case and non-stemmed
https://sraf.nd.edu/loughranmcdonald-master-dictionary/
Loughran and McDonald (2011) When is a Liability not a Liability? Textual Analysis, Dictionaries, and 10-Ks, Journal of Finance, 66:1, 35-65
data(DictionaryLM) summary(DictionaryLM)
data(DictionaryLM) summary(DictionaryLM)
Function estimates coefficients based on elastic net regularization.
enetEstimation( x, response, control = list(alpha = 0.5, s = "lambda.min", family = "gaussian", grouped = FALSE), ... )
enetEstimation( x, response, control = list(alpha = 0.5, s = "lambda.min", family = "gaussian", grouped = FALSE), ... )
x |
An object of type |
response |
Response variable including the given gold standard. |
control |
(optional) A list of parameters defining the model as follows:
|
... |
Additional parameters passed to function for |
Result is a list with coefficients, coefficient names and the model intercept.
Returns all entries from a dictionary.
extractWords(d)
extractWords(d)
d |
Dictionary of type |
extractWords(SentimentDictionary(c("uncertain", "possible", "likely"))) # returns 3 extractWords(SentimentDictionary(c("increase", "rise", "more"), c("fall", "drop"))) # returns 5 extractWords(SentimentDictionary(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3))) # returns 3
extractWords(SentimentDictionary(c("uncertain", "possible", "likely"))) # returns 3 extractWords(SentimentDictionary(c("increase", "rise", "more"), c("fall", "drop"))) # returns 5 extractWords(SentimentDictionary(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3))) # returns 3
Routine applies method for dictionary generation (LASSO, ridge regularization, elastic net, ordinary least squares, generalized linear model or spike-and-slab regression) to the document-term matrix in order to extract decisive terms that have a statistically significant impact on the response variable.
generateDictionary( x, response, language = "english", modelType = "lasso", filterTerms = NULL, control = list(), minWordLength = 3, sparsity = 0.9, weighting = function(x) tm::weightTfIdf(x, normalize = FALSE), ... ) ## S3 method for class 'Corpus' generateDictionary( x, response, language = "english", modelType = "lasso", filterTerms = NULL, control = list(), minWordLength = 3, sparsity = 0.9, weighting = function(x) tm::weightTfIdf(x, normalize = FALSE), ... ) ## S3 method for class 'character' generateDictionary( x, response, language = "english", modelType = "lasso", filterTerms = NULL, control = list(), minWordLength = 3, sparsity = 0.9, weighting = function(x) tm::weightTfIdf(x, normalize = FALSE), ... ) ## S3 method for class 'data.frame' generateDictionary( x, response, language = "english", modelType = "lasso", filterTerms = NULL, control = list(), minWordLength = 3, sparsity = 0.9, weighting = function(x) tm::weightTfIdf(x, normalize = FALSE), ... ) ## S3 method for class 'TermDocumentMatrix' generateDictionary( x, response, language = "english", modelType = "lasso", filterTerms = NULL, control = list(), minWordLength = 3, sparsity = 0.9, weighting = function(x) tm::weightTfIdf(x, normalize = FALSE), ... ) ## S3 method for class 'DocumentTermMatrix' generateDictionary( x, response, language = "english", modelType = "lasso", filterTerms = NULL, control = list(), minWordLength = 3, sparsity = 0.9, weighting = function(x) tm::weightTfIdf(x, normalize = FALSE), ... )
generateDictionary( x, response, language = "english", modelType = "lasso", filterTerms = NULL, control = list(), minWordLength = 3, sparsity = 0.9, weighting = function(x) tm::weightTfIdf(x, normalize = FALSE), ... ) ## S3 method for class 'Corpus' generateDictionary( x, response, language = "english", modelType = "lasso", filterTerms = NULL, control = list(), minWordLength = 3, sparsity = 0.9, weighting = function(x) tm::weightTfIdf(x, normalize = FALSE), ... ) ## S3 method for class 'character' generateDictionary( x, response, language = "english", modelType = "lasso", filterTerms = NULL, control = list(), minWordLength = 3, sparsity = 0.9, weighting = function(x) tm::weightTfIdf(x, normalize = FALSE), ... ) ## S3 method for class 'data.frame' generateDictionary( x, response, language = "english", modelType = "lasso", filterTerms = NULL, control = list(), minWordLength = 3, sparsity = 0.9, weighting = function(x) tm::weightTfIdf(x, normalize = FALSE), ... ) ## S3 method for class 'TermDocumentMatrix' generateDictionary( x, response, language = "english", modelType = "lasso", filterTerms = NULL, control = list(), minWordLength = 3, sparsity = 0.9, weighting = function(x) tm::weightTfIdf(x, normalize = FALSE), ... ) ## S3 method for class 'DocumentTermMatrix' generateDictionary( x, response, language = "english", modelType = "lasso", filterTerms = NULL, control = list(), minWordLength = 3, sparsity = 0.9, weighting = function(x) tm::weightTfIdf(x, normalize = FALSE), ... )
x |
A vector of characters, a |
response |
Response variable including the given gold standard. |
language |
Language used for preprocessing operations (default: English). |
modelType |
A string denoting the estimation method. Allowed values are |
filterTerms |
Optional vector of strings (default: |
control |
(optional) A list of parameters defining the model used for dictionary generation. If
If
If
If If
If
|
minWordLength |
Removes words given a specific minimum length (default: 3). This preprocessing is applied when the input is a character vector or a corpus and the document-term matrix is generated inside the routine. |
sparsity |
A numeric for removing sparse terms in the document-term matrix. The
argument |
weighting |
Weights a document-term matrix by e.g. term frequency - inverse
document frequency (default). Other variants can be used from
|
... |
Additional parameters passed to function for e.g.
preprocessing or |
Result is a matrix which sentiment values for each document across all defined rules
doi:10.1371/journal.pone.0209323
Pr\"ollochs and Feuerriegel (2018). Statistical inferences for Polarity Identification in Natural Language, PloS One 13(12).
analyzeSentiment
, predict.SentimentDictionaryWeighted
,
plot.SentimentDictionaryWeighted
and compareToResponse
for
advanced evaluations
# Create a vector of strings documents <- c("This is a good thing!", "This is a very good thing!", "This is okay.", "This is a bad thing.", "This is a very bad thing.") response <- c(1, 0.5, 0, -0.5, -1) # Generate dictionary with LASSO regularization dictionary <- generateDictionary(documents, response) # Show dictionary dictionary summary(dictionary) plot(dictionary) # Compute in-sample performance sentiment <- predict(dictionary, documents) compareToResponse(sentiment, response) plotSentimentResponse(sentiment, response) # Generate new dictionary with spike-and-slab regression instead of LASSO regularization library(spikeslab) dictionary <- generateDictionary(documents, response, modelType="spikeslab") # Generate new dictionary with tf weighting instead of tf-idf library(tm) dictionary <- generateDictionary(documents, response, weighting=weightTf) sentiment <- predict(dictionary, documents) compareToResponse(sentiment, response) # Use instead lambda.min from the LASSO estimation dictionary <- generateDictionary(documents, response, control=list(s="lambda.min")) sentiment <- predict(dictionary, documents) compareToResponse(sentiment, response) # Use instead OLS as estimation method dictionary <- generateDictionary(documents, response, modelType="lm") sentiment <- predict(dictionary, documents) sentiment dictionary <- generateDictionary(documents, response, modelType="lm", filterTerms = c("good", "bad")) sentiment <- predict(dictionary, documents) sentiment dictionary <- generateDictionary(documents, response, modelType="lm", filterTerms = extractWords(loadDictionaryGI())) sentiment <- predict(dictionary, documents) sentiment # Generate dictionary without LASSO intercept dictionary <- generateDictionary(documents, response, intercept=FALSE) dictionary$intercept ## Not run: imdb <- loadImdb() # Generate Dictionary dictionary_imdb <- generateDictionary(imdb$Corpus, imdb$Rating, family="poisson") summary(dictionary_imdb) compareDictionaries(dictionary_imdb, loadDictionaryGI()) # Show estimated coefficients with Kernel Density Estimation (KDE) plot(dictionary_imdb) plot(dictionary_imdb) + xlim(c(-0.1, 0.1)) # Compute in-sample performance pred_sentiment <- predict(dict_imdb, imdb$Corpus) compareToResponse(pred_sentiment, imdb$Rating) # Test a different sparsity parameter dictionary_imdb <- generateDictionary(imdb$Corpus, imdb$Rating, family="poisson", sparsity=0.99) summary(dictionary_imdb) pred_sentiment <- predict(dict_imdb, imdb$Corpus) compareToResponse(pred_sentiment, imdb$Rating) ## End(Not run)
# Create a vector of strings documents <- c("This is a good thing!", "This is a very good thing!", "This is okay.", "This is a bad thing.", "This is a very bad thing.") response <- c(1, 0.5, 0, -0.5, -1) # Generate dictionary with LASSO regularization dictionary <- generateDictionary(documents, response) # Show dictionary dictionary summary(dictionary) plot(dictionary) # Compute in-sample performance sentiment <- predict(dictionary, documents) compareToResponse(sentiment, response) plotSentimentResponse(sentiment, response) # Generate new dictionary with spike-and-slab regression instead of LASSO regularization library(spikeslab) dictionary <- generateDictionary(documents, response, modelType="spikeslab") # Generate new dictionary with tf weighting instead of tf-idf library(tm) dictionary <- generateDictionary(documents, response, weighting=weightTf) sentiment <- predict(dictionary, documents) compareToResponse(sentiment, response) # Use instead lambda.min from the LASSO estimation dictionary <- generateDictionary(documents, response, control=list(s="lambda.min")) sentiment <- predict(dictionary, documents) compareToResponse(sentiment, response) # Use instead OLS as estimation method dictionary <- generateDictionary(documents, response, modelType="lm") sentiment <- predict(dictionary, documents) sentiment dictionary <- generateDictionary(documents, response, modelType="lm", filterTerms = c("good", "bad")) sentiment <- predict(dictionary, documents) sentiment dictionary <- generateDictionary(documents, response, modelType="lm", filterTerms = extractWords(loadDictionaryGI())) sentiment <- predict(dictionary, documents) sentiment # Generate dictionary without LASSO intercept dictionary <- generateDictionary(documents, response, intercept=FALSE) dictionary$intercept ## Not run: imdb <- loadImdb() # Generate Dictionary dictionary_imdb <- generateDictionary(imdb$Corpus, imdb$Rating, family="poisson") summary(dictionary_imdb) compareDictionaries(dictionary_imdb, loadDictionaryGI()) # Show estimated coefficients with Kernel Density Estimation (KDE) plot(dictionary_imdb) plot(dictionary_imdb) + xlim(c(-0.1, 0.1)) # Compute in-sample performance pred_sentiment <- predict(dict_imdb, imdb$Corpus) compareToResponse(pred_sentiment, imdb$Rating) # Test a different sparsity parameter dictionary_imdb <- generateDictionary(imdb$Corpus, imdb$Rating, family="poisson", sparsity=0.99) summary(dictionary_imdb) pred_sentiment <- predict(dict_imdb, imdb$Corpus) compareToResponse(pred_sentiment, imdb$Rating) ## End(Not run)
Function estimates coefficients based on generalized least squares.
glmEstimation(x, response, control = list(family = "gaussian"), ...)
glmEstimation(x, response, control = list(family = "gaussian"), ...)
x |
An object of type |
response |
Response variable including the given gold standard. |
control |
(optional) A list of parameters defining the model as follows:
|
... |
Additional parameters passed to function for |
Result is a list with coefficients, coefficient names and the model intercept.
Result is a list with coefficients, coefficient names and the model intercept.
Function estimates coefficients based on LASSO regularization.
lassoEstimation( x, response, control = list(alpha = 1, s = "lambda.min", family = "gaussian", grouped = FALSE), ... )
lassoEstimation( x, response, control = list(alpha = 1, s = "lambda.min", family = "gaussian", grouped = FALSE), ... )
x |
An object of type |
response |
Response variable including the given gold standard. |
control |
(optional) A list of parameters defining the LASSO model as follows:
|
... |
Additional parameters passed to function for |
Result is a list with coefficients, coefficient names and the model intercept.
Function estimates coefficients based on ordinary least squares.
lmEstimation(x, response, control = list(), ...)
lmEstimation(x, response, control = list(), ...)
x |
An object of type |
response |
Response variable including the given gold standard. |
control |
(optional) A list of parameters (not used). |
... |
Additional parameters (not used). |
Result is a list with coefficients, coefficient names and the model intercept.
Loads Harvard-IV dictionary (as used in General Inquirer) into a standardized dictionary object
loadDictionaryGI()
loadDictionaryGI()
object of class SentimentDictionary
Result is a list of stemmed words in lower case
Loads Henry's finance-specific dictionary into a standardized dictionary object
loadDictionaryHE()
loadDictionaryHE()
object of class SentimentDictionary
Result is a list of stemmed words in lower case
Loads Loughran-McDonald financial dictionary into a standardized dictionary object (here, categories positive and negative are considered)
loadDictionaryLM()
loadDictionaryLM()
object of class SentimentDictionary
Result is a list of stemmed words in lower case
Loads uncertainty words from Loughran-McDonald into a standardized dictionary object
loadDictionaryLM_Uncertainty()
loadDictionaryLM_Uncertainty()
object of class SentimentDictionary
Result is a list of stemmed words in lower case
Loads polarity words from data object key.pol
which is by the package qdap. This is then converted into a standardized
dictionary object
loadDictionaryQDAP()
loadDictionaryQDAP()
object of class SentimentDictionary
Result is a list of stemmed words in lower case
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
Hu and Liu (2004). Mining Opinion Features in Customer Reviews. National Conference on Artificial Intelligence.
Function downloads IMDb dataset and prepares corresponding user ratings for easy usage.
loadImdb()
loadImdb()
Returns a list where entry named Corpus
contains the IMDb reviews,
and Rating
is the corresponding scaled rating.
Pang and Lee (2015) Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales, Proceeding of the ACL. See http://www.cs.cornell.edu/people/pabo/movie-review-data/
## Not run: imdb <- loadImdb() dictionary <- generateDictionary(imdb$Corpus, imdb$Rating) ## End(Not run)
## Not run: imdb <- loadImdb() dictionary <- generateDictionary(imdb$Corpus, imdb$Rating) ## End(Not run)
Decides upon a estimation method for dictionary generation. Input is a name for the estimation method, output is the corresponding function object.
lookupEstimationMethod(type)
lookupEstimationMethod(type)
type |
A string denoting the estimation method. Allowed values are |
Function that implements the specific estimation method.
A tokenizer for use with a document-term matrix from the tm package. Supports both character and word ngrams, including own wrapper to handle non-Latin encodings
ngram_tokenize(x, char = FALSE, ngmin = 1, ngmax = 3)
ngram_tokenize(x, char = FALSE, ngmin = 1, ngmax = 3)
x |
input string |
char |
boolean value specifying whether to use character (char = TRUE) or word n-grams (char = FALSE, default) |
ngmin |
integer giving the minimum order of n-gram (default: 1) |
ngmax |
integer giving the maximum order of n-gram (default: 3) |
library(tm) en <- c("Romeo loves Juliet", "Romeo loves a girl") en.corpus <- VCorpus(VectorSource(en)) tdm <- TermDocumentMatrix(en.corpus, control=list(wordLengths=c(1,Inf), tokenize=function(x) ngram_tokenize(x, char=TRUE, ngmin=3, ngmax=3))) inspect(tdm) ch <- c("abab", "aabb") ch.corpus <- VCorpus(VectorSource(ch)) tdm <- TermDocumentMatrix(ch.corpus, control=list(wordLengths=c(1,Inf), tokenize=function(x) ngram_tokenize(x, char=TRUE, ngmin=1, ngmax=2))) inspect(tdm)
library(tm) en <- c("Romeo loves Juliet", "Romeo loves a girl") en.corpus <- VCorpus(VectorSource(en)) tdm <- TermDocumentMatrix(en.corpus, control=list(wordLengths=c(1,Inf), tokenize=function(x) ngram_tokenize(x, char=TRUE, ngmin=3, ngmax=3))) inspect(tdm) ch <- c("abab", "aabb") ch.corpus <- VCorpus(VectorSource(ch)) tdm <- TermDocumentMatrix(ch.corpus, control=list(wordLengths=c(1,Inf), tokenize=function(x) ngram_tokenize(x, char=TRUE, ngmin=1, ngmax=2))) inspect(tdm)
Counts total number of entries in dictionary.
numEntries(d)
numEntries(d)
d |
Dictionary of type |
numPositiveEntries
and
numNegativeEntries
for more option to count the number of entries
numEntries(SentimentDictionary(c("uncertain", "possible", "likely"))) # returns 3 numEntries(SentimentDictionary(c("increase", "rise", "more"), c("fall", "drop"))) # returns 5 numEntries(SentimentDictionary(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3))) # returns 3
numEntries(SentimentDictionary(c("uncertain", "possible", "likely"))) # returns 3 numEntries(SentimentDictionary(c("increase", "rise", "more"), c("fall", "drop"))) # returns 5 numEntries(SentimentDictionary(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3))) # returns 3
Counts total number of negative entries in dictionary.
numNegativeEntries(d)
numNegativeEntries(d)
d |
is a dictionary of type |
Entries in SentimentDictionaryWeighted
with a weight of 0
are not counted here
numEntries
and
numPositiveEntries
for more option to count the number of entries
numNegativeEntries(SentimentDictionary(c("increase", "rise", "more"), c("fall", "drop"))) # returns 2 numNegativeEntries(SentimentDictionary(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3))) # returns 2
numNegativeEntries(SentimentDictionary(c("increase", "rise", "more"), c("fall", "drop"))) # returns 2 numNegativeEntries(SentimentDictionary(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3))) # returns 2
Counts total number of positive entries in dictionary.
numPositiveEntries(d)
numPositiveEntries(d)
d |
is a dictionary of type |
Entries in SentimentDictionaryWeighted
with a weight of 0
are not counted here
numEntries
and
numNegativeEntries
for more option to count the number of entries
numPositiveEntries(SentimentDictionary(c("increase", "rise", "more"), c("fall", "drop"))) # returns 3 numPositiveEntries(SentimentDictionary(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3))) # returns 1
numPositiveEntries(SentimentDictionary(c("increase", "rise", "more"), c("fall", "drop"))) # returns 3 numPositiveEntries(SentimentDictionary(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3))) # returns 1
Function performs a Kernel Density Estimation (KDE) of the coefficients and then
plot these using ggplot
. This type of plot allows to
inspect whether the distribution of coefficients is skew. This can reveal if there
are more positive terms than negative or vice versa.
## S3 method for class 'SentimentDictionaryWeighted' plot(x, color = "gray60", theme = ggplot2::theme_bw(), ...)
## S3 method for class 'SentimentDictionaryWeighted' plot(x, color = "gray60", theme = ggplot2::theme_bw(), ...)
x |
Dictionary of class |
color |
Color for filling the density plot (default: gray color) |
theme |
Visualization theme for |
... |
Additional parameters passed to function. |
Returns a plot of class ggplot
plotSentiment
and plotSentimentResponse
for further plotting options
d <- SentimentDictionaryWeighted(paste0(character(100), 1:100), rnorm(100), numeric(100)) plot(d) # Change color in plot plot(d, color="red") library(ggplot2) # Extend plot with additional layout options plot(d) + ggtitle("KDE plot") plot(d) + theme_void()
d <- SentimentDictionaryWeighted(paste0(character(100), 1:100), rnorm(100), numeric(100)) plot(d) # Change color in plot plot(d, color="red") library(ggplot2) # Extend plot with additional layout options plot(d) + ggtitle("KDE plot") plot(d) + theme_void()
Simple line plot to visualize the evolvement of sentiment scores. This is especially helpful when studying a time series of sentiment scores.
plotSentiment( sentiment, x = NULL, cumsum = FALSE, xlab = "", ylab = "Sentiment" )
plotSentiment( sentiment, x = NULL, cumsum = FALSE, xlab = "", ylab = "Sentiment" )
sentiment |
|
x |
Optional parameter with labels or time stamps on x-axis. |
cumsum |
Parameter deciding whether the cumulative sentiment
is plotted (default: |
xlab |
Name of x-axis (default: empty string). |
ylab |
Name of y-axis (default: "Sentiment"). |
Returns a plot of class ggplot
plotSentimentResponse
and plot.SentimentDictionaryWeighted
for further plotting options
sentiment <- data.frame(Dictionary=runif(20)) plotSentiment(sentiment) plotSentiment(sentiment, cumsum=TRUE) # Change name of x-axis plotSentiment(sentiment, xlab="Tone") library(ggplot2) # Extend plot with additional layout options plotSentiment(sentiment) + ggtitle("Evolving sentiment") plotSentiment(sentiment) + theme_void()
sentiment <- data.frame(Dictionary=runif(20)) plotSentiment(sentiment) plotSentiment(sentiment, cumsum=TRUE) # Change name of x-axis plotSentiment(sentiment, xlab="Tone") library(ggplot2) # Extend plot with additional layout options plotSentiment(sentiment) + ggtitle("Evolving sentiment") plotSentiment(sentiment) + theme_void()
Generates a scatterplot where points pairs of sentiment and
the response variable. In addition, the plot addas a trend line
in the form of a generalized additive model (GAM). Other
smoothing variables are possible based on geom_smooth
.
This functions is helpful for visualization the relationship
between computed sentiment scores and the gold standard.
plotSentimentResponse( sentiment, response, smoothing = "gam", xlab = "Sentiment", ylab = "Response" )
plotSentimentResponse( sentiment, response, smoothing = "gam", xlab = "Sentiment", ylab = "Response" )
sentiment |
|
response |
Vector with response variables of the same length |
smoothing |
Smoothing functionality. Default is |
xlab |
Description on x-axis (default: "Sentiment"). |
ylab |
Description on y-axis (default: "Sentiment"). |
Returns a plot of class ggplot
plotSentiment
and plot.SentimentDictionaryWeighted
for further plotting options
sentiment <- data.frame(Dictionary=runif(10)) response <- sentiment[[1]] + rnorm(10) plotSentimentResponse(sentiment, response) # Change x-axis plotSentimentResponse(sentiment, response, xlab="Tone") library(ggplot2) # Extend plot with additional layout options plotSentimentResponse(sentiment, response) + ggtitle("Scatterplot") plotSentimentResponse(sentiment, response) + theme_void()
sentiment <- data.frame(Dictionary=runif(10)) response <- sentiment[[1]] + rnorm(10) plotSentimentResponse(sentiment, response) # Change x-axis plotSentimentResponse(sentiment, response, xlab="Tone") library(ggplot2) # Extend plot with additional layout options plotSentimentResponse(sentiment, response) + ggtitle("Scatterplot") plotSentimentResponse(sentiment, response) + theme_void()
Function takes a dictionary of class SentimentDictionaryWeighted
with weights
as input. It then applies this dictionary to textual contents in order to calculate
a sentiment score.
## S3 method for class 'SentimentDictionaryWeighted' predict( object, newdata = NULL, language = "english", weighting = function(x) tm::weightTfIdf(x, normalize = FALSE), ... )
## S3 method for class 'SentimentDictionaryWeighted' predict( object, newdata = NULL, language = "english", weighting = function(x) tm::weightTfIdf(x, normalize = FALSE), ... )
object |
Dictionary of class |
newdata |
A vector of characters, a |
language |
Language used for preprocessing operations (default: English). |
weighting |
Function used for weighting of words; default is a a link to the tf-idf scheme. |
... |
Additional parameters passed to function for e.g. preprocessing. |
data.frame
with predicted sentiment scores.
SentimentDictionaryWeighted
, generateDictionary
and
compareToResponse
for default dictionary generations
#' # Create a vector of strings documents <- c("This is a good thing!", "This is a very good thing!", "This is okay.", "This is a bad thing.", "This is a very bad thing.") response <- c(1, 0.5, 0, -0.5, -1) # Generate dictionary with LASSO regularization dictionary <- generateDictionary(documents, response) # Compute in-sample performance sentiment <- predict(dictionary, documents) compareToResponse(sentiment, response)
#' # Create a vector of strings documents <- c("This is a good thing!", "This is a very good thing!", "This is okay.", "This is a bad thing.", "This is a very bad thing.") response <- c(1, 0.5, 0, -0.5, -1) # Generate dictionary with LASSO regularization dictionary <- generateDictionary(documents, response) # Compute in-sample performance sentiment <- predict(dictionary, documents) compareToResponse(sentiment, response)
Preprocess existing corpus of type Corpus
according to default operations.
This helper function groups all standard preprocessing steps such that the usage of the
package is more convenient.
preprocessCorpus( corpus, language = "english", stemming = TRUE, verbose = FALSE, removeStopwords = TRUE )
preprocessCorpus( corpus, language = "english", stemming = TRUE, verbose = FALSE, removeStopwords = TRUE )
corpus |
|
language |
Default language used for preprocessing (i.e. stop word removal and stemming) |
stemming |
Perform stemming (default: TRUE) |
verbose |
Print preprocessing status information |
removeStopwords |
Flag indicating whether to remove stopwords or not (default: yes) |
Object of Corpus
Prints entries of sentiment dictionary to the screen
## S3 method for class 'SentimentDictionaryWordlist' print(x, ...) ## S3 method for class 'SentimentDictionaryBinary' print(x, ...) ## S3 method for class 'SentimentDictionaryWeighted' print(x, ...)
## S3 method for class 'SentimentDictionaryWordlist' print(x, ...) ## S3 method for class 'SentimentDictionaryBinary' print(x, ...) ## S3 method for class 'SentimentDictionaryWeighted' print(x, ...)
x |
Sentiment dictionary of type |
... |
Additional parameters passed to specific sub-routines |
summary
for showing a brief summary
print(SentimentDictionary(c("uncertain", "possible", "likely"))) print(SentimentDictionary(c("increase", "rise", "more"), c("fall", "drop"))) print(SentimentDictionary(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3)))
print(SentimentDictionary(c("uncertain", "possible", "likely"))) print(SentimentDictionary(c("increase", "rise", "more"), c("fall", "drop"))) print(SentimentDictionary(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3)))
This routine reads a sentiment dictionary from a text file. Such a text file can
be created e.g. via write
. The dictionary type is recognized
according to the internal format of the file.
read(file)
read(file)
file |
File name pointing to text file |
Dictionary of type SentimentDictionaryWordlist
,
SentimentDictionaryBinary
or
SentimentDictionaryWeighted
write
for creating such a file
d.out <- SentimentDictionary(c("uncertain", "possible", "likely")) write(d.out, "example.dict") d.in <- read("example.dict") print(d.in) d.out <- SentimentDictionary(c("increase", "rise", "more"), c("fall", "drop")) write(d.out, "example.dict") d.in <- read("example.dict") print(d.in) d.out <- SentimentDictionary(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3), intercept=5) write(d.out, "example.dict") d.in <- read("example.dict") print(d.in) unlink("example.dict")
d.out <- SentimentDictionary(c("uncertain", "possible", "likely")) write(d.out, "example.dict") d.in <- read("example.dict") print(d.in) d.out <- SentimentDictionary(c("increase", "rise", "more"), c("fall", "drop")) write(d.out, "example.dict") d.in <- read("example.dict") print(d.in) d.out <- SentimentDictionary(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3), intercept=5) write(d.out, "example.dict") d.in <- read("example.dict") print(d.in) unlink("example.dict")
Function estimates coefficients based on ridge regularization.
ridgeEstimation( x, response, control = list(s = "lambda.min", family = "gaussian", grouped = FALSE), ... )
ridgeEstimation( x, response, control = list(s = "lambda.min", family = "gaussian", grouped = FALSE), ... )
x |
An object of type |
response |
Response variable including the given gold standard. |
control |
(optional) A list of parameters defining the model as follows:
|
... |
Additional parameters passed to function for |
Result is a list with coefficients, coefficient names and the model intercept.
Sentiment score as denoted by a linear model.
ruleLinearModel(dtm, d)
ruleLinearModel(dtm, d)
dtm |
Document-term matrix |
d |
Dictionary of type |
Continuous sentiment score
Ratio of words labeled as negative in that dictionary compared to the total
number of words in the document. Here, it uses the entry negativeWords
of the SentimentDictionaryBinary
.
ruleNegativity(dtm, d)
ruleNegativity(dtm, d)
dtm |
Document-term matrix |
d |
Dictionary of type |
Ratio of negative words compared to all
Ratio of words labeled as positive in that dictionary compared to the total
number of words in the document. Here, it uses the entry positiveWords
of the SentimentDictionaryBinary
.
rulePositivity(dtm, d)
rulePositivity(dtm, d)
dtm |
Document-term matrix |
d |
Dictionary of type |
Ratio of positive words compared to all
Ratio of words in that dictionary compared to the total number of words in the document
ruleRatio(dtm, d)
ruleRatio(dtm, d)
dtm |
Document-term matrix |
d |
Dictionary of type |
Ratio of dictionary words compared to all
Sentiment score defined as the difference between positive and negative word counts divided by the total number of words.
ruleSentiment(dtm, d)
ruleSentiment(dtm, d)
dtm |
Document-term matrix |
d |
Dictionary of type |
Given the number of positive words and the number of
negative words
. Further, let
denote the total number of words
in that document. Then, the sentiment ratio is defined as
. Here, it uses the entries negativeWords
and
positiveWords
of the SentimentDictionaryBinary
.
Sentiment score in the range of -1 to 1.
Sentiment score defined as the difference between positive and negative word counts divided by the sum of positive and negative words.
ruleSentimentPolarity(dtm, d)
ruleSentimentPolarity(dtm, d)
dtm |
Document-term matrix |
d |
Dictionary of type |
Given the number of positive words and the number of
negative words
. Then, the sentiment ratio is defined as
. Here, it uses the entries negativeWords
and
positiveWords
of the SentimentDictionaryBinary
.
Sentiment score in the range of -1 to 1.
Counts total word frequencies in each document
ruleWordCount(dtm)
ruleWordCount(dtm)
dtm |
Document-term matrix |
Total number of words
Depending on the input, this function creates a new sentiment dictionary of different type.
SentimentDictionary(...)
SentimentDictionary(...)
... |
Arguments as passed to one of the three functions
|
SentimentDictionaryWordlist
,
SentimentDictionaryBinary
,
SentimentDictionaryWeighted
This routines creates a new object of type SentimentDictionaryBinary
that
stores two separate vectors of negative and positive words
SentimentDictionaryBinary(positiveWords, negativeWords)
SentimentDictionaryBinary(positiveWords, negativeWords)
positiveWords |
is a vector containing the entries labeled as positive |
negativeWords |
is a vector containing the entries labeled as negative |
Returns a new object of type SentimentDictionaryBinary
# generate a dictionary with positive and negative words d <- SentimentDictionaryBinary(c("increase", "rise", "more"), c("fall", "drop")) summary(d) # alternative call d <- SentimentDictionary(c("increase", "rise", "more"), c("fall", "drop")) summary(d)
# generate a dictionary with positive and negative words d <- SentimentDictionaryBinary(c("increase", "rise", "more"), c("fall", "drop")) summary(d) # alternative call d <- SentimentDictionary(c("increase", "rise", "more"), c("fall", "drop")) summary(d)
This routine creates a new object of type SentimentDictionaryWeighted
that
contains a number of words, each linked to a continuous score (i.e. weight) for
specifying its polarity. The scores can later be interpreted as a linear model
SentimentDictionaryWeighted( words, scores, idf = rep(1, length(words)), intercept = 0 )
SentimentDictionaryWeighted( words, scores, idf = rep(1, length(words)), intercept = 0 )
words |
is collection (vector) of different words as strings |
scores |
are the corresponding scores or weights denoting the word's polarity |
idf |
provide further details on the frequency of words in the corpus as an additional source for normalization |
intercept |
is an optional parameter for shifting the zero level (default: 0) |
Returns a new object of type SentimentDictionaryWordlist
The intercept is useful when the mean or median of a response variable is not exactly located at zero. For instance, stock market returns have slight positive bias.
doi:10.1371/journal.pone.0209323
Pr\"ollochs and Feuerriegel (2018). Statistical inferences for Polarity Identification in Natural Language, PloS One 13(12).
# generate dictionary (based on linear model) d <- SentimentDictionaryWeighted(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3)) summary(d) # alternative call d <- SentimentDictionaryWeighted(c("increase", "decrease", "exit"), c(+1, -1, -10)) summary(d) # alternative call d <- SentimentDictionary(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3)) summary(d)
# generate dictionary (based on linear model) d <- SentimentDictionaryWeighted(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3)) summary(d) # alternative call d <- SentimentDictionaryWeighted(c("increase", "decrease", "exit"), c(+1, -1, -10)) summary(d) # alternative call d <- SentimentDictionary(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3)) summary(d)
This routine creates a new object of type SentimentDictionaryWordlist
SentimentDictionaryWordlist(wordlist)
SentimentDictionaryWordlist(wordlist)
wordlist |
is a vector containing the individual entries as strings |
Returns a new object of type SentimentDictionaryWordlist
# generate a dictionary with "uncertainty" words d <- SentimentDictionaryWordlist(c("uncertain", "possible", "likely")) summary(d) # alternative call d <- SentimentDictionary(c("uncertain", "possible", "likely")) summary(d)
# generate a dictionary with "uncertainty" words d <- SentimentDictionaryWordlist(c("uncertain", "possible", "likely")) summary(d) # alternative call d <- SentimentDictionary(c("uncertain", "possible", "likely")) summary(d)
Function estimates coefficients based on spike-and-slab regression.
spikeslabEstimation( x, response, control = list(n.iter1 = 500, n.iter2 = 500), ... )
spikeslabEstimation( x, response, control = list(n.iter1 = 500, n.iter2 = 500), ... )
x |
An object of type |
response |
Response variable including the given gold standard. |
control |
(optional) A list of parameters defining the LASSO model. Default is |
... |
Additional parameters passed to function for |
Result is a list with coefficients, coefficient names and the model intercept.
Output summary information on sentiment dictionary
## S3 method for class 'SentimentDictionaryWordlist' summary(object, ...) ## S3 method for class 'SentimentDictionaryBinary' summary(object, ...) ## S3 method for class 'SentimentDictionaryWeighted' summary(object, ...)
## S3 method for class 'SentimentDictionaryWordlist' summary(object, ...) ## S3 method for class 'SentimentDictionaryBinary' summary(object, ...) ## S3 method for class 'SentimentDictionaryWeighted' summary(object, ...)
object |
Sentiment dictionary of type |
... |
Additional parameters passed to specific sub-routines |
print
for output the entries of a dictionary
summary(SentimentDictionary(c("uncertain", "possible", "likely"))) summary(SentimentDictionary(c("increase", "rise", "more"), c("fall", "drop"))) summary(SentimentDictionary(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3)))
summary(SentimentDictionary(c("uncertain", "possible", "likely"))) summary(SentimentDictionary(c("increase", "rise", "more"), c("fall", "drop"))) summary(SentimentDictionary(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3)))
Preprocess existing corpus of type Corpus
according to default operations.
This helper function groups all standard preprocessing steps such that the usage of the
package is more convenient. The result is a document-term matrix.
toDocumentTermMatrix( x, language = "english", minWordLength = 3, sparsity = NULL, removeStopwords = TRUE, stemming = TRUE, weighting = function(x) tm::weightTfIdf(x, normalize = FALSE) )
toDocumentTermMatrix( x, language = "english", minWordLength = 3, sparsity = NULL, removeStopwords = TRUE, stemming = TRUE, weighting = function(x) tm::weightTfIdf(x, normalize = FALSE) )
x |
|
language |
Default language used for preprocessing (i.e. stop word removal and stemming) |
minWordLength |
Minimum length of words used for cut-off; i.e. shorter words are removed. Default is 3. |
sparsity |
A numeric for the maximal allowed sparsity in the range from bigger zero to
smaller one. Default is |
removeStopwords |
Flag indicating whether to remove stopwords or not (default: yes) |
stemming |
Perform stemming (default: TRUE) |
weighting |
Function used for weighting of words; default is a a link to the tf-idf scheme. |
Object of DocumentTermMatrix
DocumentTermMatrix
for the underlying class
Takes the given input of characters and transforms it into a Corpus
. The input is checked to match the expected class and format.
transformIntoCorpus(x)
transformIntoCorpus(x)
x |
A list, data.frame or vector consisting of characters |
The generated Corpus
Factors are automatically casted into characters but with printing a warning
preprocessCorpus
for further preprocessing, analyzeSentiment
for subsequent sentiment analysis
transformIntoCorpus(c("Document 1", "Document 2", "Document 3")) transformIntoCorpus(list("Document 1", "Document 2", "Document 3")) transformIntoCorpus(data.frame("Document 1", "Document 2", "Document 3"))
transformIntoCorpus(c("Document 1", "Document 2", "Document 3")) transformIntoCorpus(list("Document 1", "Document 2", "Document 3")) transformIntoCorpus(data.frame("Document 1", "Document 2", "Document 3"))
This routine exports a sentiment dictionary to a text file which can be the source for additional problems or controlling the output.
write(d, file) ## S3 method for class 'SentimentDictionaryWordlist' write(d, file) ## S3 method for class 'SentimentDictionaryBinary' write(d, file) ## S3 method for class 'SentimentDictionaryWeighted' write(d, file)
write(d, file) ## S3 method for class 'SentimentDictionaryWordlist' write(d, file) ## S3 method for class 'SentimentDictionaryBinary' write(d, file) ## S3 method for class 'SentimentDictionaryWeighted' write(d, file)
d |
Dictionary of type |
file |
File to which the dictionary should be exported |
read
for later access
d.out <- SentimentDictionary(c("uncertain", "possible", "likely")) write(d.out, "example.dict") d.in <- read("example.dict") print(d.in) d.out <- SentimentDictionary(c("increase", "rise", "more"), c("fall", "drop")) write(d.out, "example.dict") d.in <- read("example.dict") print(d.in) d.out <- SentimentDictionary(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3), intercept=5) write(d.out, "example.dict") d.in <- read("example.dict") print(d.in) unlink("example.dict")
d.out <- SentimentDictionary(c("uncertain", "possible", "likely")) write(d.out, "example.dict") d.in <- read("example.dict") print(d.in) d.out <- SentimentDictionary(c("increase", "rise", "more"), c("fall", "drop")) write(d.out, "example.dict") d.in <- read("example.dict") print(d.in) d.out <- SentimentDictionary(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3), intercept=5) write(d.out, "example.dict") d.in <- read("example.dict") print(d.in) unlink("example.dict")