Hi there. In this page, I share some experimental work in the programming language R. I use R and text analysis to analyze the words in the Dr. Seuss - The Cat In The Hat kids book.
Image Source: http://wooderice.com/wp-content/uploads/2014/04/catinthehat.jpg
A text version of the book can be found from https://github.com/robertsdionne/rwet/blob/master/hw2/drseuss.txt. The contents are copied and pasted to a different .txt file for offline use.
The R packages that are loaded in are:
# Text Mining on the Dr. Seuss - The Cat In The Hat Kids Book
# Text Version Of Book Source:
# https://github.com/robertsdionne/rwet/blob/master/hw2/drseuss.txt
# 1) Wordclouds
# 2) Word Counts
# 3) Sentiment Analysis - nrc, bing and AFINN Lexicons
#----------------------------------
# Load libraries into R:
# Install packages with install.packages("pkg_name")
library(dplyr)
library(tidyr)
library(ggplot2)
library(tidytext)
library(wordcloud)
library(tm)
To start, I load in the The Cat In The Hat book from the offline text file with the readLines()
function. Afterwards, the readLines()
object is put into a VectorSource and then into a Corpus.
Once you have the Corpus object, the tm_map()
functions can be used to clean up the text. Options include removing punctuations, converting text to lowercase, removing numbers, removing whitespace and removing stopwords (words like the, and, or, for, me).
# 1) Wordclouds
# Reference: http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know
# Ref 2: https://www.youtube.com/watch?v=JoArGkOpeU0
catHat_book <- readLines("cat_in_the_hat_textbook.txt")
## Warning in readLines("cat_in_the_hat_textbook.txt"): incomplete final line
## found on 'cat_in_the_hat_textbook.txt'
catHat_text <- Corpus(VectorSource(catHat_book))
# Clean the text up:
catHat_clean <- tm_map(catHat_text, removePunctuation)
## Warning in tm_map.SimpleCorpus(catHat_text, removePunctuation):
## transformation drops documents
catHat_clean <- tm_map(catHat_clean, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(catHat_clean, content_transformer(tolower)):
## transformation drops documents
catHat_clean <- tm_map(catHat_clean, removeNumbers)
## Warning in tm_map.SimpleCorpus(catHat_clean, removeNumbers): transformation
## drops documents
catHat_clean <- tm_map(catHat_clean, stripWhitespace)
## Warning in tm_map.SimpleCorpus(catHat_clean, stripWhitespace):
## transformation drops documents
# Remove English stopwords such as: the, and or, over, under, and so on:
catHat_clean <- tm_map(catHat_clean, removeWords, stopwords('english'))
## Warning in tm_map.SimpleCorpus(catHat_clean, removeWords,
## stopwords("english")): transformation drops documents
The next step is to convert the tm_map()
object in a Term Document Matrix and then into a data frame. Once a data frame is obtained, wordclouds along with bar graphs can be generated.
# Convert to Term Document Matrix:
td_mat<- TermDocumentMatrix(catHat_clean)
matrix <- as.matrix(td_mat)
sorted <- sort(rowSums(matrix),decreasing=TRUE)
data_text <- data.frame(word = names(sorted), freq = sorted)
#Preview data:
head(data_text, 30)
## word freq
## like like 88
## will will 58
## said said 43
## sir sir 37
## one one 35
## fish fish 34
## house house 29
## cat cat 29
## say say 29
## now now 29
## things things 26
## fox fox 26
## eat eat 26
## grinch grinch 26
## two two 25
## can can 25
## box box 25
## look look 24
## thing thing 22
## socks socks 22
## hat hat 20
## know know 20
## hop hop 18
## good good 17
## new new 17
## knox knox 17
## little little 16
## mouse mouse 16
## bump bump 15
## saw saw 15
The wordcloud()
function from the wordcloud package allows for the generation of a colourful wordcloud as shown below.
# Wordcloud with colours:
set.seed(1234)
wordcloud(words = data_text$word, freq = data_text$freq, min.freq = 5,
max.words = 100, random.order=FALSE, rot.per=0.35,
colors = rainbow(30))
To make the wordcloud smaller you can raise the minimum frequency requirement for words by changing the value of the min.freq argument in wordcloud()
.
# Wordcloud with colours with lower max words and raise minimum frequency:
wordcloud(words = data_text$word, freq = data_text$freq, min.freq = 15,
max.words = 80, random.order=FALSE, rot.per=0.35,
colors = rainbow(30))
It appears that the word like is the most common along with the words will, sir, fish, things and grinch.
In my other text mining/analysis projects in R pages, I use the tidytext approach with the tidytext package and the unnest_tokens()
function to obtain the most common words in the The Cat In The Hat book. However, in this page I still use code from the previous section. The data_text
object is already preprocessed with the tm_map()
functions and is ready for plotting with ggplot2.
I take the top 25 most common words from The Cat In The Hat book. To obtain the bars, you need the geom_col()
function. Sideways bars can be obtained with the coord_flip()
addon function. Labels and text can be added with the labs()
function and the geom_text function respectively. The theme()
function allows for adjustment of aesthetics such as text colours, text sizes and so forth.
# Wordcounts Plot:
# ggplot2 bar plot (Top 25 Words)
data_text[1:25, ] %>%
mutate(word = reorder(word, freq)) %>%
ggplot(aes(word, freq)) +
geom_col(fill = "lightblue") +
coord_flip() +
labs(x = "Word \n", y = "\n Count ", title = "Word Counts In \n The Cat In The Hat Book \n (Top 25) \n") +
geom_text(aes(label = freq), hjust = 1.2, colour = "black", fontface = "bold", size = 3.7) +
theme(plot.title = element_text(hjust = 0.5, colour = "darkgreen", size = 15),
axis.title.x = element_text(face="bold", colour="darkblue", size = 12),
axis.title.y = element_text(face="bold", colour="darkblue", size = 12),
panel.grid.major = element_blank(),
panel.grid.minor= element_blank())
In the wordclouds, you were unable to determine the counts associated with each word. With the bar graph with numeric texts, you can clearly see the counts with the words.
The most common words in The Cat In The Hat include like, will, said, sir, one, fish and say.
Sentiment analysis looks at a piece of text and determines whether the text is positive or negative (depending on the lexicon). Three lexicons are used here for analyzing words.
Do keep in mind that each lexicon has its own way of scoring the words in terms of positive/negative sentiment. In addition, some words are in certain lexicons and some words are not. These lexicons are not perfect as they are subjective with the scoring.
I read in the book into R (again) and convert the book into a tibble (neater data frame). The head()
function is used to preview/check the start of the book.
# 3) Sentiment Analysis
# Is the book positive, negative, neutral?
catHat_book <- readLines("cat_in_the_hat_textbook.txt")
## Warning in readLines("cat_in_the_hat_textbook.txt"): incomplete final line
## found on 'cat_in_the_hat_textbook.txt'
# Preview the start of the book:
catHat_book_df <- data_frame(Text = catHat_book) # tibble aka neater data frame
head(catHat_book_df, n = 15)
## # A tibble: 15 x 1
## Text
## <chr>
## 1 The sun did not shine.
## 2 It was too wet to play.
## 3 So we sat in the house
## 4 All that cold, cold, wet day.
## 5 ""
## 6 I sat there with Sally.
## 7 We sat there, we two.
## 8 "And I said, \"How I wish"
## 9 "We had something to do!\""
## 10 ""
## 11 Too wet to go out
## 12 And too cold to play ball.
## 13 So we sat in the house.
## 14 We did nothing at all.
## 15 ""
The unnest_tokens()
function is then applied on the data_frame()
object. Each word in The Cat In The Hat now has its own row. An anti_join()
is used to remove English stop words such as the, and, for, my, myself. A count()
function is used to obtain the counts for each word with the sort = TRUE
argument.
catHat_book_words <- catHat_book_df %>%
unnest_tokens(output = word, input = Text)
# Retrieve word counts as set up for sentiment lexicons:
catHat_book_wordcounts <- catHat_book_words %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
## Joining, by = "word"
nrc Lexicon
The nrc Lexicon categorizes words as either having the sentiment of trust, fear, negative, sadness, fear, anger or positive. Here, the sentiments of interest from the nrc lexicon are negative and positive.
#### Using nrc, bing and AFINN lexicons
word_labels_nrc <- c(
`negative` = "Negative Words",
`positive` = "Positive Words"
)
### nrc lexicons:
# get_sentiments("nrc")
catHat_book_words_nrc <- catHat_book_wordcounts %>%
inner_join(get_sentiments("nrc"), by = "word") %>%
filter(sentiment %in% c("positive", "negative"))
# Preview common words with sentiment label:
head(catHat_book_words_nrc, n = 12)
## # A tibble: 12 x 3
## word n sentiment
## <chr> <int> <chr>
## 1 sir 37 positive
## 2 eat 26 positive
## 3 fun 16 positive
## 4 mother 15 negative
## 5 mother 15 positive
## 6 tree 15 positive
## 7 sing 13 positive
## 8 battle 11 negative
## 9 green 10 positive
## 10 noise 10 negative
## 11 goo 9 negative
## 12 trick 9 negative
Here is the code and output for the word counts influenced by the nrc Lexicon for The Cat In The Hat book. There is a lot of code in the section below as I wanted to make the plot look nicer than usual.
# Sentiment Plot with nrc Lexicon (Word Count over 5)
catHat_book_words_nrc %>%
filter(n > 5) %>%
ggplot(aes(x = reorder(word, n), y = n, fill = sentiment)) +
geom_bar(stat = "identity", position = "identity") +
geom_text(aes(label = n), colour = "black", hjust = 1, fontface = "bold", size = 3) +
facet_wrap(~sentiment, nrow = 2, scales = "free_y", labeller = as_labeller(word_labels_nrc)) +
labs(x = "\n Word \n", y = "\n Word Count ", title = "Negative & Positive Words In \n The Cat In The Hat Kids' Book \n With The nrc Lexicon \n") +
theme(plot.title = element_text(hjust = 0.5),
axis.title.x = element_text(face="bold", colour="darkblue", size = 12),
axis.title.y = element_text(face="bold", colour="darkblue", size = 12),
strip.background = element_rect(fill = "lightblue"),
strip.text.x = element_text(size = 10, face = "bold")) +
scale_fill_manual(values=c("#FF0000", "#01DF3A"), guide=FALSE) +
coord_flip()
bing Lexicon
Words under the bing lexicon categorizes certain words as either positive or negative. In the bar plot below, you will see that the selected top words are different than the ones from the nrc lexicon. (These lexicons are subjective.)
### bing lexicon:
# get_sentiments("bing")
word_labels_bing <- c(
`negative` = "Negative Words",
`positive` = "Positive Words"
)
catHat_book_words_bing <- catHat_book_wordcounts %>%
inner_join(get_sentiments("bing"), by = "word") %>%
ungroup()
# Preview the words and counts:
head(catHat_book_words_bing, n = 15)
## # A tibble: 15 x 3
## word n sentiment
## <chr> <int> <chr>
## 1 fun 16 positive
## 2 bump 15 negative
## 3 fast 11 positive
## 4 likes 10 positive
## 5 noise 10 negative
## 6 dark 9 negative
## 7 trick 9 negative
## 8 fear 8 negative
## 9 bad 7 negative
## 10 cold 7 negative
## 11 sad 7 negative
## 12 slow 7 negative
## 13 top 6 positive
## 14 fall 5 negative
## 15 funny 5 negative
# Sentiment Plot with bing Lexicon (Counts over 3):
catHat_book_words_bing %>%
filter(n > 3) %>%
ggplot(aes(x = reorder(word, n), y = n, fill = sentiment)) +
geom_bar(stat = "identity", position = "identity") +
geom_text(aes(label = n), colour = "black", hjust = 1, fontface = "bold", size = 3) +
facet_wrap(~sentiment, nrow = 2, scales = "free_y", labeller = as_labeller(word_labels_bing)) +
labs(x = "\n Word \n", y = "\n Word Count ", title = "Negative & Positive Words In \n The Cat In The Hat Kids' Book \n With The bing Lexicon \n") +
theme(plot.title = element_text(hjust = 0.5),
axis.title.x = element_text(face="bold", colour="darkblue", size = 12),
axis.title.y = element_text(face="bold", colour="darkblue", size = 12),
strip.background = element_rect(fill = "#BEE1D3"),
strip.text.x = element_text(size = 10, face = "bold", colour = "black")) +
scale_fill_manual(values=c("#FF0000", "#01DF3A"), guide=FALSE) +
coord_flip()
The top negative word according to bing is bump. Other intriguing negative words include sue, funny, trick and noise. The word sue is either a verb as in to sue someone or it could be a name. I am not sure if I agree funny being a negative word. The word trick can be used as a verb as in to trick someone or as a noun such as a magic trick. Bing interprets trick more as a verb I presume.
AFINN Lexicon
Words from the AFINN lexicon are given a score from -5 to + 5 (whole numbers only). Scores below zero are for negative words and positive numbers are for positive words. I have used the mutate()
function from R’s dplyr package to add a new column which indicates whether a word is positive or negative. This extra column helps in creating separate plots into one plot under ggplot2
.
### AFINN lexicon:
# Change labels
# (Source: https://stackoverflow.com/questions/3472980/ggplot-how-to-change-facet-labels)
catHat_book_words_AFINN <- catHat_book_wordcounts %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
mutate(is_positive = score > 0)
head(catHat_book_words_AFINN, n = 15)
## # A tibble: 15 x 4
## word n score is_positive
## <chr> <int> <int> <lgl>
## 1 fun 16 4 TRUE
## 2 battle 11 -1 FALSE
## 3 likes 10 2 TRUE
## 4 stop 9 -1 FALSE
## 5 fear 8 -2 FALSE
## 6 bad 7 -3 FALSE
## 7 dear 7 2 TRUE
## 8 sad 7 -2 FALSE
## 9 top 6 2 TRUE
## 10 blocks 5 -1 FALSE
## 11 funny 5 4 TRUE
## 12 fan 4 3 TRUE
## 13 fight 4 -1 FALSE
## 14 luck 4 3 TRUE
## 15 shame 4 -2 FALSE
word_labels_AFINN <- c(
`FALSE` = "Negative Words",
`TRUE` = "Positive Words"
)
catHat_book_words_AFINN %>%
filter(n > 3) %>%
ggplot(aes(x = reorder(word, n), y = n, fill = is_positive)) +
geom_bar(stat = "identity", position = "identity") +
geom_text(aes(label = n), colour = "black", hjust = 1, fontface = "bold", size = 3.2) +
facet_wrap(~is_positive, scales = "free_y", nrow = 2, labeller = as_labeller(word_labels_AFINN)) +
labs(x = "\n Word \n", y = "\n Word Count ", title = "Negative & Positive Words In \n The Cat In The Hat Kids' Book \n With The AFINN Lexicon \n",
fill = c("Negative", "Positive")) +
theme(plot.title = element_text(hjust = 0.5),
axis.title.x = element_text(face="bold", colour="darkblue", size = 12),
axis.title.y = element_text(face="bold", colour="darkblue", size = 12),
strip.background = element_rect(fill = "#D5ADA4"),
strip.text.x = element_text(size = 10, face = "bold", colour = "black")) +
scale_fill_manual(values=c("#FF0000", "#01DF3A"), guide=FALSE) +
coord_flip()
Under AFINN, the most negative word is battle and the most positive word is fun. The word fun is featured in all three lexicons and the “negative” word bad is featured in all three as well. As different as these lexicons are in terms of categorization, there are a few common words between the three lexicons.
The nrc lexicon scores the The Cat In The Hat book more positively than bing and AFINN. bing gives the book a more negative score overall and the AFINN results are fairly balanced.