Text Mining


During this practical, we will cover an introduction to text mining. Topics covered are how to pre-process mined text, different ways to visualize this the mined text and an introduction to one type of analysis you can conduct during text mining: sentiment analysis. As a whole, there are multiple ways to analysis mine & analyze text within R. However, for this practical we will be focused on the techniques covered in the tidytext package, based upon the tidyverse.

For this practical, you will need the following packages:

# General Packages

# Text Mining

For this practical, we will be using text mined through the Gutenberg Project; briefly this project contains over 60,000 freely accessible eBooks, which through the package gutenberger, can be easily accessed and perfect for text mining and analysis.

During this practical, we will be looking at several books from the late 1800s, in the mindset to compare and contrast the use of language within them. These books include:

  • Alice’s Adventures in Wonderland by Lewis Carroll
  • The Picture of Dorian Gray by Oscar Wilde
  • Dracula by Bram Stoker
  • The Strange Case of Dr. Jekyll and Mr. Hyde by Robert Louis Stevenson

Despite being from the late 1800s, these books still are popular and hold cultural significance in TV, Movies and the English Language. To access this novel suitable for this practical the following function should be used:

AAIWL <- gutenberg_download(28885) # 28885 is the eBook number of Alice in Wonderland
PODG <- gutenberg_download(174) # 174 is the eBook number of The Picture of Dorian Gray
Drac <- gutenberg_download(345) # 345 is the eBook number of Dracula
SCJH <- gutenberg_download(43) # 43 is the eBook number of Dr. Jekyll and Mr. Hyde

After having loaded all of these books into your working directory (using the code above), examine one of these books using the View() function. When you view any of these data frames, you will see that these have an extremely messy layout and structure. As a result of this complex structure means that conducting any analysis would be extremely challenging, so pre-processing must be undertaken to get this into a format which is usable.

Pre-Processing Text

In order for text to be used effectively within statistical processing and analysis; it must be pre-processed so that it can be uniformly examined. In general the major steps of pre-processing include:

  • Removing numbers
  • Removing capital letters
  • Removing stop words
  • Removing inter-punctuation
  • Application of a stemming algorithm

These steps are important as this allows the text to be presented uniformly for analysis; within this practical we will discuss how to undergo each of these steps using the tidytext package, with the exception of stemming.

Step 1: Un-nesting Text

When we previously looked at this text, as we discovered it was extremely messy with it being attached one line per row in the data frame. As such, it is important to un-nest this text so that it attaches one word per row.

Before un-nesting text, it is useful to make a note of aspects such as the line which text is on, and the chapter each line falls within. This can be important when examining anthologies or making chapter comparisons as this can be specified within the analysis.

In order to specify the line number and chapter of the text, it is possible to use the mutuate function from the dplyr package.

  1. Apply the code below, which uses the mutate function, to add line numbers and chapter references one of the books. Next, use the View() function to examine how this has changed the structure.

# Template: 

tidy_[BOOKNAME] <- [BOOKNAME] %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                                 ignore_case = TRUE))))

From this, it is now possible to pass the function unnest_tokens() in order to split apart the sentence string, and apply each word to a new line. When using this function, you simply need to pass the arguments, word (as this is what you want selecting) and text (the name of the column you want to unnest).

  1. Apply unnest_tokens to your tidied book to unnest this text. Next, once again use the View() function to examine the output.

Hint: As with Question 1, ensure to use the piping operator (%>%) to easily apply the function.

This results in one word being linked per row of the data frame. The benefit of using the tidytext package in comparison to other text mining packages, is that this automatically applies some of the basic steps to pre-process your text, including removing of capital letters, inter-word punctuation and numbers. However additional pre-processing is required.

Intermezzo: word clouds

Before continuing the pre-processing process, let’s have a first look at our text by making a simple visualization using word clouds. Typically these word clouds visualize the frequency of words in a text through relating the size of the displayed words to frequency, with the largest words indicating the most common words.

To plot word clouds, we first have to create a data frame containing the word frequencies.

  1. Create a new data frame, which contains the frequencies of words from the unnested text. To do this, you can make use of the function count().

Hint: As with Question 1, ensure to use the piping operator (%>%) to easily apply the function.

  1. Using the wordcloud() function, create a word cloud for your book text. Use the argument max.words within the function to set the maximum number of words to be displayed in the word cloud.

Hint: As with Question 1, ensure to use the piping operator (%>%) to easily apply the function. Note: Ensure to use the function with(), is used after the piping operator.

  1. Discuss with another individual or group, whether you can tell what text each word clouds come from, based on the popular words which occur.

Step 2: Removing Stop Words

As discussed within the lecture, stop words are words in any language which have little or no meaning, and simply connect the words of importance. Such as the, a, also, as, were… etc. To understand the importance of removing these stop words, we can simply do a comparison between the text which has had them removed and those which have not been.

To remove the stop words, we use the function anti_join(). This function works through un-joining this table based upon the components, which when passed with the argument stop_words, which is a table containing these words across three lexicons. This removes all the stop words from the presented data frame.

  1. Use the function anti_join() to remove stop words from your tidied text attaching it to a new data frame.

Hint: As with Question 1, ensure to use the piping operator (%>%) to easily apply the function.

In order to examine the impact of removing these filler words, we can use the count() function to examine the frequencies of different words. This when sorted, will produce a table of frequencies in descending order. An other option is to redo the wordclouds on the updated data frame containing the word counts of the tidied book text without stop words.

  1. Use the function count() to compare the frequencies of words in the dataframes containing the tidied book text with and without stop words (use sort = TRUE within the count() function), or redo the wordclouds. Do you notice a difference in the (top 10) words which most commonly occur in the text?

Hint: As with Question 1, ensure to use the piping operator (%>%) to easily apply the function.

Step 3: Additional Pre-Processing

So far, you have examined different ways of processing the actual text within the data itself and preparing this for analysis. Before moving on to visualizing and analyzing this text in more detail, it should be commented that usually additional processing is required, such as removal of irrelevant sections like pre-text or details on the publisher and publishing data and place. In our case, if you have taken time to review the the text documents at the beginning, you will have seen there is an initial pre-text, and other irrelevant details which are not required for an analysis. Although these are unlike to influence analyses which are based on frequencies, if other analyses are conducted it could impact your findings. As such it can be important to remove them.

In order to filter your data, the filter() functions previously discussed in Week 7 (Interactive Data Visualization) and Basics of dplyr, can be used. In our case, you will want to use this function to remove all data which is contained within Chapter 0, which is before the novel actually begins.

  1. Using the filter() function, in addition to your knowledge of dplyr, remove Chapter 0 from your tidied text.

Hint: As with Question 1, ensure to use the piping operator (%>%) to easily apply the function. Note: Jekyll and Hyde does not have multiple chapters, so skip this step for this text.

Visualising Mined Text

When it comes to visualizing text output, there are multiple methods which can be effective. A majority of these are based upon the frequency of these words. We’ve already seen word clouds. Next, we’ll have a look at Word Frequency Bar Charts.

Word Frequency Bar Charts

One of the clearest way to visualize the frequency of different words across different documents. As a whole, there are two methods to do this: 1) plotting the most common words across all documents (through a simple count method for each document, examining the term frequencies individually), 2) plotting the term frequency inverse document frequency (tf-idf) across all documents.

First, we examine the frequency of words within individual documents. This method can be easily completed using ggplot. First, again count the word frequencies, but this time filter out the top scoring words in the text and use the mutate() function to reorder the variable in order of frequency.

  1. Create a new variable, which is made up of the frequencies of words from the tidied text, which contains only words with a frequency over 50. In addition, use the mutate() function to reorder the variable in order of frequency.

Hint: As with Question 1, ensure to use the piping operator (%>%) to easily apply the function.

From these new variables, you can use ggplot() with the geom_col() function to create a plot containing the word frequencies.

  1. Using ggplot() function using geom_col() create a frequency plot for the words which occur more than 50 times in the text. For the books ‘The Picture of Dorian Gray’ and ‘Dracula’, restrict your plots to the top 20.

Hint: Use the function coord_flip(), to flip this on the axis.

The second way to examine word frequency is through an examination where all documents are combined into one data frame. Where both term frequency (tf) is combined across all documents which can then be used in combination with the inverse document frequency (idf) to determine how importance a word is, in relation to each document within a collection.

Before all text can be combined, it is important to ensure all data frames are correctly labelled. At this time, if you were to combine all the documents, you would see the main idenifier for words from different texts is the gutenberg_id, this although is specific to each text, provides very limited information to anyone viewing these data frames. As such it is important to add the name of the book into each data frame.

  1. Add a column, labeled “book” to the tidied text which names the book.

Hint: This can easily be done through using the operator $ to define a new column in the data frame before adding the name accordingly.

  1. Swapping your code with others in your group, combine all four tided texts together, into one document called “tidy_text”.

Now that the books have been combined, you can now examine again the frequency of different words and the total amount of words per book. To do this, after grouping the data frame by book (using the group_by() function), you will need to find the frequency of each word, again using the count() function. Secondly, you will also need to find the total number of words per book, this can be done simply through the summarize() function. Before being combined together to form a new data frame which details the frequency of each word and the total number of overall words per book. Calculating this is important, as this can then be plotted to determine how common highly occurring words are, compared to rare or unique terms.

  1. Using the code template below, obtain the term frequencies across these books.

tidy.text.count <- tidy_text %>%
  group_by(??) %>%
  count(??, ??, sort = TRUE)

tidy.text.total <- tidy.text.count %>%
  group_by(??) %>%
  summarize(total = sum(??)) 

tidy.text.tf <- left_join(tidy.text.count, tidy.text.total)

From the term frequencies, we can obtain the tf, idf and tf_idf of the data set. For this, we use the function bind_tf_idf().

  1. Using the bind_tf_idf() function, and the code template below, plot the tf_idf across the books.

tidy.text.tf <- tidy.text.tf %>% 
  bind_tf_idf(word, ??, ?)

tidy.text.tf %>%
  arrange(desc(??)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(??) %>% 
  top_n(??) %>% 
  ungroup() %>%
  ggplot(aes(??, ??, fill = ??)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "??") +
  facet_wrap(~book, ncol = 2, scales = "free") +

Analysing Mined Text: Sentiment Analysis

Beyond simply understanding how frequently words occur within text, it is possible to conduct analysis which examines the sentiment of the text. When considering the sentiment of any text it can be used to examine emotions, such as Happy vs Sad, or the association of words with traits such as Good vs Bad. This sentiment analysis can be put to good use when attempting to classify text, whether classifying tweets, blog posts or emails (to determine their content or whether they are Spam). For this practical however, we will focus upon the foundations of sentiment analysis, and whether we can use it to determine what text occurs from the books we have discussed.

Similar to stop words and stemming algorithms, sentiment analysis can be based upon pre-existing dictionaries and methods. Each which have their benefits, the three covered within the recommended reading are AFINN, bing and nrc. Each of these as discussed in the book provides a different analysis of the sentiment in the text. For this practical we will focus on the nrc and AFINN dictionaries.

Before we conduct a sentiment analysis, let us examine what exists in these dictionaries.

  1. Using the function get_sentiments() passing the argument “afinn” or “nrc” to explore the assignment of values to words.

Note: to access these lexicons you will need to install the package textdata as well as the dictionaries fully.

Here you can see the main differences between these dictionaries, with afinn assigning a value between -5 to +5 indicating the sentiment from negative (-5) to positive (+5). Whereas nrc simply classifies the words by a singular emotion.

Given the size of these dictionaries, it is often more effective to conduct a sentiment analysis without having passed any stemming algorithm.

sentiment analysis using afinn

Step 1: Applying Sentiment Ranking

One method of calculating and visualizing sentiment is through a chapter calculation and determination of how positive or negative a chapter is. To do this, the first step is to apply the afinn dictionary to the passages, using the inner_join() function passing the argument get_sentiments("afinn") to assign weights to these values. Note that this is not possible for the book ’The Strange Case of Dr. Jekyll and Mr. Hyde`, as this book does not include chapters.

  1. Using the combined function: inner_join(get_sentiments("afinn")), apply this to your text to apply the weight to the text.

From this you can determine the positive or negative ranking per chapter through running the following code:

  1. Run the following code for your tided text, to create a weight per chapter ranking the positive or negative rating.

  group_by(chapter) %>%
  mutate(sentiments = sum(value)) %>%
  distinct(chapter, .keep_all = T)

Step 2: Visualising Sentiment

From this, you can create a graph using ggplot to display the the relationship between chapter and sentiments. Which will allow you to compare the different sentiments visually across these books.

  1. Using your knowledge of ggplot(), plot your tidied text to indicate the sentiments per chapter.

OPTIONAL - sentiment analysis using nrc

Step 1: Select Sentiment Focus

Before beginning any analysis with nrc, a focus of the sentiment analysis should be selected. By selecting a focus, this means selecting an emotion which you wish to examine, let us first consider fear. To do this you simply need to filter() the sentiment to “fear”.

  1. Filter the nrc dictionary, to only be listed with the “fear” terms, assigning this to the variable list nrc_fear.

Step 2: Filtering the Focus

From this you can use the inner_join() function passing the argument nrc_fear to isolate only the words on this list.

  1. Using the inner_join() and count() functions to create a list of the fear words and their frequencies within your tided text.

Step 3: Comparing Sentiment Components

  1. In your groups, discuss the differences in frequencies of these fear base, and re-run questions 19 & 20, with different emotions (for example, “joy” and “fear”) and compare the frequencies in these different texts.