Chapter 14 Text³⁵

Where are we? Where are we headed?

Up till now, you should have covered:

Loading in data;
R notation;
Matrix algebra.

14.1 Review

" and ' are usually equivalent.
<- and = are usually interchangeable³⁶. (x <- 3 is equivalent to x = 3, although the former is more preferred because it explicitly states the assignment).
Use ( ) when you are giving input to a function:

# my_results <- FunctionName(FunctionInputs)

note  `c(1,2,3)` is inputting three numbers in the function `c`

Use { } when you are defining a function or writing a for loop:

#function 
MyFunction <- function(InputMatrix){ 
  TempMat <- InputMatrix
  for(i in 1:5){
    TempMat <- t(TempMat)  %*% TempMat / 10
  } 
  return( TempMat )
}
myMat <- matrix(rnorm(100*5), nrow = 100, ncol = 5)
print( MyFunction(myMat) )

##           [,1]      [,2]       [,3]       [,4]      [,5]
## [1,]  342.3602  196.1668   856.7638  -732.7517  173.1954
## [2,]  196.1668  515.3176   762.8554  -277.1625  299.6710
## [3,]  856.7638  762.8554  2697.1230 -1868.8323  461.6741
## [4,] -732.7517 -277.1625 -1868.8323  1678.3580 -264.6936
## [5,]  173.1954  299.6710   461.6741  -264.6936  219.0823

# loop 
x <- c() 
for(i in 1:20){
  x[i] <- i 
}
print(x)

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

14.2 Goals for today

Today, we will learn more about using text data. Our objectives are:

Reading and writing in text in R.
To learn how to use paste and sprintf;
To learn how to use regular expressions;
To learn about other tools for representing + analyzing text in R.

14.3 Reading and writing text in R

To read in a text file, use readLines

readLines("~/Downloads/Carboxylic acid - Wikipedia.html")

To write a text file, use:

write.table(my_string_vector, "~/mydata.txt", sep="\t")

14.4 `paste()` and `sprintf()`

paste and sprintf are useful commands in text processing, such as for automatically naming files or automatically performing a series of command over a subset of your data. Table making also will often need these commands.

Paste concatenates vectors together.

#use collapse for inputs of length > 1 
my_string <- c("Not", "one", "could", "equal")
paste(my_string, collapse = " ")

## [1] "Not one could equal"

#use sep for inputs of length == 1 
paste("Not", "one", "could", "equal", sep = " ")

## [1] "Not one could equal"

For more sophisticated concatenation, use sprintf. This is very useful for automatically making tables.

sprintf("Coefficient for %s: %.3f (%.2f)", "Gender", 1.52324, 0.03143)

## [1] "Coefficient for Gender: 1.523 (0.03)"

#%s is replaced by a character string
#%.3f is replaced by a floating point digit with 3 decimal places
#%.2f is replaced by a floating point digit with 2 decimal places

14.5 Regular expressions

A regular expression is a special text string for describing a search pattern. They are most often used in functions for detecting, locating, and replacing desired text in a corpus.

Use cases:

TEXT PARSING. E.g. I have 10000 congressional speaches. Find all those which mention Iran.
WEB SCRAPING. E.g. Parse html code in order to extract research information from an online table.
CLEANING DATA. E.g. After loading in a dataset, we might need to remove mistakes from the dataset, orsubset the data using regular expression tools.

Example in R. Extract the tweet mentioning Indonesia.

s1 <- "If only Bradley's arm was longer. RT"
s2 <- "Share our love in Indonesia and in the World. RT if you agree." 
my_string <- c(s1, s2)
grepl(my_string, pattern = "Indonesia")

## [1] FALSE  TRUE

my_string[ grepl(my_string, pattern = "Indonesia")]

## [1] "Share our love in Indonesia and in the World. RT if you agree."

Key point: Many R commands use regular expressions. See ?grepl. Assume that x is a character vector and that pattern is the target pattern. In the earlier example, x could have been something like my_string and pattern would have been "Indonesia". Here are other key uses:

DETECT PATTERNS. grepl(pattern, x) goes through all the entries of x and returns a string of TRUE and FALSE values of the same size as x. It will return a TRUE whenever that string entry has the target pattern, and FALSE whenever it doesn't.
REPLACE PATTERNS. gsub(pattern, x, replacement) goes through all the entries of x replaces the pattern with replacement.

gsub(x = my_string,
     pattern = "o", 
     replacement = "AAAA")

## [1] "If AAAAnly Bradley's arm was lAAAAnger. RT"                                   
## [2] "Share AAAAur lAAAAve in IndAAAAnesia and in the WAAAArld. RT if yAAAAu agree."

LOCATE PATTERNS. regexpr(pattern, text) goes through each element of the character string. It returns a vector of the same length, with the entries of the vector corresponding to the location of the first pattern match, or a -1 if no match was obtained.

regex_object <- regexpr(pattern = "was",  text = my_string)
attr(regex_object, "match.length")

## [1]  3 -1

attr(regex_object, "useBytes")

## [1] TRUE

regexpr(pattern = "was", text = my_string)[1]

## [1] 23

regexpr(pattern = "was", text = my_string)[2]

## [1] -1

Seems simple? The problem: the patterns can get pretty complex!

14.5.1 Character classes

Some types of symbols are stand in for some more complex thing, rather than taken literally.

[[:digit:]] Matches with all digits.

[[:lower:]] Matches with lower case letters.

[[:alpha:]] Matches with all alphabetic characters.

[[:punct:]] Matches with all punctuation characters.

[[:cntrl:]] Matches with "control" characters such as \n, \r, etc.

Example in R:

my_string <- "Do you think that 34% of apples are red?"
gsub(my_string, pattern = "[[:digit:]]", replace ="DIGIT")

## [1] "Do you think that DIGITDIGIT% of apples are red?"

gsub(my_string, pattern = "[[:alpha:]]", replace ="")

## [1] "    34%    ?"

14.5.2 Special Characters.

Certain characters (such as ., *, \) have special meaning in the regular expressions framework (they are used to form conditional patterns as discussed below). Thus, when we want our pattern to explicitly include those characters as characters, we must "escape" them by using \ or encoding them in \Q...\E.

Example in R:

my_string <- "Do *really* think he will win?"
gsub(my_string, pattern = "\\*", replace ="")

## [1] "Do really think he will win?"

my_string <- "Now be brave! \n Dread what comrades say of you here in combat! "
gsub(my_string, pattern = "\\\n", replace ="")

## [1] "Now be brave!  Dread what comrades say of you here in combat! "

14.5.3 Conditional patterns

[] The target characters to match are located between the brackets. For example, [aAbB] will match with the characters a, A, b, B.

[^...] Matches with everything except the material between the brackets. For example, [^aAbB] will match with everything but the characters a, A, b, B.

(?=) Lookahead -- match something that IS followed by the pattern.

(?!) Negative lookahead --- match something that is NOT followed by the pattern.

(?<=) Lookbehind -- match with something that follows the pattern.

my_string <- "Do you think that 34%of the 23%of apples are red?"
gsub(my_string, pattern = "(?<=%)", replace = " ", perl = TRUE)

## [1] "Do you think that 34% of the 23% of apples are red?"

my_string <- c("legislative1_term1.png", 
               "legislative1_term1.pdf",
               "legislative1_term2.png",
               "legislative1_term2.pdf",
               "term2_presidential1.png", 
               "presidential1.png", 
               "presidential1_term2.png",
               "presidential1_term1.pdf", 
               "presidential1_term2.pdf")

grepl(my_string, pattern = "^(?!presidential1).*\\.png", perl = TRUE)

## [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE

Indicates which file names don't start with presidential1 but do end in .png
^ indicates that the pattern should start at the beginning of the string.
?! indicates negative lookahead -- we're looking for any pattern NOT following presidential1 which meets the subsequent conditions. (see below)
The first . indicates that, following the negative lookahead, there can be any characters and the * says that it doesn't matter how many. Note that we have to escape the . in .png. (by writing \\. instead of just .)

You will have the chance to try out some regular expressions for yourself at the end!

14.6 Representing Text

In courses and research, we often want to analyze text, to extract meaning out of it. One of the key decisions we need to make is how to represent the text as numbers. Once the text is represented numerically, we can then apply a host of statistical and machine learning methods to it. Those methods are discussed more in the Gov methods sequence (Gov 2000-2003). Here's a summary of the decisions you must make:

WHICH TEXT TO USE? Which text do I want to analyze? What is my universe of documents?
HOW TO REPRESENT THE TEXT NUMERICALLY? How do I use numbers to represent different things about the text?
HOW TO ANALYZE THE NUMERICAL REPRESENTATION? How do I extract meaning out of the numerical representation?

Representing text numerically.

Document term matrix. The document term matrix (DTM) is a common method for representing text. The DTM is a matrix. Each row of this matrix corresponds to a document; each column corresponds to a word. It is often useful to look at summary statistics such as the percentage of speaches in which a Democratic lawmaker used the word "inequality" compared to a Republican; the DTM would be very helpful for this and other tasks.

doc1 <- "Rage---Goddess, sing the rage of Peleus’ son Achilles,
         murderous, doomed, that cost the Achaeans countless losses,
         hurling down to the House of Death so many sturdy souls,
         great fighters’ souls."
doc2 <- "And fate? No one alive has ever escaped it,
         neither brave man nor coward, I tell you, 
         it's born with us the day that we are born."
doc3 <- "Many cities of men he saw and learned their minds,
         many pains he suffered, heartsick on the open sea,
         fighting to save his life and bring his comrades home."

DocVec <- c(doc1, doc2, doc3)

Now we can use utility functions in the tm package:

library(tm)
DocCorpus <- Corpus(VectorSource(DocVec) ) 
DTM1 <-  inspect( DocumentTermMatrix(DocCorpus) )

Consider the effect of different "pre-processing" choices on the resulting DTM!

DocVec <- tolower(DocVec)
DocVec <- gsub(DocVec, pattern ="[[:punct:]]", replace = " ")
DocVec <- gsub(DocVec, pattern ="[[:cntrl:]]", replace = " ")
DocCorpus <- Corpus(VectorSource(DocVec) ) 
DTM2 <-  inspect(DocumentTermMatrix(DocCorpus, 
                                    control = list(stopwords = TRUE,  stemming = TRUE)))

Stemming is the process of reducing inflected/derived words to their word stem or base (e.g. stemming, stemmed, stemmer --> stem*)

14.7 Important packages for parsing text

rvest -- Useful for downloading and manipulating HTML and XM.
tm -- Useful for converting text into a numerical representation (forming DTMs).
stringr -- Useful for string parsing.

Exercises

1

Figure out why this command does what it does:

sprintf("%s of spontaneous events are %s in the mind. 
        Really, %.2f?", 
        "15.03322123", "puzzles", 15.03322123)

## [1] "15.03322123 of spontaneous events are puzzles in the mind. \n        Really, 15.03?"

2

Why does this command not work?

try(sprintf("%s of spontaneous events are %s in the mind. Really, %.2f?",
            "15.03322123", "puzzles", "15.03322123" ), TRUE)

3

Using grepl, these materials, Google, and your friends, describe what the following command does. What changes when value = FALSE?

grep('\'', 
     c("To dare is to lose one's footing momentarily.",  "To not dare is to lose oneself."), value = TRUE)

## [1] "To dare is to lose one's footing momentarily."

4

Write code to automatically extract the file names that DO end start with presidential and DO end in .pdf

my_string <- c("legislative1_term1.png", 
               "legislative1_term1.pdf",
               "legislative1_term2.png", 
               "legislative1_term2.pdf",
               "term2_presidential1.png", 
               "presidential1.png", 
               "presidential1_term2.png",
               "presidential1_term1.pdf",
               "presidential1_term2.pdf")

5

Using the same string as in the above, write code to automatically extract the file names that end in .pdf and that contain the text term2.

# Your code here

6

Combine these two strings into a single string separated by a "-". Desired output: "The carbonyl group in aldehydes and ketones is an oxygen analog of the carbon–carbon double bond."

string1 <- "The carbonyl group in aldehydes and ketones 
            is an oxygen analog of the carbon" 
string2 <-  "–carbon double bond."

7

Challenge problem! Download this webpage https://en.wikipedia.org/wiki/Odyssey

Read the html file into your R workspace.
Remove all of the htlm tags (you may need Google to help with this one).
Remove all punctuation.
Make all the characters lower case.
Do this same process with this webpage (https://en.wikipedia.org/wiki/Iliad).
Form a document term matrix from the two resulting text strings.

# Your code here

Module originally written by Connor Jerzak↩
Only equal signs are allowed to define the values of a functions' argument↩

Chapter 14 Text35