# my_results <- FunctionName(FunctionInputs)
14 Text
Where are we? Where are we headed?
Up till now, you should have covered:
- Loading in data;
R
notation;- Matrix algebra.
14.1 Review
"
and'
are usually equivalent.<-
and=
are usually interchangeable1. (x <- 3
is equivalent tox = 3
, although the former is more preferred because it explicitly states the assignment).- Use
(
)
when you are giving input to a function:
note `c(1,2,3)` is inputting three numbers in the function `c`
- Use
{
}
when you are defining a function or writing afor
loop:
# function
<- function(InputMatrix) {
MyFunction <- InputMatrix
TempMat for (i in 1:5) {
<- t(TempMat) %*% TempMat / 10
TempMat
}return(TempMat)
}<- matrix(rnorm(100 * 5), nrow = 100, ncol = 5)
myMat print(MyFunction(myMat))
[,1] [,2] [,3] [,4] [,5]
[1,] 432.1881 306.5901 513.59275 -651.5584 -486.13134
[2,] 306.5901 396.4736 441.54178 -570.0129 -529.78341
[3,] 513.5927 441.5418 1541.09392 822.1930 14.36814
[4,] -651.5584 -570.0129 822.19297 4055.4073 2074.23989
[5,] -486.1313 -529.7834 14.36814 2074.2399 1240.84829
# loop
<- c()
x for (i in 1:20) {
<- i
x[i]
}print(x)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
14.2 Goals for today
Today, we will learn more about using text data. Our objectives are:
- Reading and writing in text in
R
. - To learn how to use paste and sprintf;
- To learn how to use regular expressions;
- To learn about other tools for representing + analyzing text in
R
.
14.3 Reading and writing text in R
- To read in a text file, use readLines
readLines("~/Downloads/Carboxylic acid - Wikipedia.html")
- To write a text file, use:
write.table(my_string_vector, "~/mydata.txt", sep="\t")
14.4 paste()
and sprintf()
paste and sprintf are useful commands in text processing, such as for automatically naming files or automatically performing a series of command over a subset of your data. Table making also will often need these commands.
Paste concatenates vectors together.
# use collapse for inputs of length > 1
<- c("Not", "one", "could", "equal")
my_string paste(my_string, collapse = " ")
[1] "Not one could equal"
# use sep for inputs of length == 1
paste("Not", "one", "could", "equal", sep = " ")
[1] "Not one could equal"
For more sophisticated concatenation, use sprintf. This is very useful for automatically making tables.
sprintf("Coefficient for %s: %.3f (%.2f)", "Gender", 1.52324, 0.03143)
[1] "Coefficient for Gender: 1.523 (0.03)"
# %s is replaced by a character string
# %.3f is replaced by a floating point digit with 3 decimal places
# %.2f is replaced by a floating point digit with 2 decimal places
14.5 Regular expressions
A regular expression is a special text string for describing a search pattern. They are most often used in functions for detecting, locating, and replacing desired text in a corpus.
Use cases:
- TEXT PARSING. E.g. I have 10000 congressional speaches. Find all those which mention Iran.
- WEB SCRAPING. E.g. Parse html code in order to extract research information from an online table.
- CLEANING DATA. E.g. After loading in a dataset, we might need to remove mistakes from the dataset, orsubset the data using regular expression tools.
Example in R
. Extract the tweet mentioning Indonesia.
<- "If only Bradley's arm was longer. RT"
s1 <- "Share our love in Indonesia and in the World. RT if you agree."
s2 <- c(s1, s2)
my_string grepl(my_string, pattern = "Indonesia")
[1] FALSE TRUE
grepl(my_string, pattern = "Indonesia")] my_string[
[1] "Share our love in Indonesia and in the World. RT if you agree."
Key point: Many R commands use regular expressions. See ?grepl
. Assume that x
is a character vector and that pattern
is the target pattern. In the earlier example, x
could have been something like my_string
and pattern
would have been “Indonesia
”. Here are other key uses:
DETECT PATTERNS.
grepl(pattern, x)
goes through all the entries ofx
and returns a string of TRUE and FALSE values of the same size asx
. It will return aTRUE
whenever that string entry has the target pattern, andFALSE
whenever it doesn’t.REPLACE PATTERNS.
gsub(pattern, x, replacement)
goes through all the entries ofx
replaces thepattern
withreplacement
.
gsub(
x = my_string,
pattern = "o",
replacement = "AAAA"
)
[1] "If AAAAnly Bradley's arm was lAAAAnger. RT"
[2] "Share AAAAur lAAAAve in IndAAAAnesia and in the WAAAArld. RT if yAAAAu agree."
- LOCATE PATTERNS.
regexpr(pattern, text)
goes through each element of the character string. It returns a vector of the same length, with the entries of the vector corresponding to the location of the first pattern match, or a -1 if no match was obtained.
<- regexpr(pattern = "was", text = my_string)
regex_object attr(regex_object, "match.length")
[1] 3 -1
attr(regex_object, "useBytes")
[1] TRUE
regexpr(pattern = "was", text = my_string)[1]
[1] 23
regexpr(pattern = "was", text = my_string)[2]
[1] -1
Seems simple? The problem: the patterns can get pretty complex!
14.5.1 Character classes
Some types of symbols are stand in for some more complex thing, rather than taken literally.
[[:digit:]]
Matches with all digits.
[[:lower:]]
Matches with lower case letters.
[[:alpha:]]
Matches with all alphabetic characters.
[[:punct:]]
Matches with all punctuation characters.
[[:cntrl:]]
Matches with “control” characters such as \n
, \r
, etc.
Example in R
:
<- "Do you think that 34% of apples are red?"
my_string gsub(my_string, pattern = "[[:digit:]]", replace = "DIGIT")
[1] "Do you think that DIGITDIGIT% of apples are red?"
gsub(my_string, pattern = "[[:alpha:]]", replace = "")
[1] " 34% ?"
14.5.2 Special Characters.
Certain characters (such as ., *, \
) have special meaning in the regular expressions framework (they are used to form conditional patterns as discussed below). Thus, when we want our pattern to explicitly include those characters as characters, we must “escape” them by using \ or encoding them in \Q…\E.
Example in R
:
<- "Do *really* think he will win?"
my_string gsub(my_string, pattern = "\\*", replace = "")
[1] "Do really think he will win?"
<- "Now be brave! \n Dread what comrades say of you here in combat! "
my_string gsub(my_string, pattern = "\\\n", replace = "")
[1] "Now be brave! Dread what comrades say of you here in combat! "
14.5.3 Conditional patterns
[]
The target characters to match are located between the brackets. For example, [aAbB]
will match with the characters a, A, b, B
.
[^...]
Matches with everything except the material between the brackets. For example, [^aAbB]
will match with everything but the characters a, A, b, B
.
(?=)
Lookahead – match something that IS followed by the pattern.
(?!)
Negative lookahead — match something that is NOT followed by the pattern.
(?<=)
Lookbehind – match with something that follows the pattern.
<- "Do you think that 34%of the 23%of apples are red?"
my_string gsub(my_string, pattern = "(?<=%)", replace = " ", perl = TRUE)
[1] "Do you think that 34% of the 23% of apples are red?"
<- c(
my_string "legislative1_term1.png",
"legislative1_term1.pdf",
"legislative1_term2.png",
"legislative1_term2.pdf",
"term2_presidential1.png",
"presidential1.png",
"presidential1_term2.png",
"presidential1_term1.pdf",
"presidential1_term2.pdf"
)
grepl(my_string, pattern = "^(?!presidential1).*\\.png", perl = TRUE)
[1] TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE
- Indicates which file names don’t start with
presidential1
but do end in.png
^
indicates that the pattern should start at the beginning of the string.?!
indicates negative lookahead – we’re looking for any pattern NOT following presidential1 which meets the subsequent conditions. (see below)- The first
.
indicates that, following the negative lookahead, there can be any characters and the * says that it doesn’t matter how many. Note that we have to escape the . in.png
. (by writing\\.
instead of just.
)
You will have the chance to try out some regular expressions for yourself at the end!
14.6 Representing Text
In courses and research, we often want to analyze text, to extract meaning out of it. One of the key decisions we need to make is how to represent the text as numbers. Once the text is represented numerically, we can then apply a host of statistical and machine learning methods to it. Those methods are discussed more in the Gov methods sequence (Gov 2000-2003). Here’s a summary of the decisions you must make:
- WHICH TEXT TO USE? Which text do I want to analyze? What is my universe of documents?
- HOW TO REPRESENT THE TEXT NUMERICALLY? How do I use numbers to represent different things about the text?
- HOW TO ANALYZE THE NUMERICAL REPRESENTATION? How do I extract meaning out of the numerical representation?
Representing text numerically.
- Document term matrix. The document term matrix (DTM) is a common method for representing text. The DTM is a matrix. Each row of this matrix corresponds to a document; each column corresponds to a word. It is often useful to look at summary statistics such as the percentage of speaches in which a Democratic lawmaker used the word “inequality” compared to a Republican; the DTM would be very helpful for this and other tasks.
<- "Rage---Goddess, sing the rage of Peleus’ son Achilles,
doc1 murderous, doomed, that cost the Achaeans countless losses,
hurling down to the House of Death so many sturdy souls,
great fighters’ souls."
<- "And fate? No one alive has ever escaped it,
doc2 neither brave man nor coward, I tell you,
it's born with us the day that we are born."
<- "Many cities of men he saw and learned their minds,
doc3 many pains he suffered, heartsick on the open sea,
fighting to save his life and bring his comrades home."
<- c(doc1, doc2, doc3) DocVec
Now we can use utility functions in the tm
package:
library(tm)
<- Corpus(VectorSource(DocVec))
DocCorpus <- inspect(DocumentTermMatrix(DocCorpus)) DTM1
Consider the effect of different “pre-processing” choices on the resulting DTM!
<- tolower(DocVec)
DocVec <- gsub(DocVec, pattern = "[[:punct:]]", replace = " ")
DocVec <- gsub(DocVec, pattern = "[[:cntrl:]]", replace = " ")
DocVec <- Corpus(VectorSource(DocVec))
DocCorpus <- inspect(DocumentTermMatrix(DocCorpus,
DTM2 control = list(stopwords = TRUE, stemming = TRUE)
))
Stemming is the process of reducing inflected/derived words to their word stem or base (e.g. stemming, stemmed, stemmer –> stem*)
14.7 Important packages for parsing text
- rvest – Useful for downloading and manipulating HTML and XM.
- tm – Useful for converting text into a numerical representation (forming DTMs).
- stringr – Useful for string parsing.
Exercises
1
Figure out why this command does what it does:
15.03322123 of spontaneous events are puzzles in the mind. Really, 15.03?
2
Why does this command not work?
try(sprintf(
"%s of spontaneous events are %s in the mind. Really, %.2f?",
"15.03322123", "puzzles", "15.03322123"
TRUE) ),
3
Using grepl
, these materials, Google, and your friends, describe what the following command does. What changes when value = FALSE
?
grep("'",
c("To dare is to lose one's footing momentarily.", "To not dare is to lose oneself."),
value = TRUE
)
[1] "To dare is to lose one's footing momentarily."
4
Write code to automatically extract the file names that DO end start with presidential and DO end in .pdf
<- c(
my_string "legislative1_term1.png",
"legislative1_term1.pdf",
"legislative1_term2.png",
"legislative1_term2.pdf",
"term2_presidential1.png",
"presidential1.png",
"presidential1_term2.png",
"presidential1_term1.pdf",
"presidential1_term2.pdf"
)
5
Using the same string as in the above, write code to automatically extract the file names that end in .pdf and that contain the text term2
.
# Your code here
6
Combine these two strings into a single string separated by a “-”. Desired output: “The carbonyl group in aldehydes and ketones is an oxygen analog of the carbon–carbon double bond.”
<- "The carbonyl group in aldehydes and ketones
string1 is an oxygen analog of the carbon"
<- "–carbon double bond." string2
7
Challenge problem! Download this webpage https://en.wikipedia.org/wiki/Odyssey
- Read the html file into your R workspace.
- Remove all of the htlm tags (you may need Google to help with this one).
- Remove all punctuation.
- Make all the characters lower case.
- Do this same process with this webpage (https://en.wikipedia.org/wiki/Iliad).
- Form a document term matrix from the two resulting text strings.
# Your code here
Only equal signs are allowed to define the values of a functions’ argument↩︎