You will inevitably encounter situations where you need to process strings. Text processing is a large and complex field, but a handful of functions will suffice for most simple tasks.
When you work with text in computer programs, it is called ‘string processing’ because the computer does not know anything about words or concepts, so it treats text as strings of characters.
String- anything comprised of characters or characters + numbers.
Text - the full document, sometimes called a corpus.
Word - text surrounded by spaces, sometimes called tokens.
We can vectorize text by breaking it into sentences, words, letters, etc.
## [1] "This is a string." "These" "words"
## [4] "are" "also" "strings."
# putting text together
paste( "This is a string.", "These", "words","are", "also", "strings.", sep=" ")
## [1] "This is a string. These words are also strings."
## [1] "This" "is" "a" "string." "These" "words" "are"
## [8] "also" "strings."
## [1] "T" "h" "i" "s" " " "i" "s" " " "a" " " "s" "t" "r" "i" "n" "g" "." "T" "h"
## [20] "e" "s" "e" "w" "o" "r" "d" "s" "a" "r" "e" "a" "l" "s" "o" "s" "t" "r" "i"
## [39] "n" "g" "s" "."
There are a handful of functions that you will use to work with strings. These functions find specific words or characters in your data, find parts of words, and replace them with other words or characters. There are also some functions to break text apart, put text together, or format it.
Function | Use |
---|---|
grep() |
Find a word or phrase (returns the proper string). |
grepl() |
Find a word or phrase (returns a logical vector). |
regexpr() |
Find a part of a word or phrase - very flexible. |
agrep() |
Find an approximate match. |
sub() |
Replace the first occurence of a word or phrase. |
gsub() |
Replace ALL occurences of a word or phrase. |
———– | ——————————————— |
paste() |
Combine multiple strings into a single string. |
strsplit() |
Split one string into multiple strings. |
substr() |
Extract part of a string. |
Let’s look at some examples of these functions in action.
We often need to combine several pieces of text into one string, called concatenation. R’s function for this is paste().
## [1] "My name is mud."
## [1] "My name is mud."
## [1] "My name is Larry" "My name is Moe" "My name is Curly"
## [1] "x1" "x2" "x3"
Need to sort a column of text by the length of words? You count characters with the function nchar():
## [1] 5 13
This is a little more complicated since text is often processed as a single character string.
## [1] 5
## [1] 30
We can split text using the string split function strsplit(). We just need to tell it the delimiters, which is just a space in this case.
## [[1]]
## [1] "This" "is" "all" "one" "piece" "of" "text."
## [1] 7
If we want to split everything we give it an empty split set:
## [[1]]
## [1] "a" "b" "c"
## [1] 30
## [1] 30
Recall that the census downloads contain a field called GEO.id which consists of several fips codes pasted together. If we inspect this ID we can see that the county fips (the one we often use for merges) is includes as the last five digits. How can we use this variable to exta the county fips codes?
The function substr() takes character vectors as their argument and returns the substring specified by the start and end positions.
## [1] "ick"
GEO.id <- c("0500000US01001","0500000US01003","0500000US01005")
substr( GEO.id, start=10, stop=15 ) # returns county fips codes only
## [1] "01001" "01003" "01005"
## [1] "0222000US01001" "0222000US01003" "0222000US01005"
If we want to search text for a keyword we use grep().
In case you are curious about what ‘grep’ means, it is a term inherited from Unix operating systems.
GREP (g/re/p): Globally search for a Regular Expression and Print
my.text <- c("micky","minnie","goofy","pluto")
grep( pattern="goofy", my.text ) # correctly returns the third line
## [1] 3
## integer(0)
## [1] 3
# returns each line that contains the match text
grep( "new", c("california","new york","new jersey","tennessee") )
## [1] 2 3
# perhaps we want to see all of the lines that match
grep( "new", c("california","new york","new jersey","tennessee"), value=T )
## [1] "new york" "new jersey"
Find and replace the first case in a string with sub() or all cases with gsub():
## [1] "We are traveling from Old York to New Jersey"
## [1] "california" "old york" "old jersey" "tennessee"
## [1] "We are tpartyling from New York to New Jersey"
## [1] "We are traveling from Old York to Old Jersey"
## [1] "california" "old york" "old jersey" "tennessee"
## [1] ".Hello there?"
## [1] "Hello there."
## [1] "Hello.There"
We often need to search large bodies of text for patterns.
Regular expressions are a stylized syntax that are used to query bodies of text to return very specific results. It uses symbols that help match groups of characters, as well as expressions to query locations within strings (a pattern at the beginning of a word or end of a sentence).
Note that this section borrows heavily from Gloria Li and Jenny Bryan. Thank you for the clear examples provided at:
https://stat545-ubc.github.io/block022_regular-expression.html
Recall that logical operators are symbols that allow us to translate nuanced questions into computer code. For example, how many left-handed batters have been inducted into the Baseball Hall of Fame?
Similarly, regular expression operators allow us to create complex search terms.
Instead of saying, search for the word “cat” in the text, we might want to say, search for word “cat”, only at the beginning of sentences, and do not return instances like “catch” that contain “cat”.
In order to specify these searches, we need a more flexible language. Regular expressions gives us this.
Each of these symbols functions as an operator in the regular expressions framework:
$ * + . ? [ ] ^ { } | ( )
Here are the uses of some of these:
Operator | Use |
---|---|
. | matches any single character (wild card for single character) |
* | matches 0 or more characters (wild card for any number of characters) |
^ | start of a string |
$ | end of a string |
? | match any time a character appears 0 or 1 times |
+ | match any time a character appears 1 or more times |
| | OR statement - match either statement given |
[ ] | OR statement - match any of the characters given |
[^ ] | match any characters EXCEPT those given in the list |
\ | escape character - turns an operator into plain text |
strings <- c("^ab", "ab", "abc", "abd", "abe", "ab 12", "ab$")
# match anything that starts with ab followed by any character
grep("ab.", strings, value = TRUE)
## [1] "abc" "abd" "abe" "ab 12" "ab$"
## [1] "abc" "abd"
## [1] "abc" "abd" "abe"
## [1] "abd" "abe" "ab 12" "ab$"
## [1] "ab" "abc" "abd" "abe" "ab 12" "ab$"
## [1] "^ab" "ab"
## [1] "^ab" "ab" "abc" "abd" "abe" "ab 12" "ab$"
## [1] "^ab"
## [1] "^ab" "ab" "abc" "abd" "abe" "ab 12" "ab$"
## [1] "ab$"
If we want to search for one of these special operators in our text, we need to tell R that we are looking for the operator, and not trying to use a regular expression statement. We accomplish this with an escape sequence.
Create an escape sequence by placeing the double backslash “\” in front of a special operator. For example, to search for a quote, a newline, or a tab in the text use these:
## [1] "Here is a long string\n of text that contains \n some breaks."
## [1] 79
## [[1]]
## [1] 22 56
## attr(,"match.length")
## [1] 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## [[1]]
## [1] 5 8 10 15 36 41 46 55 72
## attr(,"match.length")
## [1] 1 1 1 1 1 1 1 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## [[1]]
## [1] 34
## attr(,"match.length")
## [1] 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## [[1]]
## [1] 12 48 69
## attr(,"match.length")
## [1] 1 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
The regexpr() and gregexpr() functions are odd because they return a character position instead of an element from the character vector. These start and stop positions are used to extract pieces of text from the whole body of text.
## [1] 3
## attr(,"match.length")
## [1] 5
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
start.pos <- regexpr( "c.*g", "abcdefghi" )
stop.pos <- start.pos + attr( regexpr( "c.*g", "abcdefghi" ), "match.length" )
substr( "abcdefghi", start=start.pos, stop=stop.pos )
## [1] "cdefgh"
The quantifiers allow us to specify the number of times a character is repeated.
Operator | Use |
---|---|
* | matches at least 0 times. |
. | matches only one time |
+ | matches at least 1 times. |
? | matches at most 1 times. |
{n} | matches exactly n times. |
{n,} | matches at least n times. |
{n,m} | matches between n and m times. |
strings <- c("ht","hot","hoot","hooot")
# match at least zero times
grep("h*t", strings, value = TRUE)
## [1] "ht" "hot" "hoot" "hooot"
## [1] "hot"
## [1] "hot" "hoot" "hooot"
## [1] "ht" "hot"
## [1] "hoot"
## [1] "hoot" "hooot"
## [1] "hot" "hoot"
The position specified whether the characters occur at the beginning, middle, or end or a word or phrase.
Note that “a dog” is a STRING that contains two WORDS for the definitions below.
Operator | Use |
---|---|
^ | matches the start of the STRING. |
$ | matches the end of the STRING. |
\\b | matches the empty string at either edge of a WORD. |
\\B | matches the string provided it is NOT at an edge of a word. |
strings <- c("abcd", "cdab", "cabd", "c abd")
# anywhere in the text
grep("ab", strings, value = TRUE)
## [1] "abcd" "cdab" "cabd" "c abd"
## [1] "abcd"
## [1] "cdab"
## [1] "abcd" "c abd"
## [1] "cdab" "cabd"
## [1] 5
## attr(,"match.length")
## [1] 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## [1] 3
## [1] 1 3
## [1] 4
## [1] 4
## integer(0)
## [1] 1 2
# FormA OR FormB OR FormC
my.text <- c( "FormA", "FormC", "FormE" )
grep( pattern="Form[ABC]", my.text )
## [1] 1 2
## [1] 1 3
# replace land with LAND in all country names
gsub( "land", "LAND", c("finland", "iceland", "michael landon") )
## [1] "finLAND" "iceLAND" "michael LANDon"
# need to anchor the word to the end
gsub( "land$", "LAND", c("finland", "iceland", "michael landon") )
## [1] "finLAND" "iceLAND" "michael landon"
R has a special class of text elements for dates. This class translates letters and numbers into calendar dates, and it knows how to translate these elements easily between days and years.
You would use this function in order to re-cast characters from a database into calendar dates, or to calculate time between events.
## [1] "Mon Feb 03 20:51:53 2020"
Perhaps we are running simulations and need to print output to a file in a way that we can generate random names for the files but still keep track of the order. We can create filenames using dates:
## [1] "Mon Feb 03 20:51:53 2020.pdf"
That’s a complicated title. Perhaps we want a simple representation of the full date. We can format a date object using some simple commands. For a full list see strptime().
## [1] "2020-02-03 20:51:53 MST"
## [1] "Mon Feb 03 2020"
Suppose you want to calculate the time between two datas in your data set:
start.date <- c("2011/06/13","2011/07/25","2011/05/24")
end.date <- c("2012/01/01","2012/01/01","2012/03/19")
start.date
## [1] "2011/06/13" "2011/07/25" "2011/05/24"
## [1] "character"
You will notice that our dates were read in as characters, so we first need to translate them to the date class in order to make any meaningful comparisons between them. So we cast them as dates.
## [1] "2012-01-01" "2012-01-01" "2012-03-19"
## Time differences in days
## [1] 202 160 300
It worked! Let’s be a little more careful, though, about how we are conducting the translation to make sure we are not introducing any errors. We can explicitly specify the format of the dates to ensure they are interpretted correctly:
## [1] "2011-06-13" "2011-07-25" "2011-05-24"
That works correctly. What if we mix up days and months, though (European dates and American dates often have different ordering of days and months).
## [1] NA NA NA
## [1] "2004-06-30"
## [1] NA
At least R is smart enough to know there are no months higher than 12 and only 30 days in June, and no recycling here!
We can use the sequence function to generate lists of dates as long as the arguments are dates.
a <- as.Date("2010/01/01")
b <- as.Date("2010/02/01")
c <- as.Date("2011/01/15")
seq( from=a, to=b, by=1 ) # sequence of days
## [1] "2010-01-01" "2010-01-02" "2010-01-03" "2010-01-04" "2010-01-05"
## [6] "2010-01-06" "2010-01-07" "2010-01-08" "2010-01-09" "2010-01-10"
## [11] "2010-01-11" "2010-01-12" "2010-01-13" "2010-01-14" "2010-01-15"
## [16] "2010-01-16" "2010-01-17" "2010-01-18" "2010-01-19" "2010-01-20"
## [21] "2010-01-21" "2010-01-22" "2010-01-23" "2010-01-24" "2010-01-25"
## [26] "2010-01-26" "2010-01-27" "2010-01-28" "2010-01-29" "2010-01-30"
## [31] "2010-01-31" "2010-02-01"
## [1] "2010-01-01" "2010-01-08" "2010-01-15" "2010-01-22" "2010-01-29"
## [1] "2010-01-01" "2010-01-08" "2010-01-15" "2010-01-22" "2010-01-29"
## [1] "2010-01-01" "2010-02-01" "2010-03-01" "2010-04-01" "2010-05-01"
## [6] "2010-06-01" "2010-07-01" "2010-08-01" "2010-09-01" "2010-10-01"
## [11] "2010-11-01" "2010-12-01" "2011-01-01"
ASCII stands for the American Standard Code for Information Interchange, a standard table of letters, numbers and punctuation based upon the American alphabet. ASCII defines 128 characters, 95 print characters (letters, numbers, etc.) and 33 control characters (end of line, tab, etc.). The American alphabet is limited to text without accent marks or special characters. ASCII was originally the standard character encoding of the World Wide Web but it was changed to UTF-8, a more flexible global standard.
Data analysis can be adversely affected if foreign characters find their way into datasets. If it’s causing you trouble, it’s useful to know some tricks to find and remove non-ASCII text. The iconv() function is one option: