Lecture 11 - String Processing in R

You will inevitably encounter situations where you need to process strings. Text processing is a large and complex field, but a handful of functions will suffice for most simple tasks.

Some Vocabulary

When you work with text in computer programs, it is called ‘string processing’ because the computer does not know anything about words or concepts, so it treats text as strings of characters.

String- anything comprised of characters or characters + numbers.
Text - the full document, sometimes called a corpus.
Word - text surrounded by spaces, sometimes called tokens.

We can vectorize text by breaking it into sentences, words, letters, etc.

## [1] "This is a string." "These"             "words"            
## [4] "are"               "also"              "strings."
## [1] "This is a string. These words are also strings."
## [1] "This"     "is"       "a"        "string."  "These"    "words"    "are"     
## [8] "also"     "strings."
##  [1] "T" "h" "i" "s" " " "i" "s" " " "a" " " "s" "t" "r" "i" "n" "g" "." "T" "h"
## [20] "e" "s" "e" "w" "o" "r" "d" "s" "a" "r" "e" "a" "l" "s" "o" "s" "t" "r" "i"
## [39] "n" "g" "s" "."

There are a handful of functions that you will use to work with strings. These functions find specific words or characters in your data, find parts of words, and replace them with other words or characters. There are also some functions to break text apart, put text together, or format it.

Function Use
grep() Find a word or phrase (returns the proper string).
grepl() Find a word or phrase (returns a logical vector).
regexpr() Find a part of a word or phrase - very flexible.
agrep() Find an approximate match.
sub() Replace the first occurence of a word or phrase.
gsub() Replace ALL occurences of a word or phrase.
———– ———————————————
paste() Combine multiple strings into a single string.
strsplit() Split one string into multiple strings.
substr() Extract part of a string.

Let’s look at some examples of these functions in action.

Combining Numbers and Words

We often need to combine several pieces of text into one string, called concatenation. R’s function for this is paste().

## [1] "My name is mud."
## [1] "My name is mud."
## [1] "My name is Larry" "My name is Moe"   "My name is Curly"
## [1] "x1" "x2" "x3"

Format Case

## [1] "ABCDEFG"
## [1] "abcdefg"

Counting Characters

Need to sort a column of text by the length of words? You count characters with the function nchar():

## [1]  5 13

Counting Words

This is a little more complicated since text is often processed as a single character string.

## [1] 5
## [1] 30

We can split text using the string split function strsplit(). We just need to tell it the delimiters, which is just a space in this case.

## [[1]]
## [1] "This"  "is"    "all"   "one"   "piece" "of"    "text."
## [1] 7

If we want to split everything we give it an empty split set:

## [[1]]
## [1] "a" "b" "c"
## [1] 30
## [1] 30

Extracting Part of Text

Recall that the census downloads contain a field called GEO.id which consists of several fips codes pasted together. If we inspect this ID we can see that the county fips (the one we often use for merges) is includes as the last five digits. How can we use this variable to exta the county fips codes?

The function substr() takes character vectors as their argument and returns the substring specified by the start and end positions.

## [1] "ick"
## [1] "01001" "01003" "01005"
## [1] "0222000US01001" "0222000US01003" "0222000US01005"

Search Text for a Match

If we want to search text for a keyword we use grep().

In case you are curious about what ‘grep’ means, it is a term inherited from Unix operating systems.

GREP (g/re/p): Globally search for a Regular Expression and Print

## [1] 3
## integer(0)
## [1] 3
## [1] 2 3
## [1] "new york"   "new jersey"

Replacing Text

Find and replace the first case in a string with sub() or all cases with gsub():

## [1] "We are traveling from Old York to New Jersey"
## [1] "california" "old york"   "old jersey" "tennessee"
## [1] "We are tpartyling from New York to New Jersey"
## [1] "We are traveling from Old York to Old Jersey"
## [1] "california" "old york"   "old jersey" "tennessee"
## [1] ".Hello there?"
## [1] "Hello there."
## [1] "Hello.There"

Regular Expressions

We often need to search large bodies of text for patterns.

Regular expressions are a stylized syntax that are used to query bodies of text to return very specific results. It uses symbols that help match groups of characters, as well as expressions to query locations within strings (a pattern at the beginning of a word or end of a sentence).

Note that this section borrows heavily from Gloria Li and Jenny Bryan. Thank you for the clear examples provided at:

https://stat545-ubc.github.io/block022_regular-expression.html

Regular Expression Operators

Recall that logical operators are symbols that allow us to translate nuanced questions into computer code. For example, how many left-handed batters have been inducted into the Baseball Hall of Fame?

Similarly, regular expression operators allow us to create complex search terms.

Instead of saying, search for the word “cat” in the text, we might want to say, search for word “cat”, only at the beginning of sentences, and do not return instances like “catch” that contain “cat”.

In order to specify these searches, we need a more flexible language. Regular expressions gives us this.

Each of these symbols functions as an operator in the regular expressions framework:

$ * + . ? [ ] ^ { } | ( )

Here are the uses of some of these:

Operator Use
. matches any single character (wild card for single character)
* matches 0 or more characters (wild card for any number of characters)
^ start of a string
$ end of a string
? match any time a character appears 0 or 1 times
+ match any time a character appears 1 or more times
| OR statement - match either statement given
[ ] OR statement - match any of the characters given
[^ ] match any characters EXCEPT those given in the list
\ escape character - turns an operator into plain text
## [1] "abc"   "abd"   "abe"   "ab 12" "ab$"
## [1] "abc" "abd"
## [1] "abc" "abd" "abe"
## [1] "abd"   "abe"   "ab 12" "ab$"
## [1] "ab"    "abc"   "abd"   "abe"   "ab 12" "ab$"
## [1] "^ab" "ab"
## [1] "^ab"   "ab"    "abc"   "abd"   "abe"   "ab 12" "ab$"
## [1] "^ab"
## [1] "^ab"   "ab"    "abc"   "abd"   "abe"   "ab 12" "ab$"
## [1] "ab$"

If we want to search for one of these special operators in our text, we need to tell R that we are looking for the operator, and not trying to use a regular expression statement. We accomplish this with an escape sequence.

Create an escape sequence by placeing the double backslash “\” in front of a special operator. For example, to search for a quote, a newline, or a tab in the text use these:

  • \\’: single quote.
  • \\": double quote.
  • \\n: newline.
  • \\r: carriage return.
  • \\t: tab character.
  • \\b: matches the empty string at either edge of a WORD.
  • \\B: matches the string provided it is NOT at an edge of a word.
## [1] "Here is a long string\n           of text that contains \n           some breaks."
## [1] 79
## [[1]]
## [1] 22 56
## attr(,"match.length")
## [1] 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## [[1]]
## [1]  5  8 10 15 36 41 46 55 72
## attr(,"match.length")
## [1] 1 1 1 1 1 1 1 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## [[1]]
## [1] 34
## attr(,"match.length")
## [1] 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## [[1]]
## [1] 12 48 69
## attr(,"match.length")
## [1] 1 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

The regexpr() and gregexpr() functions are odd because they return a character position instead of an element from the character vector. These start and stop positions are used to extract pieces of text from the whole body of text.

## [1] 3
## attr(,"match.length")
## [1] 5
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## [1] "cdefgh"

Quantifiers

The quantifiers allow us to specify the number of times a character is repeated.

Operator Use
* matches at least 0 times.
. matches only one time
+ matches at least 1 times.
? matches at most 1 times.
{n} matches exactly n times.
{n,} matches at least n times.
{n,m} matches between n and m times.
## [1] "ht"    "hot"   "hoot"  "hooot"
## [1] "hot"
## [1] "hot"   "hoot"  "hooot"
## [1] "ht"  "hot"
## [1] "hoot"
## [1] "hoot"  "hooot"
## [1] "hot"  "hoot"

Position

The position specified whether the characters occur at the beginning, middle, or end or a word or phrase.

Note that “a dog” is a STRING that contains two WORDS for the definitions below.

Operator Use
^ matches the start of the STRING.
$ matches the end of the STRING.
\\b matches the empty string at either edge of a WORD.
\\B matches the string provided it is NOT at an edge of a word.
## [1] "abcd"  "cdab"  "cabd"  "c abd"
## [1] "abcd"
## [1] "cdab"
## [1] "abcd"  "c abd"
## [1] "cdab" "cabd"
## [1] 5
## attr(,"match.length")
## [1] 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## [1] 3
## [1] 1 3
## [1] 4
## [1] 4
## integer(0)
## [1] 1 2
## [1] 1 2
## [1] 1 3
## [1] "finLAND"        "iceLAND"        "michael LANDon"
## [1] "finLAND"        "iceLAND"        "michael landon"

Dates in R

R has a special class of text elements for dates. This class translates letters and numbers into calendar dates, and it knows how to translate these elements easily between days and years.

You would use this function in order to re-cast characters from a database into calendar dates, or to calculate time between events.

Time Between Dates

Suppose you want to calculate the time between two datas in your data set:

## [1] "2011/06/13" "2011/07/25" "2011/05/24"
## [1] "character"

You will notice that our dates were read in as characters, so we first need to translate them to the date class in order to make any meaningful comparisons between them. So we cast them as dates.

## [1] "2012-01-01" "2012-01-01" "2012-03-19"
## Time differences in days
## [1] 202 160 300

It worked! Let’s be a little more careful, though, about how we are conducting the translation to make sure we are not introducing any errors. We can explicitly specify the format of the dates to ensure they are interpretted correctly:

## [1] "2011-06-13" "2011-07-25" "2011-05-24"

That works correctly. What if we mix up days and months, though (European dates and American dates often have different ordering of days and months).

## [1] NA NA NA
## [1] "2004-06-30"
## [1] NA

At least R is smart enough to know there are no months higher than 12 and only 30 days in June, and no recycling here!

Creating a Sequence of Dates

We can use the sequence function to generate lists of dates as long as the arguments are dates.

##  [1] "2010-01-01" "2010-01-02" "2010-01-03" "2010-01-04" "2010-01-05"
##  [6] "2010-01-06" "2010-01-07" "2010-01-08" "2010-01-09" "2010-01-10"
## [11] "2010-01-11" "2010-01-12" "2010-01-13" "2010-01-14" "2010-01-15"
## [16] "2010-01-16" "2010-01-17" "2010-01-18" "2010-01-19" "2010-01-20"
## [21] "2010-01-21" "2010-01-22" "2010-01-23" "2010-01-24" "2010-01-25"
## [26] "2010-01-26" "2010-01-27" "2010-01-28" "2010-01-29" "2010-01-30"
## [31] "2010-01-31" "2010-02-01"
## [1] "2010-01-01" "2010-01-08" "2010-01-15" "2010-01-22" "2010-01-29"
## [1] "2010-01-01" "2010-01-08" "2010-01-15" "2010-01-22" "2010-01-29"
##  [1] "2010-01-01" "2010-02-01" "2010-03-01" "2010-04-01" "2010-05-01"
##  [6] "2010-06-01" "2010-07-01" "2010-08-01" "2010-09-01" "2010-10-01"
## [11] "2010-11-01" "2010-12-01" "2011-01-01"

Removing Non-ASCII Characters

ASCII stands for the American Standard Code for Information Interchange, a standard table of letters, numbers and punctuation based upon the American alphabet. ASCII defines 128 characters, 95 print characters (letters, numbers, etc.) and 33 control characters (end of line, tab, etc.). The American alphabet is limited to text without accent marks or special characters. ASCII was originally the standard character encoding of the World Wide Web but it was changed to UTF-8, a more flexible global standard.

Data analysis can be adversely affected if foreign characters find their way into datasets. If it’s causing you trouble, it’s useful to know some tricks to find and remove non-ASCII text. The iconv() function is one option: