1 Data Recipes

1.1 Packages Used in This Chapter

library( dplyr )
library( pander )
library( ggvis )

1.2 Key Concepts

Data recipes are written using a series of data steps. We can simplify this process using pipes

Figure 1.1: Data recipes are written using a series of data steps. We can simplify this process using pipes



Recall, data verbs use data frames as the primary input and the output value.

Figure 1.2: Recall, data verbs use data frames as the primary input and the output value.



The pipe operator passes a data frame forward through a chain of data verbs. We only reference the dataset name once, and all other times it's implicitly called through piping.

Figure 1.3: The pipe operator passes a data frame forward through a chain of data verbs. We only reference the dataset name once, and all other times it’s implicitly called through piping.

1.3 The Pipe Operator %>%

The idea of functions() was first introduced using a metaphor of a cookie recipe that has ingredients (data and arguments) and requires that each step of the process building on the results of the previous step.

The pipe operator allows us to follow this same model to build “data recipes”, a stylized way of writing a program as a series of data verbs chained together to wrangle and analyze the data. The pipe operator passes the data from one verb to the next without having to name it directly.

The pipe operator allows us to pass a transformed dataset forward in the recipe.

Figure 1.4: The pipe operator allows us to pass a transformed dataset forward in the recipe.

1.4 Building Data Recipes

Data recipes are simple scripts that follow a series of steps, just like a recipe.

This chapter demonstrates how data verbs and pipe operators can be used to write recipes to generate interesting insights.

To demonstrate the idea, we will use a dataset of US Baby Names released by the Social Security Administration. This version was downloaded by Ryan Burge and posted on Kaggle. I’ve re-posted it on GitHub so it can be read directly into R easily:

URL <- "https://github.com/DS4PS/Data-Science-Class/blob/master/DATA/BabyNames.rds?raw=true"
names <- readRDS( gzcon( url( URL )))
names %>% head() %>% pander()
Id Name Year Gender Count
1 Mary 1880 F 7065
2 Anna 1880 F 2604
3 Emma 1880 F 2003
4 Elizabeth 1880 F 1939
5 Minnie 1880 F 1746
6 Margaret 1880 F 1578

Let’s start by building a recipe to identify the top 10 male names for Baby Boomers.

  1. Create a subset of data for men born between 1946 and 1964.
  2. Sort by the annual count of each name in the subset.
  3. Keep only the most popular year for each name.
  4. Identify the top 10 most popular during this period.
  5. Print the results in a nice table that includes name and peak year data.

The recipe will look something like this:

names %>% 
  filter( Gender =="M" & Year >= 1946 & Year <= 1964 ) %>%
  arrange( desc( Count ) ) %>%
  distinct( Name, .keep_all=T ) %>%
  top_n( 10, Count ) %>%
  select( Name, Year, Count ) %>%
  pander()
Name Year Count
James 1947 94755
Michael 1957 92709
Robert 1947 91642
John 1947 88318
David 1955 86191
William 1947 66969
Richard 1946 58859
Mark 1960 58735
Thomas 1952 48617
Charles 1947 40773

There are many ways to construct a data recipe. We could have alternatively taken this approach:

  1. Create a subset of data for men born between 1946 and 1964.
  2. Count the total numer of men given each name during the period.
  3. Find the top 10 most popular names.
names %>% 
  filter( Gender =="M" & Year >= 1946 & Year <= 1964 ) %>%
  group_by( Name ) %>%
  dplyr::summarize( total=sum(Count) ) %>%
  dplyr::arrange( desc(total) ) %>%
  slice( 1:10 ) %>%
  pander()
Name total
James 1570607
Robert 1530527
John 1524619
Michael 1463911
David 1395499
William 1072303
Richard 959321
Thomas 810160
Mark 684159
Charles 657780

We can see that these two approaches to answering our question give us slightly different results, but are pretty close.

Let’s try to identify when specific female names have peaked.

  1. Create a subset of data for women.
  2. Group the data by “Name” so we can analyze each name separately.
  3. Find the year with the highest count for each name.
  4. Store this data as “peak.years”.

Each name will occur once in this dataset in the year that it experienced it’s peak popularity.

peak.years <- 
  names %>%
  filter( Gender == "F" ) %>%
  group_by( Name ) %>%
  top_n( 1, Count ) %>% 
  ungroup() 

peak.years %>% head( 5 ) %>% pander()
Id Name Year Gender Count
568 Manerva 1880 F 10
720 Neppie 1880 F 7
2621 Zilpah 1881 F 9
4625 Crete 1882 F 8
4750 Alwina 1882 F 6

We can then filter by years to see which names peaked in a given period.

filter( peak.years, Year == 1950 ) %>% 
  arrange( desc( Count ) ) %>%
  slice( 1:5 ) %>%
  pander()
Id Name Year Gender Count
462006 Constance 1950 F 4442
462008 Glenda 1950 F 4213
462103 Bonita 1950 F 1527
462301 Ilene 1950 F 453
462305 Marta 1950 F 445
# library( ggvis )
names %>%
  filter( Name == "Constance" & Gender =="F" ) %>%
  select (Name, Year, Count) %>%
  ggvis( ~Year, ~Count, stroke = ~Name ) %>%
  layer_lines()
top.five.1920 <- 
  filter( peak.years, Year == 1920 ) %>% 
  top_n( 5, Count ) 

top.five.1920
## # A tibble: 5 × 5
##       Id Name     Year Gender Count
##    <int> <chr>   <int> <chr>  <int>
## 1 169464 Ruth     1920 F      26100
## 2 169465 Mildred  1920 F      18058
## 3 169472 Marie    1920 F      12745
## 4 169477 Lillian  1920 F      10050
## 5 169481 Gladys   1920 F       8819
names %>%
  filter( Name %in% top.five.1920$Name & Gender =="F" ) %>%
  select (Name, Year, Count) %>%
  ggvis( ~Year, ~Count, stroke = ~Name ) %>%
  layer_lines()
top.five.1975 <- 
  filter( peak.years, Year == 1975 ) %>% 
  top_n( 5, Count ) 

names %>%
  filter( Name %in% top.five.1975$Name & Gender =="F" ) %>%
  select (Name, Year, Count) %>%
  ggvis( ~Year, ~Count, stroke = ~Name ) %>%
  layer_lines()
top.five.2000 <- 
  filter( peak.years, Year == 2000 ) %>% 
  top_n( 5, Count ) 

names %>%
  filter( Name %in% top.five.2000$Name & Gender =="F" ) %>%
  select (Name, Year, Count) %>%
  ggvis( ~Year, ~Count, stroke = ~Name ) %>%
  layer_lines()

Ryan Burge posted a fun project on Kaggle about how to find hipster names using this historical data. He defines hipster names as those meeting the following criteria:

  • They were popular when your grandmother was young.
  • They were unpopular when your parents were young.
  • They have recently become popular again.

Let’s stick with women’s names.

df1 <- filter( names, Gender == "F" & Year >= 1915 & Year <= 1935 & Count > 3000 )
df2 <- filter( names, Gender == "F" & Year == 1980 & Count <= 1000 )
df3 <- filter( names, Gender == "F" & Year >= 2010 & Count > 2000)

hipster.names <- 
  names %>%
  filter( Name %in% df1$Name & Name %in% df2$Name & Name %in% df3$Name ) %>%
  group_by( Name ) %>%
  dplyr::summarize( total=sum(Count), peak=max(Count) ) %>%
  arrange( desc( peak ) ) 

Here are the top 6 female hipster names:

top.hipster.names <- c("Emma","Evelyn","Alice","Grace","Lillian","Charlotte")

names %>%
  filter( Name %in% top.hipster.names & Gender =="F" ) %>%
  select (Name, Year, Count) %>%
  ggvis( ~Year, ~Count, stroke = ~Name ) %>%
  layer_lines()

And the full list:

hipster.names %>% pander()
Name total peak
Emma 595546 22701
Evelyn 534502 14279
Grace 469034 12770
Alice 551034 11956
Lillian 421392 10050
Charlotte 312022 10048
Ella 273663 9868
Josephine 297064 8683
Eleanor 268153 8499
Ruby 340143 8407
Hazel 244069 7615
Clara 268980 5779
Eva 252465 4564
Lucy 185064 4257
Stella 155080 4165
Violet 122984 4156
Vivian 198012 4128

1.5 Conclusion

The pipe operator is a little confusing when you first encounter it, but you will find that using data verbs contained in the dplyr package and the pipe operator will speed up your analysis and make your code more readable.

In the next chapter we focus more on the use of groups in data science, and the applications of the group_by() function to make your job easier.