library( dplyr )
library( pander )
library( ggvis )
16 Data Recipes
16.1 Packages Used in This Chapter
16.2 Key Concepts
16.3 The Pipe Operator %>%
The idea of functions() was first introduced using a metaphor of a cookie recipe that has ingredients (data and arguments) and requires that each step of the process building on the results of the previous step.
The pipe operator allows us to follow this same model to build “data recipes”, a stylized way of writing a program as a series of data verbs chained together to wrangle and analyze the data. The pipe operator passes the data from one verb to the next without having to name it directly.
16.4 Building Data Recipes
Data recipes are simple scripts that follow a series of steps, just like a recipe.
This chapter demonstrates how data verbs and pipe operators can be used to write recipes to generate interesting insights.
To demonstrate the idea, we will use a dataset of US Baby Names released by the Social Security Administration. This version was downloaded by Ryan Burge and posted on Kaggle. I’ve re-posted it on GitHub so it can be read directly into R easily:
<- "https://github.com/DS4PS/Data-Science-Class/blob/master/DATA/BabyNames.rds?raw=true"
URL <- readRDS( gzcon( url( URL )))
names %>% head() %>% pander() names
Id | Name | Year | Gender | Count |
---|---|---|---|---|
1 | Mary | 1880 | F | 7065 |
2 | Anna | 1880 | F | 2604 |
3 | Emma | 1880 | F | 2003 |
4 | Elizabeth | 1880 | F | 1939 |
5 | Minnie | 1880 | F | 1746 |
6 | Margaret | 1880 | F | 1578 |
Let’s start by building a recipe to identify the top 10 male names for Baby Boomers.
- Create a subset of data for men born between 1946 and 1964.
- Sort by the annual count of each name in the subset.
- Keep only the most popular year for each name.
- Identify the top 10 most popular during this period.
- Print the results in a nice table that includes name and peak year data.
The recipe will look something like this:
%>%
names filter( Gender =="M" & Year >= 1946 & Year <= 1964 ) %>%
arrange( desc( Count ) ) %>%
distinct( Name, .keep_all=T ) %>%
top_n( 10, Count ) %>%
select( Name, Year, Count ) %>%
pander()
Name | Year | Count |
---|---|---|
James | 1947 | 94755 |
Michael | 1957 | 92709 |
Robert | 1947 | 91642 |
John | 1947 | 88318 |
David | 1955 | 86191 |
William | 1947 | 66969 |
Richard | 1946 | 58859 |
Mark | 1960 | 58735 |
Thomas | 1952 | 48617 |
Charles | 1947 | 40773 |
There are many ways to construct a data recipe. We could have alternatively taken this approach:
- Create a subset of data for men born between 1946 and 1964.
- Count the total numer of men given each name during the period.
- Find the top 10 most popular names.
%>%
names filter( Gender =="M" & Year >= 1946 & Year <= 1964 ) %>%
group_by( Name ) %>%
::summarize( total=sum(Count) ) %>%
dplyr::arrange( desc(total) ) %>%
dplyrslice( 1:10 ) %>%
pander()
Name | total |
---|---|
James | 1570607 |
Robert | 1530527 |
John | 1524619 |
Michael | 1463911 |
David | 1395499 |
William | 1072303 |
Richard | 959321 |
Thomas | 810160 |
Mark | 684159 |
Charles | 657780 |
We can see that these two approaches to answering our question give us slightly different results, but are pretty close.
Let’s try to identify when specific female names have peaked.
- Create a subset of data for women.
- Group the data by “Name” so we can analyze each name separately.
- Find the year with the highest count for each name.
- Store this data as “peak.years”.
Each name will occur once in this dataset in the year that it experienced it’s peak popularity.
<-
peak.years %>%
names filter( Gender == "F" ) %>%
group_by( Name ) %>%
top_n( 1, Count ) %>%
ungroup()
%>% head( 5 ) %>% pander() peak.years
Id | Name | Year | Gender | Count |
---|---|---|---|---|
568 | Manerva | 1880 | F | 10 |
720 | Neppie | 1880 | F | 7 |
2621 | Zilpah | 1881 | F | 9 |
4625 | Crete | 1882 | F | 8 |
4750 | Alwina | 1882 | F | 6 |
We can then filter by years to see which names peaked in a given period.
filter( peak.years, Year == 1950 ) %>%
arrange( desc( Count ) ) %>%
slice( 1:5 ) %>%
pander()
Id | Name | Year | Gender | Count |
---|---|---|---|---|
462006 | Constance | 1950 | F | 4442 |
462008 | Glenda | 1950 | F | 4213 |
462103 | Bonita | 1950 | F | 1527 |
462301 | Ilene | 1950 | F | 453 |
462305 | Marta | 1950 | F | 445 |
# library( ggvis )
%>%
names filter( Name == "Constance" & Gender == "F" ) %>%
select (Name, Year, Count) %>%
ggvis( ~Year, ~Count, stroke = ~Name ) %>%
layer_lines()
.1920 <-
top.fivefilter( peak.years, Year == 1920 ) %>%
top_n( 5, Count )
.1920 top.five
# A tibble: 5 × 5
Id Name Year Gender Count
<int> <chr> <int> <chr> <int>
1 169464 Ruth 1920 F 26100
2 169465 Mildred 1920 F 18058
3 169472 Marie 1920 F 12745
4 169477 Lillian 1920 F 10050
5 169481 Gladys 1920 F 8819
%>%
names filter( Name %in% top.five.1920$Name & Gender =="F" ) %>%
select (Name, Year, Count) %>%
ggvis( ~Year, ~Count, stroke = ~Name ) %>%
layer_lines()
.1975 <-
top.fivefilter( peak.years, Year == 1975 ) %>%
top_n( 5, Count )
%>%
names filter( Name %in% top.five.1975$Name & Gender =="F" ) %>%
select (Name, Year, Count) %>%
ggvis( ~Year, ~Count, stroke = ~Name ) %>%
layer_lines()
.2000 <-
top.fivefilter( peak.years, Year == 2000 ) %>%
top_n( 5, Count )
%>%
names filter( Name %in% top.five.2000$Name & Gender =="F" ) %>%
select (Name, Year, Count) %>%
ggvis( ~Year, ~Count, stroke = ~Name ) %>%
layer_lines()
Ryan Burge posted a fun project on Kaggle about how to find hipster names using this historical data. He defines hipster names as those meeting the following criteria:
- They were popular when your grandmother was young.
- They were unpopular when your parents were young.
- They have recently become popular again.
Let’s stick with women’s names.
<- filter( names, Gender == "F" & Year >= 1915 & Year <= 1935 & Count > 3000 )
df1 <- filter( names, Gender == "F" & Year == 1980 & Count <= 1000 )
df2 <- filter( names, Gender == "F" & Year >= 2010 & Count > 2000)
df3
<-
hipster.names %>%
names filter( Name %in% df1$Name & Name %in% df2$Name & Name %in% df3$Name ) %>%
group_by( Name ) %>%
::summarize( total=sum(Count), peak=max(Count) ) %>%
dplyrarrange( desc( peak ) )
Here are the top 6 female hipster names:
<-
top.hipster.names c( "Emma", "Evelyn", "Alice",
"Grace", "Lillian", "Charlotte" )
%>%
names filter( Name %in% top.hipster.names & Gender =="F" ) %>%
select (Name, Year, Count) %>%
ggvis( ~Year, ~Count, stroke = ~Name ) %>%
layer_lines()
And the full list:
%>% pander() hipster.names
Name | total | peak |
---|---|---|
Emma | 595546 | 22701 |
Evelyn | 534502 | 14279 |
Grace | 469034 | 12770 |
Alice | 551034 | 11956 |
Lillian | 421392 | 10050 |
Charlotte | 312022 | 10048 |
Ella | 273663 | 9868 |
Josephine | 297064 | 8683 |
Eleanor | 268153 | 8499 |
Ruby | 340143 | 8407 |
Hazel | 244069 | 7615 |
Clara | 268980 | 5779 |
Eva | 252465 | 4564 |
Lucy | 185064 | 4257 |
Stella | 155080 | 4165 |
Violet | 122984 | 4156 |
Vivian | 198012 | 4128 |
16.5 Conclusion
The pipe operator is a little confusing when you first encounter it, but you will find that using data verbs contained in the dplyr package and the pipe operator will speed up your analysis and make your code more readable.
In the next chapter we focus more on the use of groups in data science, and the applications of the group_by() function to make your job easier.