12  Descriptive Statistics

12.1 Some Useful Packages

library( stargazer )  # publication quality tables
library( skimr )      # quick and comprehensive descriptive stats
library( dplyr )      # data wrangling
library( pander )

Descriptive statistics are hugely important for any analysis, but they can be challenging to produce because different classes of variable require different tables or statistics to be meaningful.

The most general core R summary() function prints some basic descriptives about variables in a dataset, reporting statistics based upon data type:

name group gender height weight strength
adam treatment male 73 180 167
jamal control male 67 190 185
linda treatment female 62 130 119
sriti control female 65 140 142
summary( dat ) 
Table continues below
name group gender height weight
Length:4 control :2 female:2 Min. :62.00 Min. :130.0
Class :character treatment:2 male :2 1st Qu.:64.25 1st Qu.:137.5
Mode :character NA NA Median :66.00 Median :160.0
NA NA NA Mean :66.75 Mean :160.0
NA NA NA 3rd Qu.:68.50 3rd Qu.:182.5
NA NA NA Max. :73.00 Max. :190.0
Min. :119.0
1st Qu.:136.2
Median :154.5
Mean :153.2
3rd Qu.:171.5
Max. :185.0

These are not pretty enough to include in a report. Fortunately there are some functions that produce nice tables for R Markdown reports. We will use the stargazer package extensively for regression results and descriptive statistics.

library( stargazer )
dat.numeric <- select_if( dat, is.numeric )
stargazer( dat.numeric, type="html", digits=2, 
           summary.stat = c("n","min","median","mean","max","sd") )
Statistic N Min Median Mean Max St. Dev.
height 4 62 66 66.75 73 4.65
weight 4 130 160 160.00 190 29.44
strength 4 119 154.5 153.25 185 28.85

In many instances we will be working with a large dataset with many variables that are non-numeric. For example, the Lahman package contains a People data frame with the demographic information of all Major League baseball players in the League’s 100-year history.

Variables contained in the People data frame in the Lahman package:

playerID factor A unique code asssigned to each player. The playerID links the data in this file with records on players in the other files.
birthYear numeric Year player was born
birthMonth numeric Month player was born
birthDay numeric Day player was born
birthCountry character Country where player was born
birthState character State where player was born
birthCity character City where player was born
deathYear numeric Year player died
deathMonth numeric Month player died
deathDay numeric Day player died
deathCountry character Country where player died
deathState character State where player died
deathCity character City where player died
nameFirst character Player’s first name
nameLast character Player’s last name
nameGiven character Player’s given name (typically first and middle)
weight numeric Player’s weight in pounds
height numeric Player’s height in inches
bats factor a factor: Player’s batting hand (left (L), right (R), or both (B))
throws factor a factor: Player’s throwing hand (left(L) or right(R))
debut character Date that player made first major league appearance
finalGame character Date that player made first major league appearance (blank if still active)
retroID character ID used by retrosheet, http://www.retrosheet.org/
bbrefID character ID used by Baseball Reference website, http://www.baseball-reference.com/
birthDate date Player’s birthdate, in as.Date format
deathDate date Player’s deathdate, in as.Date format

In these cases, many of the summary functions will be of limited use. The skimr package was developed for large datasets like these. It will automatically create a set of summary tables for a variety of data types, and the default statistics are reasonable and informative:

library( skimr )
library( Lahman )

data( People )
skim( People )
Data summary
Name People
Number of rows 20676
Number of columns 26
Column type frequency:
character 14
Date 2
factor 2
numeric 8
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
playerID 0 1.00 5 9 0 20676 0
birthCountry 59 1.00 3 14 0 58 0
birthState 540 0.97 2 22 0 305 0
birthCity 168 0.99 3 25 0 4971 0
deathCountry 10582 0.49 3 14 0 25 0
deathState 10638 0.49 2 20 0 110 0
deathCity 10587 0.49 2 26 0 2737 0
nameFirst 37 1.00 2 14 0 2613 0
nameLast 0 1.00 2 16 0 10440 0
nameGiven 37 1.00 2 43 0 13740 0
debut 213 0.99 10 10 0 10842 0
finalGame 213 0.99 10 10 0 9731 0
retroID 49 1.00 8 8 0 20627 0
bbrefID 11 1.00 5 9 0 20665 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
deathDate 10580 0.49 1872-03-17 2023-01-20 1969-08-13 9031
birthDate 420 0.98 1820-04-17 2001-11-19 1945-08-15 16528

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
bats 1178 0.94 FALSE 3 R: 12823, L: 5419, B: 1256
throws 977 0.95 FALSE 3 R: 15728, L: 3970, S: 1

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
birthYear 109 0.99 1936.12 43.33 1820 1897 1944 1975 2001 ▁▅▅▅▇
birthMonth 278 0.99 6.63 3.46 1 4 7 10 12 ▇▅▅▆▇
birthDay 420 0.98 15.62 8.77 1 8 16 23 31 ▇▇▇▇▆
deathYear 10578 0.49 1967.56 33.61 1872 1944 1969 1995 2023 ▁▃▇▇▇
deathMonth 10579 0.49 6.48 3.54 1 3 6 10 12 ▇▅▅▅▇
deathDay 10580 0.49 15.52 8.79 1 8 15 23 31 ▇▇▇▆▆
weight 812 0.96 188.21 22.50 65 173 185 200 320 ▁▂▇▁▁
height 732 0.96 72.38 2.62 43 71 72 74 83 ▁▁▁▇▁

For more functionality see:

vignette( "Using_skimr", package = "skimr" )

There are many additional packages and tricks for producing descriptive statistics. Note, though, that most produce a print-out of summary statistics but do not return a useful “tidy” dataset that can be used in subsequent steps. For most data recipes, we will rely on the summarize() function in the dplyr package. It’s utility will become obvious in the next two chapters.