library( stargazer ) # publication quality tables
library( skimr ) # quick and comprehensive descriptive stats
library( dplyr ) # data wrangling
library( pander )
12 Descriptive Statistics
12.1 Some Useful Packages
Descriptive statistics are hugely important for any analysis, but they can be challenging to produce because different classes of variable require different tables or statistics to be meaningful.
The most general core R summary() function prints some basic descriptives about variables in a dataset, reporting statistics based upon data type:
name | group | gender | height | weight | strength |
---|---|---|---|---|---|
adam | treatment | male | 73 | 180 | 167 |
jamal | control | male | 67 | 190 | 185 |
linda | treatment | female | 62 | 130 | 119 |
sriti | control | female | 65 | 140 | 142 |
summary( dat )
name | group | gender | height | weight |
---|---|---|---|---|
Length:4 | control :2 | female:2 | Min. :62.00 | Min. :130.0 |
Class :character | treatment:2 | male :2 | 1st Qu.:64.25 | 1st Qu.:137.5 |
Mode :character | NA | NA | Median :66.00 | Median :160.0 |
NA | NA | NA | Mean :66.75 | Mean :160.0 |
NA | NA | NA | 3rd Qu.:68.50 | 3rd Qu.:182.5 |
NA | NA | NA | Max. :73.00 | Max. :190.0 |
strength |
---|
Min. :119.0 |
1st Qu.:136.2 |
Median :154.5 |
Mean :153.2 |
3rd Qu.:171.5 |
Max. :185.0 |
These are not pretty enough to include in a report. Fortunately there are some functions that produce nice tables for R Markdown reports. We will use the stargazer package extensively for regression results and descriptive statistics.
library( stargazer )
<- select_if( dat, is.numeric )
dat.numeric stargazer( dat.numeric, type="html", digits=2,
summary.stat = c("n","min","median","mean","max","sd") )
Statistic | N | Min | Median | Mean | Max | St. Dev. |
height | 4 | 62 | 66 | 66.75 | 73 | 4.65 |
weight | 4 | 130 | 160 | 160.00 | 190 | 29.44 |
strength | 4 | 119 | 154.5 | 153.25 | 185 | 28.85 |
In many instances we will be working with a large dataset with many variables that are non-numeric. For example, the Lahman package contains a People data frame with the demographic information of all Major League baseball players in the League’s 100-year history.
Variables contained in the People data frame in the Lahman package:
VARIABLE | CLASS | DESCRIPTION |
---|---|---|
playerID | factor | A unique code asssigned to each player. The playerID links the data in this file with records on players in the other files. |
birthYear | numeric | Year player was born |
birthMonth | numeric | Month player was born |
birthDay | numeric | Day player was born |
birthCountry | character | Country where player was born |
birthState | character | State where player was born |
birthCity | character | City where player was born |
deathYear | numeric | Year player died |
deathMonth | numeric | Month player died |
deathDay | numeric | Day player died |
deathCountry | character | Country where player died |
deathState | character | State where player died |
deathCity | character | City where player died |
nameFirst | character | Player’s first name |
nameLast | character | Player’s last name |
nameGiven | character | Player’s given name (typically first and middle) |
weight | numeric | Player’s weight in pounds |
height | numeric | Player’s height in inches |
bats | factor | a factor: Player’s batting hand (left (L), right (R), or both (B)) |
throws | factor | a factor: Player’s throwing hand (left(L) or right(R)) |
debut | character | Date that player made first major league appearance |
finalGame | character | Date that player made first major league appearance (blank if still active) |
retroID | character | ID used by retrosheet, http://www.retrosheet.org/ |
bbrefID | character | ID used by Baseball Reference website, http://www.baseball-reference.com/ |
birthDate | date | Player’s birthdate, in as.Date format |
deathDate | date | Player’s deathdate, in as.Date format |
In these cases, many of the summary functions will be of limited use. The skimr package was developed for large datasets like these. It will automatically create a set of summary tables for a variety of data types, and the default statistics are reasonable and informative:
library( skimr )
library( Lahman )
data( People )
skim( People )
Name | People |
Number of rows | 20676 |
Number of columns | 26 |
_______________________ | |
Column type frequency: | |
character | 14 |
Date | 2 |
factor | 2 |
numeric | 8 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
playerID | 0 | 1.00 | 5 | 9 | 0 | 20676 | 0 |
birthCountry | 59 | 1.00 | 3 | 14 | 0 | 58 | 0 |
birthState | 540 | 0.97 | 2 | 22 | 0 | 305 | 0 |
birthCity | 168 | 0.99 | 3 | 25 | 0 | 4971 | 0 |
deathCountry | 10582 | 0.49 | 3 | 14 | 0 | 25 | 0 |
deathState | 10638 | 0.49 | 2 | 20 | 0 | 110 | 0 |
deathCity | 10587 | 0.49 | 2 | 26 | 0 | 2737 | 0 |
nameFirst | 37 | 1.00 | 2 | 14 | 0 | 2613 | 0 |
nameLast | 0 | 1.00 | 2 | 16 | 0 | 10440 | 0 |
nameGiven | 37 | 1.00 | 2 | 43 | 0 | 13740 | 0 |
debut | 213 | 0.99 | 10 | 10 | 0 | 10842 | 0 |
finalGame | 213 | 0.99 | 10 | 10 | 0 | 9731 | 0 |
retroID | 49 | 1.00 | 8 | 8 | 0 | 20627 | 0 |
bbrefID | 11 | 1.00 | 5 | 9 | 0 | 20665 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
deathDate | 10580 | 0.49 | 1872-03-17 | 2023-01-20 | 1969-08-13 | 9031 |
birthDate | 420 | 0.98 | 1820-04-17 | 2001-11-19 | 1945-08-15 | 16528 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
bats | 1178 | 0.94 | FALSE | 3 | R: 12823, L: 5419, B: 1256 |
throws | 977 | 0.95 | FALSE | 3 | R: 15728, L: 3970, S: 1 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
birthYear | 109 | 0.99 | 1936.12 | 43.33 | 1820 | 1897 | 1944 | 1975 | 2001 | ▁▅▅▅▇ |
birthMonth | 278 | 0.99 | 6.63 | 3.46 | 1 | 4 | 7 | 10 | 12 | ▇▅▅▆▇ |
birthDay | 420 | 0.98 | 15.62 | 8.77 | 1 | 8 | 16 | 23 | 31 | ▇▇▇▇▆ |
deathYear | 10578 | 0.49 | 1967.56 | 33.61 | 1872 | 1944 | 1969 | 1995 | 2023 | ▁▃▇▇▇ |
deathMonth | 10579 | 0.49 | 6.48 | 3.54 | 1 | 3 | 6 | 10 | 12 | ▇▅▅▅▇ |
deathDay | 10580 | 0.49 | 15.52 | 8.79 | 1 | 8 | 15 | 23 | 31 | ▇▇▇▆▆ |
weight | 812 | 0.96 | 188.21 | 22.50 | 65 | 173 | 185 | 200 | 320 | ▁▂▇▁▁ |
height | 732 | 0.96 | 72.38 | 2.62 | 43 | 71 | 72 | 74 | 83 | ▁▁▁▇▁ |
For more functionality see:
vignette( "Using_skimr", package = "skimr" )
There are many additional packages and tricks for producing descriptive statistics. Note, though, that most produce a print-out of summary statistics but do not return a useful “tidy” dataset that can be used in subsequent steps. For most data recipes, we will rely on the summarize() function in the dplyr package. It’s utility will become obvious in the next two chapters.