12 Descriptive Statistics

12.1 Some Useful Packages

library( stargazer )  # publication quality tables
library( skimr )      # quick and comprehensive descriptive stats
library( dplyr )      # data wrangling
library( pander )

Descriptive statistics are hugely important for any analysis, but they can be challenging to produce because different classes of variable require different tables or statistics to be meaningful.

The most general core R summary() function prints some basic descriptives about variables in a dataset, reporting statistics based upon data type:

name	group	gender	height	weight	strength
adam	treatment	male	73	180	167
jamal	control	male	67	190	185
linda	treatment	female	62	130	119
sriti	control	female	65	140	142

summary( dat )

Table continues below
name	group	gender	height	weight
Length:4	control :2	female:2	Min. :62.00	Min. :130.0
Class :character	treatment:2	male :2	1st Qu.:64.25	1st Qu.:137.5
Mode :character	NA	NA	Median :66.00	Median :160.0
NA	NA	NA	Mean :66.75	Mean :160.0
NA	NA	NA	3rd Qu.:68.50	3rd Qu.:182.5
NA	NA	NA	Max. :73.00	Max. :190.0

strength
Min. :119.0
1st Qu.:136.2
Median :154.5
Mean :153.2
3rd Qu.:171.5
Max. :185.0

These are not pretty enough to include in a report. Fortunately there are some functions that produce nice tables for R Markdown reports. We will use the stargazer package extensively for regression results and descriptive statistics.

library( stargazer )
dat.numeric <- select_if( dat, is.numeric )
stargazer( dat.numeric, type="html", digits=2, 
           summary.stat = c("n","min","median","mean","max","sd") )


Statistic	N	Min	Median	Mean	Max	St. Dev.

height	4	62	66	66.75	73	4.65
weight	4	130	160	160.00	190	29.44
strength	4	119	154.5	153.25	185	28.85

In many instances we will be working with a large dataset with many variables that are non-numeric. For example, the Lahman package contains a People data frame with the demographic information of all Major League baseball players in the League’s 100-year history.

Variables contained in the People data frame in the Lahman package:

VARIABLE	CLASS	DESCRIPTION
playerID	factor	A unique code asssigned to each player. The playerID links the data in this file with records on players in the other files.
birthYear	numeric	Year player was born
birthMonth	numeric	Month player was born
birthDay	numeric	Day player was born
birthCountry	character	Country where player was born
birthState	character	State where player was born
birthCity	character	City where player was born
deathYear	numeric	Year player died
deathMonth	numeric	Month player died
deathDay	numeric	Day player died
deathCountry	character	Country where player died
deathState	character	State where player died
deathCity	character	City where player died
nameFirst	character	Player’s first name
nameLast	character	Player’s last name
nameGiven	character	Player’s given name (typically first and middle)
weight	numeric	Player’s weight in pounds
height	numeric	Player’s height in inches
bats	factor	a factor: Player’s batting hand (left (L), right (R), or both (B))
throws	factor	a factor: Player’s throwing hand (left(L) or right(R))
debut	character	Date that player made first major league appearance
finalGame	character	Date that player made first major league appearance (blank if still active)
retroID	character	ID used by retrosheet, http://www.retrosheet.org/
bbrefID	character	ID used by Baseball Reference website, http://www.baseball-reference.com/
birthDate	date	Player’s birthdate, in as.Date format
deathDate	date	Player’s deathdate, in as.Date format

In these cases, many of the summary functions will be of limited use. The skimr package was developed for large datasets like these. It will automatically create a set of summary tables for a variety of data types, and the default statistics are reasonable and informative:

library( skimr )
library( Lahman )

data( People )
skim( People )

Data summary
Name	People
Number of rows	20676
Number of columns	26
_______________________
Column type frequency:
character	14
Date	2
factor	2
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
playerID	0	1.00	5	9	20676
birthCountry	59	1.00	3	14	58
birthState	540	0.97	2	22	305
birthCity	168	0.99	3	25	4971
deathCountry	10582	0.49	3	14	25
deathState	10638	0.49	2	20	110
deathCity	10587	0.49	2	26	2737
nameFirst	37	1.00	2	14	2613
nameLast	0	1.00	2	16	10440
nameGiven	37	1.00	2	43	13740
debut	213	0.99	10	10	10842
finalGame	213	0.99	10	10	9731
retroID	49	1.00	8	8	20627
bbrefID	11	1.00	5	9	20665

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
deathDate	10580	0.49	1872-03-17	2023-01-20	1969-08-13	9031
birthDate	420	0.98	1820-04-17	2001-11-19	1945-08-15	16528

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
bats	1178	0.94	FALSE	3	R: 12823, L: 5419, B: 1256
throws	977	0.95	FALSE	3	R: 15728, L: 3970, S: 1

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
birthYear	109	0.99	1936.12	43.33	1820	1897	1944	1975	2001	▁▅▅▅▇
birthMonth	278	0.99	6.63	3.46	1	4	7	10	12	▇▅▅▆▇
birthDay	420	0.98	15.62	8.77	1	8	16	23	31	▇▇▇▇▆
deathYear	10578	0.49	1967.56	33.61	1872	1944	1969	1995	2023	▁▃▇▇▇
deathMonth	10579	0.49	6.48	3.54	1	3	6	10	12	▇▅▅▅▇
deathDay	10580	0.49	15.52	8.79	1	8	15	23	31	▇▇▇▆▆
weight	812	0.96	188.21	22.50	65	173	185	200	320	▁▂▇▁▁
height	732	0.96	72.38	2.62	43	71	72	74	83	▁▁▁▇▁

For more functionality see:

vignette( "Using_skimr", package = "skimr" )

There are many additional packages and tricks for producing descriptive statistics. Note, though, that most produce a print-out of summary statistics but do not return a useful “tidy” dataset that can be used in subsequent steps. For most data recipes, we will rely on the summarize() function in the dplyr package. It’s utility will become obvious in the next two chapters.