DP4SS
Data Programming for the Social Sciences (DP4SS)
in R for social science audiences.
Jesse Lecy
&
Jamison Crawford
Attribution · NonCommercial · ShareAlike
Source Code
This textbook is being developed by adapting lecture notes and resources from a graduate-level introductory course in data science that is offered at the Watts College of Public Service at Arzona State University.
Comments and suggestions are welcome! · · · Comments
CONTENTS:
- Data Programming for the Social Sciences (DP4SS)
- Your Data Science Toolkit
- Getting Started
- Using R
- One-Dimensional Datasets
- Two-Dimensional Datasets
- Data IO
- Data Wrangling (dplyr)
- Explore and Describe
- Efficient Analysis With Groups
- Visualization
- Dynamic Documents
Your Data Science Toolkit
We will need three tools to manage your data science projects: a data programming language (R), a project management interfact (R Studio), and a way to create data-driven documents (R Markdown).
Core R
- What is R? [ video ]
- Packages
R Studio
- Installing R and R Studio
- Tour of R Studio
Data-Driven Docs
- Automation & Flexibility
- The Importance of Reproducibility
- Formats link
- Gallery link
Markdown
- R Markdown Formats overview
- Headers and Chunks link
- Knitting link
- Customization
Getting Started
These are some useful resources and guides for learning how to program if you are new to R or data programming.
Starting to Code
Getting Help
- Help files
- Error messages
- Discussion boards
The Learning Curve
- Vocabular and verbs
- Learning to Learn R
Using R
Functions, variables, and operators are the core components of any functional programming language. These first chapters are foundational for everything moving forward.
R as a Calculator
- Mathematical Operators
- Objects
- Assignment
Functions
- Input-Output Devices
- Arguments
- Values
- Returns
Logical Operators
- Logical operators
- equal
- not equal
- greater than or less than
- opposite of
Special Operators
- Unique values
- Duplicates
- Missing values (NA)
- Maximum
- Minimum
One-Dimensional Datasets
Vectors are the building blocks of analysis in R. Vectors come in a variety of flavors - we cover the four most salient data types here: numbers, characters, categories, and logical or boolean.
Vectors
- Vector Types
- Numeric (v)
- Character (s)
- Factor (ordered vs unordered) (f)
- Logical (true/false) (L)
- Checking vector types
- data class
- data mode
Converting Data Type
- Casting
- explicit casting
- implicit casting (coercion)
- Information loss
- Care with factors
Variable Transformations
- Linear transformations
- vectorized functions
- recycling rules
- Recoding values
- find and replace
- recoding factors
- Floors and ceilings
Two-Dimensional Datasets
Vectors typically represent individual variables in the social science context. A dataset contains IDs for individuals, and multiple measures from each individual. Typically data is organized so that columns represent distinct variables and rows represent individuals in the dataset. This spreadsheet representation of data is operationalized as data frames in R. Here you learn how to construct and manipulate data frames.
Dataframes
- Creating data frames from vectors
- rows and columns
- the
$
operator - Checking and changing class types
Dataframe Subsets
- Filter rows and select columns
- the
[]
operator dplyr::filter
anddplyr::select
- the
- Reorder rows or columns
sort()
versusorder()
dplyr::arrange
Dataframe Constructors
- Building data objects:
data.frame()
vscbind()
andrbind()
- Variable transformations in df’s
- assignment inside a df:
dat$x_squared <- x·x
dplyr::mutate
vsdplyr::transmute()
- assignment inside a df:
Matrices and Lists
- Matrix
- Lists
- Conversions:
- matrix to df
- list to df
Data IO
Data import and export [ input / output ].
Navigation
- Working directories
- paths: windows v linux
- current working directory:
getwd()
- change working directory:
setwd()
- check files in directory:
dir()
- create new folder:
dir.create("name")
- Unzip files
unzip("filename")
- Delete files tutorial
Built-In Datasets
- Core R datasets
- Datasets in packages
- Packages that are data
Importing Data into R
- Read options
- Copy and paste from Excel
- Using rdata format
- Read from csv or tsv
- Read text files
- Import from Excel
- Import from common format (foreign package)
- Import from the web (RCurl)
- Import from GitHub
- Import from DropBox
- [ tutorial ]
Exporting Data
- Write options
- CSV
- R Data Sets (RDS)
- CSV vs RDS
- Tables
- RData Format
- SPSS or Stata
- Copy to Clipboard
- Copy to Excel
- [ tutorial ]
APIs
- What is an API?
- Examples
- Census
- Socrata
- [ Demo with DataUSA API ]
Data Wrangling (dplyr)
Data wrangling is the process of preparing data for analysis, which includes reading data into R from a variety of formats, cleaning data, tidying datasets, creating subsets and filters, transforming variables, grouping data, and joining multiple datasets.
The goal of data wrangling is to create a rodeo dataset (clean and well-structured) that is ready for the big show (modeling and visualization)!
Slicing Datasets
- Subset operator
[]
- by position
- by name
- by logical vector
- with recycling
- Selector vectors
- Subset by row
dat[ row_selector , ]
dplyr::filter( dat, row_selector )
- Subset by column
dat[ , column_selector ]
dplyr::select( dat, column_selector )
- Reorder
- with index
- order / match
Data Wrangling Recipes
- Pipe operator
- Window vs summary functions
- dplyr cheat sheet
Combining Datasets
merge()
andmatch()
join()
in dplyr- inner, outer, right, left
Explore and Describe
The first step in the data science process is to get to know your data through descriptive analysis and exploratory analysis that searches for useful patterns or trends. We accomplish this through summary statistics, and in the next section visualization.
Summarizing Vectors
- Counting things:
sum( logical statement )
- Counting missing data:
sum( is.na(x) )
- Categorical data:
table( f1, f2 )
prop.table()
andmargin.table()
- Numeric data: min, max, mean, median, summary, quantile
- all vectors at once:
summary( data.frame )
- all vectors at once:
Summarizing Groups of Vectors
table( f1, f2 )
ftable( row.vars=c("f1","f2"), col.vars="f3" )
- Function over groups:
tapply( v1, f1 )
ordplyr:: group_by() + summarise()
- Functions over levels of numeric data:
tapply( v1, cut(v2) )
tapply( v1, INDEX=list(f1,f2)
ordplyr:: group_by() + summarise()
aggregate( dat, FUN, by=f1 )
- https://cran.r-project.org/web/packages/DescTools/vignettes/DescToolsCompanion.pdf
- v1, v2 using
cor()
or visually withpairs()
Efficient Analysis With Groups
As you become proficient with descriptive analysis you will want to find ways to be more efficient. Unless you learn how to scale data exploration and modeling you will not be able to quickly identify patterns in your data. The most efficient way to scale your analysis is to understand the dimensionality or internal problem space in your data, and use apply functions in R to replicate analysis over many groups at once.
Groups
- Logical statements
- define group criteria
- TRUE signifies membership
- Group constructors
- from categorical variables
- from numeric variables
- from strings
- from missing values
- Compound logical statements: AND and OR
- Casting logical vectors
Group Structure
- Combining factors and numeric data for analysis
- Faceting in plots
Counting Group Members
- Mathematical operators with logical vectors
- counts of members: sum( L1 )
- proportions of members: mean( L1 )
- Conditional proportions
- subset then tabulate
- logical statement in numerator and demoninator
The Mathematics of Groups
- Group structure
- generalizing logical statements
- Group dimensionality
- how many unique groups are in the data?
- combinatorics of attributes
- total groups from f1 and f2
= nlevels(F1) · nlevels(F2)
- Groups as problem spaces
- complexity theory
- search
- dimension reduction
Analysis with Groups
- Contingency tables
- counts of members:
f1 · f2
- counts of members:
- Statistics by group
- function applied over a group:
v1 ~ f1 · f2
apply()
functions- dplyr
group_by()
andsummarize()
functions
- function applied over a group:
Latent Groups
- clustering
- unsupervised learning approaches
Visualization
For a great overview with examples of R code:
Wilke, C. O. (2019). Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. O’Reilly Media. FREE EBOOK
Principles of Visual Communication
- Ground, figure, narrative (context, subject, action)
- Tufte’s rules
- Visual tragedies
Core Graphics Engine
plot()
function- Arguments:
- plot point types
- colors
- size
- axis labels
- plot title
Customizing Graphics
- Defining a canvas: xlim, ylim
- Adding data
- Type (point, line, both)
- Symbols
- Color
- Size
- Adding grids
- Adding axes
- Adding titles / axes labels
- Adding data labels: text()
- Margins
Colors in R
- select by name:
- color theory
- value
- shade, tint, tone
- hue, saturation
- transparency
- color values
- color functions
Advanced Plot Features
- Custom fonts
- Math symbols
- Multiple plots (core graphics)
- incorrect: https://en.wikipedia.org/wiki/File:Smallmult.png#/media/File:Smallmult.png
- Custom graph layouts
Grammar of Graphics and ggplot2
- Grammar of graphics concept
- ggplot overview
Animations
Dynamic Documents
R shiny
- What makes documents dynamic?
- Widgets
- input objects
- Widgets Gallery
- Render functions
- Reactive functions
- [ tutorial ]
Dashboards in R
- Principles of good dashboard design
- Layouts
- Sidebars
- Value boxes
- [ demo RMD ]
Customizing Styles
- CSS: cascading style sheets