The Data Science Toolkit

We will need three tools to manage your data science projects: a data programming language (R), a project management interfact (R Studio), and a way to create data-driven documents (R Markdown).

Core R [ CH-01 ]

What is R? [ video ]
Packages

R Studio [ CH-02 ]

Installing R and R Studio
Tour of R Studio

Data-Driven Docs [ CH-03 ]

Automation & Flexibility
The Importance of Reproducibility
Formats link
Gallery link

Markdown [ CH-04 ]

R Markdown Formats overview
- Headers and Chunks link
- Knitting link
- Customization

Getting Started

R as a Calculator [ CH-05 ]

Mathematical Operators
Assignment
Objects

Functions [ CH-06 ]

Input-Output Devices
Arguments
Values
Returns

The Learning Curve [ CH-07 ]

Vocabular and verbs
Learning to Learn R

Getting Help [ CH-08 ]

Help files
Error messages
Discussion boards

Starting to Code

One-Dimensional Datasets

Intro to Vectors [ CH-09 ]

Observations vs Variables (rows vs columns)
Vector Types
- Numeric
- Character
- Factors (ordered vs unordered)
- Logical (true/false)
Checking Vector Types
Casting
- Implicit Casting (coercion)

Identifying Groups within Data [ CH-10 ]

Set theory as categories and membership
Logical Operators
- equal
- not equal
- greater than or less than
- opposite of
Compound Statements: AND and OR
Casting logical vectors
Algebra with logical vectors
Defining groups
- from categorical variables
- from numeric variables
- missing values as a group
Recoding Values
Find and replace

Two-Dimensional Datasets

Dataframes

Creating data frames from vectors
the $ operator
Checking and changing class types
Filter rows and select columns
Reorder rows or columns
CSV vs RDS formats

Matrices and Lists

Matrix
Lists
Building data objects:
data.frame() vs cbind() and rbind()
Transformations of Datasets

Data IO

Navigating R (directories, paths, object lists)
Built-In Datasets

Getting Data into R [ data import ]

Read options
Copy and paste from Excel
Using rdata format
Read from csv or tsv
Read text files
Import from Excel
Import from common format (foreign package)
Import from the web (RCurl)
Import from GitHub
Import from DropBox
[ tutorial ]

Saving Data [ exporting datasets ]

Write options
- CSV
- R Data Sets (RDS)
- CSV vs RDS
- Tables
- RData Format
- SPSS or Stata
Copy to Clipboard
Copy to Excel
[ tutorial ]

APIs [ using APIs in R ] [ Demo with DataUSA API ]

What is an API?
Examples
- Census
- Socrata
- Twitter

Data Wrangling (dplyr)

Data wrangling is the process of preparing data for analysis, which includes reading data into R from a variety of formats, cleaning data, tidying datasets, creating subsets and filters, transforming variables, grouping data, and joining multiple datasets.

The goal of data wrangling is to create a rodeo dataset (clean and well-structured) that is ready for the big show (modeling and visualization)!

Slicing Datasets – Base R and dplyr [ CH-11 ]

Subset operator
By index, including order / match
By logical
Recycling
Subset by row – dplyr::filter()
Indices
Selector Vectors
Subset by column — dplyr::select()

Wrangling Recipes [ CH-12 ]

Pipe operator
Window vs summary functions
dplyr cheat sheet

Combining Datasets [ CH-13 ]

merge and match
join in dplyr
inner, outer, right, left

Explore and Describe

Group Structure [ CH-14 ]

Combining factors and numeric data for analysis
Faceting in plots

Summarizing Vectors

Counting things: sum( logical statement )
Categorical data: tables
Missing values
prop.table() and margin.table()
Numeric data: min, max, mean, summary / quantile
Missing values
All at once: summary + data.frame / matrix
Creating tables of descriptives: factors vs numeric

Summarizing Groups of Vectors

Table ( f1, f2 ), ftable( row.vars=c(“f1”,”f2”), col.vars=”f3” )
Function over groups: tapply( v1, f1 ) or dplyr:: group_by() + summarise()
Functions over levels of numeric data: tapply( v1, cut(v2) )
tapply( v1, INDEX=list(f1,f2) or dplyr:: group_by() + summarise()
aggregate( dat, FUN, by=f1 )
https://cran.r-project.org/web/packages/DescTools/vignettes/DescToolsCompanion.pdf

Visualize

Principles of Visual Communication [ Intro to Data Viz ]

Ground, figure, narrative (context, subject, action)
Tufte’s rules
Visual tragedies

Core Graphics Engine [ Core ] [ Custom ]

Defining a canvas: xlim, ylim
Adding data
Type (point, line, both)
Symbols
Color
Size
Adding grids
Adding axes
Adding titles / axes labels
Adding data labels: text()
Margins

Advanced Graphics

Colors and color functions
Custom fonts / math symbols
Multiple Plots (core graphics)
- Incorrect: https://en.wikipedia.org/wiki/File:Smallmult.png#/media/File:Smallmult.png
Custom graph layouts

ggplot2 [ Intro to the Grammar of Graphics ]

Grammar of graphics concept
ggplot overview

Make Dynamic

R shiny [ overview ] [ tutorial ]

What makes documents dynamic?
Widgets
- input objects
- Widgets Gallery
Render functions
reactive

flexdashboards [ overview ] [ demo RMD ]

Principles of good dashboard design
Layouts
Sidebars
Value boxes
CSS basics

Data Programming for Social Scientists