In this chapter, we’ll explore the following key concepts:
Too long; didn’t read? Here’s what you need to know:
read*()
functions
Importing data is also called reading in data in R.
Many base R functions begin with read
, like read.csv()
and read.table()
.
Some R packages have new reading functions, like read_csv()
in the “readr” package.
"If you know the [data] and know yourself,
you need not fear the result of a hundred [analyses]."
Sun Tzu, The Art of [Data Science]
If you want to get certain data into R, you must know the file format.
.csv
.csv
, .tsv
, and .xlsx
flatly /ˈflatli/
- Showing little interest or emotion.
- In a firm and unequivocal manner; absolutely.
- In a smooth and even way.
Oxford English Dictionary
The far most familiar family of file formats includes text data or flat files.
PRO TIP:
Flat files are ideal for sharing data with collaborators.
Comma-separated values or CSV files store data in plain text:
A CSV in the Wild: Observe the following public construction records on GitHub:
A Wrangled CSV: R (and other software) interpret and tame these formats tabularly.
## project name ending zip ssn class hours rate gross
## 1 Lakeview Ajay Glass 2015-08-02 14612 3888 Journeyman 40 18.5 740.0
## 2 Lakeview Ajay Glass 2015-08-02 13205 2807 Journeyman 32 28.7 918.4
## 3 Lakeview Ajay Glass 2015-08-02 14428 7762 Journeyman 40 33.7 1348.0
## 4 Lakeview Ajay Glass 2015-08-02 14549 9759 Journeyman 26 33.7 876.2
## 5 Lakeview Ajay Glass 2015-08-09 14433 5632 Journeyman 32 17.0 544.0
## 6 Lakeview Ajay Glass 2015-08-09 14612 3888 Journeyman 40 18.5 740.0
## 7 Lakeview Ajay Glass 2015-08-09 14428 7762 Journeyman 40 33.7 1348.0
## 8 Lakeview Ajay Glass 2015-08-09 13118 2838 Journeyman 16 28.7 459.2
## 9 Lakeview Ajay Glass 2015-08-16 14433 5632 Journeyman 32 17.0 544.0
## 10 Lakeview Ajay Glass 2015-08-16 14612 3888 Journeyman 42 18.5 795.5
## net sex race ot pdf_no pdf_pg pg_ob
## 1 511.31 <NA> <NA> 0 1 1 1
## 2 586.65 <NA> <NA> 0 1 1 2
## 3 1025.92 <NA> <NA> 0 1 1 3
## 4 891.08 <NA> <NA> 0 1 1 4
## 5 384.08 <NA> <NA> 0 1 3 1
## 6 657.86 <NA> <NA> 0 1 3 2
## 7 1025.92 <NA> <NA> 0 1 3 3
## 8 361.98 <NA> <NA> 0 1 3 4
## 9 384.08 <NA> <NA> 0 1 5 1
## 10 616.28 <NA> <NA> 0 1 5 2
Tab-separated values or TSV files store data in plain text:
A TSV in the Wild: Observe the same public construction records on GitHub:
Microsoft Excel files are significantly different from text or flat files.
.xls
; a workbook is in .xlsx
formatreadxl
, XLConnect
, gdata
WARNING:
Reading Excel files into R will remove all formatting, like:
Some Excel users use formatting to represent information. For example, a user may use red, yellow, and green to represent a categorical variable like “low”, “medium”, and “high”, respectively. In such cases, create a new column to store these data.
Excel Files in the Wild: It’s common to see Excel files with heavy formatting:
The JSON format stands for “JavaScript Object Notation”:
.json
jsonlite
help R read in .json
filesJSON Files in the Wild: Check out these Game of Thrones subtitles in JSON format:
Each major statistical software has its own (sometimes proprietary) file format, e.g.:
.sav
and .por
.sas7bdat
and .sas7bcat
.dta
Different packages exist for importing statistical software files, e.g.:
foreign
, created by R’s core team, supports many formatshaven
, created by Hadley Wickham, is faster but for fewer formats
You may have noticed, when working in R, that you’re asked to save your workspace.
When saving a workspace, .rdata
or .rda
files are stored in the working directory.
save()
load()
.rdata
and .rda
files store your session’s objects, command history, etc.
There are a variety of file formats that you can read into R.
For virtually every file format, there are functions and packages to import them.
You must know the file’s format to decide on the best import function.
Once you’ve identified the extension type, you should consider how to read it into R.
,
, ;
?We’ll look at a few common functions for importing common formats.
The workhorse of all base R reading functions is read.table()
.
Thanks to its modifiability, function read.table()
is remarkably flexible.
Not only does it import most file types, it’s used by many wrapper functions.
QUESTION
You can see whats under a function’s hood by typing it without ()
.
Look at the internals of function read.csv()
. What function is used?
read.csv
## function (file, header = TRUE, sep = ",", quote = "\"", dec = ".",
## fill = TRUE, comment.char = "", ...)
## read.table(file = file, header = header, sep = sep, quote = quote,
## dec = dec, fill = fill, comment.char = comment.char, ...)
## <bytecode: 0x00000000132bda28>
## <environment: namespace:utils>
A wrapper function is a modified version of a more powerful and versatile function.
Each is optimized for a specific task, like read.csv()
- a very common function.
Wrapper functions are like customized tools. Their toolboxes are packages.
Reading comma-separated values (CSV) files is done with base R’s read.csv()
function.
path =
path =
accepts the directory path or a web URLread.csv(file = "~/dp4ss-textbook/my_data.csv")
read.csv(file = "http://www.ds4ps.com/textbook/my_data.csv")
We can read in Syracuse, NY lead violations directly from the city’s open data portal.
#' Put your URL in quotes; here, we name it "url"
#' Use function paste0() to join long strings of text
url <- paste0("https://opendata.arcgis.com/datasets/",
"c15a39a8a00e48b1a60c826c8a2cb3e0_0.csv")
lead_data <- read.csv(file = url, # Use object storing URL text
stringsAsFactors = FALSE) # Always include this argument!
Now, we have a locally-stored dataset:
lead_data[1:10, 1:4] # Bracks specify rows and columns to include
Consider the following notable arguments for read.csv()
:
c()
to rename variablesc()
to specify classes
PRO TIP: READING FROM CLIPBOARDS
Why not read the data you’ve copied to your clipboard?
In this approach, just change the first argument, file =, to “clipboard”.
read.csv(file = "clipboard")
There are many ways to delimit data in a flat (text) file.
For example, tab-delimited, semicolon-delimited, asterisk-delimited - you get it.
In these instances, you can use function read.delim()
to specify the delimiter.
Here, we specify the delimiter as a comma, ,
:
lead <- read.delim(file = url,
stringsAsFactors = FALSE,
sep = ',') # Specifies delimiter
You can specify your delimiter as any character you wish:
read.delim(file = my_file, sep = ',')
read.delim(file = my_file, sep = ';')
read.delim(file = my_file, sep = '\t') # Tab-delimited
You can also read in any flat file, point-and-click, with RStudio’s import wizard:
In the import wizard, you can specify arguments with a user-friendly interface.
The “readr” package improves on several functions in base R’s reading toolkit, e.g.:
read_
Observe the output from the “tibble” data frame. What’s different?
library(readr)
url <- paste0("https://opendata.arcgis.com/datasets/",
"c15a39a8a00e48b1a60c826c8a2cb3e0_0.csv")
read_csv(url)
PRO TIP
Consider getting used to tibbles (enhanced data frames from “readr”).
ANOTHER PRO TIP
You can change variable classes much more easily in “readr” functions.
Short string representation allows you to specify classes with one character:
In a function, to specify 3 logical, 2 integer, and 4 character variables:
read_csv(path = path, col_types = c("llliicccc"))
There are many packages available for reading in Excel’s .xlsx
format.
Package “readxl” is part of the same package ecosystem as “readr”.
read_xlsx()
excel_sheets()
library(readxl)
read_xlsx(path = url,
sheet = "FY 2018 Revenue", # Specify sheet
range = "B3:G78", # Specify range
trim_ws = TRUE, # Remove leading/trailing spaces
col_types = "dcic_cli_cc") # Specify variable classes
Of course, RStudio has macros set up if you’d prefer:
LIFE HACK
Using RStudio’s import wizard may seem like such a noob move.
It is, but that doesn’t mean you can’t turn it into a clutch move.
Even when using wizards and macros, code will still run in the R console:
Copy that code and throw it into your script. It’s often faster than typing.
Two packages are commonly used to import files from SPSS, SAS, STATA and other software:
Their differences are similar to those between base R and “readr” functions.
read.dta()
is equivalent to read_dta()
in the packages, respectivelyread.spss()
is equivalent to read_spss()
, respectivelyread.stata()
is equavalent to read_stata()
, and so on
There are a variety of methods to get data out of the web and into R.
Here are a few popular methods:
Many times, it’s a matter of copying URLs from links that download datasets.
Here, we find the “Download” button on a City Of Pheonix Open Data dataset:
You can save the URL in an object or use the whole string for the path = argument.
GitHub is riddled with datasets.
You can often find a “Download” button for them.
In the event that you cannot find a “Download” button, click on “Raw”.
You can use the URL provided to read the data into R.
The following resources should help you further your knowledge on reading in data.
read.table()