As an expert data scientist, you have been hired by the Mayor of Tempe to conduct a study and make recommendations on ways to reduce traffic injuries in the city. You will use the crash data from the city’s open data portal and your data wrangling skills to look for patterns that help us understand the causes of traffic accidents in the city, and might suggest some ways to reduce injuries and fatalities.
Consider the following questions:
You will use the following packages for this lab:
library( dplyr ) # data wrangling
library( pander ) # formatting output
library( ggmap ) # grab map tiles
library( viridis ) # color pallette for maps
library( ggplot2 ) # fancy graphics
library( ggthemes ) # fancy themes for ggplots
In this lab you will use a traffic accidents dataset from the Tempe Open Data Portal:
URL <- "https://github.com/DS4PS/Data-Science-Class/blob/master/DATA/TempeTrafficAccidents.rds?raw=true"
dat <- readRDS(gzcon(url( URL )))
head( dat )
The dataset contains the following variables:
column | type | label | description |
---|---|---|---|
Incidentid | numeric | Incident ID | Unique incident ID number assigned by Arizona Department of Transportation (ADOT). |
DateTime | timestamp | Date Time | Date and time that the crash occurred. |
Year | numeric | Year | Year that the crash occurred. |
StreetName | text | Street Name | The street that the crash occurred on. |
CrossStreet | text | Cross-street | The nearest intersecting street or road. |
Distance | numeric | Distance from Intersection | The distance, in feet, that the crash occurred from the cross-street. |
JunctionRelation | text | Junction Relation | The location of the crash in relation to a junction, either an intersection or connection between a driveway and a roadway. |
Totalinjuries | numeric | Total Injuries | Total number of persons with non-fatal injuries involved in the crash. |
Totalfatalities | numeric | Total Fatalities | Total number of persons with fatal injuries involved in the crash. |
Injuryseverity | text | Injury Severity | The highest severity of injury of all persons involved in the crash. |
Collisionmanner | text | Collision Manner | Identifies the manner in which two vehicles initially came into contact. |
Lightcondition | text | Lighting Conditions | The type/level of light that existed at the time of the crash. |
Weather | text | Weather | The prevailing (most significant) atmospheric conditions that existed at the time of the crash. |
SurfaceCondition | text | Surface Condition | The roadway surface condition at the time and place of a crash. |
Unittype_One | text | Unit Type One | Driver, Passenger, Pedestrian, Pedalcyclist or Driverless. |
Age_Drv1 | numeric | ||
Gender_Drv1 | text | ||
Traveldirection_One | text | Travel Direction | The direction the unit was traveling before the incident occurred, |
Unitaction_One | text | Unit Action One | The maneuver, or last action, of the unit before the crash. |
Violation1_Drv1 | text | Violation One | The main violation/behavior of the unit that contributed to the crash. |
AlcoholUse_Drv1 | text | Alcohol Use 1 | Indicates whether alcohol was a contributing factor in the crash or not. |
DrugUse_Drv1 | text | Drug Use 1 | Indicates whether drug use was a contributing factor in the crash or not. |
Unittype_Two | text | Unit Type Two | Driver, Passenger, Pedestrian, Pedalcyclist or Driverless. |
Age_Drv2 | numeric | ||
Gender_Drv2 | text | ||
Traveldirection_Two | text | Travel Direction Two | The direction the unit was traveling before the incident occurred. |
Unitaction_Two | text | Unit Action Two | The maneuver, or last action, of the unit before the crash. |
Violation1_Drv2 | text | Violation Two | The main violation/behavior of the unit that contributed to the crash. |
AlcoholUse_Drv2 | text | Alcohol Use 2 | Indicates whether alcohol was a contributing factor in the crash or not. |
DrugUse_Drv2 | text | Drug Use 2 | Indicates whether drug use was a contributing factor in the crash or not. |
Latitude | numeric | Latitude | Used to specify the precise location of the crash. |
Longitude | numeric | Longitude | Used to specify the precise location of the crash. |
date.vec <- strptime( dat$DateTime, format="%m/%d/%y %H:%M" )
dat$hour <- format( date.vec, format="%H" )
dat$month <- format( date.vec, format="%b" )
dat$day <- format( date.vec, format="%a" )
dat$day365 <- format( date.vec, format="%j" )
dat$week <- format( date.vec, format="%V" )
Correct the order of categorical variables by making them ordered factors:
Fri | Mon | Sat | Sun | Thu | Tue | Wed |
---|---|---|---|---|---|---|
5006 | 4094 | 3044 | 2145 | 4814 | 4656 | 4711 |
# correct order of days
dat$day <- factor( dat$day, levels=c("Mon","Tue","Wed","Thu","Fri","Sat","Sun") )
table( dat$day ) %>% pander()
Mon | Tue | Wed | Thu | Fri | Sat | Sun |
---|---|---|---|---|---|---|
4094 | 4656 | 4711 | 4814 | 5006 | 3044 | 2145 |
Create 12-hour format and order the times correctly:
1 AM | 1 PM | 2 AM | 2 PM | 3 AM | 3 PM |
---|---|---|---|---|---|
289 | 1832 | 362 | 1928 | 190 | 2301 |
# set the levels so they are in the correct order
time.levels <-
c( "12 AM", " 1 AM", " 2 AM", " 3 AM", " 4 AM", " 5 AM",
" 6 AM", " 7 AM", " 8 AM", " 9 AM", "10 AM", "11 AM",
"12 PM", " 1 PM", " 2 PM", " 3 PM", " 4 PM", " 5 PM",
" 6 PM", " 7 PM", " 8 PM", " 9 PM", "10 PM", "11 PM" )
dat$hour12 <- factor( dat$hour12, levels=time.levels )
table( dat$hour12 ) %>% head() %>% pander()
12 AM | 1 AM | 2 AM | 3 AM | 4 AM | 5 AM |
---|---|---|---|---|---|
388 | 289 | 362 | 190 | 125 | 300 |
age.labels <- paste0( "Age ", c(16,18,25,35,45,55,65,75), "-", c(18,25,35,45,55,65,75,100) )
age.labels
## [1] "Age 16-18" "Age 18-25" "Age 25-35" "Age 35-45" "Age 45-55"
## [6] "Age 55-65" "Age 65-75" "Age 75-100"
In this lab you will practice your logical statements, data verbs (dplyr functions), and recipes to conduct analysis looking for types of accidents that cause serious injury. You will need to pay attention to the difference between counts of events, and severity of events.
We will define “harm” as any accident that causes at least one injury OR fatality. You need to define a new variable in your dataset that indicates whether accidents caused harm or not.
For each question, write down your data recipe or pseudocode.
You can create a new RMarkdown file, or download the lab template: RMD Template
Practice writing logical statements and basic data recipes for the following:
day | n | injuries | fatalities | harm.rate |
---|---|---|---|---|
Mon | 4094 | 1644 | 13 | 0.3 |
Tue | 4656 | 2056 | 8 | 0.32 |
Wed | 4711 | 2144 | 9 | 0.31 |
Thu | 4814 | 2204 | 10 | 0.33 |
Fri | 5006 | 2103 | 12 | 0.3 |
Sat | 3044 | 1192 | 11 | 0.26 |
Sun | 2145 | 866 | 6 | 0.27 |
Which age group has the largest number of accidents at 7am?
You can use the dplyr count() function for this (or use table() in core R if you are old school like that).
Which age group experiences the largest proportion of accidents at 7am?
Note, this calculation would be useful for assessing whether there are differences when each age group is driving recklessly. You can see differences in alertness and physical stamina across age groups at different times of day.
Do NOT confuse this question with, what age group is most likely to cause an accident at a given time of day. In that case, all of the proportions WITHIN EACH TIME PERIOD would sum to one.
Let’s break this problem down with a sample dataset. Instead of age and hour12, we will use factors f1 and f2. We want to know the proportion of each f2 within f1 (i.e. proportion of accidents per hour for each age group).
f1 | f2 | x |
---|---|---|
A | L-01 | 1 |
A | L-01 | 1 |
A | L-02 | 1 |
A | L-02 | 1 |
A | L-02 | 1 |
A | L-03 | 1 |
B | L-01 | 1 |
B | L-01 | 1 |
B | L-03 | 1 |
B | L-03 | 1 |
The data recipe will be:
Step 1: count f1-f2 groups
f1 | f2 | n |
---|---|---|
A | L-01 | 2 |
A | L-02 | 3 |
A | L-03 | 1 |
B | L-01 | 2 |
B | L-03 | 2 |
Step 2: count f1 groups
f1 | n |
---|---|
A | 6 |
B | 4 |
Step 3: Combine steps 1-2 into a single dataset
The problem with Step 2 is that count() and summarize() drop all columns except the summary stats. We know, however, that mutate() creates new variables and keeps the rest of the dataset. So the trick is to replace summarize with mutate.
f1 | f2 | n | n.f1 |
---|---|---|---|
A | L-01 | 2 | 3 |
A | L-02 | 3 | 3 |
A | L-03 | 1 | 3 |
B | L-01 | 2 | 2 |
B | L-03 | 2 | 2 |
That’s not quite right because we are counting the occurrences of f1 in a summary table. We know that there are 6 f1’s, not 3. Each row of f1 occurs n times in the original dataset. So we need to sum over n, not count rows.
f1 | f2 | n | n.f1 |
---|---|---|---|
A | L-01 | 2 | 6 |
A | L-02 | 3 | 6 |
A | L-03 | 1 | 6 |
B | L-01 | 2 | 4 |
B | L-03 | 2 | 4 |
That looks better!
Step 4: Divide each f1-f2 group count by the f1 group count.
The last step is then to divide (f1-f2) / f1.
We can then filter by a specific factor to answer a question like what age of driver is most dangerous at 7am?
d %>%
count( f1, f2 ) %>%
group_by( f1 ) %>%
mutate( n.f1 = sum(n) ) %>%
mutate( prop= round( n / n.f1, 2 ) ) %>%
filter( f2 == "L-01" )
f1 | f2 | n | n.f1 | prop |
---|---|---|---|---|
A | L-01 | 2 | 6 | 0.33 |
B | L-01 | 2 | 4 | 0.5 |
Your final table will look something like this, with p.age representing the within-age proportions.
The p.age proportion answers the question, for all of the drivers from a specific age group that get into accidents, what time of day are they most likely to have an accident?
The p.hour proportion answers the question, which age group causes the most accidents at a given time of day (e.g. 7AM).
We don’t have info on the total number of drivers from each age group on the road at any given time, so we cannot say anything about the absolute rates of dangerous driving. These rates tell us, for drivers that get into accidents, how do the accidents vary by time and age group across the day?
age | hour12 | n | n.age | n.hour | p | p.age | p.hour |
---|---|---|---|---|---|---|---|
Age 16-18 | 7 AM | 77 | 1458 | 1606 | 0.05 | 0.05 | 0.05 |
Age 18-25 | 7 AM | 408 | 8796 | 1606 | 0.25 | 0.05 | 0.25 |
Age 25-35 | 7 AM | 371 | 5456 | 1606 | 0.23 | 0.07 | 0.23 |
Age 35-45 | 7 AM | 243 | 3250 | 1606 | 0.15 | 0.07 | 0.15 |
Age 45-55 | 7 AM | 175 | 2679 | 1606 | 0.11 | 0.07 | 0.11 |
Age 55-65 | 7 AM | 116 | 1878 | 1606 | 0.07 | 0.06 | 0.07 |
Age 65-75 | 7 AM | 39 | 970 | 1606 | 0.02 | 0.04 | 0.02 |
Age 75-100 | 7 AM | 17 | 570 | 1606 | 0.01 | 0.03 | 0.01 |
NA | 7 AM | 160 | 3413 | 1606 | 0.10 | 0.05 | 0.10 |
Maybe these are related?
As a public health expert specializing in traffic accidents, you need to think about how to best target traffic accidents to reduce harm.
Should we focus on the volume of traffic accidents, or the types of accidents that are most likely to cause harm?
Calculate each of these four descriptive statistics above as a function of the 24 hours of the day, and either print a table with times and counts/rates, or plot a graph of the statistics as a function of time similar to the examples above.
# example plotting code
plot( as.numeric(d2$hour), d2$ave.num.injuries,
pch=19, type="b", cex=2, bty="n",
xlab="Hour of the Day",
ylab="Ave. Number of Passengers Hurt",
main="Average Injuries or Fatalities Per Harmful Crash")
Reflection point: as the analyst your job is to translate research questions into the models that are presented to decision-makers. You will likely receive a broad objective like identifying patterns in traffic accidents, and you will often have a lot of flexibility in how you operationalize the question.
How might strategies for addressing traffic injuries change if you switch from an emphasis on the accidents that are most common to accidents that are most likely to cause harm? These are subtle nuances in how to calculate simple statistics (counts and proportions) over groups, but they can have a big impact on policy-making!
Report your tables or graphcs. You don’t have to include a written response to the reflection point.
Using at most two variables in the dataset to define a group structure and identify:
The most dangerous accident to be involved in (highest rate of harm).
For example, your groups could be teen-agers (group 1: age) that rear-end another driver (group 2: collision type), or drunk-drivers (group 1: alcohol) that hit pedestrians (group 2: driver type), or men (group 1: gender) on Labor Day (group 2: date). Calculate rates of harm for each type of accident, and identify the most harmful case.
There is one constraint: there must be at least five cases of the accident in the dataset. For example, any type of accident involving a 95-year old likely occurs once in the dataset since these drivers are rare. If the driver was injured then that accident type will have a 100% harm rate, but it’s unlikely an accurate representation of the true harm rate because we are generalizing from a single observation.
You can use any variables from the dataset, but you are limit to groups constructed from two variables. Report the most dangerous accident type (most likely to cause harm - i.e. injury or death) that you can identify.
So far we have worked with character, numeric, logical, and categorical (factor) vectors.
We need to introduce a new type of vector class for this lab, a date variable. Dates are complicated because they must function simultaneously as categorical variables (months of the year) and numeric variables capable of arithmatic (time that passes between two dates). Furthermore, we often want to convert between idiosyncratic date representations, such as a day of a specific month to a day of the week.
They are also complicated because they can be represented in many ways:
When you first read a dataset, they are typically loaded as character vectors:
## [1] "1/10/12 9:04 " "1/5/12 17:24 " "1/16/12 19:08 " "1/27/12 14:41 "
## [5] "1/10/12 13:41 " "1/9/12 17:49 "
We can convert dates stored as characters to a special date object by specifying the format using codes understood by the strptime() function:
## [1] "2012-01-10 09:04:00 MST" "2012-01-05 17:24:00 MST"
## [3] "2012-01-16 19:08:00 MST" "2012-01-27 14:41:00 MST"
## [5] "2012-01-10 13:41:00 MST" "2012-01-09 17:49:00 MST"
Now R will recognize that the variable is a date, not a string, and it will be able to do complex day and time manipulations. Note that the format= argument above requires you to tell R what each value represents. In this case, %m represents month, %d represents day, and %y represents year. In the original data they are separated by a back slash, so that’s included in the format argument.
We need to be explicit because dates can be stored as DD-MM-YYYY, MM-DD-YY, YYYY-MM-DD, or any other number of formats. The format argument tells R how to structure the date.
We can now use the format() function to specify how we want the date represented using many common styles. Note that format()
will return a character vector, not another date class.
## [1] "09" "17" "19" "14" "13" "17"
## [1] "09" "05" "07" "02" "01" "05"
## [1] "AM" "PM" "PM" "PM" "PM" "PM"
## [1] "01" "01" "01" "01" "01" "01"
## [1] "Jan" "Jan" "Jan" "Jan" "Jan" "Jan"
## [1] "Tuesday" "Thursday" "Monday" "Friday" "Tuesday" "Monday"
## [1] "Tue" "Thu" "Mon" "Fri" "Tue" "Mon"
We can apply a wide range of formatting options to dates:
%a: Abbreviated weekday name in the current locale on this platform. (Also matches full name on input: in some locales there are no abbreviations of names.)
%A: Full weekday name in the current locale. (Also matches abbreviated name on input.)
%b: Abbreviated month name in the current locale on this platform. (Also matches full name on input: in some locales there are no abbreviations of names.)
%B: Full month name in the current locale. (Also matches abbreviated name on input.)
%c: Date and time. Locale-specific on output, “%a %b %e %H:%M:%S %Y” on input.
%C: Century (00–99): the integer part of the year divided by 100.
%d: Day of the month as decimal number 01–31.
%D: Date format such as %m/%d/%y: the C99 standard says it should be that exact format, but not all OS’s comply.
%e: Day of the month as decimal number 1–31, with a leading space for a single-digit number.
%F: Equivalent to %Y-%m-%d the ISO 8601 date format.
%g: The last two digits of the week-based year. Accepted but ignored on input.
%G: The week-based year as a decimal number. Accepted but ignored on input.
%h: Equivalent to %b.
%H: Hours as decimal number 00–23. As a special exception strings such as 24:00:00 are accepted for input, since ISO 8601 allows these.
%I: Hours as decimal number 01–12.
%j: Day of year as decimal number 001–366.
%m: Month as decimal number 01–12.
%M: Minute as decimal number 00–59.
%n: Newline on output, arbitrary whitespace on input.
%p: AM/PM indicator in the locale. Used in conjunction with %I and not with %H. An empty string in some locales (and the behaviour is undefined if used for input in such a locale). Some platforms accept %P for output, which uses a lower-case version: others will output P.
%r: The 12-hour clock time (using the locale’s AM or PM). Only defined in some locales.
%R: Equivalent to %H:%M.
%S: Second as integer 00–61, allowing for up to two leap-seconds (but POSIX-compliant implementations will ignore leap seconds).
%t: Tab on output, arbitrary whitespace on input.
%T: Equivalent to %H:%M:%S.
%u: Weekday as a decimal number 1–7, Monday is 1.
%U: Week of the year as decimal number 00–53 using Sunday as the first day 1 of the week (and typically with the first Sunday of the year as day 1 of week 1). The US convention.
%V: Week of the year as decimal number 01–53 as defined in ISO 8601. If the week (starting on Monday) containing 1 January has four or more days in the new year, then it is considered week 1. Otherwise, it is the last week of the previous year, and the next week is week 1. (Accepted but ignored on input.)
%w: Weekday as decimal number 0–6, Sunday is 0.
%W: Week of the year as decimal number 00–53 using Monday as the first day of week (and typically with the first Monday of the year as day 1 of week 1). The UK convention.
%x: Date. Locale-specific on output, “%y/%m/%d” on input.
%X: Time. Locale-specific on output, “%H:%M:%S” on input.
%y: Year without century 00–99. On input, values 00 to 68 are prefixed by 20 and 69 to 99 by 19 – that is the behaviour specified by the 2004 and 2008 POSIX standards, but they do also say ‘it is expected that in a future version the default century inferred from a 2-digit year will change’.
%Y: Year with century. Note that whereas there was no zero in the original Gregorian calendar, ISO 8601:2004 defines it to be valid (interpreted as 1BC): see https://en.wikipedia.org/wiki/0_(year). Note that the standards also say that years before 1582 in its calendar should only be used with agreement of the parties involved. For input, only years 0:9999 are accepted.
%z: Signed offset in hours and minutes from UTC, so -0800 is 8 hours behind UTC. Values up to +1400 are accepted as from R 3.1.1: previous versions only accepted up to +1200. (Standard only for output.)
%Z: (Output only.) Time zone abbreviation as a character string (empty if not available). This may not be reliable when a time zone has changed abbreviations over the years.
date.vec <- strptime( dat$DateTime, format="%m/%d/%y %H:%M" )
dat$hour <- format( date.vec, format="%H" )
dat$month <- format( date.vec, format="%b" )
dat$day <- format( date.vec, format="%a" )
dat$day365 <- format( date.vec, format="%j" )
dat$week <- format( date.vec, format="%V" )
# set the levels so they are in the correct order
time.levels <-
c( "12 AM", " 1 AM", " 2 AM", " 3 AM", " 4 AM", " 5 AM",
" 6 AM", " 7 AM", " 8 AM", " 9 AM", "10 AM", "11 AM",
"12 PM", " 1 PM", " 2 PM", " 3 PM", " 4 PM", " 5 PM",
" 6 PM", " 7 PM", " 8 PM", " 9 PM", "10 PM", "11 PM" )
dat$hour12 <- format( date.vec, format="%l %p" )
dat$hour12 <- factor( dat$hour12, levels=time.levels )
qmplot( Longitude, Latitude, data=dat, geom = "blank",
zoom = 13, maptype = "toner-background", darken = .1 ) +
stat_density_2d( aes(fill = ..level..), geom = "polygon", alpha=0.3, color = NA) +
scale_fill_viridis( ) +
facet_wrap( ~ hour12, ncol=6, nrow=4)
d2 <-
dat %>%
filter( as.numeric(week) <= 52 ) %>%
count( week )
plot( as.numeric(d2$week), d2$n, pch=19, type="b", cex=2, bty="n",
xlab="Week", ylab="Number of Crashes",
main="Cumulative Crashes by Week of the Year: 2012-2018" )
d2 <-
dat %>%
filter( as.numeric(week) <= 52 ) %>%
group_by( week ) %>%
summarize( harm = mean( Totalinjuries > 0 | Totalfatalities > 0 ) )
plot( as.numeric(d2$week), d2$harm, pch=19, type="b", cex=2, bty="n",
xlab="Weeks in the Year", ylab="Rate of Harm",
main="Proportion of Crashes that Result in Harm Across Weeks")
abline( h=mean(d2$harm), col="gray", lty=2, lwd=2 )
If we want to be more precise about crash counts per week within a given year, which is a more intuitive and actionable statistic than summing across all years in the dataset:
d2 <-
dat %>%
filter( as.numeric(week) <= 52 ) %>%
group_by( Year ) %>%
count( week ) %>%
group_by( week ) %>%
summarize( ave.crashes.per.week = mean(n) )
plot( as.numeric(d2$week), d2$ave.crashes.per.week,
pch=19, type="b", cex=2, bty="n",
xlab="Week", ylab="Number of Crashes",
main="Ave Crashes by Week of Year" )
Some of the categorical variables are hard to work with because they have a levels that are small or hard to interpret.
Collisionmanner | n |
---|---|
10 | 3 |
Rear To Rear | 56 |
Rear To Side | 174 |
Sideswipe Opposite Direction | 189 |
Unknown | 345 |
Head On | 348 |
Other | 971 |
Single Vehicle | 1737 |
Sideswipe Same Direction | 3565 |
ANGLE (Front To Side)(Other Than Left Turn) | 4686 |
Left Turn | 5395 |
Rear End | 11001 |
dat$Collisionmanner <- recode( dat$Collisionmanner,
"ANGLE (Front To Side)(Other Than Left Turn)"="Angle" )
dat$Collisionmanner <- recode( dat$Collisionmanner, "Sideswipe Same Direction"="Lane Change" )
drop.these <- c("Unknown","10","Rear To Side","Rear To Rear","Sideswipe Opposite Direction","Other")
dat <- filter( dat, ! ( Collisionmanner %in% drop.these ) )
dat$Collisionmanner <- factor( dat$Collisionmanner )
count( dat, Collisionmanner ) %>% arrange(n) %>% pander()
Collisionmanner | n |
---|---|
Head On | 348 |
Single Vehicle | 1737 |
Lane Change | 3565 |
Angle | 4686 |
Left Turn | 5395 |
Rear End | 11001 |
Patterns in types of crashes by time of day:
Angle | Head On | Left Turn | Rear End | Lane Change | Single Vehicle | |
---|---|---|---|---|---|---|
00 | 71 | 4 | 59 | 83 | 50 | 89 |
01 | 34 | 8 | 32 | 78 | 28 | 87 |
02 | 40 | 7 | 23 | 87 | 34 | 136 |
03 | 21 | 8 | 13 | 35 | 7 | 92 |
04 | 18 | 6 | 23 | 17 | 5 | 50 |
05 | 49 | 7 | 63 | 73 | 39 | 49 |
06 | 127 | 9 | 117 | 213 | 82 | 57 |
07 | 286 | 19 | 359 | 616 | 180 | 69 |
08 | 258 | 13 | 279 | 623 | 192 | 60 |
09 | 212 | 11 | 180 | 360 | 174 | 63 |
10 | 217 | 6 | 188 | 398 | 153 | 51 |
11 | 250 | 12 | 246 | 531 | 203 | 58 |
12 | 333 | 14 | 284 | 729 | 210 | 56 |
13 | 347 | 21 | 325 | 738 | 233 | 54 |
14 | 353 | 19 | 334 | 764 | 295 | 64 |
15 | 379 | 24 | 420 | 998 | 303 | 79 |
16 | 397 | 23 | 582 | 1200 | 323 | 87 |
17 | 427 | 36 | 619 | 1285 | 337 | 90 |
18 | 289 | 22 | 421 | 811 | 236 | 76 |
19 | 186 | 19 | 249 | 437 | 143 | 63 |
20 | 135 | 26 | 182 | 316 | 111 | 83 |
21 | 114 | 9 | 177 | 294 | 98 | 74 |
22 | 76 | 13 | 130 | 192 | 76 | 66 |
23 | 67 | 12 | 90 | 123 | 53 | 84 |
We have a wide range of driver ages:
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | NA’s |
---|---|---|---|---|---|---|
2 | 22 | 31 | 43.64 | 51 | 255 | 360 |
# remove meaningless ages
dat$Age_Drv1[ dat$Age_Drv1 > 99 ] <- NA
dat$Age_Drv1[ dat$Age_Drv1 < 16 ] <- NA
dat %>%
filter( ! is.na(Age_Drv1) ) %>%
count( Age_Drv1 ) %>%
ggplot( aes( x=Age_Drv1, y=n ) ) +
geom_point(size=3) + geom_line() +
theme_fivethirtyeight() +
ggtitle("Crash Count by Age") +
xlab("Age")
This many ages will make our analysis complicated, so it is better to convert the numeric age variable into a categorical age-group variable. We will use the cut() function for this, which accepts a numeric variable and group cut points (the breaks= argument), then returns the proper group label for each age.
These group labels are a little awkward, so let’s improve them a bit by creating our own:
age.labels <- paste0( "Age ", c(16,18,25,35,45,55,65,75), "-", c(18,25,35,45,55,65,75,100) )
dat$age <- cut( dat$Age_Drv1, breaks=c(16,18,25,35,45,55,65,75,100), labels=age.labels )
barplot( table(dat$age) )
We can now analyze some trends by age group.
d3 <-
dat %>%
count( hour, age )
d3 <- na.omit( d3 )
qplot( data=d3, x=as.numeric(as.character(hour)), y=n ) +
geom_line( size=0.8, color="firebrick4" ) +
geom_point( size=3, color="darkred" ) +
facet_wrap( ~ age, ncol=4 ) +
xlab("Time of Day (24hrs)") +
ylab("Number of Accidents") +
ggtitle("Number of Accidents by Time and Age Group") +
# theme_minimal()
theme_wsj( base_size=10, color="gray" )
After you have completed your lab, knit your RMD file. Login to Canvas at http://canvas.asu.edu and navigate to the assignments tab in the course repository. Upload your RMD and your HTML files to the appropriate lab submission link.
Remember to: