DS4PS


As an expert data scientist, you have been hired by the Mayor of Tempe to conduct a study and make recommendations on ways to reduce traffic injuries in the city. You will use the crash data from the city’s open data portal and your data wrangling skills to look for patterns that help us understand the causes of traffic accidents in the city, and might suggest some ways to reduce injuries and fatalities.

Consider the following questions:











Packages

You will use the following packages for this lab:


Data

In this lab you will use a traffic accidents dataset from the Tempe Open Data Portal:

The dataset contains the following variables:

column type label description
Incidentid numeric Incident ID Unique incident ID number assigned by Arizona Department of Transportation (ADOT).
DateTime timestamp Date Time Date and time that the crash occurred.
Year numeric Year Year that the crash occurred.
StreetName text Street Name The street that the crash occurred on.
CrossStreet text Cross-street The nearest intersecting street or road.
Distance numeric Distance from Intersection The distance, in feet, that the crash occurred from the cross-street.
JunctionRelation text Junction Relation The location of the crash in relation to a junction, either an intersection or connection between a driveway and a roadway.
Totalinjuries numeric Total Injuries Total number of persons with non-fatal injuries involved in the crash.
Totalfatalities numeric Total Fatalities Total number of persons with fatal injuries involved in the crash.
Injuryseverity text Injury Severity The highest severity of injury of all persons involved in the crash.
Collisionmanner text Collision Manner Identifies the manner in which two vehicles initially came into contact.
Lightcondition text Lighting Conditions The type/level of light that existed at the time of the crash.
Weather text Weather The prevailing (most significant) atmospheric conditions that existed at the time of the crash.
SurfaceCondition text Surface Condition The roadway surface condition at the time and place of a crash.
Unittype_One text Unit Type One Driver, Passenger, Pedestrian, Pedalcyclist or Driverless.
Age_Drv1 numeric
Gender_Drv1 text
Traveldirection_One text Travel Direction The direction the unit was traveling before the incident occurred,
Unitaction_One text Unit Action One The maneuver, or last action, of the unit before the crash.
Violation1_Drv1 text Violation One The main violation/behavior of the unit that contributed to the crash.
AlcoholUse_Drv1 text Alcohol Use 1 Indicates whether alcohol was a contributing factor in the crash or not.
DrugUse_Drv1 text Drug Use 1 Indicates whether drug use was a contributing factor in the crash or not.
Unittype_Two text Unit Type Two Driver, Passenger, Pedestrian, Pedalcyclist or Driverless.
Age_Drv2 numeric
Gender_Drv2 text
Traveldirection_Two text Travel Direction Two The direction the unit was traveling before the incident occurred.
Unitaction_Two text Unit Action Two The maneuver, or last action, of the unit before the crash.
Violation1_Drv2 text Violation Two The main violation/behavior of the unit that contributed to the crash.
AlcoholUse_Drv2 text Alcohol Use 2 Indicates whether alcohol was a contributing factor in the crash or not.
DrugUse_Drv2 text Drug Use 2 Indicates whether drug use was a contributing factor in the crash or not.
Latitude numeric Latitude Used to specify the precise location of the crash.
Longitude numeric Longitude Used to specify the precise location of the crash.

Make Age Categorical

## [1] "Age 16-18"  "Age 18-25"  "Age 25-35"  "Age 35-45"  "Age 45-55" 
## [6] "Age 55-65"  "Age 65-75"  "Age 75-100"



Lab Instructions

In this lab you will practice your logical statements, data verbs (dplyr functions), and recipes to conduct analysis looking for types of accidents that cause serious injury. You will need to pay attention to the difference between counts of events, and severity of events.

We will define “harm” as any accident that causes at least one injury OR fatality. You need to define a new variable in your dataset that indicates whether accidents caused harm or not.

For each question, write down your data recipe or pseudocode.

You can create a new RMarkdown file, or download the lab template: RMD Template



PART 1: Warm-Up With Data Summaries

Practice writing logical statements and basic data recipes for the following:

1) How many accidents happen on Mondays?

  • Sum over a logical statement

2) What proportion of accidents each week occur on Monday?

  • Mean of a logical statement

3) What proportion of accidents on Mondays result in harm?

  • Compound logical statement

4) What is the most common type of accident (Collisionmanner) that occurs on Mondays?

  • Use dplyr’s count() function.

5) Are there differences in the proportion of accidents that result in harm each day of the week?

  • Create a table of proportion of accidents that result in harm each day of the week
  • Use group_by() and summarize()
  • Note you can define custom summary statistics in summarize() using logical statements from above

6) Create a table that reports the following for each day of the week:

  1. Number of accidents
  2. Number of people hurt in accidents (total injuries)
  3. Number of people killed in accidents (total fatalities)
  4. Proportion of accidents resulting in harm (injuries + fatalities)
  • Use group_by() and summarize()
  • Note that in the summarize function you can count occurrences by n=n()
day n injuries fatalities harm.rate
Mon 4094 1644 13 0.3
Tue 4656 2056 8 0.32
Wed 4711 2144 9 0.31
Thu 4814 2204 10 0.33
Fri 5006 2103 12 0.3
Sat 3044 1192 11 0.26
Sun 2145 866 6 0.27



PART 2: Age Groups

1) Create a table of counts of accidents by time of day (24 one-hour segments) and age of driver (use the factor age groups).

Which age group has the largest number of accidents at 7am?

You can use the dplyr count() function for this (or use table() in core R if you are old school like that).

2) Create a new table of time of day and age group, but report the proportion of accidents by time of day for each age group (i.e. the proportions WITHIN EACH AGE GROUP should sum to one).

Which age group experiences the largest proportion of accidents at 7am?

Note, this calculation would be useful for assessing whether there are differences when each age group is driving recklessly. You can see differences in alertness and physical stamina across age groups at different times of day.

Do NOT confuse this question with, what age group is most likely to cause an accident at a given time of day. In that case, all of the proportions WITHIN EACH TIME PERIOD would sum to one.

Let’s break this problem down with a sample dataset. Instead of age and hour12, we will use factors f1 and f2. We want to know the proportion of each f2 within f1 (i.e. proportion of accidents per hour for each age group).

f1 f2 x
A L-01 1
A L-01 1
A L-02 1
A L-02 1
A L-02 1
A L-03 1
B L-01 1
B L-01 1
B L-03 1
B L-03 1

The data recipe will be:

  1. Calculate the frequency of each f1-f2 combination (cells in the f1 x f2 table).
  2. Calculate the frequency of each f1 group.
  3. Combine steps 1-2 into a single dataset.
  4. Divide each f1-f2 group count by the f1 group count.

Step 1: count f1-f2 groups

f1 f2 n
A L-01 2
A L-02 3
A L-03 1
B L-01 2
B L-03 2

Step 2: count f1 groups

f1 n
A 6
B 4

Step 3: Combine steps 1-2 into a single dataset

The problem with Step 2 is that count() and summarize() drop all columns except the summary stats. We know, however, that mutate() creates new variables and keeps the rest of the dataset. So the trick is to replace summarize with mutate.

f1 f2 n n.f1
A L-01 2 3
A L-02 3 3
A L-03 1 3
B L-01 2 2
B L-03 2 2

That’s not quite right because we are counting the occurrences of f1 in a summary table. We know that there are 6 f1’s, not 3. Each row of f1 occurs n times in the original dataset. So we need to sum over n, not count rows.

f1 f2 n n.f1
A L-01 2 6
A L-02 3 6
A L-03 1 6
B L-01 2 4
B L-03 2 4

That looks better!

Step 4: Divide each f1-f2 group count by the f1 group count.

The last step is then to divide (f1-f2) / f1.

We can then filter by a specific factor to answer a question like what age of driver is most dangerous at 7am?

f1 f2 n n.f1 prop
A L-01 2 6 0.33
B L-01 2 4 0.5

Your final table will look something like this, with p.age representing the within-age proportions.

The p.age proportion answers the question, for all of the drivers from a specific age group that get into accidents, what time of day are they most likely to have an accident?

The p.hour proportion answers the question, which age group causes the most accidents at a given time of day (e.g. 7AM).

We don’t have info on the total number of drivers from each age group on the road at any given time, so we cannot say anything about the absolute rates of dangerous driving. These rates tell us, for drivers that get into accidents, how do the accidents vary by time and age group across the day?

age hour12 n n.age n.hour p p.age p.hour
Age 16-18 7 AM 77 1458 1606 0.05 0.05 0.05
Age 18-25 7 AM 408 8796 1606 0.25 0.05 0.25
Age 25-35 7 AM 371 5456 1606 0.23 0.07 0.23
Age 35-45 7 AM 243 3250 1606 0.15 0.07 0.15
Age 45-55 7 AM 175 2679 1606 0.11 0.07 0.11
Age 55-65 7 AM 116 1878 1606 0.07 0.06 0.07
Age 65-75 7 AM 39 970 1606 0.02 0.04 0.02
Age 75-100 7 AM 17 570 1606 0.01 0.03 0.01
NA 7 AM 160 3413 1606 0.10 0.05 0.10

Maybe these are related?



PART 3: Rates of Harm

As a public health expert specializing in traffic accidents, you need to think about how to best target traffic accidents to reduce harm.

Should we focus on the volume of traffic accidents, or the types of accidents that are most likely to cause harm?

Calculate each of these four descriptive statistics above as a function of the 24 hours of the day, and either print a table with times and counts/rates, or plot a graph of the statistics as a function of time similar to the examples above.

Reflection point: as the analyst your job is to translate research questions into the models that are presented to decision-makers. You will likely receive a broad objective like identifying patterns in traffic accidents, and you will often have a lot of flexibility in how you operationalize the question.

How might strategies for addressing traffic injuries change if you switch from an emphasis on the accidents that are most common to accidents that are most likely to cause harm? These are subtle nuances in how to calculate simple statistics (counts and proportions) over groups, but they can have a big impact on policy-making!

Report your tables or graphcs. You don’t have to include a written response to the reflection point.



CHALLENGE QUESTION (not graded):

Using at most two variables in the dataset to define a group structure and identify:

The most dangerous accident to be involved in (highest rate of harm).

For example, your groups could be teen-agers (group 1: age) that rear-end another driver (group 2: collision type), or drunk-drivers (group 1: alcohol) that hit pedestrians (group 2: driver type), or men (group 1: gender) on Labor Day (group 2: date). Calculate rates of harm for each type of accident, and identify the most harmful case.

There is one constraint: there must be at least five cases of the accident in the dataset. For example, any type of accident involving a 95-year old likely occurs once in the dataset since these drivers are rare. If the driver was injured then that accident type will have a 100% harm rate, but it’s unlikely an accurate representation of the true harm rate because we are generalizing from a single observation.

You can use any variables from the dataset, but you are limit to groups constructed from two variables. Report the most dangerous accident type (most likely to cause harm - i.e. injury or death) that you can identify.



Working with Dates

So far we have worked with character, numeric, logical, and categorical (factor) vectors.

We need to introduce a new type of vector class for this lab, a date variable. Dates are complicated because they must function simultaneously as categorical variables (months of the year) and numeric variables capable of arithmatic (time that passes between two dates). Furthermore, we often want to convert between idiosyncratic date representations, such as a day of a specific month to a day of the week.

They are also complicated because they can be represented in many ways:

When you first read a dataset, they are typically loaded as character vectors:

## [1] "1/10/12 9:04 "  "1/5/12 17:24 "  "1/16/12 19:08 " "1/27/12 14:41 "
## [5] "1/10/12 13:41 " "1/9/12 17:49 "

We can convert dates stored as characters to a special date object by specifying the format using codes understood by the strptime() function:

## [1] "2012-01-10 09:04:00 MST" "2012-01-05 17:24:00 MST"
## [3] "2012-01-16 19:08:00 MST" "2012-01-27 14:41:00 MST"
## [5] "2012-01-10 13:41:00 MST" "2012-01-09 17:49:00 MST"

Now R will recognize that the variable is a date, not a string, and it will be able to do complex day and time manipulations. Note that the format= argument above requires you to tell R what each value represents. In this case, %m represents month, %d represents day, and %y represents year. In the original data they are separated by a back slash, so that’s included in the format argument.

We need to be explicit because dates can be stored as DD-MM-YYYY, MM-DD-YY, YYYY-MM-DD, or any other number of formats. The format argument tells R how to structure the date.

We can now use the format() function to specify how we want the date represented using many common styles. Note that format() will return a character vector, not another date class.

## [1] "09" "17" "19" "14" "13" "17"
## [1] "09" "05" "07" "02" "01" "05"
## [1] "AM" "PM" "PM" "PM" "PM" "PM"
## [1] "01" "01" "01" "01" "01" "01"
## [1] "Jan" "Jan" "Jan" "Jan" "Jan" "Jan"
## [1] "Tuesday"  "Thursday" "Monday"   "Friday"   "Tuesday"  "Monday"
## [1] "Tue" "Thu" "Mon" "Fri" "Tue" "Mon"


Date Formatting Options

We can apply a wide range of formatting options to dates:

  • %a: Abbreviated weekday name in the current locale on this platform. (Also matches full name on input: in some locales there are no abbreviations of names.)

  • %A: Full weekday name in the current locale. (Also matches abbreviated name on input.)

  • %b: Abbreviated month name in the current locale on this platform. (Also matches full name on input: in some locales there are no abbreviations of names.)

  • %B: Full month name in the current locale. (Also matches abbreviated name on input.)

  • %c: Date and time. Locale-specific on output, “%a %b %e %H:%M:%S %Y” on input.

  • %C: Century (00–99): the integer part of the year divided by 100.

  • %d: Day of the month as decimal number 01–31.

  • %D: Date format such as %m/%d/%y: the C99 standard says it should be that exact format, but not all OS’s comply.

  • %e: Day of the month as decimal number 1–31, with a leading space for a single-digit number.

  • %F: Equivalent to %Y-%m-%d the ISO 8601 date format.

  • %g: The last two digits of the week-based year. Accepted but ignored on input.

  • %G: The week-based year as a decimal number. Accepted but ignored on input.

  • %h: Equivalent to %b.

  • %H: Hours as decimal number 00–23. As a special exception strings such as 24:00:00 are accepted for input, since ISO 8601 allows these.

  • %I: Hours as decimal number 01–12.

  • %j: Day of year as decimal number 001–366.

  • %m: Month as decimal number 01–12.

  • %M: Minute as decimal number 00–59.

  • %n: Newline on output, arbitrary whitespace on input.

  • %p: AM/PM indicator in the locale. Used in conjunction with %I and not with %H. An empty string in some locales (and the behaviour is undefined if used for input in such a locale). Some platforms accept %P for output, which uses a lower-case version: others will output P.

  • %r: The 12-hour clock time (using the locale’s AM or PM). Only defined in some locales.

  • %R: Equivalent to %H:%M.

  • %S: Second as integer 00–61, allowing for up to two leap-seconds (but POSIX-compliant implementations will ignore leap seconds).

  • %t: Tab on output, arbitrary whitespace on input.

  • %T: Equivalent to %H:%M:%S.

  • %u: Weekday as a decimal number 1–7, Monday is 1.

  • %U: Week of the year as decimal number 00–53 using Sunday as the first day 1 of the week (and typically with the first Sunday of the year as day 1 of week 1). The US convention.

  • %V: Week of the year as decimal number 01–53 as defined in ISO 8601. If the week (starting on Monday) containing 1 January has four or more days in the new year, then it is considered week 1. Otherwise, it is the last week of the previous year, and the next week is week 1. (Accepted but ignored on input.)

  • %w: Weekday as decimal number 0–6, Sunday is 0.

  • %W: Week of the year as decimal number 00–53 using Monday as the first day of week (and typically with the first Monday of the year as day 1 of week 1). The UK convention.

  • %x: Date. Locale-specific on output, “%y/%m/%d” on input.

  • %X: Time. Locale-specific on output, “%H:%M:%S” on input.

  • %y: Year without century 00–99. On input, values 00 to 68 are prefixed by 20 and 69 to 99 by 19 – that is the behaviour specified by the 2004 and 2008 POSIX standards, but they do also say ‘it is expected that in a future version the default century inferred from a 2-digit year will change’.

  • %Y: Year with century. Note that whereas there was no zero in the original Gregorian calendar, ISO 8601:2004 defines it to be valid (interpreted as 1BC): see https://en.wikipedia.org/wiki/0_(year). Note that the standards also say that years before 1582 in its calendar should only be used with agreement of the parties involved. For input, only years 0:9999 are accepted.

  • %z: Signed offset in hours and minutes from UTC, so -0800 is 8 hours behind UTC. Values up to +1400 are accepted as from R 3.1.1: previous versions only accepted up to +1200. (Standard only for output.)

  • %Z: (Output only.) Time zone abbreviation as a character string (empty if not available). This may not be reliable when a time zone has changed abbreviations over the years.

Creating New Date Variables



If we want to be more precise about crash counts per week within a given year, which is a more intuitive and actionable statistic than summing across all years in the dataset:



Recode Factor Levels

Some of the categorical variables are hard to work with because they have a levels that are small or hard to interpret.

Collisionmanner n
10 3
Rear To Rear 56
Rear To Side 174
Sideswipe Opposite Direction 189
Unknown 345
Head On 348
Other 971
Single Vehicle 1737
Sideswipe Same Direction 3565
ANGLE (Front To Side)(Other Than Left Turn) 4686
Left Turn 5395
Rear End 11001
Collisionmanner n
Head On 348
Single Vehicle 1737
Lane Change 3565
Angle 4686
Left Turn 5395
Rear End 11001

Patterns in types of crashes by time of day:

  Angle Head On Left Turn Rear End Lane Change Single Vehicle
00 71 4 59 83 50 89
01 34 8 32 78 28 87
02 40 7 23 87 34 136
03 21 8 13 35 7 92
04 18 6 23 17 5 50
05 49 7 63 73 39 49
06 127 9 117 213 82 57
07 286 19 359 616 180 69
08 258 13 279 623 192 60
09 212 11 180 360 174 63
10 217 6 188 398 153 51
11 250 12 246 531 203 58
12 333 14 284 729 210 56
13 347 21 325 738 233 54
14 353 19 334 764 295 64
15 379 24 420 998 303 79
16 397 23 582 1200 323 87
17 427 36 619 1285 337 90
18 289 22 421 811 236 76
19 186 19 249 437 143 63
20 135 26 182 316 111 83
21 114 9 177 294 98 74
22 76 13 130 192 76 66
23 67 12 90 123 53 84

Age of Driver

We have a wide range of driver ages:

Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
2 22 31 43.64 51 255 360

This many ages will make our analysis complicated, so it is better to convert the numeric age variable into a categorical age-group variable. We will use the cut() function for this, which accepts a numeric variable and group cut points (the breaks= argument), then returns the proper group label for each age.

These group labels are a little awkward, so let’s improve them a bit by creating our own:

We can now analyze some trends by age group.



Submission Instructions

After you have completed your lab, knit your RMD file. Login to Canvas at http://canvas.asu.edu and navigate to the assignments tab in the course repository. Upload your RMD and your HTML files to the appropriate lab submission link.

Remember to: