There are 4 sets of questions. You can earn up to 100 points + bonus questions. Points are indicated next to each question.
Remember to:
This lab is loosely based on Branas et al. (2011) article, entitled “A Difference-in-Differences Analysis of Health, Safety, and Greening Vacant Urban Space”.
The ‘broken windows’ theory suggests that “vacant lots offer refuge to criminal and other illegal activity and visibly symbolize that a neighborhood has deteriorated, that no one is in control, and that unsafe or criminal behavior is welcome to proceed with little if any supervision” (p. 1297).
To prevent these problems, city A has decided to implement a public program that will green some of the city’s vacant lots in 4 different neighborhoods. ‘Greening’ includes removing trash and debris and planting grass and trees to create a park.
Simulated data on vandalism acts include the total number of calls about vandalism acts that the police has received during the year before and after the greening from households near the vacant lots.
Research question: Does greening vacant lots decreases vandalism in City A?
Three years after its implementation, the city wants to know whether the program has been successful and propose to compare lots that have been ‘greened’ and lots that are still vacants within the same neighborhoods but could have been chosen for greening.
Hypothesis: Greening vacant lots will decrease vandalism in City A.
Data on vandalism acts were reported by the local police and include the total number of calls about vandalism acts that the police has received during the year before and after the greening from households near the vacant lots.
URL <- "https://raw.githubusercontent.com/DS4PS/pe4ps-textbook/master/labs/DATA/diff-in-diff-lab.csv"
data <- read.csv( URL, stringsAsFactors=F )
head( data ) %>% pander()
X | Vandalism | Months | Group | Post |
---|---|---|---|---|
1 | 1777 | 0 | 1 | 0 |
2 | 2079 | 0 | 1 | 0 |
3 | 2432 | 0 | 0 | 0 |
4 | 2271 | 0 | 0 | 0 |
5 | 2726 | 0 | 1 | 0 |
6 | 2866 | 10 | 0 | 1 |
Variable name | Description |
---|---|
Vandalism | Number of calls about vandalism acts near to the vacant lots |
Group | Treatment (=1) and control group (=0) |
Post_Treatment | Observation three years before the treatment (=0) and three years after the treatment (=1) |
Q1: Write down the difference-in-difference model (5 points)
Q2: Run the regression in R and present results in a nice table with stargazer (5 + 5 points)
Q3: Let’s now look at the counterfactual.
Let’s consider the same study design but now let’s compare two different approaches to analyzing the data.
We will use a slightly different dataset, one that contains the monthly number of vandalism crimes that the police has received in the four study neighborhoods over a four-year period.
URL <- "https://raw.githubusercontent.com/DS4PS/pe4ps-textbook/master/labs/DATA/time-series-diff-n-diff-comparison.csv"
dat <- read.csv( URL, stringsAsFactors=F )
rbind( head(dat), tail(dat) ) %>% pander()
nhood | quadrant | group | month | year | time | treat | |
---|---|---|---|---|---|---|---|
1 | NH-01 | Q-A | treat | 1 | 1994 | 1 | 1 |
2 | NH-01 | Q-B | control | 1 | 1994 | 1 | 0 |
3 | NH-01 | Q-C | treat | 1 | 1994 | 1 | 1 |
4 | NH-01 | Q-D | control | 1 | 1994 | 1 | 0 |
5 | NH-01 | Q-E | treat | 1 | 1994 | 1 | 1 |
6 | NH-01 | Q-F | control | 1 | 1994 | 1 | 0 |
1243 | NH-04 | Q-U | treat | 12 | 1997 | 48 | 1 |
1244 | NH-04 | Q-V | control | 12 | 1997 | 48 | 0 |
1245 | NH-04 | Q-W | treat | 12 | 1997 | 48 | 1 |
1246 | NH-04 | Q-X | control | 12 | 1997 | 48 | 0 |
1247 | NH-04 | Q-Y | treat | 12 | 1997 | 48 | 1 |
1248 | NH-04 | Q-Z | control | 12 | 1997 | 48 | 0 |
post.treat | time.since | crimes | treat.x.post | |
---|---|---|---|---|
1 | 0 | 0 | 107 | 0 |
2 | 0 | 0 | 96 | 0 |
3 | 0 | 0 | 92 | 0 |
4 | 0 | 0 | 72 | 0 |
5 | 0 | 0 | 89 | 0 |
6 | 0 | 0 | 91 | 0 |
1243 | 1 | 24 | 257 | 1 |
1244 | 1 | 24 | 208 | 0 |
1245 | 1 | 24 | 190 | 1 |
1246 | 1 | 24 | 192 | 0 |
1247 | 1 | 24 | 214 | 1 |
1248 | 1 | 24 | 196 | 0 |
We have 48 months of data, and the vacant lot clean-up occurred between the 24th and 25th months. The mayor asks an analyst to evaluate the impact of the intervention.
Here is what our data looks like with red points representing the treatment group and blue representing the control:
dat$group <- factor( dat$group )
plot( dat$time, dat$crime, pch=19, col=dat$group, bty="n",
xlab="Time (months)", ylab="Number of Crimes per Quadrant" )
abline( v=24.5, lty=3, col="gray", lwd=3 )
We can simplify the graph by calculating the average crime rate across all quadrants in the study within each time period:
ts1 <-
dat %>%
filter( group == "control" ) %>%
group_by( time ) %>%
summarize( ave=mean( crimes ) )
ts2 <-
dat %>%
filter( group == "treat" ) %>%
group_by( time ) %>%
summarize( ave=mean( crimes ) )
y.24 <- ts2$ave[24]
y.25 <- ts2$ave[25]
diff <- y.24 - y.25
plot( ts1$time, ts1$ave, type="b", pch=19, col="steelblue", bty="n",
xlab="Time (months)", ylab="Number of Crimes",
xlim=c(0,50), ylim=c(60,225) )
points( ts2$time, ts2$ave, type="b", pch=19, col="firebrick" )
abline( v=c(12,36,48), lty=2, col="gray" )
abline( v=24.5, lty=3, col="gray", lwd=3 )
abline( h=c(y.24,y.25), lty=3, col="firebrick" )
arrows( x0=38, y0=y.24, y1=y.25, col="firebrick", code=3 )
text( x=40, y=mean(c(y.24,y.25)),
paste0( "Diff ~ ", round(diff,0) ),
col="firebrick", pos=4, cex=1.4 )
Note that there appears to be an immediate impact of the program, but over time we also observe what appears to be a regression to the mean level of vandalism in the treatment quadrants.
To estimate program impact the analyst runs the follow two models:
Diff-in-Diff
\[ \text{crimes} = a_0 + a_1 \cdot treat + a_2 \cdot post + a_3 \cdot treat.post + e \]
m1 <- lm( crimes ~ treat + post.treat + treat.x.post, data=dat )
stargazer( m1, type="html", digits=2,
omit.stat=c("f","rsq","adj.rsq","ser"),
intercept.bottom = FALSE,
covariate.labels = c("Intercept (a0)",
"Treatment Group (a1)",
"Post-Treatment Period (a2)",
"Treat x Post (a3)" ) )
Dependent variable: | |
crimes | |
Intercept (a0) | 110.53*** |
(1.43) | |
Treatment Group (a1) | 20.51*** |
(2.03) | |
Post-Treatment Period (a2) | 64.85*** |
(2.03) | |
Treat x Post (a3) | -23.09*** |
(2.87) | |
Observations | 1,248 |
Note: | p<0.1; p<0.05; p<0.01 |
C1 <- mean( dat$crimes[ dat$year == 1994 & dat$group == "control" ] )
C2 <- mean( dat$crimes[ dat$year == 1995 & dat$group == "control" ] )
C3 <- mean( dat$crimes[ dat$year == 1996 & dat$group == "control" ] )
C4 <- mean( dat$crimes[ dat$year == 1997 & dat$group == "control" ] )
T1 <- mean( dat$crimes[ dat$year == 1994 & dat$group == "treat" ] )
T2 <- mean( dat$crimes[ dat$year == 1995 & dat$group == "treat" ] )
T3 <- mean( dat$crimes[ dat$year == 1996 & dat$group == "treat" ] )
T4 <- mean( dat$crimes[ dat$year == 1997 & dat$group == "treat" ] )
dd.C1 <- mean( dat$crimes[ dat$post == 0 & dat$group == "control" ] )
dd.C2 <- mean( dat$crimes[ dat$post == 1 & dat$group == "control" ] )
dd.T1 <- mean( dat$crimes[ dat$post == 0 & dat$group == "treat" ] )
dd.T2 <- mean( dat$crimes[ dat$post == 1 & dat$group == "treat" ] )
plot( ts1$time, ts1$ave, type="b", pch=19, col="steelblue", bty="n",
xlab="Time (months)", ylab="Number of Crimes",
main="Difference-in-Difference Group Means",
xlim=c(0,50), ylim=c(60,225) )
points( ts2$time, ts2$ave, type="b", pch=19, col="firebrick" )
abline( v=c(12,36,48), lty=2, col="gray" )
abline( v=24.5, lty=3, col="gray", lwd=3 )
abline( h=c(y.24,y.25), lty=3, col="firebrick" )
points( c(6.5,18.5,30.5,42.5), c(C1,C2,C3,C4),
pch=19, col=adjustcolor("steelblue",alpha=0.5), cex=2 )
points( c(6.5,18.5,30.5,42.5), c(T1,T2,T3,T4),
pch=19, col=adjustcolor("firebrick",alpha=0.5), cex=2 )
points( c(12.5,36.5), c(dd.C1,dd.C2+3), col="steelblue", pch=19, cex=4 )
points( c(12.5,36.5), c(dd.T1,dd.T2-3), col="firebrick", pch=19, cex=4 )
Interrupted Time Series
\[ \text{crimes} = b_0 + b_1 \cdot time + b_2 \cdot post + b_3 \cdot time.since + e \]
Note we only need the treatment group for this model since it is a reflexive design:
d2 <- filter( dat, group == "treat" )
m2 <- lm( crimes ~ time + post.treat + time.since, data=d2 )
plot( d2$time, d2$crimes, pch=19,
col=adjustcolor("firebrick",alph=0.5), cex=1.5, bty="n",
xlab="Time (months)",
ylab="Count of Crimes Per Quadrant",
xlim=c(0,50) )
abline( v=c(12,36,48), lty=2, col="gray" )
abline( v=24.5, lty=3, col="gray", lwd=3 )
points( d2$time, m2$fitted.values, pch=19, type="l", lwd=3 )
stargazer( m2, type="html", digits=2,
omit.stat=c("f","rsq","adj.rsq","ser"),
intercept.bottom = FALSE,
covariate.labels = c("Intercept (b0)",
"Time in Month (b1)",
"Post-Treatment Period (b2)",
"Months Since Treatment (b3)" ))
Dependent variable: | |
crimes | |
Intercept (b0) | 97.71*** |
(1.91) | |
Time in Month (b1) | 2.67*** |
(0.13) | |
Post-Treatment Period (b2) | -36.43*** |
(2.62) | |
Months Since Treatment (b3) | 1.14*** |
(0.19) | |
Observations | 624 |
Note: | p<0.1; p<0.05; p<0.01 |
Questions:
Not pertinent to the questions above but useful to note, you can run an interrupted time series with a comparison group as follows:
m3 <- lm( crimes ~ time + treat + post.treat*treat +
time.since + time.since*treat, data=dat )
plot( dat$time, m3$fitted.values,
col=dat$group, pch=19, bty="n",
xlab="Time", ylab="Crimes" )
How would you interpret the results of this model? Which coefficient reprents the immediate effects? And which represents sustained effects?
stargazer( m3, type="html", digits=2,
omit.stat=c("f","rsq","adj.rsq","ser"),
intercept.bottom = FALSE )
Dependent variable: | |
crimes | |
Constant | 77.14*** |
(1.34) | |
time | 2.67*** |
(0.08) | |
treat | 20.51*** |
(1.17) | |
post.treat | 0.77 |
(2.12) | |
time.since | -0.003 |
(0.15) | |
treat:post.treat | -37.26*** |
(2.67) | |
treat:time.since | 1.13*** |
(0.17) | |
Observations | 1,248 |
Note: | p<0.1; p<0.05; p<0.01 |
How would you test the parallel lines assumption for the difference-in-difference model with this data? Write the code to create the appropriate data subset and model. Report your results.
The parallel lines assumption must be met for the difference-in-difference approach to produce unbiased estimates of program impact.
Does the interrupted time series have a similar assumption? Why or why not? Hint, what type of research design estimator does the ITS operationalize?
When you have completed your assignment, knit your RMD file to generate your rendered HTML file.
Login to Canvas at http://canvas.asu.edu and navigate to the assignments tab in the course repository. Upload your HTML and RMD files to the appropriate lab submission link.
Remember to:
Platforms like Canvas and Blackboard sometimes disallow you from submitting HTML files when there is embedded computer code. If this happens create a zipped folder with both the RMD and HTML files.