Instructions

There are 4 sets of questions. You can earn up to 100 points + bonus questions. Points are indicated next to each question.

Remember to:

Format your table using stargazer
Submit the r markdown file with all your codes.
Bold your answers.

The policy problem

This lab is loosely based on Branas et al. (2011) article, entitled “A Difference-in-Differences Analysis of Health, Safety, and Greening Vacant Urban Space”.

The ‘broken windows’ theory suggests that “vacant lots offer refuge to criminal and other illegal activity and visibly symbolize that a neighborhood has deteriorated, that no one is in control, and that unsafe or criminal behavior is welcome to proceed with little if any supervision” (p. 1297).

To prevent these problems, city A has decided to implement a public program that will green some of the city’s vacant lots in 4 different neighborhoods. ‘Greening’ includes removing trash and debris and planting grass and trees to create a park.

Simulated data on vandalism acts include the total number of calls about vandalism acts that the police has received during the year before and after the greening from households near the vacant lots.

Research question: Does greening vacant lots decreases vandalism in City A?

Three years after its implementation, the city wants to know whether the program has been successful and propose to compare lots that have been ‘greened’ and lots that are still vacants within the same neighborhoods but could have been chosen for greening.

Hypothesis: Greening vacant lots will decrease vandalism in City A.

Data

Data on vandalism acts were reported by the local police and include the total number of calls about vandalism acts that the police has received during the year before and after the greening from households near the vacant lots.

URL <- "https://raw.githubusercontent.com/DS4PS/pe4ps-textbook/master/labs/DATA/diff-in-diff-lab.csv"
data <- read.csv( URL, stringsAsFactors=F )
head( data ) %>% pander()

X	Vandalism	Months	Group	Post
1	1777	0	1	0
2	2079	0	1	0
3	2432	0	0	0
4	2271	0	0	0
5	2726	0	1	0
6	2866	10	0	1

Variable name	Description
Vandalism	Number of calls about vandalism acts near to the vacant lots
Group	Treatment (=1) and control group (=0)
Post_Treatment	Observation three years before the treatment (=0) and three years after the treatment (=1)

Lab Questions

Question 1

Q1: Write down the difference-in-difference model (5 points)

Question 2

Q2: Run the regression in R and present results in a nice table with stargazer (5 + 5 points)

Q2a: What is the average number of calls about vandalism acts in the control group before the treatment? (5 points)
Q2b: Did the number of calls for vandalism in the treatment and control groups differ significantly before the treatment? Which coefficient provides you this information? (5 + 5 points)
Q2c: Do treatment and control groups need to be identical in the pre-treatment period for valid inference? (10 points)
Q2d: Does the number of calls in the control group change after the intervention? Specify if it increases or decreases. (5 points)
Q2e: What is the average number of calls in the control group after the intervention? (5 points)
Q2f: Does greening help to reduce vandalism? Report the coefficient that you need to look. (5 + 5 points)
Q2g: What is the average number of calls in the treatment group after the intervention? (5 points)

Question 3

Q3: Let’s now look at the counterfactual.

Q3a: Write the counterfactual formula. (5 points)
Q3b: Calculate the counterfactual, i.e., how many calls for vandalism acts we would expect, had the intervention not occured. (5 points)
Q3c: What is the main assumption behind the counterfactual? (5 points)

Question 4

Let’s consider the same study design but now let’s compare two different approaches to analyzing the data.

We will use a slightly different dataset, one that contains the monthly number of vandalism crimes that the police has received in the four study neighborhoods over a four-year period.

URL <- "https://raw.githubusercontent.com/DS4PS/pe4ps-textbook/master/labs/DATA/time-series-diff-n-diff-comparison.csv"
dat <- read.csv( URL, stringsAsFactors=F )
rbind( head(dat), tail(dat) ) %>% pander()

Table continues below
	nhood	quadrant	group	month	year	time	treat
1	NH-01	Q-A	treat	1	1994	1	1
2	NH-01	Q-B	control	1	1994	1	0
3	NH-01	Q-C	treat	1	1994	1	1
4	NH-01	Q-D	control	1	1994	1	0
5	NH-01	Q-E	treat	1	1994	1	1
6	NH-01	Q-F	control	1	1994	1	0
1243	NH-04	Q-U	treat	12	1997	48	1
1244	NH-04	Q-V	control	12	1997	48	0
1245	NH-04	Q-W	treat	12	1997	48	1
1246	NH-04	Q-X	control	12	1997	48	0
1247	NH-04	Q-Y	treat	12	1997	48	1
1248	NH-04	Q-Z	control	12	1997	48	0

	post.treat	time.since	crimes	treat.x.post
1	0	0	107	0
2	0	0	96	0
3	0	0	92	0
4	0	0	72	0
5	0	0	89	0
6	0	0	91	0
1243	1	24	257	1
1244	1	24	208	0
1245	1	24	190	1
1246	1	24	192	0
1247	1	24	214	1
1248	1	24	196	0

We have 48 months of data, and the vacant lot clean-up occurred between the 24th and 25th months. The mayor asks an analyst to evaluate the impact of the intervention.

Here is what our data looks like with red points representing the treatment group and blue representing the control:

dat$group <- factor( dat$group )
plot( dat$time, dat$crime, pch=19, col=dat$group, bty="n",
      xlab="Time (months)", ylab="Number of Crimes per Quadrant" )
abline( v=24.5, lty=3, col="gray", lwd=3 )

We can simplify the graph by calculating the average crime rate across all quadrants in the study within each time period:

ts1 <- 
  dat %>% 
  filter( group == "control" ) %>%
  group_by( time ) %>% 
  summarize( ave=mean( crimes ) )

ts2 <- 
  dat %>% 
  filter( group == "treat" ) %>%
  group_by( time ) %>% 
  summarize( ave=mean( crimes ) )

y.24 <- ts2$ave[24]
y.25 <- ts2$ave[25]
diff <- y.24 - y.25 

plot( ts1$time, ts1$ave, type="b", pch=19, col="steelblue", bty="n",
      xlab="Time (months)", ylab="Number of Crimes", 
      xlim=c(0,50), ylim=c(60,225) )
points( ts2$time, ts2$ave, type="b", pch=19, col="firebrick" ) 
abline( v=c(12,36,48), lty=2, col="gray" )
abline( v=24.5, lty=3, col="gray", lwd=3 )
abline( h=c(y.24,y.25), lty=3, col="firebrick" )

arrows( x0=38, y0=y.24, y1=y.25, col="firebrick", code=3 )
text( x=40, y=mean(c(y.24,y.25)), 
      paste0( "Diff ~ ", round(diff,0) ), 
      col="firebrick", pos=4, cex=1.4 )

Note that there appears to be an immediate impact of the program, but over time we also observe what appears to be a regression to the mean level of vandalism in the treatment quadrants.

To estimate program impact the analyst runs the follow two models:

Diff-in-Diff

\[ \text{crimes} = a_0 + a_1 \cdot treat + a_2 \cdot post + a_3 \cdot treat.post + e \]

m1 <- lm( crimes ~ treat + post.treat + treat.x.post, data=dat )
stargazer( m1, type="html", digits=2,
           omit.stat=c("f","rsq","adj.rsq","ser"),
           intercept.bottom = FALSE,
           covariate.labels = c("Intercept (a0)", 
                                "Treatment Group (a1)",
                                "Post-Treatment Period (a2)", 
                                "Treat x Post (a3)" ) )


	Dependent variable:

	crimes

Intercept (a0)	110.53^***
	(1.43)

Treatment Group (a1)	20.51^***
	(2.03)

Post-Treatment Period (a2)	64.85^***
	(2.03)

Treat x Post (a3)	-23.09^***
	(2.87)


Observations	1,248

Note:	p<0.1; p<0.05; p<0.01

C1 <- mean( dat$crimes[ dat$year == 1994 & dat$group == "control" ] )
C2 <- mean( dat$crimes[ dat$year == 1995 & dat$group == "control" ] )
C3 <- mean( dat$crimes[ dat$year == 1996 & dat$group == "control" ] )
C4 <- mean( dat$crimes[ dat$year == 1997 & dat$group == "control" ] )

T1 <- mean( dat$crimes[ dat$year == 1994 & dat$group == "treat" ] )
T2 <- mean( dat$crimes[ dat$year == 1995 & dat$group == "treat" ] )
T3 <- mean( dat$crimes[ dat$year == 1996 & dat$group == "treat" ] )
T4 <- mean( dat$crimes[ dat$year == 1997 & dat$group == "treat" ] )

dd.C1 <- mean( dat$crimes[ dat$post == 0 & dat$group == "control" ] )
dd.C2 <- mean( dat$crimes[ dat$post == 1 & dat$group == "control" ] )

dd.T1 <- mean( dat$crimes[ dat$post == 0 & dat$group == "treat" ] )
dd.T2 <- mean( dat$crimes[ dat$post == 1 & dat$group == "treat" ] )

plot( ts1$time, ts1$ave, type="b", pch=19, col="steelblue", bty="n",
      xlab="Time (months)", ylab="Number of Crimes", 
      main="Difference-in-Difference Group Means",
      xlim=c(0,50), ylim=c(60,225) )
points( ts2$time, ts2$ave, type="b", pch=19, col="firebrick" ) 
abline( v=c(12,36,48), lty=2, col="gray" )
abline( v=24.5, lty=3, col="gray", lwd=3 )
abline( h=c(y.24,y.25), lty=3, col="firebrick" )

points( c(6.5,18.5,30.5,42.5), c(C1,C2,C3,C4), 
        pch=19, col=adjustcolor("steelblue",alpha=0.5), cex=2 )
points( c(6.5,18.5,30.5,42.5), c(T1,T2,T3,T4), 
        pch=19, col=adjustcolor("firebrick",alpha=0.5), cex=2 )

points( c(12.5,36.5), c(dd.C1,dd.C2+3), col="steelblue", pch=19, cex=4 )
points( c(12.5,36.5), c(dd.T1,dd.T2-3), col="firebrick", pch=19, cex=4 )

Interrupted Time Series

\[ \text{crimes} = b_0 + b_1 \cdot time + b_2 \cdot post + b_3 \cdot time.since + e \]

Note we only need the treatment group for this model since it is a reflexive design:

d2 <- filter( dat, group == "treat" )
 
m2 <- lm( crimes ~ time + post.treat + time.since, data=d2 )

plot( d2$time, d2$crimes, pch=19, 
      col=adjustcolor("firebrick",alph=0.5), cex=1.5, bty="n",
      xlab="Time (months)", 
      ylab="Count of Crimes Per Quadrant", 
      xlim=c(0,50) )
abline( v=c(12,36,48), lty=2, col="gray" )
abline( v=24.5, lty=3, col="gray", lwd=3 )
points( d2$time, m2$fitted.values, pch=19, type="l", lwd=3 )

stargazer( m2, type="html", digits=2,
           omit.stat=c("f","rsq","adj.rsq","ser"),
           intercept.bottom = FALSE,
           covariate.labels = c("Intercept (b0)", 
                                "Time in Month (b1)",
                                "Post-Treatment Period (b2)", 
                                "Months Since Treatment (b3)" ))


	Dependent variable:

	crimes

Intercept (b0)	97.71^***
	(1.91)

Time in Month (b1)	2.67^***
	(0.13)

Post-Treatment Period (b2)	-36.43^***
	(2.62)

Months Since Treatment (b3)	1.14^***
	(0.19)


Observations	624

Note:	p<0.1; p<0.05; p<0.01

Questions:

Q4a: What is the size of the program impact according to the diff-in-diff model? (5 points)
Q4b: What is the size of the immediate program impact according to the interrupted time series model? (5 points)
Q4c: How are the post-treatment coefficients conceptually different in these two models (a2 and b2)? (4 points)
Q4d: Which coefficient in each model captures the secular trend in the data (changes we can expect independent of the treatment)? (4 points)
Q4e: Do the two models make the same predictions regarding the size of secular trends (approximately)? Show your math. (2 points)

Not pertinent to the questions above but useful to note, you can run an interrupted time series with a comparison group as follows:

m3 <- lm( crimes ~ time + treat + post.treat*treat + 
            time.since + time.since*treat, data=dat )

plot( dat$time, m3$fitted.values, 
      col=dat$group, pch=19, bty="n",
      xlab="Time", ylab="Crimes" )

How would you interpret the results of this model? Which coefficient reprents the immediate effects? And which represents sustained effects?

stargazer( m3, type="html", digits=2,
           omit.stat=c("f","rsq","adj.rsq","ser"),
           intercept.bottom = FALSE )


	Dependent variable:

	crimes

Constant	77.14^***
	(1.34)

time	2.67^***
	(0.08)

treat	20.51^***
	(1.17)

post.treat	0.77
	(2.12)

time.since	-0.003
	(0.15)

treat:post.treat	-37.26^***
	(2.67)

treat:time.since	1.13^***
	(0.17)


Observations	1,248

Note:	p<0.1; p<0.05; p<0.01

BONUS QUESTIONS

BQ1: Explain why the difference-in-difference model reports a smaller program impact (a3) than the post-treatment coefficient (b2) in the time-series model. (3 points)
BQ2: Is a3 to b2 a fair comparison? Construct a treatment group prediction from the interrupted time series regression that would be a better comparison for a3. Show your math. (4 points)
BQ3: Assuming you have the data to run either model, which of these is a more convincing counter-factual, in your opinion? (3 points)

CHALLENGE QUESTION

How would you test the parallel lines assumption for the difference-in-difference model with this data? Write the code to create the appropriate data subset and model. Report your results.

The parallel lines assumption must be met for the difference-in-difference approach to produce unbiased estimates of program impact.

Does the interrupted time series have a similar assumption? Why or why not? Hint, what type of research design estimator does the ITS operationalize?

plot( dat$time, m3$fitted.values, 
      col=dat$group, pch=19, bty="n",
      xlab="Time", ylab="Crimes" )

Submission Instructions

When you have completed your assignment, knit your RMD file to generate your rendered HTML file.

Login to Canvas at http://canvas.asu.edu and navigate to the assignments tab in the course repository. Upload your HTML and RMD files to the appropriate lab submission link.