Packages
The following packages are used in this lecture.
library( "scales" ) # variable transforms, formatting tables
library( "stargazer" ) # formatting regression output
library( "dplyr" ) # data wrangling
library( "pander" ) # nice tables
library( "Wats" ) # time series plots
They can be installed from the following locations
# CRAN
install.packages( "dplyr" )
install.packages( "pander" )
install.packages( "stargazer" )
install.packages( "scales" )
# GITHUB
devtools::install_github( repo="OuhscBbmc/Wats" )
What should be clear after this lecture
Interrupted time series can be used when:
Your data need to include observations before and after the intervention had occured. The more observations you have on both sides of the intervention the more robust your model will be (typically).
Examples include:
Data | Research question |
---|---|
Yearly data on marijuana consumption in a state | Does marijuana consumption increase after the state passes a law the legalize marijuana use? |
Monthly employment data on local businesses in a city | Does a raise in the minimum wage affect local levels of employment? |
Number of books read monthly by students in a school | Does the opening of a new library within the school affect the number of books read by students? |
The dependent variable in these models will sometimes be a population average. For example, the GDP of a country, the crime rate in a city, the average math score in a school.
Unlike most other policy models that require a lot of controls, these models are nice in that they can generate valid inferences in certain circumstances with only two independent variables - time and the start date of the program.
For our first example, let’s assume we have data on the total consumption of tobacco in a US state from a point in time (t0) to ten years later (t10). Our observations are represented in the graph 1.1 below. The line suggests an upward trend, with state residents consuming more tobacco over time.
Now let’s imagine that the policymakers decide to introduce a new tax on tobacco at the end of year t10 to reduce its consumption. What are the possible outcomes of this intervention?
We can represent some possible outcomes in figures 1.2 - 1.5.
In figure 1.2, the tax has no effect on tobacco consumption, which continues to increase at the same trend as before.
In figure 1.3, the tax reduces tobacco consumption right after the policy intervention but the upward trend continues over time. There is an immediate policy effect but no sustained effect.
In figure 1.4, the tax does not reduce tobacco consumption immediately, but it shifts the trend downward in the long term. There is no immediate policy effect but we observe a sustained effect of the policy.
Finally, we can also imagine that the tax decreases tobacco consumption immediately after AND shifts the trend downward. Figure 1.5 shows both an immediate and a sustained policy effect.
Interrupted time series enable us to investigate each effect, thereby understanding if the policy intervention:
In mathematical terms, it means that the time series equation includes four key coefficients:
\[\begin{equation} \text{Y} = \text{b}_0 + \text{b}_1T + \text{b}_2D + \text{b}_3P + \text{e} \tag{1.1} \end{equation}\]
Where:
\(\text{Y}\) is the outcome variable;
\(\text{T}\) is a continuous variable which indicates the time (e.g., days, months, years…) passed from the start of the observational period;
\(\text{D}\) is a dummy variable indicating observation collected before (=0) or after (=1) the policy intervention;
\(\text{P}\) is a continuous variable indicating time passed since the intervention has occured (before intervention has occured P is equal to 0).
Our dataset will look like this:
We can graphically illustrated each coefficient and gradually build equation (1.1).
In figure 1.7, \(\text{b}_0\) is the baseline level of the outcome (the intercept or constant) and \(\text{b}_1\) is the slope of the line before the intervention (green line).
\[\begin{equation} \text{Y} = \text{b}_0 + \text{b}_1T + \text{e} \tag{1.2} \end{equation}\]
In figure 1.8, \(\text{b}_2\) is the immediate effect that occurs after the intervention. It represents how the outcome level has changed from the last observation before the intervention to the first one after. In this case the slope of the line has not changed, and the lines after and before the intervention are parallel. Their slope is equal to \(\text{b}_1\).
\[\begin{equation} \text{Y} = \text{b}_0 + \text{b}_1T + \text{b}_2D + \text{e} \tag{1.3} \end{equation}\]
In figure 1.9, the slope has changed after the intervention. \(\text{b}_3\) is the difference between the slope of the line before and the slope of the line after the intervention. It represents the sustained effect of the policy intervention. Note that \(\text{b}_2\) = 0, meaning that there has been no immediate effect of the policy.
\[\begin{equation} \text{Y} = \text{b}_0 + \text{b}_1T + \text{b}_3P + \text{e} \tag{1.4} \end{equation}\]
Finally, in figure 1.10, we have both an immediate and sustained effect and \(\text{b}_2\) and \(\text{b}_3\) are both different from zero.
\[\begin{equation} \text{Y} = \text{b}_0 + \text{b}_1T + \text{b}_2D + \text{b}_3P + \text{e} \tag{1.1} \end{equation}\]
In time series, it is important to understand the counterfactual. The counterfactual refers to what it would have occured to Y, had the policy intervention not happened.
Let’s take a case where our policy intervention had both an immediate and a sustained effect, as shown in figure 1.11. The green line is represented by the equation (1.1), where \(\beta{b}_2\) and \(\beta{b}_3\) are both different from 0.
\[\begin{equation} \text{Y} = \text{b}_0 + \text{b}_1T + \text{b}_2D + \text{b}_3P + \text{e} \tag{1.1} \end{equation}\]
To calculate the counterfactual, we need to assume that the intervention has never occured and there has been no immediate nor sustained effect. In other words, if the policy intervention had not happenend, the green line would have continued as shown by the dotted gray line. Our equation (1.1) would become:
\[\begin{equation} \text{Y} = \text{b}_0 + \text{b}_1T + \text{b}_2*0 + \text{b}_3*0 + \text{e} \tag{1.5} \end{equation}\]
Or just:
\[\begin{equation} \text{Y} = \text{b}_0 + \text{b}_1T + \text{e} \tag{1.6} \end{equation}\]
We can calculate the counterfactual for each point in time. For instance, we look at t = 12. The predicted Y is calculated in the equation (1.7). The predicted Y for t = 12 falls on the green line in figure 1.12. Its counterfactual is calculcated in equation (1.8). It is located on the dotted line, perpendicularly above the predicted Y.
\[\begin{equation} \text{Y} = \text{b}_0 + \text{b}_1*12 + \text{b}_2*D*1 + \text{b}_3*P*2 + \text{e} \tag{1.7} \end{equation}\]
\[\begin{equation} \text{Y} = \text{b}_0 + \text{b}_1*12 + \text{b}_2*D*0 + \text{b}_3*P*0 + \text{e} \tag{1.8} \end{equation}\]
The counterfactual changes for each point in time. Let’s look at t = 18.
\[\begin{equation} \text{Y} = \text{b}_0 + \text{b}_1*18 + \text{b}_2*D*1 + \text{b}_3*P*8 + \text{e} \tag{1.9} \end{equation}\]
\[\begin{equation} \text{Y} = \text{b}_0 + \text{b}_1*18 + \text{b}_2*D*0 + \text{b}_3*P*0 + \text{e} \tag{1.10} \end{equation}\]
We will apply these concepts in the following working example.
We are going to have a look at a simulated dataset describing the wellbeing of a class of students. Students’ wellbeing is an increasingly discussed topic in policy debates and some states are introducing mental health education classes (even some universities).
The dataset contains 365 daily observations of the wellbeing of a class of students. Wellbeing is measured by a index from 0 to 300. At t = 201 students will start attend a mental health education class. We want to understand how (and if) their wellbeing improves.
# load data from GitHub
URL <- "https://github.com/DS4PS/pe4ps-textbook/blob/master/data/time-series-example.rds?raw=true"
dataTS <- readRDS(gzcon(url( URL )))
First, we have a look at how the dataset is structured using the head and tail function. We can also look at rows 198-204 where the interruption occurs.
dataTS$Y <- round(dataTS$Y,1)
d.temp <- rbind( head(dataTS), c("...","...","...","..."),
dataTS[ 198:200, ],
c("**START**","**TREATMENT**","---","---"),
dataTS[ 201:204, ],
c("...","...","...","..."),
tail(dataTS) )
row.names( d.temp ) <- NULL
pander( d.temp )
Y | T | D | P |
---|---|---|---|
38.5 | 1 | 0 | 0 |
48.6 | 2 | 0 | 0 |
102.4 | 3 | 0 | 0 |
58.1 | 4 | 0 | 0 |
60 | 5 | 0 | 0 |
107.8 | 6 | 0 | 0 |
… | … | … | … |
60 | 198 | 0 | 0 |
79.4 | 199 | 0 | 0 |
62.4 | 200 | 0 | 0 |
START | TREATMENT | — | — |
173.1 | 201 | 1 | 1 |
147.3 | 202 | 1 | 2 |
100.7 | 203 | 1 | 3 |
125.7 | 204 | 1 | 4 |
… | … | … | … |
300 | 360 | 1 | 160 |
217.5 | 361 | 1 | 161 |
243.9 | 362 | 1 | 162 |
233.3 | 363 | 1 | 163 |
256.6 | 364 | 1 | 164 |
251.1 | 365 | 1 | 165 |
plot( dataTS$T, dataTS$Y,
bty="n", pch=19, col="gray",
ylim = c(0, 300), xlim=c(0,400),
xlab = "Time (days)",
ylab = "Wellbeing Index" )
# Line marking the interruption
abline( v=200, col="firebrick", lty=2 )
text( 200, 300, "Start of Wellbeing Classes", col="firebrick", cex=1.3, pos=4 )
# Add the regression line
ts <- lm( Y ~ T + D + P, data=dataTS )
lines( dataTS$T, ts$fitted.values, col="steelblue", lwd=2 )
The dataset contains four variables:
Column | Variable name | Description |
---|---|---|
\(\text{Y}\) | \(\text{Wellbeing}\) | Wellbeing index (from 0 to 300) |
\(\text{T}\) | \(\text{Time}\) | Time (from 1 to 365) |
\(\text{D}\) | \(\text{Treatment}\) | Observation post (=1) and pre (=0) intervention |
\(\text{P}\) | \(\text{Time Since Treatment}\) | Time passed since the intervention |
Our model is based on the equation (1.1):
\[\begin{equation} \text{Y} = \text{b}_0 + \text{b}_1*Time + \text{b}_2*Treatment + \text{b}_3*Time Since Treatment + \text{e} \tag{2.1} \end{equation}\]
We can run the model using the lm function in R.
regTS = lm ( Y ~ T + D + P) # Our time series model
stargazer( regTS,
type = "html",
dep.var.labels = ("Wellbeing"),
column.labels = ("Model results"),
covariate.labels = c("Time", "Treatment",
"Time Since Treatment"),
omit.stat = "all",
digits = 2 )
Dependent variable: | |
Wellbeing | |
Model results | |
Time | 0.19^{***} |
(0.04) | |
Treatment | 13.09^{**} |
(6.12) | |
Time Since Treatment | 0.54^{***} |
(0.06) | |
Constant | 57.15^{***} |
(4.13) | |
Note: | ^{}p<0.1; ^{}p<0.05; ^{}p<0.01 |
Let’s interpret our coefficients:
A useful exercise is to calculate outcomes at different points in time as we did in section 1.3. For instance, we can calculate the outcome right after the intervention, which occured at time = 200. Note that while \(\text{Time}\) = 201, \(\text{Time Since Treatment}\) is equal to 1 because it is the first day after the intervention.
\[\begin{equation} \text{Y} = \text{b}_0 + \text{b}_1*201 + \text{b}_2*1 + \text{b}_3*1 + \text{e} \tag{2.2} \end{equation}\]
We can also represent the point on a graph:
# We create a small dataset with the new values
data1 <- as.data.frame( cbind( T = 201, D = 1, P = 1 ))
# We use the function predict to (1) take the
# coefficients estimated in regTS and
# (2) calculate the outcome Y based on the
# values we set in the new datset
y1 <- predict( regTS, data1 )
# We plot our initial observations, the column Y in our dataset
plot( Y,
bty="n",
col = gray(0.5,0.5), pch=19,
xlim = c(1, 365),
ylim = c(0, 300),
xlab = "Time (days)",
ylab = "Wellbeing Index")
# We add a point showing the level of wellbeing at time = 201)
points( 201, y1, col = "dodgerblue4",
pch = 19, bg = "dodgerblue4", cex = 2 )
text( 201, y1, labels = "t = 201", pos = 4, cex = 1 )
# Line marking the interruption
abline( v=200, col="red", lty=2 )
Now, we can calculate the outcome 30 days after the intervention. \(\text{Time}\) will be equal to 230 while \(\text{Time Since Treatment}\) will be equal to 30. \(\text{Treatment}\) is always equal to 1 since it is a dummy variable indicating pre or post intervention.
\[\begin{equation} \text{Y} = \text{b}_0 + \text{b}_1*230 + \text{b}_2*1 + \text{b}_3*30 + \text{e} \tag{2.3} \end{equation}\]
We can add the new point at t = 230 to the graph:
data2 <- as.data.frame( cbind( T = 230, D = 1, P = 30 )) # New data
y2 <- predict( regTS, data2 ) # We predict the new outcome
# We plot our initial observations, the column Y in our dataset
plot( Y,
bty="n",
col = gray(0.5,0.5), pch=19,
xlim = c(1, 365),
ylim = c(0, 300),
xlab = "Time (days)",
ylab = "Wellbeing Index")
# We add a point showing the level of wellbeing at time = 201
points(201, y1, col = "dodgerblue4", pch = 19, bg = "red", cex = 2)
# We add a new point showing the level of wellbeing at time = 230)
points(230, y2, col = "dodgerblue4", pch = 19, bg = "red", cex = 2)
# Label for our predicted outcome
text(201, y1, labels = "t = 201", pos = 4, cex = 1)
#Label for the counterfactual
text(230, y2, labels = "t = 230", pos = 4, cex = 1)
# Line marking the interruption
abline( v=200, col="red", lty=2 )
Using the same mechanism illustrated in section 1.3 and 2.1, we can calculate the counterfactual for any point in time. The counterfactual at \(\text{Time}\) = 230 is the level of wellbeing at that point in time if the intervention had not occured.
Note the changes in our equation (2.3): \(\text{Treatment}\) and \(\text{Time Since Treatment}\) are both equal to 0.
\[\begin{equation} \text{Y} = \text{b}_0 + \text{b}_1*230 + \text{b}_2*0 + \text{b}_3*0 + \text{e} \tag{2.4} \end{equation}\]
As before, we plot the results on a graph. Y is our predict value and C is the counterfactual.
data3 <- as.data.frame(cbind( T= 230, D = 0, P = 0)) # Data if the intervention does not occur
y3 <- predict(regTS, data3) #Counterfactual
# We plot our initial observations, the column Y in our dataset
plot( Y,
bty="n",
col = gray(0.5,0.5), pch=19,
xlim = c(1, 365),
ylim = c(0, 300),
xlab = "Time (days)",
ylab = "Wellbeing Index")
# We add a point showing the level of wellbeing at time = 230
points(230, y2, col = "dodgerblue4", pch = 19, bg = "red", cex = 2)
# We add a point indicating the counterfactual
points(230, y3, col = "darkorange2", pch = 19, cex = 2)
# Label for our predicted outcome
text(230, y2, labels = "Y at t = 230", pos = 4, cex = 1)
#Label for the counterfactual
text(230, y3, labels = "C at t = 230", pos = 4, cex = 1)
# Line marking the interruption
abline( v=200, col="red", lty=2 )
It is important to note the counterfactual varies for each point in time. We can clearly see this if we plot a new point at time = 320 and its counterfactual.
data4 <- as.data.frame(cbind( T = 320, D = 1, P = 80))
data5 <- as.data.frame(cbind( T = 320, D = 0, P = 0))
y4 <- predict(regTS, data4)
y5 <- predict(regTS, data5)
# We plot our initial observations, the column Y in our dataset
plot( Y,
bty="n",
col = gray(0.5,0.5), pch=19,
xlim = c(1, 365),
ylim = c(0, 300),
xlab = "Time (days)",
ylab = "Wellbeing index")
# Wellbeing at time = 230
points(230, y2, col = "dodgerblue4", pch = 19, bg = "red", cex = 2)
# Counterfactual at time = 230
points(230, y3, col = "darkorange2", pch = 19, cex = 2)
# Wellbeing at time = 320
points(320, y4, col = "dodgerblue4", pch = 19, cex = 2)
# Counterfactual at time = 320
points(320, y5, col = "darkorange2", pch = 19, cex = 2)
# Adding labels
text(320, y4, labels = "Y at t = 320", pos = 4, cex = 1)
text(320, y5, labels = "C at t = 320", pos = 4, cex = 1)
text(230, y2, labels = "Y at at = 230", pos = 4, cex = 1)
text(230, y3, labels = "C at t = 230", pos = 4, cex = 1)
# Line marking the interruption
abline( v=200, col="red", lty=2 )
Finally, we can look at the results by plotting all predicted outcomes and their counterfactuals. In figure 2.5 We can see that there is an increasing in the wellbeing index after the intervention. The dashed line represents the counterfactual.
pred1 <- predict(regTS, dataTS)
# To estimate all predicted values of Y, we just use our dataset
datanew <- as.data.frame(cbind(T = rep(1 : 365), D = rep(0), P = rep(0)))
# Create a new dataset where Treatment and Time Since Treatment are equal to 0 as the intervention did not occur.
pred2 <- predict(regTS, datanew)
# Predict the counterfactuals
plot( Y,
bty="n",
col = gray(0.5,0.5), pch=19,
xlim = c(1, 365),
ylim = c(0, 400),
xlab = "Time (days)",
ylab = "Wellbeing index")
lines( rep(1:199), pred1[1:199], col="dodgerblue4", lwd = 3 )
lines( rep(201:365), pred1[201:365], col="dodgerblue4", lwd = 3 )
lines( rep(200:365), pred2[200:365], col="darkorange2", lwd = 3, lty = 5 )
text(0, 45, labels = "Predicted values", pos = 4, cex = 1, col = "dodgerblue3")
text(300, 95, labels = "Counterfactual", pos = 4, cex = 1, col = "darkorange2")
# Line marking the interruption
abline( v=200, col="darkorange2", lty=2 )
Note that the coefficients obtained from equation (2.1) do not tell us if the difference between each point (the predicted outcomes) and its counterfactual is statistically significant. They only tell you if
There is no statistical test to look at whether there is a statistically significant difference between a predicted outcome and its counterfactual. This has important implications. It could be that effect of the intervetion varies over time as in figure 2.6. In this case, there is a drop in students’ wellbeing right after the intervention. For instance, it could be that the class challenges students to think about their emotions, thereby decreasing their overall sense of wellbeing.
reg2 = lm(Y ~ T + D + P, data = dataTS2)
stargazer( reg2,
type = "html",
dep.var.labels = ("Wellbeing"),
column.labels = ("Model results"),
covariate.labels = c("Time", "Treatment", "Time Since Treatment"),
omit.stat = "all",
digits = 2 )
Dependent variable: | |
Wellbeing | |
Model results | |
Time | 0.22^{***} |
(0.04) | |
Treatment | -34.67^{***} |
(7.14) | |
Time Since Treatment | 0.63^{***} |
(0.07) | |
Constant | 66.67^{***} |
(4.81) | |
Note: | ^{}p<0.1; ^{}p<0.05; ^{}p<0.01 |
We represent these results on graph 2.6. Depending on the point in time we consider the graph, the intervention had a negative effect or a positive one. If we look at results immediately after the intervention had occured (the red dotted line), we might conclude that the effect is negative. But if we consider some delay and look at the intervention at t = 300 (green dotted line), the effect are positive.
Thinking about when the effect is likely to occur is important in time series. We discuss other example in section 4.2.
pred1 <- predict(reg2, dataTS2)
datanew <- as.data.frame(cbind(T = rep(1 : 365), D = rep(0), P = rep(0)))
pred2 <- predict(reg2, datanew)
plot(Y,
col = "gray",
xlim = c(1, 365),
ylim = c(0, 400),
xlab = "Time (days)",
ylab = "Wellbeing index")
lines( rep(1:199), pred1[1:199], col="dodgerblue4", lwd = 3 )
lines( rep(201:365), pred1[201:365], col="dodgerblue4", lwd = 3 )
lines( rep(200:365), pred2[200:365], col="darkorange2", lwd = 3, lty = 5 )
text(0, 45, labels = "Predicted values", pos = 4, cex = 1, col = "dodgerblue3")
text(300, 105, labels = "Counterfactual", pos = 4, cex = 1, col = "darkorange2")
# Line marking the interruption
abline( v=200, col="red", lty=2 )
# Line at t = 300
abline( v=300, col="forestgreen", lty=2 )
We are going to replicate a study conducted by Rodgers, John, and Coleman (2005) to investigate fertility rates in Oklahoma after the Oklahoma City bombing in 1995, which killed 168 persons.
Theories suggest that birth rates respond to sociocultural situations, including the occurence of major disasters - both natural, such as hurricanes, and man-made, such as 9/11. Individuals feel more prone to have children as a response to major life losses or because they feel closer to traditional values (such as having a family) when unexpected events occur. By contrast, they might be less willing to have children when experiencing events that negatively affect their life style.
Evidence supporting these theories is limited but there have been ongoing investigations, especially as fertility rates steadily declined in past years.
Data used in the study are available in the Wats package in R and utilized by Rodgers, Beasley, and Schuelke to discuss representation of interrupted time series.
We start by uploading the Wats package and the data.
library( "Wats" )
# Upload the data from the package
data( CountyMonthBirthRate2014Version )
# We rename the data set to avoid the long name!
data <- CountyMonthBirthRate2014Version
These are the variables contained in the dataset.
Variable name | Description |
---|---|
Fips | Fips code |
CountyName | Name of the county |
Year | Year (1990-1999) |
Month | Month |
FecundPopulation | Fecund population |
BirthCount | Number of births in the month |
Date | Date when the data were collected |
DaysInMonth | Number of days in the month |
DaysInYear | Number of days in the year |
StageID | Indicate the time 9 months after the bombing |
BirthRateMonthly | Birth |
The dataset contains observations from multiple counties, but for this exercise we are going to consider only the county of Oklahoma city.
#Subset the data to keep only Oklahoma county data
oklahoma <- subset(data, data$CountyName == "oklahoma")
To get a first sense of the data we can look at this plot that Rodgers and colleagues used to represent birth rate over time in Oklahoma county:
What do you expect to find based on the figure?
When preparing data for time series analysis, we need to create three new variables to apply our time series equation (1.1):
# Create a time variable from 1 to last row of the dataset
oklahoma$time <- rep( 1 : nrow( oklahoma ))
# Create a dummy variable equal to 1 if after 1995 or after May 1995, and equal to 0 otherwise
oklahoma$treatment <- ifelse( oklahoma$Year > 1995 | oklahoma$Year == 1995 & oklahoma$Month >= 5, 1, 0)
# Create a dummy variable that is equal to 0 before time = 64 (when the event occured) and 1 otherwise.
# If we look at the frequency of the treatment variable, we see that 64 rows has received no treatment and 56 rows did receive the treatment.
table(oklahoma$treatment)
##
## 0 1
## 64 56
# The new variable will have 64 zeros and then will count time from 1 to 56
oklahoma$timeSince <- c(rep(0, 64), rep(1:56))
We can now calculate the coefficients as we did before:
reg_OK <- lm( BirthRateMonthly ~
time +
treatment +
timeSince,
data = oklahoma)
stargazer( reg_OK,
type = "html",
dep.var.labels = ("Monthly Birth Rate"),
column.labels = "",
covariate.labels = c("Time", "Treatment", "Time Since Treatment"),
omit.stat = "all",
digits = 3 )
Dependent variable: | |
Monthly Birth Rate | |
Time | -0.005^{**} |
(0.002) | |
Treatment | 0.054 |
(0.123) | |
Time Since Treatment | 0.014^{***} |
(0.004) | |
Constant | 5.924^{***} |
(0.085) | |
Note: | ^{}p<0.1; ^{}p<0.05; ^{}p<0.01 |
Can you interpret results from the table? In particular:
We can see that:
We can look at our observed and predicted birth rate on a plot:
# Predict Birth rate
pred_OK <- predict( reg_OK, oklahoma )
plot( oklahoma$BirthRateMonthly,
col = "gray",
xlim = c( 1, nrow (oklahoma) ),
ylim = c( (min ( oklahoma$BirthRateMonthly ) - 0.5), (max (oklahoma$BirthRateMonthly) + 0.5)),
xlab = "Time (days)",
ylab = "Birth rate" )
lines( pred_OK, col="dodgerblue4", lwd = 3 )
# Line marking the interruption
abline( v=64, col="red", lty=2 )
When conducting an interrupted time series analysis is important to think about possible threats to the validity of the study. In this case, Rodgers and colleagues noted that it would be difficult to observe an increase in the birth rate right after the event for the simple reason that a pregnancy lasts 9 months!
We can run our analysis again and see whether we can find an immediate affect 9 months after the bombing. We do it by changing our Treatment and Time Since Treatment variables to indicate a different point in time.
oklahoma$treatment2 <- ifelse(oklahoma$Year > 1996 | oklahoma$Year == 1996 & oklahoma$Month >= 2, 1, 0)
# Our new treatment variable is equal to 1 if the bombing has occured and 9 months have passed.It is equal to zero otherwise.
oklahoma$timeSince2 <- ifelse(oklahoma$time <= 73, 0, rep(1 : 73))
# Our Time Since Treatment variable starts 9 months after the bombing, instead than at the time of the bombing.
reg_OK2 = lm(BirthRateMonthly ~
time +
treatment2 +
timeSince2,
data = oklahoma)
stargazer( reg_OK2,
type = "html",
dep.var.labels = ("Monthly Birth Rate"),
column.labels = "",
covariate.labels = c("Time", "Treatment - 9 months delay", "Time Since Treatment - 9 months delay"),
omit.stat = "all",
digits = 2 )
Dependent variable: | |
Monthly Birth Rate | |
Time | -0.004^{**} |
(0.002) | |
Treatment - 9 months delay | 0.13 |
(0.13) | |
Time Since Treatment - 9 months delay | 0.01^{***} |
(0.004) | |
Constant | 5.90^{***} |
(0.08) | |
Note: | ^{}p<0.1; ^{}p<0.05; ^{}p<0.01 |
Again, we don’t find a significant immediate effect. Given the further evidence, we can rule out that an immediate an effect has occured as result of the event.
In this section we are going to discuss some additional issues that you might need to consider with interrupted time series.
Autocorrelation is a major issue when working with time series. Autocorrelation occurs when observation at one point in time depends from observations at another point in time. For example:
When we use OLS, we assume that error terms associated with each observation are not correlated. But, this assumption does not hold in the presence of autocorrelation because error terms are correlated across observations. If you don’t correct for autocorrelation, you might underestimate the standard errors, meaning that you are overestimating the statistical significance.
You can assess autocorrelation by looking at the residual pattern. If residuals are randomly dispersed, then there is no autocorrelation.
regTS_OK <- lm(BirthRateMonthly ~ time, data = oklahoma)
# Calculate a simple regression where birth rate depends only on time
plot(resid( regTS_OK ))
# Plot the residuals
Some times, it can be difficult to detect small correlation patterns using the residual plot as in figure 4.1. As alternative, we can test autocorrelation of residuals using the Durbin-Watson test in the package lmtest.
#install.packages("lmtest", repos="https://cran.cnr.berkeley.edu")
#library("lmtest")
#dwtest(oklahoma$BirthRateMonthly ~ oklahoma$time)
The test is positive as the p-value is smaller than 0.05. We can conclude that autocorrelation is an issue.
The function acf is another easy way to plot your residuals and assess autocorrelation. We can have a look at it.
regTS_OK <- lm(BirthRateMonthly ~ time, data = oklahoma)
# We run again a simple regression where the outcome depends solely on time
acf(resid( regTS_OK ))
# We use the acf function to plot the residuals
In figure 4.2, the x axis represents the time while the y axis is the level of autocorrelation. If a line goes outside the blue horizontal line, the autocorrelation is significant at that point in time (p-value <= 0.05).
In our figure, it looks like our error terms are correlated at T = 1, T = 11, and T = 12. We can retrieve from the acf function the level of correlation.
# Correlation at time 1 (note that 0 is time 1, so time 1 is actually our second observation )
acf(resid( regTS_OK ))$acf[2]
# Correlation at time 12
acf(resid( regTS_OK) )$acf[12]
# Correlation at time 13
acf(resid( regTS_OK ))$acf[13]
## [1] 0.2747975
## [1] 0.2440887
## [1] 0.489367
We can use the value of the correlation that we just retrieve to correct our standard errors.
Another alternative is to an ARIMA model which allows to adjust for autocorrelation and other time series characteristics such seasonality. In this course, we do not address ARIMA models.
Time series can get complicated as there might be late effects (as we saw in our Section 3) that can be misleading. A classical example are traning programs. Let’s imagine a state that introduces a new training program at T = 8 and we want to evaluate the immediate and sustained effect. Our data are represented in figure 4.3.
Look at the lines. If we consider the time of the intervention, we would conclude that the program has a negative effect on income! But this is because individuals drop out from the labour market or stop working as they need to attend the training. Income, in fact, increases over time when individuals re-enter the labour market.
Thinking about when the effect is likely to occur is importan in time series. As we mentioned in section 1.3, we cannot test if the slope difference is significant at different point in time.
If we suspect a late effect, we can look at immediate and sustained effects some months after the intervention. In the training example, we would observe an immediate, positive effect if we look at t = 18. Moreover, the “Time since Treatment” coefficient would have indicated a larger, positive effect.
It is also difficult to know for how long the effect of a policy will be sustained. In several cases researchers note a regression to the mean phenomenon, whereas the outcome slowly returns to its initial value.
For instance, let’s think of our first example about wellbeing. It is possible that the school program has a positive effect in the short term, as students draw benefits from the new class. But in the long term, it is likely that students will get used to the wellbeing program and their wellbeing goes back to its initial values. This is shown in figure 4.5.
There are also factors that are outside the policy makers’ control and might affect the sustained effect of the intervention. For instance, family and other environmental factors might push back wellbeing to its initial value.
Finally, time series are also subject to threats to internal validity, such as:
To address these issues, you can:
This design makes sure that the observed effect is the result of the policy intervention. The data will have two observations per each point in time and will include a dummy variable to differentiate the treatment (=1) and the control (=0). The model has a similar structure but (1) we will include a dummy variable that indicates the treatment and the control group and (2) we will interact the group dummy variable with all 3 time serie coefficients to see if there is a statistically significant difference across the 2 groups.
You can see this in the following equation, where \(\text{G}\) is a dummy indicating treatment and control group.
\[\begin{equation} \text{Y} = \text{b}_0 + \text{b}_1*T + \text{b}_2*D + \text{b}_3*P + \text{b}_4*G + \text{b}_5*G*T + \text{b}_6*G*D + \text{b}_7*G*P \tag{4.1} \end{equation}\]
To interpret the coefficients you need to remember that the reference group is the treatment group (=1). The Group dummy (b4) indicates the difference between the treatment and the control group. b5 represents the slope difference between the intervention and control group in the pre-intervention period. b6 represents the difference between the control and intervention group associated with the intervention. b7 represents the difference between the sustained effect of the control and intervention group after the intervention.
We can also take two (or more) groups where the intervention has occured at different point in time and see whether the effect of the intervention is significant for both of them.