1. Ordinary Least Squares Regression: A Review

Start with some function,

\[Y_i = F(X_i,\beta,\epsilon_i)\]

where \(Y_i\) is dependent variable, \(X_i\) is K−vector of observed independent variables, and \(\epsilon_i\) is an unobserved scalar error term for observations \(i=1,...,N\).\(\beta\) is an unobserved k-vector of parameters. The parameters \(\beta\) together with the function \(F(\cdot)\) govern how X affects Y.

Linear Model

Eq (1) is quite general. We assume that \(Y\) is linear in \(X\) and that the error term enters in additively, \[Y_i = X_i\beta +\epsilon_i\] * Linear regression is used to model the linear relationship between the dependent variable, \(Y\), and predictor variables \(x_1, x_2, \ldots, x_p\).

  • Different methods to obtain estimates for \(\beta\)
  • Simplest Algorithm is Ordinary Least Squares (OLS):
    • Find \(\beta\) that minimize the sum of the squared errors

Error Term

Given that we know \(X\), if we knew \(\beta\), then we would know everything predictable about \(Y\). That is, our ignorance would only be about the error term \(\epsilon\).

The error term \(\epsilon\) has many interpretations. It could be:

  1. Unmodeled factors and left-out variables

  2. Measurement error in Y

  3. Unobserved heterogeneity

  4. It could be the sum of random stuff which is independent of X.

  5. You could think of this as random shifters (like business cycles or the weather),

In Eq. (2), \(X\beta\) is the prediction of \(Y\) given \(X\) (more on this later), and \(\epsilon\) is the prediction error. (If \(\epsilon\) were 0 then, the prediction is exactly right.)

Given an estimate of \(\beta\), we can define the prediction error \(\epsilon\) in reverse:

\[\epsilon = Y - X\beta\]

OLS Computation

If prediction error is bad, then one might wish to minimize its (weighted) sum, which is the (weighted) sum of residuals: obtain

\[min_{\hat{\beta}} \sum_{i=1}^n \omega_i |\epsilon_i|\]

where \(w_i\) is a weight for each residual. When we weight \(\epsilon_i\) by its size, we get ordinary least squares:

\[min_{\hat{\beta}} \sum_{i=1}^n |\epsilon_i| \cdot |\epsilon_i|\]

After substituting Eq. (3) into Eq. (5), we get

\[min_{\hat{\beta}} \sum_{i=1}^n |Y_i - X_i\hat{\beta}| \cdot |Y_i - X_i\hat{\beta}|\]

\[= min_{\hat{\beta}} \sum_{i=1}^n (Y_i - X_i\hat{\beta})^2\]

2. A Graphical Example

Let’s look at a graphical example for better intuition.

Graph 1

Consider graph with 5 data points

Graph 2

Draw any linear function in this scatterplot.

  • Define a fitted value for \(y\) when \(x = x_i\) as: \(\hat{y}_i =\hat{\beta}_0 + \hat{\beta}_1 x_i\)

Graph 3

  • Define a residual for the observation \(i\) as the difference between the actual \(y_i\) and its fitted value

\[\hat{u_i} = y_i - \hat{y}_i = y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i\]

Graph 4

Minimization Problem

Want to choose \(\hat{\beta}_0\) and \(\hat{\beta}_1\) to make the sum of squared residuals,

\[\sum_{i=1}^n \hat{u}_i^2 = \sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2\]

as small as possible, i.e.:

\[min_{\hat{\beta}_0,\hat{\beta}_1} \sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2\]

3. Review of OLS Assumptions

We need to make five important assumptions for our regression model to be valid (i.e. OLS estimators are unbiased and efficient)

If the following 5 assumptions hold, this is called satisfying the Gauss-Markov Assumptions

\(\hat{\beta}_0,\hat{\beta}_1,...,\hat{\beta}_k\) are the best (i.e. smallest variance) linear unbiased estimators of \(\beta_0,\beta_1,...,\beta_k\).

OLS is BLUE, i.e Best Linear Unbiased Estimator.

Assumption 1

Linear in parameters

\[y=\beta_0 + \beta_1 x1 + \beta_2 x_2 + ... + \beta_k x_k + u\]

Assumption 2

Random sampling

We have a random sample of \(n\) observations, \({(x_{i1}, x_{i2}, ..., x_{ik})}\), \(i = 1, 2,..., n\), following the population model.

Assumption 3

No Perfect Multicollinearity

Two part assumption:

  1. X is not constant (\(Var(x) \neq 0\)); and

  2. No perfect collinearity between the X’s

In your regression class you probably learned that collinearity can throw off the coefficient estimates.

Here’s an extreme example of perfectly collinear data.

                                                                                                                     my.data <- data.frame(y =  c(12, 13, 10, 5, 7, 12, 15),x1 = c(6, 6.5, 5, 2.5,3.5, 6, 7.5),x2 = c(6, 6.5, 5, 2.5,3.5, 6, 7.5))
my.data
##    y  x1  x2
## 1 12 6.0 6.0
## 2 13 6.5 6.5
## 3 10 5.0 5.0
## 4  5 2.5 2.5
## 5  7 3.5 3.5
## 6 12 6.0 6.0
## 7 15 7.5 7.5

What do you notice?

By construction, x1 and x2 are exactly the same variable, and the outcome y is perfectly modeled as \(y = x_1 + x_2\).

But there’s a problem… because the following are also true

\(y = 2 x_1\)

\(y = 3 x_1 - x_2\)

\(y = -400x_1 + 402 x_2\)

In other words, based on the data, there’s no way of figuring out which of these models is “right”. However, if we drop one of the variables from the model, we know exactly what the coefficient of the other should be.

Collinearity among predictors causes problems for regression precisely because the model is unable to accurately distinguish between many nearly equally plausible linear combinations of collinear variables. This can lead to large standard errors on coefficients, and even coefficient signs that don’t make sense.

Assumption 4

Zero conditional mean - Important!

The error \(u\) has an expected value of zero given any values of the independent variables. In other words,

\(E(u| x1, x_2, ..., \beta_k) = 0\)

Assumes that unobservables are, on average, unrelated to the explanatory variables is key to derive the unbiasedness of the OLS estimator.

  • Assumption fails in the case of

    1. omitted variable,

    2. misspecified functional form, and

    3. measurement error.

    Assumption 4 Zero conditional mean is violated with omitted variables. Let’s take an example: We are interested in the effect of education on hourly wage:

\[wage = \beta_0 + \beta_1 educ + u\]

In SLR we assume that educ is uncorrelated with unobservables captured by the u error term. Is that reasonable? What about labor market experience, for instance?

\[wage = \beta_0 + \beta_1 educ + \beta_2 Exper + u\]

  • Multiple linear regression takes \(exper\) out of the error term u.
  • With the SLR we would have to assume that experience is uncorrelated with education.
  • \(\beta_1\) measures the ceteris paribus effect of \(educ\) on \(wage\); we hold \(exper\) constant.
  • \(\beta_2\) measures the ceteris paribus effect of \(exper\) on \(wage\); we hold \(educ\) constant.

Assumption 5

Homoskedasticity

The error \(u\) has the same variance given any values of the explanatory variables. In other words, words

\(Var(u|x_1,...,x_k) = \sigma^2.\)

Lets Look at the diamonds data in the ggplot2 library to see what a more typical analysis of linear model diagnostic plots might reveal.

#run regression
library(ggplot2)
diamonds.lm <- lm(price ~ carat + cut + clarity + color, data = diamonds)
par(mar = c(4, 4, 2, 2), mfrow = c(1, 2)) #optional
plot(diamonds.lm, which=c(1,2))

Residuals vs. Fitted

There is a clear indication of non-linearity present in this plot. Furthermore, we see that the variance appears to be increasing in fitted value.

Normal QQ plot

The residuals appear highly non-normal. Both the lower tail and upper tail are heavier than we would expect under normality. This may be due to the non-constant variance issue we observed in the Residuals vs. Fitted plot.

Log Transformation

  • Here’s what happens if we log-transform both the price and carat variables.
                                                                                                                     diamonds.lm2 <- lm(log(price) ~ I(log(carat)) + cut + clarity + color, data = diamonds)

par(mar = c(4, 4, 2, 2), mfrow = c(1, 2)) #optional
plot(diamonds.lm2,which = c(1, 2))

While there remains a very slight indication of non-linearity in the Residual vs Fitted plot, the non-constant variance issue appears to have been addressed by the variable transformations. The Normal QQ plot indicates that the residuals have a heavier tailed distribution, but since we have a very large sample size this should not cause problems for inference.

4. OLS Computation in R

We will first run the canned version of OLS using lm() in R.

lm() Routine

## This is the OLS regression we will manually calculate:
reg <- lm(weight ~ height, data=women)
summary(reg)
## 
## Call:
## lm(formula = weight ~ height, data = women)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

Interpret Model Results

  • Coefficient value: shows strength of the relationship
    • Tests for economic meaningfulness
    • A 1-inch increase in the height of a women is associated with a weight increase of 3.45 (reg$coef[2]) pounds. A 1/2-foot increase (6 inches) in the height of a women is associated with a weight increase of 20.7 (6 * reg$coef[2]) pounds.
  • P-value tests the null hypothesis that the coefficient is equal to zero (no effect)
    • Tests for statistical significance
    • A low p-value (Rule of thumb: alpha < 0.05) indicates that you can reject the null hypothesis of no effect
    • a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference
  • R-squared:
    • Shows how much variation of Y explained by X
    • Higher values indicate X better predictor of Y,
    • 0% indicates that the model explains none of the variability of the response data.
  • F-statistic:
    • Shows overall fit of the model
    • Tests the null hypothesis that the fit of the intercept-only model and your model are equal.
    • A low p-value (Rule of thumb: alpha < 0.05) indicates specified model is better fit than intercept-only model
    • F-test in regression compares the fits of different linear models

Manual Approach

To better understand what is going on in the background, we will calculate OLS regression manually using matrix algebra.

Matrix Notation

Recall, using summation method, we minimize sum of squared errors as follows,

\[= min_{\hat{\beta}} \sum_{i=1}^n (Y_i - X_i\hat{\beta})^2\]

In matrix notation, we can write sum of squared residuals as follows,

\[\epsilon'\epsilon = (Y-X\hat{\beta})'(Y-X\hat{\beta})\]

Obtain beta estimates by minimizing \(\epsilon'\epsilon\). Taking the derivative of sum of squared residuals with respect to beta and set to 0. After some algebra, we get the following First Order Conditions (FOC),

\[\frac{\partial S}{\partial \beta} = -2X'Y + 2X'X\hat{\beta} = 0\]

\[2X'X\hat{\beta}=2X'Y \] \[X'X\hat{\beta} = X'Y\]

Solving for \(\beta\):

\[\hat{\beta}=(X'X)^{-1}X'Y\]

R Matrix Operators

Define X and Y

#Define Matrix X
X <- matrix(c(1, 2.4, 1, 3.1), nrow = 2, ncol = 2, byrow = TRUE)
y <- matrix(c(4.1, 5.4), nrow = 2, ncol = 1, byrow = TRUE)

Define \(X'\)

# Transpose X
t(X) 

Multiply \(X'X\)

# Multiply matrices
t(X)%*%X

Get Inverse of \(X'X\) with Solve

# Get Inverse of matrix with Solve
solve(t(X)%*%X)

Multiply \(X'Y\)

t(X)%*%y
  • Put it all together

\[\hat{\beta}=(X'X)^{-1}X'Y\]

#Get Beta Coefficient
solve(t(X)%*%X)%*%t(X)%*%y

Empirical Example

data(women)
str(women)

Create Matrices

## Create X and Y matrices for this specific regression
X = as.matrix(cbind(1,women$height))
Y = as.matrix(women$weight)

Minimize sum squared residuals

We want to solve for,

\[\hat{\beta}=(X'X)^{-1}X'y\]

X = the matrix of regressor data (the first column is all 1’s for the intercept)
y = the vector of the dependent variable data

## Choose beta-hat to minimize the sum of squared residuals
## resulting in matrix of estimated coefficients:
bh = round(solve(t(X)%*%X)%*%t(X)%*%Y, digits=2)

Combine Results

## Label and organize results into a data frame
beta.hat = as.data.frame(cbind(c("Intercept","Height"),bh))
names(beta.hat) = c("Coeff.","Est")
beta.hat
##      Coeff.    Est
## 1 Intercept -87.52
## 2    Height   3.45

Standard errors

To calculate standard errors, need to recover the variance-covariance matrix.

The VCV matrix will be a square \(k x k\) matrix that looks like:

\[\left(\begin{array}{cccc} var(\hat{\beta_1}) & cov(\hat{\beta_1}\hat{\beta_2}) & ... & cov(\hat{\beta_1}\hat{\beta_k})\\ cov(\hat{\beta_2}\hat{\beta_1}) & var(\hat{\beta_2}) & ... & cov(\hat{\beta_2}\hat{\beta_k}) \\ \vdots & \vdots &\vdots &\vdots &\\ cov(\hat{\beta_k}\hat{\beta_1}) & cov(\hat{\beta_k}\hat{\beta_2}) & ... & var(\hat{\beta_k}) \end{array}\right)\]

  • the standard errors for the estimated parameters \(\beta's\) obtained by taking the square root of the diagonal elements of the VCV matrix.

  • The variance for each parameter given X is as follows:

\[Var(\hat{\beta}|X)= \frac{1}{n-k} \hat{\epsilon}'\hat{\epsilon}(X'X)^{-1}\]

\(n\): Number of observations

\(k\): Number of parameters

\(\hat{\epsilon}\): Observed minus fitted values (\(y_i - \hat{y_i} = y_i - \hat{\beta_0} - \hat{beta_1}x_i\))

Calculate Standard Errors

## Calculate vector of residuals
res = as.matrix(women$weight-bh[1]-bh[2]*women$height)

## Define n and k parameters
n = nrow(women)
k = ncol(X)

## Calculate Variance-Covariance Matrix
VCV = 1/(n-k) * as.numeric(t(res)%*%res) * solve(t(X)%*%X)

## Standard errors of the estimated coefficients
StdErr = sqrt(diag(VCV))

Calculate P-values

## Calculate p-value for a t-test (pt) of coefficient significance
P.Value = rbind(2*pt(abs(bh[1]/StdErr[1]), df=n-k,lower.tail= FALSE),
                2*pt(abs(bh[2]/StdErr[2]), df=n-k,lower.tail= FALSE))

## concatenate into a single data.frame
beta.hat = cbind(beta.hat,StdErr,P.Value)
beta.hat
##      Coeff.    Est     StdErr      P.Value
## 1 Intercept -87.52 5.93696041 1.710338e-09
## 2    Height   3.45 0.09113675 1.091012e-14

A Scatterplot with OLS line

library(ggplot2)
FigSmooth<-ggplot(women,aes(height,weight)) +geom_point() +geom_smooth(method='lm',se=F,color="blue")+ggtitle("Regression Line Built-in R")


FigManual<-ggplot(women,aes(height,weight)) +geom_point() + geom_abline(intercept = bh[1], slope = bh[2],color="red")+ggtitle("Regression Line Manual")


library(gridExtra)
grid.arrange(FigSmooth,FigManual,nrow=1)

5. Multiple OLS Example

Let’s look at a small data set in which we’re interested in predicting the crime rate per million population based on socio-economic and demographic information at the state level.

Let’s begin by loading the packages we’ll need to get started

library(MASS)
library(plyr)
library(ggplot2)
library(knitr)
library(RCurl)
library(stargazer)

Data Prep

  • Let’s import the data set and see what we’re working with.
# Import data set
crime <-read.table("https://raw.githubusercontent.com/TonyHuaHowell/DataScienceClass2017/master/crime_simple.txt",sep = "\t", header = TRUE)
  • Relabel column names and transform variables
# Assign more meaningful variable names
colnames(crime) <- c("crime.per.million", "young.males", "is.south", "average.ed",
"exp.per.cap.1960", "exp.per.cap.1959", "labour.part",
"male.per.fem", "population", "nonwhite",
"unemp.youth", "unemp.adult", "median.assets", "num.low.salary")

#change to numeric variables
crime<-data.frame(lapply(crime, function(x) as.numeric(as.character(x))))

# Convert is.south to a factor
# Divide average.ed by 10 so that the variable is actually average education
# Convert median assets to 1000's of dollars instead of 10's
crime <- transform(crime, is.south = as.factor(is.south),
average.ed = average.ed/10,
median.assets = median.assets/100)

# print summary of the data
summary(crime)

Regression Model

To construct a linear regression model in R, we use the lm() function.

The first model we fit is a regression of the outcome (crimes.per.million) against all the other variables in the data set. You can either write out all the variable names or use the shorthand y ~ . to specify that you want to include all the variables in your regression.

##Run model and store results 
crime.lm <- lm(crime.per.million ~ young.males + is.south + average.ed+
                 exp.per.cap.1959+ exp.per.cap.1960  + labour.part+
                 male.per.fem + population + nonwhite+
                 unemp.youth + unemp.adult + median.assets + num.low.salary, data = crime)

# Get Summary of the linear regression model
summary(crime.lm)

R’s default is to output values in scientific notation. This can make it hard to interpret the numbers. Here’s some code that can be used to force full printout of numbers.

options(scipen=4)  # Set scipen = 0 to get back to default
summary(crime.lm)

Interpreting Main Results

The coefficients for these predictors are all positive, so crime rates are positively associated with wealth inequality, adult unemployment rates, average education levels, and high rates of young males in the population.
Looking at the p-values, it looks like num.low.salary (number of families per 1000 earning below 1/2 the median income), unemp.adult (Unemployment rate of urban males per 1000 of age 35-39), average.ed (Mean # of years of schooling 25 or older), and young.males (number of males of age 14-24 per 1000 population) are all statistically significant predictors of crime rate.

Plot Model Diagnostics

Plot lm object will produce four plots that are important diagnostic tools in assessing whether the linear model is appropriate. The first two plots are the most important, but the last two can also help with identifying outliers and non-linearities. We focus on the first two.

Residuals vs. Fitted When a linear model is appropriate, we expect

  1. the residuals will have constant variance when plotted against fitted values; and

  2. the residuals and fitted values will be uncorrelated.

If there are clear trends in the residual plot, or the plot looks like a funnel, these are clear indicators that the given linear model is inappropriate

par(mar = c(4, 4, 2, 2), mfrow = c(1, 2)) #optional
plot(crime.lm,which=c(1,2))

The residual vs fitted plot suggests some non-linearity may exist in the residuals. Based on the scale-location diagnostic plot, non-linear issue does not seem to be a major issue though. In any case, could take logarithms for each continuous variable, add/remove variables and re-check diagnostics to see if re-specified model is a better fit for the data.

Collinearity

As a demo, let’s look at some of the economic indicators in our data set.

economic.var.names <- c("exp.per.cap.1959", "exp.per.cap.1960", "unemp.adult", "unemp.youth", "labour.part", "median.assets")

pairs(crime[,economic.var.names], lower.panel = NULL)

Looking at the plot, we see that many of the variables are very strongly correlated. In particular, police expenditures are pretty much identical in 1959 and 1960. This is an extreme case of collinearity. Also, unsurprisingly, youth unemployment and adult unemployment are also highly correlated.

Correcting for collinearity

Let’s just include the 1960 police expenditure variable, and also drop the youth unemployment variable. We’ll do this using the update() function. Here’s what happens.

crime.lm2 <- lm(crime.per.million ~ young.males + is.south + average.ed+
                 exp.per.cap.1960  + labour.part+
                 male.per.fem + population + nonwhite+
                 unemp.adult + median.assets + num.low.salary, data = crime)

summary(crime.lm2)

Present Results

stargazer(crime.lm,crime.lm2, title="Effect of Public Spenditures on Crime",type='html',align=TRUE)
Effect of Public Spenditures on Crime
Dependent variable:
crime.per.million
(1) (2)
young.males 1.040** 1.127**
(0.423) (0.419)
is.south1 -8.308 -0.557
(14.912) (13.883)
average.ed 18.016*** 15.328**
(6.497) (6.203)
exp.per.cap.1959 -0.667
(1.149)
exp.per.cap.1960 1.608 1.138***
(1.059) (0.227)
labour.part -0.041 0.069
(0.153) (0.134)
male.per.fem 0.165 0.003
(0.210) (0.173)
population -0.041 -0.064
(0.130) (0.128)
nonwhite 0.007 -0.014
(0.064) (0.062)
unemp.youth -0.602
(0.437)
unemp.adult 1.792** 0.931*
(0.856) (0.542)
median.assets 13.736 15.159
(10.583) (10.524)
num.low.salary 0.793*** 0.826***
(0.235) (0.234)
Constant -691.838*** -633.439***
(155.888) (145.470)
Observations 47 47
R2 0.769 0.754
Adjusted R2 0.678 0.677
Residual Std. Error 21.936 (df = 33) 21.978 (df = 35)
F Statistic 8.462*** (df = 13; 33) 9.769*** (df = 11; 35)
Note: p<0.1; p<0.05; p<0.01
  • Observe that the coefficient of 1960 expenditure went from being non-signficant to significant (p-value is now very small).

  • This is interesting. It’s essentially saying that, all else being equal, every dollar per capita increase in police expenditure is on average associated with an increase in crime of 1.13 per million population.

Caution With Interpreting Results

Takeaway 1: Just because a coefficient is significant, doesn’t mean your covariate causes your response

  • This is the old adage that correlation does not imply causation. In this example, we have strong evidence that higher police expenditures are positively associated with crime rates. This doesn’t mean that decreasing police expenditure will lower crime rate. The relationship is not causal – at least not in that direction. A more reasonable explanation is that higher crime rates prompt policy makers to increase police expenditure.

Takeaway 2: There’s a difference between practical significance and statistical significance

  • Both average.ed and exp.per.cap.1960 are statistically significant. exp.per.cap.1960 has a much more significant p-value, but also a much smaller coefficient. When looking at your regression model, you shouldn’t just look at the p-value column. The really interesting covariates are the ones that are significant, but also have the largest effect.

Note also that the units of measurement should be taken into account when thinking about coefficient estimates and effect sizes. Suppose, for example, that we regressed income (measured in $) on height and got a coefficient estimate of 100, with a standard error of 20. Is 100 a large effect?

The answer depends on the units of measurement. If height had been measured in meters, we would be saying that every 1m increase in height is associated on average with a $100 increase in income. That’s too small an effect for us to care about. Now what if height was measured in mm? Then we’d be saying that every 1mm increase in height is associated on average with a $100 increase in income. Since 1inch = 25.4mm, this means that every 1inch difference in height is on average associated with a $2540 difference in income. This would be a tremendously large effect.

Moral of the story: Whether an effect is ‘practically significant’ depends a lot on the unit of measurement.