Introduction

You have been asked to a create a scale or an index that combines several measures into a single measure.

These sorts of scales can be useful for a variety of purposes. For example:

Miles, J. N., Weden, M. M., Lavery, D., Escarce, J. J., Cagney, K. A., & Shih, R. A. (2016). Constructing a time-invariant measure of the socio-economic status of US census tracts. Journal of Urban Health, 93(1), 213-232. [ pdf ]

A well-established literature documents that the socio-economic characteristics of the places in which we live influence our health and wellbeing. For example, neighborhood socio-economic status (NSES), over and above individual socio-economic status, can have lasting effects on outcomes ranging from hypertension, to allostatic load, disability, and depression.

Reviews of research on neighborhoods and health have suggested we need to better understand the role of critical periods, sequencing, and the accumulation of (dis)advantages over time. Longitudinal studies hoping to address these questions, however, must first address the methodological challenge of appropriately measuring neighborhood characteristics over time.

In other words, we need a measure where a value of 50 means the same thing in both time periods. We need a reliable instrument to conduct this type of research.

Measurement theory is an entire branch of statistics, so methods can get quite nuanced and complex. But the basic idea is to combine multiple items into a single index or scale that captures a specific latent construct like health, happiness, personality, or ability.

This exercise helps you think through how you might combine Census variables to form measures of neighborhood traits that help you better conceptualize the study of neighborhood change.

Setup

Helper functions to draw insightful correlation tables:

They are used as part of the pairs() function to add graphs and correlation coefficients to the table:




Reliability

Science requires instruments that can be used to measure things. Reliability describes one characteristics of an instrument - the consistency of the measure.

There are different forms of reliability, but a good way to think about it is the accuracy of a bathroom scale. If the same person steps off the scale and back on five times how much will each measure vary? Is it one-tenth or pound or ten pounds? It will depend on the quality (and likely expense and age) of the scale.

Similarly, in social science we have better and worst instruments. If we have a reliable instrument like an IQ test it should give us similar answers when administered over and over. If the same person took an IQ test two days in a row how much does it vary? Their IQ will likely not change over that time period (although things like sleep can impact performance). Or if a group of people that all perform very similarly in school take the IQ test would we expect them to score similarly?

Instruments become less reliable when noise or random error is part of the measure. In statistics random error means specifically that the scale must be equally likely to report your weight five pounds heavier than you are in reality as it would be to say you are five pounds lighter.

Systematic errors where they scale is always five pounds off in the same direction are actually pretty easy to fix. They don’t increase the variance of the measure (which leads to Type II errors in regression) and once you know the biased you can just add the correction factor to everyone in the sample. Oddly, even if everyone’s weight is off by a large amount, as long as it’s the same amount it would not actually biased the slope in the model.

If several items are highly correlated then creating an index from them can actually help stabilize the instrument and result in something more reliable than using a single instrument. So their value comes from the ability to triangulate a better value on a scale.

For example, if you were evaluating the classroom performance of a teacher there is a chance you could arrive on an especially good day, or an especially bad day, and your evaluation would not be reflective of their actual ability. If you visit their classroom several times and average the scores you probably have a better overall measure of their ability. Similarly, when measuring a latent construct if you can combine several data points that all represent different dimensions of the same underlying construct you will have a more stable total score.

Calculating Cronbach’s Alpha

Cronbach was one of the first to develop and popularize the use of a relability score to evaluate instruments in psychology. The formula is a little hairy, but the intuition is straight-forward.

You have three variables that you will use to create a scale. Some of the variance for each variable captures the construct of interest, and some of the variance does not.

X1 = a1 + e1
X2 = a2 + e2
X3 = a3 + e3

When you combine the three variables X1 to X3 into a common scale you will have a component that represents a stable measure of the construct: A = ( a1 + a2 + a3 ) / 3

And you will have a component of the three variables that represents random measurement error: B = ( e1 + e2 + e3 ) / 3

The ratio of these components - the signal to noise ratio A/(A+B) - drives the reliability measure. This is a grossly over-simplified explanation, but gives you some context for what alpha is reporting.

We can calculate the alpha easily enough with the psych package in R. Let’s use data from the built-in state dataset to demonstrate. After loading the state database we can access a table called state.x77 which reports statistics from all 50 states from 1977:

Table continues below
  Population Income Illiteracy Life Exp Murder
Alabama 3615 3624 2.1 69.05 15.1
Alaska 365 6315 1.5 69.31 11.3
Arizona 2212 4530 1.8 70.55 7.8
Arkansas 2110 3378 1.9 70.66 10.1
California 21198 5114 1.1 71.71 10.3
Colorado 2541 4884 0.7 72.06 6.8
  HS Grad Frost Area
Alabama 41.3 20 50708
Alaska 66.7 152 566432
Arizona 58.1 15 113417
Arkansas 39.9 65 51945
California 62.6 20 156361
Colorado 63.9 166 103766

If we wanted to construct a measure for something that approximates quality of life in each state, we can select a subset of these and combine them into a single instrument. Let’s use life expectancy, the murder rate, and high school graduation rates.

## [1] "Population" "Income"     "Illiteracy" "Life Exp"   "Murder"    
## [6] "HS Grad"    "Frost"      "Area"

Now calculate the alpha if these three variables are combined into an index:

## [1] 0.5701591

Note that the alpha measure is derived from the correlation of the three variables. If we add another variable with a lower correlation it lowers the score:

Oddly the number of days below freezing in each state is highly correlated with murder rates! But it has a poor overall relationship with others. Thus we have reduced the reliability of our index.

## [1] 0.2367693

NOTE, some might argue that temperature contributes a great deal to the quality of life in a state! Some there are theoretical reasons to include it. But recall that the trick in creating instruments is to define your latent construct as precisely as possible. Life expectancy, murder rates, and education outcomes say something about the quality of institutions or the level of civility in a state, which is likely distinct from other geographic constructs that could form a separate quality of life index.

We can improve our reliability slightly if we replace graduation rates with illiteracy:

## [1] 0.6433149

We are now above the threshold of 0.60 used for a minimally reliable index.

Combining Items

We have determined that life expectancy, murder rates, and literacy rates all partially measure the same construct. We achieve a reliability score of 0.64.

How do we combine these items, though? Notice the different scales:

Life Exp Murder Illiteracy
Min. :67.96 Min. : 1.400 Min. :0.500
1st Qu.:70.12 1st Qu.: 4.350 1st Qu.:0.625
Median :70.67 Median : 6.850 Median :0.950
Mean :70.88 Mean : 7.378 Mean :1.170
3rd Qu.:71.89 3rd Qu.:10.675 3rd Qu.:1.575
Max. :73.60 Max. :15.100 Max. :2.800

If we simply add them together the measures with a greater range and variance will contribute more toward the final index score than variables with a lower range and variance. We need to do something to standardize the inputs so they are contributing similar amounts.

If you are interested in better approaches to this problems check out work on factor analysis and instrument design.

Before and after standardizing:

Rescaling Data

One approach is to convert all current variables to new scales ranging from 0 to 100.

This is sometimes called normalizing a variable, but that term is used inconsistently across disciplines so it is better to be explicit and say you are rescaling a variable to a new scale of A (min value) to B (max value).

The formula is:

Y = (new.max) * (x - x.min) / (x.max-x.min)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   21.53   39.78   43.64   67.70  100.00

Or more conveniently:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   21.53   39.78   43.64   67.70  100.00

Let’s combine our items after rescaling:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   28.05  137.92  190.81  178.98  234.30  277.74

We can see that all inputs are on a scale of 0 to 100, and the overall quality of life index now ranges from a min of 28 to a max of 277 on a scale of 0 to 300.

Standardizing Data

Alternatively, we can convert each item into a standardized variable called a Z score. After standardization each variable will have a mean of zero and a standard deviation of 1:

Life Exp Murder Illiteracy
Min. :-2.1742 Min. :-1.6194 Min. :-1.0992
1st Qu.:-0.5670 1st Qu.:-0.8203 1st Qu.:-0.8941
Median :-0.1517 Median :-0.1430 Median :-0.3609
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.7553 3rd Qu.: 0.8931 3rd Qu.: 0.6644
Max. : 2.0273 Max. : 2.0918 Max. : 2.6742

And we can similarly combine these into an index:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -5.8295 -1.5689  0.4242  0.0000  2.1059  3.8612

Notably, rescaling or standardizing variables does not change the underlying correlation structure. So we are not impacting the reliability metrics by rescaling:

Impact of Outliers

Just like regression, outliers can heavily skew a scale.

You might check for some extreme outliers and consider truncating values if they are compressing most of your data into a small range. Or do a log transformation before rescaling.

You can top-code outliers to see if it has a big impact, but be sure to note ways you change the original data in your data manifest and include tables showing results with and without truncation if you are altering valid data points to minimize the influence of outliers. If the data is an entry error it is sufficient to note the fix.