This lab will have you replicate the work you did for the previous lab, but instead of using the Phoenix data you will select a city of your choice to use for this lab, subsequent labs, and the final project.

You want to select a dense city that has a lot of census tracts, so I suggest you select from the 50 largest cities in the US. The Census defines Metropolitan Statistical Areas (MSAs) as “one or more adjacent counties or county equivalents that have at least one urban core area of at least 50,000 population, plus adjacent territory that has a high degree of social and economic integration with the core as measured by commuting ties.” There are currently 384 MSAs in the US.

Wikipedia provides a convenient list of MSAs sorted by size.

Packages

Data Source

We will again use Census data made available through the Diversity and Disparities Project. They have created a dataset with 170 variables that have been harmonized from 1970 onward for analysis of changes in tracts over time. We will use a subset of these:

Build Your Rodeo Dataset

The most time-intensive component of most data projects the wrangling necessary to combine data from a variety of sources, cleaning it, transforming variables, and standardizing the data before it is ready for formal models.

We will use clustering to identify coherent neighborhoods within cities, but we first must acquire data, download shapefiles, transform the shapefiles into Dorling cartograms, and merge all of the map and census files together for the analysis.

This lab requires the following steps:

  1. Select a city (MSA) and identify all counties in the city
  2. Download a shapefile for your city
  3. Merge census data to your map file
  4. Transform the shapefile into a Dorling cartogram for more meaningful maps

At this point you are ready to start clustering census tracts.

I will demonstrate the code that you will use for the lab using Minneapolis. You will need to adapt it for your own city.

Step 1: Select Your MSA

You can select a city from the list of large MSAs.

To get Census data on the city you will first need to identify all of the counties that comprise the MSA. You can look this information up through MSA to FIPS crosswalks provided by the National Bureau for Economic Research (NBER): https://www.nber.org/data/cbsa-fips-county-crosswalk.html

I have added the file to GitHub for ease of access.

##  [1] "CHICO-PARADISE, CA" "CHICAGO, IL"        "CHICAGO, IL"       
##  [4] "CHICAGO, IL"        "CHICAGO, IL"        "CHICAGO, IL"       
##  [7] "CHICAGO, IL"        "CHICAGO, IL"        "CHICAGO, IL"       
## [10] "CHICAGO, IL"
##  [1] "MINNESOTA"                   "MINNEAPOLIS-ST. PAUL, MN-WI"
##  [3] "MINNESOTA"                   "MINNESOTA"                  
##  [5] "MINNESOTA"                   "MINNESOTA"                  
##  [7] "MINNESOTA"                   "MINNESOTA"                  
##  [9] "MINNEAPOLIS-ST. PAUL, MN-WI" "MINNESOTA"                  
## [11] "MINNESOTA"                   "MINNEAPOLIS-ST. PAUL, MN-WI"
## [13] "MINNESOTA"                   "MINNESOTA"                  
## [15] "MINNESOTA"                   "MINNESOTA"                  
## [17] "MINNEAPOLIS-ST. PAUL, MN-WI" "MINNESOTA"                  
## [19] "MINNESOTA"                   "MINNESOTA"                  
## [21] "MINNESOTA"                   "MINNESOTA"                  
## [23] "MINNESOTA"                   "MINNESOTA"                  
## [25] "MINNEAPOLIS-ST. PAUL, MN-WI" "MINNESOTA"                  
## [27] "MINNEAPOLIS-ST. PAUL, MN-WI" "MINNESOTA"                  
## [29] "MINNESOTA"                   "MINNESOTA"                  
## [31] "MINNESOTA"                   "MINNESOTA"                  
## [33] "MINNESOTA"                   "MINNESOTA"                  
## [35] "MINNESOTA"                   "MINNESOTA"                  
## [37] "MINNESOTA"                   "MINNESOTA"                  
## [39] "MINNESOTA"                   "MINNESOTA"                  
## [41] "MINNESOTA"                   "MINNESOTA"                  
## [43] "MINNESOTA"                   "MINNESOTA"                  
## [45] "MINNESOTA"                   "MINNESOTA"                  
## [47] "MINNESOTA"                   "MINNESOTA"                  
## [49] "MINNESOTA"                   "MINNESOTA"                  
## [51] "MINNESOTA"                   "MINNESOTA"                  
## [53] "MINNESOTA"                   "MINNESOTA"                  
## [55] "MINNESOTA"                   "MINNESOTA"                  
## [57] "MINNEAPOLIS-ST. PAUL, MN-WI" "MINNESOTA"                  
## [59] "MINNESOTA"                   "MINNESOTA"                  
## [61] "MINNESOTA"                   "MINNESOTA"                  
## [63] "MINNESOTA"                   "MINNEAPOLIS-ST. PAUL, MN-WI"
## [65] "MINNEAPOLIS-ST. PAUL, MN-WI" "MINNESOTA"                  
## [67] "MINNESOTA"                   "MINNESOTA"                  
## [69] "MINNESOTA"                   "MINNESOTA"                  
## [71] "MINNESOTA"                   "MINNESOTA"                  
## [73] "MINNESOTA"                   "MINNESOTA"                  
## [75] "MINNEAPOLIS-ST. PAUL, MN-WI" "MINNESOTA"                  
## [77] "MINNESOTA"                   "MINNESOTA"                  
## [79] "MINNEAPOLIS-ST. PAUL, MN-WI" "MINNESOTA"                  
## [81] "MINNESOTA"                   "MINNEAPOLIS-ST. PAUL, MN-WI"
## [83] "MINNEAPOLIS-ST. PAUL, MN-WI"

Select all of your county fips. To use them in the TidyCenss package you will need to split the state and county:

Step 3: Add Census Data

DATA DICTIONARY

LABEL VARIABLE
tractid GEOID
pnhwht12 Percent white, non-Hispanic
pnhblk12 Percent black, non-Hispanic
phisp12 Percent Hispanic
pntv12 Percent Native American race
pfb12 Percent foreign born
polang12 Percent speaking other language at home, age 5 plus
phs12 Percent with high school degree or less
pcol12 Percent with 4-year college degree or more
punemp12 Percent unemployed
pflabf12 Percent female labor force participation
pprof12 Percent professional employees
pmanuf12 Percent manufacturing employees
pvet12 Percent veteran
psemp12 Percent self-employed
hinc12 Median HH income, total
incpc12 Per capita income
ppov12 Percent in poverty, total
pown12 Percent owner-occupied units
pvac12 Percent vacant units
pmulti12 Percent multi-family units
mrent12 Median rent
mhmval12 Median home value
p30old12 Percent structures more than 30 years old
p10yrs12 Percent HH in neighborhood 10 years or less
p18und12 Percent 17 and under, total
p60up12 Percent 60 and older, total
p75up12 Percent 75 and older, total
pmar12 Percent currently married, not separated
pwds12 Percent widowed, divorced and separated
pfhh12 Percent female-headed families with children

Clustering

We will use the same set of variables as last week. The data is transformed into z-score so that they are all on similar scales.

pnhwht12 pnhblk12 phisp12 pntv12 pfb12 polang12
0.9411 -0.5724 -0.7285 -0.4538 -1.025 -0.9737
0.9847 -0.7166 -0.6991 -0.137 -1.009 -1.072
1.004 -0.7088 -0.7173 -0.004358 -0.8534 -1.046
0.8184 -0.7244 -0.7215 -0.1075 -0.9287 -0.4711
0.8448 -0.7244 -0.6332 1.093 -1.117 -1.004
0.9798 -0.7097 -0.7775 -0.4538 -0.8511 -0.7984

Prepare Data for Clustering

We transform all of the variables to z scorse so they are on the same scale while clustering. This ensures that each census variable has equal weight. Z-scores typically range from about -3 to +3 with a mean of zero.

pnhwht12 pnhblk12 phisp12 pntv12 pfb12 polang12
0.9411 -0.5724 -0.7285 -0.4538 -1.025 -0.9737
0.9847 -0.7166 -0.6991 -0.137 -1.009 -1.072
1.004 -0.7088 -0.7173 -0.004358 -0.8534 -1.046
0.8184 -0.7244 -0.7215 -0.1075 -0.9287 -0.4711
0.8448 -0.7244 -0.6332 1.093 -1.117 -1.004
0.9798 -0.7097 -0.7775 -0.4538 -0.8511 -0.7984

Perform Cluster Analysis

For more details on cluster analysis visit the mclust tutorial.

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VVE (ellipsoidal, equal orientation) model with 7 components: 
## 
##  log-likelihood   n  df       BIC       ICL
##       -15370.41 771 861 -36464.47 -36533.72
## 
## Clustering table:
##   1   2   3   4   5   6   7 
##  74 151 112 237  89  63  45

Some visuals of model fit statistics (doesn’t work well with a lot of variables):

Variable Selection for Clustering

We can compare the model created with reliable measures of community stability that were identified from the first lab, and compare this model to one created from three census variables, and one created from three reliable indices of neighborhood stability generated from Lab 01:

Neighborhood transitivity (Cronbach’s Alpha 0.8)

Neighborhood Diversity (Cronbach’s Alpha 0.78)

Human Capital (Cronbach’s Alpha 0.93)

You will create each index by adding the component variables together:

I arbitrarily selected three census variables to compare methods. Here we are using % of population 18 and under, % of female labor force participation, and household income.

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VVI (diagonal, varying volume and shape) model with 5 components: 
## 
##  log-likelihood   n df       BIC       ICL
##       -4626.543 771 34 -9479.108 -9849.923
## 
## Clustering table:
##   1   2   3   4   5 
## 156  78 180 268  89
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VEV (ellipsoidal, equal shape) model with 3 components: 
## 
##  log-likelihood   n df       BIC       ICL
##       -3046.201 771 25 -6258.594 -6747.571
## 
## Clustering table:
##   1   2   3 
## 187 263 321

This function does not work well using the model with 30 census variables because the plots are so small, but when you plot the model results from the clustering procedure is will show the relationship between each variable, and color-code each group. This helps you visualize how each set of variables is contributing the the clusters that were identified.

Recall that similar to how regression models are trying to select lines that minimize the residual errors in the model, clustering tries to select groups (assign tracts to groups) in a way that maximizes the overall distance between group centroids.

You can see here that the three indices we created do a better job of creating meaningful groups that the three census variables. When selecting variables for your model it is better to have independent measures - highly-correlated input variables will be explaining the same variance in your populations, and thus provide redundant information that is not useful for identifying groups.




Lab Instructions

PART 1:

Select a city, build your dataset, and cluster the neighborhoods in your city using the same techniques we used last week.

Show the clusters and the cluster demographics for the new city.

Report the labels you have assigned to each group.


PART 2:

Part 2 explores how the input data impacts your clusters.

You will use your full model with 30 census variables as the reference point, then do the following:

Question 1:

Compare that set of groups to the groups identified by the model using only the three indices above. Are they identifying the same groups? Which group is missing?

Question 2:

Select three census variables from the full list of 30, and create a new clustering model. How many groups are generated? Are they similar groups to the full model? Are our inferences significantly different if we have a limited amount of data to draw from?

Report your three variables and describe the groups that are identified.

You can use the following to create data dictionaries to provide informative labels to the group percentile plots:

# use dput( data.dictionary ) to create reproducible data frames for RMD files
data.dictionary <- 
structure(list(LABEL = c("pnhwht12", "pnhblk12", "phisp12", 
"pntv12", "pfb12", "polang12", "phs12", "pcol12", "punemp12", 
"pflabf12", "pprof12", "pmanuf12", "pvet12", "psemp12", "hinc12", 
"incpc12", "ppov12", "pown12", "pvac12", "pmulti12", "mrent12", 
"mhmval12", "p30old12", "p10yrs12", "p18und12", "p60up12", "p75up12", 
"pmar12", "pwds12", "pfhh12"), VARIABLE = c("Percent white, non-Hispanic", 
"Percent black, non-Hispanic", "Percent Hispanic", "Percent Native American race", 
"Percent foreign born", "Percent speaking other language at home, age 5 plus", 
"Percent with high school degree or less", "Percent with 4-year college degree or more", 
"Percent unemployed", "Percent female labor force participation", 
"Percent professional employees", "Percent manufacturing employees", 
"Percent veteran", "Percent self-employed", "Median HH income, total", 
"Per capita income", "Percent in poverty, total", "Percent owner-occupied units", 
"Percent vacant units", "Percent multi-family units", "Median rent", 
"Median home value", "Percent structures more than 30 years old", 
"Percent HH in neighborhood 10 years or less", "Percent 17 and under, total", 
"Percent 60 and older, total", "Percent 75 and older, total", 
"Percent currently married, not separated", "Percent widowed, divorced and separated", 
"Percent female-headed families with children")), class = "data.frame", row.names = c(NA, 
-30L))


# list variables for clustering
use.these <- c("pnhwht12", "pnhblk12", "phisp12", "pntv12", "pfb12", "polang12", 
               "phs12", "pcol12", "punemp12", "pflabf12", "pprof12", "pmanuf12", 
               "pvet12", "psemp12", "hinc12", "incpc12", "ppov12", "pown12", 
               "pvac12", "pmulti12", "mrent12", "mhmval12", "p30old12", "p10yrs12", 
               "p18und12", "p60up12", "p75up12", "pmar12", "pwds12", "pfhh12")

dd.cluster1 <- data.dictionary[ data.dictionary$LABEL %in% use.these , ]


# cluster 2
LABEL <- c("dim1","dim2","dim3")
VARIABLE <- c("Neighborhood transitivity","Neighborhood diversity","Human capital")
dd.cluster2 <- data.frame( LABEL, VARIABLE )


# cluster 3 - update with your variables
use.these <- c("pnhwhtxx", "pnhblkxx", "phispxx")
dd.cluster3 <- data.dictionary[ data.dictionary$LABEL %in% use.these , ]

Reflection (not graded, but please answer):

Reflect on how the data we use in these models defines how we are defining and describing the neighborhoods (groups) within these models?

Are these clusters valid constructs? Do we believe that they are accurately describing a concrete reality about the world? Would knowing which group that a tract is part of help us predict something about the tract, such as outcomes that children would experience living there, or how the tract might change over time?

How do we know which is the right set of groups to use?






Submission Instructions

After you have completed your lab submit via Canvas. Login to the ASU portal at http://canvas.asu.edu and navigate to the assignments tab in the course repository. Upload your RMD and your HTML files to the appropriate lab submission link. Or else use the link from the Schedule page.

Remember to name your files according to the convention: Lab-##-LastName.xxx