Background

In this hypothetical scenario, pretend that you are a student currently learning the R language. Take your time and really get into the role.

We’re going to use public data from the U.S. Department of Labor in order to lightly research various R-related occupations, the values and interests of R-using occupational incumbents, the labor market value of the R language, and we conclude by predicting your future salaries based on your work values!

Data: O*NET & SOC Codes

O*NET, or the Occupational Information Network, is the admirable pursuit by the Department of Labor to consolidate the classification of approximately one thousand kinds of jobs. If that seems like a lot, it comes from the even more voluminous DOT (Dictionary of Occupational Titles), which defined over 13,000 jobs since 1938. To do this, all federal agencies use a unique identifier for each job type, known as a SOC or Standard Occcupational Classification. SOC codes, typically variable O*NET-SOC Code, most often become the unique keys necessary to joining our tables!

The O*NET Resource Center contains all individual tables used to power My Next Move and other platforms. Tables are organized according to myriad dimensions, seen here:

Exploring O*NET Tables

Let’s look at a few tables to get to know the data.

Occupation Titles

Read in one of the highest-level tables, Occupation Data, which contains:

Occupation SOC Code
Occupation Title
Occupation Description

url <- "https://www.onetcenter.org/dl_files/database/db_26_3_text/Occupation%20Data.txt"

occupations <- read_delim(url, show = FALSE)

occupations %>% 
  head(3) %>%          # Show first 3
  pander()

O*NET-SOC Code	Title	Description
11-1011.00	Chief Executives	Determine and formulate policies and provide overall direction of companies or private and public sector organizations within guidelines set up by a board of directors or similar governing body. Plan, direct, or coordinate operational activities at the highest level of management with the help of subordinate executives and staff managers.
11-1011.03	Chief Sustainability Officers	Communicate and coordinate with management, shareholders, customers, and employees to address sustainability issues. Enact or oversee a corporate sustainability strategy.
11-1021.00	General and Operations Managers	Plan, direct, or coordinate the operations of public or private sector organizations, overseeing multiple departments or locations. Duties and responsibilities include formulating policies, managing daily operations, and planning the use of materials and human resources, but are too diverse and general in nature to be classified in any one functional area of management or administration, such as personnel, purchasing, or administrative services. Usually manage through subordinate supervisors. Excludes First-Line Supervisors.

Occupation Technologies

Now, let’s take a look at a more sophisticated table, Technology Skills, which contains:

Occupation SOC Code
Technology Title
Technology Example
UNSPSC Commodity Code
UNSPSC Commodity Title
“Hot” Technology

Here, the first 6 observations all have the same SOC Code. That’s because “Technology Skills” lists all technologies associated with the occupation. Our preview shows the first 6 technologies important for occupation SOC code 11-1011.00, or that of “Chief Executives”.

url <- "https://www.onetcenter.org/dl_files/database/db_26_3_text/Technology%20Skills.txt"

technologies <- read_delim(url, show = FALSE)

technologies %>% 
  head(6) %>%          # Show first 6
  pander()

Table continues below
O*NET-SOC Code	Example	Commodity Code
11-1011.00	Adobe Systems Adobe Acrobat	43232202
11-1011.00	AdSense Tracker	43232306
11-1011.00	Atlassian JIRA	43232201
11-1011.00	Blackbaud The Raiser’s Edge	43232303
11-1011.00	ComputerEase construction accounting software	43231601
11-1011.00	Database reporting software	43232305

Commodity Title	Hot Technology
Document management software	Y
Data base user interface and query software	N
Content workflow software	Y
Customer relationship management CRM software	N
Accounting software	N
Data base reporting software	N

Joining Tables

Recall that SOC Codes are unique identifiers for each occupation and are most often used fully, or in part, as merge keys.

We can merge these using dplyr function left_join(), specifying argument by =:

occupations %>% 
  left_join(technologies, 
            by = `O*NET-SOC Code`)

Alternatively, we can use base R function merge():

merge(x = occupations, 
      y = technologies, 
      by = "O*NET-SOC Code")

In fact, if we simply use left_join() without identifying a merge column, it automatically detects merge keys with common names and notifies us in the console.

occupations %>% 
  left_join(technologies) %>% 
  select(Title, `Commodity Title`) %>% 
  head(100)

Insights with Joins

Now that we’re capable of joining O*NET database tables, we can begin to glean our first insights. For example:

How many technologies are associated with each occupation?
Which occupations require the most diverse array of technologies?

We can capture this quickly because our occupations-to-technologies merge is one-to-many, meaning that one record in one table corresponds to multiple records in another, adjoining table. Hence, if we count the number of new rows created by merging occupation titles and their many technologies, a simple row count will sum up to total unique technologies by occupation.

occupations %>% 
  left_join(technologies) %>% 
  group_by(Title) %>% 
  summarize(Technologies = n()) %>% 
  arrange(-Technologies) %>% 
  head(10)

Instructions

Lab Template

Downlaod the lab template:

You may have to right-click on the file and “save as” depending upon your browser.

Remember to name your file: lab-##-lastname.rmd

Data

The following objects have been created for you from the O*NET 26.3 Database.

Abilities measures innate or natural abilities of occupational incumbents
AlternateTitles lists over 53,000 alternate names for O*NET occupations
Experience measures various kinds of experience, albeit encoded
ExperienceCategories decodes measured experience
Interests measures occupations with the Holland Code (RIASEC) Test
JobZoneReference decodes classes of career development
JobZones broad classes of career development for each occupation; encoded
Knowledge defines areas of expertise and importance
OccupationData lists the occupation SOC, title, and description
Skills measures learned abilities of occupational incumbents
TechnologySkills lists all relevant technology skills
WorkStyles measures various performance indicators
WorkValues measures important occupation qualities

Part 1: Basic Join Challenges

The following challenges require no more then 2-3 tables and use the same column names across all tables. Use basic dplyr verbs to answer specific questions.

(1) Question

Which 10 occupation titles (OccupationData$Title) have the highest number of “hot technologies” (TechnologySkills$HotTechnology)?

# Code

(2) Question:

Which 10 occupation titles (OccupationData$Title), with morethan 15 unique technologies, have the highest proportion (%) of “hot technologies” (TechnologySkills$HotTechnology)?

# Code

(3) Question

Which unique occupation titles (OccupationData$Title) have “R” listed as a technology example (TechnologySkills$Example)?

# Code

(4) Question

Which alternate occupation titles (AlternateTitles) have “R” listed as a technology example (TechnologySkills$Example)?

# Code

(5) Question

Which top 25 occupation titles (OccupationData$Title) have the highest level of “Independence” (Element Name) according to WorkValues?

# Code

(6) Question

Which top 25 occupation titles (OccupationData$Title) have the highest level of “Persistence” (Element Name) according to WorkStyles?

# Code

(7) Question

According to Holland Codes, “Realistic” describes people who like to work with things, requiring “motor coordination, skill, and strength”. Which top 20 occupation titles (OccupationData$Title) have the highest rating in “Realistic” scores (Element Name) according to Interests?

# Code

(8) Question

Again, per Holland Codes, “Investigative” describes people who like to work with ideas, preferring observation over action and facts over feelings. Which top 20 occupation titles (OccupationData$Title) have the highest rating in “Investigative” scores (Element Name) according to Interests?

# Code

(9) Question

Job Zones loosely classify how advanced one typically must be as an occupational incumbent. How many unique occupation titles (OccupationData$Title) are classified in each Job Zone (JobZones)?

# Code

(10) Question

Of occupation titles (OccupationData$Title) that list “R” as the Example in TechnologySkills, what proportion is in each Job Zone (JobZones)?

# Code

Part 2: Joins & Salary Predictions

The following challenges require 3-4 merged tables and incorporate data from the U.S. Bureau of Labor Statistics’ 2021 Occupational Employment and Wage Statistics which, like other federal agencies, uses unique SOC codes for each occupation.

The following commands will read in BLS data and rename and reformat the columns to match O*NET.

url <- paste0("https://raw.githubusercontent.com/DS4PS/ays-r-cod",
              "ing-sum-2022/main/labs/bls_occupation_salaries.csv")

Salaries <- read_csv(url, 
                     col_select = c("O*NET-SOC Code" = "OCC_CODE", 
                                    "Title" = "OCC_TITLE", 
                                    "Median" = "A_MEDIAN")) %>% 
  mutate(`O*NET-SOC Code` = gsub(x = `O*NET-SOC Code`, 
                                 pattern = "$", 
                                 replacement = "\\.00"),
         Median = gsub(x = Median, 
                       pattern = ",", 
                       replacement = ""),
         Median = as.numeric(Median))

Note: There is one key distinction between ONET and BLS SOC codes. The 2019 O*NET Taxonomy features even more detailed profiles than typical federal agencies. Some more detailed ONET occupation profiles will not successfully merge with BLS occupations.

(1) Question

Join the O*NET OccupationData and BLS Salaries tables. Then, filter out any occupations with an NA value in variable Median. Store this new table as object Common. How many occupations have salary data available from BLS?

# Code

Use basic dplyr verbs to answer each question.

(2) Question

Join new object Common with object Interests. Note that each interest is in Element Name and each interest rating (on a 7-point scale) is listed in Data Value. In effect, we can now associate each “interest” (e.g. artistic, social, enterprising) with median annual salaries.

Use the lm() function with formula Median ~ Element Name to print the coefficients of a simple linear model. Which two element names (not “High Points”) are most associated with higher salaries?

# Code

(3) Question

Now join Common with TechnologySkills. Filter occupations to only include the following technologies in Example, which have been provided for you in object tech:

R
C
Python
JavaScript
Microsoft Excel
Spreadsheet software
SAS statistical software
Hypertext markup language HTML

Group by technologies (Example) and determine the following summaries:

Average value for Median (annual salary)
Count for total occupations with said tech

tech <- c("R",
          "C++",
          "Python",
          "JavaScript",
          "Microsoft Excel",
          "Spreadsheet software",
          "SAS statistical software",
          "Hypertext markup language HTML")

# Code

(4) Question

Please rate your own work values by providing a score from 1 to 7 for the following value categories.

Achievement
Working Conditions
Recognition
Relationships
Support
Independence

A “7” indicates highest importance. A “1” indicates lowest importance. A default of 3.5 has been provided as an example.

my_values <- tibble::tibble("Achievement"        = 3.5, 
                            "Working Conditions" = 3.5,
                            "Recognition"        = 3.5,
                            "Relationships"      = 3.5,
                            "Support"            = 3.5,
                            "Independence"       = 3.5)

(5) Question

Join objects Common and WorkValues and filter the following values from variable Element Name:

“First Work Value High-Point”
“Second Work Value High-Point”
“Third Work Value High-Point”

Then, use select() on the following variables:

“O*NET-SOC Code”
“Element Name”
“Data Value”
“Median”

Pipe (%>%) your output into the following tidyr function:

# Code

tidyr::pivot_wider(names_from = "Element Name", 
                   values_from = "Data Value")

Lastly, use function lm() and predict() to estimate your salary based on your work values!

Make sure to set eval = TRUE for the following to run:

my_model <- lm(formula = Median ~ 
                 Achievement + 
                 `Working Conditions` + 
                 Recognition + 
                 Relationships + 
                 Support + 
                 Independence, 
               data = you_data)

predict(object = my_model, newdata = my_values)

How to Submit

Use the following instructions to submit your assignment, which may vary depending on your course’s platform.

Knitting to HTML

When you have completed your assignment, click the “Knit” button to render your .RMD file into a .HTML report.

Special Instructions

Perform the following depending on your course’s platform:

Canvas: Upload both your .RMD and .HTML files to the appropriate link
Blackboard or iCollege: Compress your .RMD and .HTML files in a .ZIP file and upload to the appropriate link

.HTML files are preferred but not allowed by all platforms.

Before You Submit

Remember to ensure the following before submitting your assignment.

Name your files using this format: lab-##-lastname.rmd and lab-##-lastname.html
Show both the solution for your code and write out your answers in the body text
Do not show excessive output; truncate your output, e.g. with function head()
Follow appropriate styling conventions, e.g. spaces after commas, etc.
Above all, ensure that your conventions are consistent

See Google’s R Style Guide for examples of common conventions.

Common Knitting Issues

.RMD files are knit into .HTML and other formats procedural, or line-by-line.

An error in code when knitting will halt the process; error messages will tell you the specific line with the error
Certain functions like install.packages() or setwd() are bound to cause errors in knitting
Altering a dataset or variable in one chunk will affect their use in all later chunks
If an object is “not found”, make sure it was created or loaded with library() in a previous chunk

If All Else Fails: If you cannot determine and fix the errors in a code chunk that’s preventing you from knitting your document, add eval = FALSE inside the brackets of {r} at the beginning of a chunk to ensure that R does not attempt to evaluate it, that is: {r eval = FALSE}. This will prevent an erroneous chunk of code from halting the knitting process.

Lab 06: Dataset Joins

Background

Data: O*NET & SOC Codes

Exploring O*NET Tables

Occupation Titles

Occupation Technologies

Joining Tables

Insights with Joins

Instructions

Lab Template

Data

Part 1: Basic Join Challenges

(1) Question

(2) Question:

(3) Question

(4) Question

(5) Question

(6) Question

(7) Question

(8) Question

(9) Question

(10) Question

Part 2: Joins & Salary Predictions

(1) Question

(2) Question

(3) Question

(4) Question

(5) Question

How to Submit

Knitting to HTML

Special Instructions

Before You Submit

Common Knitting Issues

Data Science for
Public Service

Lab 06: Dataset Joins

Background

Data: O*NET & SOC Codes

Exploring O*NET Tables

Occupation Titles

Occupation Technologies

Joining Tables

Insights with Joins

Instructions

Lab Template

Data

Part 1: Basic Join Challenges

(1) Question

(2) Question:

(3) Question

(4) Question

(5) Question

(6) Question

(7) Question

(8) Question

(9) Question

(10) Question

Part 2: Joins & Salary Predictions

(1) Question

(2) Question

(3) Question

(4) Question

(5) Question

How to Submit

Knitting to HTML

Special Instructions

Before You Submit

Common Knitting Issues

Data Science forPublic Service

Data Science for
Public Service