In this hypothetical scenario, pretend that you are a student currently learning the R language. Take your time and really get into the role.
We’re going to use public data from the U.S. Department of Labor in order to lightly research various R-related occupations, the values and interests of R-using occupational incumbents, the labor market value of the R language, and we conclude by predicting your future salaries based on your work values!
O*NET, or the Occupational
Information Network, is the admirable pursuit by the Department of
Labor to consolidate the classification of approximately one thousand
kinds of jobs. If that seems like a lot, it comes from the even more
voluminous DOT (Dictionary
of Occupational Titles), which defined over 13,000 jobs since 1938.
To do this, all federal agencies use a unique identifier for each job
type, known as a SOC or Standard
Occcupational Classification. SOC codes, typically variable
O*NET-SOC Code
, most often become the unique keys necessary
to joining our tables!
The O*NET Resource Center contains all individual tables used to power My Next Move and other platforms. Tables are organized according to myriad dimensions, seen here:
Let’s look at a few tables to get to know the data.
Read in one of the highest-level tables, Occupation Data, which contains:
<- "https://www.onetcenter.org/dl_files/database/db_26_3_text/Occupation%20Data.txt"
url
<- read_delim(url, show = FALSE)
occupations
%>%
occupations head(3) %>% # Show first 3
pander()
O*NET-SOC Code | Title | Description |
---|---|---|
11-1011.00 | Chief Executives | Determine and formulate policies and provide overall direction of companies or private and public sector organizations within guidelines set up by a board of directors or similar governing body. Plan, direct, or coordinate operational activities at the highest level of management with the help of subordinate executives and staff managers. |
11-1011.03 | Chief Sustainability Officers | Communicate and coordinate with management, shareholders, customers, and employees to address sustainability issues. Enact or oversee a corporate sustainability strategy. |
11-1021.00 | General and Operations Managers | Plan, direct, or coordinate the operations of public or private sector organizations, overseeing multiple departments or locations. Duties and responsibilities include formulating policies, managing daily operations, and planning the use of materials and human resources, but are too diverse and general in nature to be classified in any one functional area of management or administration, such as personnel, purchasing, or administrative services. Usually manage through subordinate supervisors. Excludes First-Line Supervisors. |
Now, let’s take a look at a more sophisticated table, Technology Skills, which contains:
Here, the first 6 observations all have the same SOC Code. That’s
because “Technology Skills” lists all technologies associated with the
occupation. Our preview shows the first 6 technologies important for
occupation SOC code 11-1011.00
, or that of “Chief
Executives”.
<- "https://www.onetcenter.org/dl_files/database/db_26_3_text/Technology%20Skills.txt"
url
<- read_delim(url, show = FALSE)
technologies
%>%
technologies head(6) %>% # Show first 6
pander()
O*NET-SOC Code | Example | Commodity Code |
---|---|---|
11-1011.00 | Adobe Systems Adobe Acrobat | 43232202 |
11-1011.00 | AdSense Tracker | 43232306 |
11-1011.00 | Atlassian JIRA | 43232201 |
11-1011.00 | Blackbaud The Raiser’s Edge | 43232303 |
11-1011.00 | ComputerEase construction accounting software | 43231601 |
11-1011.00 | Database reporting software | 43232305 |
Commodity Title | Hot Technology |
---|---|
Document management software | Y |
Data base user interface and query software | N |
Content workflow software | Y |
Customer relationship management CRM software | N |
Accounting software | N |
Data base reporting software | N |
Recall that SOC Codes are unique identifiers for each occupation and are most often used fully, or in part, as merge keys.
We can merge these using dplyr
function
left_join()
, specifying argument by =:
%>%
occupations left_join(technologies,
by = `O*NET-SOC Code`)
Alternatively, we can use base R function merge()
:
merge(x = occupations,
y = technologies,
by = "O*NET-SOC Code")
In fact, if we simply use left_join()
without
identifying a merge column, it automatically detects merge keys with
common names and notifies us in the console.
%>%
occupations left_join(technologies) %>%
select(Title, `Commodity Title`) %>%
head(100)
Now that we’re capable of joining O*NET database tables, we can begin to glean our first insights. For example:
We can capture this quickly because our occupations-to-technologies merge is one-to-many, meaning that one record in one table corresponds to multiple records in another, adjoining table. Hence, if we count the number of new rows created by merging occupation titles and their many technologies, a simple row count will sum up to total unique technologies by occupation.
%>%
occupations left_join(technologies) %>%
group_by(Title) %>%
summarize(Technologies = n()) %>%
arrange(-Technologies) %>%
head(10)
Downlaod the lab template:
You may have to right-click on the file and “save as” depending upon your browser.
Remember to name your file: lab-##-lastname.rmd
The following objects have been created for you from the O*NET 26.3 Database.
Abilities
measures innate or natural abilities of
occupational incumbentsAlternateTitles
lists over 53,000 alternate names for
O*NET occupationsExperience
measures various kinds of experience, albeit
encodedExperienceCategories
decodes measured experienceInterests
measures occupations with the Holland Code
(RIASEC) TestJobZoneReference
decodes classes of career
developmentJobZones
broad classes of career development for each
occupation; encodedKnowledge
defines areas of expertise and
importanceOccupationData
lists the occupation SOC, title, and
descriptionSkills
measures learned abilities of occupational
incumbentsTechnologySkills
lists all relevant technology
skillsWorkStyles
measures various performance indicatorsWorkValues
measures important occupation qualities
The following challenges require no more then 2-3 tables and use the
same column names across all tables. Use basic dplyr
verbs
to answer specific questions.
Which 10 occupation titles (OccupationData$Title
) have
the highest number of “hot technologies”
(TechnologySkills$HotTechnology
)?
# Code
Which 10 occupation titles (OccupationData$Title
), with
morethan 15 unique technologies, have the highest proportion (%) of “hot
technologies” (TechnologySkills$HotTechnology
)?
# Code
Which unique occupation titles (OccupationData$Title
)
have “R” listed as a technology example
(TechnologySkills$Example
)?
# Code
Which alternate occupation titles (AlternateTitles
) have
“R” listed as a technology example
(TechnologySkills$Example
)?
# Code
Which top 25 occupation titles (OccupationData$Title
)
have the highest level of “Independence” (Element Name
)
according to WorkValues
?
# Code
Which top 25 occupation titles (OccupationData$Title
)
have the highest level of “Persistence” (Element Name
)
according to WorkStyles
?
# Code
According to Holland Codes, “Realistic” describes people who like to
work with things, requiring “motor coordination, skill, and strength”.
Which top 20 occupation titles (OccupationData$Title
) have
the highest rating in “Realistic” scores (Element Name
)
according to Interests
?
# Code
Again, per Holland Codes, “Investigative” describes people who like
to work with ideas, preferring observation over action and facts over
feelings. Which top 20 occupation titles
(OccupationData$Title
) have the highest rating in
“Investigative” scores (Element Name
) according to
Interests
?
# Code
Job Zones loosely classify how advanced one typically must be as an
occupational incumbent. How many unique occupation titles
(OccupationData$Title
) are classified in each
Job Zone
(JobZones
)?
# Code
Of occupation titles (OccupationData$Title
) that list
“R” as the Example
in TechnologySkills
, what
proportion is in each Job Zone (JobZones
)?
# Code
The following challenges require 3-4 merged tables and incorporate data from the U.S. Bureau of Labor Statistics’ 2021 Occupational Employment and Wage Statistics which, like other federal agencies, uses unique SOC codes for each occupation.
The following commands will read in BLS data and rename and reformat the columns to match O*NET.
<- paste0("https://raw.githubusercontent.com/DS4PS/ays-r-cod",
url "ing-sum-2022/main/labs/bls_occupation_salaries.csv")
<- read_csv(url,
Salaries col_select = c("O*NET-SOC Code" = "OCC_CODE",
"Title" = "OCC_TITLE",
"Median" = "A_MEDIAN")) %>%
mutate(`O*NET-SOC Code` = gsub(x = `O*NET-SOC Code`,
pattern = "$",
replacement = "\\.00"),
Median = gsub(x = Median,
pattern = ",",
replacement = ""),
Median = as.numeric(Median))
Note: There is one key distinction between ONET and BLS SOC codes. The 2019 O*NET Taxonomy features even more detailed profiles than typical federal agencies. Some more detailed ONET occupation profiles will not successfully merge with BLS occupations.
Join the O*NET OccupationData
and BLS
Salaries
tables. Then, filter out any occupations with an
NA
value in variable Median
. Store this new
table as object Common
. How many occupations have salary
data available from BLS?
# Code
Use basic dplyr
verbs to answer each question.
Join new object Common
with object
Interests
. Note that each interest is in
Element Name
and each interest rating (on a 7-point scale)
is listed in Data Value
. In effect, we can now associate
each “interest” (e.g. artistic, social, enterprising) with median
annual salaries.
Use the lm()
function with formula
Median ~ Element Name
to print the coefficients of a simple
linear model. Which two element names (not “High Points”) are most
associated with higher salaries?
# Code
Now join Common
with TechnologySkills
.
Filter occupations to only include the following technologies in
Example
, which have been provided for you in object
tech
:
Group by technologies (Example
) and determine the
following summaries:
Median
(annual salary)tech
<- c("R",
tech "C++",
"Python",
"JavaScript",
"Microsoft Excel",
"Spreadsheet software",
"SAS statistical software",
"Hypertext markup language HTML")
# Code
Please rate your own work values by providing a score from
1
to 7
for the following value categories.
A “7” indicates highest importance. A “1” indicates lowest
importance. A default of 3.5
has been provided as an
example.
<- tibble::tibble("Achievement" = 3.5,
my_values "Working Conditions" = 3.5,
"Recognition" = 3.5,
"Relationships" = 3.5,
"Support" = 3.5,
"Independence" = 3.5)
Join objects Common
and WorkValues
and
filter the following values from variable Element Name
:
Then, use select()
on the following variables:
Pipe (%>%
) your output into the following
tidyr
function:
# Code
::pivot_wider(names_from = "Element Name",
tidyrvalues_from = "Data Value")
Lastly, use function lm()
and predict()
to
estimate your salary based on your work values!
Make sure to set eval = TRUE
for the following to
run:
<- lm(formula = Median ~
my_model +
Achievement `Working Conditions` +
+
Recognition +
Relationships +
Support
Independence, data = you_data)
predict(object = my_model, newdata = my_values)
Use the following instructions to submit your assignment, which may vary depending on your course’s platform.
When you have completed your assignment, click the “Knit” button to
render your .RMD
file into a .HTML
report.
Perform the following depending on your course’s platform:
.RMD
and
.HTML
files to the appropriate link.RMD
and .HTML
files in a .ZIP
file and upload to the appropriate link.HTML
files are preferred but not allowed by all
platforms.
Remember to ensure the following before submitting your assignment.
head()
See Google’s R Style Guide for examples of common conventions.
.RMD
files are knit into .HTML
and other
formats procedural, or line-by-line.
install.packages()
or
setwd()
are bound to cause errors in knittinglibrary()
in a previous chunkIf All Else Fails: If you cannot determine and fix
the errors in a code chunk that’s preventing you from knitting your
document, add eval = FALSE
inside the brackets of
{r}
at the beginning of a chunk to ensure that R does not
attempt to evaluate it, that is: {r eval = FALSE}
. This
will prevent an erroneous chunk of code from halting the knitting
process.