Lab 04 introduces the tools of string operators and regular expressions that enable rich text analysis in R. They are important functions for cleaning data in large datasets, generating new variables, and qualitative analysis of text-based databases using tools like content analysis, sentiment analysis, and natural language processing libraries.
The data for the lab comes from IRS archives of 1023-EZ applications that nonprofits submit when they are filing for tax-exempt status. We will use mission statements, organizational names, and activity codes.
The lab consists of three parts. Part I is a warm-up that asks you to construct a few regular expressions to identify specific patterns in the mission text.
Part II asks you to use the quanteda package, a popular text analysis package in R, to perform a simple content analysis by counting the most frequently-used words in the mission statements.
Part III asks you to use text functions and regular expressions to search mission statements to develop a sample of specific nonprofits using keywords and phrases.
IRS documentation on 1023-EZ forms are available here.
URL <- "https://github.com/DS4PS/cpp-527-spr-2020/blob/master/labs/data/IRS-1023-EZ-MISSIONS.rds?raw=true"
dat <- readRDS(gzcon(url( URL )))
head( dat[ c("orgname","codedef01","mission") ] ) %>% pander()
orgname | codedef01 |
---|---|
NIA PERFORMING ARTS | Arts, Culture, and Humanities |
THE YOUNG ACTORS GUILD INC | Arts, Culture, and Humanities |
RUTH STAGE INC | Arts, Culture, and Humanities |
STRIPLIGHT COMMUNITY THEATRE INC | Arts, Culture, and Humanities |
NU BLACK ARTS WEST THEATRE | Arts, Culture, and Humanities |
OLIVE BRANCH THEATRICALS INC | Arts, Culture, and Humanities |
mission |
---|
A community based art organization that inspires, nutures,educates and empower artist and community. |
We engage and educate children in the various aspect of theatrical productions, through acting, directing, and stage crew. We produce community theater productions for children as well as educational theater camps and workshops. |
Theater performances and performing arts |
To produce high-quality theater productions for our local community, guiding performers and audience members to a greater appreciation of creativity through the theatrical arts - while leading with respect, organization, accountability. |
You need to search for the following patterns below, print the first six missions that meet the criteria, and report a total count of missions that meet the criteria.
You will use grep() and grepl(). The grep() function will find missions that match the criteria and return the full string. grepl() is the logical version of grep(), so it returns a vector of TRUE or FALSE, with TRUE representing the cases that match the specified criteria.
grep( pattern="some.reg.ex", x="mission statements", value=TRUE )
You will print and count matches as follows. As an example, let’s search for missions that contain numbers.
Provide entertainment and education to private residence of a private residential community over 55., To serve the community as a nonprofit organization dedicated to producing live theatre and educational opportunities in the theater arts. The theater’s primary activity is to put on 3-5 plays annually in Colorado Springs, CO., The organization is a theater company that performs 3-4 plays per year., Our mission is to facilitate personal growth and social development through the creativity of the Theatre Arts. We offer musical theatre camps for ages 4-7, 8-12, and 13-20 in the summers & community theatre and classes for all ages fall-spring, Nurture minority actors, directors, playwrights, and theater artists by offering them the opportunity to participate in the best classic, contemporary, and original theater (A65 and R30). and The 574 Theatre Company strives to be a professional theatre company located in St. Joseph County, IN who seeks to create, inspire, and educate the members of the 574 community by producing high quality and innovative theatrical entertainment.
## [1] 4142
How many missions start with the word “to”? Make sure it is the word “to” and not words that start with “to” like “towards”. You can ignore capitalization.
How many mission fields are blank? How many mission fields contain only spaces (one or more)?
How many missions have trailing spaces (extra spaces at the end)? After identifying the cases with trailing spaces use the trim white space function trimws() to clean them up.
How many missions contain the dollar sign? Note that the dollar sign is a special symbol, so you need to use an escape character to search for it.
How many mission statements contain numbers that are at least two digits long? You will need to use a quantity qualifier from regular expressions.
Report your code and answers for these five questions.
Perform a very basic content analysis with the mission text. Report the ten most frequently-used words in mission statements. Exclude punctuation, and “stem” the words.
You will be using the quanteda package in R for the language processing functions. It is an extremely powerful tool that integrates with a variety of natural language processing tools, qualitative analysis packages, and machine learning frameworks for predictive analytics using text inputs.
In general, languages and thus text are semi-structured data sources. There are patterns and rules to languages, but rules are less rigid and patterns can be more subtle (computers are much better at picking out patterns in language use from large amounts of text than humans are). As a result of the nature of text as data, you will find that the cleaning, processing, and preparation steps can be more intensive than quantitative data. They are designed to filter out large portions of text that hold sentences together and create subtle meaning in context, but offer little in terms of general pattern recognition. Eliminating capitalization and punctuation help simplify the text. Paragraphs and sentences are atomized into lists of words. And things like stemming or converting multiple words to single compound words (e.g. White House to white_house) help reduce the complexity of the text.
The short tutorial below is meant to introduce you to a few functions that can be useful for initiating analysis with text and introduce you to common pre-processing steps.
Note that in the field of liturature we refer to an author’s or a field’s body of work. In text analysis, we refer to a database of text-based documents as a “corpus” (Latin for body). Each document has text, which is the unit of analysis. But it also has meta-data that is useful for making sense of the text and identifying patterns. Common meta-data might be things like year of publication, author, type of document (newspaper article, tweet, email, spoken speech, etc.). The corpus() function primarily serves to make the text database easy to use by keeping the text and meta-data connected and sympatico during pre-processing steps.
Typically large texts are broken into smaller parts, or “tokenized”. Paragraphs can be split into sentences, sentences split into words. In regression we pay attention to correlations between numbers - when one variable X is increasing, is another variable Y also increasing, decreasing, or not covarying with X? In text analysis the analogous operation is co-occurrence. How often do words co-occur in sentences or documents? Or do we expect them to co-occur more frequently than they actually do given their frequency in the corpus (the equivalent of two words being negatively correlated). It is through tokenization that these relationships can be established.
In the example below we will split missions into sets of words, apply a “dictionary” or “thesaurus” to join multiple words that describe a single concept (e.g. New York City), stem the words to standardize them as much as possible, then conduct the simplest type of content analysis possible - count word frequency.
# library( quanteda )
# convert missions to all lower-case
dat$mission <- tolower( dat$mission )
# use a sample for demo purposes
dat.sample <- dat[ sample( 1:50000, size=1000 ) , ]
corp <- corpus( dat.sample, text_field="mission" )
corp
## Corpus consisting of 1,000 documents and 36 docvars.
## 19419
## "the mission of empowered communities united, inc. seeks to empower and impact the lives of families in the global community as well as help maximize the purpose of family relationships."
## 33068
## "to provide children who has been affected by violence and loss a parent with community resources, social, mental, and physical supports by bridging the gap for at risk youth."
## 42829
## "to provide support services to families that have members suffering alzheimer's disease."
## 43994
## "bowling green prayer breakfast inc brings business leaders and students together to emphasize the importance of faith in jesus christ and prayer in everyday life"
## 14804
## "to bring comfort via miniature service horse social visits for individuals suffering from physical or emotional distress (ex: hospice, hospital, halfway house, shelter, retirement home, nursing home, etc.). mini service horse education and awareness"
# pre-processing steps:
# remove mission statements that are less than 3 sentences long
corp <- corpus_trim( corp, what="sentences", min_ntoken=3 )
# remove punctuation
tokens <- tokens( corp, what="word", remove_punct=TRUE )
head( tokens )
## tokens from 6 documents.
## 19419 :
## [1] "the" "mission" "of" "empowered"
## [5] "communities" "united" "inc" "seeks"
## [9] "to" "empower" "and" "impact"
## [13] "the" "lives" "of" "families"
## [17] "in" "the" "global" "community"
## [21] "as" "well" "as" "help"
## [25] "maximize" "the" "purpose" "of"
## [29] "family" "relationships"
##
## 33068 :
## [1] "to" "provide" "children" "who" "has" "been"
## [7] "affected" "by" "violence" "and" "loss" "a"
## [13] "parent" "with" "community" "resources" "social" "mental"
## [19] "and" "physical" "supports" "by" "bridging" "the"
## [25] "gap" "for" "at" "risk" "youth"
##
## 42829 :
## [1] "to" "provide" "support" "services" "to"
## [6] "families" "that" "have" "members" "suffering"
## [11] "alzheimer's" "disease"
##
## 43994 :
## [1] "bowling" "green" "prayer" "breakfast" "inc"
## [6] "brings" "business" "leaders" "and" "students"
## [11] "together" "to" "emphasize" "the" "importance"
## [16] "of" "faith" "in" "jesus" "christ"
## [21] "and" "prayer" "in" "everyday" "life"
##
## 14804 :
## [1] "to" "bring" "comfort" "via" "miniature"
## [6] "service" "horse" "social" "visits" "for"
## [11] "individuals" "suffering" "from" "physical" "or"
## [16] "emotional" "distress" "ex" "hospice" "hospital"
## [21] "halfway" "house" "shelter" "retirement" "home"
## [26] "nursing" "home" "etc" "mini" "service"
## [31] "horse" "education" "and" "awareness"
##
## 9214 :
## [1] "to" "manage" "and" "run" "the" "babe"
## [7] "ruth" "baseball" "program" "in" "ipswich" "ma"
# remove filler words like the, and, a, to
tokens <- tokens_remove( tokens, c( stopwords("english"), "nbsp" ), padding=F )
my_dictionary <- dictionary( list( five01_c_3= c("501 c 3","section 501 c 3") ,
united_states = c("united states"),
high_school=c("high school"),
non_profit=c("non-profit", "non profit"),
stem=c("science technology engineering math",
"science technology engineering mathematics" ),
los_angeles=c("los angeles"),
ny_state=c("new york state"),
ny=c("new york")
))
# apply the dictionary to the text
tokens <- tokens_compound( tokens, pattern=my_dictionary )
head( tokens )
## tokens from 6 documents.
## 19419 :
## [1] "mission" "empowered" "communities" "united"
## [5] "inc" "seeks" "empower" "impact"
## [9] "lives" "families" "global" "community"
## [13] "well" "help" "maximize" "purpose"
## [17] "family" "relationships"
##
## 33068 :
## [1] "provide" "children" "affected" "violence" "loss" "parent"
## [7] "community" "resources" "social" "mental" "physical" "supports"
## [13] "bridging" "gap" "risk" "youth"
##
## 42829 :
## [1] "provide" "support" "services" "families" "members"
## [6] "suffering" "alzheimer's" "disease"
##
## 43994 :
## [1] "bowling" "green" "prayer" "breakfast" "inc"
## [6] "brings" "business" "leaders" "students" "together"
## [11] "emphasize" "importance" "faith" "jesus" "christ"
## [16] "prayer" "everyday" "life"
##
## 14804 :
## [1] "bring" "comfort" "via" "miniature" "service"
## [6] "horse" "social" "visits" "individuals" "suffering"
## [11] "physical" "emotional" "distress" "ex" "hospice"
## [16] "hospital" "halfway" "house" "shelter" "retirement"
## [21] "home" "nursing" "home" "etc" "mini"
## [26] "service" "horse" "education" "awareness"
##
## 9214 :
## [1] "manage" "run" "babe" "ruth" "baseball" "program" "ipswich"
## [8] "ma"
# find frequently co-occuring words (typically compound words)
ngram2 <- tokens_ngrams( tokens, n=2 ) %>% dfm()
ngram2 %>% textstat_frequency( n=10 )
## provide community support mission youth education
## 255 229 178 152 150 144
## organization educational children services
## 128 115 108 96
Many words have a stem that is altered when conjugated (if a verb) or made pluran (if a noun). As a result, it can be hard to consistently count the appearance of specific word.
Stemming removes the last part of the word such that the word is reduced to it’s most basic stem. For example, running would become run, and Tuesdays would become Tuesday.
Quanteda already has a powerful stemming function included.
## provid educ communiti organ support mission youth program
## 373 327 285 230 216 156 155 140
## servic promot
## 137 131
Replicate the steps above with the following criteria:
For the last part of this lab, you will use text to classify nonprofits.
A large foundation is interested in knowing how many new nonprofits created in 2018 have an explicit mission of serving minority communities. We will start by trying to identify nonprofits that are membership organizations for Black communities or provide services to Black communities.
To do this, you will create a set of words or phrases that you believe indicates that the nonprofit works with or for the target population.
You will need to think about different ways that language might be used distinctively within the mission statements of nonprofit that serve Black communities. There is a lot of trial and error involved, as you can test several words and phrases, preview the mission statements that are identified, then refine your methods.
Your final product will be a data frame of the nonprofit names, activity codes, and mission statements for the organizations identified by your criteria. The goal is to identify as many as possible while minimizing errors that occur when you include a nonprofit that does not serve the Black community. This example was selected specifically because “black” is a common and ambiguous term.
To get you started, let’s look at a similar example where we want to identify immigrants rights nonprofits. We would begin as follows:
# start with key phrases
#
# use grep( ..., value=TRUE ) so you can view mission statements
# that meet your criteria and adjust the language as necessary
grep( "immigrant rights", dat$mission, value=TRUE ) %>% head()
## [1] "community justice alliance, inc. promotes social change through advocacy, communications, community education, and litigation in the areas of racial justice, immigrant rights, and political access."
## [2] "to provide fair, trustworthy immigration legal counsel. to provide information on immigrant rights and opportunities within the community."
## [1] "charitable and educational hereditary society, encourages the study of the history of ancient wales and welsh immigration to america. research and preserve documents. support the restoration of sites and landmarks."
## [2] "1.provide immigration and other social assistance services\n\nto new oromo immigrants and refugees;\n\n2.provide health awareness and education services to\n\nmembers and the community at large;\n\n3.promote self-help and social assistance among the oromos"
## [3] "helping the immigration community with education, culture and humanity."
## [4] "provides legal immigration service to low income immigrants"
## [5] "to research migration and immigration patterns impacting economic, political and social landscape in the united states."
## [6] "to provide immigration services for low-income population in new york city, educating the public about information and issues related to immigration law as well as organizing educational programs for youth and women."
## [1] "the purposes of this organization are to unite various faiths under the umbrella of love one another, love go, and respect your environment. in addition, we wish to help the homeless, refugees and be an educational resource."
## [2] "dance beyond borders provides dance fitness instructor, financial literacy and leadership training to legal immigrants and refugees. we provide a platform for instructors to teach dance to the public which fosters cultural and ethnic awareness."
## [3] "1.provide immigration and other social assistance services\n\nto new oromo immigrants and refugees;\n\n2.provide health awareness and education services to\n\nmembers and the community at large;\n\n3.promote self-help and social assistance among the oromos"
## [4] "ethiopian and eritrean cultural and resource center (eecrc) is a non profit organization that assists african refugee and immigrant communities in oregon. it provides education, advocacy, direct services, referrals and connection to resources."
## [5] "fostering original written and vocal artworks and encouraging people to write with a particular focus on working class individuals, women, refugees and immigrants as well as the lgbtq and disability communities and survivors of violence."
## [6] "the corporation is formed for the charitable purposes of educating refugees about music through demonstration and live music making, music history and the effects of music on society."
After you feel comfortable that individual statements are primarily identifying nonprofits within your desired group and have low error rates, you will need to combine all of the criteria to create one group. Note that any organization that has more than one keyword or phrase in it’s mission statement would be double-counted if you use the raw groups, so we need to make sure we include each organizaton only once. We can do this using compound logical statements.
Note that grepl() returns a logical vector, so we can combine multiple statements using AND and OR conditions.
criteria.01 <- grepl( "immigrant rights", dat$mission )
criteria.02 <- grepl( "immigration", dat$mission )
criteria.03 <- grepl( "refugee", dat$mission )
criteria.04 <- grepl( "humanitarian", dat$mission )
criteria.05 <- ! grepl( "humanities", dat$mission ) # exclude humanities
Note that to select all high school boys you would write:
( grade_9 | grade_10 | grade_11 | grade_12 ) & ( boys )
You would NOT specify:
( grade_9 | grade_10 | grade_11 | grade_12 ) | boys
Because that would then include boys at all levels and all people in grades 9-12.
Now create your sample:
these.nonprofits <- ( criteria.01 | criteria.02 | criteria.03 | criteria.04 ) & criteria.05
sum( these.nonprofits )
## [1] 406
dat$activity.code <- paste0( dat$codedef01, ": ", dat$codedef02 )
d.immigrant <- dat[ these.nonprofits, c("orgname","activity.code","mission") ]
row.names( d.immigrant ) <- NULL
d.immigrant %>% head(25) %>% pander()
orgname | activity.code |
---|---|
WALK OF STARS FOUNDATION | Arts, Culture, and Humanities: Media, Communications Organizations |
UNITED ARAB-AMERICAN SOCIETY | Arts, Culture, and Humanities: Cultural/Ethnic Awareness |
INCLUSIVE MOVEMENT FOR BOSNIA AND HERZEGOVINA INC | Arts, Culture, and Humanities: Cultural/Ethnic Awareness |
HEREDITARY ORDER OF THE RED DRAGON | Arts, Culture, and Humanities: Cultural/Ethnic Awareness |
FILIPINO-AMERICAN ASSOCIATION OF COASTAL GEORGIA INC | Arts, Culture, and Humanities: Cultural/Ethnic Awareness |
CYRUS THE GREAT KING OF PERSIA FOUNDATION | Arts, Culture, and Humanities: Cultural/Ethnic Awareness |
DANCE BEYOND BORDERS | Arts, Culture, and Humanities: Cultural/Ethnic Awareness |
SIKKO-MANDO RELIEF ASSOCIATION | Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose |
ETHIOPIAN AND ERITREAN CULTURAL ANDRESOURCE CENTER | Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose |
HARYANVI BAYAREA ASSOCIATION | Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose |
STREET CRY INC | Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose |
ASOCIACION DE MIGRANTES TIERRA Y LIBERTAD | Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose |
CONCERTS FOR COMPASSION INCORPORATED | Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose |
ARTOGETHER | Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose |
TEEN TREEHUGGERS INC | Arts, Culture, and Humanities: Arts, Cultural Organizations—Multipurpose |
UMUKABIA EHIME DEVELOPMENT ASSOC | Arts, Culture, and Humanities: Single Organization Support |
EVAN WALKER ORGANIZATION | Arts, Culture, and Humanities: Single Organization Support |
SAEA INC | Arts, Culture, and Humanities: Professional Societies & Associations |
ROSEDALE PICTURES INC | Arts, Culture, and Humanities: Film, Video |
AMERICAN DELEGATION OF THE ORDER OFDANILO 1 INC | Arts, Culture, and Humanities: Other Art, Culture, Humanities Organizations/Services N.E.C. |
VALLEY MOO DUK KWAN | Arts, Culture, and Humanities: Other Art, Culture, Humanities Organizations/Services N.E.C. |
MOVE THE WORLD | Arts, Culture, and Humanities: Dance |
BOOTHEEL CULTURAL AND PERFORMING ARTS CENTER | Arts, Culture, and Humanities: Arts Service Activities/ Organizations |
GREAT NSASS ALUMNI ASSOCIATION OF NORTH AMERICA INC | Arts, Culture, and Humanities: Humanities Organizations |
CHULA VISTA SUNSET ROTARY FOUNDATION INC | Arts, Culture, and Humanities: Humanities Organizations |
mission |
---|
the mission of the walk of stars foundation is to honor those who have excelled and made major contributions in their respective capacity in the entertainment area in motion pictures, radio and television, humanitarians, civic leaders, medal of hono |
to serve as a social organization for arab-americans to preserve the arab heritage, culture and traditions. to promote and support humanitarian and community outreach efforts locally and internationally. |
to promote the advancement of bosnia and herzegovina by fostering an inclusive platform for innovation and entrepreneurship; cultural and humanitarian activities; and networking. |
charitable and educational hereditary society, encourages the study of the history of ancient wales and welsh immigration to america. research and preserve documents. support the restoration of sites and landmarks. |
to engage in humanitarian, civic, educational , cultural, and charitable activities that would preserve, promote, and share with the community the customs, values and heritage of the filipino culture. |
the purposes of this organization are to unite various faiths under the umbrella of love one another, love go, and respect your environment. in addition, we wish to help the homeless, refugees and be an educational resource. |
dance beyond borders provides dance fitness instructor, financial literacy and leadership training to legal immigrants and refugees. we provide a platform for instructors to teach dance to the public which fosters cultural and ethnic awareness. |
1.provide immigration and other social assistance services to new oromo immigrants and refugees; 2.provide health awareness and education services to members and the community at large; 3.promote self-help and social assistance among the oromos |
ethiopian and eritrean cultural and resource center (eecrc) is a non profit organization that assists african refugee and immigrant communities in oregon. it provides education, advocacy, direct services, referrals and connection to resources. |
hba is involved in multiple non-profit activities including haryanvi cultural promotion and preservation, community services, educational activities, humanitarian aid and social activities. |
fostering original written and vocal artworks and encouraging people to write with a particular focus on working class individuals, women, refugees and immigrants as well as the lgbtq and disability communities and survivors of violence. |
helping the immigration community with education, culture and humanity. |
the corporation is formed for the charitable purposes of educating refugees about music through demonstration and live music making, music history and the effects of music on society. |
artogether is a community building creative arts project that hosts free creative art workshops, social gatherings, and family picnics to forge connections between the refugee community and the general public. |
the purpose for which the corporation is organized is to provide youth the platform to address wildlife, environmental, and humanitarian issues through arts journalism. |
umukabia ehime development association, usa, is organized exclusively for charitable purposes, which includes assisting nigerian abandoned and disabled children and to support humanitarian programs in our community and the public in general. |
our mission is to spread prosperity and compassion for all by achieving our commitment to excellence in humanitarian, environmental, and scientific initiatives. we recently conducted a successful food drive for evacuees of wildfires in ca. |
sudanese american engineers association is a non-profit, non-political, educational and humanitarian organization. its members are engineer professionals of sudanese descent. |
to train employ underprivileged individuals particularly refugees in the visual media industry. build moviemaking skills with learn by doing approach creating content to be produced by the organization with profits to partially fund future projects |
perpetuating the traditions of the dynastic and hereditary chivalric orders of the royal house of petrovi-njegos by supporting charitable, humanitarian. educational and artistic works that promote a continuing public interest in their history. |
to undertake charitable activities, both through group and individual action consistent with the practice of the art and in the interest of humanitarian principles. to facilitate participation in, and stugy of the martial arts of soo bah do. |
through the powerful expression of dance, our youth dance company performs on a local, nat’l and global stage to raise awareness of social and environmental issues. example: pollution, hunger, global refugee crisis |
to meet the cultural and humanitarian needs of the under served in the bootheel region |
humanitarian aid educational services for children and adults that are in hardships or in disaster areas within and outside united states |
world and local humanitarian, educational and cultural community service |
For your deliverables for Part III:
If you selected three nonprofit subsectors from the activity codes (code01), then created three data subsets based upon these criteria you could conduct something like content analysis as a way to compare how the three groups use language difference.
Re-run the content analysis to identify the most frequently-used words. But this time run it separately for each subsector.
How do the most frequently used words vary by subsector? Which words are shared between the three subsectors? Which are distinct?
Another way to compare differences in language use is by creating semantic networks:
Compare prominent word relationships in mission statements of arts, environmental, and education nonprofits (codedef01). Build semantic networks for each, then compare and contrast the prominence of specific words within the networks.
When you have completed your assignment, knit your RMD file to generate your rendered HTML file.
Login to Canvas at http://canvas.asu.edu and navigate to the assignments tab in the course repository. Upload your HTML and RMD files to the appropriate lab submission link.
Remember to:
See Google’s R Style Guide for examples.
Note that when you knit a file, it starts from a blank slate. You might have packages loaded or datasets active on your local machine, so you can run code chunks fine. But when you knit you might get errors that functions cannot be located or datasets don’t exist. Be sure that you have included chunks to load these in your RMD file.
Your RMD file will not knit if you have errors in your code. If you get stuck on a question, just add eval=F
to the code chunk and it will be ignored when you knit your file. That way I can give you credit for attempting the question and provide guidance on fixing the problem.
If you are having problems with your RMD file, visit the RMD File Styles and Knitting Tips manual.