Using Administrative Data

January 18, 2019

(Team 7)

Overview

The following summaries provide an overview of the multiple uses of administrative data and topics reagrding privacy and transparency.

Reality Mining: Using big data to engineer a better world. Chapter 7 - Taking the Pulse of a Nation: Census, Mobile Phones, and Internet Giants

National scale researchers & entrepreneurs get access to data sources from national censuses data, call data records or call detail records (CDRs), major internet companies (Google, Facebook, Twitter), and banks.

National Census Data – National census data is the easiest to obtain and it’s publicly available. World Bank conducts international surveys and compiles census data. Google has integrated this data in a visualization tool in its search results. The U.S. government makes U.S. consensus data resources available through the American Fact Finder. (P.113) In 2009 the government began data.gov, where many interactive data sets have their own Application Programming Interface (APIs) to integrate maps and charts into web applications. (P.114) The international World Bank accumulates data from more than 200 countries with over 7,000 indicators (such as GDP and gas prices) to produce a data catalog on land, literacy rate, health, climate change, and much more. Although World Bank is one of the “richest collections” of worldly data, some inconsistencies exist due to a nation’s timing and reporting practices. Data can also be accessed by indicators, operations, and financial data. Google uses World Bank data in Google Public Data Explorer, which allows one to “slice and dice data” from World Bank. (P. 115)

Call Data Records (CDRs) – CDRs were historically only used for billing purposes, but since 2005, it was determined how valuable this data can be used for modeling human mobility. Very few researchers and entrepreneurs have access to CDRs and must abide to legal stipulations, such as “no proprietary or personal information will ever be made public.” (P.112) Some researchers must agree to show how they can assist with future predictions related to the mobile carrier to obtain access. Mobile operators must be careful about making the data available because, even with anonymized data, it’s still possible to identify an individual by cross-referencing data. With just a DOB, gender, and zip code, 63-87% of the U.S. population can be identified. (P.117) Sprint researchers had a data set of 30 billion calls made by 25 million mobile users in the U.S. Researchers felt that by giving enough time to figure out the user’s top locations (home and work) and combined it with census data, the researchers could identify almost all the users. Due to this conclusion, researchers suggest that data should be “coarse in time and in space” (P.117) In other words, data should be collected in one day versus over a month period, and also collected from a larger geographical area, not just from one tower. AT&T operates a Work & Home Extracted Regions (WHERE) project in which a synthetic model of mobile users has been developed for a particular City. This concept has been fairly successful in maintaining individual privacy in NYC & LA, but is still in the research phase. WHERE can become an important tool if it can be applied outside of city limits. Currently, AT&T is working on projects such as AirGig, 5G, and Acumos to bring superfast internet to suburban and rural areas over power lines, combat biases in advertising, and help accelerate the adoptions of various technologies.

Internet Companies - Google services include Gmail, Google+ social network, YouTube, and the Chrome web browser. Google keeps a log of searches and URLs if the “Instant” feature is enabled on Chrome. Google also “logs YouTube videos watched, activity on Google+, and the text of emails sent through Gmail.” (P.119) Google’s Adsense program runs advertisements based on information gathered through key words and phrases.

Facebook offers a similar advertising network like Google’s Adsense. Both programs operate on “clicks” and “impressions” based on key words. Key words are set by the advertiser and “pulled” from FB user’s profiles. An advertiser then receives metrics of general statistics like the number of emails or profiles the ad appears in. This information is “mined” to get the overall feel of a product across demographics. FB also has an API which allows a programmer to have access to phone numbers, contact lists, etc., if the FB user allows this in their privacy settings. Zynga, an FB application, collects habits and behaviors of people who play their games. The most popular game, FarmVille, has been questioned for its ethnicity due to requirement of participants to trade information about their friends, likes, desires, and consumption habits to participate in the game.

Twitter is different than Google and FB because most of the user’s data is publically available and can be used to “mine for the sentiment of a nation” based on the location of a tweet. (P. 121) However, it’s a challenge to determine the signal from the “noise” from fragmented conversations, links, hashtags, and abbreviations.

Banking Transaction – Banking transactions are the most difficult to access. Banks use data analytics for purposes of fraud, to predict when someone might switch banks, or to adjust interest rates depending on spending habits or risk. (P. 121) Since banking transactions are tied to a location and a specific action, researchers can obtain a “fine-grained picture of a person’s economic behavior.” (P. 121) In 2008, Bank of America partnered with MIT using a using a sample of 10,000 customers, with various metrics, over a 3-year period. Recently, Tresata, an analytics software company in Charlottesville North Carolina, secured a $50-million-dollar growth capital investment from GCP Capital Partners. Mint, owned by Intuit, is a free, web-based personal financial management service for the US and Canada. Mint tracks on-line spending and uses the data to advertise financial products to customers based on their spending habits.

The NYPD Was Systematically Ticketing Legally Parked Cars for Millions of Dollars a Year- Open Data Just Put an End to It,

This blog shows how Open Data can be used by citizens to help themselves and government.

In late 2008, NYC passed legislation allowing drivers to park in front of a sidewalk pedestrian ramp as long as it’s not connected to a sidewalk. A citizen continued to received parking tickets for parking in a “legal” spot. The tickets were always dismissed after a time consuming process.

This citizen used NYC’s Open Data Portal to determine common parking spots in NYC where cars were ticketed for blocking pedestrian ramps. As a sample, 30 random spots were chosen that received more than 5 tickets in the last 2.5 years, and using google maps it was confirmed all spots were legal. Based on the data, there were 1,966 spots that were generating about $1.7 million annually. Note: It’s possible some of these spots were illegal, but the majority are legal based on the sample results.

The citizen posted a map of 1,000 pedestrian ramp parking spots with the number of tickets each spot had received to date, for others to view on-line. The citizen determined Brooklyn’s 70th precinct had the most cars wrongly ticketed generating over $107k fines per year, with the 77th precinct bringing in over $101k per year. Next, the citizen reached out to the NYDP via the Mayor’s Office of Analytics and Manhattan Borough President Gale Brewer’s Office and got feedback, which stated patrol officers were unfamiliar with the rule change. The officers have been trained and are digitally monitoring tickets to limit erroneous ticketing from happening in the future.

Open Data Reveals $791 Million Error in Newly Adopted NYC Budget

In 2016, New York City launched a searchable database of the municipal budget. Although it only consisted of the current year budget, it allowed a breakdown of specific budget unit per department. The transparency lead to the realization that NYC had a adopted a budget with a nearly $800 million error.

The new searchable database categorized expenses all the way down to the “Object Code” Name. These codes included elements including full time position costs, overtime and postage. All of these codes are assigned to an individual agency, making it easy to track expenses. Prior to the database, in order to analyze and assess the NYC budget, someone would have to sort through hundreds of pages of PDFs.

While sorting through the data, a question arose about the largest expenses within the NYPD. While analyzing the largest budget codes, it became clear that “Protection of Foreign Missions” was the largest expense. By way of comparison, more money was going to be spent on protecting foreign missions than school safety, transit, housing and narcotics combined. The analysis show that this budget category along amounted to about 1% of NYC’s budget and 15% of the NYPD’s entire budget.

By looking at prior year budget it was obvious that this $791 million line item was an error. But how on earth does a mistake like this happen? How does a typo like this make it through the entire budget process? Human error is understandable, and acceptable, but examples like this make the case for why open data and transparency is necessary to hold public entities accountable.

A team downloaded every individual tax bill via PDF to create a map that shows the locations of religious property tax exemptions as well as the type of exemption, amount and the name of the religious institution. The exemption types included: “House of Worship,” “Religious Dormitory,” “Clergy,” “Parsonage,” “Religious Mission,” “Bible,” and “Salvation Army.”

A first look at the data revealed that the sum all exemptions totaled $12.9 billion of a total tax due of about $21.6 billion. But many of these tax abatements go to public agencies such parks, the department of education as well as the port authority. But listed as number six on the list is houses of worship, totaling $650 million a year approximately 1% of the entire city’s budget. These exemptions also add up to $76 dollars per NYC resident.

Here is some of the highlights of the analysis: *The neighborhoods receiving the largest amount of exemptions per resident are some of our wealthiest *Communities with Large Jewish Populations Have the Most Religious Schools per Capita *Clergy May be Priced out of Manhattan, but there are Plenty Living in Ocean Parkway South *South Jamaica Has the Most Houses of Worship per Capita *There is a “Bible” exemption taken by just two properties in NYC

8 principles of open data

In 2007, number of open data advocates met and established principles of open government data. Below is a brief recap of the principles:

Complete: All data considered public is made available. Public data is defined as “data that is not subject to valid privacy, security or privilege limitations.”
Primary: All data is produced from the original sources
Timely: Data is given as quickly as possible
Accessible: Data should be to a wide variety of users for the multiple of purposes. This includes: accessible via the internet, consideration for the disabled, and follow current industry standard protocols and formats.
Machine-processable: Data is provided in a format that allows for automated processing which require proper encoding.
Non-discriminatory: Data is available to all persons, including those who request it anonymously.
Non-proprietary: Data is available in a format that no one entity has exclusivity, therefore not in a proprietary format.
License-free: Often government data is a mixture personal data, copyrighted information or others forms of non-open data. Therefore, the experts determine that open data should not subject to any copyright, patent, trademark or trade secret regulation, but “reasonable privacy, security and privilege restrictions may be allowed.”

There are seven additional principles that the group could have considered but did not. Those include:

Open data should be online and free
The open data should be permanent meaning the information should be provided in a stable location and accessible for as long as possible.
The data should be trusted which includes attestation or digital signatures or publication dates verifying authenticity;
A presumption of openness, the government will make public information available proactively with little to no barriers for use and access.
Documented, it is as important for users to know the data is current as for the data itself to be current, so users can assess accuracy.
Safe to Open Data content should be free of malware, viruses, worms, etc.
Designed with Public Input so appropriate information technology are utilized for public use and dissemination.

Key Take-aways:

Summary #1:

National scale researchers & entrepreneurs get access to data sources from: National censuses - World Bank American Fact Finder Data.gov Call data records or call detail records (CDRs) - Modeling human mobility Caution due to cross referencing Coarse in time & in space AT&T WHERE project – synthetic users Major internet companies - Google - Gmail, Google+ social network, YouTube, Chrome web browser/Adsense Facebook – advertising, Zinga, Farmville Twitter – “mine for the sentiment of a nation” based on the location of a tweet Banking Transactions – Most difficult to access Transactions are tied to a location & a specific action “Fine-grained picture of a person’s economic behavior”

Summary #2:

NYC passed legislation allowing drivers to park in front of a sidewalk pedestrian ramp as long as it’s not connected to a sidewalk

Patrol Officers continue to ticket vehicles parked in “legal” spots

Citizen was able to use NYC Open Portal and Google Maps to determine millions of dollars being fined to citizens parking in legal spots.

Citizen made appropriate notifications to NYC and NYPD provided training to officers & established a digital monitoring system to ensure citizens weren’t being “legally” ticketed.

Summary #3:

Open Data of the NYC budget for one fiscal year revealed a nearly $800 million typo in the approved budget. The new searchable database categorized expenses all the way down to the “Object Code” Name. These codes included elements including full time position costs, overtime and postage. All of these codes are assigned to an individual agency, making it easy to track expenses. It also made it easy to find errors.

Summary #4:

Summary #5:

According the leading experts, Open Data should be complete, primary, timely, accessible, machine processable, non-discriminatory, non-proprietary, and license-free.

Discussion Questions:

Do you think Zinga is being ethical (through Google) allowing FarmVille to gather information regarding friends, likes, desires, and consumption habits?
Are you comfortable with Tresata having access to your financial personal management, as well as an investment firm? 
Do you like personalized advertisements regarding on-line spending based on your personal spending tracked through Mint?
Do you think a City allowing citizen’s access to an Open Portal is good for the citizens and/or a City?
NYC will lose a substantial dollar amount in parking fines, now that a citizen took the initiate, using the City’s Open Portal system, to acknowledge parking tickets were unfair.  Do you think all City’s should be this transparent?
Do you think City’s which aren’t transparent, are purposefully doing so to “hide” the facts? 
How concerning is it that there was such a large error in the approved budget?
Have you encountered a situation where an error was exposed by a third party who analyzed your organization’s data? 
Should all budget line items be available for public scrutiny? What potential challenges might this present?
From a taxpayer perspective, how valuable is this specific  property tax data? How should policy makers use this data for policy decisions?  
Have you used similar mapping strategies to better understand where resources are allocated in your organization? 
Are there privacy concerns about visually mapping out specific taxpayer information including address/location?
In your current role/organization, how many of these 8 principles do you utilize when providing open data to the public? 
For those in the public sector, what are some of the challenges government entities face when attempting to provide data and information? 

Video:

References:

Eagle, N., & Greene, K. (2014). Reality mining: Using big data to engineer a better world. MIT Press. CH7 mobile and internet data
Project AirGig Gets Closer to Initial Commercial Deployment, Dallas, Sept 10, 2018 https://about.att.com/story/project_airgig_trials_georgia.html
A History of Firsts: AT&T Labs is Still Creating the Future 35 Years After the First Cellular Service, Andre Fuetsch, October 11, 2018 https://about.att.com/newsroom/2018/35th_mobile_anniversary_call.html
Zynga’s FarmVille, social games, and the ethics of big data mining, Michele Willson & Tama Leaver, June 10, 2015 https://www.tandfonline.com/doi/full/10.1080/22041451.2015.1048039
Tech company Tresata just became Charlotte’s third unicorn, Caroline Hudson - Staff Writer, Charlotte Business Journal, Oct 10, 2018, 10:55am EDT https://www.bizjournals.com/charlotte/news/2018/10/10/tech-company-tresata-just-became-charlottes-third.html
The NYPD Was Systematically Ticketing Legally Parked Cars for Millions of Dollars a Year- Open Data Just Put an End to It, May 11, 2016, http://iquantny.tumblr.com/
Open Data Reveals $791 Million Error in Newly Adopted NYC Budget July 15, 2016 http://iquantny.tumblr.com/
A Look at NYC’s $650 Million Property Tax Breaks Related to Religion http://iquantny.tumblr.com/
8 Principles of Open Government Data LINK