How Data Science Can Be Applied to Solve Public Interest Problems Without Losing Its Soul

This article is a preprint of the introduction to the special issue of the Journal of Technology in Human Services: Selections from the Data for Good Exchange in 2017 (jointly co-authored by Gideon Mann, Bloomberg’s Head of Data Science, and Cornell Tech’s Arnaud Sahuguet) (link).

Data science has had a tremendous impact in the private sector and is increasingly being applied in the public sector. The Data for Good Exchange (D4GX) is an annual conference centered on these novel problems and applications. This article describes the history of the conference and introduces a selection of the best papers from the 2017 conference that focus on human services.
Keywords: data science; machine learning; open data; public sector
Subject classification codes: N/A

The Data for Good Exchange (D4GX) is an annual conference on the application of data science for social good. The event, which is hosted by Bloomberg’s Office of the CTO in New York City each September, attracts a mixed audience of data scientists and direct practitioners from academia, industry, non-profits and government. Throughout the day, there is a mix of panels and paper presentations, all of which are recorded and available online. This special issue of the Journal of Technology in Human Services presents a selection of the best papers from the 2017 conference that focus on human services, as well as some of the history behind D4GX.

What is Data Science for Social Good?

At first blush, “data science for social good” could be seen as a pleonasm — redundantly emphasizing that science is good. Certainly, technologists often believe that science is a good for its own sake, and that discoveries that advance the state of knowledge also advance the human condition — that is, the deployment of technology by itself aids life, no matter what the technology is. This thought of science as a good in and of itself rests on a deep faith in the power of the free markets and the libertarian impulse to believe that human discoveries naturally lead to societal advancement.

However, one observation from the machine learning revolution is that some applications of machine learning are more equal than others: the public sector has seen less application of machine learning and lags the private sector in this domain. In particular, the algorithmic advertising business is so profitable that this industrial sector has driven the direction of research and the focus of machine learning for a number of years.

To some degree this is exacerbated by the shortage of qualified data scientists and machine learning scientists. Because of this labor scarcity, these scientists are expensive to employ and organizations in the public sector struggle to hire and retain the employees and strategic consultants needed to transform their organizations. On top of this, since industrial research funds substantial portions of scientific research, applications that fall outside of what the private sector funds are given less scientific attention and as a result other applications become less appealing targets of research for academics. Part of the impetus around “data science for social good” is to focus attention on problems that might not be addressed by market forces, for problems in government or the public sector, for people who do not hold significant market power, or for problems that might otherwise be ignored by the market.

In addition, there is the emerging question of not just where this technology is being applied, but more broadly how it is being applied. If the goal is to work in the public domain, it becomes crucial to ask if the technology is being fairly applied to everyone that it needs to serve. Because of the opacity of machine learning models, it becomes crucial to investigate how to ensure these models are fair and unbiased, especially when it is employed by the state and not a private company.

How is data science different from quantitative methods?

Another important question is what makes data science or machine learning different from statistics and quantitative modeling? Certainly, the practice of quantitative modeling has existed and been pursued in many disciplines before the recent explosion of work. So what makes data science different?

One distinction in the new application of data science is that instead of a focus on analysis, there is an emphasis on directly deploying decision-making models as part of a service. It’s no longer simply a question of where to place a new service center, but on the repeating decisions of who to serve and how to serve them.

Another crucial distinction comes in the types of data being used to make these decisions. In public health, for example, there is a long line of research related to using surveys to understand human behavior. Data science and machine learning methods hold the promise to bring data collected for other reasons or methods to old problems. Typically, these methods are distinguished by a relative lack of specificity to the problem domain and are more focused on modeling the data. In that sense, techniques applied for one problem domain may transfer more readily to other problem domains (as opposed to traditional quantitative techniques that are deeply embedded into modeling a particular scientific question).

The Data for Good Exchange

With these general guidelines defining data science for social good, the next part of the question becomes how to make sure that within the field the right problems are being address — not just problems of academic interest but those that will make a difference in society. In order to find the right problems, this means gathering people from the data science community with the tools and know-how together with professionals on the front lines who are trying to address real-world problems. There are very few places these two groups can meet.

The Data for Good Exchange was created to meet this need. The event started in a nascent form in 2014 as a partnership with the Association of Computing Machinery’s Special Interest Group for Knowledge Discovery and Data Mining (ACM SIGKDD) as part of KDD 2014, an interdisciplinary conference whose theme that year was “Data Mining for Social Good.” As part of that year’s conference in New York City, there was a full-day workshop “Unleash Data: Accelerate Impact — KDD at Bloomberg” that focused specifically on applying data science to improve civic and social outcomes. It would become the kernel for the Data for Good Exchange. The next year, in 2015, D4GX would hold its inaugural meeting as its own free-standing conference. Since our inaugural meeting, one of the unique aspects of the conference has been the mix of data scientists from academia and industry and people from the public sector. At the 2017 conference, the mix was 41% corporate, 37% academic and 21% non-profit or government.

Initially, the conference topics were focused on the promise of data science for social good: government services, public health, economic justice, the environment and methods around their application, data collaboration and data science education. Each year, there have been strong keynotes to support these themes. Oliver Wise, who was the head of data analytics in New Orleans, spoke about applying data science methods to government performance and accountability. Other keynotes have supported work on public health: Sarah Tofte from Everytown for Gun Safety, John Kahan of Microsoft on analytics approaches to understanding the mysteries of Sudden Infant Death Syndrome (SIDS) and the broader category of Sudden Unexpected Infant Death (SUID), and Kelly Henning from Bloomberg Philanthropies on the 4-year, $100 million Data for Health initiative. The keynote by Cathy O’Neil in 2016 helped shift the conference to include some of the perils of data science in social applications: bias, privacy intrusion, and opacity, as well as the war on data and fake news.

Of course, there are some related topics that have never appeared at the conference. One area has been the possibility of the singularity, of truly artificially intelligent machines. Though this is a fascinating question, the actual state of technology seems to be quite far from realization and the topic is still academic rather than practical. Another area that has been absent is autonomous weapons — a potential threat to democracy if a small cohort can build an autonomous army and exert control through mechanical means without resorting to the democratic consent of a standing human army. This topic is primarily a political discussion, unrelated to data science methods. The impact of data science on labor and the changing nature of jobs are other crucial questions, but are primarily related to political solutions, and have therefore been avoided at the conference (For a related effort, check out the Shift Commission on Work, Workers and Technology, a joint project of New American and Bloomberg).

Another area the conference might explore in the future is safety concerns around the application of data science. While the conference has discussed safety in the context of fairness, human safety (i.e., how to prevent car accidents) hasn’t been something the conference has looked at, though it could be solidly in the domain of our focus.

Along with this set of topics, there has also been a focus on efforts to actually build and deploy systems — and this special issue will showcase some of the papers from the conference that focus on this area. One mechanism the conference has supported that has successfully done this is the D4GX Immersion Day program. Through this program, graduate students are air-dropped into non-profit organizations or municipalities for a few days. While they are there, the goal is for them to assist the non-profit leadership or local government officials in developing a data strategy or to explore how to utilize their existing data more effectively. Not only has this model enabled non-profits and governments to cheaply execute a data science project, but also it has helped build enduring relationships.

Papers in this issue

Over the past three years (2015–2017), nearly 150 papers have been presented at the Data for Good Exchange. In this special issue, we present 9 of the best papers on human services from 2017. When taken together, they give a provocative hint of what data for good could mean.

The first three papers approach traditional public health problems using new methodologies and new data sets. “Hiding in Plain Sight: Insights About Healthcare Trends Gained Through Open Health Data,” by Ravi Rao and Daniel Clarke, looks at data released by the New York Statewide Planning and Research Cooperative System to understand the changing nature of health care needs and costs. “Food for Thought: Analyzing Public Opinion on the Supplemental Nutrition Assistance Program,” by Miriam Chappelka, Jihwan Oh, Dorris Scott and Mizzani Walker-Holmes, draws from social media posts and news articles to understand how public health intervention is framed and understood by the population. Last, “Data Science: A Powerful Catalyst for Cross-Sector Collaborations to Transform the Future of Global Health — Developing a New Interactive Relational Mapping Tool,” by Barbara Bulc, Cassie Landers, Katherine Driscoll and Jeff Mohr, explores how to approach the Sustainable Development Goals (SDGs). These papers simply could not have existed before open data libraries and electronic text repositories existed. Together, they give a very different light on how to design and implement public health interventions. In particular, social media has been an increasingly crucial window to understanding behaviour, and to do so properly requires both the methods of machine learning and natural language processing and the theoretical framework provided by social science.

Closely related to these core public health papers are those that examine government services and their public health effects. “Predictors of Re-admission for Homeless Families in New York City: The Case of the Win Shelter Network,” by Constantine E. Kontokosta, Boyeong Hong, Awais Malik, Ira Bellach, Xueqi Huang, Kristi Korsberg, Dara Perl and Avikal Somvanshi, creates a unified database of the homeless population served by the Win Shelter Network and explores how machine learning can predict whether these services are being used effectively or not. “Exploring the Urban-Rural Incarceration Divide: Drivers of Local Jail Incarceration Rates in the U.S.,” by Rachael Weiss Riley, Jacob Kang-Brown, Chris Mulligan, Soumyo Chakraborty, Vinod Valsalam and Christian Henrichson, uses a generalized estimated equations model on more than a decade of county jail records to reveal the economic roots of relative jail propensity.

Another category of work looks at how to develop governmental services via machine learning. “Machine Learning for Drug Overdose Surveillance,” by Daniel B. Neill and William Herlands, leverages data from New York City in an attempt to pinpoint sub-areas that are epicenters of opioid overdoses. “Can Machine Learning Create an Advocate for Foster Youth?” by Meredith Brindley, James Heyes, and Darrell Booker, applies machine learning methods to aid youth as they age out of the foster care system. In both of these cases, the problem of delivering services is an old one but the question being studied is how to determine the best way to deliver those services with algorithmic means. The final paper in this category is “Measuring the Unmeasurable — A Project of Domestic Violence Risk Prediction and Management” by Ya-Yun Chen, Yu-Hsiu Wang, Yi-Shan Hsieh, Jing-Tai Ke, Chia-Kai Liu, Sue-Chuan Chen and T. C. Hsieh, that applies machine learning to figure out how to allocate resources effectively to reduce domestic violence.

As machine learning increasingly becomes ingrained in how government services are deployed, the question of whether this technology is being applied fairly begins to arise. “Themis-ML: A Fairness-aware Machine Learning Interface for End-to-end Discrimination Discovery and Mitigation,” by Niels Bantilan, implements a method for adjusting trained models to conform to a fairness criteria.

Discussion

Going forward, it seems likely that issues related to data science and machine learning in the public domain will grow. Just as machine learning permeates industrial applications, the public sector is also adopting these methods. In some cases it is happening slower, but the ultimate impact of machine learning in the public sector may be greater than its impact in the private sector. There are few sources of data as rich as what the public sector creates and collects through its normal function — both because of their immense client base and because of the number of people the public sector directly employs. However, there are significantly difficulties in adoption, and not for purely technological reasons. The ethical and legal obligations of the public sector are more intricate than those in the private sector. Therefore, the application of machine learning has to be performed with greater care. As these issues evolve, the Data for Good Exchange will aim to evolve as well.

As issues related to the perils of data science become more significant, ethics has surfaced as a crucial topic in the conference and industry discussions. Today, there is no general agreement on what ethics for data science means or should be, in contrast to established professions like medicine and law where ethics codes exist. Medicine, for example, has the Hippocratic Oath which defines how doctors should approach their work. In service to this idea is an effort to develop an ethical code around data science, called the “Community Principles on Ethical Data Sharing” (CPEDS), undertaken by Bloomberg, in partnership with Data for Democracy and BrightHive. The intent is that the locus of CPEDS will influence an individual data scientist — not an organization — with the hope that individual commitments can effect institutional change in a bottom-up fashion. Since the goal is for something to be built at an individual level, the governance of the code itself is coming as much as possible from the community itself, and will exist as an evolving document and not a fixed set of ideas. Ultimately, the strength of the code will derive from the legitimating voice of the community — and, by integrating a broad swath of the data science community into its creation, there is the best chance to accelerate agreement and drive broad adoption. At present, the way to get connected to this effort is through Data for Democracy, this GitHub repository or Data for Democracy’s #p-code-of-ethics Slack channel.

In 2018, the conference will be held again in September. Please refer to www.bloomberg.com/d4gx for more information on past events and how to participate in future ones.

Acknowledgements

Over the past four years, there have been a tremendous number of people who made this event possible. This includes a whole host of people organizing and facilitating it from inside of Bloomberg, where the event has been hosted. In addition, we’ve had a Program Committee of 20–30 people each year that perform peer review of all submitted papers. Of course, the conference has also relied on the participation of many institutions, including Bloomberg Associates, Bloomberg Philanthropies, Facebook, Google, IBM, Microsoft, Twitter and UNICEF. Truly, this special issue would not exist without any of them.