Q&A about data science code of ethics with Lilian Huang of Data for Democracy

A new initiative from Bloomberg, BrightHive, and Data for Democracy is looking to create shared values on the ethics of data science. In September, at Bloomberg’s 2017 Data for Good Exchange (D4GX), the group announced a project called the “Community Principles on Ethical Data Sharing” (or CPEDS), which is designed to develop a set of guidelines for data scientists related to data sharing and collaboration on data-driven projects. Tech@Bloomberg sat down with Lilian Huang, Data Ethics Lead at Data for Democracy, to get more detail on this important initiative.

What is Data for Democracy and why is it trying to create a data science code of ethics?
Data for Democracy is a grassroots technology collective with more than 2,000 volunteers from across the globe. We started out as an online community in December 2016, and now have multiple local chapters based in major cities. Our volunteers are data scientists, technologists and activists who partner with civic organizations and carry out open-source research, analysis and software development. Our goal is to explore and enhance the relationship between tech, government and society.

Helping to create a data science code of ethics fits naturally into our mission, because it provides an opportunity to consider and articulate how data scientists can produce work that benefits society.

Why does the industry need a data science code of ethics?
Data science isn’t so much a single industry as it is an approach or methodological framework that’s applied across multiple domains and industries – policy, finance, medicine, marketing, and many more. As the data science “approach” becomes increasingly influential in more sectors, there’s also a heavy emphasis on collecting and storing data, and applying it for commercial and non-profit purposes. This brings up a lot of thorny issues, such as the safe storage of sensitive personal data, or the development of biased algorithms to target or exclude individuals.  As these ethical questions grow more pressing, it makes sense to outline a set of baseline responsibilities and ethical obligations that data scientists should consent to upholding – or at least, some values and priorities that they can consider – when trying to maximize the benefit and minimize the harm from their work.

What does it mean for data scientists and companies to share data responsibly?
Overall, there’s a need for safe and ethical practices regarding how to collect, store, and distribute data – these include preventing data breaches and protecting individuals’ rights over how their data is collected and used. Some of the more specific points that were raised in our preliminary community scan include: ensuring transparency around the collection of data and how this collected data will be used; ensuring clear provenance of data so that data scientists are always aware of where their datasets come from; ensuring equitable and ethical access to data once it’s collected, so that you don’t just have a few large companies buying up data and gaining an advantage; ensuring the quality of collected data; and holding sources accountable for low-quality or actively misleading data.

What steps have been taken so far?
We’ve recently completed a preliminary scan based largely in the Data for Democracy community – as mentioned, this is a group of over 2,000 people connected on Slack and Twitter. We posed discussion questions through those channels and community members responded. From these responses, we identified recurring themes that community members consider important, along with some concrete examples, and arranged them in a systematic framework. The key areas of concern we identified are: issues regarding data itself, issues regarding the questions and problems to work on, issues regarding the algorithms and models being used, issues regarding the technological products and applications that are created from research, and issues regarding the data science community and culture. Each of these is a broad area with specific sub-topics.

Our community discussions also helped us identify practical barriers to implementing a code of ethics. One major challenge is defining what a data scientist even is – as mentioned earlier, data science is applied in so many fields and organizations that have fundamentally different priorities, so there’s a question of how to ensure any degree of standardization across the board.

What’s different from the approach you’re taking to other efforts to create a data science code of ethics?
We’re taking a community-first approach – we’re choosing to host discussions among “data scientists,” which we’re broadly defining to include students, civic technologists, professionals working for data-oriented firms, and so on. We want this code of ethics to emerge as something created “by data scientists, for data scientists,” drawing on community input regarding their values and concerns.

The resulting code of ethics should better capture the diverse spectrum of interests across the data science community, and be more responsive to the various needs and priorities at play. We also hope that the more we work to incorporate community feedback from early on, the more receptive the community will be to eventually adopting the completed code of ethics. Other efforts to develop a code of ethics have been more tightly defined in terms of scope – for example, addressing research statisticians in particular – and we’re hoping that our code of ethics will pertain to a broader swath of people who interact with data. Of course, the increased scale means that it’s even more challenging to take all these general concerns and distill them into concrete, achievable targets, but we think that’s what’s most exciting about this initiative.

What do you hope to achieve?
So far, people have been enthusiastic about the idea of having a data science code of ethics, though some are rightfully skeptical about the logistics of implementation, due to the issues previously mentioned. But it’s clear that many people think this discussion – of the ethics surrounding data science – is relevant and increasingly vital as the field grows. Our proposed code of ethics may only achieve limited buy-in and adoption – and certainly, no single document can be all things to such a large and diverse community. But, if we can at least stimulate further discussion, get some people thinking about ethical concerns, and provide a platform to amplify the voices of people who have already been working on these questions for a while, we will have achieved something.

What comes next and how does the Data for Good Exchange community play a role?
We hope to draw upon what we’ve learned in this preliminary scan to reach out to the greater data science community and seek input from around 100,000 data scientists. From this larger and more diverse pool of opinions, we’ll prepare a first draft of the actual code of ethics, hopefully to be completed by May or June 2018. We want to keep the process transparent and open to feedback throughout, as much as possible. We also hope to create a platform to share useful tools and techniques that will allow people to implement best practices – ideally, we don’t just want to produce a manifesto; we also want to give people the tools they need to make these ideas actionable. The Data for Good Exchange community will be very valuable in helping us spread the word about this initiative, and will hopefully draw more people to participate in our discussions.

What advice would you give to a data scientist or organization looking to leverage data to address any kind of social good?
Our community has brought up a lot of useful points for consideration, but one highlight is to be thoughtful in identifying problems to work on – make sure that your questions and problems are clearly defined, and that they really are of value and relevance to someone. Also, be sure to link up with pre-existing resources and communicate with parties who are already working in that area. When it comes to issues of “social good,” there’s bound to already be some civic organization or non-profit working on the problem, and they can tell you a lot about the challenges on the ground and the real needs of the people you’re trying to help. It’s best to draw on this experience and non-technical knowledge, rather than trying to reinvent the wheel or assuming that an exciting new algorithm can solve all problems. Data for Democracy has made a lot of progress through collaborating with partner organizations, who let us know where and how our skillsets will be of most use, so we can make sure that the work done is the work that actually needs to be done.

Any final thoughts?
This is an ambitious effort. With the rate at which the data science field is growing and evolving, the specific guidelines we come up with will probably be out of date in a couple of years, if not sooner. But, as long as we’re able to promote a continuing discussion that people can build on in future, I think we will have accomplished something meaningful.