Two themes appeared during almost every session during the 2017 Data for Good Exchange (D4GX), held at the Bloomberg headquarters on Sunday, September 24. One was the need to generate trust between data scientists and the people whose data they hold. Another was the need to address the bias that often appears in data and algorithms.
“When data scientists are entrusted with the most private and valuable data out there, the data science community must work to deserve the trust of those whose data we are holding,” says Gideon Mann, Bloomberg’s head of data science.
A new initiative from Bloomberg, BrightHive, and Data for Democracy aims to help address these issues, as well as other ethical dilemmas related to working with data. Called the “Community Principles on Ethical Data Sharing” (or CPEDS), the project is designed to provide a set of guidelines relating to data sharing and collaboration among data scientists.
“We want data scientists to be able to work together with each other to answer questions such as how we keep data safe and minimize bias,” said Lilian Huang, Data Ethics Lead at Data for Democracy.
“This is much like physicians have a Hippocratic Oath,” says Natalie Evans Harris, the COO and vice president of ecosystem development at BrightHive and a former senior policy advisor to the U.S. Chief Technology Officer. “We will need something like that as we professionalize as a community.”
Over the next six to nine months, the partnership hopes to use a community-first approach, reflecting the diverse interests of the data science community, to define values and priorities for ethical behavior by data scientists.
During the conference’s keynote session, Harris and Huang invited attendees to participate – both in person and online – in a series of ongoing discussions devoted to developing a code of ethics. Harris asked for the audience’s input at an in-person workshop held during the conference, as well as on Twitter, Slack, and GitHub. “It doesn’t matter if we do this if the people in this room don’t buy into it,” she said.
So far, more than 2,000 data scientists have participated in the discussion, weighing in on the challenges of sharing data and developing trust in the data science community. The conversation has unearthed five particular areas of concern:
- The data itself – Includes overall practices in collecting, storing, and distributing data, as well as understanding and minimizing intrinsic bias in data.
- Questions and problems – Includes identifying relevant problems to work on and collaborating with the people already active in those fields and their work. Huang described this, in part, as, “How can you make sure you’re not reinventing the wheel every time?”
- Algorithms and models – Includes understanding and minimizing bias in algorithms and models, and working responsibly with black-box algorithms.
- Technological products and applications – Includes taking responsibility for how one’s research is used. “How do you guard against the potential for misuse,” asked Huang, “even if the research itself seems neutral?”
- Community building – Includes fostering a data science community that is inclusive and deliberately promotes equity and representation, as well as finding ethical, non-invasive ways to track progress toward this goal.
Huang said even these topics of discussion are open to revision, addition, and change. The group intends to construct a larger survey of the data science community, she says, hopefully collecting input from more than 100,000 data scientists. “If you don’t buy into this, we want to know early, so we can change, vector, or quit,” says Harris. “And I’m not so good with quitting.”