Tech At Bloomberg

Data Science Research Grants: Announcing Our Fourth Round of Winners

April 25, 2017

A graphic of a semi-opaque pink head overlaid on top of blue digital circuits on a black background.

The Bloomberg Data Science Research Grant Program aims to support cutting-edge research in broadly-construed data science, including natural language processing, machine learning, and search and ranking, in addition to the creation of, or contributions to, open source software used for data science. In April 2015, we announced our first round of recipients, in October 2015, we announced our second, and in April 2016, we announced our third. Today, we are pleased to announce the winners of our fourth round of grants.

“Bloomberg is proud to sponsor academic research in areas of data science that are relevant to our mission of connecting decision makers to a dynamic network of information, people, and ideas,” said Amanda Stent, NLP researcher in the Office of the CTO at Bloomberg and a member of the grant committee. “Through these grants, we can shine a spotlight on the wealth of data and research problems in the financial analytics universe and contribute to the open sharing of research outcomes such as data and systems. We hope our program will encourage student and faculty researchers in their work on core data science ideas and technologies.”

Out of nearly two hundred applications from faculty members at universities around the world, a committee of Bloomberg’s data scientists from across the organization selected the following eight research projects:

Greg Durrett (University of Texas at Austin)
Combining structured knowledge and big data for coreference resolution
When humans read text, they synthesize and process the information it contains in light of their preexisting background knowledge. Natural language processing systems typically lack this ability to use explicit world knowledge and understand text in a context-dependent way. One manifestation of this is in systems’ poor ability to follow the actors through a narrative and track what parts of a text refer to whom, a problem called coreference resolution. Professor Durrett’s work will improve on state-of-the-art systems for coreference resolution by drawing on large-scale unlabeled data and pre-existing knowledge bases.

Hannaneh Hajishirzi (University of Washington)
Question answering and reasoning in multimodal data
Question answering is an increasingly important adjunct to traditional information retrieval. Recently, there is growing interest in multimodal question answering: synthesizing answers from data in text, graphics, images and videos. Professor Hajishirzi will develop a system that “can read a multimodal context along with a multimodal question and reason about the answer, which may also be multimodal in nature.”

Paolo Ferragina (Università di Pisa)
Entity salience via sophisticated syntactic and semantic features
This proposal aims at more accurate determination of entity salience (i.e., which of a set of known entities are most salient to a document). Professor Ferragina will improve his well-known existing entity salience system, SWAT. A public API to SWAT will be released as an outcome of the funded research.

Thorsten Joachims (Cornell University)
Counterfactual learning with log data
Log data is one of the most ubiquitous forms of data available, as it can be recorded from a variety of online systems (e.g., search engines, query auto-completion, terminal browsing) at little cost. The project proposes a plan to develop well-founded and scalable learning methods for learning from partial-information feedback ubiquitously available in the form of logged user behavior. Taking a Counterfactual Risk Minimization (CRM) approach, Professor Joachims will develop methods for Deep Learning with log data, as well as new counterfactual learning methods that exploit the stochasticity in user behavior as a substitute for explicitly randomized control.

Mark Steedman (University of Edinburgh)
Learning hidden semantics by machine reading using entailment graphs
Existing information extraction systems have limited ability to reason about entailment (i.e., to make common-sense inferences about the states that necessarily follow from events and actions described in text). Professor Steedman will address this problem by building “a combined distributional and logical operator-based semantics automatically, by machine reading large amounts of text with an existing parser, inducing a hidden semantics of paraphrase and entailment that supports commonsense reasoning.”

Maarten de Rijke (University of Amsterdam)
Deep explanation learning for knowledge graph relations
Knowledge graphs capture structured information valuable for decision making. It is important for practical decision making systems to explain their reasoning. Professor de Rijke will build on his previous successful research with Bloomberg, developing a system to construct short textual explanations for entity relationships from knowledge graphs to support intelligent decision making.

Simon Preston, Karthik Bharath and Yves van Gennip (University of Nottingham); Michaela Mahlberg (University of Birmingham)
Dynamic word embeddings – and applications in analysis of real-world discourses
Our ability to understand natural language text has increased due to innovations such as neural embeddings of words, phrases, and entities that capture the semantics of the construct in question. However, when constructing neural embeddings, we often do not model the fact that semantics evolve with time. For example, think of answering the question: “how did the relationship between smoking and cancer evolve in the last century?” This work, proposed by Professor Preston et al., will focus on explicitly modeling time when constructing embeddings. Furthermore, their work will develop methodology to determine statistically significant trends in the dynamic embeddings.

Alexander Rush (Harvard University)
Coarse-to-fine neural attention and generation with applications to document analysis
Neural sequence-to-sequence models have broken new ground in machine translation, visual question answering, and more recently, document analysis and summarization. However, current methods do not scale well to very long input “sequences,” such as a multi-page document or a high-resolution image. Professor Rush proposes using a technique known as coarse-to-fine pruning to create fast, memory-efficient sequence-to-sequence models.

“The eight projects chosen for funding in  2017 through our Bloomberg Data Science Research Grant Program were selected from a very competitive pool of proposals,” said senior NLP researcher Prabhanjan (Anju) Kambadur, a member of the grant committee. “Each of these projects – whether about search, question answering, text analytics, or machine learning – involves a strong data science component and a commitment by the principal investigators to release data or systems to the research community. Most importantly, all will break new scientific ground.”

The application deadline for Round 5 will be announced before the end of 2017.