Since 2015, Bloomberg has supported academic research in broadly-construed data science, including natural language processing (NLP), information retrieval, machine learning, and data mining, through its annual Data Science Research Grant Program (learn more about prior grant recipients and their research).
Today, we are pleased to announce the winners of our sixth round of grants.
Out of hundreds of applications from university faculty members, a committee of Bloomberg’s data scientists from across the organization chose to fund the following six research projects:
Marti A. Hearst (University of California, Berkeley)
Unsupervised Abstractive News Summarization
Summarization is a critical problem for a wide variety of applications, including news. To date, most automated summarization algorithms have been extractive, meaning they extract sentences from the original document to create a summary. However, humans usually write abstractive summaries: they create novel sentences. With the advent of deep learning, automated abstractive approaches are only recently coming to the fore. Professor Hearst and PhD student Philippe Laban have recently contributed to progress in this research stream, developing a new Transformer-based approach to abstractive summarization that includes a conceptually simple keyword-coverage algorithm and a method for generating two different summary formats. Their research will look at yet another novel idea, which is to take abstractive summarization from supervised to unsupervised learning using a novel architecture that leverages the output of a supervised model to bootstrap a new unsupervised reinforcement learning based approach.
Maria-Florina Balcan (Carnegie Mellon University)
Data-Driven Transfer Clustering
Clustering is a fundamental problem in data science, used in a myriad of scientific and business applications. Despite significant research in different fields, clustering remains a major challenge. In many real-world applications, it is often unclear what similarity measure or objective function to use to identify a good clustering for the given data. Even when this is known, optimally solving the underlying combinatorial clustering problems is typically intractable. Motivated by the fact that many important applications require solving several related clustering problems, Professor Balcan proposes a new data-driven approach to address these challenges. Building on her past work on unsupervised learning and data-driven algorithm design, she aims to design scalable and data-efficient meta-learning procedures with provable guarantees that will produce fast, accurate clustering algorithms for the domain at hand, providing a new general tool for data science. Plus, she will also test these algorithms on large-scale clustering tasks, including image and NLP data.
Stefano Ermon (Stanford University)
Differentiable ranking losses
Sorting input objects is an important step in many machine learning pipelines (e.g., ranking objects for information retrieval). However, the sorting operator is non-differentiable with respect to its inputs, which prohibits end-to-end gradient-based optimization of losses involving rankings. NeuralSort, a general-purpose continuous relaxation of the sorting operator recently introduced by Professor Ermon’s group, permits direct gradient-based optimization of any computational graph involving a sorting operation. He proposes exploring ways to apply this new technique to information retrieval. Specifically, his goal is to identify to what extent this approach can 1) improve performance with end-to-end optimization objectives, and 2) evaluate the quality of representations learned using ranking supervision. If successful, this could lead to improvements in information retrieval tasks, question answering, unsupervised representation and manifold learning methods.
Walter Lasecki and Jonathan Kummerfeld (University of Michigan)
An Adaptive Crowdsourcing System for Real-Time Domain Adaptation
In time-sensitive settings, where there is a need to extract information from unstructured and semistructured text, current approaches to training often fall short: manual annotation can take too long and machine learning approaches make too many errors. Professor Lasecki and Dr. Kummerfeld propose an adaptive system that uses a combination of human workers and machine learning systems to select the best way for workers to efficiently annotate documents given what the system has learned so far. This work will combine crowdsourcing and natural language processing to let people provide guidance that helps machine learning models adapt their structural knowledge of previously seen documents to new ones.
Eduardo Blanco (University of North Texas)
Extracting Spatial Timelines from Text
Spatial timelines capture where individuals have and have not been located over time. Professor Blanco is proposing to extract spatial timelines from news articles, biographies and tweets to create the first corpus annotated with spatial timelines – and to define computational models to extract them. This project has the potential to improve named entity linking and opens the door to applications such as eyewitness verification.
Jeff Dalton (University of Glasgow)
A Multi-task Model for Information Extraction and Entity-Centric Ranking tasks
Dr. Dalton’s proposal is to develop new models for entity-centric information extraction and retrieval. The focus is on multi-task deep learning models for topic-specific extraction and ranking over heterogeneous text collections trained using existing knowledge resources and weak supervision. The ultimate goal is to retrieve or be alerted to information, including text content, entities, or key facts that are relevant to an information analyst’s goals.
“These six proposals were selected by the grant committee as they align greatly with key research and business areas at Bloomberg,” said Dr. Christoph Kofler, a member of the grant committee from the Office of the CTO’s Data Science team. “We are excited to provide support for each of these projects, which have the potential to deliver high impact in the scientific communities targeting machine learning, natural language processing, information retrieval and crowdsourcing/data annotation, and we look forward to building strong relationships with these respected researchers.”