Natural Language Processing & Machine Learning at Bloomberg

Across the Bloomberg Terminal, natural language processing and machine learning play a central role.

Throughout the life of the company, Bloomberg has always relied on text as a key underlying source of data for our clients. Over the past decade, we have increased our investment in statistical natural language processing (NLP) techniques that extend our capabilities. Our engineering teams have built state-of-the-art NLP technology for core document understanding, recommendation, and customer-facing systems.

At the heart of our NLP program is technology that extracts structured information from documents — sometimes known as digitization or normalization. At the core of this program is a proprietary, robust real-time NLP library that performs low-level text resolution tasks such as tokenization, chunking and parsing. On top of this core tool set, we have built named entity extractors that detect people, companies, tickers and organizations in natural text, which is deployed across our news and social text databases. These named entity extractors are crucial for enabling our sentiment analysis (BSV<GO> and TREN<GO>) derived indicators that estimate how positive a piece of news is for a particular company. Beyond that, our topic classification engine (e.g., NI OIL<GO>) automatically tags documents with normalized topics to make retrieval and monitoring straight-forward. In the law domain, we have built a legal principles engine that enables lawyers to uncover the underlying case law argumentation that supports a particular decision.

Beyond these core functions, we have built sophisticated fact extractors (or relationship extractors), that pick out specific information from documents in order to ease our ingestion flow. We have also built out a large suite of tools for structured data. One piece of these are table detection and segmentation tools that enable our analysts to increase their scope of ingested data. Additionally, we have built research systems for figure understanding that extract the underlying data from scatter plots. We have also built tools for our reporters that allow them to create self-service topic streams to find pieces of news about the companies or sectors they are responsible for covering.

All of these core NLP tools stay strictly within the domain of text, but we have also built out significant functionality that connects text to other artifacts – either people or stock tickers. Our market moving news indicators (MMN<GO>) automatically detect news headlines that are crucially important and tag them. We have a robustly deployed related stories function that highlights additional relevant information to people when they are reading stories.

Finally, we have invested heavily in tools that simplify client interaction. Our search system (HL<GO>) is very sophisticated, with state-of-the-art ranking and query understanding. Furthermore, we’ve built a natural language query interface (e.g., ‘What is IBM’s market cap<Search>’) where people can ask questions in plain English and get precise answers. This search functionality is deployed across many document collections, but our news search and ranking (NSE<GO>) gets significant attention in particular. For our internal help system, we have automatic routing systems that direct incoming queries to the appropriate internal experts. We also have built automatic answering capabilities that can detect and answer frequently occurring client inquiries.

From a staffing perspective, we have multiple natural language processing and machine learning experts, including former professors and graduates from the best programs. As we build out our team, we are also building out our infrastructure that supports them, such as the creation of a large GPU cluster to speed up the deep learning/neural network models that increasingly make up a large part of our deployed technology. Every year, we publish papers at top academic conferences — recently, our team has published papers at ACL, SIGIR, ICML, and ECML-PKDD and more. Over the past decade, our NLP and ML teams have grown into a formidable force and we anticipate the next decade will see them develop even further.

Select Recent Papers

Bloomberg contributes back to academia whenever we can by attending and speaking at conferences in ML, NLP, and IR, handing out the Bloomberg Data Science Research Grant, hosting the Bloomberg Data Science Ph.D. Fellows (new in 2018), and serving as committee members for conferences. Here are some of the recent papers we published at peer-reviewed conferences or in journals:

2018

Learning Better Name Translation for Cross-Lingual Wikification. Chen-Tse Tsai and Dan Roth. AAAI-18.

Estimating the Cardinality of Conjunctive Queries over RDF Data Using Graph Summarisation. Giorgio Stefanoni, Boris Motik, Egor V. Kostylev. WWW 2018.

Never-Ending Learning for Open-Domain Question Answering over Knowledge Bases. Abdalghani Abujabal, Rishiraj Saha Roy, Mohamed Yahya (Bloomberg), Gerhard Weikum. WWW 2018.

Key2Vec: Automated Ranked Keyphrase Extraction from Scientific Articles Using Phrase Embeddings. Debanjan Mahata (Bloomberg), John Kuriakose, Rajiv Ratn Shah, Roger Zimmermann. NAACL-HLT 2018.

Collective Entity Disambiguation with Structured Gradient Tree Boosting. Yi Yang, Ozan Irsoy, Shefaet Rahman. NAACL-HLT 2018.

Weakly-supervised Contextualization of Knowledge Graph Facts. Nikos Voskarides, Edgar Meij, Ridho Reinanda, Abhinav Khaitan, Miles Osborne, Giorgio Stefanoni, Prabhanjan Kambadur and Maarten de Rijke. SIGIR 2018.

2017

Generating descriptions of entity relationships. Nikos Voskarides, Edgar Meij, and Maarten de Rijke. European Conference on Information Retrieval (ECIR) 2017.

Adaptive Submodular Ranking. Anju Kambadur and Fatemeh Navidi with Viswanath Nagarajan. Integer Programming and Combinatorial Optimization (IPCO) 2017.

Faster Greedy MAP Inference for Determinantal Point Processes. Anju Kambadur with Insu Han, Kyoungsoo Park, Jinwoo Shin. International Conference on Machine Learning (ICML) 2017.

Civil Asset Forfeiture: A Judicial Perspective. Leslie Barrett, Alexandra Ortan, Ryon Smey, Michael W. Sherman, Zefu Lu, Wayne Krug, Roberto Martin, Anu Pradhan, Trent Wenzel, Alexander Sherman, Karin D. Martin. Data for Good Exchange 2017.

Camera Based Two Factor Authentication Through Mobile and Wearable Devices. Mozhgan Azimpourkivi, Umut Topkara (Bloomberg), Bogdan Carbunar. UbiComp 2017.