Throughout the life of the company, Bloomberg has always relied on text as a key underlying source of data for our clients. Over the past decade, we have increased our investment in statistical natural language processing (NLP) techniques that extend our capabilities. Our engineering teams have built state-of-the-art NLP technology for core document understanding, recommendation, and customer-facing systems.
At the heart of our NLP program is technology that extracts structured information from documents — sometimes known as digitization or normalization. At the core of this program is a proprietary, robust real-time NLP library that performs low-level text resolution tasks such as tokenization, chunking and parsing. On top of this core tool set, we have built named entity extractors that detect people, companies, tickers and organizations in natural text, which is deployed across our news and social text databases. These named entity extractors are crucial for enabling our sentiment analysis (BSV<GO> and TREN<GO>) derived indicators that estimate how positive a piece of news is for a particular company. Beyond that, our topic classification engine (e.g., NI OIL<GO>) automatically tags documents with normalized topics to make retrieval and monitoring straight-forward. In the law domain, we have built a legal principles engine that enables lawyers to uncover the underlying case law argumentation that supports a particular decision.
Beyond these core functions, we have built sophisticated fact extractors (or relationship extractors), that pick out specific information from documents in order to ease our ingestion flow. We have also built out a large suite of tools for structured data. One piece of these are table detection and segmentation tools that enable our analysts to increase their scope of ingested data. Additionally, we have built research systems for figure understanding that extract the underlying data from scatter plots. We have also built tools for our reporters that allow them to create self-service topic streams to find pieces of news about the companies or sectors they are responsible for covering.
All of these core NLP tools stay strictly within the domain of text, but we have also built out significant functionality that connects text to other artifacts – either people or stock tickers. Our market moving news indicators (MMN<GO>) automatically detect news headlines that are crucially important and tag them. We have a robustly deployed related stories function that highlights additional relevant information to people when they are reading stories.
Finally, we have invested heavily in tools that simplify client interaction. Our search system (HL<GO>) is very sophisticated, with state-of-the-art ranking and query understanding. Furthermore, we’ve built a natural language query interface (e.g., ‘What is IBM’s market cap<Search>’) where people can ask questions in plain English and get precise answers. This search functionality is deployed across many document collections, but our news search and ranking (NSE<GO>) gets significant attention in particular. For our internal help system, we have automatic routing systems that direct incoming queries to the appropriate internal experts. We also have built automatic answering capabilities that can detect and answer frequently occurring client inquiries.
From a staffing perspective, we have multiple natural language processing and machine learning experts, including former professors and graduates from the best programs. As we build out our team, we are also building out our infrastructure that supports them, such as the creation of a large GPU cluster to speed up the deep learning/neural network models that increasingly make up a large part of our deployed technology. Every year, we publish papers at top academic conferences — recently, our team has published papers at ACL, SIGIR, ICML, and ECML-PKDD and more. Over the past decade, our NLP and ML teams have grown into a formidable force and we anticipate the next decade will see them develop even further.
Select Recent Papers
Bloomberg contributes back to academia whenever we can by attending and speaking at conferences in ML, NLP, and IR, handing out the Bloomberg Data Science Research Grant, hosting the Bloomberg Data Science Ph.D. Fellows (new in 2018), and serving as committee members for conferences. Here are some of the recent papers we published at peer-reviewed conferences or in journals:
Harnessing GANs for Zero-Shot Learning of New Classes in Visual Speech Recognition. Yaman Kumar, Dhruva Sahrawat, Shubham Maheshwari, Debanjan Mahata, Amanda Stent, Yifang Yin, Rajiv Ratn Shah and Roger Zimmermann. AAAI 2020.
Keyphrase Generation for Scientific Articles using GANs. Avinash Swaminathan, Raj Kuwar Gupta, Haimin (Raymond) Zhang, Debanjan Mahata, Rakesh Gosangi, Rajiv Ratn Shah. AAAI 2020.
Keyphrase Extraction from Scholarly Articles as Sequence Labeling using Contextualized Embeddings. Dhruva Sahrawat, Debanjan Mahata, Haimin (Raymond) Zhang, Mayank Kulkarni, Agniv Sharma, Rakesh Gosangi, Amanda Stent, Yaman Kumar, Rajiv Ratn Shah and Roger Zimmermann. ECIR 2020.
Knowledge Graph-based Ranking of Notable News Articles. Antonia Saravanou, Edgar Meij and Giorgio Stefanoni. ECIR 2020.
Novel Entity Discovery from Web Tables. Shuo Zhang, Edgar Meij, Ridho Reinanda and Krisztian Balog. WWW 2020.
Bias in Automatic Knowledge Graph Construction: A Workshop. Edgar Meij, Tara Safavi, Chenyan Xiong, Gianluca Demartini, Miriam Redi and Fatma Ozcan. AKBC 2020.
Get IT Scored using AutoSAS – An Automated System for Scoring Short Answers. Yaman Kumar, Swati Aggarwal, Debanjan Mahata, Rajiv Shah, Ponnurangam Kumaraguru and Roger Zimmermann. EAAI-2019.
Predicting and Analyzing Language Specificity in Social Media Posts. Yifan Gao, Yang Zhong, Daniel Preoţiuc-Pietro, Junyi Jessy Li. AAAI-2019.
Visual Attention Model for Cross-sectional Stock Return Prediction and End-to-End Multimodal Market Representation Learning. Ran Zhao, Yuntian Deng, Mark Dredze, Arun Verma, David Rosenberg, Amanda Stent. FLAIRS 2019.
Improving Grey-Box Fuzzing by Modeling Program Control Flow. Siddharth Karamcheti, Gideon Mann and David Rosenberg. ML4SE 2019.
SemEval-2019 Task 6: Identifying Offensive Posts and Targeted Offense from Twitter. Haimin (Raymond) Zhang, Debanjan Mahata, Simra Shahid, Laiba Mehnaz, Sarthak Anand, Yaman Singla, Rajiv Ratn Shah, and Karan Uppal. International Workshop on Semantic Evaluation 2019 at NAACL-HLT 2019.
SemEval-2019 Task 9: Suggestion Mining from Online Reviews using ULMFiT. Sarthak Anand, Debanjan Mahata, Kartik Aggarwal, Laiba Mehnaz, Simra Shahid, Haimin (Raymond) Zhang, Yaman Singla, Rajiv Ratn Shah, Karan Uppal. International Workshop on Semantic Evaluation 2019 at NAACL-HLT 2019.
SNAP-BATNET: Cascading Author Profiling and Social Network Graphs for Suicide Ideation Detection on Social Media. Rohan Mishra, Pradyumn Prakhar Sinha, Ramit Sawhney, Debanjan Mahata, Puneet Mathur and Rajiv Ratn Shah. NAACL Student Research Workshop (SRW) 2019 at NAACL-HLT 2019.
Speak Up, Fight Back! Detection of Social Media Disclosures of Sexual Harassment. Arijit Ghosh Chowdhury, Ramit Sawhney, Puneet Mathur, Debanjan Mahata and Rajiv Ratn Shah. NAACL Student Research Workshop (SRW) 2019 at NAACL-HLT 2019.
Decoding the Style and Bias of Song Lyrics. Manash Pratim Barman, Amit Awekar, and Sambhav Kothari. SIGIR 2019.
Dialogue Act Classification in Group Chats with DAG-LSTMs. Ozan Irsoy, Rakesh Gosangi, Haimin (Raymond) Zhang, Mu-Hsin Wei, Peter Lund, Duccio Pappadopulo, Brendan Fahy, Neophytos Nephytou, and Camilo Ortiz. 1st Workshop on Conversational Interaction Systems (WCIS) at SIGIR 2019.
Modeling financial analysts’ decision making via the pragmatics and semantics of earnings calls. Katie Keith and Amanda Stent. ACL 2019.
Multi-task Pairwise Neural Ranking for Hashtag Segmentation. Mounica Maddela, Wei Xu, and Daniel Preoţiuc-Pietro. ACL 2019.
Categorizing and Inferring the Relationship between the Text and Image of Twitter Posts. Alakananda Vempala and Daniel Preoţiuc-Pietro. ACL 2019.
Analyzing Linguistic Differences between Owner and Staff Attributed Tweets. Daniel Preoţiuc-Pietro and Rita Devlin Marier. ACL 2019.
Automatically Identifying Complaints in Social Media. Daniel Preoţiuc-Pietro, Mihaela Găman, and Nikolaos Aletras. ACL 2019.
A Semi-Markov Structured Support Vector Machine Model for High-Precision Named Entity Recognition. Ravneet Arora, Chen-Tsei Tsai, Ketevan Tsereteli, Anju Kambadur, and Yi Yang. ACL 2019.
Grammatical Sequence Prediction for Real-Time Neural Semantic Parsing. Chunyang Xiao, Christoph Teichmann, and Konstantine Arkoudas. Deep Learning & Formal Languages: Building Bridges Workshop @ ACL 2019.
Hush-Hush Speak: Speech Reconstruction Using Silent Videos. Shashwat Uttam, Yaman Kumar, Dhruva Sahrawat, Mansi Aggarwal, Rajiv Ratn Shah, Debanjan Mahata and Amanda Stent. INTERSPEECH 2019.
MobiVSR: Efficient and Light-weight Neural Network for Visual Speech Recognition on Mobile Devices. Nilay Shrivastava, Astitwa Saxena, Yaman Kumar, Rajiv Ratn Shah, Amanda Stent, Debanjan Mahata, Preeti Kaur, Roger Zimmermann. INTERSPEECH 2019.
Challenges in end-to-end neural scientific document OCR. Yuntian Deng, David Rosenberg, and Gideon Mann. ICDAR 2019.
Semantically Driven Auto-completion. Konstantine Arkoudas and Mohamed Yahya. CKIM 2019.
Understanding Goal-Oriented Active Learning via Influence Functions. Minjie Xu and Gary Kazantsev. Machine Learning with Guarantees Workshop @ NeurIPS 2019.
Learning Better Name Translation for Cross-Lingual Wikification. Chen-Tse Tsai and Dan Roth. AAAI-18.
Estimating the Cardinality of Conjunctive Queries over RDF Data Using Graph Summarisation. Giorgio Stefanoni, Boris Motik, Egor V. Kostylev. WWW 2018.
Never-Ending Learning for Open-Domain Question Answering over Knowledge Bases. Abdalghani Abujabal, Rishiraj Saha Roy, Mohamed Yahya, Gerhard Weikum. WWW 2018.
RIDDL at SemEval-2018 Task 1: Rage Intensity Detection with Deep Learning. Venkatesh Elango and Karan Uppal. SemEval-2018 (at NAACL-HLT 2018).
Key2Vec: Automated Ranked Keyphrase Extraction from Scientific Articles Using Phrase Embeddings. Debanjan Mahata, John Kuriakose, Rajiv Ratn Shah, Roger Zimmermann. NAACL-HLT 2018.
Collective Entity Disambiguation with Structured Gradient Tree Boosting. Yi Yang, Ozan Irsoy, Shefaet Rahman. NAACL-HLT 2018.
Weakly-supervised Contextualization of Knowledge Graph Facts. Nikos Voskarides, Edgar Meij, Ridho Reinanda, Abhinav Khaitan, Miles Osborne, Giorgio Stefanoni, Anju Kambadur and Maarten de Rijke. SIGIR 2018.
Trends in the Adoption of Corporate Child Labor Policies: An Analysis with Bloomberg Terminal ESG Data. Data for Good Exchange 2018.
Adaptive Grey-Box Fuzz-Testing with Thompson Sampling. Siddharth Karamcheti, Gideon Mann and David Rosenberg. AISec 2018.
Predicting Good Twitter Conversations. Zach Wood-Doughty, Anju Kambadur and Gideon Mann. W-NUT 2018 (at EMNLP 2018).
The Remarkable Benefit of User-Level Aggregation for Lexical-based Population-level Predictions. Salvatore Giorgi, Daniel Preoţiuc-Pietro, Anneke Buffone, Daniel Rieman, Lyle Ungar and H. Andrew Schwartz. EMNLP 2018
Zero-Shot Open Entity Typing as Type-Compatible Grounding. Ben Zhou, Daniel Khashabi, Chen-Tse Tsai and Dan Roth. EMNLP 2018.
Why Swear? Analyzing and Inferring the Intentions of Vulgar Expressions. Eric Holgate, Isabel Cachola, Daniel Preoţiuc-Pietro and Junyi Jessy Li. EMNLP 2018.
Improving Grey-Box Fuzzing by Modeling Program Behavior. Siddharth Karamcheti, Gideon Mann and David Rosenberg. arXiv.
Generating descriptions of entity relationships. Nikos Voskarides, Edgar Meij, and Maarten de Rijke. ECIR 2017.
Automated Template Generation for Question Answering over Knowledge Graphs. Abdalghani Abujabal, Mohamed Yahya, Mirek Riedewald and Gerhard Weikum. WWW 2017.
Adaptive Submodular Ranking. Anju Kambadur and Fatemeh Navidi with Viswanath Nagarajan. IPCO 2017.
Beyond Binary Labels: Political Ideology Prediction of Twitter Users. Daniel Preoţiuc-Pietro, Ye Liu, Daniel Hopkins and Lyle Ungar. ACL 2017
Faster Greedy MAP Inference for Determinantal Point Processes. Anju Kambadur with Insu Han, Kyoungsoo Park, Jinwoo Shin. ICML 2017.
A Randomized Algorithm for Approximating the Log Determinant of a Symmetric Positive Definite Matrix. Christos Boutsidis, Petros Drineas, Anju Kambadur, Eugenia-Maria Kontopoulou and Anastasios Zouzias. ICML 2017
Boosting Information Extraction Systems with Character-level Neural Networks and Free Noisy Supervision. Philipp Meerkamp and Zhengyi Zhou. Structured Predictions Workshop (at EMNLP 2017).
Cheap Translation for Cross-Lingual Named Entity Recognition. Stephen Mayhew, Chen-Tse Tsai and Dan Roth. EMNLP 2017.
Controlling Human Perception of Basic User Traits. Daniel Preoţiuc-Pietro, Sharath Chandra Guntuku and Lyle Ungar. EMNLP 2017.
Scatteract: Automated extraction of data from scatter plots. Mathieu Cliche, David Rosenberg, Dhruv Madeka and Connie Yee. ECML PKDD 2017.
Civil Asset Forfeiture: A Judicial Perspective. Leslie Barrett, Alexandra Ortan, Ryon Smey, Michael W. Sherman, Zefu Lu, Wayne Krug, Roberto Martin, Anu Pradhan, Trent Wenzel, Alexander Sherman, Karin D. Martin. Data for Good Exchange 2017.
Knowledge Questions from Knowledge Graphs. Dominic Seyler, Mohamed Yahya and Klaus Berberich. ICTIR 2017.
Camera Based Two Factor Authentication Through Mobile and Wearable Devices. Mozhgan Azimpourkivi, Umut Topkara, Bogdan Carbunar. UbiComp 2017.