Bloomberg’s AI Researchers & Engineers Publish 3 NLP Papers at ACL-IJCNLP 2021

During the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021) this week, researchers and engineers from Bloomberg’s AI Group are showcasing their expertise in natural language processing (NLP) and computational linguistics by publishing three (3) papers at the main conference and co-located workshops, including the 17th International Conference on Parsing Technologies (IWPT 2021) and *SEM 2021: The 10th Joint Conference on Lexical and Computational Semantics.

In these papers, the authors and their collaborators — among them computer science Ph.D. student Lisa Bauer of The University of North Carolina at Chapel Hill, who performed her research as an intern in our AI Group, and her advisor, Dr. Mohit Bansal, Director of the MURGe-Lab (Multimodal Understanding, Reasoning, and Generation for Language Lab), which is part of UNC’s Natural Language Processing and Machine Learning Group, and a prior recipient of the Bloomberg Data Science Research Grant — present contributions to both fundamental NLP problems in the areas of question answering, syntactic or semantic analysis, and disentanglement, and to applications of NLP technology focused on extracting hybrid data from financial reports, as well as understanding sentences and online chats.

We asked the authors to summarize their research and explain why the results were notable in advancing the state-of-the-art in the field of computational linguistics:


Tuesday, August 3, 2021

Session 9C: Question Answering 2 (10:30-10:40 AM UTC)
TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance
Fengbin Zhu (National University of Singapore), Wenqiang Lei (NUS), Youcheng Huang (Sichuan University), Chao Wang (6Estates), Shuo Zhang (Bloomberg), Jiancheng Lv (Sichuan University), Fuli Feng (NUS) and Tat-Seng Chua (NUS)

Click here to read "TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance Domain" published at ACL-IJCNLP 2021 on August 3, 2021

Please summarize your research.

Shuo: Existing Question-Answering (QA) systems largely focus on unstructured text, structured knowledge bases, or semi-structured tables in isolation. Work that deals with hybrid data — which consists of both unstructured text and structured or semi-structured knowledge bases/tables — are rare, despite hybrid data being pervasive in real-world use cases, such as financial reports.

The main goal of this work is to construct a benchmark on a hybrid of tabular and textual content. In particular, we extract hybrid data from financial reports and conduct labor-intensive annotation tasks to build a new large-scale QA dataset containing four types of annotations:

  1. relevant context to tables;
  2. question-answer pairs where numerical reasoning is usually required to infer the answer;
  3. answer type and derivation which indicates the reasoning type; and
  4. the sources for inferring the answer (see Figure 1 as an example).

In addition, we propose a QA model that adopts sequence tagging to extract relevant cells, and then apply symbolic reasoning over the extracted cells to arrive at the final answer.

An example of TAT-QA
An example of TAT-QA. The left dashed line box shows hybrid context. The rows with blue background are row headers, while the column with grey is column headers. The right solid line box shows the corresponding question, answer with its scale, and derivation to arrive at the answer.

Why are these results notable? How will it help advance the state-of-the-art in the field of natural language processing?

Existing QA systems for hybrid data are based on Wikipedia, where the embedded tables are mostly content (text-based) tables that cover more diverse data types. This new benchmark is the first endeavor to bring the research attention of QA on hybrid content to the finance domain, where numerical tables, as well as their context, are pervasive and more challenging to make sense of. Imagining complex calculations based on those financial reports might require those with a professional background. However, it is important to make this hybrid content more discoverable. TAT-QA is a dataset that will benefit both the research community and industry applications.

The dataset was labeled by those with a financial background and took approximately three months to complete. The quality of annotations are ensured by our strict controls. We tried to include multiple annotations, including adding relevant paragraphs to tables, question-answer pairs, answer type and derivation, and answer source. In the end, not only can the QA task utilize this resource, but also other sub-tasks. like information extraction and semantic type prediction. can benefit from it.

We adapted some state-of-the-art QA models for tables and context like Tapas and HyBrider, but they do not generalize well for hybrid data containing numerical tables. In the end, our proposed method, TAGOP, only achieves 58.0% in terms of F1 score, which indicates the difficulty of this task. This signals that QA on hybrid content is still an open challenge. There are multiple reasons for this, such as the evidence for inferring the answer is scattered in both tables and context, the quantities need conversion based on units, calculations are inherently complex for financial reports, etc.

What can we do to erase the gap? The lead authors for this work have built a leaderboard based on this resource so the community can work together to move this research forward (Figure 2 provides a snapshot of the leaderboard).

Snapshot of the TAT-QA leaderboard
Snapshot of the TAT-QA leaderboard.

Friday, August 6, 2021

IWPT 2021: The 17th International Conference on Parsing Technologies
Poster Session (13:00-13:30 PM UTC)
Generic Oracles for Structured Prediction
Christoph Teichmann (Bloomberg) and Antoine Venant (Université de Montréal)

Click here to read "Generic Oracles for Structured Prediction" published at IWPT 2021 on August 6, 2021

Please summarize your research.

Christoph: For tasks like machine translation (MT) or mapping a sentence to its meaning, the output will be generated in a sequence of steps. In MT, we produce a translation word-by-word. To understand a sentence, we gradually fill in who did what to whom. When we train a model for these tasks, the model must be able to make decisions based on its own output. For example, if we are translating “Mary sieht John” (“Mary sees John”) into English and the first output the model produces is “Mary,” then a good continuation is “sees John.” If the model produces “John” as the first word, then there is still a way to recover from this suboptimal choice by going with “was seen by Mary” for the rest of the translation.

In order to train a model that can recover from its own errors, we need two ingredients:

  1. Examples of the model making errors and information on how it should have recovered. It is easy to get the errors simply by running a preliminary model on some inputs and comparing the results to a correct solution.
  2. Dynamic oracles determine what the model should have done at each step of its run. In the example above, if our model makes a worse mistake and starts its translation with “Sees,” then a dynamic oracle will tell us what to do next. “What to do next” is defined as taking the action that leads to the best outcome compared to the gold solution “Mary sees John.”
Example structured prediction task for active imitation learning
Example structured prediction task for active imitation learning.

Our paper shows how to translate the questions which dynamic oracles must answer into an optimization problem that can be expressed in the language of finite state automata. Once this translation is complete, we can use well-known techniques, such as dynamic programming, to efficiently obtain the needed answers.

Why are these results notable? How does it advance the state-of-the-art in the field of natural language processing/computational linguistics?

There is an enormous variety of problems that one encounters in the field of NLP: machine translation, named entity recognition, and dialogue structure parsing all have their own evaluation metrics and different sets of actions between which a model must choose at each step. This diversity has meant that previous implementations of dynamic oracles were task specific: they could tell us what to do for specific types of syntactic or semantic analysis, but each new problem statement required researchers to propose a new way to obtain an efficient oracle.

Our research will make this process much simpler. If it is possible to represent the loss function for a problem and the set of all possible solutions in terms of finite state automata, then our techniques lead directly to an efficient dynamic oracle. This means that we will be able to extend error aware training to a wider range of existing and new problems in this field.

Friday, August 6, 2021

*SEM 2021: The 10th Joint Conference on Lexical and Computational Semantics
QA Session 6: Discourse, Dialog, Generation (15:54-16:02 PM UTC)
Poster Session 6: Discourse, Dialog, Generation (16:10-17:00 PM UTC)
Disentangling Online Chats with DAG-Structured LSTMs
Duccio Pappadopulo (Bloomberg), Lisa Bauer (UNC Chapel Hill), Marco Farina (Bloomberg), Ozan İrsoy (Bloomberg), Mohit Bansal (UNC)

Click here to read "Disentangling Online Chats with DAG-structured LSTMs" published at *SEM 2021 on August 6, 2021

Please summarize your research.

Duccio: Online chat and text messaging systems are very common communication tools nowadays. The text conversations among groups of users have rich, complex structures that can be an obstacle for downstream NLP  tasks such as question answering, summarization or topic modeling. Disentangling these interwoven conversation threads is a crucial step before these other tasks can be performed.

To simplify this quite challenging clustering problem, it is intuitive to assume the existence of a binary relation between posts, where a post either starts a new thread (e.g., by asking a new question) or replies to a previous thread. Once all the reply-to pairs have been identified, threads immediately follow.

The benefit of framing the problem this way, is that identifying reply-to pairs is a much simpler classification problem: given a post, we aim to predict which of the previous posts it is replying to.

Our work was motivated by a new dataset released in 2019 by Kummerfeld et al, which includes annotated reply-to pairs to be used for training to conversation disentanglement models.

In our paper, we introduce a new architecture to perform thread disentanglement. Building on our previous work, “Dialogue Act Classification in Group Chats with DAG-LSTMs,” published during the 1st Workshop on Conversational Interaction Systems at SIGIR 2019, we use DAG-LSTMs to encode textual features, as they allow us to keep track of the graph-like structure of a conversation generated by the existence of user turns and user mentions.

We expand the set of features introduced for the baseline model by Kummerfeld et al. in order to capture the instances in which a user mentions another using an abbreviated or misspelled version of their username.

Excerpt from the IRC dataset (left) and our reply-to classifier architecture (right)
Excerpt from the IRC dataset (left) and our reply-to classifier architecture (right). Blue dots represent a unidirectional DAG-LSTM unit processing the states coming from the children of the current node. Red dots represent the GRU units performing thread encoding. At this point in time, we are computing the score (log-odds) of fifth utterance replying to the third.

Why are these results notable? How does it advance the state-of-the-art in the field of natural language processing/computational linguistics?
Our work achieves state-of-the-art results in the task of recovering reply-to relations.

We performed thorough feature ablation experiments showing that our model and new handcrafted features provide a significant improvement with respect to existing strong baselines, thanks to their ability to capture the uniquely complex structure of conversational data, leveraging relationships in user turns and mentions.

While the new features we introduce are tailored to the dataset we used to evaluate our model, the DAG-LSTM architecture is flexible enough to be applied to other datasets for which disentanglement is a prerequisite for downstream tasks. In particular, while we only use user turn and user mentions metadata to define parent-child links in the DAG-LSTM graph, additional relations can be used. Examples include time difference, the existence of common words in two utterances, or, more generally, any binary relation that can be dataset specific.

Furthermore, we believe our model is simple enough to allow for deployment in live scenarios where latency is a concern.