Bloomberg Engineers Publish 4 NLP Papers during EMNLP 2021’s Main Conference

During the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021) this week, AI researchers and engineers from Bloomberg are showcasing their expertise in natural language processing (NLP) and computational linguistics by publishing four (4) papers during the main conference. They also have two papers included in “Findings of the Association for Computational Linguistics: EMNLP 2021,” and another two papers featured at the co-located Workshop on Insights from Negative Results in NLP (more on these four papers here).

List of papers published during the main conference at EMNLP 2021

In these papers, the authors and their collaborators — among them Bloomberg Data Science Ph.D. Fellow Alexander Spangher, a Ph.D. candidate in the Department of Computer Science at the USC Viterbi School of Engineering,  and his advisor, Professor Jonathan May of the school’s Information Sciences Institute — present contributions to fundamental NLP problems in the areas of headline part-of-speech tagging, relation extraction, discourse analysis, code switching, dialogue state tracking, document clustering & topic noise, and word embeddings.

We asked the authors of the main conference papers to summarize their research and explain why the results were notable in advancing the state-of-the-art in the field of computational linguistics:

Monday, November 8, 2021

Virtual Poster & Demo Session 2 (12:30-14:30 AST)
Cross-Register Projection for Headline Part of Speech Tagging
Adrian Benton, Hanyang Li, Igor Malioutov

Click to read "Cross-Register Projection for Headline Part of Speech Tagging"

Please summarize your research.

Igor: Being able to process news headlines is important for a number of key downstream applications, including summarization, information extraction, question answering, as well as other niche problems like timeline generation and first story detection. Headlines often summarize the most critical information, events, and actors in a news story.

In addition, many articles are published as “flash headlines,” which are typically sent out when a breaking news event is occurring, even before journalists have time to write a full article. Over time, the article body will be populated, but the headline is initially the only source of text available.

Surprisingly little work has been done to develop core NLP tools specifically for processing headlines — and few annotated corpora are available. In this work, we aim to bridge this gap by developing a methodology for bootstrapping annotations and presenting state-of-the-art models for part-of-speech tagging on headlines. Part-of-speech tagging is used as a key signal in a number of downstream tasks, ranging from lemmatization, syntactic parsing and coreference resolution to semantic role labeling and open domain information extraction, among others.

Illustration of the application of part-of-speech tagging on headlines

To address the lack of available annotated headlines, we developed a bootstrapping methodology inspired by work in cross-lingual annotation projection in machine translation. If we align headlines and similar long-form sentences in articles, we can transfer the part-of-speech tags from a resource-rich domain with reliable statistical models — long-form news text — to corresponding words in headlines.

In our experiments, we show that existing approaches trained on long-form text suffer from significant, but predictable, errors on headlines. This can mainly be attributed to the particularities of headline language. In fact, it has long been acknowledged by linguists that headlines constitute a unique stand-alone language register due to the omission of articles and auxiliary verbs, and their use of stand-alone nominals and adverbials.

We demonstrate that joint training on both long-form news and headlines delivers improvements over training on just a single training set, as well as over naively concatenating training sets. We evaluate on a newly-annotated corpus of over 5,248 English news headlines from the Google sentence compression corpus and show that our model yields a 23% relative error reduction per token and 19% per headline. In addition, we demonstrate that better headline POS tag assignments can improve the performance of a syntax-based open domain information extraction system. We’re also releasing a gold annotated corpus of part-of-speech tagged headlines — POS-tagged Headline (POSH) — to encourage research in further improving NLP models for news headlines.

Why is this research notable?

This work takes a first step toward developing stronger NLP models for headlines by focusing on improving POS taggers. We show that training a tagger on headlines with projected POS tags results in a far stronger model than taggers trained on gold-annotated long-form text. This suggests that more expensive syntactic annotations, such as dependency trees, may also be reliably projected onto headlines, obviating the need for gold dependency annotations when training a headline parser.

Although this work is focused on learning strong headline POS taggers, the projection technique we introduced in this work can also be adapted to train other strong headline sequence taggers (e.g., training a headline chunker or named entity tagger). Projection could potentially be applied to generate silver labeled data for other domains, such as simplified English (e.g., aligned sentences from simplified to original Wikipedia) and other languages.

How will it help advance the state-of-the-art in the field?

Critically, our work aims to motivate others to study headlines as a unique register (from a computational linguistics point-of-view). We hope others will approach this research with an eye on many other different applications and core tasks in the traditional NLP pipeline and tech stack.

Monday, November 8, 2021

Virtual Poster & Demo Session 2 (12:30-14:30 AST)
Towards Realistic Few-Shot Relation Extraction
Sam Brody, Sichao Wu, Adrian Benton

Click to read "Towards Realistic Few-Shot Relation Extraction"

Please summarize your research.

Sam: Our research focuses on relation extraction, an important problem in NLP. Given a piece of text (e.g., a news article), we want to identify all occurrences of some pre-defined relation. For example, we might build a relation extraction system that identifies mentions of company acquisitions. Ideally, sentences like “Company X will purchase Y Holdings LLC for $2.3 billion.” would be tagged as containing the relation (Company X, ACQUIRES, Y Holdings LLC), whereas “Z INC. bought $1.5M in raw materials from supplier W Mining Corp.” would not.

In the past, this problem was mostly addressed through supervised learning of individual relations, where the system learns to identify each relation from training data consisting of thousands of example sentences. Obtaining such training data in sufficient quantity and quality for every relation of interest can be very costly and involves extensive manual effort.

In 2018, FewRel was proposed as a new approach to the task that relies on few-shot learning. Instead of learning individual relations, the system would be trained to try to distinguish between a wide range of different relations (i.e., relation classification). The purpose of this setting is to teach the system to understand when sentences are expressing similar relations, and when they differ. When properly trained, such a system could also potentially be used for relation extraction, to identify a relation it had never seen before from only a handful of examples.

This setting was very exciting to us, since it suggested a way to create a single relation extraction system that could handle any new relations that might be of interest to Bloomberg with very little cost and effort.

In our paper, we study few-shot relation classification models and how well they perform when deployed in a relation extraction setting. What we found was illuminating: While state-of-the-art pre-trained neural networks can do as well as humans at few-shot relation classification, their performance varies a lot when used for relation extraction. In a relation extraction setting, it was clear that these models were confusing many relations that humans can easily distinguish (e.g., deciding whether someone mentioned is the child or spouse of another person).

We observe that these models are very good at inferring entity types — even though this information is not explicitly provided — by using the structure of the words and where they lie in the sentence. As a result, they tend to confuse relations which involve similar types of entities: e.g., relations between two people, or relations that connected an organization and place (like the city-of-headquarters and country-of-registration relation).

In addition to identifying this blind spot in state-of-the-art few-shot relation classification models, our paper explores different methods for mitigating this type bias. We considered several different representations of the example sentences, but ultimately found that changing the training procedure to force the model to discriminate between similarly-typed relations was most effective at forcing models to rely less on argument-type information.

Why is this research notable? How will it help advance the state-of-the-art in the field?

Our work shows that the few-shot relation classification approach does not provide an out-of-the-box solution for the relation extraction problem, which is of much greater practical interest. However, by uncovering the weaknesses of the approach — and presenting potential solutions — we help bring few-shot learning closer to being a viable alternative to the current costly and time-consuming strategy of learning individual relations from thousands of examples.

Monday, November 8, 2021

Virtual Poster & Demo Session 2 (12:30-14:30 AST)
Multitask Semi-Supervised Learning for Class-Imbalanced Discourse Classification
Alexander Spangher, Jonathan May, Sz-rung Shiang, Lingjia Deng

Click to read "Multitask Semi-Supervised Learning for Class-Imbalanced Discourse Classification"

Please summarize your research.

Lingjia: Discourse analysis reveals the functions of paragraphs as they relate to the whole document. As shown in the figure below, some paragraphs in a press release discuss the main topic of the news, while others provide background information or reactions to the event.

Figure showing how some paragraphs in a press release discuss the main topic of the news, while others provide background information or reactions to the event.

This task is useful for many downstream NLP tasks, including document summarization, event extraction, and storyline identification.

One of the key challenges to discourse analysis is that the discourse datasets are usually class-imbalanced. For example, in articles from The New York Times in the NewsDiscourse dataset, 24% are labeled as “Current Context” (events that happen at the same time as the main event), while only 1.7% are labeled as “Consequence” (events that the main event leads to). Furthermore, collecting discourse annotations is expensive because this complex task requires annotators to be trained to provide good annotation data.

In addition, there are several competing and related discourse schemas which are not exactly the same. Though different discourse schemas define different discourse labels, we observe that they appear to offer complementary information. For example, Rhetorical Structure Theory Treebanks provide lower-level discourse information, modeling the relation between two sentences. One of the labels is “question-answer,” meaning that one sentence is the answer to another sentence. Recent news discourse schemas (e.g., the NewsDiscourse dataset) offer higher-level discourse information, modeling the relation between the sentence and the document. One of the labels is “Main Event,” meaning that the sentence talks about the main event in the news article. We propose the assumption that lower-level NLP tasks could help higher-level NLP tasks. Thus, a multi-task approach can use the lower-level discourse information to help understand the higher-level discourse information.

To test this hypothesis, we propose a multi-task neural framework that includes seven discourse datasets (one of these is newly introduced in this work), an events dataset, and an unlabeled large-scale news dataset to predict sentence-level discourse relations. Our experiments show that this multi-task approach can improve discourse classification on the NewsDiscourse dataset with an increase of 4.9 points in F-1 measure, with the biggest improvements occurring in underrepresented classes. These results demonstrate that the multi-task approach can utilize other discourse dataset information to boost performance, especially as it relates to the underrepresented classes.

Why is this research notable? How will it help advance the state-of-the-art in the field?

Discourse analysis reveals the structure of the news stories. For readers, discourse analysis can help quickly locate different sections of the stories. For journalists, discourse analysis can suggest which aspect is still missing from the story and help guide their writing. The most exciting part of this work is that it shows that the multi-task approach can utilize different discourse schemas to help learn the underrepresented classes.

This is in contrast with other proposed methods, such as training data augmentation or unsupervised data augmentation, neither of which improved performances. Instead, the multi-task learning framework in this paper can utilize the correlations between classes in divergent schemas, as well as provide support for underrepresented classes in the primary task.

Monday, November 8, 2021

Virtual Poster & Demo Session 2 (12:30-14:30 AST)
GupShup: Summarizing Open-Domain Code-Switched Conversations
Laiba Mehnaz, Debanjan Mahata, Rakesh Gosangi, Uma Sushmitha Gunturi, Riya Jain, Gauri Gupta, Amardeep Kumar, Isabelle Lee, Anish Acharya, Rajiv Ratn Shah

Click to read "GupShup: Summarizing Open-Domain Code-Switched Conversations"

Please summarize your research.

Rakesh: Code-switching is the communication behavior where speakers switch between different languages during a conversation. With the widespread adoption of conversational agents and chat platforms, code-switching has become an integral part of written conversations in many multilingual communities worldwide. Our research work introduces the task of abstractive summarization of open-domain code-switched written conversations. Namely, given a multi-party conversation in Hindi-English on any topic, the objective is to generate a summary in English. These English summaries can serve as input to other downstream NLP models, which are often trained only on English data, to perform various other tasks, such as intent classification, question answering, and item recommendation.

To facilitate this task, we built the first open-domain code-switched conversation summarization dataset. This new corpus, named GupShup, contains over 6,800 Hindi-English code-switched conversations and corresponding human-annotated summaries in English and Hindi-English. We provide a thorough analysis of the dataset and the performance of various state-of-the-art abstractive summarization models for this task. We observed that mBART, which was pre-trained on data from multiple languages, obtained the best performance on automated evaluation metrics. We also performed a human evaluation of the model generated summaries. This experiment not only helped us evaluate and compare between different models, but also evaluate the quality of automated summary evaluation metrics. Our results show that ROGUE-based metrics and BLEURT were highly correlated with the human evaluation scores, but metrics like BERTScore and BLEU proved relatively ineffective for this task.

A sample conversation in English and the corresponding code-switched version in Hindi-English. Also included in this figure are summaries in both English and Hindi-English.
Figure 1: A sample conversation in English and the corresponding code-switched version in Hindi-English. Also included in this figure are summaries in both English and Hindi-English.

Why is this research notable? How will it help advance the state-of-the-art in the field?

Code-switching is an integral part of both written and spoken conversations for various multilingual communities around the world. It is commonly observed during interactions between peers who are fluent in multiple languages. For example, on the Indian subcontinent, it is common for people to alternate between English and other regional languages (e.g., Hindi) throughout the course of a single conversation. Developing models that can accurately process code-switched text is essential for the proliferation of NLP-based technologies to these communities, in addition to contributing toward the diversity and inclusivity of available language resources. However, building such models would require high-quality human-curated datasets. This is where our work comes into play.