AACL 2020: Bloomberg’s AI Group & CTO Office Engineers and Researchers Publish 2 Papers

During the First Conference of the Asia-Pacific Chapter of the Association of Computational Linguistics and the 10th International Joint Conference on Natural Language Processing (AACL-IJCNLP 2020), which is being held virtually between December 4-7, 2020, researchers from Bloomberg’s AI Group and Office of the CTO will be showcasing two papers they are publishing at the conference together with their academic collaborators. Through these papers, the authors are highlighting contributions to low resource NLP, as well as a new task at the intersection of NLP and social science. Together, these papers demonstrate the group’s interest in both improving core NLP methodology and its applications.

We asked two of the papers’ co-authors, Rakesh Gosangi and Daniel Preoţiuc-Pietro, to summarize their research and explain why the results were notable:

Session 13A Semantics II
Sunday, December 6, 2020 (14:00-14:20, UTC +8)


Two-Step Classification using Recasted Data for Low Resource Settings (Dataset | Code)
Shagun Uppal (IIIT-Delhi), Vivek Gupta (School of Computing, University of Utah), Avinash Swaminathan  (IIIT-Delhi), Debanjan Mahata (Bloomberg), Rakesh Gosangi (Bloomberg), Haimin Zhang, Rajiv Ratn Shah (IIIT-Delhi), Amanda Stent (Bloomberg)


Please summarize your research.
Rakesh: Textual Entailment (TE) is an essential task for understanding the reasoning ability of language models. It tests a system’s ability to infer if a pair of sentences — the premise and the hypothesis — either entail or contradict one another, or are unrelated to one another. This task is framed as a supervised learning task and researchers have curated such data sets for training and testing models mostly focusing on English. Yet, very little work has been done to cover other widely spoken languages, such as Hindi, which is spoken by 800 million people and is one of India’s two official languages.

This paper aims to address this by using data recasting to create four new Hindi TE datasets from existing human-annotated text classification datasets. In this recasting process, we build template hypotheses for each class in the label taxonomy of respective text classification tasks and then pair the original annotated sentence with each of the templates to create Hindi TE samples. Often, the TE datasets show inconsistencies in their predictions which leads to contradictory results against their own beliefs. We propose a consistency regulariser to reduce pairwise-inconsistencies in predictions of the TE models. We also propose a new two-step approach that combines the predictions of related pairs of TE samples to predict the classification labels of the original task. We further improve the classification performance by jointly training the classification and textual entailment tasks together. 

Why are these results notable? How does it advance the state-of-the-art in the field of computational linguistics?
Rakesh: We aimed to demonstrate how large-scale data for the TE task can be developed for low-resource languages, without undergoing costly and time consuming human annotations. We hope our findings will encourage other researchers to pursue similar approaches to address data scarcity issues in popular, yet understudied, languages.

The new regularization constraint and joint training objective we proposed have resulted in improvements in the state-of-the-art performance for TE and classification tasks on four Hindi datasets, and was also able to tackle the inconsistency problem. We expect these approaches to be more generally useful for improving TE and text classification tasks in various domains, such as sentiment analysis, news categorization, and identifying discourse modes.

Session 14C Social Media & Computational Social Science II
Sunday, December 6, 2020 (15:20-15:35, UTC+8)


Point-of-Interest Type Inference from Social Media Text
Danae Sánchez Villegas (University of Sheffield), Daniel Preoţiuc-Pietro (Bloomberg), Nikolaos Aletras (University of Sheffield)


Please summarize your research.
Daniel: The proliferation of mobile connected devices enables the publication of social media content from a variety of physical places — or points of interest. The type of place shapes the content published from itand, in turn, this content gives a glimpse into the place’s atmosphere.

Place categories with sample tweets

Category

Sample Tweet

Arts & Entertainment

i’m back in central park . this place gives me war flashbacks now lol

College & University

currently visiting my dream school 😥 🧡

Food

Some Breakfast, it’s only right! #LA

Great Outdoors

Sorry Southport, Billy is dishing out donuts at #donutfest today. See you next weekend!

Nightlife Spot

Chicago really needs to step up their Aloha shirt game. Only a few of us dressed “appropriately” tonight.:) 🗿🌴🌺

Professional & Other Places

Leaving the news station after a long day

Shop & Service

Came to get an old fashioned tape measures and a button for my coat

Travel & Transport

Shoutout to anyone currently on the way to the APCE Annual Event in Louisville, KY! #APCE2018

 

This paper studies the language used in eight different types of places, such as restaurants, offices, or the outdoors. Through large-scale linguistic analysis, we show that text posted from a place (e.g., ‘outdoor places’) can mention specific place types (e.g., ‘beach’ or ‘island’), activities performed there (e.g., ‘hike’ or ‘swim’), feelings about the place (e.g., ‘beautiful’) or moments associated with that place (e.g., ‘sunset’).

The paper also demonstrates that we can predict with 43.67 Macro F1 accuracy one of eight place types from the language used in a single tweet by using modern Transformer-based methods.

Why are these results notable? How does it advance the state-of-the-art in the field of computational linguistics?
Daniel: The paper is the first to study the relationship between language used at a location and type information associated with the point of interest. It presents a method and dataset in which tweets are associated with Foursquare’s location metadata, which indicates from where they have been posted. We believe this dataset and initial analysis will pave the way for research into how places impact user expression. Practically, being able to infer the place type from the text could help geographers and social scientists study mobility patterns and how people interact with places in real-time, something that would be important for studying the spread of COVID-19 and its impact on businesses and society.