Bloomberg researchers have long been familiar faces at academic conferences focused on data science. But, the 2018 Conference on Empirical Methods in Natural Language Processing (or EMNLP 2018), held October 31 to November 4 in Brussels, represents a bit of a coming-out for the researchers on Bloomberg’s artificial intelligence (AI) group. Some of the highlights: Gideon Mann, Bloomberg’s head of data science, gave one of the conference’s three keynotes and NLP researchers Daniel Preoţiuc-Pietro and Chen-Tse Tsai presented both long and short papers at the conference. Preoţiuc-Pietro also delivered an invited talk at a workshop co-located with EMNLP.
Preoţiuc-Pietro’s long paper, “Why Swear? Analyzing and Inferring the Intentions of Vulgar Expressions,” examines the use of vulgarity on Twitter. Given the amount of buzz about the quality of the dialog on Twitter, it’s perhaps surprising that Preoţiuc-Pietro says “Why Swear?” is the first academic research on this topic. “This has only been studied in linguistics,” says Preoţiuc-Pietro. “We built on the linguistic research and used computational methods to disambiguate the use of vulgar words.”
Preoţiuc-Pietro is hopeful that his team’s research could lead to more precise filtering of hate speech. Most attempts to screen out hate speech will automatically remove any content containing curse words. “The assumption is that when people use vulgar words they tend to express aggression or hate speech towards others,” he says. “But, this is the case in just 15 percent of the tweets.” This means that most attempts to censor hate speech will inadvertently suppress a significant amount of more innocuous content as well.
Preoţiuc-Pietro’s method began with human annotators, who classified curse words according to their meaning. “Ass,” for example, could mean a donkey, but it could also be used for emphasis or aggression. Other curse words could be used to signal group affiliation or simply the presence of a more casual speaking environment.
To build a model that would enable an algorithm to make similar decisions, Preoţiuc-Pietro’s team also leveraged insights from linguistics and psychology. A curse word at the beginning of a tweet was often used to demonstrate group identity or informality. In product reviews, a vulgar word was more likely to be used for emotion or emphasis, as was an emphasis role near an adjective.
Through a survey, Preoţiuc-Pietro’s group identified some demographic characteristics of their swearing Twitter users. They found that those who used vulgarity online were more likely to be younger, male, politically liberal, and less educated. By filtering for hate speech without considering the differing meanings of vulgarity, “We censor potentially important information and from certain groups,” says Preoţiuc-Pietro.
The workshop Preoţiuc-Pietro gave a talk at, The 4th Workshop on Noisy User-generated Text (W-NUT), is closely related to this research. It examines text that is especially difficult for a machine to understand, often because the literal meaning of the text is at odds with its actual meaning or intent.
Preoţiuc-Pietro is also part of a research team presenting a shorter paper at EMNLP on the gains to be made in understanding users by aggregating them on a county level. Titled “The Remarkable Benefit of User-Level Aggregation for Lexical-Based Population-Level Predictions,” the research maps a billion tweets to the counties from which they originate. Twitter has previously been used to measure community health, well-being, and political sentiment, but there hasn’t been much work done on assessing those or similar metrics at a community level. By mapping the tweets geographically by county, Preoţiuc-Pietro and his co-authors were able to make large and consistent improvements in the accuracy of community-level predictions.
Tsai’s long paper, “Zero-Shot Open Entity Typing as Type-Compatible Grounding,” focuses on a common and important problem known as entity typing, or properly categorizing an entity mention, so that a machine can better understand it and find relevant information. Knowing an entity’s type – Actor? Author? Politician? – is crucial in helping a machine answer questions. “If a question contains entities, it is helpful to understand the type of entity, in context, so that it’s easier for the computer to find answers,” says Tsai. Some type sets used by an algorithm can include thousands of types; others, only a handful.
The method developed by Tsai and his colleagues can be used with any type set, and doesn’t need typing-specific supervision. Most previous works rely on costly annotation, and cannot generalize to unseen types or out-of-domain data. The idea of grounding mentions to type-compatible entities in Wikipedia makes this system robust across different domains. Compared to other so-called open typing systems, Tsai’s system works significantly better. And it performs almost as well as systems that need extensive in-domain training data.
On Saturday, November 3, 2018, Mann delivered a keynote talk that examined the use of natural language technology to help participants in the global capital markets understand world events and breaking business news as it happens. He looked at the state of the art in using natural language processing in finance and highlighted some of the open problems that researchers are currently tackling.
Those problems are part of what brought Tsai to work at Bloomberg. He is quick to say that Bloomberg has “very nice people” and that he enjoys working with them. But, he also notes, “We have good data and good problems. I’m really excited about working on the challenging problems at Bloomberg.”