WSDM 2021: Contextualizing Trending Entities in News Stories

Every day, millions of news stories are produced by media companies and served up to readers. No one can read every story, so these companies curate their front pages and news feeds to showcase the most popular, relevant content. This can now be done automatically using trends.

Trending keywords or phrases are terms that appear frequently in a media environment during a particular timeframe. Media companies use these trends to help readers discover popular content. Different organizations identify and amplify trends in different ways, but whichever way this is accomplished, the presentation of popular, timely content to readers is mission-critical for news organizations and social media companies.

A team of Bloomberg researchers is attempting to push forward the science of identifying trending content with a new paper entitled “Contextualizing Trending Entities in News Stories.” The goal of this paper is to provide two ways to retrieve and rank information that contextualizes a trending object. Then, this information can help frame for content publishers and readers as to why a particular piece of content is popular, relative to other key terms called “contextual entities.”

For example, the related terms to the trending entity “Joe Biden,” might include “president,” “Kamala Harris,” “United States,” “politics,” “government,” and so on. These contextual entities all help tell the story about who or what the initial object, “Joe Biden,” is, and why the item is trending. Once these entities are pinpointed, they must be ranked in terms of salience to the original trending entity.

The proposed retrieval and ranking of contextual entities is accomplished with two distinct methods. The first involves an unsupervised graph-based algorithm based on Personalized PageRank and entity embeddings. The second supervised method is based on Learning to Rank, and involved the creation of a test collection using crowdsourced annotation.

Marco Ponza

AI researcher Marco Ponza (pictured above) is presenting the paper on behalf of his team, which included his colleagues Diego Ceccarelli, Edgar Meij, and Sambhav Kothari, as well as the University of Pisa’s Paolo Ferragina, during the 14th ACM International Conference on Web Search and Data Mining (WSDM 2021) this week.

On the first day of his internship at Bloomberg in October 2019, Diego Ceccarelli, Ponza’s mentor and future co-author, introduced him to existing trend identification functionality within Bloomberg, and that got him thinking about how to build on this functionality.

“I had the idea that maybe we could contextualize trends using other entities that are found in the news stories,” Ponza says.

He spent the next month planning and determining what data to use, a task he claims was the most time-consuming part of the project. Because machine learning models are characterized by “garbage in, garbage out,” it was critical that Ponza define the problem with clarity, so that the data annotation process could be performed by the crowd with a high quality threshold. The subject matter also presented challenges, as the financial data that Bloomberg produces requires a certain level of subject matter expertise. Annotators must be able to understand the data to annotate it properly.

Because no labeled data was initially available, Ponza started by working on a sophisticated, unsupervised approach with machine learning models. First, the contextual entities needed to be linked to the trending entity, and then Ponza’s team connected the contextual entities to each other and weighted these relationships using knowledge sourced from Wikipedia. Finally, they employed Personalized PageRank, a standard tool for finding vertices in a graph that are most salient to a query or user.

The team found that the supervised method generally performed better than the unsupervised method, with improvements of up to 10 percent on the new enriched dataset that was built and released for this research task. This dataset contains hundreds of trends and thousands of entities. In constructing it, entities were labeled by human annotators with a relevance score representing how much an entity is useful in explaining the trend.

Content producers can leverage this research to serve content to their readers more efficiently and effectively. When contextual entities are successfully retrieved from a knowledge graph and ranked, they can be used to automatically build more useful summaries of a given piece of content, recommendations for further reading, and auto-completion of search suggestions, among other uses.

This project serves as a testament to Bloomberg’s commitment to state-of-the-art research. Because Bloomberg engineers are given time and resources to pursue worthy research projects alongside their day-to-day product work, new employees – and even interns – have the opportunity to get involved in groundbreaking research beginning on their first day at the office.

“In our group, you can dedicate time to work on product and you can also advocate for time for developing your own research,” says Marco. “And there are meetings every week where you can have other people that are doing similar research help you.”

In this fashion, researchers at Bloomberg are never on their own. When they get stuck, they have easy access to other experts who can help them overcome any roadblocks that stand in their way.