Finding Relevant Facts in a Knowledge Graph: New Work by Bloomberg Researchers Presented at SIGIR 2018

How can you tell if two facts are related to one another? For humans, the question of, “Is that relevant?” is answered almost instinctively. For computers or algorithms, it’s a bit trickier.

Ridho Reinanda, a London-based researcher in the AI group at Bloomberg, is working with a team to develop an algorithm that, when given a fact, will be able to find other relevant facts that are related. “There are a small number of facts that might be useful to enrich our understanding of the main fact,” says Reinanda. “If you know that Bill Gates is a founder of Microsoft, it might also be useful to know that Paul Allen was a founder of Microsoft too.” Reinanda says this research represents a new path. “There has been no previous work solving this problem,” he says.

The paper, entitled “Weakly-Supervised Contextualization of Knowledge Graph Facts,” was co-authored by Reinanda and a team of London and New York-based Bloomberg AI researchers, including Edgar Meij, Abhinav Khaitan, Miles Osborne, Giorgio Stefanoni, and Anju Kambadur. The research was conducted together with Nikos Voskarides, a Ph.D. student from the University of Amsterdam who completed the work while interning at Bloomberg, and his supervisor, Professor Maarten de Rijke, a recipient of the 2017-2018 Bloomberg Data Science Research Grant (who is unable to attend SIGIR due to lengthy delays surrounding his visa application).

Voskarides will present the results of this research this morning at The 41st International ACM Special Interest Group on Information Retrieval (SIGIR 2018) Conference on Research and Development in Information Retrieval, being held this week at the University of Michigan in Ann Arbor, Michigan.

“I’m excited to go back to this conference,” says Reinanda, who previously attended SIGIR while working on his Ph.D. “They’ve got a really interesting program. They try to cover a lot of new developments in information retrieval and search engine technology, in addition to topics like conversational search, recommending news, and personal assistants.”

The first step in this research was simply to find candidate facts, defined as facts that might be related to the main fact. Each fact is actually composed of two entities (nodes) and the relationship (edge) between them. In natural language, this would be referred to as the subject, the predicate, and the object. With the exception of a few relationships that could generate thousands of candidate facts on their own, Voskarides et al. decided that any fact one or two hops away in the knowledge graph would qualify as a candidate fact. That might generate a set of more than a thousand candidate facts.

The next step was to rank the candidate facts by relevance. To do that, Voskarides et al. turned the main fact into a query and considered each candidate fact as a potential answer to that query. By combining hand-crafted features and a neural network, Voskarides et al. found they could develop a good ranking model.

Continuing on the Bill Gates theme, Reinanda suggests “Bill Gates/Spouse/Melinda Gates” as a query fact. A candidate fact might be that Paul Allen is a founder of Microsoft. Microsoft and Bill Gates can be connected directly, and Paul Allen can be connected to Bill Gates through a variety of paths. To better represent this, Voskarides et al.’s algorithm aggregates the contextual information from all the paths leading to the facts being considered, and runs them through a neural network model which is optimized for this fact-ranking task.

Because the discovery of related facts is a novel problem, the team had to devise their own way of determining how well their solution works. They came up with two methods for collecting relevance labels, a distant supervision procedure using Wikipedia data that automatically assigned annotation and one using human annotators. In the Gates example, they would go to the Wikipedia page for Melinda Gates and look for mentions of Bill Gates. Then they looked to see if Paul Allen is mentioned in the same sentence as Bill Gates, in the sentence before a mention of Bill Gates, or in the sentence immediately following the mention. If so, the fact is considered relevant. The data generated through this distant supervision procedure is used to train the ranking model.

For human testers, it worked a bit differently. For each fact, crowdsourced annotators were presented with a candidate fact and asked, “If you were writing a description of this fact, would you include this additional candidate fact?” Facts that solicited an answer of “definitely” were considered the most relevant, followed by those the annotators said they would include only if they had enough space. This manually-annotated data is then used for evaluation.

In their experiments, the learning to rank method outperformed all baselines by a large margin, thereby demonstrating the distant supervision procedure is useful for learning a supervised ranking function for this task. More details on the task, the researchers’ approach, and the evaluation data set created for this task can be found here.

Unveiling the State of the Art

On Sunday, July 8, Edgar Meij, a senior researcher leading Bloomberg’s graph analytics efforts, joined the University of New Hampshire’s Laura Dietz and Wayne State University’s Alexander Kotov to host a half-day tutorial at the conference on Utilizing Knowledge Graphs for Text-centric Information Retrieval. Meij is also one of the organizers of the full-day KG4IR’18 (Knowledge Graphs and Semantics for Text Retrieval, Analysis and Understanding) Workshop on Thursday, July 12, 2018 that is co-located with SIGIR ’18. The workshop will look at the use of knowledge graphs and similar semantic resources for information retrieval applications. Because the conference is quite general, Meij says it will be helpful, especially for new Ph.D. students, to have a more intimate setting in which to discuss these particular interests.

Bloomberg senior data scientist Edgar Meij (Photographer: Jason Alden)

The workshop, says Meij, will be appealing to many of the industry and academic researchers at the conference. He also expects it to attract academics who are interested in what’s going on in industry. This particular workshop will feature a joint panel with CAIR’18, another workshop on conversational approaches to information retrieval, or chatbots. “We decided to reach out to them and see if they would have a shared panel and to see if there is interest and room for collaboration, as knowledge graphs are a key enabling technology for conversational agents,” adds Meij.

The tutorial on Sunday was aimed more at those who are newer to the field, in order to help them get up to speed on some of the newest and most interesting research. It gave participants a window into the state-of-the-art of knowledge graph research and illuminated the thinking of researchers who are trying to improve the search experience.

Both events, say Meij, help support community development within the field. He and the other attendees get to see, “who else is building similar things around the world, who’s really pushing the envelope, and where the wildest ideas are coming from.”

He’ll also get some strong hints as to where the next batch of smart researchers is coming from, which is obviously helpful in recruiting. His own career is a perfect example: Three years ago, he was working at Yahoo! Labs when he came to SIGIR ’15 and saw researchers from Bloomberg present a paper called “Finding Money in the Haystack: Information Retrieval at Bloomberg.”

“That particular talk really resonated and stuck in my mind,” says Meij. “It’s one of the reasons I’m now working at Bloomberg.”