SIGIR 2018 Research: Weakly-supervised Contextualization of Knowledge Graph Facts

This page describes the gold standard data set we developed for the evaluation of the algorithms in our SIGIR 2018 paper.

Nikos Voskarides, Edgar Meij, Ridho Reinanda, Abhinav Khaitan, Miles Osborne, Giorgio Stefanoni, Kambadur Prabhanjan and Maarten de Rijke: Weakly-supervised Contextualization of Knowledge Graph Facts. 2018. In Proceedings of the 41st International ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR ’18).

Knowledge graphs (KGs) model facts about the world; they consist of nodes (entities such as companies and people) that are connected by edges (relations such as founderOf). Facts encoded in KGs are frequently used by search applications to augment result pages. When presenting a KG fact to an end user, providing other facts that are pertinent to that main fact can enrich the user experience and support exploratory information needs. KG fact contextualization is the task of augmenting a given KG fact with additional and useful KG facts. The task is challenging because of the large size of KGs; discovering other relevant facts even in a small neighborhood of the given fact results in an enormous amount of candidates.

We introduce a neural fact contextualization method (NFCM) to address the KG fact contextualization task. NFCM first generates a set of candidate facts in the neighborhood of a given fact and then ranks the candidate facts using a supervised learning to rank model. The ranking model combines features that we automatically learn from data and that represent the query-candidate facts with a set of hand-crafted features we devised or adjusted for this task.

In order to obtain the annotations required to train the learning to rank model at scale, we generate training data automatically using distant supervision on a large entity-tagged text corpus. We show that ranking functions learned on this data are effective at contextualizing KG facts. Evaluation using human assessors shows that it significantly outperforms several baselines. For this evaluation, we used crowdsourcing to develop a human-curated evaluation set which can be found here. This page provides more details on the process we used.

The procedure we use to construct this evaluation dataset is as follows. First, for each of the 65 relationships we consider, we sample five query facts of the relationship. Since fact enumeration for a query fact can yield hundreds or thousands of facts it is infeasible to consider all the candidate facts for manual annotation. Therefore, we only include a candidate fact in the set of facts to be annotated if:

  • the candidate fact was deemed relevant by the automatic data gathering procedure (as detailed in the paper) or
  • the candidate fact matches a fact pattern that is built using relevant facts that appear in at least 10% of the query facts of a certain relationship. An example fact pattern is parentOf<?, ?>, which would match the fact parentOf<Bill Gates, Jennifer Gates>.

We use the CrowdFlower platform and ask the annotators to judge a candidate fact with respect to its relevance to a query fact. We provide the annotators with the following scenario:

We are given a specific real-world fact, e.g., “Bill Gates is the founder of Microsoft,” which we call the query fact. We are interested in writing a description of the query fact (a sentence or a small paragraph). The purpose of this assessment task is to identify other facts that could be included in a description of the query fact. Note that even though all facts presented for assessment will be accurate, not all will be relevant or equally important to the description of the main fact.

We ask the annotators to assess the relevance of a candidate fact on a 3-graded scale.

  • Very relevant – I would include the candidate fact in the description of the query fact; the candidate fact provides additional context to the query fact.
  • Somewhat relevant – I would include the candidate fact in the description of the query fact, but only if there is space.
  • Irrelevant – I would not include the candidate fact in the description of the query fact.

Alongside each query-candidate fact pair, we provide a set of extra facts that could possibly be used to decide on the relevance of a candidate fact. These include facts that connect the entities in the query fact with the entities in the candidate fact. For example, if we present the annotators with the query fact spouseOf<Bill Gates, Melinda Gates> and the candidate fact parentOf<Melinda Gates, Jennifer Gates> we also show the fact parentOf<Bill Gates, Jennifer Gates>. Each query-candidate fact pair is annotated by three annotators. We use majority voting to obtain the gold labels, breaking ties arbitrarily. By following this crowdsourcing procedure, we obtain 28,281 fact judgments for 2,275 query facts (65 relations, 5 query facts each).