If Sir Tim Berners-Lee and Vinton Cerf are both speaking at a conference, it must be a pretty big deal. That’s exactly what will be happening at The Web Conference (April 23-27, 2018) in Lyon, France, at which two London-based Bloomberg research scientists will also be presenting papers. Each will be presenting their papers on Wednesday, April 25th in a research track entitled Web Content Analysis, Semantics, and Knowledge.
Giorgio Stefanoni will be attending The Web Conference for the first time. The conference, he says, “has a very good reputation, and is known for having a high impact factor.” Mohamed Yahya has been to the event twice before. “This is one of the top conferences,” he says. “It is a place where you can meet very important people. But you also meet people from much more diverse backgrounds and interests compared to other events.”
Stefanoni’s research entitled “Estimating the Cardinality of Conjunctive Queries over RDF Data Using Graph Summarisation” tackles the problem of trying to estimate how many answers there might be to a query made on the type of database commonly used to serve knowledge graph data, known as RDF (or Resource Description Framework). The number of possible answers is an important factor in figuring out how those answers should be retrieved. A poor method could take hours to return results, while a more suitable method could find those same results within seconds.
The most widely-used methods for estimating the number of answers to a query are often wildly inaccurate ‒ by many orders of magnitude. “The answer could be 10, but the estimate could be in the millions,” says Stefanoni. Another problem: “The systems just output a number. There’s never really an interpretation of the number, and they don’t tell you how confident they are in the number.”
The estimates are so far off ‒ partly because of the way they are derived. Often, they are formed by breaking up the query into small pieces, and then estimating the number of answers to each piece independently using one-dimensional statistics of the data. These partial estimates are then combined using ad-hoc approaches to obtain an estimate of the number of answers to the full query. “Putting it all back together is where you run into problems,” says Stefanoni. “The data are not independent of each other.”
Stefanoni’s research looks at this problem differently. Instead of breaking up the RDF graph into parts and trying to summarize each part, his work attempts to summarize the graph in its entirety. This is done mainly by collapsing nodes that are similar. Because of this compression, some information is lost, and it’s possible to create a variety of graphs that are compatible with the new compressed graph. The estimate becomes the average of all the graphs that are compatible with the summary. Stefanoni and fellow researchers at the University of Oxford, Boris Motik and Egor Kostylev, have come up with a formula that can compute the estimate just by looking at the summary. Importantly, it also outputs a confidence factor, so that the user knows how likely it is that a particular query will overrun the estimate by a significant amount.
This method produces estimates that are far more accurate than earlier ones. “On some queries, we may get almost the right number, by a factor of two,” says Stefanoni. “A standard system would be off by a factor of 1,000.”
A Better Way to Train Question Answering Systems
One of the factors limiting research and the use of machine learning is the amount of data needed to train any machine learning system. Yahya and his collaborators at the Max Planck Institute for Informatics, Abdalghani Abujabal, Rishiraj Saha Roy and Gerhard Weikum, have found a way to achieve typical question answering results, while starting with dramatically less training data and improving over time. They’ve published their findings in a paper entitled “Never-Ending Learning for Open-Domain Question Answering over Knowledge Bases.” The secret: strategically inserting a very small amount of human feedback to gently nudge the machine along the correct path.
Ideally, the questions used to train a machine learning system will be similar to the ones that users actually ask. If they’re not, the system will make lots of mistakes and has no chance to improve itself by learning from these mistakes. In Yahya’s work, “even if the system doesn’t know how to answer a question, it can gracefully recover by backing off and taking a different, less accurate approach to answering it.” The system then tells the user how the answer was found, and asks if the answer looks correct. This information is used to improve the system for subsequent users, reducing the amount of feedback required from users over time.
User feedback turns out to be immensely important. Yahya estimates it can reduce the amount of training data needed to get a question answering system up and running by 90 percent. The system starts out being correct about 20 percent of the time, but it reaches about 50 percent accuracy over time. “The numbers sound low, but this is state of the art,” says Yahya.
Crucially, the system also provides information relating to how an answer was derived. As a user, “I need to have confidence that the system actually understood the question as I intended it to be understood,” says Yahya. The explanation allows people to make corrections, and most importantly, to have the confidence that the answer is correct.
Says Yahya, “If users have to do a tiny amount of work to get a huge improvement in results, they are probably willing to do that.”