In mid-2017, Kazi Shefaet Rahman of Bloomberg’s NLP group was looking at how he could improve the alerts that Bloomberg sends to its customers. To produce the alerts, an algorithm has to analyze tweets and tag them accurately. Rahman thought there might be a way to increase the accuracy of that tagging – not just for tweets, but for any type of news.
Rahman was wrestling with one of the fundamental problems in natural language processing, often known as entity disambiguation. Given a word with multiple meanings or uses – “Washington,” for example – how can an algorithm accurately choose the proper meaning, be that the President, the city, the state, a street, a particular sports team, or something else?
One of Rahman’s colleagues, Yi Yang, thought he had a way to do it. “We started by looking at the product internally, and trying to improve it,” says Yang. “But when we saw how close we were to beating the state-of-the-art, we realized that with some more effort we could do it.”
The resulting paper authored by Yang, Rahman, and their colleague Ozan Irsoy titled “Collective Entity Disambiguation with Structured Gradient Boosting,” does indeed improve on earlier solutions to this problem. It is one of three papers authored by Bloomberg engineers to be presented today at the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT), being held in New Orleans from June 1-6, 2018.
To solve this problem, the three senior research scientists employed a combination of two machine learning methods: structured prediction and gradient tree boosting.
Structured prediction attempts to resolve all the possible meanings of various mentions of entities in a document simultaneously. To do this, it must consider not only the candidates for every individual mention, but all possible combinations of every mention, which is computationally expensive.
Gradient tree boosting uses a small sample of features to measure similarities between potential answers and any given mention. It can readily work with features with different scales coming from different sources and combine them in an optimal way. It can also bucket ranges of potential answers into nodes, making the method more efficient and adaptable.
The Bloomberg researchers combined these two methods with a strategy they refer to as bi-directional beam search with golden path. (“Golden path” refers to the fact that while the algorithm was being trained, the researchers made sure the correct answers were retained in hopes that the algorithm would have better odds of finding the correct answers once it was operating on its own.)
To explain bi-directional beam search, the authors use the example of a sentence containing the entities “Cook,” “Apple,” and “Cupertino.” The beam search will look at “Cook” and retain only the most likely answers. Then it will move on to “Apple” and, again, keep only the most likely answers. By constantly triaging possible answers and holding onto only the most likely, beam search greatly decreases computing costs.
By tossing aside some less-likely answers, however, it’s possible that the beam search will inadvertently miss the correct one. An algorithm that analyzed “Cook” as roughly “chef,” and “Apple” as a tree fruit, might be stumped at “Cupertino.” Says Irsoy: “It has no idea because it dropped all the other correct candidates.”
The bi-directional beam search helps prevent this by considering the sentence not only from left to right, but also from right to left. In this case, the algorithm would also encounter “Apple” and “Cupertino” together, thereby increasing the likelihood that “Cook” would be recognized as “Tim Cook,” Apple’s CEO. “The bi-directional beam search helps you make a good decision overall when you have multiple candidates for every entry,” says Irsoy.
Most algorithms, says Yang, would look at “Cook” by itself and a few words around it, and then repeat that process with Cupertino and Apple. “If there is a really long document, you are making lots of independent decisions,” he says. “You’re not able to ask if you have coherence among all the predictions you’ve made. This approach allows you to look at long documents with long-range dependencies.”
Irsoy and Yang say that further research could perhaps improve the algorithm by finding a better way to extract different features of the various candidate mentions, or to make the beam search even better at approximating a wider range of choices.
But Rahman points out that it’s often not the algorithm that needs more attention. “Sometimes it’s not a limitation of the model, but a reflection of the resources available,” he says. “Availability of data resources is still the bottleneck and will continue to be sought after.” Bloomberg researchers are fortunate, he says, because they work with a dataset of people, companies, and financial information that has been assembled and curated over the past 35 years. “By leveraging those resources to inform our model,” says Rahman, “we can gain more of an edge than any improvement to the model.”