Chen-Tse Tsai, a research scientist in the artificial intelligence group at Bloomberg, has been working on a problem that seems simple enough: Given a name in a text, find the correct Wikipedia page.
But consider that many names appear on dozens of Wikipedia pages. The word “Chicago,” for example, could refer to the city, the University of Chicago, or even the band – and that’s just for starters. Plus, what if the name is not in English? What if it is usually transliterated, so that the phonetic sound, rather than any literal meaning, is preserved in the new language?
Name translation is an especially helpful step in general translation, says Tsai. If the names in the text can be identified, it’s often possible to get an idea of what the text is about. “Some facts are only stated in foreign-language texts. By grounding them to the English-language Wikipedia, we can get more information,” says Tsai. “It’s a pretty huge problem.” The English-language Wikipedia, says Tsai, is the most useful Wikipedia for this sort of research, simply because it’s the largest.
Tsai says a three-step process can be used to untangle cross-lingual wikification. The first is to successfully identify named entities (people, locations, organizations, etc.) in a foreign text. That might sound simple, but, he says, “This is still a very challenging problem.”
The second challenge is to identify possible English Wikipedia pages for each foreign-language name. This was the subject of Tsai’s work. His goal was to pick the 30 most-likely Wikipedia candidates for any name, hoping that the best match would be among them.
The third step is a ranking problem, which Tsai did not attempt in this particular research.
In his research, Tsai devised a new way to surface likely Wikipedia matches. He showed that his model outperformed six others, sometimes by impressive margins. Tsai presented his work on Tuesday, February 6, 2018 at the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), a conference in New Orleans sponsored by the Association for the Advancement of Artificial Intelligence (AAAI), .
To identify possible Wikipedia pages for a given foreign-language name, Tsai says most researchers have tried to use a dictionary. If a name exists in multiple languages, Wikipedia will often link between them. This doesn’t work as well as one might suspect, though: Tsai points out that for Spanish-language mentions, this method yields the correct English name in about 40 percent of cases. The smaller the non-English Wikipedia, the harder it is to consistently find matches.
Tsai’s methodology is more advanced. Instead of simply looking up a name, his model tries to generalize from the entire dictionary. It looks at all the title pairs joined by inter-language links, and then tries to learn how to translate them. In other words, the links themselves are used as training data. In Spanish, Tsai says, there are about 10,000 title pairs that he used to train the model, while there are only about 1,000 in Tagalog.
Tsai’s model also pays special attention to the order of the words in a name, which can vary between different languages. While a transliterated word may be most likely to show up as the third word in a foreign phrase, for instance, it might more commonly be found as the first word in English. The key idea in the proposed model is to consider word alignment and word transliteration jointly. Tsai says a better understanding of word order helps the model better handle transliterations, and vice versa.
Most research of this type focuses on the biggest languages, says Tsai – English, Spanish, and Chinese. But a more generalized model, such as his, can be extended to other languages with less Wikipedia coverage. Notably, Tsai’s method achieves impressive candidate generation coverage with Tagalog (73 percent); Italian (66 percent); and Bengali (65 percent). (Arabic and Hebrew proved more difficult, at 37 percent and 46 percent, respectively.)
Tsai said his research relates to the work he does at Bloomberg in the field of information extraction and disambiguation. He has worked on several multilingual and cross-lingual problems. It would be ideal to create something specific to each language, says Tsai, but that can be difficult and slow, and each tool is of limited utility. “There is a need to cover more languages using the existing data we have,” he says. “We want this model to cover as many as possible.”