Bloomberg Machine Learning & NLP Researchers Collaborate with Academic Teams to Publish Papers at AAAI 2019

Bloomberg researchers are familiar faces at academic conferences focused on artificial intelligence and machine learning. At the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) held January 27-February 1, 2019 in Honolulu, Hawaii, and the co-located 9th Symposium on Educational Advances in Artificial Intelligence (EAAI-19) held January 28-29, 2019, Bloomberg’s Daniel Preoţiuc-Pietro, NLP Researcher, and Debanjan Mahata, Text Analytics Researcher, joined teams of academic researchers in presenting papers on machine learning applications.

While Bloomberg regularly conducts its own core research in-house and utilizes cutting-edge technologies in its own projects, this research was conducted outside Bloomberg by a mix of university professors and undergraduate students, along with the Bloomberg researchers.

Mahata participated in research in his role as an adjunct professor at Indraprastha Institute of Information Technology, Delhi (IIIT-Delhi) with Yaman Kumar, who was an undergraduate doing an internship under Assistant Professor Rajiv Ratn Shah. They collaborated with Swati Aggarwal, Assistant Professor at Netaji Subhas Institute of Technology, Ponnurangam Kumaraguru, associate professor at IIIT-Delhi, and Roger Zimmermann, associate professor at National University of Singapore, to publish “Get IT Scored using AutoSAS – An Automated System for Scoring Short Answers” during the special AI for Education paper track during EAAI-19 on Monday, January 28th.

Preoţiuc-Pietro, who previously researched social media communication, worked together with Junyi Jessy Li, assistant professor in the Linguistics Department at The University of Texas at Austin, who previously researched specificity. They collaborated with Yifan Gao and Yang Zong, two undergraduates at UT-Austin, to publish “Predicting and Analyzing Language Specificity in Social Media Posts.” Their poster will be presented at AAAI-19 on Wednesday evening, January 30th.

Get IT Scored using AutoSAS – An Automated System for Scoring Short Answers

Manually grading thousands of online exams with short answers can take hours, if not days. Utilizing natural language processing and decision trees to process the text can save teachers time and money. For example, British teachers spend 30% of their time evaluating students – which translates to about £3 billion (or almost $4 billion) a year. While this application is focused on grading, the algorithm developed in this research can be utilized to process text in any type of form that has short answers.

By using a fundamental and basic technique, the algorithm can score each answer, while also providing an understanding of why it received that particular grade so students know in which areas they need to improve. A random forest model is a collection of easily integrated models that work together to provide a transparent solution regarding the areas that are driving the score higher, as well as those areas which require improvement.

“The model is an ensemble of decision trees that have been initialized using randomly selected parameters and trained on data sampled from the original training data,” said Mahata. “Individually, these decision trees are not very accurate machine learning models, but when they come together to do the final prediction, they perform better than individual trees that are tuned and trained for a particular task.”

Decision trees have advantages over many machine learning models. At every stage, the model accounts for and calculates the best feature until it reaches a final prediction. “Each step can be understood and integrated later to gain visibility into how the decision tree is making a decision and what went into each decision along the way for that to be the final outcome,” said Mahata. “That’s why we used decision trees – or an ensemble of decision trees ‒ to train against this particular data set.”

The training data for the algorithm consisted of actual tests from high school students that were publicly available and released by the Automated Student Assessment Prize (ASAP), a competition hosted on Kaggle that was sponsored by The William and Flora Hewlett Foundation. There were 10 different questions and more than 16,000 responses that were manually graded by teachers and double scored on a scale of 0 to 3 by ASAP graders. The average answer was 50 words in length, but individual responses varied from one to 300 words.

The model made its prediction based on nine types of features: Word2Vec and Doc2Vec; part-of-speech (POS) tagging; weighted keywords; prompt overlap; lexical overlap; word frequency, difficulty and diversity; statistics of sentence and word length; logical operators based features; and temporal features.

Of these, Word2Vec and Doc2Vec, as well as prompt overlap and weighted keywords, were the most important features in the model. Prompt overlap looks at the words that overlap from the question, as this demonstrates reading comprehension and how well the test-taker understood the question. A weighted keyword expands the vocabulary of a particular question and then searches for those particular words in the answer. These alternative techniques capture whether the answer is on topic or not.

Word2Vec and Doc2Vec models are trained using a huge corpus of text documents to capture semantic relationships between words. The Word2Vec model was released by Google and trained on news articles, and the Doc2Vec model was trained using Wikipedia data. Only nouns, verbs, and adverbs, or meaningful noun phrases, are translated into a vector, or mathematical representation, while stop words like “not,” “be,” “a,” and “an,” are not processed.

“The algorithms only understand numbers and math, so we need to feed them a suitable representation of text that they would understand,” said Mahata. Based on modern deep learning techniques, Word2Vec and Doc2Vec aid in representing a piece of text using a vector, whereby each vector is a meaningful representation of text.

For example, the vectors corresponding to the words ‘king’ and ‘queen’ can be close to each other in the space captured by these models and could even be represented by an equation: ‘king’ – ‘man’ + ‘woman’ ~= ‘queen.’ “It basically understands the inherent meaning of the word and how king is relative to man – the gender is man and the gender of queen is woman,” said Mahata. “It’s the representation of words in a low-dimensional vector space,  where each dimension captures a latent meaning of the word.”

Since certain words are related to each other, a short description of polar bear would include the words “bear,” “polar,” and “snow” with mentions of locations like Arctic and Antarctic regions. A computer does not understand the relationships between words, but features such as Word2Vec and Doc2Vec aid the algorithms in understanding the underlying relationships between words and phrases, thereby helping in determining if an answer is very good, well-written, and succinct for a given topic. Using these features, answers to a question on where polar bears are found would be rated higher if they mention “arctic region” than an answer mentioning “zoo.”

Predicting and Analyzing Language Specificity in Social Media Posts

Social media posts provide value in different ways, especially on Twitter, where users express themselves in very short utterances. “You can study more things about the person who is writing the text,” said Preoţiuc-Pietro. “Twitter is mostly studied when one wants to understand more about the user.”

The level of detail that’s conveyed in a post regarding a concept, object or event – or specificity – helps provide context for the language and communication between two people. Users with different backgrounds and ideologies have their own unique style of communicating on the platform, and the specificity within their posts reveals information about their character and background.

“In social media, you have to be very succinct and short,” said Preoţiuc-Pietro. “It’s also about an expectation of your audience – are you using it to communicate with friends or to express your viewpoint?”

There are many applications where specificity can be leveraged. Within political arguments, or argumentation, specificity reveals whether someone uses specific numbers and references, generalities, or a more emotional stance. In summarization, the level of specificity dictates what background to include, if any, and whether the entire thread should be displayed with a tweet or if the tweet has the appropriate context to stand alone.

Understanding more about social media posts and users has many applications at Bloomberg, particularly because clients demand real-time information and more and more news-like content is disseminated via Twitter, so this helps them follow trends, analyze brands and more. Since Bloomberg processes social media content, tracking events and identifying specific tweets is important, as is determining if any additional background is required for a given tweet.

In this work, the team researched how an algorithm can predict the specificity of a social media post on a fine-grained scale. While specificity prediction was previously researched by Dr. Li on a coarser scale for news ‒ where the information tends to be more self-contained ‒ when analyzing a tweet in isolation, its social context is important because it reveals information about the person who sent it.

Demographic information, such as gender, age, faith, political ideology, income, and education level, are all very important, and the keywords and language used in a tweet can help shed light on the sender. Tweets about school or homework, for example, are likely sent by someone between 14 and 18 years old. General posts that don’t reference events or actions assume the reader knows what that user is talking about. For example, younger users tend to communicate with close communities who know everything about that person. Contrast that with older users who might tweet information as a way to also compensate for the fact that their followers may not know them personally, which leads them to assume their followers may not have the entire context.

The team analyzed training data comprised of about 7,000 tweets from 4,000 users who shared demographic information. Messages were annotated based on their specificity, or how general the tweet was, and rankings were crowdsourced via Amazon Mechanical Turk. A regular supervised algorithm was then used to try to predict the score.

Text length is the simplest attribute included in the model that’s a key determinant of specificity. Percentages of capital letters and part-of-speech (POS) tags – like nouns, proper nouns, determiners, pronouns, adjectives, prepositions, and punctuation – were also included in surface and lexical features. These attributes were the most significant driver of the score and had the highest correlation of 0.67.

While emoji usage signaled a more subjective post and the use of adjectives may indicate less specificity, numbers and capitalization patterns that identify proper names were a very good proxy for specificity.

The algorithm also utilized distributional word representation that captures a tweet’s overall context, social media content that looks for prominent sentence features in posts, and emotion features that measure a tweet’s subjectivity.