Some of today’s most successful applications of machine learning are based on supervised learning, a type of machine learning algorithm that relies on labelled training data encoding a priori known ground truths to reproduce a human decision-making process. Collecting and generating such datasets can be very expensive.
For example, if we were to apply machine learning in the context of medical diagnosis, then the data we wish to make predictions about– something like X-Ray images or analysis results – must be reviewed and judged by a panel of medical experts. In the finance domain, if a news editor is covering equities, monitoring the entire stream of Twitter messages for breaking news can be very time consuming. Machine learning models can be trained to monitor the stream for specific tweets about a particular topic, like “earnings.” This way, a reporter can be alerted only when a related news story “breaks” on Twitter, enabling them to be the first to cover it. Training the models to do that is also typically expensive – each of the potentially thousands or even millions of tweets used in the model’s training data set must first be annotated by hand, usually by multiple judges.
Good labels and annotations are therefore key to training models to provide accurate output. Active learning allows us to select the new samples to be labelled strategically, minimizing the amount of samples needed; the goal is for it to perform as if it was trained with a much larger dataset.
The Bloomberg team conducts machine learning research in-house. In a recent paper, Minjie Xu, a senior ML/NLP researcher and engineer in London, collaborated with Gary Kazantsev, Head of Quant Technology Strategy in the Office of the CTO, to examine how to use analytical tools recently employed for interpretability research to conduct more efficient and effective active learning. Gary presented the paper, “Understanding Goal-Oriented Active Learning via Influence Functions,” at the NeurIPS 2019 Workshop on Machine Learning with Guarantees in Vancouver, Canada on Saturday, December 14, 2019.
“Using the analytical tool recently resurfaced for interpretability studies to analyze active learning algorithms, we discovered some interesting insights that shed light on a suite of popular active learning strategies, and we hope this will help people better understand their underlying mechanisms and make the right choice in practice,” said Xu. “It was a natural fit for this workshop, where they focus on theoretical analyses of machine learning algorithms.”
Active Learning Strategies
“Active learning helps you to figure out which data samples to choose for annotation so you get the most benefit from this training data,” said Xu. “It helps you prioritize your annotation budget — ideally, if you have 1 million unlabelled samples, you want to annotate them all, but this takes time and energy. If you only have a budget to label 100 samples, active learning helps you identify which 100 will provide the most benefit.”
The most popular active learning strategies are based on the concept of “uncertainty.” Very often, a trained model makes a prediction using a certainty, or confidence value, of its prediction being correct. In this class of active learning, those samples that the model is most “uncertain” about are the ones to be selected next for annotation.
Another type of active learning, called “goal-oriented active learning” by Xu and Kazantsev, is guided by an explicitly chosen “goal function” imposed on the trained model. The “utility” of each training sample is then measured by its potential influence on helping the model achieve the designated goal. The next best sample to label is the one having the highest “utility.”
“You ask your goal function what is the goal for the new model given I have one more sample.” explains Xu, “And you want to annotate those samples so that the goal potentially increases the most.”
Cost of Goal-Oriented Active Learning from a Pool
However, this is not as straightforward as it may sound, as the ground truth label of those samples in the pool are unknown at this stage. To use them for training purposes, you have to first “guess” their labels in some fashion.
“To carry out this goal-oriented utility evaluation, you have to do some computation for each one in this big pool of unlabelled samples – if you have one million of them, you need to carry out the computation one million times,” said Xu. “If you have a quick way to compute this utility for each sample, it might still be fine. But in such goal-oriented active learning paradigm, computing the utility for even just one unlabelled sample can already take a long time.”
For example, if you have a thousand possible categories a label can take, and a million samples in the pool to choose from, you may end up doing the model retraining one million times one thousand times, which amounts to one billion times, just to determine one best next sample. Thus, goal-oriented active learning can be expensive. Accurately evaluating these utilities over a large pool of samples becomes almost impossible in practice.
One of the contributions of the paper is to efficiently approximate the calculation of such goal-based utilities using a method from robust statistics called “influence functions”, which provide an estimate of the updated model (and accordingly the updated goal) without needing to actually retrain the model. As a result, massive model retraining costs can be replaced with much cheaper gradient computations. More importantly, the approximation also naturally extends to the batch mode setting, allowing the algorithm to select multiple new samples at each step, further saving on computational costs, which otherwise scale exponentially with the batch size.
The Unresolved Questions
Along their journey, Xu and Kazantsev also made some unexpected discoveries that call into question some common practices and beliefs about active learning strategies.
For example, if one were to “guess” the label of the sample directly using current model beliefs (i.e., taking the expectation without any modifications), the approximation would yield exactly the same utility for all of the samples, making it useless. On the other hand, even after doing the expensive computation of the true utilities, they would still be fairly close and therefore have limited distinguishability. In addition, even if one were allowed to “peek” at the true label of the samples, which in principle should give us an advantage over any other possible “guesses,” it was empirically found that many popular goal-oriented active learning strategies actually perform much worse.
While Xu and Kazantsev have preliminary results, being able to refine their research further will move their work forward.
“The main purpose of this paper is to share our findings with the community so fellow researchers in the machine learning field can become aware of these issues and make more exciting discoveries to carry this line of research forward,” said Xu. “We are very interested in this topic and will continue looking into it.”