Scatteract: The First Fully Automated Way of Mining Data from Scatter Plots

Extracting insight from unstructured information is nothing new for Bloomberg. Every day Bloomberg uses automated systems to pull data from Word documents, PDFs, paragraphs of text, and tables.

However, there is one form of data that can be terribly difficult for software to process: data found in any kind of chart. While an excellent way to convey patterns and trends in data, they do not facilitate further modeling of the data or close inspection of individual data points. This has been an ongoing challenge for Bloomberg’s Data Technologies Automation team in Princeton. Their job is to find ways to extract data from financial documents — usually PDFs — in an automated way, and to produce “better-than-human” analyses. This team has built engines that can extract data from tables and text. Charts seemed the next logical step, starting with scatter plots.

These handy visual expressions of trends or patterns are very useful for quickly conveying high-level information. But, they have been nearly impenetrable when it comes to deeper, more detailed analysis. It’s a bit like if you were flying over a forest. You can see the overall pattern of tree growth, but, if you need to know more about each species, you can only guess.

Data scientists Mathieu Cliche and Connie Yee, together with David Rosenberg from the Office of the CTO, have been collaborating with the Data Technologies Automation team led by Biye Li, to develop a system to automatically extract numerical data points from scatter plot charts in PDF documents using a combination of optical character recognition (OCR) together with robust regression to map from pixels to the coordinate system of the chart, and deep learning techniques to identify the key components of the chart.

The system is called Scatteract, and it’s the first system to extract data from the image of a scatter plot and represent it in the coordinate system of the chart in a fully automated way. Mathieu will be presenting a paper about it at The European Conference on Machine Learning & Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2017), a major international conference on machine learning held this week in Macedonia. The team has also published as open source the relevant code to Scatteract to enable others to extract data from scatter plots.

The results are impressive. The system has been shown to successfully extract data from 89 percent of procedurally-generated scatter plots, and 78 percent of scatter plots found on the Web.

Here’s how Scatteract works in a nutshell. It locates three types of objects in a chart: tick marks, or the visual representation of data points on the axes; tick values, which represent the scale of the axes and which are usually printed near the tick marks; and finally the plotted points on the chart itself.

Scatteract identifies these elements and produces bounding boxes for each one that are then described as pixel coordinates. The next major step is to map those pixel coordinates into coordinates that apply to the scale of the chart. The results are produced as table data.

While there have been numerous efforts to extract numerical data from charts, this is the first system to use machine learning techniques to do it. The team used Google’s open source machine learning library TensorFlow, and its related object detection method TensorBox. They then combined them with Tesseract, Google’s open source optical character recognition engine.

“It’s a potential game-changing way in how we approach the problem of data extraction at Bloomberg” says Biye Li. “Previously, to make use of data in a document, you needed to know which data was present. Scatteract makes it possible to extract all the useful data from a document and then decide how to use it later. It enables us to access a lot more data in a more consistent and structured way, for the benefit of Bloomberg customers.”

At least part of the inspiration for Scatteract came from last year’s U.S. presidential election. Websites like Nate Silver’s FiveThirtyEight published many charts predicting different results with varying levels of confidence. It was difficult to analyze the underlying data from the images presented. At one point, someone pulled out a ruler to try measuring a chart image for scale to see if he could quantify any data points.

Scatteract addresses a growing need for access to more kinds of data for use in building financial models. Generally speaking, having more data is always better. And, emerging technologies like neural networks are especially data hungry. The more data you feed them, the more accurate their models become. Scatteract essentially unlocks a source of data that had previously been unavailable.

Down the road, the team hopes to extend Scatteract’s capabilities to also extract categorical information from scatter plot legends, in addition to applying the system to other types of charts, including bar charts, line charts and pie charts. Eventually, an end-to-end system will be able to detect and classify charts in a document and use the appropriate techniques to extract useful data from them.