The search for Solr analytics

In the constant search to extract meaningful insight from data, one team of Bloomberg programmers is having an outsized impact, both inside and outside the company.

Steven Bower and Houston Putman are contributing a new version of their Analytics Component to a project called Apache Solr, and their work is benefitting programmers and data scientists all over the world. Solr (pronounced like “solar”) is an open source platform designed for indexing and searching data. Created in 2004, it’s widely used as a data search tool at a variety of high-profile companies, including Best Buy, eBay and Netflix.

Bower said Solr also serves as the technical foundation for more than 300 functions on the Bloomberg Terminal. “Pretty much anytime you search on the Terminal, Solr is there,” he said. “It forms the underpinning for things like News Search, Bloomberg Unified Search, Fixed Income Search (SRCH<GO>), and even jobs listed on Bloomberg.com/careers. It’s all over the place.”

Bloomberg’s reliance on Solr led the company’s software engineers to contribute to the Solr project in a big way. Three members of the Bloomberg search and news teams are “committers” to the project, meaning they can modify the code directly. Another 20 Bloomberg engineers have submitted code that has ultimately been added to Solr. And one is a member of the Apache Lucene/Solr Project Management Committee (PMC), which provides oversight of the project for the Apache Software Foundation (ASF), decides the release strategy, appoints new committers and sets community and technical direction for the project.

The original idea for adding the Analytics Component to Solr dates back four years, to when Putman, then a 19-year old intern at Bloomberg, started work on an earlier data analysis project.

“Basically people were asking us for ways they could do simple rollups of their data generated by Solr,” Putman said. “The requests were pretty basic, like how many records were in the system every week. Over time, the requests got more advanced.”

Solr already included a component called Stats that proved to be too inflexible. Once Putman released the first version of the Analytics Component, it proved popular in the Solr community for expanding the range of functionality. Inside Bloomberg, the new functionality helped the company cut back on its use of some commercial database products.

The Analytics Component’s ability to search on live data alleviates the need and cost to use a data warehouse database. That speeds up and simplifies the process, which also reduces the cost of conducting the analysis. “It’s pretty intense what it can do,” Bower said. Version 2.0 of the Analytics Component (SOLR-10123) is a standard part of Solr 7.0, the platform’s soon-to-be-released major revision.

The new version addresses a number of challenges found as usage of Solr has grown at Bloomberg.

First, a common method that Solr users employ to handle ever-growing document stores (called ‘collections’ in Solr), and to provide consistent performance, is called ‘sharding’: breaking the document store into chunks. Those chunks can then be distributed across a number of Solr instances, either located on a single machine (to take advantage of multiple CPUs, for example), or across multiple machines (when a single machine can no longer provide sufficient performance for the application). Unfortunately, the first version of the Analytics Component could not be applied to sharded collections. The new version directly supports sharded collections, which allows it to be used on collections with tens of billions (even hundreds of billions) of documents.

This enhancement provides much more than just the ability to support larger collections, though. Sharding can be used to more fully utilize the processing power of machines running Solr (which are often constrained by how quickly they can load and store data from their storage devices, not by their CPUs). Executing analytics on a collection organized this way employs a method referred to as ‘MapReduce’, providing near-linear performance increases as the number of shards grows. When the shards are distributed across multiple machines, the parallelism benefit continues, as the computations are also distributed across the machines.

Second, in the years since the Analytics Component was made available to teams at Bloomberg, their usage demonstrated a need to support hundreds (or even thousands) of categories in multi-level group searches performed against their large collections. The new version of the component has a completely rewritten grouping algorithm, which can handle these searches without a significant loss in performance.

Why all the effort to boost an open source project? It falls in line with Bloomberg’s culture values of making a difference in the communities where we live and work. This way, other companies — even potential competitors — benefit from the work that Putman, Bower and their Bloomberg colleagues have done. And those same external users help improve the software by finding its flaws and fixing them, after putting it through their own-real world deployments.

Attending Lucene/Solr Revolution 2017 in Las Vegas? Learn more about the next iteration of the Solr Analytics Component from Houston Putman during “Analytics at Scale with the Analytics Component 2.0” from 10:00-10:40 AM PT on Thursday, September 14, 2017.