How Bloomberg Integrated Learning-to-Rank into Apache Solr

January 23, 2017

The latest milestone in open source development at Bloomberg is the incorporation of the Learning-to-Rank (LTR) plug-in into Apache Solr 6.4.0, which shipped this week. The release of the plug-in marks the culmination of a year’s worth of close collaboration between two groups of Bloomberg software engineers in New York and London and the open source project’s community to make it easier to re-rank search results using machine learning.

Apache Solr is an open source enterprise search platform built on top of the Apache Lucene search engine library. Solr powers search for companies and websites worldwide, and it is one of several open source projects to which Bloomberg engineers contribute as part of the company’s ongoing effort to actively participate in the open source community.

Development of the Solr Learning-to-Rank plug-in was jointly led by Michael Nilsson and Joshua Pantony from the Unified Search team in New York and Diego Ceccarelli from the News Search team in London. Individual code contributors included New York-based Jon Dorando, Naveen Santhapuri and David Grohmann from Bloomberg, and, from outside Bloomberg, London-based Alessandro Benedetti (Benedetti and Ceccarelli had initially met since they were both Italian and working on Solr in London).

David Grohmann, Michael Nilsson, Jon Dorando and Naveen Santhapuri, four engineers based at Bloomberg Global Headquarters in New York who worked on the Learning-to-Rank plug-in. (Photographer: Teddy Vuong/Bloomberg, January 17, 2017)

Their original goal was to improve both Federated Search and News Search on the Bloomberg Terminal. A Solr-based Search-as-a-Service platform drives search for multiple functions on the Terminal, and Learning-to-Rank algorithms are responsible for the quality of many of its search results. Any time subscribers perform a search, they expect to instantly find the most relevant companies, people and news. “When we started this project, our existing unified search application (HL<GO>) already had a re-ranking framework for search results, but it was something we had built ourselves outside of the search engine,” says Nilsson.

On the other side of the Atlantic, the re-ranking requirements of the News Search team were different, but similar. As the engineers talked with colleagues, other teams also came forward asking for their own Solr-based re-ranking frameworks. “We knew we wanted more features, but that was the push to create a general Learning-to-Rank plug-in for Solr that would make customization easier and allow others to take advantage of their own re-ranking algorithms,” Nilsson says.

In the Information Retrieval field, Learning-to-Rank techniques are used to improve the relevance of users’ search results. First, a search query is made for documents that match the user’s search terms. The top N results of the original search query are then re-ranked using new scores computed by applying the trained machine learning model. Since these machine learning queries are more computationally intensive—slow and expensive, in other words—using the ranking from the second query on just a subset of results helps improve performance, while delivering relevant results.

The effort to integrate the Learning-to-Rank plug-in into the upstream project was led by Apache Lucene/Solr committer Christine Poerschke, a senior software engineer in the News Search team in London. Last month, Poerschke was named to the Apache Lucene Project Management Committee (PMC), becoming the first Bloomberg employee to be invited to join any Apache PMC. In this new role, she is part of a group of developers around the globe that provides oversight of the project for the Apache Software Foundation (ASF), decides the release strategy, appoints new committers and sets community and technical direction for their project.

Diego Ceccarelli and Christine Poerschke, engineers on Bloomberg’s News Search team in Bloomberg’s EMEA Headquarters in London (Photographer: Matthew Muzerie for Bloomberg, January 11, 2017)

The contribution process formally started with the creation of the SOLR-8542 ticket in the project’s issue tracking system. The project welcomes contributions in the form of code ‘diff’ patch files, as well as via GitHub pull requests. The engineers initially proposed the code change for the Learning-to-Rank plug-in as a patch file, but then switched to a GitHub branch-and-pull-request approach, as the latter better supports an iterative collaborative development process.

Both leading up to and throughout the contribution process, the team worked hard to ensure that the plug-in would be accepted by open source users and ultimately be useful to the larger Solr community. In addition to engaging in discussions with other developers directly on the JIRA ticket, the team also spoke about the plug-in at community meetups and conferences (the slides and video recording of Ceccarelli and Nilsson’s talk at Lucene/Solr Revolution in 2015 are available online) in order to gather feedback and answer questions.

Part of the integration effort was to take a step back and consider how any ‘it works like this for us’ aspects of the code fit with the larger community’s requirements. For example, Solr users already familiar with the configuration of existing components would benefit if the Learning-to-Rank model configuration worked in a similar way, or they could encounter problems if subtle details unnecessarily worked differently. “Working together, we were forced to generalize, to satisfy all these different requirements,” says Ceccarelli.

This word cloud shows the classes used by the Learning-to-Rank plug-in; the more frequently a class is mentioned in the code, the more prominent its visualization.

After a year-long period of on-and-off iterative code revisions, public comments and documentation, the Learning-to-Rank plug-in is now part of the Solr 6.4.0 release. The plug-in provides an easy-to-use framework to deploy machine learning models into Solr. Now search engineers, both inside and outside Bloomberg, can use the plug-in and their own machine learning models to improve their search solutions. This allows engineering teams to focus on their specific domain, rather than spend time building and maintaining their own re-ranking infrastructure.

With the inclusion of the Learning-to-Rank plug-in as part of Solr, the project’s worldwide community has taken on the responsibility for maintaining and extending this technology. This collaborative open development means that, in the future, the community – which includes several Bloomberg engineers who are active contributors, developers at other companies, as well as independent search experts – will be able to integrate their own extensions and improvements to the plug-in. Those updates will then automatically ship to all Learning-to-Rank plug-in users as part of future Solr releases.

The opportunity for Bloomberg engineers to participate in important and interesting open source projects also has other benefits. Search results ranking is a relatively difficult technical problem. Taking on, and then contributing the results of, this kind of challenge is validating and rewarding for Bloomberg’s engineers and it is also of interest to many prospective Bloomberg engineers. Pantony says, “When I talk to potential candidates, a lot of them get really excited when they hear Bloomberg is doing stuff like this. It helps us reinforce that Bloomberg is a collaborative and open environment for engineers.”

If you are new to Solr, check out the Solr Quick Start tutorial here. The Learning-to-Rank plug-in documentation is available in this section of the Apache Solr Reference Guide.

How Bloomberg Integrated Learning-to-Rank into Apache Solr

Read more related stories