How Bloomberg Built a Better Search Engine with Open Source Technology

320,000 users, 10 million searches a day and just 180 milliseconds response time for each query: that’s what Bloomberg’s news search back-end has to stand up to every day. Redesigning a system to meet these demands would be a pretty daunting task, right?

Now imagine the users have an in-depth knowledge of the domain and need to perform complex searches in real-time, and huge financial decisions ride on it.

Those are the unique challenges Bloomberg software developer Ramkumar Aiyengar and his team faced when they decided to switch to an open source platform to power the news search functionality on Bloomberg Terminals. In this conversation, Aiyengar, Engineering Manager of News Infrastructure at Bloomberg, discusses how open source search technology was the key to addressing those challenges and delivering better search results to customers.

Why did you redesign Bloomberg’s news search back-end?

Around three years ago, we decided to invest more heavily in building powerful, flexible solutions for news search.  The primary driver was that we wanted to build a framework to help us develop more powerful, intuitive features and deliver more relevant search results to our clients.  Our previous back-end was run on a third-party solution, which we could not adapt to meet changing business requirements. We required a scalable solution which could take on additional users and machines, with the flexibility to add more features as required.

We decided to use open source software so that we could make changes to improve the system, while also contributing back to the community.

What open source solution did you select?

We carried out a survey of many of the open source solutions available and selected Apache Solr/Lucene, which is a feature-rich distributed search engine supporting many of the use cases we had. Not only is this a free software product, but it is built using a free development model. This means the development community driving the direction of the project is completely meritocratic and not linked to any organization.

As a result, there is an active, diverse community which influences and shapes its future. Being part of an open development model means that we can contribute to the software, and get the world to help test the code for us and maintain it. It’s the best of both worlds. As a matter of fact, we now have three committers who contribute directly to the project.

What were the business and user requirements?

We get around 10 million searches a day. As soon as a news story comes in, it has to be made available for search in around 100 milliseconds and our average search response time is around 180 milliseconds. The system provides intuitive mechanisms which allow clients to create searches and alerts of varying complexity – the most complex can go to more than 20,000 characters in size – and the technology needs to perform these fast and efficiently.

Chronology and relevance are equally important. Let’s say you’re a day trader who is interested in a certain stock that you trade on a daily basis. You need information in real-time when it comes in. You’re more interested in the most recent news first because a few seconds after the news breaks, it could already be meaningless. The challenge here is deciding in real-time what to show to the customer and what to leave out. We don’t want to serve too many stories and clutter the user experience, nor serve too few and miss out on news the customer cares about.

Then you have portfolio managers and people who assess risk. They often want to know the most important news stories in their areas of interest in the last few hours. Here, chronology is still important – but older news, which may be more relevant, could appear at the top of their search.

Tell us about the system redesign.

Like any big system redesign, we went through multiple phases to make it work, then to make it fast and stable. Making it work was probably the easiest of the three goals. However speed and stability – which are just as important as making it work – are often competing priorities which need to be carefully addressed in tandem.

An interesting example is that of a product feature called news on a ticker list, where the customer has a list (say a portfolio) of a few thousand securities and wants news on every one of them. In search engine terms, this is the equivalent of searching on any of a few thousand search terms, which can be challenging to execute fast. Like many other requirements in this project, this too needed fairly intricate changes to Lucene, the lower-level component of the search engine, and Solr, the application built on top of it.

For a technical deep dive, watch Ram’s presentation from Berlin Buzzwords.

The last three years have been an exercise in redesigning the system and dealing with the challenges that we have come across since then. At the beginning of this year, we released the product to all customers, and this year we’ve been looking at leveraging our re-writes, using Solr/Lucene to build additional functionality on top of our news search back-end.

We have also built an entire solution from scratch for our alerting back-end, which is responsible for updating all our search results in real-time, and letting all our customers know when news is published that matches their interests, all within 100 milliseconds of publishing.

What is your current focus now that the redesign is done?

We are now focusing on three specific areas: improving the relevancy of our search results, trend detection, and faceting.  For relevancy, we are in the process of building a mechanism for Learning-to-Rank in Solr.  This will allow systems using Solr, including ours, to use Machine Learning to improve relevancy and show the best results for any given search.

We have now also used Solr to build a system which detects publication trends to show our clients what companies are being talked about the most in news and on social media. Leveraging similar technologies, we are also revamping our classification back-end, which annotates stories with the companies, topics and people mentioned in your news stories.

How have clients responded to these improvements?

The redesign has been one of the most complex architectural moves that I’ve ever been involved with and the reason for this complexity was that from a user perspective, our work needed to be invisible. Clients should have a much better experience – the data should be better and the search capability should be faster and slicker, even if nothing in front of them is changing.

If our users are able to find the news they are looking for the moment they are looking for it, without having to go anywhere else to find it – then we have succeeded. That’s our end goal.