Building a Real-Time News Search Engine

Bloomberg developer Dan Collins reveals how the team at Bloomberg R&D built a real-time search engine and alerting service with open source technology.

Time is precious when it comes to accessing data–particularly business or financial information that investors use to make trades. When a company goes public or announces a new product, informed decisions need to be made within seconds, or even milliseconds.

Today, financial information piles up far quicker than humans can parse it. Few understand the complexity of the challenge more than Dan Collins, a telecom R&D veteran who is currently a senior developer at Bloomberg, based in the London R&D office.

Collins recently flew to New York City to give a talk about Bloomberg’s approach to solving this problem by building a real-time search and alert framework using open source search software Lucene and Solr.

As open source technology becomes increasingly important to companies, Collins talks about the search ecosystem, how the team built this framework and the challenges they overcame.

This talk, given by Collins, has been edited for length.

The Surprising Thing about Bloomberg

Bloomberg has this image as being a finance company, or a TV station, depending on how you look at it. But we’re really a technology company.

If, for example, you take a deeper look at the news content we provide on the Bloomberg Terminal, you’ll see that in addition to the stories by Bloomberg News there are stories from other major news generators like The New York Times, Wall Street Journal and the BBC. We have about 125,000 sources we pull data from and share with clients. So when our Bloomberg Terminal subscribers run eight million searches every day – a few hundred every second – and have a million and a half saved searches, it’s a big deal.

Search Query Complexity

The news stories can range from a headline that’s just 80 characters long to a 10MB research document that’s data-extracted from a PDF, and searches can be up to 20,000 characters in size. Users search by both keywords and topics, which include both near-based queries including slop-factors, and zoning – where the search appears in the story. There are a bunch of other filters people can put on top of that such as time-based ranges.

It gets even more complicated when you set up alerts for these searches because of the evolution of alerts in general and people’s expectations. And, our users have variety of different delivery options including emails, instant messages or pop-ups. We’ve also got scrolling alerts that pop on screen as news stories come in that meet a saved search query. Including those, we’ve got about five and a half million alerts a day.

So the question is, how do we build our own platform that delivers real-time search queries and alert requests that gives us the flexibility to provide more sophisticated ways for our users to get the information they need as quickly as possible?

The Bloomberg Solution

At Bloomberg, we’ve been active in the Solr community for quite some time. So, in terms of the alerting system we needed to build, we started looking at the Luwak library, which Alan Woodward and the rest of the Flax search guys over at Cambridge had developed.

It has what it calls a “pre-searcher” to reduce the number of queries it runs when we know there’s no way they can possibly match. We can be more efficient because we’ve got less searches to go over.

The second phase is to process the “document”. You index it as normal into an in-memory index, (which could be literally be a Lucene memory index), and what you get out of it is a list of matches. There are also some extensions that we’re working on around scoring, so you can try and score particular alerts.

While we used open source technologies Lucene and Luwak, we also contributed to them to make it work for us. For example, Luwak needs to be able to extract terms out of all types of queries, so there’s a couple of term extractors we’ve had to send back. Filters are something that it didn’t support particularly well so we ended up changing most of our filter queries into just alternative query types. We also found a bug in Luwak – certain alerts started failing consistently across a particular language. For us it was Korean.

Finally, we built a custom component, which is still in the works. At the moment, this matching application is a separate web application, but the goal is to integrate it back into Solr.

What’s Next

Corporate technology has become highly complex. At the lower levels of the stack, innovators know that proprietary software can cause more problems than it solves. A lot of companies are deciding they can’t sit behind closed doors any more, and they need to get more involved in open source. Like mobile device makers and car manufacturers before them, corporate tech departments are beginning to join together around open standards, contributing back to the infrastructure that all modern businesses need to thrive. We have been increasing our contribution over the last few years and have three Solr committers in the company.