Tech At Bloomberg

Bloomberg Bets Big on SREs

March 26, 2018

When Bloomberg reached out to Tucker Vento about an opportunity as a System Reliability Engineer at Bloomberg, he wasn’t initially sold. “Working on a news site didn’t sound like a lot of fun,” he says. “It was one of my friends who said, no, no, talk to them.”

But the job he was holding didn’t have the level of impact he wanted. “In my prior role, I felt like I was working against the developers a lot of the time,” he says. “I was there to make sure the product stayed quick, and they wanted to do things that slowed it down.” He thought he could perhaps have more influence and job satisfaction at a place that emphasized the quality of its technology, not just the number of features the software had. He quickly learned that Bloomberg is much more than a news organization; it’s a trusted network of people and information used by financial professionals around the globe, powered by incredibly complex technology infrastructure.

Now an SRE at Bloomberg, he collaborates with a group of engineers that builds messaging software that handles more than a billion messages a day. Everyone in Bloomberg’s 5,000+ person Engineering department understands the importance of reliability and speed. “If my opinion is about reliability, well, our developers want to be on that side too,” Tucker says. “That shared responsibility creates a more collaborative culture.”

During the past two years, Bloomberg has made a huge investment in SRE talent, and is looking for additional SREs to join teams across the company to continue making its systems better. This effort coincides with Bloomberg’s products and systems getting more advanced and a growing number of institutional clients turning to the company as a single, enterprise-wide unified data source. Simply put, “The need to have dedicated SRE teams that focus solely on stability, availability, and scale has become even more important,” says Stig Sorensen, Bloomberg’s head of the Production Visibility group within Bloomberg’s Technology Infrastructure Engineering organization.

Three years ago, Garry Ryan joined Bloomberg to lead the London Feeds development team, which creates the software that pushes market data into the Bloomberg Terminal for more than 500 stock exchanges around the world. He is now the global Feeds SRE manager. “Our senior leadership understands that this is massively important,” he says. “The global head of Engineering, Vlad Kliatchko, is incredibly supportive of SRE and is pushing a whole bunch of tech improvements that are making real progress. The team is very motivated because they understand what they’re doing is so important.”

At Bloomberg, there are essentially two types of SREs: Infrastructure SREs and Application SREs. Infrastructure SREs work on Bloomberg’s Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) offerings, while Application SREs partner with Application Development teams to improve the stability of a particular product. Together, they build tools that tackle highly complex and ever-changing problems. “There needs to be strong collaboration between the two teams,” Stig says. “It is not so much about simply handing software over to an SRE team, as it is working with them to make sure it is stable and reliable.”

In both cases, Bloomberg SREs are viewed as a special type of software engineer. Instead of being obsessed with features, they care about performance ‒ stability, availability, reliability, and operability. All Bloomberg SREs (both Infrastructure and Application) focus on these SRE principles: Monitoring; Provisioning, Configuration and Orchestration; Capacity Management; Deployment and Rollback; and Incident Management.

In addition, since it’s very difficult to have stable systems without a solid development process, software development life cycle (SDLC) practices can also be included. In certain cases, it makes sense for SRE teams to take the lead on improving Continuous Integration / Continuous Deployment (CI/CD), automated testing frameworks, as well as build and release engineering (through code quality tools).

“Our SREs are united by a common vision of harnessing the power of automation through software development to deliver reliable, stable services to our clients,” says Sorensen. “They care about how we can manage our infrastructure and applications more efficiently, and they do that through software development.”

In many cases, Bloomberg’s strongest SREs were once application developers with an eye for system availability and stability who transitioned into this new role to help make the product even more reliable. Bloomberg has also recently turned its search for SREs outward, hiring from other tech giants. They are all working together to continue developing the SRE function within Bloomberg, the opportunities available to them, and the impact they can have.

“I think of our team as force multipliers,” says Saru Thuraiman, an SRE team lead who works with the Trading Systems engineering group to ensure the availability of the buy-side (AIM) and sell-side order management systems (TOMS and SSEOMS) that are used by thousands of professionals across the financial industry. “It could be positive or negative. Anything we do that is good will be awesome for everybody a level up. But the negative impact could also be huge. As a result, we need to be thoughtful, have empathy towards our feature teams and improve the overall stability of the system through SRE principles.”

Saru enjoys being a member of the SRE community at Bloomberg and encourages others to jump on the opportunity as well. At a few larger companies, he observes, SRE is almost a solved problem, and a new hire walks in as one of many existing SREs. At Bloomberg, he says, “We have a successful product, and we’re at an inflection point where we’re trying to modernize and adopt more of these SRE principles. Whoever joins the SRE team now gets to make that impact. There’s a lot of room for innovation and change.”

After a few years working in Asia and Australia, James Vautin recently moved home to New York, where he joined Bloomberg.  Looking back, he realizes he’d been doing a lot of the work of an SRE, even before the term became popular: trying to keep a real estate website up and running, monitoring systems for a telecommunications company, and trying to match software demands and infrastructure at a mission-critical data center. When looking for his next position, he knew he wanted to do something that mattered. “It’s a much bigger deal if a system goes down here than if someone can’t post a status update on Facebook,” he says.

As an early member of the Data & Analytics Infrastructure team, he’s now getting an opportunity to build a data science platform that will allow Bloomberg’s software engineers to use state-of-the-art tools like Spark, TensorFlow, and the company’s sizable GPU footprint in a consistent, easy-to-use way to build applications that leverage machine learning. “This is at the forefront of what is going on in technology right now,” he adds. “And we’re using Kubernetes, which has been a technology I’ve been interested in for a while.”

Yoga Ramalingam became an SRE team leader about two years ago, responsible for the internal telemetry system that monitors all of the company’s thousands of servers running a combination of open source and proprietary software.

“With open source, the challenge is to pick and choose what’s going to work for you,” he says, noting that a lot of open source software has cool features, but will break when deployed at scale inside an enterprise. He finds the best open source products, determines if they will scale and support all the platforms he needs to monitor and, if not, decides whether or not they’re worth trying to fix. “You’re learning new things, and someone is paying you to play with these toys,” he notes. If Yoga believes a particular open source application could be altered to accommodate Bloomberg’s needs, he dives in (such as with his contribution of CollectdWin, a Windows agent similar to ‘collectd‘). “You feel really proud because you are contributing to the open source community,” he says. “Our senior leaders encourage us to do that. They will always find time for us to do it.”

Of course, the concept of SREs is not new to Bloomberg. What’s new is the structure. Why the formal shift? The problem, says Sorensen, is that “an SRE who works on a software development team won’t have any time to really do SRE work.”

That sounds familiar to Arundhati Kogekar, who has spent ten years with Bloomberg’s Communications Channels engineering team, which is responsible for more than a billion email and Instant Bloomberg (IB) messages a day, as well as a substantial amount of audio and video. “These systems are so highly visible and so critical,” she says. “If a post is even a little bit delayed, a customer will notice, since that slight delay could cost them a lot of money.” The team knew how important it was to have the system up and running properly. Still, she adds, “That had to be balanced with continually pushing out new features. We couldn’t focus completely on stability.”

Now, she and her team in Communication Channels engineering can do just that. In April 2017, Arundhati started her department’s first SRE team. “Building this new team helped us focus our energies,” she says.

Her work as an SRE has also boosted her profile internally. “This has been a big experience in forging my own way,” she says. Arundhati’s been strategizing about how the team should function, and how it will work with the feature teams. She’s also learning how to encourage the feature teams to be just as invested in stability as her team is. “Because we are pioneers, this has come under a lot of scrutiny, but we also have a lot of backing and support from our senior leadership,” she notes. “If we can pull it off, it’ll be a great model for how SREs can work with feature teams on building highly scalable and reliable products, while also keeping up with the fast pace of feature releases.”

Michael Rembetsy, the manager of Automation & Systems in Bloomberg’s Technology Infrastructure Engineering group, points out that for an SRE, Bloomberg’s size and scale are ideal. The company is big enough to allow people to move between different areas of interest, but small enough that everyone knows each other and actively collaborates. “It’s the best of both worlds,” he says. And the work at Bloomberg is particularly challenging because the teams are providing global services for market data ‒ tens of billions of data points come in from thousands of sources around the world every day. “There are very few places you’re going to walk in and deal with the scale we have here,” he says. “As an SRE, the challenge of scale is always a good one.”