Bloomberg has adopted Kubernetes, the open source system for deploying and managing containerized applications which has gained a great deal of industry momentum, in its infrastructure. As a result, systems are becoming more distributed than ever before, running on machines scattered around the globe and across the cloud. This means there are more moving parts, any of which could fail for a long list of reasons.
Systems engineers want to feel confident that the complex systems they’ve built will withstand problems and keep running. To do that, they run batteries of elaborate tests designed to simulate all sorts of problems. But it’s impossible to imagine every potential problem, let alone plan for all of them.
“When you have something so complex that it’s difficult to predict things, the only reasonable way to deal with that is to simulate the kinds of failure you’re expecting to see before they happen to you in production,” says Mikolaj Pawlikowski, a London-based software engineer with Bloomberg’s Data Technologies team .
Because problems in the real world don’t occur on a schedule, the best way to test is to cause problems at random. Netflix, the streaming video provider, developed a tool it calls Chaos Monkey, which randomly terminates virtual machines running on Amazon Web Services (AWS). “It would take down nodes so that engineers could gain confidence the application would keep running anyway,” Pawlikowski said.
Inspired in part by Chaos Monkey, Bloomberg has built its own tool for testing Kubernetes clusters called PowerfulSeal, which Pawlikowski presented at KubeCon + CloudNativeCon North America 2017 in Austin, TX.
The tool is aimed specifically at Kubernetes, and includes the ability to describe the objects running in each container so that it knows precisely which things it needs to break for testing purposes.
It also has an interactive mode that allows systems engineers to experiment and see how it behaves on their clusters and, over time, build their own testing policies.
“You can just point it at a cluster and ask it to delete things, take things up and down, and execute arbitrary commands,” Pawlikowski says. “It lets you get a good idea of how resilient the application is, and then, with that experience, you can write policies in YAML and deploy them.”
Those policies can be fine-tuned lots of ways, including creating rules for the time of day, probability, how much of the application to break, or where to break it. Once deployed, the tool runs in autonomous mode. And like Chaos Monkey before it, PowerfulSeal is being released as an open source tool via Bloomberg’s GitHub repository.
PowerfulSeal currently includes drivers for Kubernetes clusters running on OpenStack, but Pawlikowski says he’s hopeful the open source community will create additional drivers to use it with other cloud platforms, including Amazon Web Services EC2, Microsoft Azure, and Google Compute Engine. He’s also hoping the community will create some “pretty crazy filters” that will enable testing for additional kinds of scenarios. “I hope the community will make this PowerfulSeal even more powerful,” he added.