Performance engineering: what is 'chaos testing' in application development?
Performance engineering is the activity of making software applications perform better. It's a holistic approach to performance testing and the best practices associated with it. Over the last decade, 'chaos testing' has emerged as an important part of this testing methodology. It helps to ensure applications perform well despite failures or unexpected events.
'Just as athletes can’t win without a sophisticated mixture of strategy, form, attitude, tactics, and speed, performance engineering requires a good collection of metrics and tools to deliver the desired business results.'
— Todd DeCapua, IT engineer and writer
A 'good collection of metrics and tools' has to cover as many situations as possible - including the extreme ones.
What is chaos testing?
Chaos testing (or chaos engineering) is the activity of applying 'unexpected' or extreme circumstances to a software system. This gives you a measurement of how robustly the system can withstand such events outside the production environment. It affords app developers the ability to identify and learn from failures before they become outages.
The Simian Army
Let's talk about Netflix. The content streaming giant built a chaos testing framework after moving to a distributed cloud architecture on AWS (Amazon Web Services) in 2008. In their new home, they created The Chaos Monkey. This test was designed to randomly kill instances and services within their architecture, and to see how well it was able to run despite these failures.
The idea of this kind of chaos testing is to proactively apply resiliency. When abnormal or unplanned instances arise in the future, the software can withstand these events.
This developed into the tool suite known as 'The Simian Army'. The army consists of too many troops (a.k.a. test types) to cover in detail here, but includes Chaos Gorilla, Latency Monkey and 10-18 Monkey. These all replicate different types and scales of failure-inducing activity.
From there, the engineers at Netflix created Spinnaker, an open-source, multi-cloud continuous delivery platform. Several members of The Simian Army have since been absorbed into this platform. It's worth noting the Chaos Monkey system can only be used within an application managed by Spinnaker.
Spinnaker isn't your only option, though. Other tools like Failure Injection Testing (FIT) and Gremlin are able to be used more widely for chaos engineering. These can also test for more failure variants than just killing instances. For Kubernetes, check out Litmus and Chaos Mesh, as well.
Signs you should be doing chaos testing
If a digital monkey got into your system and started pulling out the metaphorical wiring, would your application hold up? Here are four compelling reasons you want to start doing chaos testing:
- You'd like to reduce the number of outages you're currently experiencing.
- You want to minimize mean time to recovery (MTTR).
- You want to communicate to stakeholders that your application won't suffer from costly and unexpected downtime. (98 percent of businesses said that one hour of downtime could cost up to $100,000).
- You are about to launch your application beyond alpha and beta stages, and are looking for good reviews from the general public or clients.
Capgemini's World Quality Report recommends that 25 percent of a development team's budget should go towards Quality Assurance.