Importance of Chaos Engineering in DevOps

DevOps are an advanced form of agile development. Agile development is an approach to creating software solutions quickly while minimizing errors and mistakes. It uses an iterative approach where small cycles of designing, creating, testing, etc, are used to create a prototype. The prototype is tested and improved cyclically to finally get a final working product.

DevOps is an advanced version of this. It basically involves both the developer and operations team to work together to quickly develop a product. The inclusion of the operations team makes the entire process more accountable, jointly responsible, and improves collaboration.

Despite all of the advantages, the very nature of DevOps and agile development is prone to errors. A system breaking error can really impact the trust between the customer and the supplier of the software, so some safeguards need to be put in place. This is where chaos engineering comes into the picture.

What is Chaos Engineering?

Chaos engineering is a method of testing software by intentionally using it in faulty ways and introducing errors. The idea is to test the resilience and reliance of the system. 

Many systems seem to work fine as long as they are used in the intended ways. However, they can break and exhibit unwanted behavior if they are used in unintended ways. 

This is a security risk and can be used as an attack vector to disrupt the working of the software. In large scale deployment such an issue can be catastrophic. 

Chaos engineering is used to find such errors and bugs in a controlled environment where they cannot do any harm. 

DevOps require chaos engineering to make sure that their product works perfectly and does not have any system breaking bugs in it. 

In 2024, chaos engineering is more important than ever due to the massive market for cloud-based SAAS applications. These cloud-based apps and software suites are always undergoing changes. 

If no chaos engineering practices are used then something like the Crowdstrike outage could happen. That outage cost not only the software provider money, but also all of their customers too. It was labeled history’s largest IT network outage ever.

Now, that you have understood the importance of chaos engineering, it is time to learn about its constituents. 

The Process of Chaos Engineering

The process of chaos engineering can be broken down into five basic steps. Their particulars are discussed below.

  1. Steady State Hypothesis

The steady state hypothesis refers to defining the normal behavior of the software. This includes its intended function, time taken to do that function, and how many times it is permitted to fail. 

This also includes the behavior of the system during unexpected circumstances. For example, if the system runs on a virtual machine, what will happen if the VM is suddenly powered off during normal operation. Is the system intended to resume as normal after power is restored or is it supposed to stop working? 

The steady state hypothesis lays out all the variables and the ideal way performance of the system when each variable is changed. It is commonly understood that the ideal scenario outlined in the hypothesis does not come to complete fruition. This is why the rest of the steps are necessary.

  1. Designing Failure Inducing Experiments

Once the ideal working of the software is determined, the next step is to brainstorm methods of making the system deviate from it. This step is necessary because it helps you understand system dependencies. Too many people assume happy scenarios and fail to consider failure scenarios. This results in outages like the aforementioned CrowdStrike outage.

Our earlier example of cutting the power to a VM is applicable here. 

In the hypothesis stage, we only defined the ideal behavior. In the experiment design phase, you have to define how you would practically create the scenario in which the software needs to be tested. 

For example, the power of a VM can be cut in several ways. 

  • You can simply shutdown the VM from the host machine
  • You can shut down the host machine itself
  • You can delete the VM
  • You can suspend the VM

There are so many ways of doing essentially the same thing. In this step, you need to come up with as many of the experiments as you possibly can.

Now, the question is that do these differing methods produce the same output or a different output? Does the output match what we defined in the steady state hypothesis? 

These questions are naturally answered when you actually do these experiments. Which is the next step. 

  1. Conducting Experiments in a Secure Environment

Once you have devised all the experiments, it is time to conduct them in a secure environment. For example, if your software is cloud-based and distributed, then your experiments need to check the networking aspect of it. 

That would require using a lot of VMs and extensive use of an internal network where you know the system is safe from external harm. Then you can test the system by deliberately messing with DNS records and other network settings. Things you can do include but are not limited to:

  • Introducing extra latency
  • Removing DNS records or altering them erroneously. 
  • Making the TTL of records extremely long to check the impact on DNS propagation.
  • Use different types of protocols than the system was designed for (e.g., UDP instead of TCP)

You also need to tabulate the results of these experiments. For example, how was the system response time affected by introducing latency. How much was it affected by changing the amount of latency? Did increasing the DNS propagation time alter the system’s function? Did altering the DNS records ****make certain resources unavailable? By tabulating the results, an analysis can be done. 

This is just one example. However, you get the point. The experiments need to be conducted in an environment where unexpected variables and outcomes cannot do harm.

  1. Analyzing Results

Finally, you have to analyze the results of the experiments.  The results show how much the system performance deviated from your steady state hypothesis. 

Not only the deviation needs to be measured, but its cause must be found out as well.  After all, without learning the cause of the large deviation in expected behavior you cannot fix it. 

The more the deviation, the more the system/software needs tweaking. If all results are off by a certain margin, then the software is considered unfit for deployment. 

At this point, it is necessary to start looking for a complete overhaul. If the deviations are minor, then solutions and fixes can be devised instead.  

  1. Devising Solutions

The insights gained from the experiments must be used to create patches and fixes that make the software more resilient. Resilience here refers to the ability of the software to handle unexpected variables and perform according to the steady state hypothesis. 

If fixes are not possible, then at least an action plan needs to be created. The plan’s aim should be to minimize damage caused by the erroneous behavior. If the blast radius of the error is not big enough to be worrisome, then the software can be deployed after drafting an action plan. 

The solutions you have to come up with are dependent on the problems your software faces. So, it is a case by case scenario. 

Conclusion

So, there you have it, chaos engineering. We learned that chaos engineering is a highly useful process used in DevOps. Its aim is to identify and fix problems in a software by deliberately trying to break it. Different types of experiments are devised and conducted in a controlled environment and using insights gained from them, a solution is implemented. This is chaos engineering in a nutshell. 

Instatus status pages
Hey, want to get a free status page?

Get a beautiful status page that's free forever.
With unlimited team members & unlimited subscribers!

Check out Instatus

Start here
Create your status page or login

Learn more
Check help and pricing

Talk to a human
Chat with us or send an email

Statuspage vs Instatus
Compare or Switch!

Updates
Changesblog and open stats

Community
Twitter, now and affiliates

Policies·© Instatus, Inc