Chaos engineering

Chaos engineering

A brief introduction.

I recently published an npm package called chaos-engine. What this package does is run destructive tests on a function. It comes with a predefined array of arguments which it passes to whatever function you provide, and then it returns a chaos report showing the argument(s) passed and how your function responded.

This was created in response to a prompt on Chaos engineering.

What is Chaos Engineering?

Chaos Engineering is about making the chaos inherent in the system visible. 1

As a programmer or developer, or engineer, (I believe) it is common practice to ensure that the system or program you have built performs according to its purpose. The program should also be able to handle any potential errors that may come up, especially as a result of forces that are out of your control (external servers shutting down, users inputting invalid values, e.t.c)

Chaos engineering is the practice of testing a system's resilience by occasionally causing it to fail. That is, intentionally trying to break the system. Doing so helps expose potential errors and weaknesses in the system. This way, they can be fixed or managed before they become an even bigger issue and lead to losses.

"By breaking things on purpose we surface unknown issues that could impact our systems and customers." 2

Origin story

Chaos engineering as a concept first became popular in 2010 when Netflix created the Chaos Monkey. That was a response to the company's move from physical infrastructure to cloud infrastructure provided by Amazon Web Services. It was also to ensure that a loss of an Amazon instance would not affect the Netflix streaming experience.

The Chaos monkey works by frequently shutting down a random instance of the Netflix streaming service in production during business hours. Then the engineers would try to fix it. By doing this, the Netflix engineers became better at fixing it whenever a similar situation occurred due to external forces.

By taking a rare and potentially catastrophic event and making it frequent, we give engineers a strong incentive to build their service in such a way that this type of event doesn’t matter. Engineers are forced to handle this type of failure early and often. Through automation, redundancy, fallbacks, and other best practices of resilient design, engineers quickly make the failure scenario irrelevant to the operation of their service. 2

The whole story is pretty fascinating and more insightful than I can fit into this short write-up. You can read more about it here: Birth of chaos

Conclusion

The idea is to uncover unknown weaknesses in your system and figure out how to fix or manage them on time. However, before applying chaos engineering to your system, you should be sure that what you have built is capable of withstanding or recovering from any errors on its own. If not, it is best to look over it again. Chaos engineering should boost the level of confidence you have in your system.

Although the chaos monkey is more complex than my chaos-engine package, the purpose is similar. When a user passes a function to it, the chaos engine will run several arguments through the function to break it. By taking a look at the response of each argument, you will be able to see cases that you did not factor in when writing the function.

The documentation explains a lot more about the package. You can give it a try here: Try chaos-engine package on RunKit..

Attribution

  1. Birth of chaos
  2. Chaos Engineering: the history, principles, and practice
  3. Why Do Chaos Engineering?

Cover Photo by Naveen Kumar on Unsplash