Chaos engineering week at Track.health

3 min readNov 13, 2020

Stepping into the realm of Chaos engineering

We were inspired by the Chaos engineering activities that were done by Netflix and wanted to test out of platform here at track.health to see how resilient we were on random failures.

As a famous quote from the book Release It! Design and Deploy Production-ready Software by Michael T. Nygard;

Enterprise software must be cynical. Cynical software expects bad things to happen and is never surprised when they do. Cynical software doesn’t even trust itself, so it puts up internal barriers to protect itself from failures. It refuses to get too intimate with other systems, because it could get hurt.

Defining the steady state

Before we looked at what tools are out there for Chaos engineering, we wanted to look at what we wanted to test and what a steady state is for our application.

Aftwards, we wrote a load test suite using Gatling so that we can simulate the expected load while the chaos activities took place. With this in place, we started looking at tools to start off our Chaos activities.

Selecting a tool to start off with

There are an overwhelming amount of tools out there from various amazing people who have open-sourced most of them. We went through this amazing list compiled on GitHub.

Since our platform runs on Kubernetes and due to the fact that we wanted to keep the first iteration simple, we went ahead with kube-monkey.

We identified the key deployments that we need to shutdown sporadically and then started deploying the kube-monkey deployment in their respective namespaces.

There are different ways to schedule kube-monkey to shutdown deployments in your cluster and the documentation does a good job explaining them so I would not want to repeat it as part of this post.

We started off with its debug mode to just see if in action. With debug enabled, you can schedule kube-monkey to run at a defined interval so you can see whether the correct deployments are being shut down. Now to get this working, we had to label our deployments as follows;

apiVersion: apps/v1kind: Deploymentmetadata:name: ms-account-managementnamespace: ns-accountslabels:app: ms-account-management{{- if eq .Values.kube_monkey_enabled "yes"}}kube-monkey/enabled: enabledkube-monkey/identifier: ms-account-managementkube-monkey/mtbf: '1'kube-monkey/kill-mode: "fixed"kube-monkey/kill-value: '1'{{- end}}

What the hell is this ?

{{- if eq .Values.kube_monkey_enabled “yes”}}

We use helm templates. So feature flagging stuff is kind of the norm at Track.health, so we put this functionality behind a flag in order to turn it on and off with each deployment.

Once kube-monkey was deployed, we ran our load test suite at regular intervals and captured the AWS Cloudwatch metrics on each run. This led to some interesting findings for us.

We saw certain parts of the platform did not fail fast and were hanging on some of the HTTP requests made.

Circuit breaking to the rescue

Circuit-breaking was in the cards. We wanted a solution that gave us more fine grained control over how we did circuit breaking which is why we chose resilience4j over Istio.

Also, the author of the library is quite responsive specially on twitter because he replied to one of my tweets.

Wrapping up

Our journey with Chaos engineering just started off and there is much more to do and more things to uncover that we have not even touched upon. This was a good start which led us to improve our platform to make it more resilient in the event of sudden changes to the state of our deployments.

More on our Chaos journey to come in the future.

Have a good one everyone and thank you for reading!