The Self Healing Myth: Readiness & Liveness Probes

When starting out with Kubernetes, we had some discussions around what to do with “liveness” and “readiness” probes.

What are Readiness & Liveness Probes?

Liveness Probes tell Kubernetes your application is still alive. If the liveness probes fail, Kubernetes will kill the pod and bring up a new one.

Readiness Probes tell Kubernetes that your pod is ready to receive traffic.

NOTE
Both Readiness & Liveness probes get hit throughout the lifetime of the pod, not just on startup.
If your ‘readiness probe’ says it’s not ready, then Kubernetes will leave the pod running, but not send traffic to it. Later when it’s ready again, it can join the pool of pods receiving traffic.

The Dillema – Smart versus Dumb probes

If your application depends on a database, and it pretty much useless if that Database is down, then you could make “smart” probes that check the the database (and other dependencies).

Or you could make “dumb” probes and attach no logic to them, so that as long as the application itself was up it would return a 200 response code, even though with dependent services down, the application would be useless.

And as we progressed in migrating 50+ applications across to Kubernetes, we encountered a variety of issues that changed our way of thinking and helped us refine our usage of these probes.

Some people believed that since we were now on the cloud, autoscaling and self healing were all out of the box. If there was a problem and a pod needed to restart, then it should restart automatically and “self heal”.

The Self Healing Myth

Self healing is hard. If you auto-restart a pod because of an issue, then you’d better be very sure that the restart will actually fix the issue. You might wake up in the morning to find that your pods have restarted 1000 times. Constant restarts can cause strain on the cluster, especially because of the “CPU goes to 100% on startup problem applications like Java have.

Don’t want to restart your pods forever? Well then you need a back-off strategy. But that is another complexity to manage.

Here’s my lesson for you about self-healing in general:

Self healing only works when you are 100% sure the action being taken will solve the issue.

If you are not 100% sure, you can still ‘attempt’ to self-heal, as long as you have logic about how to handle failed self-heal attempts, and how to back off on automated healing.

General Rule of Thumb

Make both the liveness and readiness probes as dumb as possible. Make them hit an endpoint on your application that just returns HTTP 200 and nothing more.

To handle when downstream dependencies are down, you should implement failure handling in your code (adding circuit breakers is a good example of something that can be implemented). Then you can gracefully handle failure and respond back to the caller of your service immediately.

When to use “smart” Liveness Probes

If you really want to get smart with your liveness probes, then there is one simple rule to follow: be 100% sure that the restart will fix your problem

NOTE
If you are finding you need to restart to solve problems. then the problem should be fixed in the code. Restarting is usually just a band-aid solution to an underlying problem.

Example of when to Restart: Java Memory Leak

We had cases where we had a Java memory leak. Memory would keep growing and garbage collection was not cleaning up any memory. Restarts were the only way to clear out the memory and resolve the immediate problem.

So if you made a liveness endpoint in the application that could tell when it was reaching a particular threshold, it could ask itself to be restarted by returning a HTTP 500 code on the Liveness probe.

 

When to use “smart” Readiness Probes

My biggest mistake with using Readiness probes at first was not realising:

The readiness probe is not just called on startup of the pod, it is ALWAYS called during the lifetime of the pod

So should you use this when the database is down?

No!
Doing this would mean that the pod stays alive, but Kubernetes stops sending traffic to it. But the database would be shared amongst all of the instances of that application, so they would most-likely all be marked as “Not Ready”.
And if you mark your pod as not-ready, then what happens to the caller of your service? They will start getting errors such as HTTP 503 responses. What would the caller do with this?
If you instead implemented circuit breakers that failed fast, at least they could send back sensible messages to the caller.

So when do we actually use “smart” readiness probes?

Only if you have a very specific use case. In general you should never use them. Instead you should implement failure handling in your code in the form of circuit breakers etc… and handle the case gracefully.

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

Up ↑

%d bloggers like this: