Guide

How Kubernetes Eviction Works: Resource Management Gone Wrong

The final chapter of a wizard's journey through the technical inner workings of Kubernetes  Resource Management

Understanding the mysterious inner workings of Kubernetes resource management at a deep level can make you feel like a wizard. As detailed in the first chapter in this series, becoming a wizard of Kubernetes resource management involves achieving an end-to-end contextual understanding of how resource management functions in Kubernetes, including everything from its user abstractions to technical implementation at the Linux kernel level.

In Chapter 1, we detailed how pod spec and node status are used to make matches between pending pods and available nodes. In Chapter 2 and Chapter 3, we delved deep into the particulars of how requests and limits are converted to Linux process settings, and we explored what implications that has for container resource and reliability outcomes at runtime. 

In this final chapter, we’ll return briefly to kubelet, then try and summarize what all of this means for us as administrators who are operating the high-level abstraction.

Node Pressure and Eviction #

CPU throttling and containers being OOMKilled are examples of resource issues coming to a head on the node — somewhat outside of Kubernetes’ direct control. In those situations, Kubernetes has done its best to set pods and containers up for success, but ultimately, Kubernetes code isn’t directly involved in arbitrating conflicts that happen. Whatever the outcome, the Kubernetes workloads will still be running (or restarting) on their assigned node during and after the conflict.

If, for some reason, a collection of pods running together on a node clearly isn’t working out (lots of OOMs happening, for example), Kubernetes can intervene and kick some pods out to try and improve the situation. This is where node pressure and eviction come in.

Kubelet periodically evaluates node pressure according to its housekeeping-interval, every 10 seconds by default. If a resource such as memory is detected to be within either a soft-eviction threshold, or a hard-eviction threshold, kubelet will taint the node with a resource pressure condition (temporarily preventing new pods from being scheduled), and then pick one or more pods to evict from the node.

The idea is that something is wrong; whatever is happening isn't good, and kubelet is going to try and disrupt it. Kubernetes resource management works well when workload resource requests are configured correctly, but ensuring requests are set correctly is a whole problem by itself. Without automated tools in place to manage requests, it’s unlikely that they are set well across the board.

When Kubelet decides to evict a pod and disrupt whatever is happening to cause resource pressure, it will first try to identify “bad actors” or “worst offenders” that are using more resources than they requested, and evict those pods first. Pod priority is also considered, but having requests set too low is the best way to get voted off the island.

When eviction happens, the evicted pods’ containers will not keep running on the node. The pod will be terminated. If the pod belonged to a workload resource such as a deployment, a new pod will be created to replace it. That new pod will need to be scheduled on a node before it can run again. Hopefully, the pod will be scheduled on a different node. If no other nodes are available though, there’s a chance that the pod will be rescheduled to run on the same node again, after the resource pressure taint is removed.

Because of the careful scheduling by Kubernetes, which ensures that the sum total of requests made by pods running on a node will never exceed the node’s capacity, eviction usually only occurs when a lot of pods don’t have requests set at all or when requests are set too low.

💡 KEY OBSERVATION

Node pressure eviction serves to evict ill-behaved pods on overloaded nodes, hopefully redistributing the evicted workloads to other nodes.

This is most likely to occur when BestEffort pods are being heavily used, or Burstable pods have their requests set too low.

A somber parting note about the potential danger of node pressure scenarios:

If your cluster has a static size and does not automatically provision additional nodes as needed, node pressure eviction can lead to a sort of chain reaction meltdown. 

The basic scenario goes like this:

  1. A pod over-consumes memory on a node.
  2. The node evicts the over-consumer and taints itself with a resource pressure condition (no new pods).
  3. The evicted pod is rescheduled on a new node.
  4. The pod again over-consumes memory.
  5. The new node evicts the over-consumer and also taints itself with a resource pressure condition, just like the first node.
  6. This cycle continues, potentially tainting and locking down many (possibly all) of the cluster’s nodes.

Now toss a horizontal pod autoscaler (HPA) into the mix. An HPA might react to the situation by scaling the affected workload up, creating even more over-consumer pods that add fuel to the fire. It’s not exactly nuclear, but if unchecked, it becomes a resource mismanagement situation that can affect the whole cluster.

This anecdote is based on a real-world scenario. It serves as an interesting reminder that even though this series spends a lot of time talking about Kubernetes resources under a microscope — one container, even one resource type at a time — resource management is a building block underlying a complex system, and how well resources are managed can contribute to macro-scale effects.

Journey's End: Making Sense of Resource Management #

Congratulations! You made it. Assuming you haven’t skip-scrolled too much to get here (but casting no shade if you did), it took you quite a lot of reading to do it. Here’s a quick summary to help remind you what you learned along the way, and what technical implementation details you’ve been exposed to — specifically in the context of how Kubernetes requests and limits actually work.

  • Pod spec – where you set requests and limits
  • Node status – reports node resource capacity
  • Pod scheduling – decides which node to run pods on based on requests
  • Container configuration for CPU – cgroups and CFS 🤯
  • Container configuration for memory – cgroups and oom_score_adj 🤯
  • Node pressure and eviction – terminates pods when things just aren’t working out

It’s been a grimoire. You’ve pulled back the curtain, peered into the crystal ball, observed the outlines of gears and pulleys that power the kube-tron and can distinguish some of the magic from the technology. So … what can you do with this knowledge, now that you have it?

At the end of the day, most developers and administrators won’t be dealing with the engine of Kubernetes; they’ll be working with the user interface: setting resource request and limit values for containers in YAML manifests. When you take a step back and summarize the takeaway from all of these fascinating technical details about what happens between scheduling, cgroups configuration, runtime contention events and the possibility of eviction, for our day-to-day decision-making as developers or administrators, it all comes down to this:

Resource requests matter.

None of the volatility and the exception handling and the decide-who-gets-more and who-gets-less logic needs to matter very much when each container’s resource requests are thoughtfully aligned with the container’s actual resource usage. When usage aligns with requests, contingency mechanisms don’t need to be exercised. Having the Right resource request settings figured out up front side-steps it all.

💡KEY OBSERVATION SUMMARY

Resource requests matter.

Mechanisms for handling resource contention are commendable, but fundamentally reactive to incorrect alignment between how much CPU or memory a container requests and how much CPU or memory the container actually uses. Over-reliance on contention-handling is fraught, even with the quality of work Kubernetes offers.

Kubernetes can be robust, effective, and efficient when resource requests correctly match requirements, with minimal need to exercise contention handling mechanisms. For optimal performance and efficiency, endeavor to set request values correctly for every container in every pod.

Getting requests right all of the time is a lot easier said than done though. It’s actually an incredibly tall order. At least, for humans.

Aligning resource request values in Kubernetes to real container resource requirements is an intractable task for developers and administrators. It’s a continuous battle with no end. We wrote an article about that. But it is possible; there is a way.

Just automate it.

At StormForge, this is what we spend our time on. When we’re not writing articles like this or spelunking around in Kubernetes source code, we’re hard at work building automation to make real-world realization of this key observation possible: Set the right CPU and memory requests for every workload, on every cluster, every time, and do it with minimal human intervention.

If you’re curious about that and want to see what we do, you can try Optimize Live for free or play around in our sandbox environment.

The goal is to make it feel like magic 🪄

…to everyone else. For you, it’s probably too late. You’ve learned too much. Don’t worry though. It’s not so bad, being a wizard. 

Additional Reference and Resources #

Related reading:

Tools that are your friend:

  • kubectl
  • jq
  • crictl
  • systemd-cgls

Relevant cgroupfs and procfs locations:

  • /sys/fs/cgroup/kubepods.slice
  • /proc/${pid}/oom_score
  • /proc/${pid}/oom_score_adj

We use cookies to provide you with a better website experience and to analyze the site traffic. Please read our "privacy policy" for more information.