Guide
Chapter 3 - EKS Karpenter: Deep Dive
Karpenter is an open-source Kubernetes autoscaling project that was donated to the Cloud Native Computing Foundation (CNCF) by Amazon. The project aims to optimize how Kubernetes clusters scale their worker nodes to maximize resource efficiency and cost optimization, especially compared to traditional autoscaling tools like the Cluster Autoscaler.
Karpenter provides several valuable features for cluster administrators, such as supporting dynamic instance types, gracefully handling interrupted instances (like during spot termination), and enabling faster pod scheduling time. This article provides Kubernetes administrators with a comprehensive overview of Karpenter’s architecture and benefits, installation guidance, and best practices for effectively leveraging Karpenter.
Karpenter is a modern Kubernetes autoscaler developed to tightly integrate with cloud providers, enabling more intelligent node autoscaling for Kubernetes clusters. It is designed to overcome the limitations of traditional tools like the Cluster Autoscaler and provide a flexible and powerful solution for provisioning worker nodes while ensuring proper resource allocation.
A highlight of Karpenter is its direct integration with the APIs of cloud providers like AWS EC2, enabling it to make precise decisions quickly in response to cluster events. A backlog of pods stuck in the Unschedulable status will cause Karpenter to analyze hundreds of available instance types to launch a worker node matching exactly the requirements of the incoming pods. Examples of these requirements include resource capacity, availability zone selection, operating system types, and spot/on-demand instance mixtures.
Karpenter monitors for interruption events such as spot instance termination or upcoming EC2 maintenance via AWS APIs and then quickly launches replacement instances to ensure that pods are rescheduled with minimal downtime. This contrasts with the traditional Cluster Autoscaler project, which is limited to a static set of instance types, cannot gracefully handle instance interruption, and causes pod scheduling delays due to its reliance on AutoScalingGroup resources to handle instance creation instead of directly leveraging the EC2 API.
The dynamic instance support from Karpenter is a crucial feature for administrators struggling to rightsize their clusters and avoid costly wasted resources. Karpenter’s ability to launch instance types that precisely suit workload requirements enables a cluster to accurately rightsize its resource utilization, ensuring that there is no wasted compute capacity or cost inefficiencies. The project also allows workload consolidation. It will continuously analyze node capacity and pod requirements to dynamically terminate/replace instances, allowing it to rightsize the cluster and ensure that resources aren’t being wasted—all without impacting pod performance.
Karpenter is cloud provider agnostic and currently supports AWS and Azure.
Understanding how Karpenter works under the hood will help you better grasp how scaling and scheduling decisions are made, how to better leverage Karpenter’s features, and how to troubleshoot potential problems.
Karpenter is a Kubernetes controller; these are applications that monitor the Kubernetes API server to watch the state of the cluster’s objects and react to particular events. Karpenter watches a few specific objects—like NodePools and NodeClasses (which we will discuss in more detail shortly), as well as pod objects—to determine how to perform its autoscaling responsibilities.
To understand these objects and why Karpenter monitors them, let’s step through the workflow of some typical scaling operations. There are four primary components to Karpenter’s operations, which we explore in detail below.
Pods deployed to a Kubernetes cluster will cause a component called the Kubernetes Scheduler to decide which worker node will host the pod. This decision takes into account the pod’s requirements (like CPU/memory demands) and available node capacity. If there aren’t any available nodes that can host the pod—such as when there aren’t enough memory resources available—the pods will be stuck in an Unschedulable status. The trimmed snippet below shows a pod that is flagged as Unschedulable due to insufficient node memory:
kubectl get --output yaml pod pod-name
status:
phase: Pending
conditions:
- type: PodScheduled
status: "False"
reason: Unschedulable # Pod is marked as Unschedulable.
message: '0/2 nodes are available: 2 Insufficient memory. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod.'
lastTransitionTime: "2024-03-02T08:03:35Z"
lastProbeTime: null
Karpenter continuously monitors the API server to find pods flagged as Unschedulable. It will then determine how to get these pods scheduled to a new node. This first step is similar to how the Cluster Autoscaler operates by observing Unschedulable pods to determine whether node scaling is necessary.
Karpenter’s goal is to launch enough worker nodes to satisfy the requirements of any Unschedulable pods. However, Karpenter must evaluate all constraints set in the pod’s attributes to determine what instance configuration to launch. This takes into account the following factors:
Karpenter collects all constraints specified by the Unschedulable pods to determine what kind of instance it can launch to fit the pod and in which availability zone. A common challenge here is ensuring that the CPU and memory requirements are configured correctly for the pod: Misconfigured values will cause Karpenter to select inaccurate instance configurations, leading to wasted resources and excessive costs.
Karpenter will now need to match the pod constraints with a set of node available node configurations. There are two Karpenter-specific object types we’ll explore here to understand how nodes are configured.
These objects define the desired configuration of worker nodes provisioned by Karpenter. The configuration includes things like taints, labels, and instance attributes like spot and GPU hardware. Earlier versions of Karpenter called this object a “Provisioner”, which was deprecated when Karpenter graduated to beta.
Administrators can create multiple NodePool objects to organize their nodes neatly based on separate use cases. For example, nodes belonging to different teams in an organization may have team-specific taint and label configurations, which can then be leveraged by pod affinity rules to ensure that each team’s pods are only scheduled to their own NodePools. When Karpenter attempts to provision a node for Unschedulable pods, it will select an appropriate NodePool where the configuration of the NodePool and the pod match.
Here is an example of a NodePool object that specifies constraints on the operating system (Linux) and a taint:
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: general-purpose
annotations:
kubernetes.io/description: "General Purpose"
spec:
template:
metadata:
labels:
nodepooltype: general-purpose
spec:
taints:
- key: example.com/taint
effect: NoSchedule
requirements:
- key: kubernetes.io/os
operator: In
values: ["linux"]
nodeClassRef:
name: default
The NodeClass object is the second object administrators must configure for Karpenter—you can see in the example above that a NodeClass is referenced in the NodePool resource. The NodeClass object holds cloud-provider-specific constraints, which will be applied along with the NodePool constraints when Karpenter launches a Node. The schema for a NodeClass will vary depending on what cloud provider Karpenter is deployed to.
Here is an example of an AWS-specific NodeClass object that defines basic subnet, security group, and IAM role settings:
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "cluster-1"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "cluster-1"
role: "KarpenterNodeRole-cluster-1"
The NodeClass object on AWS also supports configuring UserData, tags, BlockDeviceMappings, and MetadataOptions.
Once Karpenter has matched the pod’s constraints with a particular NodePool, it will use the EC2 API to launch an EC2 instance with the constraints defined in the NodePool and NodeClass objects. EC2 instances are launched quickly by leveraging the EC2 API directly instead of the Cluster Autoscaler’s approach of updating AutoScalingGroup configurations, which involves extra steps that introduce provisioning delays. The Unschedulable pods will schedule to the new node once it joins the cluster, and the pods will no longer be flagged as Unschedulable. Karpenter will continuously poll the API server for Unschedulable pods and run through the above workflow whenever additional nodes are required.
A significant benefit of the above Kubernetes objects is that the entire cluster’s worker node configuration is managed through Kubernetes YAML manifests. This approach allows administrators to leverage capabilities like version-controlling the node configuration, implementing a GitOps strategy, and using native Kubernetes security controls like role-based access control (RBAC) and API server audit logs.
Traditional tools like the Cluster Autoscaler require administrators to additionally set up AutoScalingGroups, launch templates, and manage node group objects to enable the autoscaling functionality. This introduces complexity for the administrator, who must deploy and manage several pieces of infrastructure to facilitate autoscaling instead of just relying on Karpenter to manage the setup centrally with a Kubernetes-native approach.
Once pods are in the Running state, Karpenter will continue looking for efficiency improvement opportunities. It will regularly evaluate active pods and node utilization across the entire cluster to determine if pods can be consolidated into fewer nodes, enabling Karpenter to terminate unused nodes to reduce costs. Karpenter also continuously evaluates the instance size configuration across the cluster to determine whether instances can be replaced with smaller, cheaper ones based on real-time workload requirements. Consolidation performed by Karpenter is more intelligent than the Cluster Autoscaler because the former takes into account the entire cluster’s node utilization to determine scale-down actions. The Cluster Autoscaler only looks at individual node utilization which is less effective for accurately binpacking pods.
Recently, Karpenter has also rolled out a spot-to-spot consolidation feature. This functionality actively tracks spot market prices, enabling Karpenter to replace instances with more cost-efficient alternatives based on dynamic pricing data while carefully balancing the expected interruption rate. Karpenter further enhances spot support by monitoring AWS APIs to proactively detect spot interruption warnings (and other EC2 maintenance-related events), enabling it to launch replacement nodes to reschedule pods immediately before downtime occurs. Traditional projects like the Cluster Autoscaler require the setup of additional tools to allow this type of functionality.
In summary, the four components for Karpenter’s operations are as follows:
Administrators can significantly reduce waste and optimize costs through Karpenter’s automatic instance rightsizing features during both initial scheduling and consolidation. However, the effectiveness of Karpenter’s features hinges on the accurate configuration of pod resource allocations, specifically CPU and memory, via the pod’s “Requests” fields. If these allocations are not set correctly, Karpenter’s node rightsizing analysis may lead to inefficient resource use through overallocation (resulting in unnecessary costs) or underallocation (leading to performance issues). Therefore, it’s crucial for administrators to meticulously configure pod resource allocations to fully benefit from Karpenter’s capabilities.
Karpenter’s architecture represents a significant advancement in Kubernetes autoscaling options, offering a more efficient and cost-effective solution than traditional tools like the Cluster Autoscaler. By leveraging Kubernetes-native objects like NodePools and NodeClasses, Karpenter simplifies the deployment and management of node resources and allows for precise node provisioning based on real-time workload demands.
This section will show you how to install Karpenter in an EKS cluster to enable experimentation and further learning. The tutorial will guide you through creating an EKS cluster, enabling Fargate support, installing the Karpenter tool, and testing the node autoscaling functionality. We also have a video tutorial, if you prefer.
There are a few tools we’ll install to follow this tutorial. Follow the links to view installation tutorials for each tool:
The tutorial will assume you have IAM permissions to create AWS resources.
First, we’ll define some important settings, such as the Karpenter version, Kubernetes version, and the region where the cluster will be deployed. Adjust these settings carefully based on your requirements:
export KARPENTER_NAMESPACE="kube-system"
export KARPENTER_VERSION="0.35.0"
export K8S_VERSION="1.29"
export AWS_PARTITION="aws"
export CLUSTER_NAME="karpenter-demo"
export AWS_DEFAULT_REGION="us-west-2"
export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
export TEMPOUT="$(mktemp)"
The following script will deploy an EKS cluster using the eksctl tool and set up some Karpenter-related resources, like the EventBridge rules for monitoring spot interruption events. The script will enable Fargate support, which is how we’ll run Karpenter with a serverless approach for simplicity. We need a worker node running for hosting our Karpenter pods, and leveraging Fargate simplifies this initial bootstrapping step. IAM roles are configured automatically to allow Karpenter the appropriate IAM permissions to manage our EC2 instances. No changes are required to the script below, but administrators are encouraged to review the script to understand what is being executed by eksctl and how the ClusterConfig resource works:
# Download the cloud formation template to create a KarpenterNodeRole curl -fsSL https://raw.githubusercontent.com/aws/karpenter-provider-aws/v"${KARPENTER_VERSION}"/website/content/en/preview/getting-started/getting-started-with-karpenter/cloudformation.yaml > "${TEMPOUT}"
# Create the KarpenterNodeRole
aws cloudformation deploy \
--stack-name "${CLUSTER_NAME}" \
--template-file "${TEMPOUT}" \
--capabilities CAPABILITY_NAMED_IAM \
--parameter-overrides "ClusterName=${CLUSTER_NAME}"
# Create a new EKS cluster
eksctl create cluster -f - <<EOF
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: ${CLUSTER_NAME}
region: ${AWS_DEFAULT_REGION}
version: "${K8S_VERSION}"
tags:
karpenter.sh/discovery: ${CLUSTER_NAME}
iam:
withOIDC: true
serviceAccounts:
- metadata:
name: karpenter
namespace: "${KARPENTER_NAMESPACE}"
roleName: ${CLUSTER_NAME}-karpenter
attachPolicyARNs:
- arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:policy/KarpenterControllerPolicy-${CLUSTER_NAME}
iamIdentityMappings:
- arn: "arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/KarpenterNodeRole-${CLUSTER_NAME}"
username: system:node:{{EC2PrivateDNSName}}
groups:
- system:bootstrappers
- system:nodes
fargateProfiles:
- name: karpenter
selectors:
- namespace: "${KARPENTER_NAMESPACE}"
EOF
# Retrieve the new cluster's details export CLUSTER_ENDPOINT="$(aws eks describe-cluster --name "${CLUSTER_NAME}" --query "cluster.endpoint" --output text)" export KARPENTER_IAM_ROLE_ARN="arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/${CLUSTER_NAME}-karpenter"
# Print the cluster details to the screen echo "${CLUSTER_ENDPOINT} ${KARPENTER_IAM_ROLE_ARN}"
aws iam create-service-linked-role --aws-service-name spot.amazonaws.com || true
The output of the above should indicate that cluster creation was completed successfully. If you encounter problems (such as IAM errors), refer to AWS’s documentation on troubleshooting cluster creation.
We’re ready to deploy Karpenter to our new cluster; use the Helm command below to begin the installation. There are additional flags available to customize the installation configuration:
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version "${KARPENTER_VERSION}" --namespace "${KARPENTER_NAMESPACE}" --create-namespace \
--set "settings.clusterName=${CLUSTER_NAME}" \
--set "settings.interruptionQueue=${CLUSTER_NAME}" \
--set controller.resources.requests.cpu=1 \
--set controller.resources.requests.memory=1Gi \
--set serviceAccount.create=false \
--set serviceAccount.name=karpenter \
--wait
You can see Karpenter deployed by running:
kubectl describe deployment karpenter --namespace "${KARPENTER_NAMESPACE}"
Next, we need to set up some NodePools and NodeClass resources to define our desired worker node configuration. Running the command below will create two NodePools and a NodeClass object, with each NodePool containing a team-based taint. The example will demonstrate how we can set up multiple NodePools for different teams and segregate each workload. The subnet and security groups associated with the EC2 instances are automatically selected based on existing tags, which were all created by eksctl in the previous steps.
cat <<EOF | envsubst | kubectl apply -f -
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: team-1
spec:
template:
spec:
taints:
- key: team-1-nodes
effect: NoSchedule
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
nodeClassRef:
name: default
---
apiVersion: karpenter.sh/v1beta1
kind: NodePool metadata:
name: team-2
spec:
template:
spec:
taints:
- key: team-2-nodes
effect: NoSchedule
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
nodeClassRef:
name: default
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2 # Amazon Linux 2
role: "KarpenterNodeRole-${CLUSTER_NAME}"
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"
EOF
With the steps above completed, the cluster is ready to begin scheduling new pods. Let’s deploy some to validate that Karpenter is working correctly.
So far, no EC2 instances have been deployed to our cluster since our Karpenter pods are running on serverless Fargate nodes. We can now test whether Karpenter can provision some EC2 nodes by deploying new pods.
We will deploy two pods with the below YAML. Each pod will have a separate toleration, allowing the pods to only schedule to one of the NodePools we defined above. When these two pods are deployed, we expect to see the “team-1-nginx” pod schedule to the “team-1” NodePool, and the “team-2-nginx” pod schedule to the “team-2” NodePool. The example demonstrates how we can configure multiple NodePools to segregate different types of workloads based on taints and tolerations, and rely on Karpenter to launch appropriate nodes to schedule the desired pods:
kubectl apply -f -
<<EOF apiVersion: v1
kind: Pod
metadata:
name: team-1-nginx
spec:
containers:
- name: nginx
image: nginx
resources:
requests:
cpu: "0.5"
memory: 300Mi
tolerations:
- key: "team-1-nodes"
operator: "Exists"
effect: "NoSchedule"
---
apiVersion: v1
kind: Pod
metadata:
name: team-2-nginx
spec:
containers:
- name: nginx
image: nginx
resources:
requests:
cpu: "0.5"
memory: 300Mi
tolerations:
- key: "team-2-nodes"
operator: "Exists"
effect: "NoSchedule"
EOF
Karpenter will see the pods above stuck in the Unschedulable status since there are no EC2 instances to host them. It will react by provisioning two new nodes to fit our new pods, which we can see as follows:
kubectl logs deploy/karpenter --namespace "${KARPENTER_NAMESPACE}"
# "message":"found provisionable pod(s)"
kubectl get nodes
# ip-192-168-4-xx.us-west-2.compute.internal Ready v1.29.0
# ip-192-168-8-xx.us-west-2.compute.internal Ready v1.29.0
kubectl get pods
# team-1-nginx Running
# team-2-nginx Running
The test above demonstrates that Karpenter is successfully provisioning new worker nodes to host our pods, while respecting the taint and tolerations configuration. You can continue experimenting by extending the NodePools and NodeClass with more granular settings and deploying new workloads with varying constraints (like pod affinity) to observe Karpenter’s provisioning behavior. Deleting the pods will result in Karpenter automatically terminating the excess nodes.
Administrators can implement a few key best practices to fully leverage the value of Karpenter.
Karpenter supports a Kubernetes-native feature called leader elections, which allows multiple controller replicas to run in parallel without conflicting (only one makes decisions while the other is on standby). The standby replica enables high availability by taking over responsibilities if there’s a failure in the active replica. Karpenter’s Helm chart enables two replicas by default, and in a production cluster, this minimum value should be maintained.
Karpenter exposes many Prometheus metrics by default, which can be scraped by any Prometheus-compatible monitoring tool. Administrators of production clusters should create dashboards and alerts for Karpenter’s metrics to allow for visibility into issues like node provisioning failures. The blast radius of broken autoscaling is significant, so enabling appropriate observability is critical in production. Metrics will also provide insight into whether the Karpenter pods need more CPU/memory resources to prevent autoscaling bottlenecks, especially in large clusters.
When deploying critical pods that must not be interrupted until the job is completed, Karpenter supports a valuable annotation to ensure that it cannot terminate the job’s worker node as part of consolidation efforts. Consider implementing this annotation when deploying pods that cannot handle the interruption. The annotation can also be applied to a worker node object to ensure that the node won’t be consolidated, which may be helpful if the administrator needs to keep a particular node online (such as to gather logs or other data).
kubectl annotate pod pod-name karpenter.sh/do-not-disrupt='true'
kubectl annotate node node-name karpenter.sh/do-not-disrupt='true'
These objects support many configuration parameters that will impact the shape of your cluster. Values should be carefully evaluated to ensure that the desired node configurations are being deployed and are suitable for your workload’s use cases.
A key feature of NodePool resources is the “weights” attribute. It is possible to set up multiple NodePools that are compatible with your workload while setting priorities for Karpenter to respect. For example, a common use case will be setting a spot instance or reserved instance configuration for a high priority NodePool for cost optimization purposes. Setting the “weight” attribute to a high value will tell Karpenter to prioritize deploying the desired spot or reserved instance NodePool. If these instance types are unavailable or have reached a limit, Karpenter can fallback to a lower weight NodePool which might contain regular on-demand instances. Enabling multiple NodePools allows you to prioritize a desired configuration while still ensuring fallbacks are available to avoid blocking pod scheduling.
This step is crucial for enabling Karpenter to correctly determine instance rightsizing choices. Without accurate resource values, Karpenter cannot select the correct instance types, consolidate them to reduce waste, and optimize costs.
Administrators typically need help with rightsizing pod resource allocations, especially in large clusters with a variety of different workloads deployed. The challenge for administrators is analyzing each pod’s historical utilization to determine optimal resource allocation, continuously performing this analysis to keep allocations up-to-date, and performing this analysis with large-scale clusters.
Oversized resources will cause waste, while undersized resources will create performance bottlenecks, so implementing an automation tool to reduce the operational overhead for administrators and improve the accuracy of resource allocation is recommended. StormForge Optimize Live can assist administrators in addressing these needs as an HPA-compatible automated rightsizing solution powered by machine learning algorithms, removing the need to manually set pod resource limits and requests.
Karpenter represents a significant advancement for Kubernetes autoscaling, offering instance configuration flexibility, cost optimization features, and the ability to manage instances with Kubernetes-native objects. By understanding the architecture and design of Karpenter, it is easy to see how its approach to autoscaling and consolidation can bring value to any EKS cluster. Following the installation tutorial described above allows administrators to get hands-on experience configuring Karpenter and will be a starting point for further experimentation.
Leveraging Karpenter’s features will require following best practices in terms of Karpenter’s configuration, the NodePool’s constraints, implementing observability, and accurately defining pod resource allocations to enable precise scaling decisions. Administrators leveraging Karpenter’s rightsizing capabilities can consider testing an automation solution for rightsizing pod CPU/memory resources with StormForge. By using StormForge and Karpenter together, administrators will benefit from precise cluster-wide rightsizing and cost optimization, ensuring that resources aren’t wasted and that workload performance is maximized. You can try out StormForge for free.
We use cookies to provide you with a better website experience and to analyze the site traffic. Please read our "privacy policy" for more information.