Blog
By StormForge | Oct 26, 2023
By: Yofti Makonnen and Erwin Daria, StormForge
For any organization on its app-modernization journey, Kubernetes is the go-to platform for deploying and managing containerized applications. But concealed within its hyper-dynamic flexibility and cloud-native awesomology™, is an inconvenient truth. Kubernetes by itself is incredibly inefficient when it comes to resource management and its expense is a problem that has proven to be difficult to address at scale.
Over the first two installments of this blog series, we covered how implementing cost optimization capabilities is non-negotiable for today’s cloud-native application platforms, starting with the importance of using intelligent node autoscaling tools like Karpenter as a critical first step in managing resource capacity for Kubernetes.
In this installment of our multi-part series, we’ll build on the insights shared in the previous installments by switching our focus to perhaps the hardest problem to solve within the Kubernetes cost optimization paradigm: Workload Resource Right-Sizing.
Where node autoscaling is critical in managing the resource capacity of a Kubernetes cluster, pod resource right-sizing is meant to adjust the resource requests of the workloads themselves in response to changes in traffic or resource requirements. By right-sizing the workloads, intelligent node autoscalers will inherently make more cost-effective decisions about what kind and shape of nodes to provision to a cluster.
The most obvious example of a pod resource right-sizing tool is the Kubernetes-native Vertical Pod Autoscaler, or VPA.
VPA, like the Horizontal Pod Autoscaler or HPA, is designed to scale pod resources to accommodate changing application load. However unlike HPA, which scales pod replicas horizontally, VPA is designed to scale a pod’s resources vertically – changing CPU and Memory request values up or down.
While the VPA in theory demonstrates clear value for organizations managing Kubernetes workload resources, it can present a few limitations in practical use.
Vertically right-sizing pod requests based on resource demands is an undeniably critical capability that should be part of any Kubernetes platform. But the Kubernetes-native VPA’s current limitations often prevent wide-spread adoption, leaving much of its cost savings potential on the table. A recent user survey calculated VPA adoption at less than 1% of all workloads. HPA, by contrast, was adopted on more than 40% of workloads.
Some of VPA’s limitations to consider:
To address the workload resource right-sizing problem minus the limitations of the VPA, StormForge has developed Optimize Live: a hosted right-sizing platform that uses machine learning to provide pod-level resource recommendations that are compatible with the HPA and designed to support large scale Kubernetes environments.
Optimize Live is designed to right-size Kubernetes workloads continuously, eliminating resource overprovisioning while ensuring application performance.
Optimize Live improves upon the Kubernetes-native VPA, by analyzing pod usage metrics for a wider array of workload-types, including those scaled with an HPA, and providing a continuous cadence of right-sizing recommendations that can be implemented autonomously. Additionally, because all of Optimize Live’s right-sizing logic is hosted, it dramatically simplifies installation and configuration, while also being able to scale to support large environments with hundreds of thousands of workloads across many clusters. A feat that cannot be accomplished reliably with the VPA.
In order to illustrate the benefits of implementing Optimize Live along with Karpenter (from our previous post), we’ll be using a few tools against our sample environment:
For instructions on deploying and configuring auto-scaling groups with Cluster Autoscaler on EKS, please refer to AWS’ official documentation AWS’ official documentation.
For instructions on deploying and configuring Karpenter, please refer to Karpenter’s official documentation.
Sign up for a 30-day free trial of StormForge Optimize Live.
For specific configurations used in this blog, please refer to StormForge’s StormForge’s GitHub.
In addition to the tools listed above, our sample Kubernetes environment is configured with two node groups:
m5.xlarge
on-demand
instancesamd64
on-demand
instancesWith the infrastructure details out of the way, let’s review our testing methodology.
Our goal in this blog is to use a control set of Kubernetes objects (our microservices demo) to induce each node provisioner to add a set of worker nodes to the cluster and to then implement the workload right-sizing recommendations generated by Optimize Live and compare the costs for each.
labels:node-type=cluster-autoscaler
taints:dedicated=cluster-autoscaler:NoSchedule
For the Karpenter nodes:
labels:node-type=karpenter
taints:dedicated=karpenter:NoSchedule
nodeSelector
and Tolerations
to ensure that our pods are scheduled using each of the respective node provisioners.Lastly, we’ll use eks-node-viewer
to review and compare the projected monthly costs of each set of nodes with and without Optimize Live’s recommendations implemented.
As in our previous installment, we’ll use helm to deploy our microservices demo application:
helm install ecommerce-1 -n ecommerce-1 microservices-demo \
--create-namespace \
--values values.yaml
Included in the values.yaml file:
nodeSelector:
key: node-type
value: cluster-autoscaler
tolerations:
key1: "dedicated"
operator1: "Equal"
value1: "cluster-autoscaler"
effect1: "NoSchedule"
Figure 4:For reference, the values.yaml file includes both the nodeSelector and tolerations for our Cluster Autoscaler nodes
With all 20 microservices demo pods successfully running, we can use eks-node-viewer
to visualize the monthly costs for these instances.
eks-node-viewer -node-selector node-type=cluster-autoscaler -resources cpu,memory
EKS-Node-Viewer Output:
By using eks-node-viewer
we can see that our non-optimized application is deployed across 7 x m5.xlarge
instances, representing a projected monthly cost of $981.12.
Our next step is to take a look at the workload right-sizing recommendations made by Optimize Live and to see if implementing any optimizations will impact the number of nodes provisioned by Cluster Autoscaler and thereby our monthly infrastructure costs.
With our microservices demo application deployed, we can log into our Optimize Live account to see our recommendations:
Upon initial inspection of our “ecommerce-1” namespace in the Optimize Live user interface, we can see that our current (non-optimized) requests are calculated at >18.5 cores and >36GiB of memory, however, Optimize Live’s machine learning has been analyzing the actual usage of each microservice and has generated a series of recommendations that can reduce total CPU requests by ~8 cores and memory by 34GiB.
To view each of the 11 microservices recommendations independently, we have a number of options:
First, we can click each of the microservices to view their details:
The details page for each workload provides visualizations for current request values, resource usage metrics over the analysis period, along with recommended requests and the projected impact of applying those recommendations.
To retrieve each of the recommendations as YAML objects, we can simply click the “Export Patch” button in the UI, or we can use the stormforge CLI to download JSON:
stormforge patch -d . --cluster partner-sandbox-2 --resource deployments --namespace ecommerce-2 --name productcatalogservice --recommendation 1698164280
Deployment_productcatalogservice.json:
{
"apiVersion": "apps/v1",
"kind": "Deployment",
"metadata": {
"annotations": {
"kubernetes.io/change-cause": "Optimize Live (1698164280)",
"stormforge.io/last-updated": "2023-10-24T19:29:04Z",
"stormforge.io/recommendation-url": "https://api.stormforge.io/v3/workloads/partner-sandbox-2/ecommerce-2/deployments/productcatalogservice/recommendations/1698164280"
},
"name": "productcatalogservice"
},
"spec": {
"template": {
"spec": {
"containers": [
{
"name": "server",
"resources": {
"limits": {
"cpu": "874m",
"memory": "23Mi"
},
"requests": {
"cpu": "728m",
"memory": "19Mi"
}
}
}
]
}
}
}
}
Figure 9: Patches produced by Optimize Live include annotations which describe “change-cause” along with other metadata that help explain why the recommendations are made
Of course, simply generating recommendations isn’t what you’re here to see. Let’s apply the recommendations and see how a properly right-sized set of workloads affects the nodes provisioned by Cluster Autoscaler.
For the complete documentation covering the application of recommendations, visit StormForge Docs.
For the purposes of our testing, we will use the stormforge
CLI to apply the most recent recommendations to all of the workloads in our “ecommerce-1” namespace.
stormforge apply --namespace ecommerce-1
This will apply the recommendations generated for each of our workloads within the ecommerce-1 namespace.
We can then use eks-node-viewer
again to see how the optimized workloads impact node provisioning:
EKS-Node-Viewer Output:
With Optimize Live’s recommendations applied, our demo infrastructure has gone from 7 x m5.xlarge
instances to 6 x m5.xlarge
instances, resulting in a decrease in costs from $981.12 to $840.96 – a savings of $140.16, 14%, or exactly 1 fewer nodes.
While projected monthly infrastructure costs are down, total CPU allocated is at 56.5% and it appears that there could be much better resource utilization with a different set of instances provisioned for this application.
In a nod to our previous installment, these results beg to ask the question: How would Karpenter fare with these right-sized workload requests?
Once again, we’ll deploy a second copy of our microservices demo application with helm, passing our values file configured for Karpenter:
helm install ecommerce-2 -n ecommerce-2 microservices-demo \
--create-namespace \
--values karpenter-values.yaml
Included in the values.yaml file:
nodeSelector:
key: node-type
value: karpenter
tolerations:
key1: "dedicated"
operator1: "Equal"
value1: "karpenter"
effect1: "NoSchedule"
Figure 11: For reference, the karpenter-values.yaml file includes both the nodeSelector and tolerations
After successfully deploying our 20 non-optimized pods, this time using Karpenter-provisioned nodes, we can use eks-node-viewer
again to visualize and record the projected monthly infrastructure costs.
eks-node-viewer -node-selector node-type=karpenter -resources cpu,memory
EKS-Node-Viewer Output:
After confirming that Optimize Live is now also analyzing and generating recommendations for our microservices in the ecommerce-2 namespace, we can use the same steps outlined above to inspect and apply our recommendations.
With our microservices demo application successfully deployed, we can log into our Optimize Live account to see our recommendations:
Looking at the most recent copy of our microservices demo application deployed into the “ecommerce-2” namespace, we can see an identical set of recommendations highlighting a reduction in CPU requests by ~8 cores and a reduction in memory requests by ~34GiB.
How will these right-sizing recommendations affect Karpenter? Let’s apply them and take a look.
We can apply Optimize Live’s right-sizing recommendations using the stormforge
CLI (just as we did above) to see how they affect which instance types Karpenter provisions to our cluster:
stormforge apply --namespace ecommerce-2
With the recommendations applied, we can turn our attention back to EKS-node-viewer
to see how Karpenter is able to re-provision nodes and consolidate the now-right-sized workloads onto a different set of instances:
eks-node-viewer -node-selector node-type=karpenter -resources cpu,memory
EKS-Node-Viewer Output:
Now that all of the workloads in our “ecommerce-2” namespace have been right-sized, Karpenter is able to provision a different set of instances that further reduce our projected monthly infrastructure costs from $721.24 to $253.89 – a total reduction of 74% from our initial baseline with Cluster Autoscaler – an exponential benefit when compared to using Karpenter alone.
In the previous installment of this blog, we highlighted the benefits of implementing an intelligent node autoscaling solution, specifically Karpenter, as a critical first step in implementing cost optimization capabilities in a Kubernetes environment. While the infrastructure cost savings can be substantial, adding a workload right-sizing solution like Optimize Live can provide an exponential boost in infrastructure cost optimization when compared to intelligent node autoscaling solutions alone. As a result, you’ve seen the a clear cost optimization benefit realized through the application of both Karpenter and Optimize Live against a single namespace.
We use cookies to provide you with a better website experience and to analyze the site traffic. Please read our "privacy policy" for more information.