Guide
Chapter 4 - Horizontal Pod Autoscaler
The Horizontal Pod Autoscaler (HPA) is a key component of the Kubernetes autoscaling feature sets. It adjusts the number of replicas for a workload based on observed metrics such as CPU utilization or custom metrics. Configuring the HPA in your clusters can ensure optimal application performance, reliability, and cost efficiency in Kubernetes.
By default, the HPA scales based on CPU usage without requiring additional configuration. However, effectively leveraging the HPA comes with challenges, such as precisely configuring custom metrics and fine-tuning autoscaling parameters. One significant challenge is configuring custom metrics, which requires precise configuration and setting up and maintaining additional components to collect, aggregate, and expose these metrics to the Autoscaler.
This comprehensive article describes the fundamental concepts, components, and inner workings of the HPA. We also walk through setting up, configuring, and observing it under various load scenarios through a practical demonstration. We also explore common obstacles and advanced HPA implementation tools in production environments. By the end of this article, you will have a solid foundation on the HPA and be well-equipped to utilize its capabilities in your Kubernetes clusters.
The Horizontal Pod Autoscaler is a Kubernetes resource that automatically scales the number of pods in scalable resources such as deployments and replica sets based on observed metrics. The primary goal of the HPA is to ensure that applications have sufficient resources to handle varying levels of load while also avoiding overprovisioning and wasting resources during periods of low demand.
The HPA has evolved significantly since its introduction. In the early versions of Kubernetes, HPA v1 only supported scaling based on CPU utilization. This meant that the HPA could only make scaling decisions by comparing the observed CPU usage of pods against a target CPU utilization percentage set by the user.
To address this limitation, Kubernetes introduced HPA v2 (v1.23 onward), which added support for scaling based on memory utilization and custom metrics. This allowed users to define their metrics so the HPA could make more informed scaling decisions tailored to the specific needs of their applications.
The HPA can scale based on three types of metrics:
The HPA periodically fetches the specified metrics and calculates the desired number of replicas based on the observed values and the target values set by the user. The basic scaling algorithm can be summarized as follows:
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]
For example, if the current metric value is 200milicores and the desired value is 100m, the number of replicas will be doubled (200 / 100 = 2). If the current value is 50m, the number of replicas will be halved (50 / 100 = 0.5, rounded up to 1).
The HPA also considers the minReplicas and maxReplicas values specified in the HPA definition to ensure that the number of replicas stays within the configured bounds.
When multiple metrics are specified, the HPA calculates each metric’s desired number of replicas independently and then takes the maximum value as the final desired replica count.
When the load drops, the HPA doesn't immediately scale down the number of replicas. Instead, it uses a stabilization window to prevent rapid fluctuations in the number of replicas due to dynamic load patterns.
You can customize the scaling-down behavior using the behavior field in the HPA specification. For example:
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 100
periodSeconds: 15
This configuration sets a 5-minute stabilization window and allows scaling down by 100% of the current replicas every 15 seconds. You can further refine this behavior by specifying multiple policies and using the selectPolicy
field to choose how these policies are applied.
It's also possible to disable scale-down altogether:
behavior:
scaleDown:
selectPolicy: Disabled
The Metrics Server is a cluster-wide aggregator of resource usage data in Kubernetes. It collects CPU and memory usage metrics from the kubelet on each node and exposes them through the Kubernetes API. The HPA uses the Metrics Server to access resource metrics for scaling decisions.
To use the HPA with resource metrics, you must deploy the Metrics Server in your Kubernetes cluster. The Metrics Server is not deployed by default in most Kubernetes distributions, so you may need to install it separately.
Custom and External Metrics APIs
The Custom Metrics API and External Metrics API allow the HPA to access application-specific and external metrics for scaling decisions. These APIs are provided by third-party adapters that collect metrics from various sources and expose them in a format compatible with the HPA.
Some popular custom and external metrics adapters include the following:
To use custom or external metrics with the HPA, you must deploy the appropriate adapter in your Kubernetes cluster and configure it to expose the desired metrics. The adapter then registers itself with the Kubernetes API server, making the metrics available for the HPA to consume.
In this section, we walk through a hands-on demo of the HPA to demonstrate its usage and behavior in a real-world scenario. Before we begin, be sure that you have the following prerequisites in place:
We deploy a web application for this demo and expose an endpoint using a service.
Create a new file named hpa-demo.yaml with the following content:
apiVersion: apps/v1
kind: Deployment
metadata:
name: hpa-demo
spec:
selector:
matchLabels:
run: hpa-demo
template:
metadata:
labels:
run: hpa-demo
spec:
containers:
- name: hpa-demo
image: registry.k8s.io/hpa-example
ports:
- containerPort: 80
resources:
limits:
cpu: 500m
requests:
cpu: 200m
---
apiVersion: v1
kind: Service
metadata:
name: hpa-demo
labels:
run: hpa-demo
spec:
ports:
- port: 80
selector:
run: hpa-demo
To deploy, run the commands below:
> kubectl apply -f hpa-demo.yaml
deployment.apps/hpa-demo created
service/hpa-demo created
Next, verify the deployment status:
> kubectl get deploy/hpa-demo
NAME READY UP-TO-DATE AVAILABLE AGE
hpa-demo 1/1 1 1 71s
Now that the demo app is running, we can create the autoscaler.
Create a new file named hpa.yaml with the following definition:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: hpa-demo
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: hpa-demo
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
This HPA configuration will automatically scale the hpa-demo deployment based on the average CPU utilization of its pods. By adjusting the number of replicas between 1 and 10, it will try to maintain an average CPU utilization of 50%.
To create the HPA, run this command.
> kubectl apply -f hpa.yaml
horizontalpodautoscaler.autoscaling/hpa-demo created
To observe how the autoscaler responds to increased load, we’ll create a separate pod that acts as a client. This client pod will continuously send requests to the hpa-demo service in an infinite loop.
In a new terminal, run the following command to create the load generator pod:
> kubectl run -i --tty test-load-generator --rm --image=busybox:1.35 --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://hpa-demo; done"
This command will run the load generator pod and send requests to the hpa-demo service every 0.01 seconds.
In the original terminal, execute the following command to monitor the HPA:
> kubectl get hpa hpa-demo --watch
Within a short period, you should notice an increase in CPU load. As the load continues, the number of replicas will increase. For example:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
hpa-demo Deployment/hpa-demo 0%/50% 1 10 1 8m52s
hpa-demo Deployment/hpa-demo 68%/50% 1 10 1 9m16s
hpa-demo Deployment/hpa-demo 250%/50% 1 10 2 9m31s
hpa-demo Deployment/hpa-demo 200%/50% 1 10 4 9m46s
hpa-demo Deployment/hpa-demo 105%/50% 1 10 5 10m
In this case, the CPU consumption has risen to 250% of the target, prompting the Horizontal Pod Autoscaler to scale the deployment to five replicas.
To verify the number of replicas, run:
> kubectl get deployment hpa-demo
The output should display the replica count, matching the value from the HPA:
NAME READY UP-TO-DATE AVAILABLE AGE
hpa-demo 5/5 5 5 10m
To complete the demonstration, you must stop generating load on the hpa-demo service.
In the terminal where you created the load generator pod using the busybox image, press <Ctrl>
+
C
to terminate the load generation process.
After allowing a brief period for the system to adjust, observe the HorizontalPodAutoscaler’s status by running:
kubectl get hpa hpa-demo --watch
You should see output resembling the following:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
hpa-demo Deployment/hpa-demo 0%/50% 1 10 1 15m
This indicates that the CPU utilization has dropped to 0%, falling below the target of 50%.
To confirm that the deployment has scaled down, run:
kubectl get deployment hpa-demo
The output should show that the number of replicas has been reduced to one:
NAME READY UP-TO-DATE AVAILABLE AGE
hpa-demo 1/1 1 1 20m
As the CPU utilization decreased to 0%, the HPA automatically scaled the deployment to one replica.
Remember that the autoscaling process may take a few minutes to complete, so be patient while the system adjusts the number of replicas.
In the example above, we covered CPU usage-based scaling for pods. However, scaling on metrics like CPU and memory will often be insufficient. To address this, HPA provides custom pod-metrics-based scaling, as mentioned earlier.
In this section, we deploy a demo app that exposes a custom metric called request_count and deploy an HPA that scales the demo based on this custom metric.
We must install the Prometheus Operator and Prometheus Adapter to implement our use case:
The Prometheus Adapter bridges Prometheus and the Kubernetes metrics API, exposing custom metrics collected by Prometheus to a format that Kubernetes can understand and use for autoscaling.
After installation, we must update the adapter rules to ensure that the Prometheus Adapter collects the metrics from our demo app. Save the rules below as values.yaml:
rules:
default: true
custom:
- seriesQuery: 'request_count'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_count"
as: "${1}_per_second"
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
- seriesQuery: |
{namespace!="",__name__!~"^container_.*"}
resources:
template: "<<.Resource>>"
name:
matches: "^(.*)_count"
as: ""
metricsQuery: |
sum by (<<.GroupBy>>) (
irate (
<<.Series>>{<<.LabelMatchers>>}[1m]
)
)
> helm upgrade -f values.yaml prometheus-adapter prometheus-community/prometheus-adapter
To implement our use case, apply the manifest below.
> kubectl apply -f https://gist.githubusercontent.com/decisivedevops/d23cb8620275af24d1c0ac4097518f49/raw/9b332a69504e3ae8998387e298c62aeb8fc374ec/pod-metrics-hpa-demo.yaml
Here’s a brief overview of each resource:
Next, ensure that the application is deployed and running.
> kubectl get pods
NAME READY STATUS RESTARTS AGE
prometheus-adapter-65bf5d48b-5rprb 1/1 Running 0 4m40s
prometheus-operator-865844f8b4-jxhqd 1/1 Running 0 2m17s
prometheus-prometheus-0 2/2 Running 0 2m3s
prometheus-prometheus-1 2/2 Running 0 2m3s
py-hpa-demo-746778999b-kxw7r 1/1 Running 0 111s
We can see that the metrics exposed by the application are available in Prometheus:
> kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .
You should see the pods/request_count
metric.
Now we can confirm that the HPA is listening to the pod metrics:
> kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
py-hpa-demo-hpa Deployment/py-hpa-demo 0/50 1 10 1 5m15s
TARGETS
0/50
indicates that the demo app has not received any requests yet, so HPA is idle.
To generate the requests, you can use the test-load-generator
loop below:
> kubectl run -i --tty test-load-generator --rm --image=busybox:1.35 --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://py-hpa-demo:8080; done"
If you don't see a command prompt, try pressing enter.
Hello, World! Request count: 11
Hello, World! Request count: 12
Hello, World! Request count: 13
Hello, World! Request count: 14
Hello, World! Request count: 15
Hello, World! Request count: 16
After a few seconds, you can see that HPA TARGETS
is updated:
> kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
py-hpa-demo-hpa Deployment/py-hpa-demo 28/50 1 10 1 8m4s
Continue to run that loop to see the HPA in action. After 1-2 minutes of running the test-load-generator
loop, check the HPA again:
> kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
py-hpa-demo-hpa Deployment/py-hpa-demo 154/50 1 10 2 9m59s
This confirms that the HPA is indeed scaling the pods based on the custom metrics that the app is exposing.
While the Horizontal Pod Autoscaler (HPA) is a powerful tool for automatically scaling applications in Kubernetes, users may encounter several common pitfalls and challenges when working with HPA. Let’s explore them in detail.
While HPA supports custom metrics, setting them up can be complex and painful. It requires additional components and configuration, such as a metrics server and custom metrics API.
Collecting and exposing custom metrics often involves integrating with external monitoring systems or instrumenting the application code. This setup process can be time-consuming and requires expertise in Kubernetes and the chosen monitoring solution. Ensuring the reliability and scalability of the custom metrics pipeline adds another layer of complexity to the overall setup.
HPA has several parameters that must be tuned based on the application’s requirements and behavior. These parameters include the scaling thresholds, number of replicas, and scaling behavior.
Finding the correct values for these parameters can be challenging because it requires understanding the application’s performance characteristics and load patterns. Incorrectly tuned parameters can lead to aggressive scaling or slow response to changes in demand, impacting the application’s performance and cost efficiency.
HPA can only scale applications within the limits of the available cluster resources. It can only scale the application within a certain point (cluster’s max capacity) if the cluster capacity cannot accommodate the increased demand. This can result in performance degradation or even service unavailability. It’s vital to ensure that the cluster has enough capacity to handle the expected workload and plan for potential scaling needs in advance.
Monitoring and troubleshooting HPA can be challenging, especially when dealing with complex applications and large-scale deployments. Proper monitoring and logging mechanisms are required to track scaling behavior and identify issues.
Troubleshooting HPA issues may involve analyzing metrics, reviewing HPA events and logs, and correlating them with application behavior. Effective monitoring and troubleshooting practices are essential to ensure the smooth operation of HPA and to resolve any issues that arise quickly.
In this comprehensive guide, we explored the Horizontal Pod Autoscaler (HPA), a powerful Kubernetes feature that automatically scales the number of pods based on observed metrics. We learned the core concepts, components, and working principles of HPA and implemented it through a hands-on demonstration, showcasing its ability to scale a deployment under varying load conditions.
We also discussed the common pitfalls and challenges associated with HPA, such as the complexities of setting up custom metrics, the difficulty in tuning HPA parameters, and the limitations imposed by cluster capacity. To address these challenges and extend HPA’s capabilities, we introduced Kubernetes Event-driven Autoscaling (KEDA) as an advanced tool. KEDA simplifies the process of autoscaling based on event-driven and custom metrics, providing a more flexible and intuitive approach to scaling applications.
As you continue to explore Kubernetes autoscaling, remember to assess your application’s requirements carefully, choose the appropriate metrics for scaling, and leverage the right tools and techniques to overcome the challenges associated with HPA. With a solid understanding of HPA and the benefits of advanced tools like KEDA, you’ll be well-equipped to build and manage scalable applications on Kubernetes confidently.
We use cookies to provide you with a better website experience and to analyze the site traffic. Please read our "privacy policy" for more information.