Editor’s note: This guest blog post was written by Max Brenner, a freelance DevOps and cloud consultant based in Germany. For more from Max, see his blog here.
Note from the author: This post has been happily sponsored by StormForge, the company building the product that is being used. Despite sponsorship, they allowed me to share my honest opinion. Huge thanks for their trust and support!
Manually optimizing an application on Kubernetes is hard, especially if your goal is to keep costs low. There are plenty of options, the dependency trees in modern microservice architectures are getting bigger and bigger and for some (most?) of us all of this takes place on hardware that we have no control over.
Additionally without a lot of testing and monitoring effort it’s very difficult to determine the “correct” specs for your deployments. It requires extensive trial and error to push your application into the area where it still performs without being overprovisioned.
Otherwise you’ll end up wasting resources (and thus money), or with a system that falls apart as load increases. That’s why I decided to take a look at StormForge, a tool that aims to solve this problem without occupying endless amounts of engineering time.
StormForge is a SaaS product that combines trial and error experiments with machine learning to help you determine configurations that will optimize your application for the metrics you care about. It also includes Performance Testing to place load on your system in a test environment, but for this blog I’ll focus on StormForge’s optimization capabilities. This includes three key components:
Below you can find a visualization that explains the general workflow of a StormForge experiment.
Over the course of a single experiment, StormForge will be able to predict configurations that will exceed the performance of your baseline values, based on the metrics you defined.
As Kubernetes is StormForge’s primary target platform, you interact with it using the manifests, a concept you are probably already familiar with. The main one being the Experiment manifest, which consists of the following parts:
The optimization section allows you to set the number of iterations that the experiment will go through. Adjust this value according to your use case. Integrating StormForge experiments into your CI pipeline to prevent performance regression? 20 trials may be sufficient.
Wanting to create a baseline configuration for your newly written application? Let StormForge cycle through 200 iterations and leave it running overnight to get the most accurate result.
...
optimization:
- name: "experimentBudget"
value: "50"
...
Parameters define the possible configuration values that StormForge can experiment with, like the amount of memory, number of replicas or different disk storage types.
As a general rule of thumb increase the number of trials along with your number of parameters. The more permutations your experiment is covering, the more data points the ML model will need to determine the optimal configurations.
...
parameters:
- name: memory
min: 500
max: 2000
baseline: 1000
- name: cpu
min: 500
max: 2000
baseline: 500
...
Patches instruct the StormForge controller how to apply the previously declared parameters to your Kubernetes manifests. This is where StormForge’s biggest strength is in my opinion as you can patch everything that is configurable through a Kubernetes manifest.
It may start with simple things like CPU and memory resource limits. But with more and more extensions coming to Kubernetes lately the possibilities are endless. One example I can imagine is using Crossplane to cycle through VM types of different cloud providers and see on which one your application performs best.
...
patch: |
spec:
template:
spec:
containers:
- name: my-app
resources:
limits:
memory: "{{ .Values.memory }}Mi"
cpu: "{{ .Values.cpu }}m"
...
The trial template specifies what your application-specific performance measurement looks like. Internally it launches a Kubernetes Job which gives you the freedom of using your tool of choice, like StormForge’s own load testing solution, or another tool like Locust or Gatling.
Also you can take care of some preparation tasks like deploying a Helm chart or a dedicated Prometheus instance. Be aware that trial jobs will be executed on the same Kubernetes cluster as your application under test. So be sure to keep room for its potential additional CPU and RAM usage.
...
template:
spec:
initialDelaySeconds: 15
template:
spec:
template:
spec:
containers:
- image: crunchydata/crunchy-pgbench:centos7-11.4-2.4.1
name: pgbench
envFrom:
- secretRef:
name: postgres-secret
...
Metrics allow you to define the criteria that you want to gather and/or optimize, e.g. minimizing the costs while maximizing the throughput of your application. StormForge comes with certain metrics out of the box, like trial duration, but can also query popular monitoring systems like Prometheus or Datadog if you are looking to optimize more specific metrics.
...
metrics:
- name: duration
minimize: true
query: "{{duration .StartTime .CompletionTime}}"
- name: cost
minimize: true
type: pods
# calculating with $20/month/CPU core and $3/month/GB of RAM
query: '{{resourceRequests .Pods "cpu=0.020,memory=0.000000000003"}}'
selector:
matchLabels:
component: my-app
...
For more information check out StormForge’s documentation which covers all these elements in more detail.
Try StormForge for FREE, and start optimizing your Kubernetes environment now.
With all the concepts clarified let’s prepare our application and cluster for the first experiment. To begin with we have to decide on an application that we want to optimize. I went with Apache Cassandra because I always see it using huge amounts of resources in Kubernetes clusters and wanted to check whether this can be improved.
So I went ahead and created a standard StatefulSet that takes care of deploying my Cassandra pods. Additionally we’ll need some way of putting load onto the database during the trial job. Luckily, Cassandra comes with a tool to do just that, called cassandra-stress.
...
template:
spec:
template:
spec:
containers:
- image: cassandra:4.0
name: cassandra-stress
# making sure that our trial job does not eat up endless amounts of resources
resources:
limits:
memory: "1024Mi"
cpu: "2000m"
command:
- bash
- -c
- |
CASSANDRA_URL="cassandra.default.svc.cluster.local"
/opt/cassandra/tools/bin/cassandra-stress write n=100000 -rate threads=400 -node $CASSANDRA_URL
/opt/cassandra/tools/bin/cassandra-stress mixed n=100000 -rate threads=400 -node $CASSANDRA_URL
cqlsh --request-timeout=60 -e "DROP KEYSPACE keyspace1;" $CASSANDRA_URL || true
As you can see our trial job consists of the following three steps:
Because we always execute the same number of read and write operations (2 x 10000), simply measuring the time of the trial job to complete will be a good way to judge how our database performs.
With our application good to go, let’s quickly prepare our Kubernetes cluster. Assuming you have a connection to it set up, after installing the stormforge stormforge CLI tool the following commands will deploy the StormForge controller.
$ stormforge login # log into our StormForge account
Opening your default browser to visit:
https://auth.carbonrelay.io/authorize?...
You are now logged in.
$ stormforge init # deploy the StormForge controller
customresourcedefinition.apiextensions.k8s.io/experiments.redskyops.dev created
customresourcedefinition.apiextensions.k8s.io/trials.redskyops.dev created
clusterrole.rbac.authorization.k8s.io/redsky-manager-role created
clusterrolebinding.rbac.authorization.k8s.io/redsky-manager-rolebinding created
namespace/redsky-system created
deployment.apps/redsky-controller-manager created
clusterrole.rbac.authorization.k8s.io/redsky-patching-role created
clusterrolebinding.rbac.authorization.k8s.io/redsky-patching-rolebinding created
secret/redsky-manager created
$ stormforge check controller --wait # wait for the controller to become ready
Success.
After going through the preparation phase we can start with our first experiment. To keep it simple, we will start with two parameters (CPU and memory limits) and a single metric (duration of our trial) that we want to optimize. Below you can see the relevant parts of the experiment spec:
...
parameters:
- name: memory
min: 1000
max: 4000
baseline: 2000
- name: cpu
min: 500
max: 3500
baseline: 1000
patches:
- targetRef:
kind: StatefulSet
apiVersion: apps/v1
name: cassandra
patch: |
spec:
template:
spec:
containers:
- name: cassandra
resources:
limits:
memory: "{{ .Values.memory }}Mi"
cpu: "{{ .Values.cpu }}m"
requests:
memory: "{{ .Values.memory }}Mi"
cpu: "{{ .Values.cpu }}m"
metrics:
- name: duration
minimize: true
query: "{{duration .StartTime .CompletionTime}}"
...
After roughly 1.5 hours we end up with the following chart in the SFO dashboard.
As you may have already guessed, the results are not really surprising. The more resources that fuel your application, the better it will perform. Who would’ve thought? Anyway, I think this experiment is valuable for the following two reasons.
First, it’s a simple showcase to understand how StormForge works, and second, it helped us validate our assumptions. Imagine if Cassandra performed worse with more resources or the trial duration turned out to be completely random. In this case we would know that something in our test setup is messed up. Instead we now can jump onto our first proper experiment with confidence.
So let’s add another dimension of metrics. In most real world scenarios you try to achieve the best performance while keeping the resource usage as low as possible. And that’s exactly what we will be trying to do. As a general rule of thumb for all kinds of optimization, you should always try to optimize at least two metrics that contradict each other, like increasing throughput while decreasing CPU usage or reducing latency while minimizing cache size.
We will be combining the CPU and memory usage into a single cost estimation metric by using the resourceRequests function which allows us to create a weighted sum of the two. The parameters will stay the same while the metrics change to the following:
...
metrics:
- name: duration
minimize: true
query: "{{duration .StartTime .CompletionTime}}"
- name: costs
type: pods
# weights are equal to AWS EC2 pricing of c5.xlarge instance ($30.60/month/CPU core,$15.30/month/GB RAM)
query: '{{resourceRequests .Pods "cpu=0.03060,memory=0.000000000001530"}}'
selector:
matchLabels:
component: cassandra
...
After applying our Experiment manifest we can have a look at the individual trials runs using kubectl like so:
$ kubectl get trials -o wide -w
NAME STATUS ASSIGNMENTS VALUES
cassandra-two-metrics-1-000 Completed memory=2000, cpu=1000 duration=69, costs=33.808642559999996
cassandra-two-metrics-1-001 Completed cpu=2845, memory=3003 duration=36, costs=91.87477680384
cassandra-two-metrics-1-002 Completed cpu=1782, memory=2886 duration=47, costs=59.15927121407999
cassandra-two-metrics-1-003 Completed cpu=1529, memory=3225 duration=51, costs=51.961336128
cassandra-two-metrics-1-004 Completed cpu=728, memory=3403 duration=87, costs=27.73630531584
cassandra-two-metrics-1-005 Completed cpu=537, memory=2269 duration=107, costs=20.07240498432
cassandra-two-metrics-1-006 Completed cpu=1929, memory=4000 duration=41, costs=65.44468512
...
Now that we have a chart with more useful information here’s an explanation of the different data points:
After selecting the trial run that fits our needs, we can export the new application configuration using the `stormforge` tool. I’ll go with StormForge’s recommended trial that decreased the costs by 16% while achieving the same duration run.
$ stormforge export -t \ # export the patched resource
-f cassandra.yaml \ # path to the Cassandra STS manifest
-f experiment.yaml \ # path to the Experiment manifest
cassandra-two-metrics-1-015 # the trial run with our desired configuration
...
resources:
limits:
cpu: 875m
memory: 1000Mi
requests:
cpu: 875m
memory: 1000Mi
...
Comparing this configuration to our baseline values of 1000m CPU shares and 2000Mi memory we were able to save quite a bit of resources. While the CPU improvement is probably based on a bit of wriggle room of the trial run duration, the RAM savings definitely confirm that Cassandra is running perfectly fine with half of our original value.
Please be aware that the outcome of StormForge experiments is only meaningful if your load test is very similar to what your application is being exposed to when running in production. That’s why you should select a load generation tool that is capable of doing so.
So there you have it. A basic example of optimizing Cassandra on Kubernetes by letting software take care of tedious and repetitive tasks and thereby figuring out to which position certain knobs should be set. Of course this example could have been extended endlessly by adding more and more parameters, like JVM heap size, different disk types or various compression algorithms.
But as all of those very much depend on your use case and the capabilities of your infrastructure, I’d be very happy if you are able to do that on your own using StormForge after reading this blog post. If there are any open questions, feel free to reach out to them.
We use cookies to provide you with a better website experience and to analyze the site traffic. Please read our "privacy policy" for more information.