Blog
By James Hochadel | Apr 15, 2021
Kubernetes is a powerful tool helping DevOps teams work with more efficiency and with less friction, which is why more and more of them are using it every day. Most developers have little trouble standing up clusters and scheduling workloads. But when it comes to handling the ongoing task of optimizing their deployments for both performance and resource-efficiency, it’s a different story.
In this post, we discuss some common Kubernetes Day Two challenges, and how machine learning technology can help developers to overcome them.
Kubernetes has caught on and become very popular among DevOps teams for good reason. Developers and engineers love it because it gives them easier ways to build more flexible and scalable applications and infrastructure. Theoretically, it also helps developers ensure that their applications perform reliably and contribute to a positive user/customer experience.
Kubernetes also helps developers achieve faster time-to-market with new apps or features. Its flexible, microservices-based approach and better resource utilization make for less friction in the development process. It also helps dev teams to be responsive to the needs of the business, enables them to break up monolithic applications, and it smooths the path for moving more of their operations to the cloud.
So, it’s no surprise that according to the Cloud Native Computing Foundation (CNCF), this orchestration system continues to lead the container charge. In its CNCF 2020 Survey, 91% of respondents reported using Kubernetes with 83% of them saying they’re presently using it in production. With adoption and usage stats like that, it’s safe to say that Kubernetes is here to stay.
While the upside of Kubernetes is compelling, this container orchestration system is not without its challenges.
One area in which dev teams often encounter challenges is in the Day Two phase. Of course, none of these phases are specific days. They’re actually weeks, months, or even years. Day Two follows Day Zero (requirements, design, and prototyping) and Day One (build, test and deployment).
In the Day Two phase, the application is running in production. That means it needs all of the ‘care and feeding’ that goes with resources in production, including performance monitoring, troubleshooting and incident remediation, maintenance and upgrades, security and compliance checks, etc. Day Two is also when teams become familiar enough with their new resources to start making adjustments and fine-tuning things to strike their desired balance of performance and costs.
But there’s a catch, and it’s the source of Day Two hurdles for lots of developers. Kubernetes does provide capabilities for changing things like application configuration settings to boost performance or lower cloud costs -however, those controls are not automated. But given the dynamic nature of many Kubernetes workloads, making the correct changes fast enough to produce the desired result simply isn’t humanly possible.
Developers could sit at a console all day, querying their Prometheus metrics to make educated guesses about what resources their containers need. But past one or two containers, that’s not a viable strategy, and developers typically have higher priorities. Instead, what they need are automated tools that have the speed necessary to keep pace with development, and the intelligence needed to know which configuration changes influence resource consumption and performance.
The good news for DevOps teams is that new, machine learning-powered solutions are replacing manual configuration management. Let’s take a look at one of these Kubernetes Day Two issues and how machine learning can help developers resolve it.
One of the Day Two goals is to make sure that containers have the right amount of resources to run the way that’s best for the organizations. That means ensuring that containers have enough resources so they don’t fail or run poorly, but that they’re also not excessively over-provisioned and driving cloud costs through the roof. Without automation tools, it’s very difficult for developers to consistently strike this ‘just right’ balance.
One overarching factor that affects developers’ resourcing decisions is how important an application or service is to the business. Where an app ranks on the ‘business-critical’ scale generally determines which level of Quality of Service (QoS) developers will assign to it. In Kubernetes, there are three QoS classes. The highest class is ‘guaranteed’, the middle category is ‘burstable’, and the lowest is ‘best effort’.
If a Kubernetes node experiences resource pressure, particularly with incompressible resources like memory or disk space, its kubelet may evict pods so that the node remains stable. When that happens, Kubernetes will attempt to evict pods with lower QoS classes first. BestEffort pods specify no resource requests or limits and are considered the lowest priority. Burstable pods request a minimum amount of resources and an upper resource limit, and if they exceed their minimum request, they are candidates to be killed. Finally, there is no other option, Kubernetes will remove pods with Guaranteed QoS class. Such pods may be killed during normal operation if they exceed their resource limits.
QoS classes are determined by two key controls: requests and limits. Essentially, they’re dials developers can use to turn up or down the amount of resources, such CPU or memory, that get allocated to a container. Requests establish guaranteed resources for a container. When a container requests a resource, Kubernetes will only schedule it on a node that has the needed resource available. Limits on the other handset a ‘not to exceed’ level beyond which Kubernetes won’t let its resource allocation go.
Other workload-specific configuration options that developers can control, such as JVM settings for Java applications, may be specified on the pod in a variety of ways, including environment variables, mounted volumes with config files, and CLI arguments to the container. Changing these settings, such as adjusting heap sizes or garbage collection parameters, can have a big impact on application performance.
In the early days of a deployment of an app or service, there is typically some ‘wiggle room’ for configuration settings. Developers, who don’t want their apps to fail for obvious reasons, often tend to over-provision them. But as the Kubernetes deployment scales, and new applications with different behaviors and resource requirements are added to the picture, performance and costs can go way out of bounds.
That’s when developers really need to tune their apps’ configuration settings. It’s also when they realize that they don’t have the tools to do so effectively – with precision and speed. Without visibility into the effects of configuration setting changes, developers have resorted to risky guesswork.
Until fairly recently, developers didn’t have a better alternative. But that has changed with the arrival of new solutions that leverage AI and machine learning to provide more intelligent, effective, and automated ways to optimize applications running in Kubernetes environments. The StormForge Platform is one such solution.
StormForge Optimize Live analyzes, optimizes, and refine cloud-native application configurations. This intelligent automation helps dev teams to ensure that their apps and services consistently meet their goals for stability, performance, and cost. The StormForge machine learning engine continually observes apps and services, and gauges in real-time how they respond to configuration setting changes. It learns more with each trial, zeroing in on the optimal settings for achieving performance and cost goals. It also automatically shows developers the trade-offs associated with setting changes, recommends optimal configurations, and makes it easy for developers to download and apply the recommended settings.
The Day Two phase of Kubernetes deployments shouldn’t involve developers trying their luck at guessing the best configuration settings. They shouldn’t always be cleaning an operational mess that resulted from a bad guess. Instead, it should be a time when their cloud-native apps and services are largely on auto-pilot, but are also being automatically adjusted to account for changes – in the environment, in business goals, or any other delta.
In other words, it should be a time when an organization is realizing the full benefits of Kubernetes, and developers can work on more pressing and interesting projects than chasing configuration setting changes all day long.
You can try Optimize Live for yourself with a free trial, or play around in our sandbox environment.
We use cookies to provide you with a better website experience and to analyze the site traffic. Please read our "privacy policy" for more information.