How to Rightsize Infrastructure When Migrating from Virtual Machines to Kubernetes
Pro Tip: “Don’t take your legacy world to be your future world.”
70%
cost reduction achieved from workload optimization
99.99%
availability
66%
accelerated EKS migration (from ~18 to 8 months)
Acquia needed to solve two primary problems, and they knew they needed automation to achieve their goals.
As a website hosting service, they needed to efficiently allocate compute capacity for the tens of thousands of customer applications their platform has running at any given time. Each customer environment is essentially a snowflake with different code bases, profiles, traffic, etc. The resources needed to serve traffic from one customer aren’t like any other.
“We can’t be in the business of manually figuring out how to allocate resources to different customers across the board. The best we could do is an approximation with some pretty broad bands, but that opens the door to some stability problems if you’re sizing people too small,” said Wil Reed, principal software architect, Acquia.
At the same time, they needed to manage their internal platform services as they migrated from legacy virtual machines to Kubernetes and Amazon EKS. They were holistically changing their architecture, and they did not know the performance profile of their customers code running on a containerized platform. Ultimately, they needed to know how to size the environment without guessing in order to deliver the customer availability, performance and scalability they needed.
"We needed something that could keep up with the pace of scale that we have."
Ed Brennan
Chief Software Architect, Acquia
Properly sizing CPU and memory requests for pods based on customers’ actual usage and continuously automating the process daily.
Optimizing their internal environment to ensure application availability that can handle their customer’s highly variable workloads.
Automating Amazon EKS capacity planning by intelligently rightsizing workloads for the stability needed to complete their EKS migration.
See what Acquia’s chief software architect Ed Brennan and principal software architect Wil Reed had to say about their experience with StormForge and Optimize Live’s continuous rightsizing capabilities.
“It’s not just about cost optimization. It’s about rightsizing the environment, so you get what you expect from the application services you’re deploying and using,” Ed said.
“We needed a platform like Kubernetes that enables customization and … we needed Optimize Live to configure that platform,” Wil said.
“Within the first six months of deploying Optimize Live — if not sooner — we achieved the ROI that we wanted to achieve, and we have continued to optimize that over time,” Ed said.
They started applying recommendations just once a week at first. They proceeded cautiously to ensure the changes they were making would have the desired impact to either improve the customers’ applications performance or for Acquia’s cost savings.
In order to increase the frequency of their automation schedule, they used the auto-deploy thresholds feature. With their change thresholds configured, Optimize Live’s automation makes high impact improvements to workloads that need them while avoiding churn for workloads that don’t.
Using the auto-deploy thresholds, they moved to automatically apply recommendations once a day, and they plan to continue increasing the frequency.
“There’s two reasons why we trust it,” Wil said. “One, we enabled recommendations a while before we said, ‘go and apply them.’ We picked a handful of our non production environments, and we watched it, and then we rolled it out slowly. After a period of time, we just have the data to back it up — that these things do tend to be accurate.
“The other reason is that for our workloads in particular (but for lots of workloads, if people are being honest with themselves), it’s a little silly to imagine that a human can do better.
“The goal is essentially [to automatically apply recommendations] as frequently as possible, without disruption. The more often you do it, the more you can take advantage of peaks and troughs in traffic patterns. And our workloads are highly variable based on the time of day and the day of the week,” Wil said.
“The Kubernetes ecosystem is about dynamic responsiveness to changes throughout the day. If you think of Optimize Live as a static configuration of a behavior pattern, then the more the merrier,” Wil said.
“StormForge understands the Kubernetes ecosystem and design principles, and they’re committed to integrating with those design principles. They don’t try to get you to treat Kubernetes like some other static infrastructure or build some imperative process around what should be a declarative system. That is a fairly stark comparison with some other vendors.”
“StormForge allows people to achieve the promise of Kubernetes, which is workloads being dynamically sized based on just what they need in a shared pool of infrastructure that can be reused and reclaimed and dynamic. You can’t do that to the fullest extent without this tool.”
“One of the reasons why it’s interesting to look at different clusters is because different clusters are in different regions, and different regions have different prices at AWS land. So that’s very helpful to be able to look at that dimension,” Wil said.
“Because StormForge uses the same declarative mechanisms as anything else, it integrates super well with ArgoCD. There’s no challenge with using Argo CD because StormForge provides a CRD or you work through annotations, both of those are declarative. So ArgoCD works pretty flawlessly,” Wil said.
Acquia recognized that they needed the flexibility of cloud-native architecture to effectively deliver their platform’s services to customers. They began migrating to Amazon EKS, but they found their environment wasn’t stable enough due to workload churn.
Their control planes were “kind of just falling down all the time because of the amount of churn we had in those environments, and part of that churn was that we were scaling on the wrong metrics,” Ed said.
Without those insights into the applications running on EKS and their performance, they decided to pause their EKS migration to go through a stabilization exercise. “This is where StormForge came in to give us those insights, and to make those adjustments for us,” Ed said.
“EKS and Kubernetes is a very complicated beast. EKS is only part of the equation,” Ed said. “The other part of the equation is the nodes and the workloads that are running on them, and having the insights to make the right adjustments in any given time. “You have to know what levers you can pull, and then pull the right levers at the right time,” Brennan said “Optimize Live gave us that capability to create that application profile and adjust it over time as the use of that application changes.”
With these insights from Optimize Live, Acquia swiftly completed their EKS migration in about 6 months. “It could have taken us a year and a half, if we didn’t have the right tools system around us to make that happen.”
While Acquia was going to adopt EKS for a variety of reasons, the ability to automate EKS capacity planning using Optimize Live made it that much more attractive.
According to Wil, “Optimize Live increases the benefits of moving to EKS.”
We’d love to chat.
We use cookies to provide you with a better website experience and to analyze the site traffic. Please read our "privacy policy" for more information.