On-Demand

Eliminating the Hidden Costs of Kubernetes with ML-powered Optimization

Rich: Alright, I think we will go ahead and get started. Thank you everyone for joining today’s webinar, Eliminating the Hidden Costs of Kubernetes Using Machine Learning Powered Rapid Experimentation.

So really happy to have you with us today a couple of housekeeping notes before we get started first of all if you have questions anytime during the webinar please enter them into the Q&A part of the Zoom interface and we will make sure we block out a little bit of time at the end of the webinar to get to your questions. Second of all, we’re doing a giveaway today. So we’re giving away a couple of Amazon gift cards and the way that you get entered to win those is first, just by attending you get an entry to win one of those, but second of all if you ask a question for each question you ask we will give you another entry to win the first Amazon gift card. We want to make sure that we give all of you the opportunity to ask questions and encourage that so we will give away one Amazon gift card based on your attendance, plus the questions you ask, and then we will be sending out a follow-up survey after the webinar. If you take the time to complete that survey, we will enter you into a drawing for the second gift card as well. So hopefully everybody takes advantage of that.

Alright, so let’s dive into the topic today. My name is Rich Bentley. I am the Senior Director of Product Marketing at StormForge. I’ve been in the software industry for quite a while with a few different experiences out there. Based here in Michigan. We’ve got also on the webinar today Brad Ascar, who is our Senior Solutions Architect, certified Kubernetes administrator, who has been an architect and cloud practitioner in the industry also for a long time with some of the experiences that you see there. Based out of Atlanta, Georgia.

So I’m going to kick it off and just spend a little bit of time talking about what we see as some of the challenges in the industry and then Brad will kind of lead us into how we can use machine learning and rapid experimentation to help address some of those issues. So just to start out, we all know that cloud native architectures and technology hold a lot of promise and potentially can provide a lot of benefits to our organization. They can help us get to a point where we can introduce new capabilities and new features faster to the market, become more agile in our development efforts, provide a better user experience, scale easier as we expand your usage of applications, really make those applications more responsive and easier to scale to the needs of the audience, cost efficiency, and really just providing that always-on availability. So there’s a lot of great benefits to cloud native, but as we also Probably know there are a lot of challenges that come in when we try to get there. Cloud-native technologies are complicated. There’s a lot of moving parts. They’re very dynamic, ephemeral type workloads that are running, and so it makes it hard to kind of get to those benefits. We know that performance can be an issue because of the complexity. We want to make sure that our applications are responsive. That they’re providing the user experience that we need in order for our users to have that positive experience and to do what they need to do in a fast time frame reliability. We have to have reliability. Any downtime can be extremely costly to an organization. And then scalability, right? It’s easy to get started with cloud native, but as we start to scale up to larger and larger user bases, that can cause problems in terms of capacity and also performance and availability as well. And then last but not least is cost efficiency. One of the reasons we’re moving to cloud native is that we want to pay for what we use and not be paying for excess capacity, but it’s very difficult to do that effectively as we’ll see as we get into the content for today.

So one of the reasons for the challenges in achieving the benefits of cloud native technology is the complexity of Kubernetes and deploying applications on Kubernetes. So if we think about it, when we deploy a set of containers, a set of services running within those containers, there are a lot of different things that we can tune in those applications. So there are Kubernetes specific settings like CPU and memory, requests and limits, replicas things like that, but then there are also application specific settings as well depending on what type of application we’re running in those containers. There are a number of different things that we can set and tweak as well, and if you kind of multiply that by the number of different containers you’re running and all that complexity, there’s really almost an infinite number of different ways that you can deploy your application. Each of those decisions that you make on how to deploy your application has an impact on the cost of running that application, the way that it performs, how reliable the application is as well. So it’s a very difficult and complicated problem especially when you’re trying to do it as a human trying to understand everything that’s going on in your application and make those decisions in a way that’s going to provide the best outcomes.

A few things that kind of expand on that. So cloud waste is a really significant problem in the industry. There are a couple of studies out there that have shown the amount of waste going on in the cloud just in terms of idle and over provisioned resources. $17.5 billion dollars is completely wasted every year that organizations are spending on cloud resources that are never used, that aren’t needed, but you do it because you don’t know what you don’t know, right? that has an impact on the environment as well, right? We know that data centers account for the same level of greenhouse gas emissions as the airline industry, and obviously that’s growing over time as cloud usage and cloud applications expand. It’s also a significant problem in organizations. So Datadog did a report a little while back that I thought was really interesting. Datadog is monitoring a lot of different sites in production. They’re a SAS vendor and so they have a lot of data that they can aggregate and look at and one of the interesting things that they found is that most containers that are running in production are only using a small percentage of the requested CPU and memory. So you can see here 49% of containers are using less than 30% of the CPU that they request. And very similarly on the memory side, 45% of containers are using less than 30% of the memory that they request. And so what that means is that you’re basically paying for a lot of CPU and a lot of memory that isn’t actually being used. Also on the user experience side of things, if the complexity of your application is making it hard to provide a good experience for your users, that has a lot of impacts. It really depends on what type of business you’re in. It could be a direct impact on revenues, it could be an impact on converting customers and following up and seeing customers, or users abandon your site, which can cause an impact on your brand on customer satisfaction. But there’s also impacts internally as well, right? The amount of time you spend in war rooms trying to solve performance issues can really take a lot of people, a lot of time, a lot of effort and really just slow you down in terms of your transformation and what you’re trying to accomplish as an organization.

Then there’s also an impact on the productivity of your developers and your engineers. An interesting study here that D2IQ did that found that 38% of developers and architects say that the work that they do makes them feel burnt out. And really kind almost shockingly 51% of developers and architects say that building cloud-native applications makes them want to find a new job. So really a big impact on the morale and retention rates of your developers, which is really key as an organization that you want to be able to move fast and develop new capabilities. And then the last thing I wanted to share here is one of my favorite quotes. This is a Tweet from a few years back, but I really just love this. Jason Jackson said, “you overheard we replaced our monolith with microservices so that every outage could be more like a murder mystery.” The point here is that if you’re not addressing these issues up front before you’re deploying applications, it can be a real challenge trying to address these reactively because of the complexity, because of the dynamic nature of these environments, it can be very difficult to track down a problem that occurs because by the time you figure it out. The container that caused the problem may not be running anymore, right? So it’s very challenging to actually solve these problems from a reactive perspective.

Now there’s a lot of different approaches you may be using today to sort of help address the problems that we talked about. Each of these kind of has their place, but there’s also challenges with each of these approaches as well. So trial and error is sort of the default way of getting started, right? When you deploy your application, you have to figure out all these settings. Well the only thing you can really do is try something, see what works, see how it performs, see if there are problems, then go back and tweak it. But as kind of referring back to the slide I covered earlier, there are so many different settings, so many different things that you have to look at that it becomes really an overwhelming problem and not something as a human being you can really effectively do.

Second, performance and load testing. So performance testing tools are really important. We have a performance testing tool of our own that Brad will talk a little bit about, but on their own they can only take you so far, right? A performance testing tool can show you how your application will perform under load, but it can’t tell you what to do about it if there’s a problem, right? So they give you kind of a first step, but they really don’t take you kind of all the way to addressing those problems.

Next, we’ve got things built into Kubernetes like the Horizontal Pod Autoscaler. These can help you dynamically scale the number of pods, but the thing about these technologies is that they also require configuration to optimize and they can only take you so far in terms of addressing application specific settings as well.

And then finally you’ve got your monitoring and observability tools. Everybody has multiple tools in place out there and they’re great for identifying problems in production, but going back to the last slide, they’re very reactive in nature, right? By the time you find a problem with your monitoring tool, the problem has already happened and it’s already impacted your user base as well. So each of these things on their own can be useful, but on their own they can only take you so far.

And so our view is that what you really need is what you call Application Optimization, which is what we’re going to talk about for the rest of the webinar today. Application Optimization has to be you use machine learning to help you make those decisions on tuning your applications, it’s got to be automated to make it work effectively in a complex environment, and it needs to use what we call Rapid Experimentation to go through and help make those tuning decisions to make your applications run optimally.

So at this point I am going to turn it over to Brad to talk more about what we mean by optimization and how you can address it with StormForge.

Brad: Thanks, Rich. So I think first off on machine learning powered machine learning is at the very heart of our company. It’s really what we’re completely built around. It’s not just a marketing add-on. It’s not a simple script, right? It’s actual machine learning. Scientists that do this for a living day in and day out, making our product better. So that’s the first big component if you really want AI and machine learning, you have to have somebody that actually knows what they’re doing there. It’s all about building it automated ultimately into your environment so that you can integrate it into the way that you work. And ultimately the way we do that is through Rapid Experimentation, and we’ll go into what that looks like – have a few more slides and I’ll jump in the demo.

So what is optimization? It’s choosing a set of inputs that produce the best results and those are all the things that make up your application. Some people get stuck on thinking about all the Kubernetes based things, which are very important, but ultimately Kubernetes should move to the background. It’s what delivers your application. So Kubernetes has things like CPU, limits and requests, memory requests and numbers of replicas, but you also have things for your own applications, application specific parameters like JVM heap size, caching behaviors, all sorts of things that are actually at the application layer. There are the knobs and levers that change the way that the applications behave and things like Horizontal Pod Autoscaler in Kubernetes, which is great, but out of the box it just does things based on memory and CPU utilization. If you don’t tune it then that’s all it knows about. Well maybe your application doesn’t really move in its memory, CPU utilization very much, the other things like backups and queues other kinds of things that are really important to your application. That’s really what you should be scaling on. That tells you those are the leading indicators that maybe the system is getting busy and you need to scale you need to be able to expose those. You need to be able to use those as the way that you use in the HPA, but even more importantly you got to know what those settings are so that it behaves in the way that you want it to. Ultimately we want to maximize your performance and stability of your application and then minimize the cost in the resource utilization of your application.

It’s always a trade-off, right? So you’re trying to do this to make your application more efficient and more optimized. There’s some time and effort that’s involved in that, in reactive kinds of ways and in ways where you’re not using great tools for. That’s a lot of time and effort to be able to do that, right? It’s not unusual for us to talk to customers and partners and they literally spend days and weeks trying to tune their application, making it do what they want it to do. It’s also about application performance. What do you change to do those things so that ultimately the result of that time and effort is making a better performing application and also the cost of resource utilization, and they’re all interconnected, right? You can make an application a lot a lot faster by throwing a ton of resources at it, but really at the end of the day, that’s an ineffective way to do it because at some point you cross some threshold where you’ll just be on the architecture of the application itself to be able to deliver. It’s also super costly and performance is one of those things that’s elusive, and particularly if you’re trying to do it manually and if there’s a lot of parameters to change or a lot of things to consider. Humans aren’t really good at doing that and it’s super boring work, which gets you into the DevOps kind of burnout.

But what if you could optimize it automatically? What if we took that time and effort part of it and really shrunk that down so that it helps you understand what the performance is and how to get better cost and resource utilization without having to really involve that many people to do it? And then every single release of your software could need a new optimization. It really isn’t just something that you set once and then forget. Your application changes over time. It changes because you’re adding business features, hopefully. We’re going to show you how you free up resources to build new business features instead of taking care of plumbing things, like trying to optimize the application manually the old way, but it’s still much easier to do it in the way we’re going to describe.

So three major portions of what we do with StormForge Platform. And start with the left-hand side where we started out. Application optimization, true machine learning power rapid experimentation engine, this is really the heart of the solution for optimization. Machine learning is very efficient at finding things in multi-dimensional space when you have a lot of parameters that you’re changing. What do those parameters do to your application? In microservices applications, you’ve got a bunch of different microservices and each one of those tunings for each one of those microservices affects the whole application. How do you put those together and understand the cause and effect of those kinds of changes? One important part of what we talk about because a lot of people that have brushed out machine learning is we don’t require a lot of upfront data. So some solutions out there will literally want two or three months of your performance data for your application so we can figure out how your application names. We’re not that kind of solution. Our technology allows us to do it from when you start the experimenting and gets you answers in minutes and hours and automatically implements your optimal configuration based on your goals. You tell us what it is you want your application to do, maybe it’s high throughput, maybe it’s reductional latency, maybe it’s those things while also keeping an eye on costs because you do have a cost threshold that you stay within, and also allows you to identify high risk configurations. A lot of configurations you can choose might seem like they’re a great configuration to save costs, until of course it falls over and breaks on you.

The next part of our platform is performance testing. So a key component to what we do on the optimization is we measure how you behave, how your application behaves under stress, under the load that it’s going to be expected to work under. Performance testing is really important. We’ve got a portion of our platform that is performance testing. It allows you to very quickly create performance tests. We do this in a hosted manner, so we host this as a service in any AWS data centers around the world. And it allows you to generate, within minutes, load tests and then push the button and, literally within seconds, you’re getting your performance test being sent to you. no more waiting on the QA team for that part of the infrastructure because they’re overloaded. They really need help like this as well, but ultimately it allows you to work in the way that DevOps teams work, which is I have something I need to test, I create the tests for. I immediately have this ability to test ultimately doing things like integrating into your CI/CD pipeline so that it’s part of the workflow. In fact, the optimization and the performance testing is part of what we want you to build into your pipeline. That’s really kind of a shift left and that is moving it back in the process. So long before you’re pushing it out to your customer, what the performance limits are, how to optimize your application, and we execute realistic tests to give you that from an open workload model. That’s something that if you’re not in the performance testing world, there’s two different models. There’s the closed model and the open model. The closed model is a lot of tools that are out there, a lot of the things you may already be using a closed workload model. The system that’s being tested when it gets under pressure actually slows down the system that’s testing it. The problem with that is that you don’t get a real answer as to how it works in the real world. Your customers are coming and hitting your site. As a lot more customers come, they hit your site harder. Just because your system slows down doesn’t mean the customers stop coming, of course unless you constantly do that and they want to move away from you. Ultimately, you really want to know what the real world scenarios look like.

An open workload test, which is what we do, sends the kind of load that you expect. So you model a million requests per second or 100,000 requests per second, whatever it is. That kind of system watches itself to make sure it’s continuing to send that load even as the system that its testing is slowing down. Super important because that’s the way the real world works. The real world doesn’t just say, okay let’s just wait for the site to become more useful to me and wait around 30 seconds or a minute for it.

The last part is record and replay, which is how do I write performance tests? Now how do I know they’re reliable? What if there was an open source project that allowed you to record that traffic in your application and then replay it or use it as a basis for your performance testing efforts? Super important because then you have a reliable test because it actually is a recording of real traffic. And maybe you then also use that to amplify traffic, so that what higher peak times look like and you can see what the real world usage is and then able to amplify that as well.

Let me share my screen, all right. Are you seeing my screen?

Rich: Oh yeah.

Brad: Okay, so here I’m in the performance testing. This is our hosted platform for performance testing. So pretty simple. We use a javascript based DSL. We’d use it because number one people that write applications oh pretty much everybody knows javascript. Javascript’s heavily used in the web world as opposed to some older testing framework where there’s a lot of XML files and stuff like that. It’s easy to define something. You say where am I sending this load, what is it that I’m doing, how am I simulating the way that traffic comes in? The real world it doesn’t just all turn on immediately and then turn off immediately. Users trickle in maybe at the top of the hour if you sell tickets for concerts, then a large amount of traffic comes in and then over a little time it fades off. However, you need to be able to model what that traffic looks like, you have complete control of what that traffic looks like. You set it up and say what kind of infrastructure you need for sending that. Your plan determines how much infrastructure you have to be able to send that kind of traffic, and then ultimately your developers are able to to easily, in a real simple javascript declarative way, tell it what it is that it’s doing, how it’s supposed to treat each session, so it does a log, it does in this case a shopping cart metaphor, so it searches for something, maybe put something in a shopping cart. At the end of the day, you’re doing this so that you can actually determine how it behaves. So here’s an example of the test runs. This one’s called troublemaker because it will cause some trouble. Here I’m going to show you the results of this particular run. In this particular run, you’ll see that the appdex, or the application performance index, is not great. It’s actually pretty low because there are a whole bunch of errors that happen. We sent 315,782 requests. It started out at 25 millisecond response time and came 30 second response time, which is not great. That’s actually where we cut it off. As we indicated, we actually created an increasing amount of traffic until you can see the very large amount of traffic and then tail it down. All of the things that you’d expect from a performance testing framework to show you all the bars and graphs to show you everything that was going on. This one’s kind of gets to be the interesting one. This is what’s happening to your latency. How’s your application responding as we cranked up the number of users? As soon as we hit this point it started having a problem. As soon as we got here, we really started having a lot of errors. That tells you I can’t handle the kind of traffic I’m designing this traffic for this application. We’ll tell you the error codes, we’ll show you over time what kind of error codes you’re getting, we’ll show you where that where that huge bump was in the problems that your site was having, and other events that are going on how many connections and that kind of stuff. The kind of details that you want of course out of performance testing system. The challenge is, when I’m looking at this and I know I have a problem, and on the inside you’re watching your application and you’re looking at something that looks like this. This is not exactly this application, but if you’ve ever looked at your monitoring dashboards there’s some point where somebody says aha there’s the problem. You see the problem. great, now what the problem is and this has actually happened to you. Hopefully not in production. Hopefully you’re doing this in another environment, but most people are testing in production, but at this point what is it I change? I know I have a problem. Is it the number of replicas? How much memory? How much CPU? Is my autoscale are not scaling in the right way? The challenge is that unless you know when I see this and I see this, what it is to fix my application do what I wanted to do, then I’m just admiring graphs. I’m not actually able to take action on these graphs. Or it may be there’s two or three people in your organization total that know how to do this and if they move on, you don’t know how to you don’t know how to troubleshoot this, right? That brings you to what we’re doing in the machine learning side and optimization. 

So I’m giving you a sneak peek. I got okay to show you the next generation of our UI, which I’m told is coming very very soon. So if I have a UI glitch, this is a test development system that I’m showing you from. Here I will show you some experiments. I’m going to show you the cake after it’s baked and that’s really what you’re concerned about is understanding how we do what we do on the optimization side. So I’m going to show you a microservices application. In this case it’s the docker voting app if those of you that know that. We’ve run experiments, in fact we ran 159 trials here, to determine what is the best configuration for this application. So we measured it and we were looking for optimizing the cost and also making sure throughput stayed within acceptable levels. All of these dots are trials, some of these boxes and diamonds are the optimal, which is really what you care about. You care about those things that were best. So for a cost factor, and you can see as I have a little bit more cost I can handle more throughput, but there are some places along this curve that are really interesting. It’s called the Pareto by the way. Here’s the original configuration of this application. If I click on it, it was the most expensive way to run this and not the most performant way to run this. So as I click over here and scroll down, these were the settings of the database, CPU, numbers of replicas and stuff for this application. These are the things that are in YAML files, or in config maps, or whatever in your Kubernetes application says. This is how this application performs. Now I want to reduce my cost massively and in this case I’m going to reduce performance just a little bit. This darker one is actually the middle of that Pareto. When I click on it, if you look below as soon as I click on it, those were all the changes that were necessary to make it that much more efficient on costs. Now it may be that I want something that’s just as performant, so I’m going to go to this one. And say okay it’s just as performant, but it was actually 57% less expensive to operate for a performance level that was actually higher than the other one. That’s the difference. Those are the things you need to change, the knobs and levers, in the application to be able to get the solution that you want. This is really important because now how this application is optimized, right now, until you make changes to the application, at which point the performance may change drastically. One thing that we like to talk about is in your CI/CD pipeline to do CI/CO/CD with continuous optimization in the middle, which is I’m about ready to push out a new deployment of my application, a new version, why don’t I try why run a single trial of this, each one of these is a trial of the same experiment, why don’t I run this one again using these settings and make sure that the application still performs the way that it did before. It could be that it passes. It runs within one or two percent, you say okay I’ve not done anything with the application that harms its performance and it’s cost, I’m ready to push this thing out and it rolls out through the rest of your CI/CD pipeline. But what if it gets to this point and it actually falls over? All sudden it’s 25% more expensive to run or what if it’s even worse 25% less performance? So now I have a bad deployment and risk sending this to my users. Hopefully you’ve got some deployment processes that can handle some of this. Maybe canary deploys or blue green. But even with canary or blue green, you’re giving a portion of your customers, whatever portion that is, a bad experience when they run into this problem. And it may be that the application actually will totally fall over and you may not see it during the time that you’re doing a canary or blue green. Everything may look okay until peak time and then it falls over and then you have to roll it back. Then maybe 100% of your users got a bad experience and then you had to roll back. Wouldn’t it be better to be proactive in what you’re doing and know before you push the button to push that thing to deploy or automatically deploy, but it’s actually still performing, it actually still works well, and it’s not going to cost you a time to continue to run that application. That’s the kind of thing that we want our customers to be able to do and the answers that we give you. So in this case, it’s quite obvious even for this case, even though we could save even more. At the exact same performance, we saved you 57%. That’s not unusual for us to find with our customers and our partners that we literally save them that kind of money because they’re able to understand the behavior of the application and in a lot of cases, they didn’t even know that certain levers made that big of an impact on the way application behaves. In fact, there may have been knobs and levers that they didn’t expose in the application because they never had a tool like this. They had their hands full trying to figure it out the way they were already doing it. Now they have a new tool that gives them even more capabilities to do even more testing to understand how their application behaves. This is a simple microservices application. There’s other kinds of applications that people run in their organizations. A lot of people run apache spark. They do business intelligence kinds of things, crunch through some data. In this case, we would work with a partner and this partner had us looking, in this case, just to optimize for usage. Now, they didn’t care so much about the duration because this is the kind of thing that’s batchy, it runs overnight. So in this case they’re trying to just drop the total amount of usage from the original configuration. So the original configuration is application for spark, and this is a particular data set that they were processing against, had this set of settings as the base configuration for their application. As you look and I click on this and show you the optimal configuration, look how little change was actually necessary, but it’s a little change on a lot of parameters to make it that much more efficient. That’s a huge, huge savings for that little change. Now when we showed this to the person that we were working with they said this is not surprising to me, which is actually unusual most people are like I’m totally surprised how much. What he said to us is not surprising to me because I’m an expert on spark. I’ve actually spent two years tuning spark at very large scale. This is not surprising to me. What’s surprising to me is that you got an expert in a box because you gave me this answer in five hours. That took me two years of hard knocks to be able to understand the kind of tuning, right? And every data set, because every company’s got unique data, every data set can cause this to behave very differently. You need tuning for your specific instance and that way you can drive the value that you want out of your applications. These are some of the things that our customers are able to do and also be able to watch it over time, so that as they release new versions of what they’re doing, maybe there’s a new version of spark, maybe there’s a dot version of it, you can rerun this and make sure it’s still optimal. And if it’s not optimal they’ll find out what the new settings are to make it optimal for what they’re doing. So over time, as your applications change, as your data changes, and as you add more and more applications, we want you doing this. We want you to understand the performance implications of your applications, the impact on cost, and the impact on operations because ultimately at the end of the day this makes for better behaved applications, which means in your environment in Kubernetes, which is generally multi-tenant, better neighbors make for a more stable system. That’s really ultimately what we’re driving you towards is to be able to use machine learning power to do all this and the key is we’re doing this all in the background. You just fire it off, the machine learning is doing this by deploying your applications in your environment measuring how they behave and determining what’s the best configuration including avoiding bad configurations, which bring their own operational risk, reputational risk, and cost risk. 

So any kind of application, anything that runs in Kubernetes, you’re able to work with and expose. Any different way that you deploy. You deploy using deployments, stateful sets, you’re using helm charts, whatever it is, maybe you’ve got your own custom applications and you have a very unique way that you deploy your application, that’s great. We can work with all of it. We were born for Kubernetes. We work in Kubernetes. In fact these experiments that we talk about and the trials that we run in experiments, those are also Kubernetes objects. So in the Kubernetes world, we’re doing Kubernetes things in a Kubernetes way because that’s that’s exactly where we live and we think that that’s the best way to work within the Kubernetes environment. Alrighty, that’s the end for me here on the demo. Let me jump to slides afterwards.

I’m going to show you a couple of examples of results that we’ve gotten with our customers. So we worked with a very large travel website. I actually worked on this project. And they were able to achieve 50% cost efficiency improvement for their application. This is their primary application. This was a team of people literally focused on the performance of their application. It is their product. It’s what they do. They had minimal dev and test environments due to private cloud resource constraints and as you can imagine in 2020, which is when we work with them, travel companies were getting hit very, very hard. They really needed to find efficiencies. We were going to be happy coming in saving them 10% or 15% and when we got into the POC it became quite obvious pretty quickly that based off of their original configuration, the UI looks a little different because we changed UI, this was their base configuration with no change in performance, so the performance was exactly the same, we will achieve a 50% cost efficiency increase for the application. We’re talking about six figures kinds of changes for this company. Seven figures, I’m sorry seven figures. 

Another company we’ve worked with, in fact we’ve got another webinar [with them] if you’d love to go see that webinar as well, Trilio, where they improve their customer experience so maybe your product is also the product you’re trying to improve, right, and it has a direct impact on your customer. They are a SAS vendor. They got cloud native backup for Kubernetes and each customer’s environment is of course unique because every customer’s a different customer – they’re snowflakes – and they wanted to reduce the failures and durations during their backups. At the end, we were able to do some important things with them, including find the things that drove the cost, the cost factor, how fast we could do the backup while still maintaining that lowest cost factor, more importantly we really helped them eliminate high-risk configurations. Things that will cause the thing to break, and if you’re a backup vendor, the last thing you need of course is to have your backups break because then when the customer tries to restore it, it doesn’t actually restore. So huge, huge benefit to them and they’re looking to use StormForge as a health check for their disaster recovery customers, right? This is really important for them.

Then we’re on the next steps. So Rich, I think this is your part.

Rich: Yeah, for sure. Definitely some things you can do here to follow up. Sign up for free! basically you can use the product for free and try it out with your own application. If you want to schedule a demo, we can actually walk you through and get a little bit more specific and answer questions specific to your environment. Then you can see the ways you can follow us and contact us there as well. So definitely look forward to talking with everyone on here more as a follow-up as we go forward. 

So we’ve got a couple of questions here that we will take. We’ve got a few minutes left, but if you have other questions, please enter them and we will try to get to them. The first one, I really like this question Brad, you talked about a lot of big numbers in terms of cost savings and the question here is, how do I best take this data back to my CEO and how do I tie it specifically to total cost savings? So kind of taking what comes out of our application and sort of making that real and taking it back to your organization. 

Brad: Yeah, so we definitely have some white papers and some case studies on the things that we’ve done with other customers and we can… Definitely contact us. We’ll definitely be happy to give you that kind of information, find out a little bit more about what you’re doing, so we can basically tailor that to you and help you bring that back and show that value. Love to be able to do a demonstration for your CEO, your CTO, or whoever wants to see this as well, and ultimately if it’s something that’s really compelling to them, which I think it will be 30 to 50 plus percent cost savings and more efficient and reliable applications, we’d love to do a proof of value with you.

Rich: Great, next question is around load testing. So you kind of showed a little bit of the StormForge load testing solution. The question is whether I can use other load testing tools with your optimization? 

Brad: Perfect question because I forgot to mention it as I was doing the demonstration. While we love our performance testing tool, we understand you’ve got your own performance testing tools, you’ve already got a lot of investment there. We can work with whatever you have with it. For Locus, we have direct out of the box integration with Locus. We’re looking to add additional capabilities there. Of course we do it of course with our own performance test as well. And then we’ve got a way for you to integrate with anything else as well, so it’s pretty simple to actually do that configuration and it’s a very, very easy to follow pattern to be able to use really any system that has a CLI or an API where you can basically trigger that that load to come and to tell it when it’s when to stop. So great question.

Rich: Cool, looks like two more questions. I think we’ve got time to get them both in here before we’re done. The first one is, what is the advantage of using machine learning to predict and update your parameters beforehand versus just load balancing on demand like elastic scaling based on incoming volume like a lot of applications do these days?

Brad: Yeah, that’s a great question. So number one you have to understand the performance capabilities of your application. Not every application can scale the way that you would wish. Not every application that’s being load balanced and scaled has every tier of the application able to handle any amount of load. So one of the things we talk to, we talk to all sorts of companies that are in traditional retail so they’re worried about Black Friday scenarios, different kinds of people have different times of the year that are important to them, or some folks have very daily events based on behaviors. There’s challenges sometimes in that the architecture that you’ve chosen for your application limits the top-end capability of your application. You won’t know that and just tweaking what you have in production won’t be able to show you that ahead of time and it may reach a point where it just can’t scale anymore because architectural challenges elsewhere in the ecosystem with all the various microservices. By testing and doing experimentation, you can actually determine that there are serious problems there and that maybe you can’t get above a certain performance level and that might be really bad if you’re expecting 25% growth and you can only grow 5% more than last year’s peak, you’ve got a serious problem. You want to know that ahead of time so that you can then take the steps necessary to make those changes, utilize great things like load balancers, but load balancers don’t fix everything. Every portion of your architecture doesn’t scale in the same way. In fact things like database tiers the way scale-up databases work, when you scale those, a lot of times they actually slow down because now they’re building out additional nodes or shards and you could tell it to scale and it’s exactly the wrong time. When you’re under pressure and you scale is the wrong time to scale that kind of database technology. Great question.

Rich: Cool, alright, we’ve got a few more questions, but we’re almost out of time here, so one one last question we’ll get to on the live event today. Can you talk a little bit about what it takes to set up an experiment? What’s involved? How hard is it? What do you have to go through to actually create an experiment?

Brad: Yeah, so experiments are actually pretty easy. It’s a Kubernetes object, so it’s a YAML file like other Kubernetes YAML files, so anybody that’s in the application development and deployment in Kubernetes knows how to work it. We’ve actually got a declarative way that you can in that file say what it is you want to do, what your targets are, and then it generates the experiment file, which you can then further modify. These are things that are done in minutes, right, so to be able to do that. You hook it up to the load test and then the rest of it’s actually handled within the system, so it will then deploy the application, apply the new configuration, run the load against it. It does require a load test and if you don’t have a load test of course we’ve got some great ways to be able to get the data to inform how you build that load test.

Rich: Great, thanks Brad! Well, thank you everybody for attending. Thanks for all the great questions. Apologies if we did not get to your question, but we will follow up after the event. As a reminder we will send out a survey as well and would really appreciate your input and your feedback on what you thought of the event. If there are other things you would like to see in future events, we’d always look forward to having your feedback on that. So thank you everyone. Thanks Brad and this concludes today’s webinar.

Latest Resources

Seeing is Believing

Start getting resizing recommendations minutes from now.

Watch An Install

Free trial includes full version on 1 cluster for 30 days!

We use cookies to provide you with a better website experience and to analyze the site traffic. Please read our "privacy policy" for more information.