Blog

How to Use Optimize Live’s OOM Response

With OOM Response, you gain a reactive response, in addition to our proactive recommendations. Learn how it works and how to configure it.

By Nick Walker | Nov 22, 2024

a chart showing the OOM Response feature in StormForge Optimize Live

I’m thrilled to launch Optimize Live’s OOM Response feature to automate away the toil of responding to out of memory (OOM) errors. As traffic patterns change over time, and software is updated regularly, changes in memory usage occur. Unpredictable memory usage spikes may eventually cause OOM kills that lead to service disruptions and performance issues. The OOM Response feature provides insurance for your memory settings.    

How Does OOM Response Work?

With Optimize Live OOM Response, you gain a reactive response, in addition to our proactive recommendations. This feature continuously monitors Kubernetes clusters for OOM events. When they’re detected, Optimize Live produces a new recommendation to increase memory by a configurable percentage — by default, we recommend 20% — for the next 4 days. This timeframe allows our machine learning to analyze memory usage data and refine memory recommendations going forward.

Some of the key benefits include:

  • Continuous Monitoring: Real-time detection of OOM events triggers immediate action, removing manual toil.
  • Configurable Responses: Choose your desired OOM memory increase percentage and whether to immediately apply the OOM Response recommendations, or wait for your next automatically scheduled recommendation to be applied.  
  • Measurable Results: Review OOM events over time, and watch them go down as you automate the problem away.  

How to Enable OOM Response

  1. Sign up for a free account, or log in if you’re an existing user.  
  2. Install the Optimize Live Agent and Applier. 
    1. For existing users, be sure to upgrade your agent to at least version 2.16.1, which includes the OOM Response feature.
  3. Enable OOM Response via configuration of cluster defaults. 
  4. Create a file called cluster-defaults.yaml with this code in it:
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-defaults
  namespace: stormforge-system
data:
  cluster-defaults.yaml: | 
    live.stormforge.io/reliability.oom.memory-bump-up.percent: >
      20,resource:daemonsets=0
    live.stormforge.io/reliability.oom.memory-bump-up.min: >
      100Mi,resource:daemonsets=0Mi
    live.stormforge.io/reliability.oom.memory-bump-up.max: >
      2Gi
    live.stormforge.io/reliability.oom.memory-bump-up.apply-immediately: >
      IfAutoDeployEnabled,resource:daemonsets=Never,resource:statefulsets=Never

Then, apply the cluster-defaults.yaml file and restart the agent to pick up the changes. 

kubectl apply -f cluster-defaults.yaml -n stormforge-system;
kubectl rollout restart deployment stormforge-agent-workload-controller -n stormforge-system

What Does this Configuration Do? 

With this configuration, you can immediately apply OOM Response recommendations wherever you’ve already enabled auto-deploy — taking your workload optimization to the next level. While this is our recommended configuration for OOM Response, depending on your own specific needs, you may update these cluster defaults or override the behavior more specifically at a namespace or workload level.  

It’s important to note that this configuration does not enable OOM Response for DaemonSets because increasing the memory even slightly for a pod that runs on every node can cause quite a bit of change in your cluster. For now, we recommend leaving this feature off for Daemonsets.  

Additionally, this configuration does not immediately apply memory increases for StatefulSets as these workloads are often sensitive to restarts. Waiting until the next scheduled recommendation works well for StatefulSets.   

Take Control of Your Resource Management

Adding this reactive protection from OOM Response to our proactive rightsizing recommendations ensures that every platform team is empowered to drive automated optimization in their environment while improving the reliability of their platform.

Our team is committed to expanding your ability to automate workload optimization. We’re excited to see how the OOM Response feature will empower you to further take control of your resource management.

Test it out with a free trial, or see it in the sandbox environment, and let us know what you think!

Latest Posts

We use cookies to provide you with a better website experience and to analyze the site traffic. Please read our "privacy policy" for more information.