At StormForge, we regularly evaluate our load testing setup and infrastructure to make sure everything works as expected for our customers and to anticipate issues early on. Since we have been working on a revamp of our core load testing engine for the past weeks and months, it was about time to scale our internal testing efforts up to eleven!

Goal

We recently finished switching over our customers to our new load testing engine. While having done a lot of testing internally, we thought: Why not go a little over our requirements? No sooner said than done. Our goal? Make more than one million requests per second with at least one million active clients. 😱

To be honest: Most of our clients have less traffic they need to simulate. Some require a large amount of bandwidth, others have very little, but very valuable traffic. But we always need to be aware what our limits are so that our customers can rely on StormForge.

Premise

The use case we selected as the basis for our experiments is based on a fictitious HbbTV scenario were the requirement is to have a large number of viewers having their HbbTV enabled devices sending a regular heartbeat.

Although streaming is getting more and more popular, “normal” broadcast television is still VERY big around the world. HbbTV is a solution to bring more interactivity into the broadcasting industry. It is also used for analytics purposes, which we want to use as an example in this article.

What we had in mind boils down to a very simple scenario per simulated client:

randomly pick a station identifier from a data source
make a POST request acquiring a session token for the station being watched
enter a loop for several minutes
make a heartbeat HTTP request including the session token roughly every second

By the way, some readers will remember that we have written about TV-related and very large tests before in the past. Compared to our adventures testing a very large scale interactive German TV show in 2014, this test will be way bigger!

Get Started with StormForge

Try StormForge for FREE, and start optimizing your Kubernetes environment now.

Start Trial

Challenges and Test Setup

The first problem we encountered was actually not our load generator setup itself. We have many years of experience in ad-hoc provisioning of cloud resources and managing them without any manual intervention. The problem was to set up a target that can actually handle the load. So how to test a very efficient, scalable load testing engine?

As you might know, we have our open source testapp publicly available. This is a simple Golang application and we are running it on a cheap yet beefy bare metal server. The box is quite powerful, but not nearly powerful enough to handle north of 1,000,000 requests per second with many more established TCP connections.

We are always looking at different cloud technologies, mostly out of curiosity. One of the things that caught our interest already in the past was AWS Fargate, AWS’s solution to run containers without having to manage servers or clusters. Since we built a docker image for our testapp already, we thought why not try to run and scale this on Fargate?

After some experiments on how efficient our testapp runs on Fargate, we ended up with the following configuration:

one AWS Network Load Balancer to have a single target to test
150 Fargate containers with 2 cores and 4GB RAM each running our testapp (we hit some problems in utilising more than 2 cores efficiently, but that’s a story for another time).
test target and load generator cluster located in Dublin, Ireland

To set up and manage our target on Fargate, we used fargatecli, which is a command line tool to setup Fargate tasks and services. Once we have requested an increase in the allowed Fargate tasks, provisioning our test target was very simple. First, we created a new network load balancer (NLB) and then created a new Fargate service with 150 instances of our testapp.

Preparations

Before we actually ran the fully scaled test, we set ourselves some intermediary steps. The issue going from 0 to 100 usually is that problems quickly generate a lot of noise if you run at higher traffic scenarios. Generally it is a good idea to cover some basic ground first.

We defined three steps which is something we recommend our customers to do as well:

0) Measure the base capacity of the target system

First we made a series of tests against different combinations of CPU/memory configurations of our testapp deployed on a single AWS Fargate task to establish a baseline. Our goal was to get close to 80% CPU utilization per container.

While our testapp usually runs on a beefy bare metal server, we quickly realized that our app behaves quite a bit different on AWS Fargate: We get around 6.250 req/sec/core on our bare metal 8 cores box and only 3.250 req/sec/core on 2-core 4GB containers. 🤔 Since our goal is not to optimize for performance here and scaling Fargate is not an issue, we had to grit our teeth and carry on. 😁

We also noticed that the default nofile limit in Fargate tasks is 2048/4096 (soft/hard) which we had to increase as we were planning to have many thousand connections per container.

1) Run at ~10% for 5 minutes

The first step towards our goal went very smoothly. We mainly used this step to verify our results from our capacity testing step to improve our confidence for the next steps.

2) Run at ~50% for 10 minutes

Because we are not using AWS Fargate in our daily operation, we immediately hit the default account limit of available AWS Fargate Task instances. Before we could continue we had to request an increase with AWS Support. When this was approved, we could run our test without further issues.

Fire it up!

After passing our internal milestones to verify that our target would not be overloaded by accident and everything behaves the way we want it for our experiment, it was time for our first test run.

It took almost exactly 11 seconds for our load generator cluster to be provisioned after we hit the “Launch new Test Run” button. 17 seconds later, we reached 1 million requests per second!

After that, we have performed a series of tests at that scale to pinpoint possible optimizations in our test infrastructure and analytics components. It turns out that processing hundreds of millions of requests per test run takes a moment and generates quite a bit of data that need to be taken care of.

Conclusion

Using our new load generator engine (which is now enabled by default for all customers), scaling a StormForge performance test up to very large scenarios is the same effort from the user perspective compared to a very small scale test.

Setting up a test target that can handle over a million active clients and well over 1 million requests per second was a bit more involved. Our usual and recommended approach to make intermediate steps towards your testing goal has proven to be a good idea — especially because we saw significant differences in vertical scalability on Fargate for our testapp (investigation pending). Horizontal scaling on the other hand, was very straight forward: Just crank up the number of tasks and you are good to go!

During our testing, we generated hundreds of millions requests which needed to be analyzed. We confirmed some assumptions we had with our test analysis pipeline, which we are planning to work on to further improve processing times, as our goal is that nothing should take longer than 60 seconds. Once analyzed, we were glad to see that our reports and latency analysis tooling work as expected with sub-second delays.

Do you have questions or remarks regarding AWS Fargate, large scale testing or other performance topics? Just drop us a line 🙂

Some More Details

Here are some more details on the setup:

we used fargatecli to setup the Network Load Balancer and Fargate tasks
we used 150 Fargate container, each with 2048 CPU units and 4096 MB memory
each container handled around 6.500 requests per second at 80% CPU utilization
we used HTTP/2 and gzip compression for all requests

The StormForge test case definition that was executed is pretty straight forward. Note that we artificially delay the responses of our testapp to get a bit more realistic “processing time” of the target.

definition.setTarget("http://testapp-lb-657638500f694428.elb.eu-west-1.amazonaws.com:8080");

// 1,000,000 active users, ~5min per User: 3333.33 arrivals/sec
definition.setArrivalPhases([{ duration: 10 * 60, rate: 3333.33, }]);

definition.session("up-to-eleven", function(session) {
  const channels = session.ds.loadStructured("hbbtv_channels.csv");
  const channel = session.ds.pickFrom(channels);

  // Get a "watching token" for the selected channel
  session.post("/random/get_token?delay=100&channel=:id", {
    tag: "get_token", 
    params: {
      id: channel.get("id"),
    },
    extraction: {
      jsonpath: {
        "token": "$.token",
      }
    }
  });

  session.assert("token_received", session.getVar("token"), "!=", "");

  const pings = session.ds.generate("random_number", { range: [280, 320]});

  session.times(pings, function(ctx) {
    ctx.get("/ping?delay=100&token=:token", {
      tag: "ping",
      params: {
        id: session.getVar("token"),
      }
    });

    ctx.waitExp(1);
  });
});

Performance Testing at over 1 Million Requests per Second