30 Sept, 2022
Cloud engineers are often working in complex, microservice architecture environments with many moving parts that can all cause problems in your systems. If you aren’t proactive about understanding these demands and usage patterns, it will undoubtedly come back to haunt you at 3am in some form of incident or outage!
On a recent client engagement, our team started doing simple load testing on the API we managed because we wanted to see how it would perform as the number of users grew (our team had some very ambitious usage targets over the following months, which was over twice the existing load). I’ll take you through our learnings along the way and share some tips for starting load testing in your development team, focusing more on the process we went through rather than the load testing tooling we used (as there’s already plenty of information available on this).
The top 5 tips when starting load testing in your team
Be clear on what you are testing and the outcomes you’re expecting from your load testing.
Ideally your app will be extremely well monitored and understood, and you already have great visibility into how your application is working. Even if you still have work to do in this space, before you start your load test it’s important to consider what information you want to gain from this exercise and if you have visibility into this yet. For example, if you’re worried about a performance bottleneck in a specific spot in your system, do you have the ability to see how that bottleneck is performing during the test? Depending on where the bottleneck is, for example if it’s in your database, in your application, or has dependencies on other applications, this will determine exactly how you gain visibility. Putting in time to make sure you have the metrics, traces, or logs available to be able to monitor will help.
For the Midnyte City crew, it was also important to look at how our whole application performed under this high demand. In addition to checking the load test results, we monitored during the load testing to see how response times fluctuated and how CPU and memory of our containers were responding. Without us doing the work to improve visibility into our system before we started load testing, we wouldn’t have felt confident in understanding how our system performed under load (which is different to justknowing the response times for our system).
One more thing we did before we ran our load tests was to understand what a ‘good’ result looked like (e.g. a target response time). Without this, it can be hard for people in your team to interpret the results of the load test. You might have a feel for what good likes like, this will depend on who is using your application (i.e. is this showing directly to users or is it a behind the scenes process), any performance goals your team/organisation has set, and also any previous load/performance testing that has been done. If what ‘good’ looks like isn’t specified clearly for the team to understand, you will find people on your team will choose what makes sense to them (which will vary person to person).
Understand current usage patterns, and what are the ‘high loads’ you want to test.
The systems we are testing can often be used in many different ways, some of these are more resource intensive (or are more likely to have bottlenecks) than others.
As a team we looked at our current metrics to understand existing usage patterns. We determined what endpoints were being hit and how often to build up a baseline of what current load looked like on our system. It’s easy to repetitively test hitting one endpoint, but this is rarely what customers do. We also looked at our peak usage patterns; this could be hours during the day of different times of year depending on your application.
Typically in load tests, it’s important to run scenarios with gradually increasing numbers of simulated customers. This allows you to see how latency of calls varies with increasing load, and by understanding your baseline / peak usage patterns, it will allow these scenarios to be accurate.
Ensure you understand how the tooling works, and how to create scenarios that match what you want to test.
There are many different load testing tools, and they’re all fairly similar. You can define a user scenario you want to test and then have different simulated customers execute these scenarios. Now that you have a great understanding of the scenarios you are testing from step 2, it’s time to turn these into load test code.
The things we found important to consider when writing the test scenarios were:
- Building some randomness into the load test to more closely mimic end users
- Finding the right balance between timing and frequency of endpoint calls to mimic user patterns
- Using data that reflected ‘production’ data as closely as possible
Picking the right environment.
To get the best outcomes from load testing, you want the system to match production as closely as possible to ensure the results you see are indicative of the performance on production systems. I know organisations can have all sorts of different test environments, and depending on the sort of load testing, getting test data could be quite challenging or time consuming.
It’s likely you will be load testing in a shared test environment, where your load testing could have impacts on other systems, so it’s important to consider any other teams that may be impacted and notify them of your plans to load test. Part of this should also be considering any downstream effects your load tests might have (sending emails, creating users/orders etc.), and any cleanup required.
Even if your test environment looks a bit different from your production environment (whether this is due to test data or something else), it’s still worth doing load testing and tracking down bottlenecks. Just be mindful of the differences, and highlight any risks/concerns in this. For example, if you only have access to a much smaller data set for load testing than you do in production, this may impact how accurate your load testing is. It’s good to let the broader team know how this dataset will affect your load testing, and how you plan to mitigate these to build further confidence in your system.
Document outcomes (or baselines), and either build them into your pipeline or redo them regularly.
Hopefully now you have finished your load testing, and have more confidence in your system. Depending on how often you’re planning on performing load testing, and if it’s a new or established process, it will affect what you do next. If this is a process you want to run outside of your pipeline, at intermittent intervals (i.e. every couple of weeks/months), it’s worthwhile documenting results to be able to compare your load test outcomes over time.
One thing we discussed at our client site when we were doing load testing, but didn’t prioritise for implementation immediately, was running the load tests in our CI/CD pipeline to help increase confidence. Due to the time it takes torun load tests, it isn’t always feasible to run them on every check-in though. One option to help work around this would be to run a subset of your load tests everyday, and a larger set every week.
The longer you go without running load tests, the harder it can be to track down the culprit. Performance problems can be very hard to resolve, and now you have done all the hard work to get load testing working, it’s important to perform these regularly.
Coming back to the client where we began doing load testing, we ran our load tests by following the tips above and we felt very confident in being able to achieve the massive growth targets ahead of us.
We gained important learnings along the way - that’s another great thing about load testing. It can help you learn significantly more about your application, and forces you to think about your application as part of a larger, complex system, helping to identify different concerns and problems around things like connectivity, performance, and network speed.
As your team's maturity in performing load testing grows, you can also do some fun things like seeing how your performance goes during a deployment, or if a container goes down. Load testing is a practice you can use to better understand and build confidence in how your system performs under load. It can help us cloud engineers validate that our systems are as resilient and scalable as we expect.