top button

7 Simple Tips for Better Performance Engineering

+1 vote

Software performance and resilience are key components of the user experience, but as the software industry embraces DevOps, it’s starting to fall short on the performance and resilience aspects. Performance issues are often overlooked until the software fails entirely.

However, we all know that performance doesn't suddenly degrade. As software is released through iterations, there is a performance cost every time more code is added, along with additional logic loops where things can fail, affecting the overall stability.

Crippling performance or software availability issues are hardly ever due to a single code change. Instead, it’s usually death by a thousand cuts. Having rigorous practices to reinforce performance and resilience, and testing continuously for these aspects, are great ways to catch a problem before it starts. And as with many aspects of testing, the quality of the performance practice is much more important than the quantity of tests being executed.

Here are seven simple tips to drive an efficient performance and resilience engineering practice.

1. Use benchmarks and change only one variable at a time

In performance and resilience engineering, a benchmark is a standardized problem or test that serves as a basis for evaluation or comparison. We define such tests so that we can compare them to each other. In order to compare, we change one element and measure the impact of that change against another test.

During our continuous integration process, we benchmark new versions of the software to measure how the code changes impact performance and resilience of our software. In some other benchmarks, we want to measure how our software performs on different-sized hardware. As we also support multiple architectures, platforms, operating systems, databases, and file systems, we want to be able not only to define how to get the best performance and reliability, but also to compare them to one another.

These are all valid benchmark practices because we change one element and measure the impact of that change. However, if we were to change the software version under test and the hardware on which we test at the same time, and then try to compare results, we would not be able to conclude whether any change observed is due to one change, the other, or a combination of both—often, the combination of changes will have a different effect from when they happen individually. 

In performance engineering, try to do "apples to apples" comparisons, use benchmarks, and change only one variable across multiple versions of the test you want to compare. 

2. Monitor memory, CPU, disk, and network usage

As performance and resilience engineering is a scientific endeavor, it can only be achieved by seeking to objectively explain the events we observe in a reproducible way. That means we need to measure.

For performance engineering, we must not only measure the software we are testing, but also the hardware we are testing it on. Monitoring the memory, CPU, disk, and network usage are key for our analysis. We also must understand how those resources are allocated, as it pertains to our processing needs.

In information technology, we are always transferring data from one point to another and transforming it. Along the way we add redundancy; some of that redundancy is a waste or overhead, and some of it is necessary, as it allows us to ensure data integrity and security. Performance engineering is all about removing overhead and adding data integrity. 

3. Run each test at least three times

Before we can compare test results, we need to make sure the numbers we want to compare are trustworthy. Every time we run a test, we expect that if we run the same test under the same conditions at a different time, we should get the same results and metrics.

But when we run a test for the first time, we have no history of that test under the new conditions to decide if the results we have are repeatable. Keep in mind that previous tests where one component is different cannot be taken into account for result repeatability; only the same test executed multiple times can allow us to gain confidence in our result. 

Results we can trust are a key element, so I recommend that you not consider the results of a test for performance comparison unless you have executed that test at least three times. Five times is even better test hygiene. And for a release to customers or a general availability release, many more executions will be necessary. 

4. Achieve a result variance under 3 percent

Still on the topic of results, we must prove that the same test repeated at different times should produce the same result. A key indicator for that is the variance (also called variability) of the primary metric. The variance is a metric that expresses the percentage difference of the best and worse execution of a same test.

Let’s consider a performance test where the primary metric is a throughput measurement in transactions a second. If we have a test with the worst execution throughput of one hundred transactions per second and the best execution throughput of one hundred ten transactions per second, our variance will be 10 percent:

(Larger value – lower value) / Lower value
(110 – 100) / 100 = 0.1

Likewise, for a resilience test where the primary metric is the recovery time in seconds, if we have a test with the worst recovery time of five minutes and the best of four minutes, our variance will be 25 percent.

The variance is the key indicator of whether our results can be trusted. A variance under 3 percent means our results are reliable. A variance between 3 percent and 5 percent means results are acceptable and repeatable, but with room for improvement regarding stability of the test, environment, or software under test. A variance between 6 percent and 10 percent means we cannot repeat our results and should actively investigate why we have such a high variance. And any test with a variance greater than 10 percent cannot be used for performance consideration at all. 

5. Run your load tests for at least half an hour

Load tests are often aimed at measuring what the capacity of a system is for a specific usage. The goal is to get that system to process the largest workload in the shortest period without failing. For the measurements of such tests to have any base in reality, in my opinion, the measured performance has to be sustainable for thirty minutes at the very least.

When you think about it, the only thing you have proven with a fifteen-minute load test is that the system can handle the load fifteen minutes. Additionally, the shorter the run, the more subject to artificial variance it will be.

In performance engineering, we also need warm-up periods, because first executions are always slower on first calls. Even on a warmed-up system, the first few transactions of a test are likely to be slower and not necessarily the same between multiple runs—hence the artificial variance. On a test thirty minutes or longer, those tests will not show and are much less likely to induce variance.

If a load test duration is under thirty minutes, its results will have very little meaning from a performance engineering standpoint. Testing for at least half an hour excludes any warm-up period. 

6. Prove your load results can be sustained for at least two hours

Again, I recommend half an hour at a minimum. As explained in the previous point, the only thing you have proven with a thirty-minute load test is that the system can sustain the load for thirty minutes. While thirty minutes will be enough to detect most new performance changes as they are introduced, in order to make these tests legitimate, it is necessary to also be able to prove they can run for at least two hours at the same load.

Short of running out of space, a peak load should be sustainable indefinitely. Proving the load can be run for two hours is a good first step. I recommend aiming for six, twelve, and twenty-four hours as milestones, and when possible, prove you can run these loads for five consecutive days.

Note that these endurance-under-load tests are to prove sustainability of load results. They do not need to be run against every single code change, but only to prove load numbers’ sustainability.

Start with proving two hours is sustainable. Anything less and your performance number should not be used for performance publications, and definitely not for capacity considerations. 

7. Ensure you have good automation

You cannot have successful performance engineering without good automation. Do you spend more time analyzing your test results (good automation), or executing tests and making changes to existing automation (bad automation)?

If you think you can improve your automation practices, start with these seven principles:

  1. Know why you automate
  2. Understand the steps of your automation
  3. Don't consider only the happy path or the unhappy path
  4. Build blocks you can stack on top of each other
  5. Plan automation early
  6. Scenarize your automation
  7. Gather metrics from your automation
posted Dec 28, 2018 by Arun

  Promote This Article
Facebook Share Button Twitter Share Button Google+ Share Button LinkedIn Share Button Multiple Social Share Button

Related Articles
+9 votes

Reporting the results of performance testing is much more nuanced, and there are many ways of displaying these values—but Michael Stahl felt none of these ways was particularly effective. He proposes a reporting method that makes performance test results easy to read at a glance.

Effective reporting of test results is one of the holy grails of our profession. If done correctly, it improves the project’s quality and helps us focus on the real issues. But if done badly, it adds confusion and reduces the value that testers bring.

Reporting the results of functional tests is relatively simple because these tests have a clear pass or fail outcome. Reporting the results of performance testing is much more nuanced.

Let’s start with a definition: For the purpose of this article, I use the term performance test to mean any test that performs a measurement, with a range of numeric values all considered an acceptable result. It may be measurement of power consumption, the number of users a website serves in parallel, the speed that data can be read from a disk, etc.—any measurement of a nonfunctional requirement.

The first challenge in performance testing is deciding what’s considered a “pass.” Frequently this is neglected in the requirements definition phase. I have seen many requirements that read something like, “Data extraction time from the database shall be less than 10 mSec,” or “The rate of processing a video file shall be at least 100 frames per seconds (fps).” Such requirements are incomplete, as they do not include the actual target we want to hit. We only know the worst result we agree to tolerate and still approve the product. There are two problems here.

First, let’s assume I ran a test and found that video file processing is done at a rate of 101 fps (recall that the requirement was “at least 100 fps”). Looks good, right? But does it mean we are close to the edge (that is, the product hardly meets the requirement) or that everything is fine? If the requirement had been well defined, it would have included both the target and the minimum—for example, target: 120 fps; minimum: 100 fps. With such a requirement, a result of 101 fps clearly indicates the product hardly meets the requirements.

Second, when a test fails marginally (e.g., 99 fps), the product manager is under pressure to be “flexible“ and accept the product as is. How often have we heard, “Indeed, we are below the minimum, but we are almost passing, so we can decide it’s fine”? If the full requirement were available (target: 120 fps), it would be clear how far the results are from the target and that the product has a real issue.

For the sake of completeness, I will mention that a nonfunctional requirement must not only specify target and minimum, but also the test method, since the test method influences the results. For example, when measuring CPU utilization, the results would vary significantly depending on how we perform the measurement. Do we measure the maximum value recorded? Over how long a time? Do we average measurements? How many measurements a second? What else is running on the CPU in parallel to our test?

In theory, reporting performance test results should not be a problem at all. Just present the results and indicate a pass or fail. But again, we don’t only want to know the result; we want to get an idea of how the result relates to the target. Crafting a report that is not overly complex but still delivers a complete picture of the status is a balancing act.

We could use a table:

Table showing video processing requirement of 120 frames per second

However, because most products have many performance requirements, we will end up with a large table full of numbers. It will be hard to quickly see where there is a problem. We could use color to improve readability:

Table showing where tests met requirements, using yellow for within range and green for good

But this brings up more questions. Does it make sense that frame processing speed and CPU utilization get the same color code? One is almost failing, while the other is well within the acceptable range. So maybe color frame processing in red? But then what color would we use for a failure? And how long would we consider a result green before it should become yellow? Not to mention the difficulties that could occur due to some people having color-blindness.

I was thinking about this issue when my doctor sent me for my annual blood check, which I do meticulously—about every three years. Anyway, the results from the lab included a list of dozens of numbers displayed in this format:

Blood tests results depicted on a color-coded sliding scale

Even though I am not a physician, I could tell right away which results were fine, which were marginal, and which were something I should discuss with my doctor.

A light bulb went on in my head: Why not use this method for reporting performance tests? I took a few data points and experimented with PowerPoint:

Performance test results displayed in the same color-coded sliding scale format

Note that I still use colors, but the axis explains the choice of color and identifies where higher is better and where lower is better in a color-independent way. The reader can clearly see the position of each measurement within the allowed range; the colors serve mainly to focus attention where there is trouble. Creating such a report takes some time, but it could be automated.

I have not yet seen this idea implemented in a real project—I’m still working on that—but if you do use this idea, I’d be happy to learn about your experience and the reaction from your organization.