Soak testing is a type of performance and load test that evaluates how a software application handles a growing number of users for an extended period of time.

In any experiment that is carried out, we often rely on probabilities to prove (or disprove) a hypothesis.

When carrying out an A/B test, for example, we are often seeking statistically significant results.

We are great advocates of testing in production and so A/B testing is one effective way to test your features on a select number of users to make sure that they’re working as they should before rolling them out to everyone else.

However, since such tests are always based on probabilities, as no hypothesis testing can be 100% certain, **this is why sometimes we may arrive at wrong conclusions leading to what is known as type I and type II errors.**

We mentioned the term ‘statistical significance’ which is what any experiment is seeking to find. In the experiments you run, you want to make sure that a relationship actually exists between the variables proposed in your hypothesis, which is the purpose of an A/B test.

You are ultimately seeking to ensure that your A/B tests achieve statistical significance before making any decisions.

If you’ve often carried out A/B tests, then you’re probably familiar with this term as it gives you the tools necessary to make informed decisions to meet your business goals.

For the sake of further clarification, a statistically significant result in such tests means that the result is highly unlikely to have occurred randomly and is instead attributed to a specific cause or trend.

Simply put, it is the probability that the gap or difference between variations and control is not random or due to chance but due to a well-backed experiment. It indicates your risk tolerance and confidence level.

In other words, when you run an A/B test with a 95% significance or confidence level, this means you can be 95% confident that when you determine the winning variation, the results obtained are real and not due to chance.

However, as with any hypothesis test based on statistics and probabilities, two types of errors can show up in your results.

Before we delve deeper into type I errors, it would be worthwhile to give an overview of what hypothesis testing is.

Hypothetical testing is when a hypothesis is tested against its opposite to determine whether it’s true or not. In this case, you have the null hypothesis and the alternative hypothesis or two variables.

Therefore, a statistical hypothesis test is used to determine a possible conclusion from two different and conflicting hypotheses.

The null hypothesis posits that there is no relationship between the two proposed phenomen while the alternative hypothesis is the opposite of what is stated in the null hypothesis.

P-values used in statistical testing help decide whether to reject the null hypothesis. The smaller the value, the more likely you are to reject the null hypothesis. In other words, it tells you how likely your data would have occurred under the null hypothesis.

The p-value is most commonly set at p< 0.05 to declare statistical significance.

However, in any statistical test, there is always a degree of uncertainty so the risks of committing an error are quite high.

The following table depicts these errors in relation to the null hypothesis:

One such error is type 1 (or type I) error, also referred to as false positive, which is the wrong rejection of a null hypothesis even though it’s true. In other words, you conclude that the results are statistically significant when they are simply a result of chance or due to unrelated factors.

Simply put, a type 1 error occurs when the tester validates a statistically significant difference when there isn’t one.

In an A/B test, a type 1 error is when you declare a bad variation as the winner even though the test conducted was inconclusive. In other words, as a false positive, you adhere to the belief that a variation in a test has made a statistically significant difference.

Type 1 errors have a probability of “α” or alpha correlated to the confidence level you set. For example, if you set a confidence level of 95% then there is a 5% chance that you will get a type 1 error.

Type 1 means wrongfully assuming that your hypothesis testing worked even though it hasn’t. Consequently, the main reason to remain on the lookout for such errors is that they may end up costing your company a lot of money as they could possibly lead to loss in sales.

If, for example, you tested out a change in the color of a button on your homepage and you noticed early on that the button did lead to more clicks. You are then convinced that this variation made a difference so you decide to end the test early by wrongfully concluding that there is indeed a correlation between this change in color and conversion rates.

Thus, you end up deploying this variation to all your users to find that, surprise, it didn’t actually have an impact. The end result is that you could risk hurting your customer conversion rate in the long run.

The best way to avoid such errors may be to increase test duration to ensure that your variation outperformed the control in the long run and sample size.

Related: Sample Size Calculator for A/B Testing

Type 2 (or type II) errors, also referred to as false negatives, occur when you don’t reject the null hypothesis when it’s actually false and you end up rejecting your own hypothesis and variation. Type 2 errors have a probability of β or beta.

In an A/B test, this means that you fail to conclude there was an effect when there indeed was and so no conclusive winner is declared among the control and variations even though there should be one.

In other words, you believe that a variation has made no statistical difference and you mistakenly believe the null hypothesis and that a relationship doesn’t exist when it does.

A type 2 error is inversely related to the statistical power of a test, where power is the probability that a test can detect an effect that actually exists. The higher the statistical power, the lower the probability of committing a type 2 error.

Statistical power usually depends on three factors: sample size, significance level and The “true” value of your tested parameter.

Just like type I errors, type II errors can lead to false assumptions and poor decision making by concluding the test too early.

Furthermore, getting false negatives and failing to notice the effect of your variations may lead to wasted opportunities as you’re not taking advantage of opportunities to increase your conversion rate.

To reduce the risk of such an error, make sure you increase the statistical power of your test, for example, having a big enough sample size. This would entail gathering more data over a longer period of time to help avoid reaching the false conclusion that your experiment didn’t have an impact when the opposite is true.

The probability of making type I and type II errors is depicted in the image below, where the null hypothesis distribution shows all possible results if the null hypothesis is true while the alternative hypothesis shows all possible results if the alternative hypothesis is true:

As can be seen, type I and type II errors occur where these two distributions overlap.

Let’s consider these two scenarios:

- If your results demonstrate statistical significance, this means that there is a difference between the variations. In that case you may reject the null hypothesis. However, this could sometimes be a type 1 error.
- If your results don’t show statistical significance then the null hypothesis cannot be rejected. This could also sometimes be a type 2 error.

In the end, it’s important to strike a balance between making type 1 and type 2 errors. Many argue that making type I errors may be more damaging as it could lead to changes that will end up wasting resources, costing time and money while type 2 errors are more about ‘missed opportunities’ (though it could also have significant consequences).

The essential thing to remember is that A/B tests are based on statistical probabilities meaning that the results obtained are never 100% certain.

Nevertheless, these tests serve as a valuable tool to help marketers increase sales and conversion rate so even if your results may not be as certain as you’d like them to be, you can still increase the probability of the test result being true by avoiding the aforementioned errors.

To reduce probabilities for error, the key is to increase sample size and run the test for as long as possible to ensure the collection of as accurate as possible data and to increase the credibility of your test results.

Read more about A/B testing statistics in our A/B testing guide here.

More terms from the glossary

Soak Testing
Read description →

Soak testing is a type of performance and load test that evaluates how a software application handles a growing number of users for an extended period of time.

User Acceptance Testing
Read description →

User acceptance testing (UAT) is used to verify whether a software meets business requirements and whether it’s ready for use by customers.

Fake Door Testing
Read description →

Fake door testing is a method where you can measure interest in a product or new feature without actually coding it.