A/B test statistical significance explained
Statistical significance is the most cited โ and most misunderstood โ concept in A/B testing. This guide explains what it actually means and how to interpret it correctly.
What is statistical significance?
Statistical significance tells you whether the difference you observe between variants is likely real or could have appeared by chance alone.
When we say a result is "statistically significant at 95% confidence," we mean: if there were no real difference between the variants, there is less than a 5% probability of seeing a difference this large or larger just from random variation.
Crucially, it does not mean there is a 95% chance the winning variant is better. That is a common misinterpretation.
How p-values work
The p-value is the probability of observing your data (or something more extreme) assuming the null hypothesis is true โ that is, assuming there is no actual difference between variants.
- Low p-value (e.g. p = 0.02) โ The observed difference would be unlikely under pure chance. We reject the null hypothesis and call the result significant.
- High p-value (e.g. p = 0.35) โ The observed difference is easily explained by random variation. We fail to reject the null hypothesis.
The threshold โ Most teams use ฮฑ = 0.05 (5%) as the cutoff. If p < ฮฑ, the result is significant. This threshold is a convention, not a law of nature โ some teams use 0.01 or 0.10 depending on the cost of errors.
Confidence intervals tell you more than p-values
A p-value only tells you whether the effect is likely nonzero. A confidence interval tells you the plausible range of the effect size.
For example: "The conversion rate difference is +1.2% with a 95% confidence interval of [+0.3%, +2.1%]." This tells you the effect is significant (the interval doesn't include zero) and gives you a sense of the likely magnitude.
If the interval is [โ0.5%, +2.9%], the result is not significant โ but you can see that the effect might still be meaningful. You probably need more data.
Common significance mistakes
- Peeking at results daily โ Checking significance every day and stopping when p < 0.05 dramatically inflates false positives. A test with a 5% false positive rate when checked once can have a 30%+ false positive rate when checked daily. Use sequential testing if you need to monitor results continuously.
- Confusing significance with importance โ A statistically significant result can be practically meaningless. A +0.01% conversion lift might be significant with millions of visitors, but it is not worth shipping. Always check the effect size, not just the p-value.
- Treating non-significance as proof of no effect โ A non-significant result means you could not detect an effect โ not that no effect exists. Your test may simply be underpowered. Check the Power Calculator to understand what your test could actually detect.
- Ignoring multiple comparisons โ Testing 20 metrics at ฮฑ = 0.05 means you expect one false positive by chance alone. Use corrections (Bonferroni, Holm) or focus on a single primary metric.
The Bayesian alternative
If the frequentist framework feels counterintuitive, Bayesian analysis gives you a direct probability statement: "There is a 94% probability that variant B is better than A."
This is often what people think significance means. The Bayesian Calculator computes this for you โ no need to reason about null hypotheses or p-values.
Check significance for your test
Use the Conversions Calculator to test whether your A/B test results are statistically significant. Enter your visitor counts and conversions to get a p-value, confidence interval, and effect size.