ABtesting.tools

How to analyze A/B test results

Your test has finished running. Now what? This guide walks you through the analysis step by step so you make the right decision.

Step 1: Confirm the test ran correctly

Before looking at results, verify these basics:

  • Sample ratio mismatch (SRM) β€” If you expected a 50/50 split but got 52/48 or worse, something may be wrong with the randomization. Significant SRM invalidates results.
  • Full weekly cycles β€” The test should have run for complete weeks (7, 14, 21 days) to avoid day-of-week bias.
  • No external interference β€” Confirm no major events (site outages, marketing campaigns, holidays) occurred during the test that could skew results.

Step 2: Check statistical significance

Enter your data into the appropriate calculator based on your metric type:

Look at the p-value. If p < 0.05, the result is statistically significant at the 95% confidence level.

Step 3: Look at the effect size

Statistical significance alone is not enough. A +0.01% lift can be significant with enough data but is probably not worth shipping.

  • Absolute effect β€” The raw difference (e.g. control: 4.2%, variant: 4.8% β†’ absolute effect: +0.6 percentage points).
  • Relative effect β€” The percentage change (e.g. +0.6pp on a 4.2% baseline β†’ +14.3% relative lift). This is what matters for business decisions.

Ask yourself: is this lift large enough to justify the engineering and product costs of shipping the change?

Step 4: Read the confidence interval

The confidence interval gives you the range of plausible effect sizes.

  • Narrow interval (e.g. [+0.3%, +0.9%]) β€” You have good precision. The effect is likely between +0.3% and +0.9%. Safe to make a decision.
  • Wide interval (e.g. [βˆ’0.5%, +1.7%]) β€” High uncertainty. The true effect could be negative or much larger than observed. Consider running the test longer.

The interval is more informative than the p-value alone β€” it tells you both whether the effect exists and how large it might be.

Step 5: Make the decision

  • Significant positive result β€” Ship the variant. The effect is real and the direction is clear.
  • Significant negative result β€” The variant hurt performance. Do not ship. Analyze why.
  • Not significant β€” You could not detect a difference. This does not mean there is no difference β€” your test may be underpowered. Check what MDE your test was powered to detect using the Power Calculator.

If inconclusive: either extend the test (if practical) or accept that the effect is too small to detect with your traffic volume and move on to higher-impact ideas.

Common analysis pitfalls

  • Cherry-picking metrics β€” If your primary metric showed no effect, do not hunt through secondary metrics for a win. Pre-register which metric is primary.
  • Post-hoc segmentation β€” Slicing results by country, device, or user type after the test increases false positive risk. Only trust pre-registered segments.
  • Ignoring novelty effects β€” New designs often show an initial lift that fades as users get accustomed. Consider monitoring post-launch metrics for a few weeks.