How to analyze A/B test results
Your test has finished running. Now what? This guide walks you through the analysis step by step so you make the right decision.
Step 1: Confirm the test ran correctly
Before looking at results, verify these basics:
- Sample ratio mismatch (SRM) β If you expected a 50/50 split but got 52/48 or worse, something may be wrong with the randomization. Significant SRM invalidates results.
- Full weekly cycles β The test should have run for complete weeks (7, 14, 21 days) to avoid day-of-week bias.
- No external interference β Confirm no major events (site outages, marketing campaigns, holidays) occurred during the test that could skew results.
Step 2: Check statistical significance
Enter your data into the appropriate calculator based on your metric type:
- Conversion rates (clicked, purchased, signed up) β Use the Conversions Calculator
- Per-user numeric metrics (revenue, time on site) β Use the Continuous Metrics Calculator
- Ratio metrics (AOV, revenue per click) β Use the Ratio Metrics Calculator
Look at the p-value. If p < 0.05, the result is statistically significant at the 95% confidence level.
Step 3: Look at the effect size
Statistical significance alone is not enough. A +0.01% lift can be significant with enough data but is probably not worth shipping.
- Absolute effect β The raw difference (e.g. control: 4.2%, variant: 4.8% β absolute effect: +0.6 percentage points).
- Relative effect β The percentage change (e.g. +0.6pp on a 4.2% baseline β +14.3% relative lift). This is what matters for business decisions.
Ask yourself: is this lift large enough to justify the engineering and product costs of shipping the change?
Step 4: Read the confidence interval
The confidence interval gives you the range of plausible effect sizes.
- Narrow interval (e.g. [+0.3%, +0.9%]) β You have good precision. The effect is likely between +0.3% and +0.9%. Safe to make a decision.
- Wide interval (e.g. [β0.5%, +1.7%]) β High uncertainty. The true effect could be negative or much larger than observed. Consider running the test longer.
The interval is more informative than the p-value alone β it tells you both whether the effect exists and how large it might be.
Step 5: Make the decision
- Significant positive result β Ship the variant. The effect is real and the direction is clear.
- Significant negative result β The variant hurt performance. Do not ship. Analyze why.
- Not significant β You could not detect a difference. This does not mean there is no difference β your test may be underpowered. Check what MDE your test was powered to detect using the Power Calculator.
If inconclusive: either extend the test (if practical) or accept that the effect is too small to detect with your traffic volume and move on to higher-impact ideas.
Common analysis pitfalls
- Cherry-picking metrics β If your primary metric showed no effect, do not hunt through secondary metrics for a win. Pre-register which metric is primary.
- Post-hoc segmentation β Slicing results by country, device, or user type after the test increases false positive risk. Only trust pre-registered segments.
- Ignoring novelty effects β New designs often show an initial lift that fades as users get accustomed. Consider monitoring post-launch metrics for a few weeks.