How to analyze A/B test results

Your test has finished running. Now what? This guide walks you through the analysis step by step so you make the right decision.

Step 1: Confirm the test ran correctly

Before looking at results, verify these basics:

Sample ratio mismatch (SRM) — If you expected a 50/50 split but got 52/48 or worse, something may be wrong with the randomization. Significant SRM invalidates results.
Full weekly cycles — The test should have run for complete weeks (7, 14, 21 days) to avoid day-of-week bias.
No external interference — Confirm no major events (site outages, marketing campaigns, holidays) occurred during the test that could skew results.

Step 2: Check statistical significance

Enter your data into the appropriate calculator based on your metric type:

Conversion rates (clicked, purchased, signed up) — Use the Conversions Calculator
Per-user numeric metrics (revenue, time on site) — Use the Continuous Metrics Calculator
Ratio metrics (AOV, revenue per click) — Use the Ratio Metrics Calculator

Look at the p-value. If p < 0.05, the result is statistically significant at the 95% confidence level.

Step 3: Look at the effect size

Statistical significance alone is not enough. A +0.01% lift can be significant with enough data but is probably not worth shipping.

Absolute effect — The raw difference (e.g. control: 4.2%, variant: 4.8% → absolute effect: +0.6 percentage points).
Relative effect — The percentage change (e.g. +0.6pp on a 4.2% baseline → +14.3% relative lift). This is what matters for business decisions.

Ask yourself: is this lift large enough to justify the engineering and product costs of shipping the change?

Step 4: Read the confidence interval

The confidence interval gives you the range of plausible effect sizes.

Narrow interval (e.g. [+0.3%, +0.9%]) — You have good precision. The effect is likely between +0.3% and +0.9%. Safe to make a decision.
Wide interval (e.g. [−0.5%, +1.7%]) — High uncertainty. The true effect could be negative or much larger than observed. Consider running the test longer.

The interval is more informative than the p-value alone — it tells you both whether the effect exists and how large it might be.

Step 5: Make the decision

Significant positive result — Ship the variant. The effect is real and the direction is clear.
Significant negative result — The variant hurt performance. Do not ship. Analyze why.
Not significant — You could not detect a difference. This does not mean there is no difference — your test may be underpowered. Check what MDE your test was powered to detect using the Power Calculator.

If inconclusive: either extend the test (if practical) or accept that the effect is too small to detect with your traffic volume and move on to higher-impact ideas.

Common analysis pitfalls

Cherry-picking metrics — If your primary metric showed no effect, do not hunt through secondary metrics for a win. Pre-register which metric is primary.
Post-hoc segmentation — Slicing results by country, device, or user type after the test increases false positive risk. Only trust pre-registered segments.
Ignoring novelty effects — New designs often show an initial lift that fades as users get accustomed. Consider monitoring post-launch metrics for a few weeks.