Note: This talk is also available as a blog post
We want to...
...compare conversion rates and revenue
...estimate uncertainty of metrics
We don't want...
...binary answers
...too many modelling assumptions
Variant | A | B |
---|---|---|
Title | Rambling about bootstrapping, confidence intervals, and the reliability of online sources | Bootstrapping the right way |
Visitors | 2500 | 2500 |
Ticket sales | 350 | 410 |
Conversion rate | 14% | 16.4% |
Let's look at a type of dataset that I often work on: conversions [...] the formula for the confidence interval [...]
scipy.stats.beta.ppf([0.025, 0.975], k, n - k)
In statistics, a confidence interval (CI) is a type of interval estimate, computed from the statistics of the observed data, that might contain the true value of an unknown population parameter. The interval has an associated confidence level that, loosely speaking, quantifies the level of confidence that the parameter lies in the interval. More strictly speaking, the confidence level represents the frequency (i.e. the proportion) of possible confidence intervals that contain the true value of the unknown population parameter. In other words, if confidence intervals are constructed using a given confidence level from an infinite number of independent sample statistics, the proportion of those intervals that contain the true value of the parameter will be equal to the confidence level.
Interval: Not a single point, a range
Confidence level: Higher is wider, lower is narrower
Confidence interval: Yet another confusing frequentist concept
Key insight: We can test the correctness of CI algorithms!
Sample | Data | Mean |
Original | 10, 12, 20, 30, 45 | 23.4 |
Resample 1 | 30, 20, 12, 12, 45 | 23.8 |
Resample 2 | 20, 20, 30, 30, 30 | 26 |
... | many more resamples | ... |
means = [np.random.choice(sample, size=len(sample)).mean()
for _ in range(num_resamples)]
np.percentile(means, [2.5, 97.5])
[...] the number of resamples needs to be 15,000 or more, for 95% probability that simulation-based one-sided levels fall within 10% of the true values, for 95% intervals [...]
We want decisions to depend on the data, not random variation in the Monte Carlo implementation. We used r = 500,000 in the Verizon project.
The sample sizes needed for different intervals to satisfy the "reasonably accurate" (off by no more than 10% on each side) criterion are: n ≥ 101 for the bootstrap t, 220 for the skewness-adjusted t statistic, 2,235 for expanded percentile, 2,383 for percentile, 4,815 for ordinary t (which I have rounded up to 5,000 above), 5,063 for t with bootstrap standard errors and something over 8,000 for the reverse percentile method.
Bootstrapping is promoted because "it's just for loops" *
We should also use for loops to validate bootstrapping code!
* It's actually not that simple:
In practice, implementing some of the more accurate bootstrap methods is difficult, and people should use a package rather than attempt this themselves.
Variant | A | B |
---|---|---|
Free | 40% @ $0 | 60% @ $0 |
Tier 1 | 30% @ $25 | 25% @ $50 |
Tier 2 | 20% @ $50 | 10% @ $100 |
Tier 3 | 10% @ $100 | 5% @ $200 |
True mean | $27.5 | $32.5 |
Factors: different taxes, exchange rates, discount vouchers, etc.
(and our friend, randomness)
rnd = np.random.RandomState(0)
weights = [0.4, 0.3, 0.2, 0.1]
prices = [0, 25, 50, 100]
sample = []
for price, size in zip(prices, rnd.multinomial(100, weights)):
if price:
sample.extend(rnd.poisson(price, size))
else:
sample.extend([0] * size)
How often is the true difference in means in the "95%" CI?
Out of scope, but remember the IID (independent and identically distributed) assumption
Further reading: Hackers beware: Bootstrap sampling may be harmful on yanirseroussi.com