Bootstrapping¶
Bootstrapping is a statistical method that allows statisticians to resample data from an incomplete sample of data and make inferences on the dataset.
Steps for bootstrapping¶
- Obtain a random sample from a population
- Take a sample with replacement of the same sample size from the original sample
- Calculate the statistic of interest from the resample (mean, median, etc)
- Repeat the resampling process until you have a distribution of resample statistics
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("bloodlead2.csv")
df.head()
Out[2]:
type | blood_lead | |
---|---|---|
0 | exposed | 38 |
1 | exposed | 23 |
2 | exposed | 41 |
3 | exposed | 18 |
4 | exposed | 37 |
Get sample from population¶
In [3]:
exposed_bl = np.array(df[df['type'] == 'exposed']['blood_lead'])
control_bl = np.array(df[df['type'] == 'control']['blood_lead'])
Resample data from original sample¶
In [4]:
resampled_vls = np.random.choice(control_bl, len(control_bl))
Calculate statistic of interest¶
In [5]:
np.mean(resampled_vls)
Out[5]:
17.575757575757574
Repeat resampling process many times¶
In [6]:
meanlist = []
for each_sample in range(10000):
resampled_vls = np.random.choice(control_bl, len(control_bl))
fic_mean = np.mean(resampled_vls)
meanlist.append(fic_mean)
Visualize bootstrap distribution¶
In [7]:
fig, ax = plt.subplots()
ax.hist(meanlist)
ax.set_title('Expected distribution of mean with further sampling')
conf_interval = np.percentile(meanlist, [2.5, 97.5])
ax.plot([conf_interval[0], conf_interval[1]], [2600, 2600], 'r')
Out[7]:
[<matplotlib.lines.Line2D at 0x241450166a0>]