Bootstrapping¶

Bootstrapping is a statistical method that allows statisticians to resample data from an incomplete sample of data and make inferences on the dataset.

Steps for bootstrapping¶

  • Obtain a random sample from a population
  • Take a sample with replacement of the same sample size from the original sample
  • Calculate the statistic of interest from the resample (mean, median, etc)
  • Repeat the resampling process until you have a distribution of resample statistics
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("bloodlead2.csv")
df.head()
Out[2]:
type blood_lead
0 exposed 38
1 exposed 23
2 exposed 41
3 exposed 18
4 exposed 37

Get sample from population¶

In [3]:
exposed_bl = np.array(df[df['type'] == 'exposed']['blood_lead'])
control_bl = np.array(df[df['type'] == 'control']['blood_lead'])

Resample data from original sample¶

In [4]:
resampled_vls = np.random.choice(control_bl, len(control_bl))

Calculate statistic of interest¶

In [5]:
np.mean(resampled_vls)
Out[5]:
17.575757575757574

Repeat resampling process many times¶

In [6]:
meanlist = []
for each_sample in range(10000):
    resampled_vls = np.random.choice(control_bl, len(control_bl))
    fic_mean = np.mean(resampled_vls)
    meanlist.append(fic_mean)

Visualize bootstrap distribution¶

In [7]:
fig, ax = plt.subplots()
ax.hist(meanlist)
ax.set_title('Expected distribution of mean with further sampling')
conf_interval = np.percentile(meanlist, [2.5, 97.5])
ax.plot([conf_interval[0], conf_interval[1]], [2600, 2600], 'r')
Out[7]:
[<matplotlib.lines.Line2D at 0x241450166a0>]
No description has been provided for this image