↩

Bootstrapping for Parameter Estimates

Resampling methods are an indispensable tool in modern statistics. They involve repeatedly drawing samples from a training set and recomputing an item of interest on each sample. Bootstrapping is one such resampling method that repeatedly draws independent samples from our data set and provides a direct computational way of assessing uncertainty. This tutorial covers the basics of bootstrapping to estimate the accuracy of a single statistic. Note that resampling (to include bootstrapping) to improve the bias-variance tradeoff in predictive modeling is discussed elsewhere.

tl;dr

First, I cover the packages and data used to reproduce results displayed in this tutorial. I then discuss how boostrapping works followed by illustrating how to implement the method in R.

Prerequisites: What you need.
Why Bootstrap?: A short discussion of how boostrapping works.
Creating boostrap samples: How do you create bootstrap samples in R.
Applying functions: How to iterate over each sample to compute a parameter estimate.
Where to learn more: Resources to help you learn more.

Prerequisites

This tutorial primarily uses the rsample package to create bootstrap samples but also uses the purrr and ggplot2 packages (both contained in tidyverse). To illustrate, our example attrition data comes from the rsample package.

library(tidyverse)
library(rsample)

as_tibble(attrition)
## # A tibble: 1,470 x 31
##      Age Attr… Busi… Dail… Depa… Dist… Educ… Educ… Envi… Gend… Hour… JobI…
##  * <int> <fct> <fct> <int> <fct> <int> <ord> <fct> <ord> <fct> <int> <ord>
##  1    41 Yes   Trav…  1102 Sales     1 Coll… Life… Medi… Fema…    94 High 
##  2    49 No    Trav…   279 Rese…     8 Belo… Life… High  Male     61 Medi…
##  3    37 Yes   Trav…  1373 Rese…     2 Coll… Other Very… Male     92 Medi…
##  4    33 No    Trav…  1392 Rese…     3 Mast… Life… Very… Fema…    56 High 
##  5    27 No    Trav…   591 Rese…     2 Belo… Medi… Low   Male     40 High 
##  6    32 No    Trav…  1005 Rese…     2 Coll… Life… Very… Male     79 High 
##  7    59 No    Trav…  1324 Rese…     3 Bach… Medi… High  Fema…    81 Very…
##  8    30 No    Trav…  1358 Rese…    24 Belo… Life… Very… Male     67 High 
##  9    38 No    Trav…   216 Rese…    23 Bach… Life… Very… Male     44 Medi…
## 10    36 No    Trav…  1299 Rese…    27 Bach… Medi… High  Male     94 High 
## # ... with 1,460 more rows, and 19 more variables: JobLevel <int>, JobRole
## #   <fctr>, JobSatisfaction <ord>, MaritalStatus <fctr>, MonthlyIncome
## #   <int>, MonthlyRate <int>, NumCompaniesWorked <int>, OverTime <fctr>,
## #   PercentSalaryHike <int>, PerformanceRating <ord>,
## #   RelationshipSatisfaction <ord>, StockOptionLevel <int>,
## #   TotalWorkingYears <int>, TrainingTimesLastYear <int>, WorkLifeBalance
## #   <ord>, YearsAtCompany <int>, YearsInCurrentRole <int>,
## #   YearsSinceLastPromotion <int>, YearsWithCurrManager <int>

Why Bootstrap?

Direct standard error formulas as discussed in the univariate statistical inference tutorial exist for various statistics and help to compute confidence intervals. However, prior to the computer age, when estimating certain parameters of the distribution (i.e. percentile points, proportions, odds ratio, and correlation coefficients), complex and laborous Taylor series were required to compute errors of an estimate. The bootstrap was developed as an alternative, computer-intensive approach to derive estimates of standard errors and confidence intervals for any statistics.

Bootstrapping involves repeatedly drawing independent samples from our data set (Z) to create bootstrap data sets ( $Z^{*1}, Z^{*2}, \dots, Z^{*B}$ ). This sample is performed with replacement, which means that the same observation can be sampled more than once and each bootstrap sample ( $Z^{*i}$ ) will have the same number of observations as the original data set.

The basic idea of bootstrapping is that inference about a population from sample data, can be modelled by resampling the sample data and performing inference about a sample from resampled data (resampled → sample → population).

The bootstrap process begins with a statistic that we are interested in ( $\hat\alpha^*$ ). Some large number (B) of bootstrap samples are independently drawn. Each bootstrap sample is used to compute this statistic ( $\hat\alpha^{*1}, \hat\alpha^{*2}, \dots, \hat\alpha^{*B}$ ). We can then use all the bootstrapped data sets to compute the standard error of this desired statistic

$\widehat{SE}_B = \sqrt{\frac{\sum^B_{b=1}(\hat \alpha^{*i} - \bar \alpha^{*B})^2}{(B-1)}} \tag{1}$

Thus, $\widehat{SE}_B$ serves as an estimate of the standard error of $\hat \alpha$ estimated from the original data set.

Creating Bootstrap Samples

The first objective in bootstrapping is to create your B bootstrap samples. We can use rsample::bootstrap to create an object that contains B resamples of the original data.

# reproducibility
set.seed(124)

# create 10 bootstap samples
(bt_samples <- bootstraps(attrition, times = 10))
## # Bootstrap sampling 
## # A tibble: 10 x 2
##    splits       id         
##    <list>       <chr>      
##  1 <S3: rsplit> Bootstrap01
##  2 <S3: rsplit> Bootstrap02
##  3 <S3: rsplit> Bootstrap03
##  4 <S3: rsplit> Bootstrap04
##  5 <S3: rsplit> Bootstrap05
##  6 <S3: rsplit> Bootstrap06
##  7 <S3: rsplit> Bootstrap07
##  8 <S3: rsplit> Bootstrap08
##  9 <S3: rsplit> Bootstrap09
## 10 <S3: rsplit> Bootstrap10

The bootstrap samples are stored in data-frame-like tibble object where each bootstrap is nested in the splits column. We can access each bootstrap sample just as you would access parts of a list. Here, we access the first bootstrap sample stored in bt_samples$splits[[1]]. The <1470/530/1470> indicates that 1470 observations are in the bootstrap sample, 530 observations were not sampled in this bootstrap sample (meaning $1470-530 = 940$ observations were sample 1 or more times to make the 1470 observations in the bootstrap sample), and the last 1470 indicates the total number of observations in the original data set. Remember, when performing bootstrap sampling, each bootstrap sample will always contain the same number of observations as the original data set. To extract the first bootstrap sample we can use rsample::analysis.

# observation split
bt_samples$splits[[1]]
## <1470/530/1470>

# exctract the first bootstrap sample
analysis(bt_samples$splits[[1]]) %>% as_tibble()
## # A tibble: 1,470 x 31
##      Age Attr… Busi… Dail… Depa… Dist… Educ… Educ… Envi… Gend… Hour… JobI…
##  * <int> <fct> <fct> <int> <fct> <int> <ord> <fct> <ord> <fct> <int> <ord>
##  1    56 Yes   Trav…   441 Rese…    14 Mast… Life… Medi… Fema…    72 High 
##  2    32 No    Trav…   859 Rese…     4 Bach… Life… High  Fema…    98 Medi…
##  3    34 No    Trav…   216 Sales     1 Mast… Mark… Medi… Male     75 Very…
##  4    34 No    Trav…  1111 Sales     8 Coll… Life… High  Fema…    93 High 
##  5    39 Yes   Trav…  1162 Sales     3 Coll… Medi… Very… Fema…    41 High 
##  6    46 No    Trav…  1009 Rese…     2 Bach… Life… Low   Male     51 High 
##  7    53 No    Trav…  1223 Rese…     7 Coll… Medi… Very… Fema…    50 High 
##  8    50 No    Trav…   939 Rese…    24 Bach… Life… Very… Male     95 High 
##  9    41 No    Trav…   337 Sales     8 Bach… Mark… High  Fema…    54 High 
## 10    60 No    Trav…   422 Rese…     7 Bach… Life… Low   Fema…    41 High 
## # ... with 1,460 more rows, and 19 more variables: JobLevel <int>, JobRole
## #   <fctr>, JobSatisfaction <ord>, MaritalStatus <fctr>, MonthlyIncome
## #   <int>, MonthlyRate <int>, NumCompaniesWorked <int>, OverTime <fctr>,
## #   PercentSalaryHike <int>, PerformanceRating <ord>,
## #   RelationshipSatisfaction <ord>, StockOptionLevel <int>,
## #   TotalWorkingYears <int>, TrainingTimesLastYear <int>, WorkLifeBalance
## #   <ord>, YearsAtCompany <int>, YearsInCurrentRole <int>,
## #   YearsSinceLastPromotion <int>, YearsWithCurrManager <int>

So we have successfully created bootstrap samples and we know how to access each sample. Now its time to apply a function iteratively over each bootstrap sample to compute our desired statistic.

Applying Functions Across Bootstrap Samples

Assume we wanted to understand the difference in income for employees that churn versus those that do not. As discussed in the multivariate statistical inference tutorial we can use the two-sample t-test to compare the difference in mean income. But what if we want to compare the difference in median income? Or the difference in the 95th percentile income?

There is no simple test statistic to perform these comparisons; however, we can use bootstrapping to get a better understanding of these estimates and their standard errors. First, we’ll create a function that computes the difference in median monthly incomes between employees that churn versus those that do not.

statistic <- function(splits) {
  x <- analysis(splits)
  median_yes <- median(x$MonthlyIncome[x$Attrition == "Yes"]) 
  median_no <- median(x$MonthlyIncome[x$Attrition == "No"]) 
  median_yes - median_no
}

The above function takes an individual bootstrap sample (i.e. split) and computes the difference in median income. Now let’s create our bootstrap samples. There is no concrete number of bootstrap samples you should use but a good rule of thumb is to use at least 500 when computing a measure of central tendency and at least 1000 when computing an extreme value such as the 95th percentile.¹ With today’s modern computing power, my default is to use a minimum of 2000.

set.seed(123)
bt_samples <- bootstraps(attrition, times = 2000)

We can now iterate over each bootstrap sample and apply our statistic. Here, we use map_dbl, which comes from the purrr package and is part of the tidyverse, to iterate over each sample and compute the difference in median incomes. The results show a slightly bimodal and skewed distribution.

# iterate over each bootstrap sample and compute statistic
bt_samples$wage_diff <- map_dbl(bt_samples$splits, statistic)

# plot distribution
ggplot(bt_samples, aes(x = wage_diff)) + 
  geom_line(stat = "density", adjust = 1.25) + 
  xlab("Difference in Median Monthly Income (Attrition minus Non-Attrition)")

We see that across all bootstrap samples, this statistic ranges from about -2500 to -1000. We can compute the confidence interval by taking the percentiles of the bootstrap distribution. We see that the median difference is -$1,949 with a 95% confidence interval between -$2,355 and -$1,409.

quantile(bt_samples$wage_diff, probs = c(.05, .5, .95))
##        5%       50%       95% 
## -2355.050 -1948.750 -1409.475

Consequently, we can be 95% confident that, on average, employees that churn are paid less than employees that do not churn (approximately -$2,355 to -$1,409 less).

Resource to Learn More

Bootstrapping has many useful applications ranging from simple parameter estimates to being incorporated into modeling approaches (i.e. Random Forests). You can learn more about boostrapping, and its application in R, with the following resources:

Davison & Hinkley (1997). Bootstrap methods and their application. Link. ↩