In regression analysis, bootstrapping is an efficient tool for statistical

deduction, which focused on making a sampling distribution with the key idea of

resampling the originally observed data with replacement1. The term

bootstrapping, proposed by Bradley Efron in his “Bootstrap methods:

another look at the jackknife” published in 1979, is extracted from the cliché

of ‘pulling oneself up by one’s bootstraps’2. So, from the meaning

of this concept, sample data is considered as a population and replacement

samples are constantly drawn from the sample data, which is considered as a

population, to generate the statistical deduction about original sample data. The essential bootstrap analogy states that “the

population is to the sample as the sample is to the bootstrap samples”2.

The bootstrap falls into two types, parametric and nonparametric. Parametric

bootstrapping assumes that the original data set is drawn from some specific

distributions, e.g. normal distribution2. And the samples generally are

pulled as the same size as the original data set. Nonparametric

bootstrapping is right the one described in the start of this summary, which repeatedly

and randomly draws a certain size of bootstrapping samples from the original

data. According to our regression analysis lecture, bootstrapping is quite useful

in non-linear regression and generalized linear models. For small sample size,

the parametric bootstrapping method is highly preferred.2 In large

sample size, nonparametric bootstrapping method would be preferably utilized. For

a more detailed clarification of nonparametric bootstrapping, a sample data

set, A = {x1, x2, …, xk} is randomly drawn from a population B = {X1, X2,

…, XK} and K is much larger than k. The statistic T = t(A) is considered as

an estimate of the corresponding population parameter P = t(B).2 Nonparametric

bootstrapping generates the estimate of the sampling distribution of a

statistic in an empirical way. No

assumptions of the form of the population is necessary. Next, a sample of size k

is drawn from the elements of A with replacement, which represents as A?1 = {x?11, x?12, …, x?1k}. In the resampling,

a * note is added to distinguish resampled data from original data. Replacement

is mandatory and supposed to be repeated typically one thousand or ten thousand

times, which is still developing since computation power develops, otherwise

only original sample A would be generated.1 And for each bootstrap estimate of

these samples, mean is calculated to estimate the expectation of the

bootstrapped statistics. Mean minus T is

the estimate of T’s bias. And T?, the bootstrap variance estimate, estimates the sampling variance of the population, P. Then bootstrap confidence

intervals can be calculated using either bootstrap percentile interval approach

or normal theory interval approach. Confidence intervals by bootstrap percentile

method is to use the empirical quantiles of the bootstrap estimates, which is

written as

T?(lower) < P < T?(upper). In more details, it can be written as Tˆ ? (Tˆ
?
upper – T*ˆ) ? P ? Tˆ + (T*ˆ + Tˆ ?lower). 2
Bootstrapping is an effective
method to doublecheck the stability of the model estimation results. It is much
better than the intervals calculated by sample variance with normality
assumption. And simplicity is bootstrapping’s another important benefit. For
complicated estimators, such as correlation coefficients, percentile points,
for complex parameters in the distribution, it is a pretty simple way to generate
estimates of confidence intervals and standard errors. However, simplicity can also
bring up disadvantage for bootstrapping, which makes the important assumptions for
the bootstrapping easy to neglect1. And bootstrapping is often
over-optimistic and doesn’t assure finite sample size1.
There are several types of bootstrapping schemes in the regression
problems. One typical approach is to resample residuals in the regression
models. The main procedure is firstly fit the original data set with the model,
and generate model estimates, ?ˆ and calculate residuals, ?ˆ; secondly randomly
and repeatedly sample the residuals (typically 1000 or 10000 times) to get K
sets residuals of size k and add each resampled residual to the original
equation, generating bootstrapped Y*; Finally use bootstrapped Y* to refit the
model and get bootstrap estimate ?ˆ?2.
Another typical approach in the regression context is random-x
resampling, which is also called case resampling2. We can either
apply Monte Carlo algorithm, which is to repeatedly resample the data of the
same size as the original data set with replacement, or identify any possible
resampling of the data set2. In our case, before fitting regression
model with the original predictor variable and response pairs (xi, yi), for i =
1, 2, . . ., k, these data pairs are resampled to get K new data pairs of size
k. Then the regression model is fit to each of these K new data sets. ?ˆ? is generated from K
parameter estimates.
In the next section, I’m going to review the nonparametric bootstrapping
package in R with some examples in my research area-----population
pharmacokinetics analysis, which use differential models and statistics to
describe how human bodies act to drugs. In R, a package is called “boot”, which
provides various sources for bootstrapping either a single statistic or a
vector. To run the boot function in the boot library, there are 3 necessary
parameters3:
1)
data, which can
be a vector, matrix, or data frame for bootstrap resampling3;
2)
statistic, the
function that produces the statistic for bootstrapping. This function should
include the data set and an indices parameter, giving the selection of cases
for each resampling3;
3)
R, the number of resampling
times3.
The function boot() runs the statistic function for R
times. In each call, it generates a group of random indices with replacement to
select a sample. Then calculated statistics for each sample are collected in
the bootobject function. So the function boot() is used as bootobject <- boot(data= , statistic= , R=,
...)3. After seeing the satisfying plot, we use boot.ci(bootobject,
conf=, type= ) to get confidence intervals3.
Bootstrapping is prevalently used in
the population analysis of clinical trials in pharmaceutical/biotech
industries. It is a pretty useful tool to assess and control the model analysis
stability. A good example is how bootstrapping validates population
pharmacokinetic (PK) model for Triptan, a vasopressor used for the acute
treatment of migraine attack5. A single oral dose of 50 mg was given
to 26 healthy Korean male subjects. Plasma data were obtained for pre-dose,
0.25, 0.5, 0.75, 1, 1.5, 2, 2.5, 3, 4, 6, 8, 10, and 12 h post-dose5.
Population PK analysis of Triptan was performed using plasma concentration data
by our software called NONMEM building models using differential equations. Total
364 observations of plasma concentrations were successfully described by a
one-compartment model with first-order of both absorption with lag time and elimination,
and a combined transit compartment5. The model scheme is shown as
Figure 1 as below:
Figure 1: The scheme of the final PK model of Triptan 5
The final model was validated through a 1000-time
resampling bootstrapping. The procedure was conducted with 1000 datasets resampled
from the original dataset5. The median and 90% confidence intervals
of all the PK parameters were shown in the Table 1 together with the final
parameter estimates. Results from the visual prediction check with
Table 1: NONMEM estimated Parameters
and Bootstrap Results5
1000 simulations were
assessed by visual comparison of the gray
area of 90% prediction interval from the simulated data with an overlay of the circled
raw data. Any excess of data going outside the gray area indicates that the
estimates were not legitimate.
Figure 2:
Visual predictive check plot of the model from time 0 to 12 h after a single
oral administration of 50 mg Triptan. Circles represent the raw data set: the
90% confidence interval of the 1000 times simulations (gray area), and observed
concentration (solid line) of the 5th, median, and 95th percentiles.5
Our conclusion is that the
final model and its estimated parameter were sufficiently robust and stable by
the assessment of the bootstrapping. All estimated parameter from the final
model were within the 95% bootstrap confidence intervals.