bootstrap

Bootstrapping - how to assess reliability of partitions given in a tree.

Baron Karl Friedrich Hieronymus von Münchhausen

Bootstrapping is one of the most popular ways to assess the reliability of branches.The term bootstrapping goes back to the Baron Münchhausen (pulled himself out of a swamp by his shoe laces). Briefly, positions of the aligned sequences are randomly sampled from the multiple sequence alignment with replacements. The sampled positions are assembled into new data sets, the so-called bootstrapped samples. Each position has an about 63% chance to make it into a particular bootstrapped sample. If a grouping has a lot of support, it will be supported by at least some positions in each of the bootstrapped samples, and all the bootstrapped samples will yield this grouping. Bootstrapping can be applied to all methods of phylogenetic reconstruction.
Bootstrapping thus realizes the impossible: the evolution of sequences in real life happened only once, and it is impossible to run the evolution of, let's say, small subunit ribosomal RNAs again. Nevertheless, using the resampling approach, pseudosamples are generated that have a variation that resembles the variation one would have obtained, if it were possible to sample 100 or 1000 parallel worlds in which the evolution of 16S rRNAs occurred over and over again. You end up with a statistical analyses using a single original sample only.

Bootstrapping has become very popular to assess the reliability of reconstructed phylogenies. Its advantage is that it can be applied to different methods of phylogenetic reconstruction, and that it assigns a probability-like number to every possible partition of the dataset (= branch in the resulting tree). Its disadvantage is that the support for individual groups decreases as you add more sequences to the dataset, and that it just measures how much support for a partition is in your data given a method of analysis. If the method of reconstruction falls victim to a bias or an artifact, this will be reproduced for every of the bootstrapped samples, and it will result in high bootstrap support values.

For information on bootstrapping and non-informative sites go here.

Creating a bootstrapped sample

Joe Felsenstein describes the bootstrap procedure in his manual to the seqboot program (part of the PHYLIP package, the manual is here, the citations here) as follows:

The bootstrap. Bootstrapping was invented by Bradley Efron in 1979, and its use in phylogeny estimation was introduced by me (Felsenstein, 1985b; see also Penny and Hendy, 1985). It involves creating a new data set by sampling N characters randomly with replacement, so that the resulting data set has the same size as the original, but some characters have been left out and others are duplicated. The random variation of the results from analyzing these bootstrapped data sets can be shown statistically to be typical of the variation that you would get from collecting new data sets. The method assumes that the characters evolve independently, an assumption that may not be realistic for many kinds of data.

The sample input and output of the seqboot program illustrates the generation of the bootstrapped samples:

TEST DATA SET

5 6 Alpha AACAAC Beta AACCCC Gamma ACCAAC Delta CCACCA Epsilon CCAAAC

CONTENTS OF OUTPUT FILE

(If Replicates are set to 10 and seed to 4333)

5 6 Alpha ACAAAC Delta CACCCA Gamma ACAAAC Beta ACCCCC Epsilon CAAAAC 5 6 Alpha AACAAC Beta AACCCC Epsilon CCAAAC Delta CCACCA Gamma CCCAAC 5 6 Delta CAACCC Beta ACCCCC Gamma ACCAAA Alpha ACCAAA Epsilon CAAAAA 5 6 Alpha AAAACA Beta AAAACC Gamma AAACCA Delta CCCCAC Epsilon CCCCAA 5 6 Beta ACCCCC Epsilon CAAACC Delta CCCCAA Gamma AAAACC Alpha AAAACC 5 6 Gamma CCAACC Alpha ACAACC Epsilon CAAACC Delta CACCAA Beta ACCCCC 5 6 Alpha AAACAA Delta CCCACC Epsilon CCCAAA Gamma AACCAA Beta AAACCC 5 6 Alpha AAAACC Delta CCCCAA Beta CCCCCC Epsilon AAAACC Gamma AAAACC 5 6 Beta AAAAAC Alpha AAAAAC Gamma AACCCC Delta CCCCCA Epsilon CCCCCC 5 6 Delta CCCCAA Epsilon CCAACC Gamma AAAACC Alpha AAAACC Beta AACCCC