Trees with CLUSTALW

Besides aligning sequences, Clustal also includes programs to calculate distance trees. The trees generated by clustalw certainly have their limitations, however, if one is aware of these limitations, the program is extremely useful for initial exploration.

Trees are calculated from a corrected or uncorrected distance matrix using the neighbor joining method. This method does not use an optimization procedure but a much faster algorithmic approach.

Several parameters that you can choose in clustalw influence tree building.

 The choice of substitution matrix, and of other alignment parameters

 You can ignore all positions that in any of the sequences contains a gap

 You can correct for multiple substitutions
(In a perfect world you want to use the actual number of substitutions that occurred in evolution, and not the number of sites that differ between two sequences).
Later in the course we will discuss other methods for distance correction, however, everything considered clustalw is doing quite well.

Clustalw also provides possibilities for bootstrapping:

Bootstrapping - how to assess reliability of partitions given in a tree.

Baron Karl Friedrich Hieronymus von Münchhausen

Bootstrapping is one of the most popular ways to assess the reliability of branches.  The term bootstrapping goes back to the Baron Münchhausen (pulled himself out of a swamp by his shoe laces). Briefly, positions of the aligned sequences are randomly sampled from the multiple sequence alignment with replacements.  The sampled positions are assembled into new data sets, the so-called bootstrapped samples.  Each position has an about 63% chance to make it into a particular bootstrapped sample.  If a grouping has a lot of support, it will be supported by at least some positions in each of the bootstrapped samples, and all the bootstrapped samples will yield this grouping. Bootstrapping can be applied to all methods of phylogenetic reconstruction.
Bootstrapping thus realizes the impossible: the evolution of sequences in real life happened only once, and it is impossible to run the evolution of, let's say, small subunit ribosomal RNAs again. Nevertheless, using the resampling approch, pseudosamples are generated that have a variation that resembles the variation one would have obtained, if it were possible to sample 100 or 1000 parallele worlds in which the evolution of 16S rRNAs occurred over and over again. You end up with a stastical analyses using a single original sample only.

Bootstrapping has become very popular to assess the reliability of reconstructed phylogenies. Its advantage is that it can be applied to different methods of phylogenetic reconstruction, and that it assigns a probability-like number to every possible partition of the dataset (= branch in the resulting tree). Its disadvantage is that the support for individual groups decreases as you add more sequences to the dataset, and that it just measures how much support for a partition is in your data given a method of analysis. If the method of reconstruction falls victim to a bias or an artifact, this will be reproduced for every of the bootstrapped samples, and it will result in high bootstrap support values.

For information on bootstrapping and non-informative sites go here.

 

 

Creating a bootstrapped sample

Joe Felsenstein describes the bootstrap procedure in his manual to the seqboot program (part of the PHYLIP package, the manual is here, the citations here) as follows:

The bootstrap. Bootstrapping was invented by Bradley Efron in 1979, and its use in phylogeny estimation was introduced by me (Felsenstein, 1985b; see also Penny and Hendy, 1985). It involves creating a new data set by sampling N characters randomly with replacement, so that the resulting data set has the same size as the original, but some characters have been left out and others are duplicated. The random variation of the results from analyzing these bootstrapped data sets can be shown statistically to be typical of the variation that you would get from collecting new data sets. The method assumes that the characters evolve independently, an assumption that may not be realistic for many kinds of data.

The sample input and output of the seqboot program illustrates the generation of the bootstrapped samples:


TEST DATA SET

 
    5    6
Alpha     AACAAC
Beta      AACCCC
Gamma     ACCAAC
Delta     CCACCA
Epsilon   CCAAAC


CONTENTS OF OUTPUT FILE

(If Replicates are set to 10 and seed to 4333)

 
    5     6
Alpha        ACAAAC
Delta        CACCCA
Gamma        ACAAAC
Beta         ACCCCC
Epsilon      CAAAAC
    5     6
Alpha        AACAAC
Beta         AACCCC
Epsilon      CCAAAC
Delta        CCACCA
Gamma        CCCAAC
    5     6
Delta        CAACCC
Beta         ACCCCC
Gamma        ACCAAA
Alpha        ACCAAA
Epsilon      CAAAAA
    5     6
Alpha        AAAACA
Beta         AAAACC
Gamma        AAACCA
Delta        CCCCAC
Epsilon      CCCCAA
    5     6
Beta         ACCCCC
Epsilon      CAAACC
Delta        CCCCAA
Gamma        AAAACC
Alpha        AAAACC
    5     6
Gamma        CCAACC
Alpha        ACAACC
Epsilon      CAAACC
Delta        CACCAA
Beta         ACCCCC
    5     6
Alpha        AAACAA
Delta        CCCACC
Epsilon      CCCAAA
Gamma        AACCAA
Beta         AAACCC
    5     6
Alpha        AAAACC
Delta        CCCCAA
Beta         CCCCCC
Epsilon      AAAACC
Gamma        AAAACC
    5     6
Beta         AAAAAC
Alpha        AAAAAC
Gamma        AACCCC
Delta        CCCCCA
Epsilon      CCCCCC
    5     6
Delta        CCCCAA
Epsilon      CCAACC
Gamma        AAAACC
Alpha        AAAACC
Beta         AACCCC

 

Problems with clustalw:

 The input order in analyzing the bootstrapped samples is not randomized; therefore, if you have no phylogenetic information at all, you get 100% bootstrap values.
LOOK AT YOUR ALIGNMENTS CAREFULLY! -
or "From junk comes junk!"

 If you have very different branch lengths, even if you have a "molecular clock" running, long branches have the tendency to attract each other.

 

TREEVIEW

To view trees generated by clustalw, you can use treeview from Rod Page.

The program should be already installed on your PCs. The program is extremely user friendly. Trees generated can be copied and pasted into Microsoft Word, and the labels can be rearranged after double clicking on the imported image.

 There are several programs available that among other things calculate distance matrices (some with more sophisticated corrections than available in clustal).  You can use the Joe Felsenstein’s program Neighbor.exe to calculate neighbor joining trees from the distance matrices.  A PC version of the program is here, source code and executables are available through the Phylip homepage.

 

 Assignments

  1. align the sequences contained in testseq1.txt using clustalx. Set the option to put support values on the nodes (not the branches). Calculate a neighbor joining tree AND perform a bootstrap analysis (trees menu, bootstrap NJ tree). Load the trees into treeview. In treeview toggle between the different display options (buttons on top of the tree window). Go to Tree and define the outgroup as Sulfolobus and Thermococcus. Then use the outgroup to root the tree (same menu).
    Does the tree correspond to your expectations? (What is your expectation?)

Note: you also can use treeview to edit a tree. This comes in handy, when you need to generate usertrees for other programs. Try it out!!
In treeview copy a tree onto the computer's clipboard, and paste it into a MSWord document.

Try to safe the tree as a windows metafile and import (Insert picture) into MS Word).

njplot is an excellent alternative to treeview. It is distributed as part of the treeview package, and is has an excellent user interface to re-root trees, Load the trees generated in the njplot program. (Follow oral intructions on how to to start the program).

  1. The sequences in testseq1.txt (V/A-ATPase catalytic subunits) are quite similar to one another. To test the effect of long branches, I added a homologous, but only distantly related sequence to this file (the ATPase involved in flagellar assembly from Salmonella). The resulting file is testseq1b.txt.

 Align the sequences and calculate bootstrapped trees for this file using the possible permutations of gaps/ no gaps, and with and without correction for multiple substitutions.

 Which of the resulting trees appears to best reflect the actual evolution?

 Give a justification for your choice?

 What might be the reason that the others options worked less well?

 What do you expect to happen, when you replace the Salmonella sequence with a completely (?) unrelated sequence? (testseq1c.txt)

 Is your expectation confirmed?

 

Discussion of Results

Analyses of trees obtained with testseq1b: In my opinion, the best trees were obtained with correction for multiple substitutions turned on. Without correction for multiple substitutions the two longest branches (flSalmonella and Borrelia) group together and the group of the two yeasts is broken up by the Neurospora sequence. Excluding the positions with gaps resulted in a slight improvement (the yeasts go together), and the bootstrap values for the branches that are supported by other evidence were higher, whereas questionable groupings were appropriately little supported.

Analysis of trees obtained with testseq1c: The Synechococcus sequence is not homologous to any of the other sequences. Accordingly the distance correction does not work for all instances. Without considering gaps the sequence groups with the longest branch (long branch attraction), with position that contain gaps included it goes with the Drosophila sequence, probably, because the amino terminal ends of the two sequences match up.

 Exclusion of positions with gaps gets rid of a lot of noise (these regions are usually least conserved), and of instances of convergent gap formation (some other programs handle this problem with more alternatives).

 Multiple substitutions occur, thus it is a good thing to take this into consideration when calculating distances.

 

3. Jalview is an excellent JAVA applet to inspect and edit multiple sequence alignments. It also allows inspection of protein space for the aligned sequences. This works surprisingly well. The Jalview Homepage contains a lot of additional information.
Start Jalview as Java Web Start Application.
Load the file ATPaseSU.unix.aln file (here)
Explore the different coloring options (COLOUR menu). Which one seems to work best (most meaningfull - scroll through the alignment to a more conserved region).

Note: You can change/edit the alignment by pointing on an amino acid residue and dragging it to the right or left. Try it, but leave the sequences in an aligned state before you move on.

CALCULATE an AVERAGE DISTANCE TREE USING PID
Click somewhere in the resulting tree to color groups of related sequences in the same color.

CALCULATE the PRINCIPAL COMPONENET ANALYSIS.
In a principal component analyses, the new dimensions are calculated as a linear combination of the original dimensions, so that greatest variance by any projection of the data set comes to lie on the first axis, etc. for the following dimensions.
Can you find a higher dimension that breaks up the vacuolar ATPase A subunits into two clusters? (Their names start with A.).
Which of the A subunit sequences cluster together, if you use this dimesion (2, 3 and 5 worked for me)?