Assignment 10

Your name:
Your email address:

Answer all the questions in red in the provided boxes.

Intro-SLIDES are here

Assignments:

1. (35 minutes) Parsimony, distance matrix and ml analyses: sensitivity to LBA and missing data.    

PHYLIP is a collection of programs for phylogenetic analyses written by Joe Felsenstein. The programs are freely available (including source code), and can be used on a variety of different operating system. The programs are modular. Different modules exist to create bootstrap samples, calculate distance matrices and calculate trees from the distance matrices (Fitch and Neighbor), calculate consensus trees, etc.. All programs either use files called infile or intree, or alternatively the user needs to provide the file name. We will use the sequences in testseq1b.txt.

We will use the program protpars as implemented in seaview.

(IGNORE THIS PARAGRAPH, if you are in class) This paragraph has comments on phylip that you can safely ignore for now: If you use protpars directly (without the seqview interface), note that phylip by default treats gaps as a 21st character. If you want to treat the gap as missing data, you need to replace the gap symbol with "?"'s. In case you want to use one of the programs on your own, you need to read the excellent manuals that come with the software.

To calculate a phylogenetic tree from the aligned sequences using protein parsimony, open seaview, and drag the file into alignment window.

Align the sequences using muscle. (click on align, then align all).

In the trees menu select parsimony. Uncheck "ignore all gap sites" and check "gaps as unknown state". To do a more thorough search for the tree that explains the data with the least number of substitutions, select "randomize seq order" 5 times provides a reasonable number of starting points for the heuristic search.

Four notes:

  1. Sulfolobus and Thermococcus are Archaea, Borrelia is a Spirochete (bacterium), Acetabularia is a green algae, Daucus is a flowering plant (carrot), Candida, and Saccharomyces are yeasts, Neurospora is another fungus (not a yeast though), Drosophila is an animal (fruit fly) and Trypanosomes are protists.
  2. the Borrelia sequence actually is an archaeal type ATPase acquired through gene transfer
  3. The sequences in testseq1b.txt (V/A-ATPase catalytic subunits) are quite similar to one another. To test the effect of long branches, I added a homologous (paralogous), but only distantly related sequence to this file (the ATPase involved in flagellar assembly from Salmonella).
  4. Remember that some sequences contain inteins -- What are potential problems that might be caused by the intein sequence?

How are the fungal sequences resolved? (What does this tell us about parsimony and missing data?)
Where does the paralogous Salmonella sequence go? (This is as expected, parsimony analyses are very sensitive to the Long Branch Attraction (LBA)- the parsimony result is depicted as a tree without meaning full branchlengths - to addrees this question you will need to compare the parsimony tree to one that has meaningful branchlenghts).
How many equally parsimonious trees did you obtain? Compare with your neighbors: did the most parsimonious tree have the same number of steps?
Does the resulting tree indicate that this parsimony analysis was very sensitive to Long Branch Attraction?

Repeat the parsimony analysis using 100 bootstrap samples (do not ignore positions with gaps, but do not check 5 random starting trees--if you do a bootstrap analysis, repeated heuristic searches for each dataset are not worth the time). What is the support for the two long branches going together? Check with your neighbors, did you get the same result?

Perform a distance analysis in Seaview. Select bootstrap, Kimura distances, NJ, and do not check ignore all gap sites. Inspect the bootstrap support values.
Does the resulting tree indicate that this distance matrix phylogenetic analysis was very sensitive to Long Branch Attraction?
What might have gone wrong with the fungal seqeunces?

Seaview also allows calculating trees using the maximum likelihood (ml) principle. The ml tree is the tree under which the sequence alignment has the highest probability. We also will use the program to determine if the assumption of ASRV and additional invariant sites leads to a significan better fit. Because the tree calculation on the PCs takes some time, you will collaborate with two of your neighbor.

One of you will use the following settings: Amino acid equilibrium frequencies: Empirical; Invariable sites: optimized; Across site rate variation: Optimized; Tree search operations: Best of NNI & SPR. (Use the default settings for the other parameters)

The second will use: Amino acid equilibrium frequencies: Empirical; Invariable sites: none; Across site rate variation: Optimized; Tree search operations: Best of NNI & SPR.

And the thiord will use: Amino acid equilibrium frequencies: Empirical; Invariable sites: none; Across site rate variation: None; Tree search operations: Best of NNI & SPR.

NOTE: You will want to save the phyml output. When the tree calculation is done, DO NOT CLICK OK - first copy the output to the clipboard and then save it into a text file.
How does the ml tree compare to the parsimony and neighbor joining trees?

What was your (or your two neighbors) estimation for the percent invaraint sites and for the shape parameter of the Gamma distribution?

The Likelihood Ratio Test (LRT) can be used to compare nested models In our case the three models are LG (No ASRV),LG + Gamma, and LG + Gamma + I.

One important condition that has to be fulfilled before one can use a Likelihood Ratio Test (LRT) to compare two models, is that the models should be "nested". This means that the simpler model must be a constrained version of the parameter-rich model. The likelihood ratio test is performed by doubling the difference in log-likelihood scores and comparing this test statistic with the critical value from a chi-squared distribution having degrees of freedom equal to the difference in the number of estimated parameters in the two models. The parameter-rich model will always have a better fit, due to the extra parameters and will therefore have the highest log-likelihood, so the difference should be a positive number. In this case there is 1 degree of freedom between each of the models — the gamma shape parameter is one parameter and the % invariant sites is a second parameter. Use this online chi-square calculator to determine the significance of the test.

What were the three log-likelihood values for the three models?
Does the LG+Gamma model explain the data significantly better than the more simple LG model? 
Does the LG+Gamma+I model explain the data significantly better than LG+Gamma model?

If you have a dataset to analyze, you will want to test a lot of different models and pick the best. Currently the easiest way to do this is to use iqtree. The sofware is available via a web interface (e.g. here - they assume you use HIV sequences, you need to carefully select what you want the program to do), and it is installed on the xanadu cluster. To run iqtree in the cluster:

Create a directory for lab11, and transfer the aligned sequences for testseq1b (as a multiple fasta file created via Save as in seaview) into that directory.

When done, use filezilla to move the files created by iqtree to your desktop computer. You can open the treefile in seaview (or in figtree, which allows for more editing, beautification of the tree).

Open the log file in a texteditor. At the and of listing of the for the lnL for the individula models, is the listing of the best models under the different criteria.

Which models were chosen, and what do the abbreviations mean? (see the iqtree documantation chapter 8.6 Rate heterogeneity across sites)

Does the tree calculated under the best model correspond to the tree you obtained with seaview?

 

Exercise 2:

Long Branch Attraction (LBA) is a serious problem in phylogenetic reconstruction. LBA denotes the fact that long branches tend to be grouped together with significant support, even though the organisms representing the long branches did not share more recent common ancestry. The support usually is measured through bootstrap support values for the different trees. We have simulated the evolution of 4 sequences (named A,B,C,D) according to the following tree:
tree

Files containing these sequences in multiple sequence fasta format were generated and named according to the length chosen for the two long branches (all scaled in substitutions per site). For the simulation we assumed that the Among Site Rate Variation could be described with a gamma distribution that has a shape factor of 1 (equal to an exponential distribution).

These files in a single zipped file are here

Your task is to explore the sensitivity of different phylogenetic reconstruction algorithms towards LBA. At the minimum you should use protein parsimony and one protein distance matrix or ml analysis approach. In this case we know that the sequences are aligned as given; however, to explore the effect that the alignment algorithm has on LBA, we can align them before phylogenetic reconstruction. To keep track of things, name the files accordingly.

NOTE I: If you want to explore the effect of alignment, it might be a good idea to use seaview and muscle as alignment program - especially for the more divergent sequences. We will use the GUI provided in seaview.

Note II: You can divide the labor with your neighbor, distributing different sequences to different students.

We will use programs as implemented in SEAVIEW

3A: To test parsimony, choose the files with x = 0.1, 0.3, 1, 3.

For the datasets with x = 0.1, 0.3, 1, 3, use the tree menu in seaview, select parsimony, uncheck "ignore all gap sites", check "gaps as unknown states", check "bootstrap with 100 replicates", and move the consensus tree level lever to the left. (Note: If you are interested in the best parsimony tree, then you want to use the original dataset (not bootstrapped) and randomize the input order for several independent heuristic searches, if you do a bootstrap analysis, repeated heuristic searches for each dataset are not worth the time.)

In the following box list the files that you chose, aligned or as provided, and the bootstrap support for the correct tree ((A,D),(B,C)), or the support for the LBA tree ((A,C),(B,D,)) (note: seaview will show them arbitrarily rooted)

3B) (or do 3C) Explore a distance matrix based approach with respect to LBA (Neighbor joining using Poisson corrected or observed distances work well). Depending on the settings, these might be less sensitive to LBA. x = 0.3, 1, 3, 10 are good choices to explore.

In the following box list the parameters you selected in seaview, the files that you chose (aligned or as provided), and for each file indicate the bootstrap support for the correct tree, or the support for the LBA tree:

 

3C (optional) Explore the sensitivity of phyml towards LBA. This only works on a fast computer - either transfer the sequences to the cluster, login, qlogin and start the program by typing phyml at the command line, or use seaview on your own computer.

In the following box list give the parameters you chose for phyml, the files that you chose, indicate if you aligned them or used them as provided, and for each file give the support value for the correct tree, or the support for the LBA tree:


Finished?

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.