Assignment 9

Your name:
Your email address:

Answer all the questions in red in the provided boxes.

Assignments:

1) Maximum likelihood tests using phyml as implemented in Seaview.
We will test the following models:

#of rates #of frequencies Gamma Invariant sites degrees of freedom to previous model

JC 1 1 N N

HKY 85 2 4 N N 4

GTR 6 4 N N 2

GTR + Gamma 6 4 Y N 1

GTR +GammaInv 6 4 Y Y 1

Open this file in Seaview. Select all sequences. Select sites Extein.
Under Trees select phyml.

Select JC69 from the first pull-down menu. Select none for Invariable sites and none for Among Site Rate Variation. For the tree search use Nearest Neighbor Interchange (NNI). Run. Write down the log likelihood value into a table. Keep the tree window open.
Select HKY from the first pull-down menu. Select none for Invariable sites and none for Among Site Rate Variation. For the Transition / Transversion ratio and the nucleotide frequencies select optimized. For the tree search use Nearest Neighbor Interchange (NNI). Run. Write down the log likelihood value into a table.
Select GTR from the first pull-down menu. Select none for Invariable sites and none for Among Site Rate Variation. For the nucleotide frequencies select optimized.
Select GTR from the first pull-down menu. Select none for Invariable sites and optimized for Among Site Rate Variation. For the nucleotide frequencies select optimized.
Select GTR from the first pull-down menu. Select optimized for Invariable sites and optimized for Among Site Rate Variation. For the nucleotide frequencies select optimized.

One important condition that has to be fulfilled before one can use a Likelihood Ratio Test (LRT) to compare two models, is that the models should be "nested". This means that the simpler model must be a constrained version of the parameter-rich model. The likelihood ratio test is performed by doubling the difference in log-likelihood scores and comparing this test statistic with the critical value from a chi-squared distribution having degrees of freedom equal to the difference in the number of estimated parameters in the two models. The parameter-rich model will always have a better fit, due to the extra parameters and will therefore have the highest log-likelihood, so the difference should be a positive number. The degree of freedom between each of the models is given in the above table - plus/minus gamma shape parameter is one parameter (even though is is approximated by 4 rate categoroies) and the % invariant sites also counts as a parameter.

Use this online chi-square calculator to determine the significance of the test.

Are all the more complex models a significant improvement over the more simple ones?
Enter twice the log likelihood ratio and the P-value with which the simpler hypothesis is rejected.

Are all the more complex models a significant improvement over the more simple ones? JC HKY85 2*deltaLogL: P-value: GTR 2*deltaLogL: P-value: GTR + Gamma 2*deltaLogL: P-value: GTR + GammaInv 2*deltaLogL: P-value:

Doing this using a GUI and copying numbers back and forth is tedious. An older program called modeltest automatically tested a few dozen models. More recently iqtree incorporates testing for the appropriate model.
To get the aligmnet into a format readable by iqtree,
in seaview, select sites extein, select all sequences, then select File > Save Selection
Enter a filename (e.g., Yeast_vma1_extein_aligned.fst), select fasta format, select File -> save selection as

The software is available via a web interface (e.g. here Today, we will use the version as availabe on xanadu. To run iqtree

Start Filezilla and connect to transfer.cam.uchc.edu
PuTTY to xanadu-submit-ext.cam.uchc.edu
login
srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash

Create a directory for lab9, and transfer the aligned sequences for exteins only (as a multiple fasta file created via save selection as in seaview - see above) into that directory.

cd lab9
load the iqtree module
module load iqtree/1.6.10
execute iqtree
iqtree -s Yeast_vma1_extein_aligned.fst
this will take some time.

When done, use filezilla to move the files created by iqtree to your desktop computer. You can open the treefile in seaview.

Open the log file in a text editor. At the and of listing of the for the lnL for the individula models, is the listing of the best models under the different criteria.

Which models were chosen, and what do the abbreviations mean? (see the iqtree documantation chapter 11.6 Rate heterogeneity across sites; for info on the different criteria see here)

BIC: AICc: AIC:

Open the "selection.fst.treefile" file in seaview (in the alignment window File > open > select the file).
Does the tree calculated under the best model correspond to the trees you obtained with seaview? What is the main difference between the models that consider Among Site Rate Variation and those that do not?

Does the tree calculated under the best model correspond to the trees you obtained with seaview? What is the main difference between the models that consider Among Site Rate Variation and those that do not?

Exercise 2:

Long Branch Attraction (LBA) is a serious problem in phylogenetic reconstruction. LBA denotes the fact that long branches tend to be grouped together with significant support, even though the organisms representing the long branches did not share more recent common ancestry. The support usually is measured through bootstrap support values for the different trees. We have simulated the evolution of 4 sequences (named A,B,C,D) according to the following tree:
tree

Files containing these sequences in multiple sequence fasta format were generated and named according to the length chosen for the two long branches (all scaled in substitutions per site). For the simulation we assumed that the Among Site Rate Variation could be described with a gamma distribution that has a shape factor of 1 (equal to an exponential distribution).

These files in a single zipped file are here

Your task is to explore the sensitivity of different phylogenetic reconstruction algorithms towards LBA. At the minimum you should use protein parsimony and one protein distance matrix or ml analysis approach. In this case we know that the sequences are aligned as given; however, to explore the effect that the alignment algorithm has on LBA, we can align them before phylogenetic reconstruction. To keep track of things, name the files accordingly.

NOTE I: If you want to explore the effect of alignment, it might be a good idea to use seaview and muscle as alignment program - especially for the more divergent sequences. We will use the GUI provided in seaview.

Note II: You can divide the labor with your neighbor, distributing different sequences to different students.

We will use programs as implemented in SEAVIEW

2A: To test parsimony, choose the files with x = 0.1; 0.3; 1; 3.

For the datasets with x = 0.1, 0.3, 1, 3, use the tree menu in seaview, select parsimony, uncheck "ignore all gap sites", check "gaps as unknown states", check "bootstrap with 100 replicates", and move the consensus tree level lever to the left. (Note: If you are interested in the best parsimony tree, then you want to use the original dataset (not bootstrapped) and randomize the input order for several independent heuristic searches, if you do a bootstrap analysis, repeated heuristic searches for each dataset are not worth the time.)

In the following box list the files that you chose, aligned or as provided, and the bootstrap support for the correct tree ((A,D),(B,C)), or the support for the LBA tree ((A,C),(B,D,)) (note: seaview will show them arbitrarily rooted)

2B) (or do 2C) Explore a distance matrix based approach with respect to LBA (Neighbor joining using Poisson corrected or observed distances work well). Depending on the settings, these might be less sensitive to LBA. x = 0.3, 1, 3, 10 are good choices to explore.

In the following box list the parameters you selected in seaview, the files that you chose (aligned or as provided), and for each file indicate the bootstrap support for the correct tree, or the support for the LBA tree:

2C) Explore the sensitivity of phyml towards LBA. This may work better on a fast computer

use the default setting for phyml in seaview (go with nearest neighbor interchange (nni) and the LG substitution matrix).
the search converges much faster, if you do not align the sequences first
use the aLRT (approximate likelihood ratio test) support values (not the real non-parametric bootstrap). The aLRT values are between 0 and 1, with one corresponding to maximum probability for the branch to be present in the true-tree.
x=1, 3, and 10 are good values to explore.

In the following box list give the parameters you chose for phyml, the files that you chose, indicate if you aligned them or used them as provided, and for each file give the support value for the correct tree, or the support for the LBA tree:

Finished?

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.

	#of rates	#of frequencies	Gamma	Invariant sites	degrees of freedom to previous model
JC	1	1	N	N
HKY 85	2	4	N	N	4
GTR	6	4	N	N	2
GTR + Gamma	6	4	Y	N	1
GTR +GammaInv	6	4	Y	Y	1