Assignment 9

Your name:
Your email address:

Answer all the questions in red in the provided boxes.

Assignments:

1) Maximum likelihood tests using phyml as implemented in Seaview.
We will test the following models:

#of rates #of frequencies Gamma Invariant sites degrees of freedom to previous model

JC 1 1 N N

HKY 85 2 4 N N 4

GTR 6 4 N N 2

GTR + Gamma 6 4 Y N 1

GTR +GammaInv 6 4 Y Y 1

Open this file in Seaview. Select all sequences. Select sites Extein.
Under Trees select phyml.

Select JC69 from the first pull-down menu. Select none for Invariable sites and none for Among Site Rate Variation. For the tree search use Nearest Neighbor Interchange (NNI). Run. Write down the log likelihood value into a table. Keep the tree window open.
Select HKY from the first pull-down menu. Select none for Invariable sites and none for Among Site Rate Variation. For the Transition / Transversion ratio and the nucleotide frequencies select optimized. For the tree search use Nearest Neighbor Interchange (NNI). Run. Write down the log likelihood value into a table.
Select GTR from the first pull-down menu. Select none for Invariable sites and none for Among Site Rate Variation. For the nucleotide frequencies select optimized.
Select GTR from the first pull-down menu. Select none for Invariable sites and optimized for Among Site Rate Variation. For the nucleotide frequencies select optimized.
Select GTR from the first pull-down menu. Select optimized for Invariable sites and optimized for Among Site Rate Variation. For the nucleotide frequencies select optimized.

One important condition that has to be fulfilled before one can use a Likelihood Ratio Test (LRT) to compare two models, is that the models should be "nested". This means that the simpler model must be a constrained version of the parameter-rich model. The likelihood ratio test is performed by doubling the difference in log-likelihood scores and comparing this test statistic with the critical value from a chi-squared distribution having degrees of freedom equal to the difference in the number of estimated parameters in the two models. The parameter-rich model will always have a better fit, due to the extra parameters and will therefore have the highest log-likelihood, so the difference should be a positive number. The degree of freedom between each of the models is given in the above table - plus/minus gamma shape parameter is one parameter (even though is is approximated by 4 rate categoroies) and the % invariant sites also counts as a parameter.

Use this online chi-square calculator to determine the significance of the test.

Are all the more complex models a significant improvement over the more simple ones?
Enter twice the log likelihood ratio and the P-value with which the simpler hypothesis is rejected.

Are all the more complex models a significant improvement over the more simple ones? JC HKY85 2*deltaLogL: P-value: GTR 2*deltaLogL: P-value: GTR + Gamma 2*deltaLogL: P-value: GTR + GammaInv 2*deltaLogL: P-value:

Doing this using a GUI and copying numbers back and forth is tedious. An older program called modeltest automatically tested a few dozen models. More recently iqtree incorporates testing for the appropriate model.
To get the aligmnet into a format readable by iqtree,
in seaview, select sites extein, select all sequences, then select File > Save Selection
Enter a filename, select fasta format. save
The software is available via a web interface (e.g. here - they assume you use HIV sequences, you need to carefully select what you want the program to do). Select the radio button nucleotides, then select "find best and apply" in the pull-down menu. To be careful with computational resources, you should get the results of the analysis are here. Click on Download all files and unzip the file iqt.model and open it an a text editor. (Or download the log file at the bottom of the page). Scrol down to the listing of the best models.
Which models were chosen, and what do the abbreviations mean? (see the iqtree documantation chapter 8.6 Rate heterogeneity across sites; for info on the different criteria see here)

BIC: AICc: AIC:

Aside (do not do this now!): iqtree is also installed on the xanadu cluster. To run iqtree in the cluster:

Start Filezilla and connect to transfer.cam.uchc.edu
PuTTY to xanadu-submit-ext.cam.uchc.edu
login
srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash

Create a directory for lab9, and transfer the aligned sequences for exteins only (as a multiple fasta file created via save selection as in seaview - see above) into that directory.

cd lab9
load the iqtree module
module load iqtree
execute iqtree
iqtree -s
this will take some time.

When done, use filezilla to move the files created by iqtree to your desktop computer. You can open the treefile in seaview.

Open the log file in a text editor. At the and of listing of the for the lnL for the individula models, is the listing of the best models under the different criteria.

END ASIDE

Open the "iqt.treefile" file from iqtree output folder (iqtall) in seaview (in the alignment window File > open > select the file).
Does the tree calculated under the best model correspond to the trees you obtained with seaview? What is the main difference between the models that consider Among Site Rate Variation and those that do not?

Does the tree calculated under the best model correspond to the trees you obtained with seaview? What is the main difference between the models that consider Among Site Rate Variation and those that do not?

Exercise 2:

Install the latest version of tracer from the github repository on your computer.
We will set-up and analyze a MrBayes analysis of the intein and extein sequences from the Yeast vma1 sequence alignment. The files were saved as selection from seaview, and the following MrBayes block was added to each:

begin mrbayes;
lset nst=6 rates=gamma;
mcmcp filename=analysis_Extein;
mcmcp samplefreq=50 printfreq=50 diagnfreq=500;
mcmcp ngen=20000;
mcmcp savebrlens=yes;
end;

nst6 denotes the GTR model. I only chose the gamma distribution (not gamma+invariant sites) because the two parameters have very similar effects in descibing ASRV.
The file with the intein sequances is here. The one with the extein is here.

Download the files,
Start filezilla and connect to transfer.cam.uchc.edu. (use you username and password)
Create a lab9 directory and transfer the files you downloaded that directory.

Connect to xanadu in terminal or putty:
ssh mcb3421usrXX@xanadu-submit-ext.cam.uchc.edu
or enter user name and address in putty

srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash
cd lab9
module load MrBayes/3.2.7
Check that the nexus files are in the directory (using more is a good choice).
start MrBayes by typing
mb
then type

execute Yeast_vma1_intein_aligned.nxs
showmodel

Read through the output after each command.

Type

mcmc

to start the chains.

Every 50 generations the program prints a line to the screen and every 500 lines it prints the

Average standard deviation of split frequencies

This values should be below 0.01 to indicate the the two parallel runs obtain similar results. It the value is above, run the chain for another couple 10000 generations.
When you are done (do not do more than 40000 generations - you can download the parameter and treefiles from a longer runs below)
type "no" at the Continue with analysis? (yes/no): prompt.

After the run is finished, the " sump " command will plot the logL vs. generation number, that allows to determine the necessary burnin (you want to discard those samples as "burnin" where the -logL is still rising steadily).
To see the whole logL curve, you need to set the burnin fraction to .02 . (type help sump at the mb command line).  sump burninfrac=.02

At the start of the run, the likelihood rapidly increases by orders of magnitude. If the first samples are included in the plot, one really cannot see if the likelihood values fluctuate around a constant
value. You can exclude the first couple of samples by specifying a burnin or a burninfrac. (The new version of MrBayes uses a burnin of 25% by default.)

Repeat the sump command using different values for  burninfrac. Select the value that no longer reveals an upswing at the beginning.

Summarize the trees sampled after the burnin using the sumt command:
sumt burninfrac =.25
, where you need to substitute '.25' with the number you obtained in the previous step of
the exercise. This command creates a consensus tree, and shows posterior probabilities of the clades. You can take a look at the tree on the screen (scroll up to see the bipartition table, and different version of the tree)

1) Which value did you use for the burnin/burninfraction?
2) Which branch in the tree is the longest? (check the branch lengths tables, then look up the branch in the bipartition table)
3) How long is it? (use the mean value)
4) What is the measure? (Check the label at the scale bar for the last tree image)
5) How big is the shape parameters for the intein? What is the 95% credibility interval?
6) What is the probability for a nucleotide to be an A? What is the probability to be a C?
7) Can you explain in a few words, why is it important to exclude a 'burnin' from our analyses?
1) Which value did you use for the burnin/burninfraction? 2) Which branch in the tree is the longest? (check the bipartition and branch lengths tables) 3) How long is it? 4) What is the measure? (Check the scale bar at the tree image) 5) What is the shape parameters for the intein ? What is the 95% credibility interval? 6) What is the probability for a nucleotide to be an A? What is the probability to be a C? 7) Can you explain in a few words, why is it important to exclude a 'burnin' from our analyses?

Here are parameter files from longer runs for the intein and extein sequences separately (unzip the file).
Start Tracer on your computer. Drag the four .p files into the Trace file window (upper left).
Select the 2 runs for the intein, select trace on the upper right above the plot window, uncheck line plot to make the image less busy.
Repeat for the two extein runs. Select alpha (the shape parameter), select marginal density on top, and histogram in the display options. Then switch to mu estimates on top.
What are the means for the shape parameter?
Switch to pi(A) an pi(T). Is the frequency of A and T different for inteins and exteins?
The tree length for each of the sampled trees (in substitutions per site) is sampled under TL. How many more subsitutions occured overall in the intein and compared to the extein? Is the different significant?

1) What are the means for the shape parameter? 2) Is the frequency of A and T different for inteins and exteins? 3) How long is it? 4) How many more subsitutions occured overall in the intein and compared to the extein? 5) Is the different significant?

Finished?

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.

	#of rates	#of frequencies	Gamma	Invariant sites	degrees of freedom to previous model
JC	1	1	N	N
HKY 85	2	4	N	N	4
GTR	6	4	N	N	2
GTR + Gamma	6	4	Y	N	1
GTR +GammaInv	6	4	Y	Y	1