MCB 5472 : Calculating phylogenetic trees from molecular data

Student Projects

If you have not done so, complete the MrBayes exercise form last week!

Summary of what you should do:

Download the MrBayes output form here
Inspect the terminal output from the sumt command (part of the above zip file) - answer the question what the posterior probability is for the intein from AB093510.1_Saccharomyces_exiguus (seq 22) forms a clann with sequence #19 (XM_003683782.1_Tetrapisispora_phaffii)
Load the .p files (both - part of the above zip file) into tracer and Determine the mean, median and 95% highest posterior density interval for the estimated omega for the sites under purifying selection.
Download the spreadsheet here, which contains the generations >80000 from the two runs. The average for the probability to be under positive selection pr+(xyz) and the omega(zyz) already calculated (you need to scroll to the right to get to these columns). The second sheet contains bar graphs of these values. Are the regions that contain the LAGLIDADG motifs under negative (purifying) selection?
More information is in the original exercise below (option 3 below)
PS: if you want to have data from a longer run, let me know - I can share it with you.

Get to know iqtree tree

iqtree is a rather new (and fast, and extremely versatile) software to calculate trees from sequence data. The authors are very helpful in implementing new models, and at least in Europe and Canada, this seems to have become the software of choice at the moment (with RaxML a close second).

The manual is at http://www.iqtree.org/doc/. I found the easiest to download the pdf and search for items in the pdf .

The program reads phylip files with long sequence names (created by seaview). It handles DNA, protein, character, and partitioned data. We will use it on the yeast intein and extein vma1 DNA sequences.
To enable iqtree on bbc3, you need to load the module:
module load iqtree/current

If you run a file without specifying a model, iqtree will determine the best model, and then run an analysis. You also can specify a model - then the program runs faster. It is a good idea to run every analysis in a separate directory, or to create multiple sequence alignment files with names that reflect the analysis you are planning to do.

We will run the intein and extein data to find the best model. iqtree -s intein.phy and iqtree -s extein.phy
What models did iqtree consider the best fit for the data? What do this models specify? (check the manual)
Load the resulting treefiles into figtree. Are the trees for the intein and extein different? (Also check the branch lengths!)

To compare the tree topologies, we need to have support values for the nodes. To get these, we can run a fast bootstrap analysis, e.g.,
iqtree -s intein.phy -bb 1000 -m GTR+F+R4 and iqtree -s extein.phy -bb 1000 -m TIM2e+I+G4

Load the trees into figtree, and under node labels select label to see the bootstrap support values. Are the intein and extein trees different? Does the difference pertain to significantly supported nodes? (In the unlikely event that this takes too long, treefiles for the nucleotide and protein datasets are here)

Use MrBayes to determine Omega

The MrBayes manual is here

A file of the intein sequences in nexus format is here (this is the file format that MrBayes and PAUP use). The file was created in seaview.
I added the following MrBayes block:

begin mrbayes;
lset nst=2 rates=gamma nucmodel=codon omegavar=Ny98;
report possel = yes siteomega = yes;
mcmcp filename=analysisS;
mcmcp samplefreq=100 printfreq=100 diagnfreq=500;
mcmc ngen=100000;
mcmcp savebrlens=yes;
end;

This directs MrBayes to run the Yang model and to report omega and significance values for each codon.

To run the analysis:

move the nexus formated file into a directory on the cluster,
change into this directory,
load the MrBayes module for version 3.2.6. module load mrbayes/3.2.6
start MrBayes by typing mb
then type execute Yeast_vma1_intein_withMotifs.nxs
check that everything is loaded as expected
start the Metropolis coupled Monte Carlo Markov Chain by typing mcmc

After the specified number of generations, the program halts and waits for your input (continue the run, yes or no).

This takes at least a couple of hours, for an analysis that you want to rely on you should run this for several days! A criterion for halting the chains is the "average standard deviation of split frequencies". This number compares the results from two parallel runs, the closer to zero the value the better. Results from an overnight run are here.

After a run is finished, you can use the " sump " command (within MrBayes) to plot the logL vs. generation number. This allows to determine the necessary burnin (you want to discard those samples as "burnin" where the -logL is still rising steadily).
To see the whole logL curve, you need to set the burnin fraction to .02 . (type help sump at the mb commandline). sump burninfrac=.02

Rather than using the sump command, you also can download the tracer application that can read the parameter files from MrBayes, and provides an easy intuitive and interactive way to evaluate these files with respect to burnin and confidence intervals. The .p and .t files from an analyses that I ran on this dataset, and the terminal output from the sump and sumt commands is here.
Reading the sumt output, can you figure out what the posterior probability is for the intein from AB093510.1_Saccharomyces_exiguus (seq 22) froms a clann with sequence #19 (XM_003683782.1_Tetrapisispora_phaffii)? Note that sequences from the genus Sacharomyces do not form a clann.

Load the .p files (both) into into tracer, you can load both parameter files at the same time, and select in the upper left field of the tracer application, if you want to analyze one or both of the parameter files.

Inspect the trace (select the right button over the graphics window) of lnL (select in the list of parameters on the left).

Determine the mean, median and 95% highest posterior density interval for the estimated omega for the sites under purifying selection.

Load the two parameter files (.p) into excel and copy the values for the generations after the burnin into a new spreadsheet.

Follow the guidance from the powerpoint slides and calculate the values estimated for omega for each codon, and the probability that the codon is under positive selection. ALTERNATIVELY, here is a spreadsheet with the generations >80000 from the two runs, and the average for the probability to be under positive selection pr+(xyz) and the omega(zyz) already calculated (you need to scroll to the right to get to these columns). The second sheet contains bar graphs of these values.

Are the regions that contain the LAGLIDAdG motifs under negative (purifying) selection?

2) Work on your student project!

Include a summary of what you did in your report!