MCB 5472 : Inferring Phylogenies (I)

Please send your answers per email to gogarten@uconn.edu, or hand in a hardcopy

Please let me know, how far you got during the lab. If most students didn't finish, we may continue this next week!

1) ML on protein sequences using PhyML

a) For a dataset of your choice (or here or here (for the latter, the phylip formated file to be used for tree-puzzle is here) -NOTE: your sequences need to be aligned before you analyze them) use PhyML (on bbcsrv3, after qlogin, enter "phyml" at the command line) and calculate the tree with the highest likelihood using a model for Among Site Rate Variation (ASRV) that has a proportion of invariant site estimated from the data, and that describes the remaining sites with 4 rate categories that are a discrete approximation of a continuous Gamma distribution whose shape parameter is estimated from the data.
Repeat the analysis using a model that does not include invariant sites. Perform a maximum likelihood ratio test (LRT) to determine if the more complex model (the one with an estimated percent invariant sites) leads to a significant improvement in likelihood.

Notes:

phyml is also implemented as part of seaview - if you only want to analyze a single datseet, this might be the easiest way to go.
phyml comes with a command line version that makes it easier to call the program from scripts. To get information on how to use the commandline version, type phyml - or check the manual (pdf)
there also are several online servers

If you use phyml as implemented in seaview, DO NOT CLICK OK when the analysis is finished, rather click on copy at the bottom of the window, open a text editor and paste the content into the test window. We are interested in the log likelihood, and the last values estimated for the shape parameter, and the % invariable sites (if the latter two were estimated as part of the model). One important condition that has to be fulfilled before one can use a Likelihood Ratio Test (LRT) to compare two models, is that the models should be "nested". This means that the simpler model must be a constrained version of the parameter-rich model. The likelihood ratio test is performed by doubling the difference in log-likelihood scores and comparing this test statistic with the critical value from a chi-squared distribution having degrees of freedom equal to the difference in the number of estimated parameters in the two models. The parameter-rich model will always have a better fit, due to the extra parameters and will therefore have the highest log-likelihood, so the difference should be a positive number. In this case there is 1 degree of freedom between each of the models — the gamma shape parameter is one parameter and the % invariant sites is the second parameter. Use this online chi-square calculator to determine the significance of the test.

What dataset did you use?
What was the estimated proportion of invariant sites?
What was the estimated shape parameter?
What does a shape parameter of this magnitude signify?

2) Using the same dataset and the same model in TREE-PUZZLE

Invoke TREE-PUZZLE from the command line by typing "puzzle"

Use the tree from (1) as usertree (option k). Take you time in selecting the correct model ! (four rate categories plus invariant sites).

Did all sequences pass the test for homogeneous composition?
What was the proportion of invariant sites?
What shape parameter was estimated?

2B) Strict Molecular Clock

Repeat the analyses from above, but estimate if a strict molecular clock is compatible with the data (option z).

Select the pinvar and alpha from the previous analysis.

For a large data set, this might take some time (about 20 minutes for archaea_euk.phy). If you want control over the commanline, you can send the process (running puzzle) into background.To do so, stop the process in foreground by pressing down <ctrl> and <z> simultaneously. Then restart the process in background by typing
bg %1 ***

Where did treepuzzle place the root?
Was the clock-model rejected? What was the difference in 2log (likelihood)?
What else can you learn from the outfile?

======================== END Assignment================

Work on your student project !