MCB 5472 : Inferring Phylogenies #1

Please send your answers per email to gogarten@uconn.edu, or hand in a hardcopy

Please let me know, how far you got during the lab. If most students didn't finish, we may continue this next week!

1) ML on protein sequences using PhyML on the cluster

a) For a dataset of your choice (aligned amino acid or nucleotide sequences) use PhyML (enter "phyml" at the command line) and calculate the tree with the highest likelihood using a model for Among Site Rate Variation (ASRV) that has a proportion of invariant site estimated from the data, and that describes the remaining sites with 4 rate categories that are a discrete approximation of a continuous Gamma distribution whose shape parameter is estimated from the data.

If you do not have a dataset, you can use atp_all.phy
Notes:

phyml is also implemented as part of seaview - if you only want to analyze a single dataset, this might be the easiest way to go.
phyml come with a command line version that makes it easier to call the program from scripts. To get information on how to use the command-line version, type phyml - or check the manual at http://www.atgc-montpellier.fr/phyml/usersguide.php
there also are several online servers

Note: Phyml uses a "relaxed" phylip format that allows the names of OTUs to be longer than 10 characters. In seaview this works seamlessly with protpars and protdist; however, if you use another program using the phylip format, you are allowed OTU names of only 10 characters. I.e., use clustalw2 or similar to generate the .phy formated alignments.

Note 2: By default phyml calculates support values using an approximate LRT. These are comparable in stringency to posterior probabilities, but usually higher than support values calculated through non-parametric bootstrap.

What dataset did you use?
What was the estimated proportion of invariant sites?
What was the estimated shape parameter?
What does a shape parameter of this magnitude signify?

2) Using the same dataset and the same model in TREE-PUZZLE

Invoke TREE-PUZZLE from the command line by typing "puzzle"

Use the tree from (3) as usertree (option k). Take you time in selecting the correct model ! (four rate categories plus invariant sites).

Did all sequences pass the test for homogeneous composition?
What was the proportion of invariant sites?
What shape parameter was estimated?

2B) Strict Molecular Clock

Repeat the analyses from above, but estimate if a strict molecular clock is compatible with the data (option z).

Select the pinvar and alpha from the previous analysis.

For a large data set, this might take some time (about 20 minutes for archaea_euk.phy). Start it, once it is running, open a new ssh connection , and qlogin to a different node (preferred), alternatively you can send the process (running puzzle) into background. For the latter, stop the process in foreground by pressing down <ctrl> and <z> simultaneously. Then restart the process in background by typing
bg %1

While waiting, continue with your student project.

Where did treepuzzle place the root?
Was the clock-model rejected? What was the difference in 2log(likelihood)?
What else can you learn from the outfile?

3) Puzzleboot (only if you have time, and you want to try this).

puzzleboot is a UNIX shell-script program that allows the distance matrix option of PUZZLE to be used in the context of a bootstrap analysis with PHYLIP programs (something which PUZZLE was not originally designed to do).

Use a file of your choice (needs to be phylip formatted).

Create 100 bootstrap samples using seqboot.

Copy puzzleboot_mod.sh and puzzle.cmds into the directory where you want to run the script. .
CAREFUL: puzzleboot removes all outfiles and outtrees from the directory it runs in!

change permission of puzzleboot_mod.sh
chmod u+x puzzleboot_mod.sh

Change the responses in puzzle.cmds to that they correspond to the model you want to use. You probably want to fix pinvar and alpha to values you already estimated! (if you don't, the program spends a long time to find ml estimates for these parameters)

Hint: to help trouble shooting, run the puzzleboot first on the original data (one file), move to the bootstrap sample only after you are happy with the commands file.

Run seqboot on your data. To execute the script type:
./puzzleboot_mod.sh your_file_name_that_contains_the_bootstrapped_samples

The output consists of 100 distance matrices, run them through neighbor or fitch. You need to select the m option.

You get the support values with consense.

======================== END Assignment================

Work on your student project !