==============The first two exercises are optional, you should have completed them already last week.=============
1) ML based reconstruction on protein sequences using PhyML
a) For a dataset of your choice or here (tyrRS) use PhyML (enter "phyml" at the command line) and calculate the tree with the highest likelihood using a model for Among Site Rate Variation (ASRV) that has a proportion of invariant site estimated from the data, and that describes the remaining sites with 4 rate categories that are a discrete approximation of a continuous Gamma distribution whose shape parameter is estimated from the data.
2) Using the same dataset and the same model in TREE-PUZZLE
Invoke TREE-PUZZLE from the command line by typing "puzzle"
Use the tree from (1) as usertree (option k). Take you time in selecting the correct model ! (four rate categories plus invariant sites).
=========================End optional exercise #1 ! ===========================
3) Strict Molecular Clock
Repeat the analyses from above, but estimate if a strict molecular clock is compatible with the data (option z).
Select the pinvar and alpha from the previous analysis.
For a large data set, this might
take some time (about 20 minutes for archaea_euk.phy). Start it, once it is running,
open a new ssh connection , and qrsh to a different node (preferred), alternatively
you can send the process (running puzzle) into background. For the latter, stop
the process in foreground by pressing down <ctrl> and <z> simultaneously.
Then restart the process in background by typing
bg
%1
While waiting, continue with 5 below.
4) User trees in Tree-Puzzle (Confidence Sets)
One of the uses for tree-puzzle is the ease with which it allows to calculate confidence sets. This is important when you want to decide if the tree you obtained from a data set is significantly different from expectation. For example: you analyze the phylogeny of a gene that is different from the expected organismal phylogeny, you need to decide if the gene phylogeny is significantly different from the organismal one. The significance level reported corresponds to the probability that the gene's data set could have been generated under the organismal phylogeny (or whatever you want to compare the gene family to). Puzzle uses a 5% significance level, meaning that in 5% of the cases you will deem the phylogenies as different, even though the gene tree might have been calculated under the organismal tree. If you want more control on the significance level, and if you want to use the software that represents the dernier crie, you'll need to use consel which we normally use in conjunction with PAML, but which now also works in conjunction with treepuzzle (but I have not tried this yet).
Puzzle expects the trees you want to test in a single file. The first line contains the number of trees, the trees are in Newick format, branchlengths and internal labels are possible but idnored. Trees generated by phyml, clustalw, or by programs from PHYLIP work without any problem. To modify a tree, you can either edit the tree in parenthesis notation using a texteditor, or use a tree editing software. The most useful is the edit option in treeview, but this is only abailable under Microsoft. Under OSX you can use tree edit (a copy of the progam is also here), but this application dies frequently, and you need to export each individual tree to file selecting Newick format (the copy paste option uses PAUP's Nexus format, but you can use copy/paste from a text file in Newick format to get a tree into TreeEdit).
For a dataset of your choice or here (tyrosylRS), try to generate a couple of user trees. The phyml tree for the sample dataset is here.
An example for a usertree file is here (for the ATPases) or here (for the tyrRS archaea_euk.phy dataset). What questions might you want to address using usertrees?*
Start treepuzzle, load the dataset, select an appropriate model, start the analysis and load the usertrees.
For a large data set, this analysis might
take some time (about 20 minutes for archaea_euk.phy). Start it, once it is running,
open a new ssh connection , and qrsh to a different node (preferred), alternatively
you can send the process (running puzzle) into background. For the latter, stop
the process in foreground by pressing down <ctrl> and <z> simultaneously.
Then restart the process in background by typing
bg
%1 (ctrl-z bg can come in handy in a lot of different situations).
While waiting read though this output generated by treepuzzle on a set of ATP synthase catalytic subunits and their homologs. What do you conclude from these analyses?
5) MrBayes is installed on the cluster. You invoke it by typing mb at the command line. In case you want to install the software on additional computers, go here and follow instructions to download and install MrBayes.
The goal of this exercise is to
learn how to use MrBayes to reconstruct phylogenies.
===================== more optional exercises =======================
6) Puzzleboot (only if you have time, and you want to try this).
puzzleboot is a UNIX shell-script program that allows the distance matrix option of PUZZLE to be used in the context of a bootstrap analysis with PHYLIP programs (something which PUZZLE was not originally designed to do).
If you want to analyze a phylogeny, it usually is a good idea to have at least 2 support values for each branch. Puzzleboot is a popular way to obtain one set of these support values.
Their advantage is that the pairwise distances are calculated using a sophisticated model, that is rather insensitive to long branch attraction. But as is true for other distance matrix analyses, the program is pretty fast.
Use a file of your choice (needs to be phylip formatted).
Copy puzzleboot_mod.sh
and puzzle.cmds into
the directory where you want to run the script. .
CAREFUL: puzzleboot removes
all outfiles and outtrees from the directory it runs in!
change permission
of puzzleboot_mod.sh
chmod u+x puzzleboot_mod.sh
Change the responses in puzzle.cmds to that they correspond to the model you want to use. You probably want to fix pinvar and alpha to values you already estimated!
Hint: to help trouble shooting, run the puzzleboot first on the original data (one file), move to the bootstrap sample only after you are happy with the commands file.
Run seqboot on your data. To execute the script type:
./puzzleboot_mod.sh
your_file_name_that_contains_the_bootstrapped_samples
The output consists of 100 distance matrices, run them through neighbor or fitch. You need to select the m option :) !
You get the support values with consense.
======================== END OPTIONAL Part 2================
Your main task for today is to work on your student project !
* Comment: Often you want to test trees that have constrains, e.g., all Eukaryotes or all Halobacteria in a single clade. You might not want to, or be able to, go through all possible permutations possible under this constraint. PAML and TREEFINDER allow the user to reconstruct the ml tree given a constraint. PAML is slow in its heuristic search algorithm, treefinder is much faster. The program and instruction manual is available at http://www.treefinder.de/ . When I used the program, I used the graphics interface version to develop the correct commands and then I ran these for longer problems on the cluster.