MCB 5472 : Inferring Phylogenies (continued)

==============The first two exercises are optional, you should have completed them already last week.=============

1) ML based reconstruction on protein sequences using PhyML

a) For a dataset of your choice or here (tyrRS) use PhyML (enter "phyml" at the command line) and calculate the tree with the highest likelihood using a model for Among Site Rate Variation (ASRV) that has a proportion of invariant site estimated from the data, and that describes the remaining sites with 4 rate categories that are a discrete approximation of a continuous Gamma distribution whose shape parameter is estimated from the data.

What dataset did you use?
What was the estimated proportion of invariant sites?
What was the estimated shape parameter?
What does a shape parameter of this magnitude signify?

2) Using the same dataset and the same model in TREE-PUZZLE

Invoke TREE-PUZZLE from the command line by typing "puzzle"

Use the tree from (1) as usertree (option k). Take you time in selecting the correct model ! (four rate categories plus invariant sites).

Did all sequences pass the test for homogeneous composition?
What was the proportion of invariant sites?
What shape parameter was estimated?

=========================End optional exercise #1 ! ===========================

3) Strict Molecular Clock

Repeat the analyses from above, but estimate if a strict molecular clock is compatible with the data (option z).

Select the pinvar and alpha from the previous analysis.

For a large data set, this might take some time (about 20 minutes for archaea_euk.phy). Start it, once it is running, open a new ssh connection , and qrsh to a different node (preferred), alternatively you can send the process (running puzzle) into background. For the latter, stop the process in foreground by pressing down <ctrl> and <z> simultaneously. Then restart the process in background by typing
bg %1

While waiting, continue with 5 below.

Where did treepuzzle place the root?
Was the clock-model rejected? What was the difference in 2log(likelihood)?
What else can you learn from the outfile?

4) User trees in Tree-Puzzle (Confidence Sets)

One of the uses for tree-puzzle is the ease with which it allows to calculate confidence sets. This is important when you want to decide if the tree you obtained from a data set is significantly different from expectation. For example: you analyze the phylogeny of a gene that is different from the expected organismal phylogeny, you need to decide if the gene phylogeny is significantly different from the organismal one. The significance level reported corresponds to the probability that the gene's data set could have been generated under the organismal phylogeny (or whatever you want to compare the gene family to). Puzzle uses a 5% significance level, meaning that in 5% of the cases you will deem the phylogenies as different, even though the gene tree might have been calculated under the organismal tree. If you want more control on the significance level, and if you want to use the software that represents the dernier crie, you'll need to use consel which we normally use in conjunction with PAML, but which now also works in conjunction with treepuzzle (but I have not tried this yet).

Puzzle expects the trees you want to test in a single file. The first line contains the number of trees, the trees are in Newick format, branchlengths and internal labels are possible but idnored. Trees generated by phyml, clustalw, or by programs from PHYLIP work without any problem. To modify a tree, you can either edit the tree in parenthesis notation using a texteditor, or use a tree editing software. The most useful is the edit option in treeview, but this is only abailable under Microsoft. Under OSX you can use tree edit (a copy of the progam is also here), but this application dies frequently, and you need to export each individual tree to file selecting Newick format (the copy paste option uses PAUP's Nexus format, but you can use copy/paste from a text file in Newick format to get a tree into TreeEdit).

For a dataset of your choice or here (tyrosylRS), try to generate a couple of user trees. The phyml tree for the sample dataset is here.

An example for a usertree file is here (for the ATPases) or here (for the tyrRS archaea_euk.phy dataset). What questions might you want to address using usertrees?*

Start treepuzzle, load the dataset, select an appropriate model, start the analysis and load the usertrees.

Which dataset did you use?
What usertrees did you generate? (Describe them in words.) What do these trees test?
Which model for Among Site Rate Variation did you choose?

For a large data set, this analysis might take some time (about 20 minutes for archaea_euk.phy). Start it, once it is running, open a new ssh connection , and qrsh to a different node (preferred), alternatively you can send the process (running puzzle) into background. For the latter, stop the process in foreground by pressing down <ctrl> and <z> simultaneously. Then restart the process in background by typing
bg %1 (ctrl-z bg can come in handy in a lot of different situations).

While waiting read though this output generated by treepuzzle on a set of ATP synthase catalytic subunits and their homologs. What do you conclude from these analyses?

Could the flagellar ATPase subunits have evolved from the bacterial alpha or beta subunits?
Could the A-subunit be the ortholog to the alpha subunits?

Can you conclude anything from your usertree analysis? If yes, what:

5) MrBayes is installed on the cluster. You invoke it by typing mb at the command line. In case you want to install the software on additional computers, go here and follow instructions to download and install MrBayes.

The goal of this exercise is to learn how to use MrBayes to reconstruct phylogenies.

Save testseq1c.nex (catalytic ATPase subunits) (or sequences of your choice, or here (glycyl-tRNA synthetase) or here (bacterial threonyl-tRNA synthetase)--The latter 2 analyses might take a long time. I placed the files into an archive. After you expand the archive, execute the file. You then can summarize the results from the previous run. The archives are here and here.) into the folder from which you want to run the MrBayes Program. The 1st is the dataset of vacuolar ATPase A subunit you used earlier, but in the NEXUS format that MrBayes reads (clustal and many other programs write nexus formated files, but usually, you need to go into the file with a texteditor and change things like the way gaps are treated, or the data type). The other two are aminoacyl tRNA synthetases. You might need to rename the file after downloading. Start MrBayes by typing mb
At the MrBayes command prompt type "execute filename" (where filename is the name of nexus file you want to analyze). This will load the data into the program.
Type some of the following commands one by one to select a model and settings for your analyis:

prset aamodelpr=fixed(jones) [this sets the substititution matrix to JTT, a mdern version of PAM]
lset rates=invgamma
lset ngammacat=4 [this selects a model in which ASRV is described by a gamma distribution approaximated by 4 categories]
mcmcp samplefreq=50 printfreq=50 nchains=2 startingtree=random [this sets the frequency with which the "robot" reports results to the screen and to the files (different files for parameters (.p) and trees (.t))]
mcmcp savebrlens=yes [mcmcp sets parameters for the chain. savebrlens tells it to save the trees with branchlengths]
[mcmcp filename=testseq1c] [if you use this command, it tells the program to save the files under a certain name. This is handy, if you want to read in data from a previous run, but it usually is easier to go with the default, which in this case is testseq1c.nex]
mcmc ngen=10000 [this starts the chain, at the end you'll be asked if you want to continue the run, read the stuff below while you wait]
sump [or sump filename=testseq1c] [this summarizes the data stored in the parameter file]

help prset lists the settings for the priors, help lset lists the settings for the evolutionaly model, and help mcmcp lists the settings of the chains.
Information on the BETA and DIRILECHT distributions is here and here. More info is in the MrBayes Manual.
Before you start the run type showmodel.

Rather than typing the commands one by one you could use a MrBayes block at the end of the input file, e.g.:
begin mrbayes;
    prset aamodelpr=fixed(jones);
    mcmcp samplefreq=50 printfreq=50 nchains=2 startingtree=random;
    mcmcp savebrlens=yes filename=testseq1c;
    mcmc ngen=20000; [you need to add set autoclose=yes to get to the next step]
    sump filename=testseq1c;
end;
NOTE: The blue set of parameters (two chains only, no among site rate variation leads to very fast execution, but it is only recommended to create data that you can try sump and sumt on; this does NOT represent a reasonable set of parameters!)

" prset" sets the priors for the model used to calculate likelihoods. In this case we choose the substitution parameters from the JTT amino acid substitution model (Jones et al., 1992).
" mcmcp " sets parameters for Markov Chain Monte Carlo: we set to sample every 50 generation, to print results to a screen every 50th generation, run 2 chains simultaneously, start with random tree, and save branch lengths.
" mcmc " actually runs the chain, and we set it to run for 20000 generations.
The program runs to analyses in parallel (by default each with 4 chains, and three of these chains are heated; we use only two to make things a little faster--- it definitely is a good idea to run mb on a fast and remote computer). The smaller the reported average standard deviation of split frequencies is, the more reliable the result (i.e., your run is close enough to infinitely long). When the value is below .015, or when your patience is exhausted, terminate the run, by typing no at the prompt.

After the run is finished, the " sump " command will plot the logL vs. generation number, that allows to determine the necessary burnin (you want to discard those samples as "burnin" where the -logL is still rising steadily).

[Rather than using the sump command, you also can import the parameter file into EXEL and plot the logL values as a chart in EXEL. See below.]

During the start of the run, the likelihood rapidly increases by
orders of magnitude. If the first samples are included in the plot, one
really cannot see if the likelihood values fluctuate around a constant
value. You can exclude the first couple of samples by specifying a burnin.
sump burnin=20 excludes the first 20 samples.
Note, that the y-axis is rescaled automatically, which provides you a better spread to judge it the chain is "mixing well", or still in the initial orientation.

Type
" sumt burnin=20 " or " sumt filename=testseq1c burnin=20 "
, where you need to substitute '20' with the number you obtained in the previous step of the exercise (Note: that burnin values is the # of generations divided by 50, since we sample every 50th generation). This command creates a consensus tree, and shows posterior probabilities of the clades. You can take a look at the tree in Treeview or njplot by loading the testseq1.con file.

Which branch in the tree is the longest?
How long is it?
What is the measure?
Can you explain in a few words, why is it important to exclude a 'burnin' from our analyses?
Did your analysis extimate a shape parameter and a % invariant sites?
If yes, what values did you obtain, and what is the 95% credibility intervall for these values? (To estimate the latter, load the parameter file into MSExel, copy the values for each generation after the burnin and sort them. Then discard the top and bottom 2.5% of the values. The reaminder represents the 95% credibility interval)

===================== more optional exercises =======================

6) Puzzleboot (only if you have time, and you want to try this).

puzzleboot is a UNIX shell-script program that allows the distance matrix option of PUZZLE to be used in the context of a bootstrap analysis with PHYLIP programs (something which PUZZLE was not originally designed to do).

If you want to analyze a phylogeny, it usually is a good idea to have at least 2 support values for each branch. Puzzleboot is a popular way to obtain one set of these support values.
Their advantage is that the pairwise distances are calculated using a sophisticated model, that is rather insensitive to long branch attraction. But as is true for other distance matrix analyses, the program is pretty fast.

Use a file of your choice (needs to be phylip formatted).

Copy puzzleboot_mod.sh and puzzle.cmds into the directory where you want to run the script. .
CAREFUL: puzzleboot removes all outfiles and outtrees from the directory it runs in!

change permission of puzzleboot_mod.sh
chmod u+x puzzleboot_mod.sh

Change the responses in puzzle.cmds to that they correspond to the model you want to use. You probably want to fix pinvar and alpha to values you already estimated!

Hint: to help trouble shooting, run the puzzleboot first on the original data (one file), move to the bootstrap sample only after you are happy with the commands file.

Run seqboot on your data. To execute the script type:
./puzzleboot_mod.sh your_file_name_that_contains_the_bootstrapped_samples

The output consists of 100 distance matrices, run them through neighbor or fitch. You need to select the m option :) !

You get the support values with consense.

======================== END OPTIONAL Part 2================

Your main task for today is to work on your student project !

* Comment: Often you want to test trees that have constrains, e.g., all Eukaryotes or all Halobacteria in a single clade. You might not want to, or be able to, go through all possible permutations possible under this constraint. PAML and TREEFINDER allow the user to reconstruct the ml tree given a constraint. PAML is slow in its heuristic search algorithm, treefinder is much faster. The program and instruction manual is available at http://www.treefinder.de/ . When I used the program, I used the graphics interface version to develop the correct commands and then I ran these for longer problems on the cluster.