MCB 5472 : Inferring Phylogenies

Please send your answers per email to bioinf@carrot.mcb.uconn.edu, or hand in a hardcopy

Please let me know, how far you got during the lab. If most students didn't finish, we may continue this next week!

You should answer the questions in red!

You can do these exercises either with the sequences in atp_all.phy, infile1.txt or you can use an alignment of your choice.
REMEMBER: Programs in PHYLIP treat the "-" symbol as an informative character (5th base, or 21st amino acid). If you want to treat gaps as missing characters, you need to replace them with "?"

To do so in vi:
vi filename.phy (enter)
:%s/-/?/g (enter)
ZZ (enter)

Preparation:

 

1) Protein parsimony analysis using Phylip

a) In phylip_temp execute the program seqboot by typing
   > seqboot
Read the menu and enter the appropriate letters to generate 100 pseudo samples using bootstrap. If in doubt, read the manual.

Remember to move your output to a new name:

   > mv outfile your_filename.boot.phy

b) do a protein parsimony analysis on the original dataset
  > protpars
Read the menu and enter the appropriate letters to a heuristic search for the most parsimonious tree. Jumble the input order twice.

Again, remember to move your output to new names:

  > mv outfile your_filename.protpars.outfile
  > mv outtree your_filename.protpars.outtree

c) do a protein parsimony analysis on the pseudo samples (your_filename.boot.phy)
  > protpars
Read the menu and enter the appropriate letters to a heuristic search for the most parsimonious tree. (Jumble the input order, but only once)
  > mv outtree your_filename.boot.protpars.outtree
Calculate a consensus tree from the (100) trees in your_filename.boot.protpars.outtree
  > consense
     > mv outfile your_filename.boot.protpars.consense.outfile
  > mv outtree your_filename.boot.protpars.consense.outtree

Is the topology of the consensus tree different from the most parsimonious tree(s)?

 

2) Protein distance matrix analyses using Phylip

a) Calculate two protein distance matrices from your data.
  > protdist
Read the menu and enter the appropriate letters to calculate two distance matrices, one using the JTT substitution matrix without any correction for multiple substitutions (i.e. the default values). When done rename the outfile.
     > mv outfile your_filename.protdist.outfile
For the second analyses select to correct for multiple substitutions using the Gamma correction. Type G, then Y.
You will be asked to enter the
"Coefficient of variation of substitution rate among positions (must be positive) In gamma distribution parameters, this is 1/(square root of alpha)."

If you don't know this parameter, enter 1. In case of ATP_all.phy this parameter is 0.88, in case of infile1.txt this parameter is 1.24.
Save the outfile.
     > mv outfile your_filename.protdist_gamma.outfile

b) To calculate trees from the distance matrices use the programs neighbor and fitch (with the global rearrangement option).

  > fitch (follow the menu)
  > mv outfile your_filename.protdist.fitch.outfile
  > mv outtree your_filename.protdist.fitch.outtree

  > neighbor
  > mv outfile your_filename.protdist.neighbor.outfile
  > mv outtree your_filename.protdist.neighbor.outtree

Are the trees from NEIGHBOR and FITCH different in their topology"?

  > fitch
  > mv outfile your_filename.protdist_gamma.fitch.outfile
  > mv outtree your_filename.protdist_gamma.fitch.outtree

Is the tree calculated using the gamma correction different in topology from the one calculated without the Gamma correction?

Copy (via afp:// or Fugu) the trees calculated from the distance matrices onto your computer and open them in njplot.
Explore the different options in njplot: re-root the trees in a place that seems appropriate.

What is the difference in the trees calculated from distances with and without gamma correction?

 

3) ML on protein sequences using PhyML

a) For a dataset of your choice or here use PhyML (enter "phyml" at the command line) and calculate the tree with the highest likelihood using a model for Among Site Rate Variation (ASRV) that has a proportion of invariant site estimated from the data, and that describes the remaining sites with 4 rate categories that are a discrete approximation of a continuous Gamma distribution whose shape parameter is estimated from the data.

4) Using the same dataset and the same model in TREE-PUZZLE

Invoke TREE-PUZZLE from the command line by typing "puzzle"

Use the tree from (3) as usertree (option k). Take your time in selecting the correct model ! (four rate categories plus invariant sites).

=====================This is how far I expect everyone to get today!==================

4B) Strict Molecular Clock

Repeat the analyses from above, but estimate if a strict molecular clock is compatible with the data (option z).

Select the pinvar and alpha from the previous analysis.

For a large data set, this might take some time (about 20 minutes for archaea_euk.phy). Start it, once it is running, open a new ssh connection , and qrsh to a different node (preferred), alternatively you can send the process (running puzzle) into background. For the latter, stop the process in foreground by pressing down <ctrl> and <z> simultaneously. Then restart the process in background by typing
bg %1

While waiting, continue with 5 below.

5) ML mapping

Use a dataset of your choice or use testseq5.phy.

The latter file contains vacuolar/archaeal ATPases from the following pro- and eukaryotes.

Daucus carota, Arabidopsis thaliana, Gossypium hirsutum are plants;

Acetabularia acetabulum is a green and Cyanidium caldarium is a red algae

Mus, Homo, Bos (mammals) Gallus (bird), Drosophila, Aedes (insects) are animals

Saccharomyces, Candida, Schizosaccharomyces, and Neurospora are fungi

Dictyostelium discoideum, Entamoeba, Plasmodium falciparum, Trypanosoma, Giardia are protists or protozoa

Sulfolobus acidocaldarius (70oC),Archaeoglobus fulgidus (83oC), Methanosarcina barkeri (30-37 oC), Methanosarcina mazeii (37oC), Methanococcus jannaschii (80oC), Haloferax volcanii (37oC), Halobacterium salinarium (ca37oC), Methanobacterium thermoautotrophicum (60-65oC), Desulfurococcus sp. (85-90 oC), Thermococcus sp (75+ oC) (archaea),

Enterococcus hirae (37oC), Borrelia burgdorferi (33-37oC), Thermus thermophilus (70-80oC), Deinococcus radiodurans (30oC) (Bacteria) are prokaryotes. (Usually Bacteria have an F- and not an A-ATPase. The bacteria probably obtained the archaeal/vacuolar type ATPase through horizontal gene transfer.)

The prokaryotes can be considered as outgroup for the eukaryotes.

Design a question that you can address using ml mapping.
For example:
Are the plants, green algae and red algae a monophyletic group?
Is Giardia the deepest branch among the eukaryotic sequences?
Do the animals group with the fungi, or do the fungi go with the plants?
After you have a question, figure out what your 4 groups should be (see the tree below). Only then start the program.

Before entering the final y to run the program, select that you want to run this on all possible quartets (option n, enter 0) with one sequence from each group.
Select a gamma distribution with 8 classes, enter the shape parameter as 0.6. 

This is an exercise in thinking and concentration. If you think too little, you will have to go back to the start several times.

The eukaryotic part of the tree calculated with phyml is as follows, the whole tree is here:

 

6) Puzzleboot (only if you have time, or if you want to try this on your own data).

puzzleboot is a UNIX shell-script program that allows the distance matrix option of PUZZLE to be used in the context of a bootstrap analysis with PHYLIP programs (something which PUZZLE was not originally designed to do).

Use a file of your choice (needs to be phylip formatted).

Copy puzzleboot_mod.sh and puzzle.cmds into the directory where you want to run the script. .
CAREFUL: puzzleboot removes all outfiles and outtrees from the directory it runs in!

change permission of puzzleboot_mod.sh
chmod u+x puzzleboot_mod.sh

Change the responses in puzzle.cmds to that they correspond to the model you want to use. You probably want to fix pinvar and alpha to values you already estimated!

Hint: to help trouble shooting, run the puzzleboot first on the original data (one file), move to the bootstrap sample only after you are happy with the commands file.

Run seqboot on your data. To execute the script type:
./puzzleboot_mod.sh
your_file_name_that_contains_the_bootstrapped_samples

The output consists of 100 distance matrices, run them through neighbor or fitch. You need to select the m option.

You get the support values with consense.