MCB 372 : Inferring Phylogenies

Please send your answers per email to bioinf@carrot.mcb.uconn.edu, or hand in a hardcopy

Please let me know, how far you got during the lab. If most students didn't finish, we may continue this next week!

Complete pymol exercise #3 on coloring according to site conservation from last week (Note: instructions and conservation file are updated.).
Hopefully, this should take less than about 40 minutes.

Conservation profiles, inteins and PyMOL

a. We will color a protein according to its conservation profile and determine where the inteins sit in the VMA protein

b. You will need the conservation profile and protein .pdb file

c. Lets work with the excel file first (note: a perlscript that creates a conservation profile from a multiple sequence alignment is here)

i. In this file the first column gives the alignment position and the second column is the conservation score - the lower the more conserved (the number of aa types in a sliding window)

ii. Sort the profile on the second column smallest to largest

iii. Save your file and put it a side for now

d. Now we will work with the protein files

i. Open 1VDZ.pdb in PyMOL (fetch 1VDZ)

1. On the right under the all object click S then as ribbons

2. In the PyMOL> bar type

a. color gray, all

3. The conservation scores range from 2 to 11 for each number range ie. 2-3, 3-4, 4-5, 5-6, 6-7, 7-8, 8-9, 9-10, 10-11. To make it an object the easiest is to mark each range on the excel sheet

4. For each object we marked off in the excel file select the amino acid positions from the first column

a. So for the 2-3 group select those, copy them then in another cell using right click > paste special and select transpose

b. Copy the transposed list and paste it in text wrangler

c. Then copy the space between the numbers

d. Then Search>find

e. Paste into the top box and replace all with what between, the quotes here ‘ resi ‘ so space resi space

f. Back to pymol

i. Type

ii. select OBJECTnnn, resi then copy and paste the line from, text wrangler into the PyMOL> box

remember, give each number range is a different object name, OBJECT1,OBJECT2 ...

iii. Now all those amino acids are grouped together

iv. Do the same for the rest of the objects replacing the object identifier with the appropriate term of your choice.

g. Once all the objects are made they should show up on the side bar

h. Make one more group for the inteins

i. select intein, resi 300 resi 322

i. The colors pymol is capable of is here http://pymolwiki.org/index.php/Color_Values with the color range of your choice Color each group ie.

i. color red, OBJECT1

ii. color yellow, OBJECT2

iii. ect.

j. To make the intein insertion sites pop out against the rest of the protein select intein on the side bar click S >as spheres

5. Where are the inteins sitting?

==========================================================

You should answer the questions in red!

You can do these exercises either with the sequences in atp_all.phy, ompA.phy or you can use an alignment of your choice.
REMEMBER: Programs in PHYLIP treat the "-" symbol as an informative character (5th base, or 21st amino acid). If you want to treat gaps as missing characters, you need to replace them with "?"

To do so in vi:
vi filename.phy (enter)
:%s/-/?/g (enter)
ZZ (enter)

Preparation:

Install njplot and figtree on your computer (if it isn't installed already). To do so click on the link and download the version for your operating system.
ssh to your account on bbcxsrv1.biotech.uconn.edu and create a directory phylip_temp
> ssh yourname@bbcxsrv1.biotech.uconn.edu
> qrsh
> mkdir phylip_temp
> cd phylip_temp
copy the files with the aligned sequences into the phylip_temp directory (use afp://bbcxsrv1.biotech.uconn.edu or Filezilla or ... )

1) Protein parsimony analysis using Phylip

a) In phylip_temp execute the program seqboot by typing
> seqboot
Read the menu and enter the appropriate letters to generate 100 pseudo samples using bootstrap. If in doubt, read the manual.

Remember to move your output to a new name:

> mv outfile your_filename.boot.phy

b) do a protein parsimony analysis on the original dataset
> protpars
Read the menu and enter the appropriate letters to a heuristic search for the most parsimonious tree. Jumble the input order twice.

Again, remember to move your output to new names:

> mv outfile your_filename.protpars.outfile
> mv outtree your_filename.protpars.outtree

c) do a protein parsimony analysis on the pseudo samples (your_filename.boot.phy)
  > protpars
Read the menu and enter the appropriate letters to a heuristic search for the most parsimonious tree. (Jumble the input order, but only once)
  > mv outtree your_filename.boot.protpars.outtree
Calculate a consensus tree from the (100) trees in your_filename.boot.protpars.outtree
  > consense
   > mv outfile your_filename.boot.protpars.consense.outfile
  > mv outtree your_filename.boot.protpars.consense.outtree
Is the topology of the consensus tree different from the most parsimonious tree(s)? How could this happen?

2) Protein distance matrix analyses using Phylip

a) Calculate two protein distance matrices from your data.
  > protdist
Read the menu and enter the appropriate letters to calculate two distance matrices, one using the JTT substitution matrix without any correction for multiple substitutions (i.e. the default values). When done rename the outfile.
   > mv outfile your_filename.protdist.outfile
For the second analyses select to correct for multiple substitutions using the Gamma correction. Type G, then Y.
You will be asked to enter the
"Coefficient of variation of substitution rate among positions (must be positive) In gamma distribution parameters, this is 1/(square root of alpha)."
If you don't know this parameter, enter 1. In case of ATP_all.phy this parameter is 0.88.
Save the outfile.
   > mv outfile your_filename.protdist_gamma.outfile

b) To calculate trees from the distance matrices use the programs neighbor and fitch (with the global rearrangement option).

  > fitch (follow the menu)
  > mv outfile your_filename.protdist.fitch.outfile
  > mv outtree your_filename.protdist.fitch.outtree

  > neighbor
  > mv outfile your_filename.protdist.neighbor.outfile
  > mv outtree your_filename.protdist.neighbor.outtree

Are the trees from NEIGHBOR and FITCH different in their topology"?

  > fitch
  > mv outfile your_filename.protdist_gamma.fitch.outfile
  > mv outtree your_filename.protdist_gamma.fitch.outtree

Is the tree calculated using the gamma correction different in topology from the one calculated without the Gamma correction?

Copy (via afp:// or Fugu) the trees calculated from the distance matrices onto your computer and open them in njplot.
Explore the different options in njplot: re-root the trees in a place that seems appropriate.

What is the difference in the trees calculated from distances with and without gamma correction?

3) ML on protein sequences using PhyML

a) For a dataset of your choice use PhyML (enter "phyml" at the command line) and calculate the tree with the highest likelihood using a model for Among Site Rate Variation (ASRV) that has a proportion of invariant site estimated from the data, and that describes the remaining sites with 4 rate categories that are a discrete approximation of a continuous Gamma distribution whose shape parameter is estimated from the data.
Notes:

phyml is also implemented as part of seaview - if you only want to analyze a singel datseet, this might be the easiest way to go.
phyml come with a command line version that makes it easier to call the program from scripts. To get information on how to use the commandline version, type phyml - or check the manual (pdf)
there also are several online servers

What dataset did you use?
What was the estimated proportion of invariant sites?
What was the estimated shape parameter?
What does a shape parameter of this magnitude signify?

4) Using the same dataset and the same model in TREE-PUZZLE

Invoke TREE-PUZZLE from the command line by typing "puzzle"

Use the tree from (3) as usertree (option k). Take you time in selecting the correct model ! (four rate categories plus invariant sites).

Did all sequences pass the test for homogeneous composition?
What was the proportion of invariant sites?
What shape parameter was estimated?

4B) Strict Molecular Clock

Repeat the analyses from above, but estimate if a strict molecular clock is compatible with the data (option z).

Select the pinvar and alpha from the previous analysis.

For a large data set, this might take some time (about 20 minutes for archaea_euk.phy). Start it, once it is running, open a new ssh connection , and qrsh to a different node (preferred), alternatively you can send the process (running puzzle) into background. For the latter, stop the process in foreground by pressing down <ctrl> and <z> simultaneously. Then restart the process in background by typing
bg %1

While waiting, continue with 5 below.

Where did treepuzzle place the root?
Was the clock-model rejected? What was the difference in 2log(likelihood)?
What else can you learn from the outfile?

5) Puzzleboot (only if you have time, and you want to try this).

puzzleboot is a UNIX shell-script program that allows the distance matrix option of PUZZLE to be used in the context of a bootstrap analysis with PHYLIP programs (something which PUZZLE was not originally designed to do).

Use a file of your choice (needs to be phylip formatted).

Copy puzzleboot_mod.sh and puzzle.cmds into the directory where you want to run the script. .
CAREFUL: puzzleboot removes all outfiles and outtrees from the directory it runs in!

change permission of puzzleboot_mod.sh
chmod u+x puzzleboot_mod.sh

Change the responses in puzzle.cmds to that they correspond to the model you want to use. You probably want to fix pinvar and alpha to values you already estimated! (if you don't, the program spends a long time to find ml estimates for these parameters)

Hint: to help trouble shooting, run the puzzleboot first on the original data (one file), move to the bootstrap sample only after you are happy with the commands file.

Run seqboot on your data. To execute the script type:
./puzzleboot_mod.sh your_file_name_that_contains_the_bootstrapped_samples

The output consists of 100 distance matrices, run them through neighbor or fitch. You need to select the m option.

You get the support values with consense.