Assignment 8

Dot Plot Exercises and Seaview

Your name:
Your email address:

Dotplots with Gephard

Gephard is a dotplot program, that plots the similarity between two sequences similar to the dot matrix comparison in pairwise blast. However, it plots the actual similarities between every window in one sequence against every window in the other sequence. The program is available as a java jar. On windows you can start the program through double clicking.
On a Mac this is more complicated
(if you are working in the computer lab skip to the paragraph starting with Download the program ....).
Download the program from https://cube.univie.ac.at/gepard (The link to Gepard-1.40.jar is somewhere in the middel of the page under download.)
Move the file to your application folder) or to another folder.
Open terminal and cd to the dirctory where the gephard.jar is located and type
java -jar Gepard-1.40.jar
or, if you put the jar inot the application folder:
java -jar /Applications/Gepard-1.40.jar

However, this still might not work, (You might get some message like "Exception in thread ..."
I replaced the JAVA version I had on my Mac with OpenJAVA:
I went to /Library/Java/JavaVirtualMachines and deleted both folders which were there.
Then I went to https://adoptopenjdk.net/archive.html? and downloaded jdk-11.0.9+11.1 (both jre and jdk)
Installed the packages, and the above command works beautifully

Download the program from https://cube.univie.ac.at/gepard (The link to Gepard-1.40.jar is somewhere in the middle of the page under download.)

1. Locating an intein in a bacteriophage terminase

Download the genomes from Actinobacteriophage Neos5 and HarveySr, and the two terminases encoded on these genomes (Neos5.fasta, HarveySr.fasta, Gene6_HarveySR.faa and Gene6_Neos5.faa)
These files (and others we might use today) are in this zip file
Double click the gephard.jar (or on a Mac start the program from the command line; see above).
In the menu on the left, select the two phage genomes as sequence 1 and sequence 2.
If you click on the plot, you place a cross-hair into the image. The alignment corresponding to the cross is depicted below. You can move the cross using the cursor arrows on the keyboard.
If this does not give a nice image, play with the settings in the PLOT menu (on the left). Good choices are check mark in Auto params and Auto matrix. You can also move the levers in the DISPLAY menu to see more or less noise. About how many nucleotides long is the first insertion in Neos5?

The gene encoded in this region of the genomes is a terminase, its function is to stuff DNA into the phage head.
Gephard at times has difficulties to adjust the scale, if you load new sequences. The best is to exit the program, start it again and load the to gene6 sequences (not these are aa sequences).
Calculate a dotplot comparing the two sequences. Use the cross hair to find the beginning and end of the insertion in Neos5.
What are the first 3 and the last three amino acids in the Neos5 insertion?

2 Repetitive proteins in dot-plots

Proteolipids are a subunits od the ATP synthase. In the F1 ATPases there are 9-12 subunits, each forming a hairpin loop of two connected alpha helices (see the slide 5 from class 3).
In Gephard compare the Methanocaldococcus protein against itself (part of the file archive downloaded above).
Do you see any repetitive units? How many?
Adjust the plot and display parameters to optimize the display of the diagonals of to the main diagonal.
Which settings optimize the diagonals without introducing too much noise?



Compare the Methanopyrus sequence against the one from Methanocaldococcus. How many equivalents to the single repeat unit in Methanocaldococcus do you find in Methanopyrus?


How many repeats do you identify when you compare the Methanopyrus sequence against itself?


3 - Getting to know Seaview

If you are working from home: Download and install Seaview on your computer (you can download the program at http://doua.prabi.fr/software/seaview).

Seaview includes alignment (muscle and clustalo) and phylogenetic reconstruction programs (Neighbor joining and parsimony analysis from PHYLIP, a collection of programs for phylogenetic analyses written by Joe Felsenstein, and phyml, a maximum likelihood program).

Advantages of seaview are

  • You can designate sites as subsets or groups and analyze them separately.
  • You can save (and read) multiple sequence files in different formats (seaview has its own format, called mase, and it is recommended that you use it, if you specified groups of sites or groups of sequences).
  • you can switch between displaying OpenReading frames as nucleotide sequences, and display and align them as amino acid sequences, and then go back to the nucleotide sequences. 
  • You can modify the alignments by hand. 
  • This is a great program to get a quick idea of what is going on in your data sets 
Open Seaview,and load the multiple fasta file Yeast_vma1_all_not_aligned.fst into seaview. (in the archive downloaded above)  The file contains a selection of nucleotide sequences that encode the vacuolar ATPase in different yeasts.  Some of these have been invaded by an intein. 

3.1) In the Seaview window,  select Props, place a check mark into view as protein.  If you downloaded the sequences as ORFs or from an alignment resulting from a tblastn search, you should not have any stop codons (little * in the view as proteins display).

Do you see any stop codons in your sequences? 
Delete the sequence that has stop codons (click on the name of the sequence, so that it turns white on black, then select edit -> delete sequences.   Uncheck view as protein and save the file in fst format.  Go back to view as protein. 

Select Align -> Alignment options -> muscle  then Select Align. How many alignment columns are in your alignment?     (scroll to the right click the last column, on top is tells you sequence and position in the alignment | position in the sequence).

3.2) The first four sequences have not been invaded by an intein.  Can you find the place where the intein begins and where it ends?  What are the first two and the last three amino acids of the intein?  

3.3) Create sets of sites the correspond to the extein and the intein.  First go to Sites create a set called "all sites", then duplicate this set,  call it intein.  Scroll to the right, and in the row of xxx below the alignment, click on the x below the last aa of the N-extein (the x disappears and the column is grayed out).  Then right click on any of the xxx below the N-extein, -> all the x below the N-extein should disappear.   Do the same at the end of the intein: remove the x under the first aa of the C-Extein, then right click on any of the xes to the right. 
This might be a good point in time to save your file in mase format. To do this you first need to unselect select view as protein (Props -> remove checkmak. After saving, return to view as protein.) 
Do the same for the sites corresponding to the extein:  Sites -> all sites, then Sites -> duplicate set, call it extein.
Move to the right click on the xs below the first and the last aa of the intein, the right click on an x under the intein. (The right click removes all the x between to non-x columns.  If you right click before the last column of the intein is removed, you remove everything till the end :( ).
If you want information how to modify an alignment by hand, check out the help pages.

3.4) Uncheck view as protein.  Save the alignment in mase format.  Select sites -> extein.  Then highlight all the intein containing sequences.  Select Trees ->phml -> model GTR (everything else as default -> RUN.  After a minute the tree building window is complete.  If you do serious work, you want to copy all and place the results into your notebook BEFORE you click ok. 
After you click ok, the window opens with the calculated maximum likelihood tree. 
Explore the Swap and Re-root buttons on top.  These operations do not change the tree (which is calculated as an unrooted tree). 
If you click on Br support, the estimated probability that the branch is real is displayed next to the branch.  (just in case select file -> save unrooted tree and give it a name.  Also, copy past the image of the tree into your notebook.

3.5) Repeat this for the intein sequence:  Sites -> intein ; select the intein containing sequences,  Trees-> phyml (GTR) > RUN  (copy all and save the program output before clicking ok.  Save the tree, and compare it to the extein tree.

Do you see any similarities between the trees calculated for the intein and the extein?

3.6) If you have time, rename the intein free genes (select the sequence by clicking on the name, add a prefix (e.g. N_) at the beginning of the name.  Select sites extein, select all (click on the names) sequences.) trees > phyme > RUN.

Do the intein free genes form a clann/clade?  

Finished?

Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.