The Last Assignment

Your name:
Your email address:

Dotlet Exercise

Aside:  In class 13, an assignment was to read through the dotlet  examples pages.  Dotlet is a Java applet provided by the Swiss Institute for Bioinformatics that performs interactive dot plots.  This application is great, but next to impossible to get to run from a "normal" browser, because JAVA applets are no longer supported in Google's Chrome, Safari, firefox, and Microsoft's Edge. Waterfox or Internet Explorer may work, but may be not, because the operating system may interfere). End Aside

Instead, we will use a java script version of dotlet that is in beta testing. This version works fine for protein to protein comparisons, but fails tor DNA to protein comparisons.

Comparing yeast ATPase catalytic subunit with yeast HO endonuclease.
This is similar to the analysis we did using pairwise blast. However in this case, we plot all pairwise comparisons between windows, i.e. we do not need to worry about E value cutoffs and low complexity filters.

Go to the applet and input the two sequences to analyze (note: just the sequence, not the header).  These are the sequences we will be using: 

    Sce_VMA.fa (the vacuolar ATPase catalytic subunit from yeast),

    SceHO.fa (the mating type switching HO endonuclease from yeast),

    vma1Neurospora.fa (the vacuolar ATPase catalytic subunit from Neurospora crassa) and

  1. Compare the yeast vma-1 subunit with intein (Sce_VMA.fa) (sequence 1) with the Neurospora A-subunit (vma1Neurospora.fa) (sequwence 2). Select a window size between 9 and 15 and click "compute". The program will compare every window of the chosen size in one sequence to all the possible windows in the other sequence. Below the dot matrix window you see a histogram that describes how often window pairs with the indicated score occurred. The sliding bars below and above the histogram let you select the colors with which matches are depicted.   (zoom out until you see the complete yeast vma-1 sequence.

    If you click inside the dot plot panel, the alignment window at the bottom aligns the two sequences accordingly. You can fine-tune the alignment using the keyboard arrows (this may be somewhat of a pain, because the downwards arrow moves both the hairpin cross and the webpage; it helps to use command - or command + to zoom in or out; also, the windows are shown in differnet arrangements in different browsers).

    Which sequence positions (from ... to....) in the yeast sequence represent the intein?


  2. Compare the yeast vma-1 subunit with intein (Sce_VMA.fa) to the HO endonuclease (sex change enzyme) (SceHO.fa) (i.e. replace sequence 2 from the first part of the exercise) (Blossum 45, and a Window of 19 works well),
    does the complete intein sequence match to something in the HO endonuclease?

    Does this mean the HO endonuclease contains a self splicing domain? 
    Is there a part of the sequence in the HO endonuclease that might correspond to an extein?

  3. Compare two unrelated sequences (e.g., (SceHO.fa) and vma1Neurospora.fa).  If you adjust the levels in the histogram, the dot matrix comparison looks a little like the  rain in a Japanese woodcut (here or here). 
    What is the reason for these many small diagonals?
      (Hint: Check the impact of the window size; move the hairline-cross along one of the little diagonal streaks.)


    Repetitive proteins in Dotlet  

    We will use proteolipids as examples (these consist of hydrophobic hairpin loops (alpha helices) embedded in the membrane.  They are the subunits of ATPsynthases/ATPases (V-, F- , and A-ATPases) that move the proton (or Na+) accross the membrane.

  1. Compare the Methanocaldococcus protein (WP_010869717.1) against itself. Do you see any repetitive units? How many?
    Does the choice of scoring matrix make a difference?




  2. Compare the Methanopyrus sequence GI 19887539 against the one from Methanocaldococcus (WP_010869717.1). How many equivalents to the single repeat unit in Methanocaldococcus do you find in Methanopyrus?


  3. How many repeats do you identify when you compare the Methanopyrus sequence GI 19887539 against itself?


  4. Compare the two sequences using Pairwise Blast.
    Which program should you use?
    What is the effect of turning the low complexity filter on or off? 
    Can you identify the same number of repeat units as in #5?
    What happens if you compare
    the Methanopyrus sequence GI 19887539 against itself?


Creating a mummer/nucmer plot

Mummer as an easy alternative to gene plots and to pairwise blastn searches.  The result is something like a dot matrix comparison.   nucmer (=NUCleotide mumMER) is part of the mummer package, and it handles input files with multiple contigs rather well.  (e.g., if one of your genomes contains several plasmids or additional chromosomes, or if the genome is not closed. 

The difference to a gene plot is that in this case the search is done on the nucleotide level, and that the program keeps track of the + and the - strand. Mummer is installed on the cluster. 

The following assumes that you established a filezilla and an ssh connection to xanadu (see lab 5).  [note: establish the ssh connection first, to be sure your password has not expired!]

module load MUMmer/4.0.2 

module load gnuplot

module load perl

mkdir lab13

cd lab13

srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash

Download two or more genomes you want to compare (browse microbial genomes at NCBI and download chromosome or genome (*.fna) files to your computer or use the curl -O command targeting the NCBI FTP site (to get the link, right click in the link in the ftp server).

For an example you could download four Haloferax genomes using

curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/025/685/GCF_000025685.1_ASM2568v1/GCF_000025685.1_ASM2568v1_genomic.fna.gz ( Haloferax volcanii DS2)

curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/005/406/325/GCF_005406325.1_ASM540632v1/GCF_005406325.1_ASM540632v1_genomic.fna.gz ( Haloferax mediterranei ATCC 33500)

curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/010/692/905/GCF_010692905.1_ASM1069290v1/GCF_010692905.1_ASM1069290v1_genomic.fna.gz ( Haloferax alexandrinus)

curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/190/965/GCF_001190965.1_ASM119096v1/GCF_001190965.1_ASM119096v1_genomic.fna.gz ( Haloferax gibbonsii)

Unzip the files
gunzip *.gz

To compare two genomes we use
nucmer file1.fna file2.fna

For example
nucmer GCF_010692905.1_ASM1069290v1_genomic.fna GCF_001190965.1_ASM119096v1_genomic.fna

To plot the matches we use a script that comes with mummer and that creates a file to be send to gnuplot. Sadly, a few of the options in mummerplot do not work on the cluster, resulting in an error message. Nevertheless, mummerplot (with the postscript option)  generates a file that is being understood by gnuplot as installed on the cluster.

To runmummer plot:
mummerplot --postscript --prefix give_it_a_name out.delta -  out.delta is the output file from nucmer.

You will get an error message - ignore it :)  We will send the gnuplot-file to gnuplot from the commandline:

gnuplot give_it_a_name.gp

This creates a file in your working directory that ends on .ps (e.g., give_it_a_name.ps).

Use filezilla to transfer the file to your personal computer.  On Macs, preview reads postscript files; I am told that Adobe's Acrobat reader opens .ps files under windows, but you could use inkscape or gimp.    

If you want to compare many genomes, you can use the following script that compares all .fna files in the directory with one another.  This takes time, you could use it get a snack or coffee.  The script is here. Run it from the command line by typing
perl mummer2.pl

In the name of the plot files, the first name is the REF (in the plot), the second the query
Which genomes did you analyze?
Did the genomes use homologous starting points?
Any interesting observations? 



Type exit to release the compute node from the queue.
If you you encountered problems in your session, check the queue for abandoned sessions using the command qstat. If there are abandoned sessions under your account, kill them by deleting them from the queue by typing qdel job-ID, e.g. "qdel 40000" would delete Job # 40000

Finished?

Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.