Assignment 5: Statistics of Sequence Comparison / Dotplots

Your name:
Your email address:

Some cluster computer basics

Spend 5 minutes reading this introduction to the command-line interface. If you think something like this might be in your future, read the rest after class.

If you are working on a windows machine, start a program called PuTTY. In the "Host" field, type xanadu-submit-ext.cam.uchc.edu

In the "Username" field, type your username mcb3421usrXX, where XX is a number assigned to you. (If your number is below 10 , do not use a leading 0. Your username would be mcb3421usrX.)
Login. It will ask you to accept a new host key. Now enter the password _______ (the password will be on the whiteboard and on huskyCT). Hit return. <Note the terminal will not echo the keystrokes you make to enter the password.>

The window with a black background that opens on your screen is the command-line window.

In the window with a black background, you will see a flashing cursor. Stuff you type appears at the cursor position.

This paragraph is only relevant if you are working on a Mac, find the terminal application (Termianl.app). By default is located in the Applications folder, in a subfolder utilitites (/System/Applications/Utilities/Terminal.app). The easiest might be to search for Terminal.app in the finder or in spotlight.

Double click the Terminal application icon. In the window that opens, type

ssh mcb3421usrXX@xanadu-submit-ext.cam.uchc.edu
where XX is a number assigned to you. If your number is below 10, do not use a leading 0. Your username would be mcb3421usrX.)
It will ask you to accept a new host key, type yes <return> [<return> means hit the return key.]
Now enter the password _______. Hit return. <Note the terminal will not echo the keystrokes you make to enter the password.>

It is important to know a little bit about the structure of the cluster. When you log into the cluster, you are at the head node. Everything you execute will run on the head node. You never ever should run anything that needs computational resources on the head node. To actually run programs or execute comand such as blast, you should have moved to a compute node, or submit the command to the queue (which runs it on a compute node)!
You can read more about the xanadu cluster here.

To move from the login node to run an interactive session (you should never, ever run things on the login node) to a compute node, type
srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash

and then hit the return or enter key. This takes you to a "compute" node.
If the cluster is busy, and the queue you select is busy, the amove to compute node can take several minutes.
You can check the node you are on by typing hostname at the commandline. The headnode is hpc-ext-2.cam.uchc.edu, the compute node something like xanadu-68.
whoami returns your username.

Now type qstat
This command shows the job-ID of your session.

Type ls
This gives you a directory (or folder) listing. There may be nothing to see (yet). Type mkdir lab5
This command stands for "make directory". Now type ls
again. You should see your newly-created directory. Type pwd
This stands for "print working directory". This forward-slash (/) separated "path" is a shorthand for the directory you are currently in.
Type cd lab5 (cd .. would move you back up to your home directory)
Now type pwd again. You have moved into the lab5 directory.

If you need more information on a command (e.g., pwd), you can use the man command. man pwd displays the manual pages for the pwd command. (You can move forward using the space-bar, or the arrow keys. to leave the manual page type q.)

There are many ways to get files into your account, e.g., the file transfer application that is part of the SSH Client or Filezilla. Another possibility is the wget commant. To move this file into the directory on your cluster, you could type
wget https://j.p.gogarten.uconn.edu/mcb3421_2020/labs/HaloATPases+Intein.txt --no-check-certificate

You could copy paste the above line to the terminal window, or right click "this file" and copy the link address.

Do the same for this file:
wget https://j.p.gogarten.uconn.edu/mcb3421_2020/labs/HaloATPases-Intein.txt --no-check-certificate

Once you downloaded the files into the lab5 directory, you can display their content by typing
cat HaloATPases+Intein.txt
cat HaloATPases-Intein.txt

Alternatively you could use the more or less commands that allows you to go through a file page by page (Type <space> to get to the next page, type "q" to quit).
This is a typical multiple sequence fasta file from the NCBI. Take a look at the annotation lines.

Where is the species and strain given in the comment-line/annotation-line? What pattern could you use to automate retrieval of the species and strain designation?

To display both files you could use the wildcard "*", which stands for any character.
cat H*
Displays both files to the screen. By default the output of cat goes to the standard output, in our case the screen. But you can redirect the output to a file using the > symbol.
cat H* > HaloATPases_and_homologs.txt
copies both files into a new file.

Use more to scrol through the file
more HaloATPases_and_homologs.txt

There is a similar command called less (strange humor, more or less) that allows you to move forward and backwards in a file:
less HaloATPases_and_homologs.txt

The unix command line comes with helpful features.

One is automatic line completion.
If you type a command and hit the <tab> key, the interface will complete the line as long as there is only one possible completion.
In our case,
less H<tab> will complete the line to less HaloATPases. If you enter the <tab> key again, the different possible completions will be listed, enter the next character of the file you want to see (_ or - or +) and hit the <tab> key again.

Another is the history. The upward arrow in the cursor field will recall the last command (and hitting it repeatedly will allow you to go back in the command history.) You can use the forward and backward arrow to move the cursor to within the command and then modify the text (this is great, if you mistyped, or were in the wrong directory). The only complication is that the mouse does not work to move the cursor location in the command line. Try it out!
You can output the complete history of the commands you submitted by typing history .

type more HaloATPases and_homologs.txt
then use the recall command and the arrow keys to correct he command.

Type exit
You will return to the master node. Now type qstat
You should have no jobs running. If there is a job listed, then type qdel
followed by a space, and then the job-ID number, followed by return or enter key.
Then type qstat again, to confirm that there are no running jobs.

To change the password for your account, you can use the passwd command. Warning: only do this, if you will be able to memorize both your username and your password!

Then type logout to exit from the master node.

Start Filezilla
Open the site manager (under the file menu, or the left most icon in the bar on top of the application window.
Click on the NewSite button at the bottom left of the site manager window.
On the right
under protocol select SFTP
under host enter transfer.cam.uchc.edu
under LogOn Type select normal
under User enter mcb3421usrXX (Go to the class notebook, behind the name of your personal section is an _#. this is your number. If your number is 1, do not use a leading 0. Your username would be mcb3421usr1.)
under Password enter your xanadu password
Hit connect.

You should see 4 windows. On top the directories structure on your computer (left) and on the cluster (right). On the bottom the contents of the two active directories (i.e. the directories between which the files will be transferred). You can move to a directory/ subdirectory by double clicking. The folder with two dots behind it is the parent directory.
If you double click on a file in the bottom windows it is transferred (copied) between the two directories on the two computers. You also can drop and drag files in the window representing the directory on xanadu.
Right clicking offers a couple of options, the most convenient is to change permissions. You can easily give or revoke permissions to others.

Which groups and which type of permissions can you select using FileZilla?

1. PRSS (20 minutes)

PRSS is a program written by Bill Pearson that uses the shuffling approach to determine if two sequences are significantly similar.
Using PRSS*, determine if there is significant similarity between the proteins with gi numbers: 2506213, 2493127, 4323566, 2983405, 1303679. To start the program click on "Shuffle Sequence". Sample values are in the table below. Note that 2e-3 means 2 times 10^-3. Note, you can do many comparisons at once, using multiple gi numbers for the second sequence (here).
Each of you will pick one row and enter the pseudo E-value for 10000 comparisons into the table (here).
Note: If someone already entered a value, add your value in addition, separated by a ";"
Note: if you enter multiple gi numbers as the second sequences, the output lists them in order of significance level.

Group discussion of results.

For a comparison that resulted in a significant E value (strong - smaller than 1e-5), but not overwhelming (i.e. > 1e-90) repeat the analysis selecting a different substitution matrix (Blosum80, PAM250).
How does the E-value change?

For a comparison that resulted in a significant E value (strong - smaller than 1e-5), but not overwhelming (i.e. > 1e-90) repeat the analysis increasing and decreasing the gap opening penalty (e.g., -1, -10, -13, -20, -100).
Which gap opening penalty gives the smallest E-value? What might be the reason for your finding?

* The PRSS server at embNET provides the traditional PRSS output, but their link to sequence retrieval (among others) is broken ... .

2. BLAST (10 minutes)

Repeat a few of the pairwise comparisons using Pairwise BLAST
(go here, then
select the protein BLAST.
Click the box to select align two or more sequences, which is at the bottom of the "Enter Query Sequence" box. Make sure the blastp tab in the header is selected).
You can paste the GI numbers directly into the box labeled "Enter accession number(s), gi(s), or FASTA sequence(s)" You can force the program to report insignificant alignments by increasing the expect value. To do this, click on the plus sign labeled "Algorithm parameters". A number of additional options will drop down, including the "Expect Threshold," which is set to 10 by default, but you can enter a larger number to obtain less significant matches (do this only if the default parameter does not return any result). When all of your parameters are set, click the BLAST button.

How do these E-values compare to the ones obtained using PRSS?

Note: The NCBI BLAST interface adjusts the gap penalties to a new default value for each substitution matrix.

3a. Transitive homology? Part one (5 minutes)

You find that sequence A (gi 1303679) has a significant similarity to sequence B (gi 2506213) over the entire length and sequence B has significant similarity to C (gi 2983405) over the entire length, but C and A are not significantly similar (assuming an E-value <e-8).
Can you nevertheless conclude that A is homologous to C? (Two characters -- sequences, or morphological characters -- are homologous if they are derived from the same character existing in some ancient organism.)

3b. Transitive homology? Part two (10 minutes)

Does the same reasoning hold for gi 6320016, gi 1303679, gi 2507047?

Why might this case be different from the previous one?
How does the output from the pairwise blast comparison help you to draw a conclusion (compare 6320016 with 1303679, and 6320016 with 2507047)? (Check both the Graphic Summary nad the DotPlot.)

4. Dotplots (10 minutes)

Gephard is a dotplot program, that plots the similarity between two sequences similar to the dot matrix comparison in pairwise blast. However, it plots the actual similarities between every window in one sequence against every window in the other sequence. The program is available as a java jar. On windows you can start the program through double clicking.

On a Mac this is more complicated.
Download the program from https://cube.univie.ac.at/gepard (The link to Gepard-1.40.jar is somewhere in the middel of the page under download.)
Move the file to your application folder) or to another folder.
Open terminal and cd to the dirctory where the gephard.jar is located and type
java -jar Gepard-1.40.jar
or, if you put the jar inot the application folder:
java -jar /Applications/Gepard-1.40.jar

However, this still might not work, (You might get some message like "Exception in thread ..."
I replaced the JAVA version I had on my Mac with OpenJAVA:
I went to /Library/Java/JavaVirtualMachines and deleted both folders which were there.
Then I went to https://adoptopenjdk.net/archive.html? and downloaded jdk-11.0.9+11.1 (both jre and jdk)
Installed the packages, and the above command works beautifully

Download the program from https://cube.univie.ac.at/gepard (The link to Gepard-1.40.jar is somewhere in the middel of the page under download.)
Download the genomes from Actinobacteriophage Neos5 and HarveySr, and the two terminases encoded on these genomes (Neos5.fasta, HarveySr.fasta, Gene6_HarveySR.faa and Gene6_Neos5.faa)
These files are in this zip file
Double click the gephard.jar (or on a Mac start the program form the commandline).
In the menu on the left, select the two phage genomes as sequence 1 and sequence 2.
If you click on the plot, you place a cross-hair into the image. The alignment corresponding to the cross is depicted below. You can move the cross using the cursor arrows on the keyboard.
About how many nucleotides long is the first insertion in Neos5?

The gene encoded in this region of the genomes is a terminase, its function is to stuff DNA into the phage head.
Gephard at times has difficulties to adjust the scale if you load new sequences. The best is to exit the program, start it again and load the to gene6 sequences (not these are aa sequences).
Calculate a dotplot comparing the two sequences. Use the cross hair to find the beginning and end of the insertion in Neos5.
What are the first 3 and the last three amino acids in the Neos5 insertion?

OPTIONAL: FASTA (10 minutes)

Do a databank search of the Swissprot database with (gi 2493127). Fasta is accessible through the web at

http://www.ebi.ac.uk/Tools/sss/fasta/

(Enter the sequence in Fasta format (either search entrez for for 2493127 or follow this link) in Step2. In the display pulldown-menu select FASTA (step 3). In Step 1, select the Uniprot clusters 50% as database (you need to click on the triangle to see the different uniprot clusters. In step 3 under options, change the HISTOGRAM pulldown menu to yes, under the "Matrix" pulldown select the BLASTP62 matrix, under the "alignments" pulldown select 50 alignments and 1000 scores, leave everything else in the default options.)
< As a curtesy to others using the server, get the results here. >

Click on the "Tool Output" tab.

1) How many proteins have an alignment score above 140?
2) In comparing the number of actual matches (==) to the distribution fitted to these data (*) for each alignment score (given in the left column), for which value of score(s) does the number of actual hits deviate (as percent of the expectation) most from the expectation?
3) In the pairwise alignments, what is signified by the ":" and the "." ?

Finished?

Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.

Table 1: PRSS pseudo E-value for 10000 comparisons.

protein GI number	2506213 flagellar ATPase	2493127 vacuolar ATPase subunit B	4323566 ATPase subunit alpha	2983405 transcription termination factor Rho	1303679 vacuolar ATPase subunit A
2506213 flagellar ATPase	8.3e-165	4.4e-19	5.6e-17	8.6e-14	6.2e-14
2493127 vacuolar ATPase subunit B	4.1e-18	1e-186	1.6e-17	26	1.3e-13
4323566 ATPase subunit alpha	5e-19	1.1e-20	1.8e-163	0.00012	1.7e-08
2983405 transcription termination factor Rho	1.5e-11	74	0.0085	2.6e-177	3.3
1303679 vacuolar ATPase subunit A	2.2e-13	8.7e-14	5.4e-07	0.85	0

Table 2: BLASTp E-value.

protein GI number

2506213

flagellar ATPase

2493127

vacuolar ATPase subunit B

4323566

ATPase subunit alpha

2983405

transcription termination factor Rho

1303679

vacuolar ATPase subunit A

2506213

flagellar ATPase

1e-31

1e-32

1e-20

9e-26

2493127

vacuolar ATPase subunit B

3e-28

4e-35

0.003

2e-24

4323566

ATPase subunit alpha

2e-28

4e-35

5e-06

2e-15

2983405

transcription termination factor Rho

1e-16

0.003

4e-06

4e-05

1303679

vacuolar ATPase subunit A

5e-22

3e-24

2e-15

7e-05