Assignment 5: Statistics of Sequence Comparison

Your name:
Your email address:

Some cluster computer basics

Spend 5 minutes reading this introduction to the command-line interface. If you think something like this might be in your future, read the rest after class. 

If you are working on a windows machine, start a program called PuTTY.  In the "Host" field, type xanadu-submit-ext.cam.uchc.edu

In the "Username" field, type your username mcb3421usrXX, where XX is a number assigned to you. (Go to the class notebook, behind the name of your personal section is an _#. this is your number.  If your number is 1, do not use a leading 0.  Your username would be mcb3421usr1.)
Login. It will ask you to accept a new host key. Now enter the password _______ (the password is in the notebook under today's summary and goals section).  Hit return. <Note the terminal will not echo the keystrokes you make to enter the password.>

The window with a black background that opens on your screen is the command-line window.

In the window with a black background, you will see a flashing cursor. Stuff you type appears at the cursor position.

If you are working on a Mac, find the terminal application (Termianl.app).  By default is located in the Applications folder, in a subfolder utilitites (/System/Applications/Utilities/Terminal.app).  The easiest might be to search for Terminal.app in the finder or in spotlight. 

Double click the Terminal application icon.  In the window that opens, type

ssh mcb3421usrXX@xanadu-submit-ext.cam.uchc.edu
where XX is a number assigned to you. (Go to the class notebook, behind the name of your personal section is an _#. this is your number.  If your number is 1, do not use a leading 0.  Your username would be mcb3421usr1.)
It will ask you to accept a new host key, type yes <return>  [<return> means hit the return key.]
Now enter the password _______ (the password is in the notebook under today's summary and goals section).  Hit return. <Note the terminal will not echo the keystrokes you make to enter the password.>

It is important to know a little bit about the structure of the cluster.  When you log into the cluster, you are at the head node.  Everything you execute will run on the head node.  You never ever should run anything that needs computational resources on the head node.  To actually run programs or execute comand such as blast, you should have moved to a compute node!
You can read more about the xanadu cluster here.

To move from the login node to run an interactive session (you should never, ever run things on this node) to a compute node,  type
srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash

and then hit the return or enter key. This takes you to a "compute" node.
If the cluster is busy, and the queue you select is busy, the amove to  compute node can take several minutes.

Now type qstat
This command shows the job-ID of your session.

Type ls
This gives you a directory (or folder) listing. There may be nothing to see (yet). Type mkdir lab5
This command stands for "make directory". Now type ls
again. You should see your newly-created directory. Type pwd
This stands for "print working directory". This forward-slash (/) separated "path" is a shorthand for the directory you are currently in.
Type cd lab5  (cd ..  would move you back up to your home directory)
Now type pwd again. You have moved into the lab5 directory.

If you need more information on a command (e.g., pwd), you can use the man command. man pwd displays the manual pages for the pwd command. (You can move forward using the space-bar, or the arrow keys. to leave the manual page type q.)

There are many ways to get files into your account, e.g., the file transfer application that is part of the SSH Client or Filezilla. Another possibility is the wget commant. To move this file into the directory on your cluster, you could type
wget https://j.p.gogarten.uconn.edu/mcb3421_2020/labs/HaloATPases+Intein.txt --no-check-certificate 

You could copy paste the above line to the terminal window, or right click "this file" and copy the link address.

Do the same for this file:
wget https://j.p.gogarten.uconn.edu/mcb3421_2020/labs/HaloATPases-Intein.txt --no-check-certificate 

Once you downloaded the files into the lab5 directory, you can display their content by typing
cat HaloATPases+Intein.txt
cat HaloATPases-Intein.txt


Alternatively you could use the more or less commands that allows you to go through a file page by page (Type <space> to get to the next page, type "q" to quit).
This is a typical multiple sequence fasta file from the NCBI. Take a look at the annotation lines.

Where is the species and strain given in the commandline?  What pattern could you use to automate retrieval of the species and strain designation?

To display both files you could use the wildcard "*", which stands for any character.
cat H*
Displays both files to the screen. By default the output of cat goes to the standard output, in our case the screen. But you can redirect the output to a file using the > symbol.
cat H* > HaloATPases_and_homologs.txt
copies both files into a new file.

Use more to scrol through the file
more HaloATPases_and_homologs.txt

There is a similar command called less (strange humor, more or less) that allows you to move forward and backwards in a file:
less HaloATPases_and_homologs.txt

The unix command line come with helpful features.

One is automatic line completion.
If you type a command and hit the <tab> key, the interface will complete the line as long as there is only one possible completion.
In our case,
less H<tab> will complete the line to less HaloATPases. If you enter the <tab> key again, the different possible completions will be listed, enter the next character of the file you want to see (_ or - or +) and hit the <tab> key again.

Another is the history. The upward arrow in the cursor field will recall the last command (and hitting it repeatedly will allow you to go back in the command history.) You can use the forward and backward arrow to move the cursor to within the command and then modify the text (this is great, if you mistyped, or were in the wrong directory). The only complication is that the mouse does not work to move the cursor location in the command line. Try it out!

type more HaloATPases and_homologs.txt
then use the recall command and the arrow keys to correct he command.

You also can type history in the command line to see all the commands you executed.

Type exit
You will return to the master node. Now type qstat
You should have no jobs running. If there is a job listed, then type qdel
followed by a space, and then the job-ID number, followed by return or enter key.
Then type qstat again, to confirm that there are no running jobs.

To change the password for your account, you can use the passwd command. Warning: only do this, if you will be able to memorize both your username and your password!

Then type logout to exit from the master node.

Start  Filezilla
Open the site manager (under the file menu, or the left most icon in the bar on top of the application window. 
Click on the NewSite button at the bottom left of the site manager window.
On the right
under protocol select SFTP
under host enter transfer.cam.uchc.edu
under LogOn Type select normal
under User enter mcb3421usrXX (Go to the class notebook, behind the name of your personal section is an _#. this is your number.  If your number is 1, do not use a leading 0.  Your username would be mcb3421usr1.)
under Password enter your xanadu password (the password is in the notebook under today's summary and goals section)
Hit connect.

You should see 4 windows.  On top the directories structure on your computer (left) and on the cluster (right).  On the bottom the contents of the two active directories (i.e. the directories between which the files will be transferred).  You can move to a directory/ subdirectory by double clicking.  The folder with two dots behind it is the parent directory.
If you double click on a file in the bottom windows it is transferred (copied) between the two directories on the two computers.  You also can drop and drag files in the window representing the directory on xanadu. 
Right clicking offers a couple of options, the most convenient is to change permissions.   You can easily give or revoke permissions to others. 

Which groups and which type of permissions can you select using FileZilla?

1. PRSS (20 minutes)

PRSS is a program written by Bill Pearson that uses the shuffling approach to determine if two sequences are significantly similar.
Using PRSS*, determine if there is significant similarity between the proteins with gi numbers: 2506213, 2493127, 4323566, 2983405, 1303679. To start the program click on "Shuffle Sequence". Sample values are in the table below. Note that 2e-3 means 2 times 10-3. Note, you can do many comparisons at once, using multiple gi numbers for the second sequence  (here).
Each of you will pick one row and enter the pseudo E-value for 10000 comparisons into the table (here). 
Note: If someone already entered a value, add your value in addition, separated by a ";"
Note: if you enter multiple gi numbers as the second sequences, the output lists them in order of significance level. 

Discussion of results via webex.

For a comparison that resulted in a significant E value (strong - but smaller than 1e-5), but not overwhelming (i.e. > 1e-90) repeat the analysis selecting a different substitution matrix (Blosum50, Blosum80, PAM250).
How does the E-value change?

For a comparison that resulted in a significant E value (strong - smaller than 1e-5), but not overwhelming (i.e. > 1e-90) repeat the analysis increasing and decreasing the gap opening penalty (e.g., -1, -10, -13, -20, -100).
Which gap opening penalty gives the smallest E-value? What might be the reason for your finding?

* The PRSS server at embNET provides the traditional PRSS output, but their link to sequence retrieval (among others) is broken ... .

2. BLAST (10 minutes)

Repeat a few of the pairwise comparisons using Pairwise BLAST
(go here,
  then select the protein BLAST.
  Click the box to select align two or more sequences, which is at the bottom of the "Enter Query Sequence" box. Make sure the blastp tab in the header is selected).
You can paste the GI numbers directly into the box labeled "Enter accession number(s), gi(s), or FASTA sequence(s)" You can force the program to report insignificant alignments by increasing the expect value. To do this, click on the plus sign labeled "Algorithm parameters". A number of additional options will drop down, including the "Expect Threshold," which is set to 10 by default, but you can enter a larger number to obtain less significant matches (do this only if the default parameter does not return any result). When all of your parameters are set, click the BLAST button.

Enter you results into the second table  (here).


How do these E-values compare to the ones obtained using PRSS?

Note: The NCBI BLAST interface adjusts the gap penalties to a new default value for each substitution matrix. ||

3a. Transitive homology? Part one (5 minutes)

You find that sequence A (gi 1303679) has a significant similarity to sequence B (gi 2506213) over the entire length and sequence B has significant similarity to C (gi 2983405) over the entire length, but C and A are not significantly similar (assuming an E-value <e-8).

Can you nevertheless conclude that A is homologous to C? (Two characters -- sequences, or morphological characters -- are homologous if they are derived from the same character existing in some ancient organism.)


3b. Transitive homology? Part two (10 minutes)

Does the same reasoning hold for gi 6320016, gi 1303679, gi 2507047?

Why might this case be different from the previous one?
How does the output from the pairwise blast comparison help you to draw a conclusion (compare 6320016 with 1303679, and 6320016 with 2507047)?

 

4. FASTA (10 minutes)

Do a databank search of the Swissprot database with (gi 2493127). Fasta is accessible through the web at 

   http://www.ebi.ac.uk/Tools/sss/fasta/

(Enter the sequence in Fasta format (either search entrez for for 2493127 or follow this link) in Step2.  In the display pulldown-menu select FASTA (step 3). In Step 1, select the Uniprot clusters 50% as database (you need to click on the triangle to see the different uniprot clusters. In step 3 under options, change the HISTOGRAM pulldown menu to yes, under the "Matrix" pulldown select the BLASTP62 matrix, under the "alignments" pulldown select 1000 alignments and 1000 scores, leave everything else in the default options.)
< If the server is overloaded, you can get the results here. >

1) How many proteins show sequence similarity to the query sequence with an E-value smaller than 1E-198?
2) What type of sequences are among the matches?

Click on the "Tool Output" tab.
3) In comparing the number of actual matches (==) to the distribution fitted to these data (*) for each alignment score (given in the left column), for which value of score(s) does the number of actual hits deviate (as percent of the expectation) most from the expectation?
4) In the pairwise alignments, what is signified by the ":" and the "."
?

Finished?

Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.

 

Table 1: PRSS pseudo E-value for 10000 comparisons.

protein GI number
2506213
flagellar ATPase
2493127
vacuolar ATPase subunit B
4323566
ATPase subunit alpha
2983405
transcription termination factor Rho
1303679
vacuolar ATPase subunit A
2506213
flagellar ATPase
8.3e-165
4.4e-19
5.6e-17
8.6e-14
6.2e-14
2493127
vacuolar ATPase subunit B
4.1e-18
1e-186
1.6e-17
26
1.3e-13
4323566
ATPase subunit alpha
5e-19
1.1e-20
1.8e-163
0.00012
1.7e-08
2983405
transcription termination factor Rho
1.5e-11
74
0.0085
2.6e-177
3.3
1303679
vacuolar ATPase subunit A
2.2e-13
8.7e-14
5.4e-07
0.85
0

 

 

Table 2: BLASTp E-value.

protein GI number
2506213
flagellar ATPase
2493127
vacuolar ATPase subunit B
4323566
ATPase subunit alpha
2983405
transcription termination factor Rho
1303679
vacuolar ATPase subunit A
2506213
flagellar ATPase
0 1e-31 1e-32 1e-20 9e-26
2493127
vacuolar ATPase subunit B
3e-28 0
4e-35
0.003 2e-24
4323566
ATPase subunit alpha
2e-28
4e-35
0 5e-06
2e-15
2983405
transcription termination factor Rho
1e-16 0.003
4e-06
0 4e-05
1303679
vacuolar ATPase subunit A
5e-22
3e-24
2e-15
7e-05
0