Assignment 5: Statistics of Sequence Comparison

Your name:
Your email address:

Some cluster computer basics

Spend 5 minutes reading this introduction to the command-line interface.

Start a program called PuTTY. In the "Host" field, type xanadu-submit-ext.cam.uchc.edu
In the "Username" field, type your username mcb3421usrXX, where XX is a number assigned to you. Login. It will ask you to accept a new host key. Now enter the password _______. Hit return. <Note the terminal will not echo the keystrokes you make to enter the password.>

The window with a black background that opens on your screen is the command-line window.

In the window with a black background, you will see a flashing cursor. Stuff you type appears at the cursor position.

You can read more about the xanadu cluster here.

To move from the login node (you should never, ever run things on this node) to a compute node,
type srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash
and then hit the return or enter key. This takes you to a "compute" node.
If the cluster is busy, and the queue you select is busy, the move to the compute node can take several minutes.

Now type qstat
This command shows the job-ID of your session.

Type ls
This gives you a directory (or folder) listing. There may be nothing to see (yet). Type mkdir lab5
This command stands for "make directory". Now type ls
again. You should see your newly-created directory. Type pwd
This stands for "print working directory". This forward-slash (/) separated "path" is a shorthand for the directory you are currently in.
Type cd lab5
Now type pwd again. You have moved into the lab5 directory.

If you need more information on a command (e.g., pwd), you can use the man command. man pwd displays the manual pages for the pwd command. (You can move forward using the space-bar, or the arrow keys. to leave the manual page type q.)

There are many ways to get files into your account, e.g., the file transfer application that is part of the SSH Client or Filezilla. Another possibility is the wget commant. To move this file into the directory on your cluster, you could type
wget http://carrot.mcb.uconn.edu/mcb3421_2017/HaloATPases+Intein.txt
You could copy paste the above line to the terminal window, or right click "this file" and copy the link address.

Do the same for this file:
wget http://carrot.mcb.uconn.edu/mcb3421_2017/HaloATPases-Intein.txt

Once you downloaded the files into the lab5 directory, you can display their content by typing
cat HaloATPases-Intein.txt
Alternatively you could use the more or less commands that allows you to go through a file page by page (Type <space> to get to the next page, type "q" to quit).

To display both files you could use the wildcard "*", which stands for any character.
cat H*
Dispalys both files to the screen. By default the output of cat goes to the standard output, in our case the screen. But you can redirect the output to a file using the > symbol.
cat H* > HaloATPases.txt
copies both files into a new file.

Use more to scrol through the file
more HaloATPases.txt

There is a similar command called less (strange humor, more or less) that allows you to move forward and backwards in a file:
less HaloATPases.txt

The unix command line come with helpful features.

One is automatic line completion.
If you type a command and hit the <tab> key, the interface will complete the line as long as there is only one possible completion.
In our case,
less H<tab> will complete the line to less HaloATPases. If you enter the <tab> key again, the different possible completions will be listed, enter the next character of the file you want to see (. or - or +) and hit the <tab> key again.

Another is the history. The upward arrow in the cursor field will recall the last command (and hitting it repeatedly will allow you to go back in the command history.) You can use the forward and backward arrow to move the cursor to within the command and then modify the text (this is great, if you mistyped, or were in the wrong directory). The only complication is that the mouse does not work to move the cursor location in the commandline. Try it out!

type more HaloATPases .txt
then use the recall command and the arrow keys to correct he command.

You also can type history in the commanline to see all the commands you executed.

Type exit
You will return to the master node. Now type qstat
You should have no jobs running. If there is a job listed, then type qdel
followed by a space, and then the job-ID number, followed by return or enter key.
Then type qstat again, to confirm that there are no running jobs.

To change the password for your account, you can use the passwd command. Warning: only do this, if you will be able to memorize both your username and your password!

Then type logout to exit from the master node.

 

1. PRSS (20 minutes)

Using PRSS*, determine if there is significant similarity between the proteins with gi numbers: 2506213, 2493127, 4323566, 2983405, 1303679. To start the program click on "Shuffle Sequence". Sample values are in the table below. Note that 2e-3 means 2 × 10-3. Note, you can do many comparisons at once, using multiple gi numbers for the second query (here)

Table 1: PRSS pseudo E-value for 10000 comparisons. Fill out one column and enter numbers on whiteboard.

protein GI number
2506213
flagellar ATPase
2493127
vacuolar ATPase subunit B
4323566
ATPase subunit alpha
2983405
transcription termination factor Rho
1303679
vacuolar ATPase subunit A
2506213
flagellar ATPase
         
2493127
vacuolar ATPase subunit B
       
4323566
ATPase subunit alpha
     
2983405
transcription termination factor Rho
   
1303679
vacuolar ATPase subunit A
 

For a comparison that resulted in a significant E value (strong - but smaller than 1e-5), but not overwhelming (i.e. > 1e-90) repeat the analysis selecting a different substitution matrix (Blosum50, Blosum80, PAM250).
How does the E-value change?

For a comparison that resulted in a significant E value (strong - smaller than 1e-5), but not overwhelming (i.e. > 1e-90) repeat the analysis increasing and decreasing the gap opening penalty (e.g., -1, -10, -13, -20, -100).
Which gap opening penalty gives the smallest E-value? What might be the reason for your finding?

* The PRSS server at embNET provides the traditional PRSS output, but their link to sequence retrieval (among others) is broken ... .

2. BLAST (10 minutes)

Repeat a few of the pairwise comparisons using Pairwise BLAST
(go here,
  then select the protein BLAST.
  Click the box to select align two or more sequences, which is at the bottom of the "Enter Query Sequence" box. Make sure the blastp tab in the header is selected).
You can paste the GI numbers directly into the box labeled "Enter accession number(s), gi(s), or FASTA sequence(s)" You can force the program to report insignificant alignments by increasing the expect value. To do this, click on the plus sign labeled "Algorithm parameters". A number of additional options will drop down, including the "Expect Threshold," which is set to 10 by default, but you can enter a larger number to obtain less significant matches (do this only if the default parameter does not return any result). When all of your parameters are set, click the BLAST button.

Table 2: BLASTp E-value.Enter your values on the whiteboard

protein GI number
2506213
flagellar ATPase
2493127
vacuolar ATPase subunit B
4323566
ATPase subunit alpha
2983405
transcription termination factor Rho
1303679
vacuolar ATPase subunit A
2506213
flagellar ATPase
         
2493127
vacuolar ATPase subunit B
 
   
4323566
ATPase subunit alpha


   
2983405
transcription termination factor Rho

   
1303679
vacuolar ATPase subunit A



 

How do these E-values compare to the ones obtained using PRSS?

Note: The NCBI BLAST interface adjusts the gap penalties to a new default value for each substitution matrix.
How do these values correspond the impact of the gap penalty on the E-value you performed in exercise 1?


3a. Transitive homology? Part one (5 minutes)

You find that sequence A (gi 1303679) has a significant similarity to sequence B (gi 2506213) over the entire length and sequence B has significant similarity to C (gi 2983405) over the entire length, but C and A are not significantly similar (assuming an E-value <e-8).

Can you nevertheless conclude that A is homologous to C? (Two characters -- sequences, or morphological characters -- are homologous if they are derived from the same character existing in some ancient organism.)


3b. Transitive homology? Part two (10 minutes)

Does the same reasoning hold for gi 6320016, gi 1303679, gi 2507047?

Why might this case be different from the previous one?
How does the output from the pairwise blast comparison help you to draw a conclusion (compare 6320016 with 1303679, and 6320016 with 2507047)?

 

4. FASTA (10 minutes)

Do a databank search of the Swissprot database with (gi 2493127). Fasta is accessible through the web at

   http://www.ebi.ac.uk/Tools/sss/fasta/

(Enter the sequence in Fasta format (either search entrez for for 2493127 or go here and cut and paste (step 2)). In the display pulldown-menu select FASTA (step 3). Select the UniProtKB/swissprot database as target (step 1). In step 3 under options, change the HISTOGRAM pulldown menu to yes, under the "Matrix" pulldown select the BLASTP62 matrix, under the "alignments" pulldown select 1000 alignments and 1000 scores, leave everything else in the default options.)
<If the server is overloaded, you can get the results here.>

1) How many proteins show sequence similarity to the query sequence with an E-value smaller than 1E-198?
2) What type of sequences are among the matches?

Click on the "Tool Output" tab.
3) In comparing the number of actual matches (==) to the distribution fitted to these data (*) for each alignment score (given in the left column), for which value of score(s) does the number of actual hits deviate most from the expectation?
4) In the pairwise alignments, what is signified by the ":" and the "." ?

Repeat the search (same parameters) with the uniprot databank that clusters sequences together that have more than 50% identity. <If the server is overloaded, you can get the results here.>
5) How many matches better than 1E-198 did you obtain?

Finished?

Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.

 

Table 1: PRSS pseudo E-value for 10000 comparisons.

protein GI number
2506213
flagellar ATPase
2493127
vacuolar ATPase subunit B
4323566
ATPase subunit alpha
2983405
transcription termination factor Rho
1303679
vacuolar ATPase subunit A
2506213
flagellar ATPase
8.3e-165
4.4e-19
5.6e-17
8.6e-14
6.2e-14
2493127
vacuolar ATPase subunit B
4.1e-18
1e-186
1.6e-17
26
1.3e-13
4323566
ATPase subunit alpha
5e-19
1.1e-20
1.8e-163
0.00012
1.7e-08
2983405
transcription termination factor Rho
1.5e-11
74
0.0085
2.6e-177
3.3
1303679
vacuolar ATPase subunit A
2.2e-13
8.7e-14
5.4e-07
0.85
0

 

 

Table 2: BLASTp E-value.

protein GI number
2506213
flagellar ATPase
2493127
vacuolar ATPase subunit B
4323566
ATPase subunit alpha
2983405
transcription termination factor Rho
1303679
vacuolar ATPase subunit A
2506213
flagellar ATPase
0 1e-31 1e-32 1e-20 9e-26
2493127
vacuolar ATPase subunit B
3e-28 0
4e-35

0.003 2e-24
4323566
ATPase subunit alpha
2e-28

4e-35

0 5e-06
2e-15

2983405
transcription termination factor Rho
1e-16 0.003
4e-06

0 4e-05
1303679
vacuolar ATPase subunit A
5e-22
3e-24

2e-15

7e-05

0