Assignment 11

Your name:
Your email address:

Pop-Up Quiz:

Why is the E-value not a good measure for false positives in a PSI-blast search?

Assuming you did 10 iterations in a PSI blast search with an evalue cut-off of 0.005. Why should you be skeptical of a match reported in the 10th iteration even though that match has a reported E-value of .000000001?

Assignments:

1. Using PRSS (or here) determine if significant similarity exists between proteins with the following accession numbers AAA23671.1 (D-Ala D-Ala ligase) and P04425.1 (Glutathione synthetase). Copy the fasta formated sequences into the PRSS window. Select 1000 shuffles, then click on "Shuffle Sequence" to start the comparison.

What is the E-Value (10000) of the comparison?

Collaborate with others in the class. Some should do exercise #2, the other exercise #3. Share the results!

2. (10 minutes) Do a PSI-BLAST search (this is the same submission page as for blastp, but you check off the PSI blast radio button) with the

Glutathione synthetase (P04425.1 ) as a query,
use UniProtKBSwiss-Prot as database,
restrict the taxon to only search Bacteria (bacteria (taxid:2)),
select the PAM250 substitutions matrix,
select 20000 Max Target Sequences).
Under "Filters and Masking", turn on the filter for low complexity (tick/check the box),
set the E-value cut-off for inclusion in the next round ("PSI-BLAST Threshold") to 0.001 (or 1e-3) and
change the maximum target sequences to 20000.

Note: Depending on the settings, PSI-Blast switches back and forth between the format and the result window. Place a check in the box "Show results in a new window" (at the bottom of the page).

Note that the iteration number is listed at the top of the result page. One "launches" the next PSI-Blast iteration by looking for the "Run PSI-Blast iteration ? with max" at the top (or the bottom) of the list of significant matches, and clicking "Go".

After how many iterations (do not more than 5 iterations!) do you start to pick up D-Ala D-Ala ligase, biotin carboxylase (aka Acetyl-CoA carboxylase A1) and Carbamoyl-phosphate synthase ?
D-Ala D-Ala ligase: biotin carboxylase (aka Acetyl-CoA carboxylase A1): carbamoyl-phosphate synthase:

Which other types of enzyme are included among the hits?

Note:Collaborate with your neighbor. One of you should do task #2, the other #3.

3. (10 minutes) Do a PSI-BLAST search (this is the same submission page as for blastp, but you check off the PSI blast radio button) with the

D-Ala D-Ala ligase AAA23671.1 as a query,
use UniProtKBSwiss-Prot as database,
restrict the taxon to only search Bacteria (bacteria (taxid:2)),
select the PAM250 substitutions matrix),
select 20000 Max Target Sequences.
Under "Filters and Masking", turn on the filter for low complexity (tick/check the box),
set the E-value cut-off for inclusion in the next round ("PSI-BLAST Threshold") to 0.001 (or 1e-3), and
change the maximum target sequences to 20000.

Note : Depending on the settings, PSI-Blast switches back and forth between the format and the result window. Place a check in the box Show results in a new window (at the bottom of the page).

After how many iterations (do not more than 5 iterations!) do you start to pick up carbamoyl phosphate synthetases, glutathione synthetase, and biotin carboxylase (aka Acetyl-CoA carboxylase A1)?
glutathione synthetase: carbamoyl phosphate synthetases: biotin carboxylase (aka Acetyl-CoA carboxylase A1):

Which other types of enzyme are included among the hits?

What might be the reason for the different results obtained in tasks 2 and 3?

4. (30 Minutes) Do a PSI-BLAST (useuse UniProtKBSwiss-Prot as database and restrict the taxon to only search archaea (archaea (taxid:2157))) search for 5 iterations, an E-value cut-off for inclusion in the next round ("PSI-BLAST Threshold") to 0.0001 with the following sequence (if this takes to long, use this PSSM linked below):

DO NOT use the gi number as query. The gi number refers to the whole protein, we want to use the intein sequence only!
>Pab_VMA intein from gi|7436316|pir||D75028
CVDGDTLVLTKEFGLIKIKDLYKILDGKGKKTVNGNEEWTELERPITLYGYKDGKIVEIKATHVYKGFS
AGMIEIRTRTGRKIKVTPIHKLFTGRVTKNGLEIREVMAKDLKKGDRIIVAKKIDGGERVKLNIRVEQKR
GKKIRIPDVLDEKLAEFLGYLIADGTLKPRTVAIYNNDESLLRRANELANELFNIEGKIVKGRTVKALLI
HSKALVEFFSKLGVPRNKKARTWKVPKELLISEPEVVKAFIKAYIMCDGYYDENKGEIEIVTASEEAAYG
FSYLLAKLGIYAIIREKIIGDKVYYRVVISGESNLEKLGIERVGRGYTSYDIVPVEVEELYNALGRPYAE
LKRAGIEIHNYLSGENMSYEMFRKFAKFVGMEEIAENHLTHVLFDEIVEIRYISEGQEVYDVTTETHNFIGG
NMPTLLHNT

DO NOT use the gi number as query. The gi number refers to the whole protein, we want to use the intein sequence only!

What types of enzymes do you get as hits?
How could you verify that the target proteins contain inteins?
Which E-value cut-off for inclusion in the next round did you choose?
What is the percent identity of the least significant hit added in the last iteration (clicking on the score in the table will jump to the alignment)?

Save the PSSM (Position Specific Scoring Matrix, or profile) from this search. To do that choose PSSM from pull-down menu inside the download link close to the top of the result page. Save the PSSM as an ASN file on your computer.
(If the iterations take too long, the PSSM after 6 iterations on the swissprot database is here, the PSSM after 5 iterations on the nr database is here*)

* I used the following commands (commandline using the new blast+ package):

module load blast/2.10.1

psiblast -db /isg/shared/databases/blast/v5/nr -out out.inteinq -query intein_query.fasta -out_pssm inteinquery.pssm -out_ascii_pssm inteinquery.asci.pssm -inclusion_ethresh 0.0001 -outfmt 6 -num_iterations 5 -num_threads 2 -save_pssm_after_last_round -max_target_seqs 50000

These PSSMs should work in detecting a wide variety of intein containing proteins. We will use them to search the genomes of organisms of interest. Go to PSI-BLAST. Select BlastP (on top of the page). [You can skip this with the new blast webpage: Paste intein sequence into query sequence box.] Select the non redundant database.
Select the organisms to which the search should be restricted. You can select individual organisms or whole groups. (if you start typing, options will appear form which to select taxa)

Possible are

Pyrobaculum aerophilum
Aeropyrum pernix
Sulfolobus tokodaii
Archaeoglobus fulgidus
Methanothermobacter thermautotrophicus
Thermoplasma volcanium
Methanocaldococcus
Methanococcales
Halobacteria
Halorubrum
Haloferax

After selecting the genomes to search, go to Algorithm parameters and under PSI-blast options select and upload your PSSM (the one you saved yourself, or the ones that are linked above; if you have time, use the different PSSMs to see if this makes any difference).

As you start your search with a PSSM matrix already, you do not need to (and should not) do iterations!

What are the results of your search? Did you get any significant matches? What are they?
If you have significant matches, does the match occur over the full lengths of both query and subject sequences? What is the percent identity?
Use a hit in the databank as a query in a fast blast search to investigate if the hit indeed harbors an inteins.
What is your conclusion?
In your answer indicate
- the PSSM you used,
- genomes searched,
- number of significant matches found,
- the E-values of these matches, and
- the identity of these matches (i.e., are these probable inteins, or are they likely to be something else?).

5. Using PSSMs

SKIP THE COMMANDs IN BLUE!!!!!

Execute the ones in green

IS605.faa.txt is a FASTA formatted file containing an annotated IS605 transposase protein sequence from a Frankia genome.

We will use it to build a PSSM for this protein family, and then compare (mainly quantitatively) three searches:

a blastp search of Frankia genomes for ORFs with significant matches to the sequence in the FASTA file,
a tblastn of Frankia genomes for significant matches to the sequence in the FASTA file
a PSI-blastp search of the collection of ORFs from the five genomes (using a pre-calculated PSSM, and
a PSI-tblastn search of Frankia genomes for significant matches to this PSSM

To do this, we will use the cluster. Use PuTTY (or another terminal program) to establish a terminal connection to the cluster at via xanadu-submit-ext.cam.uchc.edu (same username and password as previously). (Since the files are going TO the cluster in this case, we will use "curl" for convenience, but it might be a good idea to also open a filezilla connection.)

A FASTA file with all the proteins from 5 Frankia genomes is here - sixFrankia.faa.txt

A FASTA file of the nucleotide sequences of the 5 Frankia genomes is here - sixFrankia.fna.txt

Note that these (currently 5) Frankia genomes were retrieved from the Entrez ftp site. In this case, I retrieved the *.faa (protein) files, and the *.fna (genome nucleotide) files.

Preparation: (bracketed parts are comments only!)

	mkdir lab11								(make a new folder called "lab13")
	cd lab11								(equivalent of "double-clicking" a folder to descend "into" it)
	curl -O https://j.p.gogarten.uconn.edu/mcb3421_2019/labs/IS605.faa.txt	(curl stands for "see URL", this downloads the IS605 file to the cluster)
	                                                              	 	(...you could also accomplish this with the Bitvise SFTP client)
	ls									(make sure the file is there)
	cat IS605.faa.txt					          	(...and that the contents look OK)
	srun --pty --partition=mcbstudent --qos=mcbstudent --mem=2G bash	(we're about to do some serious computation, so onto a compute node we go...)
	module load blast                                                       (load the blast module)
	ls	                                                                (once more to make sure the IS605 file is present in this folder)

The latest version of blast requires processors that have a particular instruction set (usually found in computers used in gaming).  To use this you need to select the compute node and blast module as follows: 
	srun --pty --partition=xeon --qos=mcbstudent --mem=2G bash	 (we're about to do some serious computation, so onto a compute node we go...)
	module load blast/2.10.1                 (load the blast module)

To create the PSSM for IS605 we can use the following command: (the psiblast command is long!...remember to scroll to the right to get all of it!)
Do NOT EXECUTE THE COMMAND IN BLUE!!!!!psiblast -query IS605.faa.txt -db nr -out is605.faa.blast -outfmt 6 -inclusion_ethresh 0.0005 -num_threads 2 -num_iterations 5 -out_pssm is605.faa.pssm -max_target_seqs 100000 -save_pssm_after_last_round ("-num_threads 2" says to use two threads -- similar to two processors/CPUs -- which should make it faster, at the expense of using twice the resources)

This takes quite a while, grab the PSSM from here - is605.faa.pssm: curl -O https://j.p.gogarten.uconn.edu/mcb3421_2020/labs/is605.faa.pssm

Options for psiblast can be seen using psiblast -help or look at psiblast-help.txt


         In
                preparation for our blast searches on the Frankia genomes, we
                need to first create a searchable blast database: 
        
        	curl -O https://j.p.gogarten.uconn.edu/mcb3421_2020/labs/sixFrankia.faa.txt
        makeblastdb -in sixFrankia.faa.txt -dbtype prot -parse_seqids
	curl -O https://j.p.gogarten.uconn.edu/mcb3421_2020/labs/sixFrankia.fna.txt
	makeblastdb -in sixFrankia.fna.txt -dbtype nucl -parse_seqids

        To do a normal blastp search: 
        blastp -db sixFrankia.faa.txt -query IS605.faa.txt -out blastp.out -evalue 1e-5 -outfmt 6 -num_threads 2
        To do a normal tblastn search:


            tblastn -db sixFrankia.fna.txt -query IS605.faa.txt -out tblastn.out -evalue 1e-5 -outfmt 6 -num_threads 2
          

          To do a PSI-blast search of the encoded proteins:
        	(get the PSSM first, because we skipped the command in blue, see above)
psiblast -db sixFrankia.faa.txt -in_pssm is605.faa.pssm -out PSIblastP.out -inclusion_ethresh 1e-5 -evalue 1e-5 -outfmt 6 -num_threads 2
         
        As we are doing only the last iteration, an
          E-value cut-off of .001 is ok -- no danger of corrupting the PSSM. 

          Note, we need to add -evalue 1e-3, otherwise the
          e-values are reported in the table up to an e-value of 10!
          
        
        You will receive a fancy warning message, but
            it seems to works just fine. More
              on the warning message is here. 
        
         To do a PSI-blast search of the 6 reading frames of
          the genomes: 
        	tblastn -db sixFrankia.fna.txt -in_pssm is605.faa.pssm -out psitblastn.out -evalue 1e-5 -outfmt 6 -num_threads 2
         Counting the number of lines (corresponding to the
          number of significant matches; note, we had set the e-value to 10^-3,
          and selected tabular output) in a file:
         wc
                -l blastp.out

                    wc -l tblastn.out

                wc -l PSIblastP.out

              wc -l
                psitblastn.out
        or   wc
                -l psitblastn.out PSIblastP.out tblastn.out
                        blastp.out
        Also note (for fun purposes only) that the tblastn
          matches are not distributed evenly by genome: 
         cut -f2
              psitblastn.out | uniq -c 
          (cuts out 2nd field, which is the target genome ID, and counts the
          matches for each of the 5 genomes)

          To figure out which genome is which, you can use the blastdbcmd

blastdbcmd -db sixFrankia.fna.txt -entry NC_007777.1,NC_009921.1,NC_014666.1,NC_015656.1,NZ_CM001489.1,NC_014666.1 -outfmt "%l %a %t"

 The outfmt string specifies length of
            th sequence, accession number, and title. To get more help on the
            blastdbcmd use

blastdbcmd
                -help 
           How does the number of blastp matches compare to
            the number of tblastn, PSI-blast, and PSI-tblastn matches?
           
          
          If there is a significant difference in the number
            of matches, can you think of a reason why this could happen?
           
          
          BE SURE TO TYPE logout (log out from
                  compute node and return to master node) TO RELEASE THE
                QLOGIN JOB FROM THE QUEUE 
          
            Finished? 
            Type logout to release the compute
              node from the queue. 

                If you you encountered problems in your session, check the queue
                for abandoned sessions using the command qstat. If there are
                abandoned sessions under your account, kill them by deleting
                them from the queue by typing qdel
                job-ID, e.g. "qdel 40000"
                would delete Job # 40000 
          
           
            
              Send email to your instructor (and yourself)
              upon submit 

               Send email to yourself
                only upon submit (as a backup) 

               Show summary upon
              submit but do not send email to anyone.