Powerpoint slides on BLAST are here

Blast, PSI–blast, and Homology

Are two similar sequences homologous (i.e., is their similarity due to shared ancestry) ?

One way to quantify the similarity between two sequences is to

1.    compare the actual sequences and calculate alignment score

2.    randomize (scramble) one (or both) of the sequences and calculate the alignment score for the randomized sequences.

3.    repeat step 2 at least 100 times

4.    describe distribution of randomized alignment scores

5.    do a statistical test to determine if the score obtained for the real sequences is significantly better than the score for the randomized sequences

To illustrate the assessment of similarity/homology we will use a program from Pearson's FASTA package called PRSS. 
This and many other programs by Bill Pearson are available from his web page at ftp://ftp.virginia.edu/pub/fasta/

A web version is available here.

Go through example. Sequences are here (fl), here (B), here (A) and here (A2)

There are many other alignment programs.  BLAST is a program that is widely used and offered through the NCBI (go here for more info).  It also offers to do pairwise comparisons  (go here, do example).

To force the program to report an alignment increase the E-value.

Rules of thumb:

Usually E values (in a blast search or through randomization) smaller than 10-4 are convincing.

(For small values the E value gives the probability to find a match of this quality in a search of a data bank of the same size by chance alone - for more detailed information see Terminology section and the BLAST help manual on P values)

If you can demonstrate significant similarity using either randomization or an unweighted blast search, your sequences are homologous (i.e. related by common ancestry).  Convergent evolution has not been shown to lead to sequence similarities detectable by these means (see above - this might not be true for scores in PSI-blast)

If the actual alignment score is more than three standard deviations (of the randomized sequences) better than the mean for the randomized sequences, the two sequences are homologous (i.e. related by common ancestry).  PRSS and many other program use more accurate distributions to describe the distribution of random hits.  The expectation value for the alignment-score of the actual sequences is based on these statistics.

Terminology:

E-values give the expected number of matches with an alignment score this good or better,
P-values give the probability of to find a match of this quality or better. P values are [0,1], E-values are [0,infinity).
For small values E=P
z-values give the distance between the actual alignment score and the mean of the scores for the randomized sequences expressed as multiples of the standard deviation calculated for the randomized scores.
For example: a z-value of 3 means that the actual alignment score is 3 standard deviations better than the average for the randomized sequences. Z-values > 3 are usually considered as suggestive of homology, z-values > 5 are considered as sufficient demonstration. (see the but below). A somewhat readable description of E, P, HSP and other values is here.

BUT:
Failure to detect significant similarity does only shows our inability to detect homology, it does not prove that the sequences are not homologous.

Examples:

Jim Knox (MCB-UConn) has studied many proteins involved in bacterial cell wall biosynthesis and antibiotic binding, synthesis or destruction. Many of these proteins have identical 3-D structure, and therefore can be assumed to be homologous, however, the above tests fail to detect this homologies. (for example, enzymes with GRASP nucleotide binging sites are depicted here.)

DNA replication involves many different enzymes. Some of the proteins do the same thing in bacteria, archaea and eukaryotes; they have similar 3-D structures (e.g.: sliding clamp, E. coli dnaN and eukaryotic PCNA, see Edgell and Doolittle, Cell 89, 995-998), but again, the above tests fail to detect homology.

Helicase and F1-ATPase. Both form hexamers with something rotating in the middle (either the gamma subunit or the DNA; D. Crampton, pers. communication). The monomers have the same type of nucleotide binding fold (see gogarten.uconn.edu rotating F-ATPases)

BLAST and PSI BLAST

Run a blast trial with  this sequence

A BLAST tutorial for standard blast is here; a more general tutorial on BLAST, including PSI BLAST, is here

An easy way to force the program to report less significant matches is to increase the expect value in the blast search form pages

PSI-blast provides an enormous advantage over normal blast in the detection of distantly related sequences.  It only works if some closely related sequences are already available, but if this is the case it finds a lot of other distantly related sequences. 

The NCBI page describes PSI blast as follows:
Position-Specific Iterated BLAST (PSI-BLAST) provides an automated, easy-to-use version of a "profile" search, which is a sensitive way to look for sequence homologues. The program first performs a gapped BLAST database search. The PSI-BLAST program uses the information from any significant alignments returned to construct a position-specific score matrix, which replaces the query sequence for the next round of database searching. PSI-BLAST may be iterated until no new significant alignments are found. At this time PSI-BLAST may be used only for comparing protein queries with protein databases. 

A diagram giving an overviw over the PSI-blast procedure is here.

The results of a normal blast search are aligned and a pattern of conserved residues is extracted from the alignment.  This pattern is used for the next iteration.  An important parameter to adjust is the E-value threshold down to which matches are included in the alignment and pattern extraction. 
At higher iterations a PSI blast profile can be corrupted and false positives are identified with “significant” E-values.   I.e., in a traditional blast search one can be quite certain that a match with an E-value of 10^-13 represents a homologue; this is not clear with PSI blast.  Test studies indicate that profile corruptions are likely after more than 5 iterations. On the positive side: there are many fewer false negatives with PSI blast than with normal blast.

 

Assignment #3:

Write down your answers!

Blast using the NCBI web interface:
  1. There are different BLAST programs.
    Why/when would you want to use BLASTP? When blastN, blastx, tblastn or tblastx?


  2. Using a protein sequence of your choice (see above, or gi|59713171) search NR, SWISSPROT and env_nr.
    How many significant hits did you find in each of these databanks?
    How do the results change, when you use a smaller word size or a different scoring matrix (use swissprot to explore these questions)?
    Do all the target sequences have similar description lines? At what E-values does the content of the annotation lines change?
    Compare your results with that of your neighbors.
    Did the low complexity filter replace any part of your query sequence with XXX?
    If you want to search an individual genome, you can go to the corresponding genome page (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi click in the genome in the Refseq column), and start a blast search. You also can start searches of several genomes simultaneously at the microbial genome blast page.

  3. Using Entrez download the protein sequence with PID# 2506213. What type of proteins do you find among the related proteins?

  4. Copy the FASTA format of gi2506213 onto your computer's clipboard (not the clipboard of the NCBI page), go to the NCBI BLAST page and do a BLAST search (default options). Do you obtain any different results?

  5. Download the fasta formated sequences for D-Alanin-D-Alanin ligase (gi|145722) and a glutathione synthetase (gi|121663). Also check among the related sequences for carbamoyl phosphate synthetases.

  6. Do a BLAST search of Swissprot with the D-Ala D-Ala ligase or the glutathione synthetase as a query. Increase the "Expect" parameter to 1000 and the "descriptions" to 250

  7. Do a PSI-BLAST search with the same sequence as a query (use the SWISSPROT database as target). After how many iterations do you start to pick up the other sequence types? Which other types of enzyme are included among the hits?
    (If it is too slow check here for 7 rounds of PSI-BLAST).

  8. Use Internet Explorer for this exercise!!!!
    Do a PSI-BLAST search for 3 iterations with the following sequence:
    >gi|7436316|pir||D75028 Pab VMA intein CVDGDTLVLTKEFGLIKIKDLYKILDGKGKKTVNGNEEWTELERPITLYGYKDGKIVEIKATHVYKGFS
    AGMIEIRTRTGRKIKVTPIHKLFTGRVTKNGLEIREVMAKDLKKGDRIIVAKKIDGGERVKLNIRVEQKR GKKIRIPDVLDEKLAEFLGYLIADGTLKPRTVAIYNNDESLLRRANELANELFNIEGKIVKGRTVKALLI
    HSKALVEFFSKLGVPRNKKARTWKVPKELLISEPEVVKAFIKAYIMCDGYYDENKGEIEIVTASEEAAYG FSYLLAKLGIYAIIREKIIGDKVYYRVVISGESNLEKLGIERVGRGYTSYDIVPVEVEELYNALGRPYAE
    LKRAGIEIHNYLSGENMSYEMFRKFAKFVGMEEIAENHLTHVLFDEIVEIRYISEGQEVYDVTTETHNFI GGNMPTLLHNT
    What types of enzymes do you get as hits? Do you notice anything strange about the search results? Save the PSSM (Position Specific Scoring Matrix, or profile) from your search on the 4th iteration. To do that choose PSSM from pull-down menu under Format options and click "Format!" button. After the search is done, you should get strangely looking alphanumerical symbol mixture in your browser window. This is a PSSM. Save PSSM matrix to the disk as text file, or keep this browser window opened. We are going to use this profile in the next question.

  9. Now we will use the PSSM to BLAST the completed genomes. Go to Microbial Genomes Genomic BLAST page (Let it load completely before choosing any options). Paste intein sequence into query sequence box, change Query and Database entries to "Protein". Choose one of the following genomes as Database:
    • Pyrobaculum aerophilum
    • Aeropyrum pernix
    • Sulfolobus tokodaii
    • Archaeoglobus fulgidus
    • Methanothermobacter thermautotrophicus
    • Thermoplasma volcanium
    • Methanococcus jannaschii
    • Saccharomyces cerevisiae (This genome is on the eukaryotes Genomic BLAST page - select eukaryoted on top of the genome blast page)

    After that click "Adv. BLAST" button. This will redirect you to the BLAST search window. Paste your PSSM from Question #7 into PSSM box (under Options). What are the results of your search? Did you get any significant matches? What are they? If you have significant matches, does the match occur over the full lengths of both query and subject sequences? Use Blink to investigate if the hits are indeed inteins.
    What is your conclusion? In your answer indicate
     - genomes searched,
     - number of significant matches found,
     - the E-values of these matches, and
     - the identity of these matches
         (i.e., are these probable inteins, or are they likely to be something else?).

    The following are optional. Do them only if you have time.
    You might want to return to this exercise if you have timthis afternoon of next week.

    Using Blast on a "local" machine:
    Many completed genomes can be found on the NCBI FTP site (ftp://ftp.ncbi.nih.gov). The eukaryotic genomes are accessible from the genomes directory while all prokaryotic (bacteria + archaea) genomes are stored in the the subfolder /bacteria (a clear case of anti archaeal discrimination). Go into the bacteria sub-folder and click on the Pyrococcus abyssi folder (Pyrococcus is a member of the archaeal domain). Among the files in these folders we are are interested in the .ffn file which contains all nucleotide sequences for individual predicted ORF's and the .faa file which contains the predicted ORF's amino acids sequences. Both files are in a fasta format. Copy the .faa file on to your computer and rename the file p_abyssi.faa. Also download the .faa file form Thermotoga maritima and rename it to t_maritima.faa.
    Also download any other genomes you might be interested in!

  10. To be able to use the fasta file as a target for a database search, we need to use the formatdb program. You invoke this program by typing formatdb in a dos prompt window (peovided the blast program folder has been added to your path. if this is not the case, follow oral instuctions!).
    The following should work to turn p_abyssi into a searchable databank. (You can do this for any collection of sequences in fasta format. Note you do not need to format the file containing your query sequences)

    formatdb -i p_abyssi.faa -o T -p T

    The -i is to indicate your infile.name, -p T is for a protein sequence, for a nucleotide sequence, you need to use -p F (for False); the -o T instructs the formatdb program to create indices. For more information on the formatdb program check http://bioinformatics.ubc.ca/resources/tools/index.php?name=formatdb .

  11. To do a blast search of every sequence in thermotoga.faa against the P.abyssi genome, we use blastall. To get information on the program type
    blastall <enter>
    For details check one of the many help pages on blastall, examples are here and here.


    Which option turns off the the low complexity filter?
    Which option, and which setting, sets the wordsize to 2?
    Which option allows to use two processors?
    Which options (and why) will you use in your blast search? (the answer depends on what questions you want to answer).


  12. The following options might work:

    blastall -i t_maritima.faa -d p_abyssi.faa -o blast.out -p blastp -e 10e-20 -m 8
    -i the querry sequence(s)
    -d the database
    -o the output file
    -p the blastprogram to use
    -e the e-value cut off
    -m the format of the output. option 8 gives a tabular output, that can be easily read into excel spreadsheet or into a database program. option 9 is similar but includes comment lines.
    the individual columns in the output denote the following
    # Query id # Subject id # % identity # alignment length # number of mismatches # number of gap openings # position of query start # position of query end # position of subject start # position of subject end # e-value of a hit # bit score of a hit

  13. Load the results the results into an excel spreadsheet.
    We might have a demonstration in class on how to do this.

    If you want only the one top scoring blast hit for each query sequence, you can run the mod.pl program on your sequences. The program is here (It also adds a header to the table and adds another column). Download it into your directory. You should be able to invoke the program from the command line (after you moved to the correct directory using the cd command) by typing
    perl extract_lines.pl blast.out (stay tuned there might be issues with the path, and with perl under windows, etc.)
    Also, open the program in a word processor and read through it - it might be easier to understand than you think - PERL often comes in handy.

    How many matches did you find that are above an E-value of 10e-20?
    Which gene was most similar between T. maritima and P. abyssi?