homology assesment

Blast using the NCBI web interface:

There are different BLAST programs.
Why/when would you want to use BLASTP? When blastN, blastx, tblastn or tblastx?
Using a protein sequence of your choice (see above, or gi|59713171) search NR, SWISSPROT and env_nr.
How many significant hits did you find in each of these databanks?
How do the results change, when you use a smaller word size or a different scoring matrix (use swissprot to explore these questions)?
Do all the target sequences have similar description lines? At what E-values does the content of the annotation lines change?
Compare your results with that of your neighbors.
Did the low complexity filter replace any part of your query sequence with XXX?
If you want to search an individual genome, you can go to the corresponding genome page (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi click in the genome in the Refseq column), and start a blast search. You also can start searches of several genomes simultaneously at the microbial genome blast page.
Using Entrez download the protein sequence with PID# 2506213. What type of proteins do you find among the related proteins?

Copy the FASTA format of gi2506213 onto your computer's clipboard (not the clipboard of the NCBI page), go to the NCBI BLAST page and do a BLAST search (default options). Do you obtain any different results?

Download the fasta formated sequences for D-Alanin-D-Alanin ligase (gi|145722) and a glutathione synthetase (gi|121663). Also check among the related sequences for carbamoyl phosphate synthetases.

Do a BLAST search of Swissprot with the D-Ala D-Ala ligase or the glutathione synthetase as a query. Increase the "Expect" parameter to 1000 and the "descriptions" to 250

Do a PSI-BLAST search with the same sequence as a query (use the SWISSPROT database as target). After how many iterations do you start to pick up the other sequence types? Which other types of enzyme are included among the hits?
(If it is too slow check here for 7 rounds of PSI-BLAST).

Use Internet Explorer for this exercise!!!!
Do a PSI-BLAST search for 3 iterations with the following sequence:
>gi|7436316|pir||D75028 Pab VMA intein CVDGDTLVLTKEFGLIKIKDLYKILDGKGKKTVNGNEEWTELERPITLYGYKDGKIVEIKATHVYKGFS
AGMIEIRTRTGRKIKVTPIHKLFTGRVTKNGLEIREVMAKDLKKGDRIIVAKKIDGGERVKLNIRVEQKR GKKIRIPDVLDEKLAEFLGYLIADGTLKPRTVAIYNNDESLLRRANELANELFNIEGKIVKGRTVKALLI
HSKALVEFFSKLGVPRNKKARTWKVPKELLISEPEVVKAFIKAYIMCDGYYDENKGEIEIVTASEEAAYG FSYLLAKLGIYAIIREKIIGDKVYYRVVISGESNLEKLGIERVGRGYTSYDIVPVEVEELYNALGRPYAE
LKRAGIEIHNYLSGENMSYEMFRKFAKFVGMEEIAENHLTHVLFDEIVEIRYISEGQEVYDVTTETHNFI GGNMPTLLHNT
What types of enzymes do you get as hits? Do you notice anything strange about the search results? Save the PSSM (Position Specific Scoring Matrix, or profile) from your search on the 4th iteration. To do that choose PSSM from pull-down menu under Format options and click "Format!" button. After the search is done, you should get strangely looking alphanumerical symbol mixture in your browser window. This is a PSSM. Save PSSM matrix to the disk as text file, or keep this browser window opened. We are going to use this profile in the next question.
Now we will use the PSSM to BLAST the completed genomes. Go to Microbial Genomes Genomic BLAST page (Let it load completely before choosing any options). Paste intein sequence into query sequence box, change Query and Database entries to "Protein". Choose one of the following genomes as Database:
- Pyrobaculum aerophilum
- Aeropyrum pernix
- Sulfolobus tokodaii
- Archaeoglobus fulgidus
- Methanothermobacter thermautotrophicus
- Thermoplasma volcanium
- Methanococcus jannaschii
- Saccharomyces cerevisiae (This genome is on the eukaryotes Genomic BLAST page - select eukaryoted on top of the genome blast page)
After that click "Adv. BLAST" button. This will redirect you to the BLAST search window. Paste your PSSM from Question #7 into PSSM box (under Options). What are the results of your search? Did you get any significant matches? What are they? If you have significant matches, does the match occur over the full lengths of both query and subject sequences? Use Blink to investigate if the hits are indeed inteins.
What is your conclusion? In your answer indicate
- genomes searched,
- number of significant matches found,
- the E-values of these matches, and
- the identity of these matches
(i.e., are these probable inteins, or are they likely to be something else?).

The following are optional. Do them only if you have time.
You might want to return to this exercise if you have timthis afternoon of next week.

Using Blast on a "local" machine:
Many completed genomes can be found on the NCBI FTP site (ftp://ftp.ncbi.nih.gov). The eukaryotic genomes are accessible from the genomes directory while all prokaryotic (bacteria + archaea) genomes are stored in the the subfolder /bacteria (a clear case of anti archaeal discrimination). Go into the bacteria sub-folder and click on the Pyrococcus abyssi folder (Pyrococcus is a member of the archaeal domain). Among the files in these folders we are are interested in the .ffn file which contains all nucleotide sequences for individual predicted ORF's and the .faa file which contains the predicted ORF's amino acids sequences. Both files are in a fasta format. Copy the .faa file on to your computer and rename the file p_abyssi.faa. Also download the .faa file form Thermotoga maritima and rename it to t_maritima.faa.
Also download any other genomes you might be interested in!
To be able to use the fasta file as a target for a database search, we need to use the formatdb program. You invoke this program by typing formatdb in a dos prompt window (peovided the blast program folder has been added to your path. if this is not the case, follow oral instuctions!).
The following should work to turn p_abyssi into a searchable databank. (You can do this for any collection of sequences in fasta format. Note you do not need to format the file containing your query sequences)
formatdb -i p_abyssi.faa -o T -p T
The -i is to indicate your infile.name, -p T is for a protein sequence, for a nucleotide sequence, you need to use -p F (for False); the -o T instructs the formatdb program to create indices. For more information on the formatdb program check http://bioinformatics.ubc.ca/resources/tools/index.php?name=formatdb .
To do a blast search of every sequence in thermotoga.faa against the P.abyssi genome, we use blastall. To get information on the program type
blastall <enter>
For details check one of the many help pages on blastall, examples are here and here.

Which option turns off the the low complexity filter?
Which option, and which setting, sets the wordsize to 2?
Which option allows to use two processors?
Which options (and why) will you use in your blast search? (the answer depends on what questions you want to answer).
The following options might work:

blastall -i t_maritima.faa -d p_abyssi.faa -o blast.out -p blastp -e 10e-20 -m 8
-i the querry sequence(s)
-d the database
-o the output file
-p the blastprogram to use
-e the e-value cut off
-m the format of the output. option 8 gives a tabular output, that can be easily read into excel spreadsheet or into a database program. option 9 is similar but includes comment lines.
the individual columns in the output denote the following
# Query id # Subject id # % identity # alignment length # number of mismatches # number of gap openings # position of query start # position of query end # position of subject start # position of subject end # e-value of a hit # bit score of a hit
Load the results the results into an excel spreadsheet.
We might have a demonstration in class on how to do this.

If you want only the one top scoring blast hit for each query sequence, you can run the mod.pl program on your sequences. The program is here (It also adds a header to the table and adds another column). Download it into your directory. You should be able to invoke the program from the command line (after you moved to the correct directory using the cd command) by typing
perl extract_lines.pl blast.out (stay tuned there might be issues with the path, and with perl under windows, etc.)
Also, open the program in a word processor and read through it - it might be easier to understand than you think - PERL often comes in handy.

How many matches did you find that are above an E-value of 10e-20?
Which gene was most similar between T. maritima and P. abyssi?