MCB 372: PSI-Blast

Please send your answers per email to gogarten@uconn.edu, or hand in a hardcopy

Please let me know, how far you got during the lab. If most students didn't finish, we will continue this next week!

You should answer the questions in red!

In case you cannot access nr or other databanks on bbcxsrv1 go here

1) Decide on a sequence for which you want to find distant homologs.
Possible examples are:

Inteins (molecular parasites that cur themselves from the host protein
Homing endonucleases (nucleases that have a large recognition site, and that help molecular parasite to integrate into their "home"
transposases or integrases or reverse transcriptases
a glycosidase (like a beta fructofuranisidase or invertase). This would allow you find all enzymes encoded in an organisms that cleave glycosidic bonds, which might be interesting in case you look at an organism that interacts with plants, or that eats plant/algal material.

Download an appropriate sequence from ENTREZ and save it as a FASTA file in a directory on the cluster. Make sure that the sequence only contains the parts you are interested in. Especially important if you have a multidomain protein (host protein + intein), or if the sequence has a signal sequence. Read the genbank file, and only select the appropriate amino acids. Also download the genome you are interested in. Download both the annotated encoded aa sequences, and the raw nucleotide sequences.
Turn both of the genome sequence data files into searchable databanks using formatdb.
In case of multiple chromosomes use cat *.fna > all.fna to create a single multiple sequence fasta file (see footnote in http://gogarten.uconn.edu/MCB372/Laboratories/assign2.html)

Which commands did you use to run formatdb?
Which sequence did you choose?
Which genome did you choose?

2) Use a query and database of your choice to do a normal blast search of swissprot and a genome of your choice (download form ftp://ftp.ncbi.nih.gov/)

ssh to your account on: ssh yourname@bbcxsrv1.biotech.uconn.edu
qrsh (log into a compute node)
blastall -p blastp -i your.query -d swissprot
blastall -p blastp -i your.query -d your_genome.faa
blastall -p tblastn -i your.query -d your_genome.fna
(you could add -o, -a2 , or -F flags .... )

Which commands did you use?
How many significant hits (what is your significance level?) did you have in the three databanks?

3) the command line for a PSI BLAST search is blastpgp
blastpgp -
gives you a listing of all the options.

To calculate a PSSM you need to define the following options:

-i
-j
-C
-h
-e

You also might want to use

-a
-Q

What are these options?

The command to generate a PSSM matrix using nr might look as follows:
blastpgp -i test1.fa -d swissprot -I T -h 0.0001 -j 4 -C test1.chk -Q test1.matrix -a2
Remember, do this on a compute node! (if you use nr and get a segmentation fault, you might need to ssh node017 first)

To search a protein databank with the profile we could use something like the following:
blastpgp -i test1.fa -d genome.faa -R test1.chk -o genome.test1.br -I T -a 2

To search a nucleotide databank with the profile we could use something like the following:
blastall -i test1.fa -d target_genome_nucl.fna -p psitblastn -R test1.chk

How many significant hits to your profile did you obtain?
Does this mean anything?

================================================================================

IF YOU HAVE TIME, and want to explore the NCBI's web interface, do the following (seems to work in SAFARI):

Do a PSI-BLAST search at the NCBI using the same or a different query as above (to speed things up, you could use swissprot as database). If you don't have a good sequence, use this intein from Pyrococcus:

>Pab_VMA intein from gi|7436316|pir||D75028
CVDGDTLVLTKEFGLIKIKDLYKILDGKGKKTVNGNEEWTELERPITLYGYKDGKIVEIKATHVYKGFS
AGMIEIRTRTGRKIKVTPIHKLFTGRVTKNGLEIREVMAKDLKKGDRIIVAKKIDGGERVKLNIRVEQKR
GKKIRIPDVLDEKLAEFLGYLIADGTLKPRTVAIYNNDESLLRRANELANELFNIEGKIVKGRTVKALLI
HSKALVEFFSKLGVPRNKKARTWKVPKELLISEPEVVKAFIKAYIMCDGYYDENKGEIEIVTASEEAAYG
FSYLLAKLGIYAIIREKIIGDKVYYRVVISGESNLEKLGIERVGRGYTSYDIVPVEVEELYNALGRPYAE
LKRAGIEIHNYLSGENMSYEMFRKFAKFVGMEEIAENHLTHVLFDEIVEIRYISEGQEVYDVTTETHNFIGG
NMPTLLHNT

On the Format page, set the E-value cut-off for inclusion in the next round to 0.0001 and change the maximum target sequences to 10000. Note : By default PSI-Blast switches back and forth between the format and the result window. DO NOT CLICK the "Run PSI_Blast iteration X" button repeatedly. Click it once and open the Format window!.

Save the PSSM (Position Specific Scoring Matrix, or profile) from your search on the 4th iteration. To do that choose PSSM from pull-down menu under Format options (in the SHOW line) and THEN click the "Format!" button. After the search is done, you should get a strangely looking alphanumerical symbol mixture in your browser window. This is a PSSM. Save the PSSM matrix to disk as text file, and keep this browser window open. We are going to use this profile in the next search.

Now we will use the PSSM to BLAST the completed genomes. Go to Microbial Genomes Genomic BLAST page ( Let it load completely before choosing any options! ).
Paste intein sequence into query sequence box,
change Query and Database entries to "Protein" and select blast program BLASTP .
Choose one or more of the genomes as database. The following work for inteins:

Pyrobaculum aerophilum
Aeropyrum pernix
Sulfolobus tokodaii
Archaeoglobus fulgidus
Methanothermobacter thermautotrophicus
Thermoplasma volcanium
Methanococcus jannaschii
Saccharomyces cerevisiae (This genome is on Other eukaryotes Genomic BLAST page , but user interface is the same)

After that click "Adv. BLAST" button . This will redirect you to the advanced BLAST search window. Fill out the form. Select PSI-BLAST and use "Upload PSSM" to load your PSSM file click BLAST, then FORMAT (keep your fingers crossed). At present I have not found a way to use the PSSMs calculated using blastpgp in the web interface.

The following applies only if blast doesn't recognize the common databanks:

In order to use the databases installed on bbcxsrv1.biotech.uconn.edu the programs need to know where the databases are installed. One was to do this is to use an environmental variable.

BLASTDB=/common/data/
export BLASTDB

If this is a common thing for you to do frequently, you should add these two lines to your .profile . This is a file that is normally hidden (its name starts with a period, you see it only, if you type ls -a), you can edit, or create it using any text editor, one possibility would be to use vi .profile

type i to enter insert mode
type or copy (in vt100 emulations it is usually SHIFT INSERT key that copies things to the cursor location).
BLASTDB=/common/data/
export BLASTDB
Hit the <ESCAPE> key and type :wq

Every time you log in (including qrsh) the environmental variable is defined and exported.
If this is to complicated, you can use this .profile and sftp it into your home directory.

The databanks available in /common/data/ include:

alu.a
alu.n
drosoph.aa
drosoph.nt
ecoli.aa
ecoli.nt
env_nr
env_nt
est_human
est_mouse
est_others
gss
human_genomic
mito.aa
mito.nt
nr
nt
other_genomic
pataa
patnt
pdbaa
pdbnt
pfam
sts
swissprot
vector
yeast.aa
yeast.nt

define and export the BLASTDB environmental variable.