Please send your answers per email to gogarten@uconn.edu, or hand in a hardcopyPlease let me know, how far you got during the lab. If most students didn't finish, we will continue this next week!
You should answer the questions in red! In case you cannot access nr or other databanks on bbcxsrv1 go here
1) Decide on a sequence for which you want to find distant homologs. Possible examples are:
Download an appropriate sequence from ENTREZ and save it as a FASTA file in a directory on the cluster. Make sure that the sequence only contains the parts you are interested in. Especially important if you have a multidomain protein (host protein + intein), or if the sequence has a signal sequence. Read the genbank file, and only select the appropriate amino acids. Also download the genome you are interested in. Download both the annotated encoded aa sequences, and the raw nucleotide sequences. Turn both of the genome sequence data files into searchable databanks using formatdb. In case of multiple chromosomes use cat *.fna > all.fna to create a single multiple sequence fasta file (see footnote in http://gogarten.uconn.edu/MCB372/Laboratories/assign2.html)
Which commands did you use to run formatdb? Which sequence did you choose? Which genome did you choose?
2) Use a query and database of your choice to do a normal blast search of swissprot and a genome of your choice (download form ftp://ftp.ncbi.nih.gov/)
Which commands did you use? How many significant hits (what is your significance level?) did you have in the three databanks?
3) the command line for a PSI BLAST search is blastpgp blastpgp - gives you a listing of all the options.
To calculate a PSSM you need to define the following options:
-i-j-C-h-e
You also might want to use
-a-Q
What are these options?
The command to generate a PSSM matrix using nr might look as follows: blastpgp -i test1.fa -d swissprot -I T -h 0.0001 -j 4 -C test1.chk -Q test1.matrix -a2 Remember, do this on a compute node! (if you use nr and get a segmentation fault, you might need to ssh node017 first)
To search a protein databank with the profile we could use something like the following: blastpgp -i test1.fa -d genome.faa -R test1.chk -o genome.test1.br -I T -a 2
To search a nucleotide databank with the profile we could use something like the following: blastall -i test1.fa -d target_genome_nucl.fna -p psitblastn -R test1.chk
How many significant hits to your profile did you obtain? Does this mean anything?
================================================================================
IF YOU HAVE TIME, and want to explore the NCBI's web interface, do the following (seems to work in SAFARI):
Do a PSI-BLAST search at the NCBI using the same or a different query as above (to speed things up, you could use swissprot as database). If you don't have a good sequence, use this intein from Pyrococcus:
>Pab_VMA intein from gi|7436316|pir||D75028 CVDGDTLVLTKEFGLIKIKDLYKILDGKGKKTVNGNEEWTELERPITLYGYKDGKIVEIKATHVYKGFS AGMIEIRTRTGRKIKVTPIHKLFTGRVTKNGLEIREVMAKDLKKGDRIIVAKKIDGGERVKLNIRVEQKR GKKIRIPDVLDEKLAEFLGYLIADGTLKPRTVAIYNNDESLLRRANELANELFNIEGKIVKGRTVKALLI HSKALVEFFSKLGVPRNKKARTWKVPKELLISEPEVVKAFIKAYIMCDGYYDENKGEIEIVTASEEAAYG FSYLLAKLGIYAIIREKIIGDKVYYRVVISGESNLEKLGIERVGRGYTSYDIVPVEVEELYNALGRPYAE LKRAGIEIHNYLSGENMSYEMFRKFAKFVGMEEIAENHLTHVLFDEIVEIRYISEGQEVYDVTTETHNFIGG NMPTLLHNT
On the Format page, set the E-value cut-off for inclusion in the next round to 0.0001 and change the maximum target sequences to 10000. Note : By default PSI-Blast switches back and forth between the format and the result window. DO NOT CLICK the "Run PSI_Blast iteration X" button repeatedly. Click it once and open the Format window!.
Save the PSSM (Position Specific Scoring Matrix, or profile) from your search on the 4th iteration. To do that choose PSSM from pull-down menu under Format options (in the SHOW line) and THEN click the "Format!" button. After the search is done, you should get a strangely looking alphanumerical symbol mixture in your browser window. This is a PSSM. Save the PSSM matrix to disk as text file, and keep this browser window open. We are going to use this profile in the next search.
Now we will use the PSSM to BLAST the completed genomes. Go to Microbial Genomes Genomic BLAST page ( Let it load completely before choosing any options! ). Paste intein sequence into query sequence box, change Query and Database entries to "Protein" and select blast program BLASTP . Choose one or more of the genomes as database. The following work for inteins:
After that click "Adv. BLAST" button . This will redirect you to the advanced BLAST search window. Fill out the form. Select PSI-BLAST and use "Upload PSSM" to load your PSSM file click BLAST, then FORMAT (keep your fingers crossed). At present I have not found a way to use the PSSMs calculated using blastpgp in the web interface.
The following applies only if blast doesn't recognize the common databanks:
In order to use the databases installed on bbcxsrv1.biotech.uconn.edu the programs need to know where the databases are installed. One was to do this is to use an environmental variable.
BLASTDB=/common/data/ export BLASTDB
If this is a common thing for you to do frequently, you should add these two lines to your .profile . This is a file that is normally hidden (its name starts with a period, you see it only, if you type ls -a), you can edit, or create it using any text editor, one possibility would be to use vi .profile
Every time you log in (including qrsh) the environmental variable is defined and exported. If this is to complicated, you can use this .profile and sftp it into your home directory.
The databanks available in /common/data/ include:
define and export the BLASTDB environmental variable.