MCB 5472 : Blastall and genome plots

Please let me know, how far you got during the lab. If most students didn't finish, we will continue this next week!

Questions you should answer are given in red. Please send your answers per email to gogarten@uconn.edu with subject MCB5472

1) PSI Blast using the NCBI's web interface, (this description seems to work in SAFARI, there is a problem with some browsers translating wrap arounds into line breaks):

Do a PSI-BLAST search at the NCBI (you could use the same or a different query as below (in exercise 2), but inteins work nicely because you can easily asses the success of the search. If you don't have a good sequence, use this intein from Pyrococcus (this is a large intein containing both an endonuclease domain and the selfsplicing domain).:

>Pab_VMA intein from gi|7436316|pir||D75028
CVDGDTLVLTKEFGLIKIKDLYKILDGKGKKTVNGNEEWTELERPITLYGYKDGKIVEIKATHVYKGFS
AGMIEIRTRTGRKIKVTPIHKLFTGRVTKNGLEIREVMAKDLKKGDRIIVAKKIDGGERVKLNIRVEQKR
GKKIRIPDVLDEKLAEFLGYLIADGTLKPRTVAIYNNDESLLRRANELANELFNIEGKIVKGRTVKALLI
HSKALVEFFSKLGVPRNKKARTWKVPKELLISEPEVVKAFIKAYIMCDGYYDENKGEIEIVTASEEAAYG
FSYLLAKLGIYAIIREKIIGDKVYYRVVISGESNLEKLGIERVGRGYTSYDIVPVEVEELYNALGRPYAE
LKRAGIEIHNYLSGENMSYEMFRKFAKFVGMEEIAENHLTHVLFDEIVEIRYISEGQEVYDVTTETHNFIGG
NMPTLLHNT

On the Format page, set the E-value cut-off for inclusion in the next round to 0.0001 (this is the PSI Blast threshold at the bottom of the form NOT the Expect Threshold) and change the maximum target sequences to 20000.

While the search is ongoing, meditate of the question "Why does the PsiBlast form have two different E-value thresholds?".

Note : You can instruct PSI-Blast to switch back and forth between the format and the result window (with a little checkmark). If you do, DO NOT CLICK the "Run PSI_Blast iteration X" button repeatedly. Click it once and open the Format window!

On the result page, scroll down to a hit with a high Evalue (boarderline significant hit) and check that the target sequence indead encodes an intein. (CTRL click on the link to the sequence to open in a new window, in the new window select BLINK ...)

After running at least two iterations (three blast searches, you should be asked to run iteration 4)), save the PSSM (Position Specific Scoring Matrix, or profile) from your search. To do that choose the download link on top of the output window. Save the text into a file on your desktop.

The plan was to use this PSSM to do other searches - at present this does not work using the web interface. If you want to check one particular organism, you can select to display the results for only one particular organism:

Now we will use the PSSM to BLAST the completed genomes. Go to Microbial Genomes Genomic BLAST page ( Let it load completely before choosing any options! ).
Paste intein sequence into query sequence box,
change Query and Database entries to "Protein" and select blast program BLASTP .
Choose one or more of the genomes as database. The following work well for inteins:

Halobacteriales
Methanocaldococcus jannaschii DSM 2661
Pyrobaculum aerophilum
Aeropyrum pernix
Sulfolobus tokodaii
Archaeoglobus fulgidus
Methanothermobacter thermautotrophicus
Thermoplasma volcanium
Methanococcus jannaschii
Saccharomyces cerevisiae (This genome is on Other eukaryotes Genomic BLAST page , but user interface is the same)

After that click "Adv. BLAST" button . This will redirect you to the advanced BLAST search window. Fill out the form. Select PSI-BLAST, in the format option menu. use "Upload PSSM" to load your PSSM file click BLAST. At present I have not found a way t to use PSSM to search nucleotide sequences using the web interface. Be careful, make sure the program actually reads the PSSM not only the query sequence.

2) BLAST via the commandline

Decide on a sequence for which you want to find distant homologs.
Possible examples are:

Inteins (molecular parasites that cut themselves from the host protein
Homing endonucleases (nucleases that have a large recognition site, and that help molecular parasite to integrate into their "home"
transposases or integrases or reverse transcriptases
a glycosidase (like a beta fructofuranisidase or invertase). This would allow you find all enzymes encoded in an organisms that cleave glycosidic bonds, which might be interesting in case you look at an organism that interacts with plants, or that eats plant/algal material.

Download an appropriate sequence from ENTREZ and save it as a FASTA file in a directory on the cluster. Make sure that the sequence only contains the parts you are interested in. Especially important if you have a multidomain protein (host protein + intein), or if the sequence has a signal sequence. Read the genbank file, and only select the appropriate amino acids. Also download the genome(s) you are interested in. Download both the annotated encoded aa sequences, and the raw nucleotide sequences.
Turn both of the genome sequence data files into searchable databanks using formatdb.
In case of multiple chromosomes use cat *.fna > all.fna to create a single multiple sequence fasta file (see footnote in http://gogarten.uconn.edu/MCB372/Laboratories/assign2.html)

Which commands did you use to run formatdb?
Which sequence did you choose?
Which genome did you choose?

3) Use a query and database of your choice to do a normal blast search of swissprot and a genome of your choice (download form ftp://ftp.ncbi.nih.gov/)

ssh to your account on: ssh yourname@bbcxsrv1.biotech.uconn.edu
qrsh (log into a compute node)
blastall -p blastp -i your.query -d swissprot
blastall -p blastp -i your.query -d your_genome.faa
blastall -p tblastn -i your.query -d your_genome.fna
(you could add -o, -a2 , or -F flags .... )

Which commands did you use?
How many significant hits (what is your significance level?) did you have in the three databanks?

4) the command line for a PSI BLAST search is blastpgp
blastpgp -
gives you a listing of all the options.

To calculate a PSSM you might need to define the following options:

-i
-j
-C
-h
-e

You also might want to use

-a
-Q

What are these options?

Take a note on the names you used for your query, the checkpoint file, and the databank (nr is the non redundant databank, it should be available on the cluster)

The command to generate a PSSM matrix using nr might look as follows:
blastpgp -i test1.fa -d swissprot -I T -h 0.0001 -j 4 -C test1.chk -Q test1.matrix -a2
Remember, do this on a compute node! (if you use nr and get a segmentation fault, you might need to ssh node017)

Once you started the search, this would be a good time to take a break.

To search a protein databank with the profile you could use something like the following:
blastpgp -i test1.fa -d genome.faa -R test1.chk -o genome.test1.br -I T -a 2
(where test1.chk is the file created with the -C flag in blastpgp)

To search a nucleotide databank with the profile you could use something like the following:
blastall -i test1.fa -d target_genome_nucl.fna -p psitblastn -R test1.chk -F F
(the filter F flag is necessary, because blastpgp by default has he filter off, whereas it is on by default in blastall

How many significant hits to your profile did you obtain?
Does this mean anything?

4) If you have time run a PSI Blast search of nr with one of the following sequences.
Do you obtain any matches that might reveal the function of the query sequence?

>gi|111225682|ref|YP_716476.1| hypothetical protein FRAAL6341 [Frankia alni ACN14a]
MNPATLAASRRFPLVGRPRPACPALPDRVNEIADIAQGAVQEGADGLAEGAHALNKAALVASDCGMPTLA
RDLCWQHINIYRSADRPLTVLQTRYMLEPVLNLARLHFRAGADDRALRLLTSMYRAVTSHTDLVVDGHAL
PSTGLTGTRHDHHKLREWVWLHLVGDGIRALALAGRWDDAVAHARAHRGIGLHLMEGRQAKILAHCLNGR
PAAAITALAGSTPEQPWELQVASCLNVMCTDGTPASRNIDEMIEHFVGQEPMPGYVVFRAQLGLTVATLA
RTTDRAAANGVLAQVADEVIKAGDGYAARDVLRHPDTRAADLTSEQGSALVDVLTSSGLEAGRLPEPLLR
SLLSSARTAAEALNSSIPLRDHLVAVEASPRPGVHLNARNCAPGLPAEQPGVEGGETDPVRPRGSRGHRL
ADLP

>gi|158311919|ref|YP_001504427.1| conserved hypothetical protein [Frankia sp. EAN1pec]
MNPVALALACFFPLIGRPRLACQPLPDRVAEIAEIAQAAARDGADGLAEGAHALNKAALLASDCGLAPLA
RDLCWQHINIYCAVPRPLTVHEARYMLEPALNLARLQIRASDGEQALGLLTAMFQAVSSNTDLVVDGRVL
PLTDLIGTRDERHKLREWVWLHLVGDGVRALALAGRWDDAVIHADTYRGIGLHLLEGRQAKILAHCLTGT
SAEARAALAESTPMYPWELQVASCLEVMCTEDTSTAHGVTTMIGQFLGQRPMPGYAVFRAHLGMTVAALA
ATTDPDAATRVLTQTVEEVIEAEDGYAARDVLRLRPTQAVDLPARHEKALADLLNASGLRAETPPEPVLE
SVLGSARTAEAAIVAATHPQRR

>gi|158313186|ref|YP_001505694.1| conserved hypothetical protein [Frankia sp. EAN1pec]
MAATTSTAATPSTFDIVAARFPLVPRSRPSCPPLDARIAHVAALAGQAAGGGGDALLRAAEAHNLAALIA
SDCGLPDLARSLCWRQIDTLPLRRPLDGATAKLALQPFINLARLRLRAGDGLAAYQMLTTLYDVVVARTS
TAIDERALVFDDLVTDVDHPQTVRWLWTVLLADGTRALTRTGHWTEALDHLNRHKGIGQRLLDGRQTAIL
AHHAHRDHYAAEHLLTTTATTQPWEQSVATCLGLLHRHLTGLKTPDDGRSTIDALLPSNNPEHLTFNIQL
GLCLLDLADTPQHLRPVLDTIIDGALHSDDAYAARDLLTHPAARGYLNRDQLTLLNERQRHSGLGSGRIP
EALRTRLLGALALAS

>gi|158314618|ref|YP_001507126.1| hypothetical protein Franean1_2797 [Frankia sp. EAN1pec]
MQLTRLVLTGTCLLGLLAAGAPAFADSYATTDCAQNPIPGCDLAAGGHGLAPAPPGGQAPPPQGGGSGGS
GRGGSARPPGDVALDPADVARCSYVRSDFQPTTEAIQPVRFRPAPDGGLRVVTAVDRPGPFLARPVATGP
DGRPGAWYVYQCQTDGVRDALYRPPVWIPDGQGGPVAGGPDVGGLAEQARSQLRLQGPAIALSPIGRQLI
RLPTWMWLDPAGWRPVSATAAAGGVSVTATATPTGVDWVMGDGAQVHCTGPGTPYPDGGDPKAPSPDCGH
TYQTVSEDQPGGVFTLTATVTWNVTWAGGGQTGVFDGLTTVSTVQVAVISIPALITGGG

>gi|158314619|ref|YP_001507127.1| hypothetical protein Franean1_2798 [Frankia sp. EAN1pec]
MAPRSDSRGRRWPVRGVAVGLLAAGVLVVSCSSNDSAEPAPGPAPSTSRSPAPRMTVSPTPTSPADAAGQ
RAVAAYVGLWEAMAEASHTSDWQSPELARYASGDALQAVSGGLYADHYNGLVSRGAPVLHPEITSVEPAD
APTTVMVFDCSDSTNWLRHRADGAPFTDEPGGRRAVTSEVRLHQDGSWKVTRFAVEPVGSCS

>gi|158318490|ref|YP_001510998.1| hypothetical protein Franean1_6758 [Frankia sp. EAN1pec]
MHSPTEAPGRFAGRTPADGRQPGPDPSSSTAVRRSGVGSAVVVGMGREPVTPAGDDAPSPPLPIRVPRTT
WPAGRRRLSSTGRGYYVRPLSGPAGAAEQKWTVVVLDADGDPRETVTGFSSLEAADAYAELDPAIIRWIS
VPTRPAIPERIPGL

>gi|86739833|ref|YP_480233.1| hypothetical protein Francci3_1124 [Frankia sp. CcI3]
MWPPPNDADTVTRMRAAHERASAALRVTVDRTQAEEAWGFAGRTLGRPVMTPDSPGWLRIAATEAGQQIT
TFWDGGRTAQQALPASIPRPALRAIHDHHHDGWDYRAELYDRVHVRPLAVTTVPRRLTDSPKGQWFQALR
RSFRILATVSTDRRTIEQSYLDDVMPRILGEPITTTSPAPWVTAHGDLHWANLCGPTLCMLDWEGWGLAP
AGYDAATLYCHSLFMPTLAAQVQARFADALSTESGRYAELAVIAELLDTVHSGTGLDIDHAGLLRIRATH
LLRRGIPRQQEGAATRPGQT

>gi|86740523|ref|YP_480923.1| hypothetical protein Francci3_1818 [Frankia sp. CcI3]
MRQRCGLSLAFRHPIWPGLVPTASDLASCRLAREPNGVGVGLSARKGPDRMTAGGLAHDTAHGPASGQSL
PVGQIDCRLAIVVGATPWRNTAVTPTLVSSAGARERGAVEEVFGSVRLAGRIVHVDDGRLRRGGPHEWRT
QREPADGPRVRAESAPSRRRPSGEWRRHPTPTMSAAPVTRVRQFTRRPATDRSSAGPCIPKVTERPCRLC
LRAALRCASRTEHAQPAGRVLPEARWGCPSQTNSAAKGECPGHPGHGPLSNPAVRIR

>gi|86740529|ref|YP_480929.1| hypothetical protein Francci3_1824 [Frankia sp. CcI3]
MANWFPLVPRPRPPTIDRRQRLTEIEHLAHSAPGNARRTTNAAEALNKAALLASDCGLPDLAKDLCWRQF
HVFDNAGPQPPALATAVLQPLINLGRLELRADDPDRAYTFFNQIHHAVRTSTAVVLDGVTVNPARLFADD
NTLLRARRFTWTVLLGDGTRALARAGRWDDAVAHLHRHHGIGRRLLDGRQTLILASHLAGESEAARTALA
TSVTPTPWERAVAAVLGILCRHGSEPQAIGAALDDLRASSRDPTHPVFQNRLGLTLLDLATDATDQKWIA
SAILAGLRDGHDGRCAADALAHPRLQQHLSPAHRVDLVERVKTAGIQRPPPTTFRDAMTTATLAAERGLR
DALAGPLEGGRLITRRPDQSAARGEAG

>gi|86740733|ref|YP_481133.1| hypothetical protein Francci3_2030 [Frankia sp. CcI3]
MTNVGTDPADENLLAWFPLVQRPRPPGLPLEDRVRQLHDLAARTSDGLPLLRAAEVCNKAALIASDCGQP
DLAQDLCWRQHTLFDQARPLPASAAELALQPVLNLPRQLIRDGDGNRAHAILQALHEAARTQTSALIDGR
SVSLHNVTCASDDHRTMRTLTWTALLADGVRALARAGRWHEAAEQAAAHRGVGRRLLDGRQATVLALAQA
GHTEQAAALVDQSATPEPWEQAIQTILRVHCLRQAGADTGPQIAPLLATALTLMRQPDLSTMVFRARAGM
IALDLADGHDHPRIDVLRRALIAGTFKDAYTARDTLAHRLRESMTTTQRQTLADVFRAAGLDAGSIPESL
YGDLMETVKFAEDQLRGCLGRHARHCECTTATS

>gi|86742003|ref|YP_482403.1| hypothetical protein Francci3_3317 [Frankia sp. CcI3]
MAVSGSLARPVKNAGHDHDRPGQQLPVSMVIDRLSGHILDAVVVIPSLSVDEVKESLKRLRQGHGLARPT
ALPTVLPELLRARLADGRVSEPASAQEVTLLVTAFRQAVEGLADDERLCVEVDFNLSAEHRYPTLTERQE
SLARQQRCAAKTVRRRADRALDTLAYMLLTNGPSTVTSTVRSPATISSPEQDRSPVDHGEPWGEDLRAFW
RLSHGARIDIVCSEIPEDERPEYASPADRNYLRYAKFADLDTLIYLRTRFARLAPTVTIRDFAPSEYFDT
QADVLVVVGGPPWNAKYREFLPRLPFFFEPHPLGADDPLVVPGMNGLVLGPRWTERNELLEDLAVFTRLT
LAQGTTVFLLGGCLTLGVLGAARCLLEAERGARSSRYITEHVNDADFVLVTEARRIGGLTDVADLTRVPP
LLLLARSNNEPFAVHVDNSDRYLQDQSGTEVHHIDHSC

================================================================================

The following applies only if blast doesn't recognize the common databanks:

In order to use the databases installed on bbcxsrv1.biotech.uconn.edu the programs need to know where the databases are installed. One way to do this is to use an environmental variable.

BLASTDB=/common/data/
export BLASTDB

If this is a common thing for you to do frequently, you should add these two lines to your .profile . This is a file that is normally hidden (its name starts with a period, you see it only, if you type ls -a), you can edit, or create it using any text editor, one possibility would be to use vi .profile

type i to enter insert mode
type or copy (in vt100 emulations it is usually SHIFT INSERT key that copies things to the cursor location).
BLASTDB=/common/data/
export BLASTDB
Hit the <ESCAPE> key and type :wq

Every time you log in (including qrsh) the environmental variable is defined and exported.
If this is to complicated, you can use this .profile and sftp it into your home directory.

The databanks available in /common/data/ include:

alu.a
alu.n
drosoph.aa
drosoph.nt
ecoli.aa
ecoli.nt
env_nr
env_nt
est_human
est_mouse
est_others
gss
human_genomic
mito.aa
mito.nt
nr
nt
other_genomic
pataa
patnt
pdbaa
pdbnt
pfam
sts
swissprot
vector
yeast.aa
yeast.nt

define and export the BLASTDB environmental variable.