Please let me know, how far you got during the lab. If most students didn't finish, we will continue this next week!
Questions you should answer are given in red. Please send your answers per email to gogarten@uconn.edu with subject MCB5472
1) PSI Blast using the NCBI's web interface, (this description seems to work in SAFARI, there is a problem with some browsers translating wrap arounds into line breaks):
Do a PSI-BLAST search at the NCBI (you could use the same or a different query as below (in exercise 2), but inteins work nicely because you can easily asses the success of the search. If you don't have a good sequence, use this intein from Pyrococcus (this is a large intein containing both an endonuclease domain and the selfsplicing domain).:
>Pab_VMA intein from gi|7436316|pir||D75028 CVDGDTLVLTKEFGLIKIKDLYKILDGKGKKTVNGNEEWTELERPITLYGYKDGKIVEIKATHVYKGFS AGMIEIRTRTGRKIKVTPIHKLFTGRVTKNGLEIREVMAKDLKKGDRIIVAKKIDGGERVKLNIRVEQKR GKKIRIPDVLDEKLAEFLGYLIADGTLKPRTVAIYNNDESLLRRANELANELFNIEGKIVKGRTVKALLI HSKALVEFFSKLGVPRNKKARTWKVPKELLISEPEVVKAFIKAYIMCDGYYDENKGEIEIVTASEEAAYG FSYLLAKLGIYAIIREKIIGDKVYYRVVISGESNLEKLGIERVGRGYTSYDIVPVEVEELYNALGRPYAE LKRAGIEIHNYLSGENMSYEMFRKFAKFVGMEEIAENHLTHVLFDEIVEIRYISEGQEVYDVTTETHNFIGG NMPTLLHNT
On the Format page, set the E-value cut-off for inclusion in the next round to 0.0001 (this is the PSI Blast threshold at the bottom of the form NOT the Expect Threshold) and change the maximum target sequences to 20000.
While the search is ongoing, meditate of the question "Why does the PsiBlast form have two different E-value thresholds?".
Note : You can instruct PSI-Blast to switch back and forth between the format and the result window (with a little checkmark). If you do, DO NOT CLICK the "Run PSI_Blast iteration X" button repeatedly. Click it once and open the Format window!
On the result page, scroll down to a hit with a high Evalue (boarderline significant hit) and check that the target sequence indead encodes an intein. (CTRL click on the link to the sequence to open in a new window, in the new window select BLINK ...)
After running at least two iterations (three blast searches, you should be asked to run iteration 4)), save the PSSM (Position Specific Scoring Matrix, or profile) from your search. To do that choose the download link on top of the output window. Save the text into a file on your desktop.
The plan was to use this PSSM to do other searches - at present this does not work using the web interface. If you want to check one particular organism, you can select to display the results for only one particular organism:
Now we will use the PSSM to BLAST the completed genomes. Go to Microbial Genomes Genomic BLAST page ( Let it load completely before choosing any options! ). Paste intein sequence into query sequence box, change Query and Database entries to "Protein" and select blast program BLASTP . Choose one or more of the genomes as database. The following work well for inteins:
After that click "Adv. BLAST" button . This will redirect you to the advanced BLAST search window. Fill out the form. Select PSI-BLAST, in the format option menu. use "Upload PSSM" to load your PSSM file click BLAST. At present I have not found a way t to use PSSM to search nucleotide sequences using the web interface. Be careful, make sure the program actually reads the PSSM not only the query sequence.
2) BLAST via the commandline
Decide on a sequence for which you want to find distant homologs. Possible examples are:
Download an appropriate sequence from ENTREZ and save it as a FASTA file in a directory on the cluster. Make sure that the sequence only contains the parts you are interested in. Especially important if you have a multidomain protein (host protein + intein), or if the sequence has a signal sequence. Read the genbank file, and only select the appropriate amino acids. Also download the genome(s) you are interested in. Download both the annotated encoded aa sequences, and the raw nucleotide sequences. Turn both of the genome sequence data files into searchable databanks using formatdb. In case of multiple chromosomes use cat *.fna > all.fna to create a single multiple sequence fasta file (see footnote in http://gogarten.uconn.edu/MCB372/Laboratories/assign2.html)
Which commands did you use to run formatdb? Which sequence did you choose? Which genome did you choose?
3) Use a query and database of your choice to do a normal blast search of swissprot and a genome of your choice (download form ftp://ftp.ncbi.nih.gov/)
Which commands did you use? How many significant hits (what is your significance level?) did you have in the three databanks?
4) the command line for a PSI BLAST search is blastpgp blastpgp - gives you a listing of all the options.
To calculate a PSSM you might need to define the following options:
-i-j-C-h-e
You also might want to use
-a-Q
What are these options?
Take a note on the names you used for your query, the checkpoint file, and the databank (nr is the non redundant databank, it should be available on the cluster)
The command to generate a PSSM matrix using nr might look as follows: blastpgp -i test1.fa -d swissprot -I T -h 0.0001 -j 4 -C test1.chk -Q test1.matrix -a2 Remember, do this on a compute node! (if you use nr and get a segmentation fault, you might need to ssh node017)
Once you started the search, this would be a good time to take a break.
To search a protein databank with the profile you could use something like the following: blastpgp -i test1.fa -d genome.faa -R test1.chk -o genome.test1.br -I T -a 2 (where test1.chk is the file created with the -C flag in blastpgp)
To search a nucleotide databank with the profile you could use something like the following: blastall -i test1.fa -d target_genome_nucl.fna -p psitblastn -R test1.chk -F F (the filter F flag is necessary, because blastpgp by default has he filter off, whereas it is on by default in blastall
How many significant hits to your profile did you obtain? Does this mean anything?
4) If you have time run a PSI Blast search of nr with one of the following sequences. Do you obtain any matches that might reveal the function of the query sequence?
>gi|111225682|ref|YP_716476.1| hypothetical protein FRAAL6341 [Frankia alni ACN14a] MNPATLAASRRFPLVGRPRPACPALPDRVNEIADIAQGAVQEGADGLAEGAHALNKAALVASDCGMPTLA RDLCWQHINIYRSADRPLTVLQTRYMLEPVLNLARLHFRAGADDRALRLLTSMYRAVTSHTDLVVDGHAL PSTGLTGTRHDHHKLREWVWLHLVGDGIRALALAGRWDDAVAHARAHRGIGLHLMEGRQAKILAHCLNGR PAAAITALAGSTPEQPWELQVASCLNVMCTDGTPASRNIDEMIEHFVGQEPMPGYVVFRAQLGLTVATLA RTTDRAAANGVLAQVADEVIKAGDGYAARDVLRHPDTRAADLTSEQGSALVDVLTSSGLEAGRLPEPLLR SLLSSARTAAEALNSSIPLRDHLVAVEASPRPGVHLNARNCAPGLPAEQPGVEGGETDPVRPRGSRGHRL ADLP
>gi|158311919|ref|YP_001504427.1| conserved hypothetical protein [Frankia sp. EAN1pec] MNPVALALACFFPLIGRPRLACQPLPDRVAEIAEIAQAAARDGADGLAEGAHALNKAALLASDCGLAPLA RDLCWQHINIYCAVPRPLTVHEARYMLEPALNLARLQIRASDGEQALGLLTAMFQAVSSNTDLVVDGRVL PLTDLIGTRDERHKLREWVWLHLVGDGVRALALAGRWDDAVIHADTYRGIGLHLLEGRQAKILAHCLTGT SAEARAALAESTPMYPWELQVASCLEVMCTEDTSTAHGVTTMIGQFLGQRPMPGYAVFRAHLGMTVAALA ATTDPDAATRVLTQTVEEVIEAEDGYAARDVLRLRPTQAVDLPARHEKALADLLNASGLRAETPPEPVLE SVLGSARTAEAAIVAATHPQRR
>gi|158313186|ref|YP_001505694.1| conserved hypothetical protein [Frankia sp. EAN1pec] MAATTSTAATPSTFDIVAARFPLVPRSRPSCPPLDARIAHVAALAGQAAGGGGDALLRAAEAHNLAALIA SDCGLPDLARSLCWRQIDTLPLRRPLDGATAKLALQPFINLARLRLRAGDGLAAYQMLTTLYDVVVARTS TAIDERALVFDDLVTDVDHPQTVRWLWTVLLADGTRALTRTGHWTEALDHLNRHKGIGQRLLDGRQTAIL AHHAHRDHYAAEHLLTTTATTQPWEQSVATCLGLLHRHLTGLKTPDDGRSTIDALLPSNNPEHLTFNIQL GLCLLDLADTPQHLRPVLDTIIDGALHSDDAYAARDLLTHPAARGYLNRDQLTLLNERQRHSGLGSGRIP EALRTRLLGALALAS
>gi|158314618|ref|YP_001507126.1| hypothetical protein Franean1_2797 [Frankia sp. EAN1pec] MQLTRLVLTGTCLLGLLAAGAPAFADSYATTDCAQNPIPGCDLAAGGHGLAPAPPGGQAPPPQGGGSGGS GRGGSARPPGDVALDPADVARCSYVRSDFQPTTEAIQPVRFRPAPDGGLRVVTAVDRPGPFLARPVATGP DGRPGAWYVYQCQTDGVRDALYRPPVWIPDGQGGPVAGGPDVGGLAEQARSQLRLQGPAIALSPIGRQLI RLPTWMWLDPAGWRPVSATAAAGGVSVTATATPTGVDWVMGDGAQVHCTGPGTPYPDGGDPKAPSPDCGH TYQTVSEDQPGGVFTLTATVTWNVTWAGGGQTGVFDGLTTVSTVQVAVISIPALITGGG
>gi|158314619|ref|YP_001507127.1| hypothetical protein Franean1_2798 [Frankia sp. EAN1pec] MAPRSDSRGRRWPVRGVAVGLLAAGVLVVSCSSNDSAEPAPGPAPSTSRSPAPRMTVSPTPTSPADAAGQ RAVAAYVGLWEAMAEASHTSDWQSPELARYASGDALQAVSGGLYADHYNGLVSRGAPVLHPEITSVEPAD APTTVMVFDCSDSTNWLRHRADGAPFTDEPGGRRAVTSEVRLHQDGSWKVTRFAVEPVGSCS
>gi|158318490|ref|YP_001510998.1| hypothetical protein Franean1_6758 [Frankia sp. EAN1pec] MHSPTEAPGRFAGRTPADGRQPGPDPSSSTAVRRSGVGSAVVVGMGREPVTPAGDDAPSPPLPIRVPRTT WPAGRRRLSSTGRGYYVRPLSGPAGAAEQKWTVVVLDADGDPRETVTGFSSLEAADAYAELDPAIIRWIS VPTRPAIPERIPGL
>gi|86739833|ref|YP_480233.1| hypothetical protein Francci3_1124 [Frankia sp. CcI3] MWPPPNDADTVTRMRAAHERASAALRVTVDRTQAEEAWGFAGRTLGRPVMTPDSPGWLRIAATEAGQQIT TFWDGGRTAQQALPASIPRPALRAIHDHHHDGWDYRAELYDRVHVRPLAVTTVPRRLTDSPKGQWFQALR RSFRILATVSTDRRTIEQSYLDDVMPRILGEPITTTSPAPWVTAHGDLHWANLCGPTLCMLDWEGWGLAP AGYDAATLYCHSLFMPTLAAQVQARFADALSTESGRYAELAVIAELLDTVHSGTGLDIDHAGLLRIRATH LLRRGIPRQQEGAATRPGQT
>gi|86740523|ref|YP_480923.1| hypothetical protein Francci3_1818 [Frankia sp. CcI3] MRQRCGLSLAFRHPIWPGLVPTASDLASCRLAREPNGVGVGLSARKGPDRMTAGGLAHDTAHGPASGQSL PVGQIDCRLAIVVGATPWRNTAVTPTLVSSAGARERGAVEEVFGSVRLAGRIVHVDDGRLRRGGPHEWRT QREPADGPRVRAESAPSRRRPSGEWRRHPTPTMSAAPVTRVRQFTRRPATDRSSAGPCIPKVTERPCRLC LRAALRCASRTEHAQPAGRVLPEARWGCPSQTNSAAKGECPGHPGHGPLSNPAVRIR
>gi|86740529|ref|YP_480929.1| hypothetical protein Francci3_1824 [Frankia sp. CcI3] MANWFPLVPRPRPPTIDRRQRLTEIEHLAHSAPGNARRTTNAAEALNKAALLASDCGLPDLAKDLCWRQF HVFDNAGPQPPALATAVLQPLINLGRLELRADDPDRAYTFFNQIHHAVRTSTAVVLDGVTVNPARLFADD NTLLRARRFTWTVLLGDGTRALARAGRWDDAVAHLHRHHGIGRRLLDGRQTLILASHLAGESEAARTALA TSVTPTPWERAVAAVLGILCRHGSEPQAIGAALDDLRASSRDPTHPVFQNRLGLTLLDLATDATDQKWIA SAILAGLRDGHDGRCAADALAHPRLQQHLSPAHRVDLVERVKTAGIQRPPPTTFRDAMTTATLAAERGLR DALAGPLEGGRLITRRPDQSAARGEAG
>gi|86740733|ref|YP_481133.1| hypothetical protein Francci3_2030 [Frankia sp. CcI3] MTNVGTDPADENLLAWFPLVQRPRPPGLPLEDRVRQLHDLAARTSDGLPLLRAAEVCNKAALIASDCGQP DLAQDLCWRQHTLFDQARPLPASAAELALQPVLNLPRQLIRDGDGNRAHAILQALHEAARTQTSALIDGR SVSLHNVTCASDDHRTMRTLTWTALLADGVRALARAGRWHEAAEQAAAHRGVGRRLLDGRQATVLALAQA GHTEQAAALVDQSATPEPWEQAIQTILRVHCLRQAGADTGPQIAPLLATALTLMRQPDLSTMVFRARAGM IALDLADGHDHPRIDVLRRALIAGTFKDAYTARDTLAHRLRESMTTTQRQTLADVFRAAGLDAGSIPESL YGDLMETVKFAEDQLRGCLGRHARHCECTTATS
>gi|86742003|ref|YP_482403.1| hypothetical protein Francci3_3317 [Frankia sp. CcI3] MAVSGSLARPVKNAGHDHDRPGQQLPVSMVIDRLSGHILDAVVVIPSLSVDEVKESLKRLRQGHGLARPT ALPTVLPELLRARLADGRVSEPASAQEVTLLVTAFRQAVEGLADDERLCVEVDFNLSAEHRYPTLTERQE SLARQQRCAAKTVRRRADRALDTLAYMLLTNGPSTVTSTVRSPATISSPEQDRSPVDHGEPWGEDLRAFW RLSHGARIDIVCSEIPEDERPEYASPADRNYLRYAKFADLDTLIYLRTRFARLAPTVTIRDFAPSEYFDT QADVLVVVGGPPWNAKYREFLPRLPFFFEPHPLGADDPLVVPGMNGLVLGPRWTERNELLEDLAVFTRLT LAQGTTVFLLGGCLTLGVLGAARCLLEAERGARSSRYITEHVNDADFVLVTEARRIGGLTDVADLTRVPP LLLLARSNNEPFAVHVDNSDRYLQDQSGTEVHHIDHSC
================================================================================
The following applies only if blast doesn't recognize the common databanks:
In order to use the databases installed on bbcxsrv1.biotech.uconn.edu the programs need to know where the databases are installed. One way to do this is to use an environmental variable.
BLASTDB=/common/data/ export BLASTDB
If this is a common thing for you to do frequently, you should add these two lines to your .profile . This is a file that is normally hidden (its name starts with a period, you see it only, if you type ls -a), you can edit, or create it using any text editor, one possibility would be to use vi .profile
Every time you log in (including qrsh) the environmental variable is defined and exported. If this is to complicated, you can use this .profile and sftp it into your home directory.
The databanks available in /common/data/ include:
define and export the BLASTDB environmental variable.