MCB 5472: Blastall and PSI Blast

Please let me know, how far you got during the lab. If most students didn't finish, we will continue this next week!

Questions you should answer are given in blue. Please collect your answers in a word document and send them per email to gogarten@uconn.edu with subject MCB5472

1) Complete this exercise from last week

Blast from the command line:

Inteins are molecular parasites that invade in the most conserved regions of the most conserved proteins (see here). Two of these inteins have invaded the Archaeal/Vacuolar type ATPase/ATP synthase (the gene is sometimes known as vma-1). We are particularily interested in the one in insertion site b, often denoted as vma-1b.

The sequence surrounding the insertion site is very conserved, the intein itself accumulates substitutions at a much higher rate.

Reads from a second generation metagenome sequencing project are at

/Users/jpgogarten/Thane_smetagenomes/santa_pola_37_percent_reads_longer_than_30.txt

We will use these sequences to search for inteins that are located in vma-1b

The sequence of a V-ATPases with intein from Candidatus Nanosalinarum is

>gi|339757395|gb|EGQ40976.1| archaeal/vacuolar-type H+ ATPase subunit A [Candidatus Nanosalinarum sp. J07AB56]
MTQETIESEGEIYKITGPVVVAEDLDCQMNDVVYVGGEELLGEVIQIEGGQAYIQVYEETTGVSPGEPVK NTGEPLSVELGPGLLGSIYDGLQRPLPELEEQMGSFIQRGEDAPGINPEEEYSFEPAVEEGDHVEPGDVI GTVQVSYGEHKVLLPPHSDGGEVEEVREGNYTVTDHVAELDTGEKVSMRQEVPIRDQRPAEEQLPPEVPL ITGQRVFDGLFPLAKGGTAAVPGGFGTGKCVVGDTPVALADGSKRAIEDIYHDYEGRGDQTQRENERWTD LDNGPRVLSRKKGEIVEKQVSTVYKGKTDSTVSVGTRSGREVELTPVHRLFVLTPEMEIEEREAQNLQKG DTLLVPRNLPVDSSNVEIEVAEALPEKRVVGNGFEAVREAITVLEENHGTRKAAAEHLGIDDHRMDAYAT GRNRPRVADATAILEEAGKTHKLRCVKGEKQSKPTRIPEEVDEDFAELLGLLLGDGTVKPRSVQFYNNNE RLLDRVEHLAEKLFGLKSARTTANTVESVRVDSKVLRDLLVYLGFPETEKSLNCSVPETVMKGSEDVVAA FVRGYFLADGHFSEYEAELSTSSRRMQEDLAYALTRIGITPRVSEKQTDKNPHFRVRFSGDQLIEFHRKL EADYDKFEQIEEYLDRVEDHFRGTESVDIAPETVRSAFESSEATRADLKSAGAKLSNYETQEERISVPAL QNFADVTKNKHLAEMAGNHLEHFHPDRVASIETHQETKEVYDLTVPDTHNFVGGNAPMLLHNTVTQQSLA KFSDADVIVYIGCGERGNEMTEVLEEFPELEDPQTGEKLIDRTVLIANTSNMPVAAREASIYTGVTIAEY FRDQGLDVALMADSTSRWAEAMREVSARLEEMPGERGYPAYLASRLAEFYERGGRVRPLGPEEPGSVSII GAVSPPGGDFSEPVTQNTLRVVKNFFALDKDLAEKRHFPSINWNDSYSGYSETLSDYWQEEVDEDWNQNV QRLRDLLQESDDLEETVQLVGKDALPDRDRLTLEIGDMLKEFYLQQNAFHPVDQYSSPQKTFDMLEVILE YADHAYEALDEGALVEDIVSLNSRAEIGRIKTAEDHREKLEEVREQMQDEFGEVAEQ

The sequence of the selfspliching domain of the intein is

>Candidatus Nanosalinarum sp. J07AB56 intein splicing domain
CVVGDTPVALADGSKRAIEDIYHDYEGRGDQTQRENERWTDLDNGPRVLSRKKGEIVE KQVSTVYKGKTDSTVSVGTRSGREVELTPVHRLFVLTPEM-EIEEREAQNLQKGDTLL VPRNLPVDSSNVHFHPDRVASIETHQETKEVYDLTVPDTHNFVGGNAPMLLHN

The sequence of the selfspliching domain with 18 nucleotides upstream and downstream of the einsertion site is:

> Candidatus Nanosalinarum sp. J07AB56 intein splicing domain +-18 nucleotides
LAKGGTAAVPGGFGTGKCVVGDTPVALADGSKRAIEDIYHDYEGRGDQTQRENERWTDLDNGPRVLSRKKGEIVE KQVSTVYKGKTDSTVSVGTRSGREVELTPVHRLFVLTPEMEIEEREAQNLQKGDTLL VPRNLPVDSSNVHFHPDRVASIETHQETKEVYDLTVPDTHNFVGGNAPMLLHNTVTQQSLAKFSDADVIVY

We will attempt to use tblastn to determine how many intein are encoded in the read collection, and howmany possible insertion sites are present.

cd $HOME

mkdir blasttest2

cd blasttest2

cp /Users/jpgogarten/Thane_smetagenomes/santa_pola_37_percent_reads_longer_than_30.txt reads.ffn

(this is s big file, and copying it may take a few minutes)

While the file is being copied, decide which would be an appropiate query sequence (see above). Use a texteditor to put together a appropriate query. Save this sequence into a file called query.fa

formatdb -i reads.ffn -o T -p F

The commant for the search could be something like this:

blastall -i query.fa -d reads.ffn -p tblastn -e 10e-2 -a2

If you want to write the output to a file

blastall -i query.fa -d reads.ffn -o blastout1 -p tblastn -e 10e-2 -a2

If you want to use tabular output, use the -m 9 option.

What was your query, and which reasoning did you use to decide on it?

How many reads include the intein target sequence?

Howmany of these did include the intein?

[Aside: If you don't have decided on a student project, one possibility would be to devise an automated strategy to use ngs reads from a population or from metagenomes to determine how many vma-1b target sites are occupied by an intein].

=========================================================

2) Using Blastall to do genome plots for microbial genomes (different Frankia, or different Thermotoga species work nicely) .

[aside: The NCB provides a genome plot facility that is somewhat similar to this exercise. Check gene plot (Frankia), (Borrelia), (Leptosira) or (Mycobacteria) for examples. Problems include that multiple chromosomes and plasmids are concatenated, and that at time it is not clear if top scoring, top-scoring reciprocal, or one way scoring and reciprocal blasthits are depicted.]

Download the *.faa, *.ptt and *.fna files from two genomes (main chromosome only, check gene plot to make sure that your genomes aren't too divergent) from the ncbi's ftp site ftp://ftp.ncbi.nih.gov. As we want to plot syntenic relationships between the two genomes, we want to use the chromosomes from two close relatives.

We want to plot a genes genome position in one genome against the genome position of a homolog in another genome. One way to do this is to replace the GI numbers in the genome with the location in the genome. A program that does this is here. Save the two genomes (as faa files, the two ptt files, the two files, and the perlscript into a new directory on the cluster (name in gp (for genome plot)). We also will use the extract_lines.pl script.

Open a terminal connection to the cluster,
qrsh
cd gp, run the addnucnum script. For example, if the genome and ptt files are ACN.faa, ACN.ptt, CCI.faa, CCI.ppt then the commands would be
perl addnucnum.pl ACN
and
perl addnucnum.pl CCI
Check the addnucnum.pl program, the ptt files, and the output in a a texteditor or in vi (Note, at present the script uses the middle of the ORF to identify the location, if you want the plot the beginning of each ORF, modify the script.

Pick one of the numbered genomes and turn it into a "blastable" databank. e.g.
formatdb -i ACN.num.faa -p T -o T
Check the result of the formatdb command by typing
more *.log

then run a blastall search of the other genome (or rather the proteins encoded in the other genome) against the databank
blastall -i CCI.num.faa -d ACN.num.faa -o blast.out -p blastp -e 10e-13 -m 8 -a2
NOTE: You want to select a reasonable significance cut-off, because we will plot all significant matches.
NOTE: You need to use the numbered genomes in both cases.

Run blast.out through extract_lines.pl
perl extract_lines.pl blast.out

Load the files blast.out and blast.out.top into excel and create a scatterplot for the first two columns.

What if any is the difference between the two plots?
Why do some proteins have more than one match?

If you have time, repeat the analysis, but use a lower or higher E-value cut-off, and/or use the same genome as target and query (matches off the diagonal are paralogs).

=========================================================

3) Mummer is a program that finds matching nucleotide sequences, and produces nice plots, similar to the genome plot created above. The difference is that in this case the search is done on the nucleotide level, and that the program keeps track of the + and the - strand. Mummer is installed on the cluster. Execute the following command, replacing CCI and ACN with the genomes you chose.
mummer -mum -b -c CCI.fna ACN.fna > ref_qry.mums
If you are interested in details, check the mummer manual at http://mummer.sourceforge.net/manual/#mummer
You can use mummer -maxmatch -b -c CCI.fna ACN.fna > ref_qry.mums to get all matches, not only the unique matches.

mummerplot --postscript --prefix=ref_qry ref_qry.mums

To plot the results, we need to go to the head-node (only that node has gnuplot installed)
exit

cd gp (move to the genome plot directory)

The installed version of mummerplot creates a gnuplot file (*.gp) designed for an older version of gnuplot.

You will get a warning about "deprecated syntax" - ignore it.

On a Mac, view output (ref_qry.ps) with preview; under windows the ghostview program is a freeware program that works reasonably well - see here, or you could use the online viewer here.

Do find differences between the plots created in exercise #2 and the mummer plots?
Which of the generated plots do you think most useful for ... ?

=========================================================

4) PSI BLAST using blastall

In case you cannot access nr or other databanks on bbcxsrv1 go here

* Decide on a sequence for which you want to find distant homologs.
Possible examples are:

Download an appropriate sequence from ENTREZ and save it as a FASTA file in a directory on the cluster. Make sure that the sequence only contains the parts you are interested in. Especially important if you have a multidomain protein (host protein + intein), or if the sequence has a signal sequence. Read the genbank file, and only select the appropriate amino acids. Also download the genome you are interested in (download form ftp://ftp.ncbi.nih.gov/). Download both the annotated encoded aa sequences, and the raw nucleotide sequences.
Turn both of the genome sequence data files into searchable databanks using formatdb.
In case of multiple chromosomes use cat *.fna > all.fna and cat *.faa > all.faa to create a single multiple sequence fasta file.

Which commands did you use to run formatdb?
Which sequence did you choose?
Which genome did you choose?

* Use a query and database of your choice to do a normal blast search of swissprot and a genome of your choice

Which commands did you use?
How many significant hits (what is your significance level?) did you have in the three databanks?

* the command line for a PSI BLAST search is blastpgp
     blastpgp -
gives you a listing of all the options.

To calculate a PSSM you need to define the following options:

-i
-j

-C
-h
-e

You also might want to use

-a
-Q

What are these options?

The command to generate a PSSM matrix using nr might look as follows:
blastpgp -i test1.fa -d swissprot -I T -h 0.0001 -j 4 -C test1.chk -Q test1.matrix -a2
Remember, do this on a compute node! (if you use nr and get a segmentation fault, you might need to ssh node017 first)

To search a protein databank with the profile we could use something like the following:
blastpgp -i test1.fa -d genome.faa -R test1.chk -o genome.test1.br -I T -a 2

To search a nucleotide databank with the profile we could use something like the following:
blastall -i test1.fa -d target_genome_nucl.fna -p psitblastn -R test1.chk

How many significant hits to your profile did you obtain?
Does this mean anything?

================================================================================

IF YOU HAVE TIME, and want to explore the NCBI's web interface, do the following:

Do a PSI-BLAST search at the NCBI using the same or a different query as above (to speed things up, you could use swissprot as database). If you don't have a good sequence, use this intein from Pyrococcus:

>Pab_VMA intein from gi|7436316|pir||D75028
CVDGDTLVLTKEFGLIKIKDLYKILDGKGKKTVNGNEEWTELERPITLYGYKDGKIVEIKATHVYKGFS
AGMIEIRTRTGRKIKVTPIHKLFTGRVTKNGLEIREVMAKDLKKGDRIIVAKKIDGGERVKLNIRVEQKR
GKKIRIPDVLDEKLAEFLGYLIADGTLKPRTVAIYNNDESLLRRANELANELFNIEGKIVKGRTVKALLI
HSKALVEFFSKLGVPRNKKARTWKVPKELLISEPEVVKAFIKAYIMCDGYYDENKGEIEIVTASEEAAYG
FSYLLAKLGIYAIIREKIIGDKVYYRVVISGESNLEKLGIERVGRGYTSYDIVPVEVEELYNALGRPYAE
LKRAGIEIHNYLSGENMSYEMFRKFAKFVGMEEIAENHLTHVLFDEIVEIRYISEGQEVYDVTTETHNFIGG
NMPTLLHNT

[aside: DO NOT use the gi number as query. gi|7436316 also contain the extein portion, the sequence above only contains the intein.]

On the Format page, set the E-value cut-off for inclusion in the next round to 0.0001 and change the maximum target sequences to 10000, and turn on the filter for low complexity.

Save the PSSM (Position Specific Scoring Matrix, or profile) from your search after the 4th iteration. To do that choose PSSM with parametrs ASN.1 from pull-down menu under Download. Save the PSSM matrix to disk as text file, and keep this browser window open. We are going to use this profile in the next search.

Now we will use the PSSM to BLAST the completed genomes. Go to Microbial Genomes Genomic BLAST page ( Let it load completely before choosing any options! ).
Paste intein sequence into query sequence box,
change Query and Database entries to "Protein" and select blast program BLASTP .
Choose one or more of the genomes as database. The following work for inteins:

What was your query, what was your genome, how many matches did you obtain?

If you repeat the exercise using tblastn on the nucleotide sequences, do you obtain the same results?

==========================================================================

The following applies only if blast doesn't recognize the common databanks:

In order to use the databases installed on bbcxsrv1.biotech.uconn.edu  the programs need to know where the databases are installed. One way to do this is to use an environmental variable.

BLASTDB=/common/data/
export BLASTDB

If this is a common thing for you to do frequently, you should add these two lines to your .profile . This is a file that is normally hidden (its name starts with a period, you see it only, if you type ls -a), you can edit, or create it using any text editor, one possibility would be to use vi .profile

Every time you log in (including qrsh) the environmental variable is defined and exported.

The databanks available in /common/data/ include:

define and export the BLASTDB environmental variable.