Class 8

Old tasks

blastall, formatdb and fastacmd are installed on the cluster and can be invoked form the commandline.

typing blastall alone gives a list of options. Note: -F, -W, -a, -e

typing "formatdb -" or "fastacmd -" gives you a list of options

Download sequences from http://genome.ornl.gov/microbial/tmel/

Other genomes at http://lamarck.mcb.uconn.edu/~jpgogarten/temp/thermotoga_genomes/

Do example of formating databank (Thermotoga MSB8 in geneomes)

formatdb -i *.faa -p T -o T -n msb8.faa

A typical blastall command using the commandline is

blastall -p blastp -d /common/data/swissprot -i query.fa -o results.tab -F F -m 9 -W 2 -a 2 -e 0.000001

blastall -p blastp -d /Users/jpgogarten/genomes/Thermotoga_maritima/MSB8.faa -i tmel.faa -o /scratch/results.tab -F F -m 9 -W 2 -a 2 -e 0.001

DISCUSS /scratch and NFS problems.

An easy way to extract individual fasta files form a databank is the fastacmd for example, if we want to retrieve sequences with gi numbers 32699731 and 3915985 :

fastacmd -d /common/data/swissprot -p T -s 32699731,3915985

Note: no space between numbers. This only works if in making the database with formatdb you used the -o T option.
Within a script you could use the -o option to write the sequences to file, or you could assign the sequences to a variable using backtics:

$gi1=32699731;
$gi2=3915985;
$seq1=`fastacmd -d nr -p T -s $gi1,$gi2` ;

print "the sequences are: \n$seq1\n";

More info on fastacmd, blastall and formatdb is available at the NCBI: fastacmd, blastall, formatdb

7.1 read through the PSIBLAST notes at ftp://ftp.ncbi.nlm.nih.gov/blast/documents/blastpgp.html
Try to figure out which commands will allow you to use nr as a database to develop a PSSM and then use the result to search another, possibly user defined database.
(hints: Use checkpoints (-C and -R option) the end of the file has an example (more complicated because it uses tblastn).

7.2 Develop an idea for a student project, write outline and outline for programs

7.3 Modify the cumulative nucleotide strand bias program to
(A) calculate the (G-C)/(G+C) bias in a sliding window
(B) writes a plotfile for use in gnuplot that has the correct name for the genome as title for the plot

7.4 Write a script that takes the input and ouput from a blastall search and returns all sequences from the query file that did NOT have a match in the databank.

Else: All the programs available through the emboss suite are available on the cluster (type the name at the command line). See http://emboss.sourceforge.net/ for more details.
A listing of program groups is here. The information for a program that calculated IEP and statistics is here.

run_pepstats.pl runs pepstat on all the sequences in the genome.
($%#^&* this requires aa sequences without any ambiguous characters, to accomplish this, you can use the script check_genome.pl . This generates a new *.faa file where all the non aa characters are removed.)

parse_pepstats.pl goes through the outputs of the pepstat program and extracts information. The resulting files can be loaded into excel (see here for an example).

Assignment: Make appointment to discuss student project!