Old tasks
blastall, formatdb and fastacmd are installed on the cluster and can be invoked form the commandline.
typing blastall alone gives a list of options. Note: -F, -W, -a, -e
typing "formatdb -" or "fastacmd -" gives you a list of options
Download sequences from http://genome.ornl.gov/microbial/tmel/
Other genomes at http://lamarck.mcb.uconn.edu/~jpgogarten/temp/thermotoga_genomes/
Do example of formating databank (Thermotoga MSB8 in geneomes)
formatdb -i *.faa -p T -o T -n msb8.faa
A typical blastall command using the commandline is
blastall -p blastp -d /common/data/swissprot -i query.fa -o results.tab -F F -m 9 -W 2 -a 2 -e 0.000001
blastall -p blastp -d /Users/jpgogarten/genomes/Thermotoga_maritima/MSB8.faa -i tmel.faa -o /scratch/results.tab -F F -m 9 -W 2 -a 2 -e 0.001
DISCUSS /scratch and NFS problems.
An easy way to extract individual fasta files form a databank is the fastacmd for example, if we want to retrieve sequences with gi numbers 32699731 and 3915985 :
fastacmd -d /common/data/swissprot -p T -s 32699731,3915985
Note: no space between numbers. This only works if in making the database with formatdb you used the -o T option.
Within a script you could use the -o option to write the sequences to file, or you could assign the sequences to a variable using backtics:
$gi1=32699731;
$gi2=3915985;
$seq1=`fastacmd -d nr -p T -s $gi1,$gi2` ;
print "the sequences are: \n$seq1\n";
More info on fastacmd, blastall and formatdb is available at the NCBI: fastacmd, blastall, formatdb
7.1 read through the PSIBLAST notes at ftp://ftp.ncbi.nlm.nih.gov/blast/documents/blastpgp.html
Try to figure out which commands will allow you to use nr as a database to develop a PSSM and then use the result to search another, possibly user defined database.
(hints: Use checkpoints (-C and -R option) the end of the file has an example (more complicated because it uses tblastn).
7.2 Develop an idea for a student project, write outline and outline for programs
7.3 Modify the cumulative nucleotide strand bias program to
(A)
calculate the (G-C)/(G+C) bias in a sliding window
(B) writes a plotfile for use in gnuplot that has the correct name for the genome as title for the plot
7.4 Write a script that takes the input and ouput from a blastall search and returns all sequences from the query file that did NOT have a match in the databank.
Else: All the programs available through the emboss suite are available on the cluster (type the name at the command line). See http://emboss.sourceforge.net/ for more details.
A listing of program groups is here. The information for a program that calculated IEP and statistics is here.
run_pepstats.pl runs pepstat on all the sequences in the genome.
($%#^&* this requires aa sequences without any ambiguous characters, to accomplish this, you can use the script check_genome.pl . This generates a new *.faa file where all the non aa characters are removed.)
parse_pepstats.pl goes through the outputs of the pepstat program and extracts information. The resulting files can be loaded into excel (see here for an example).
Assignment: Make appointment to discuss student project!