MCB 5472 : Extracting information from repetitive tasks

1) Handling genome (or larger) amounts of data -- Extracting text from other applications

EMBOSS is installed on the cluster. Here is a list of programs in EMBOSS. Today we will be using pepstats. Click on its entry in the list to see the command line arguments.

Download a "genome" of your choice from NCBI. The easiest is to use the ftp server: ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/. Alternatively, especially if the genome is not jet on the ftp server, select the genome link at the NCBI, follow the link to the "bioproject," click on the number in protein links (protein sequences nnnn links), then use the "send to" -> file pull down menu to create a multiple fasta file for all the linked sequences. Since we will be using pepstats, be sure to grab the protein sequence ".faa" file(s).

Move your multiple sequence fasta file to a directory on the cluster. (As you are transferring files, you might as well move this R-script, and the scripts parse_pepstats.pl and parse_pepstats_mod.pl into the same directory.)
The script histogramScript_pdf.R creates a histogram from data in a file called indata. The first line in indata contains the name of the data (no spaces), the following lines contain the the data (one data point per line). The output of the script are two pdf files, plot1.pdf contains a histogram of the data, plot2.pdf compares the histogram of the data in indata to a normal distribution. More information is in the documentation file (MS-Word formated file).

parse_pepstats_mod.pl calls the histogram script, which needs to be in the same directory.


To run the EMBOSS program from the command line, type: 
pepstats genome.faa -outfile genome.pepstats
where "genome.faa" should be the name of the file that contains the protein sequences encoded in the genome, and "genome.pepstats" is the output file.

Check the output file generated by pepstats using a text editor.

We will use parse_pepstats.pl or parse_pepstats_mod.pl to extract the isoelectric point for all proteins:

Read through the script. Try to figure out how the program finds the values for the theoretical isoelectric points in the pepstats output. Execute the program:

perl parse_pepstats.pl genome.pepstats
where genome.pepstats is the name of outfile (see above)

parse_pepstats.pl will generate three files, with suffixes ".pI", ".pos_charged", and ".parsed".

Use the ".pI" file (isoelectric points) to construct a histogram (or the .parsed file). You can use the histogram_script (via parse_pepstats_mod.pl) or try to use an Excel addon for this.

Which genomes did you analyze?
Describe the distribution of isoelectric points in your selected genome. How many peaks?
Why might there be a minimum at around pH7?
Compare your finding with others in the class. Do thermo- and halophiles have the distributions of isoelectric points?

Check a few of the ORFs with very alkaline theoretical isoelectric point (the *.parsed outfile contains accession numbers in the first column; you could use
fastacmd -s accession_number to retrieve the sequence from nr).
Which charge would these proteins have at neutral pH? Can you see a pattern in the types of enzmes?

2) Work on your student project!

Include a one sentence summary of what you did in your email.