MCB 5472 : Extracting information from repetitive tasks

You should answer the questions in red! (email to gogarten@uconn.edu)

1) Handling genome amounts of data -- Extracting text from other applications

EMBOSS is installed on the cluster. Here is a list of programs in EMBOSS. Today we will be using pepstats. Click on its entry in the list to see the command line arguments.

Download a "genome" of your choice from NCBI. The easiest is to use the ftp server: ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/. Alternatively, especially if the genome is not jet on the ftp server, select the genome link at the NCBI, follow the link to the "bioproject," click on the number in protein links (protein sequences nnnn links), then use the send to file pull down menu to create a multiple fasta file for all the linked sequences. Since we will be using pepstats, be sure to grab the protein sequence ".faa" file(s).

pepstats genome.faa -outfile genome.pepstats
Check the output file generated by pepstats using a text editor.

 

Use parse_pepstats.pl to extract the isoelectric point for all proteins:

Read through the script. Try to figure out how the program finds the values for the theoretical isoelectric points in the pepstats output. Execute the program:

perl ./parse_pepstats.pl genome.pepstats

parse_pepstats.pl will generate three files, with suffixes ".pI", ".pos_charged", and ".parsed".

Use the ".pI" file (isoelectric points) to construct a histogram (or the .parsed file). You can use the histogram_script from http://lamarck.mcb.uconn.edu/~jpgogarten/scripts/ or try to use the Excel 2008 spreadsheet at http://gogarten.uconn.edu/mcb3421_2008/histo_upgrade4.xls . (Copy the isoelectric point data into the first column below the X. In the table in the center, adjust the first midpoint to 0.5 and the last midpoint to 13.5)
Which genomes did you analyze?
Describe the distribution of isoelectric points in your selected genome. How many peaks?
Why might there be a minimum at around pH7?
Compare your finding with others in the class. Do thermo- and halophiles have the
distributions of isoelectric points?

Check a few of the ORFs with very alkaline theoretical isoelectric point (the *.parsed outfile contains accession numbers in the first column; you could use
fastacmd -s accession_number to retrieve the sequence from nr).
Which charge would these proteins have at neutral pH? Can you see a pattern in the types of enzmes?

2) Work on your student project!

Include a one sentence summary of what you did in your report.