Assignment 11 Handling genome (or larger) amounts of data -- Extracting text from other applications

Your name:
Your email address:

Histograms of isoelectric points (IEPs) of all proteins encoded in a genome OR finding organisms that follow a salt-in strategy.

See these slides for some background.

Underlying today's exercise are two observation:
1) some organisms that live in environments with very high salt concentrations follow a salt in strategy.  Instead of keeping salt out of the cell, they accumulate very high concentration of KCl (>4M salt).  A consequence of the high salt concentration is that the reach of a charge is very low (the so-called Debye lengths --- using the formula given here, the Debye length is in the Angstrom range inside the cytoplasm of an organism following the salt in strategy).  Too compensate for this, organisms following the salt-in strategy have many negatively charged amino acids in their proteins (or in most of them).  And these negatively charges aa (sidechains containing a COO- group) lead to an isoelectric point for the proteins at a low (acidic) pH.  (I.e. one would need to move the pH of the solution to around pH 2 to have the overall protein with zero charge.) 

Aside:  most proteins have a negative charge at the pH at which they normally exist.  The overall negative charge prevents proteins from clumping together.  Exceptions are proteins that bind to the DNA (the backbone contains phosphate groups that give the DNA on overall negative charge; for a protein to bind to the DNA, it needs to have a positive charge), and proteins that bind to the cell-wall or extra-cellular matrix.  The cell wall has an overall negative charge, and for a protein to stay inside the cell wall it helps to have a net positive charge.  
For a "normal" cell the theoretical isoelectric points (calculated from the types of sidechains present in the protein) looks as follows: 

IEP Histogram

2) The Haloarchaea follow the salt-in strategy, and as a consequence have one big peak in the histogram of IEPs at below pH 4.  The same is true for a group of archaea know as Nanohaloarchaea.  The placement of the Nanohaloarchaea remains uncertain.  They were initially considered the sister group to the Haloarchaea.  Later they were grouped with other recently discovered Archaea into the DPANN group (one of the Ns stands for Nanohaloarchaea).  The DPANN group allegedly is a deep branching group in the archaea. 
And more recently a paper found them to have evolved independently from the Haloarchaea from a methanogen ancestor.  The putative ancestral groups of methanogens are the

Methomicrobiales
(e.g., Methanoregula boonei 6A8; Methanolinea tarda NOBI-1; Methanoculleus marisnigri JR1; Methanolacinia petrolearia DSM 11571; Methanofollis liminatans DSM 4140; Methanocorpusculum labreanum Z; Methanosphaerula palustris E1-9c) for the Haloarchaea, and the
Methanocellales (e.g., Methanocella arvoryzae MRE50; Methanocella paludicola SANAE; Methanocella conradii HZ254) or the Nanohaloarchaea. 
A third possibly ancestral group to the Halobacteria are the
Methanosarcinales
(e.g., Methanosarcina barkeri; Methanohalobium evestigatum Z-7303; Methanosalsum zhilinae DSM 4017, Methanosaeta thermophila PT)
Recently a group of methanogenes was described, the Methanonatronarchaeia (aka Methanonatronarcheia). The placement of this group inside the Archaea remains controversia (here, here and here).

Your task is to test, if any of these ancestor groups contain species which appear to be on the path towards a salt-in strategy. 

Every student should analyze at least one haloarchaeal (the group is still called Halobacteria), one Nanohaloarchaeal, and one genome from each of the proposed ancestral groups (Methomicrobiales, Methanocellales, Methanosarcinales, Methanonatronarchaeia). 
We do not restrict ourselves to completely sequenced genomes! In addition, feel free to use any other genome you are intersted in (halophilic bacteria, acidophiles (how would you detect a proton in strategy?).
Also, the modified vesion of the script is analyzing all faa files present in a directory automatically; therefore, feel free to download many faa files (the only additional work is to rename the files)!

EMBOSS is installed on the cluster. Here is a list of programs in EMBOSS. Today we will be using pepstats. Click on its entry in the list to see the command line arguments.

First we will download the encoded proteins from the genome we will analyze

Go to the NCBI's current genome list.

Click on the "Prokaryotes" tab. Click on Filters in the upper right corner, Check Assembley level "Complete", "Chromosome", and "Scaffold".

Use the use the "Search by organism" box to narrow to a taxonomic group. (Note that after a "Search by organism", one might need to repeat the process of clicking on the "Prokaryotes" tab, and re-tick the filters genomes box.)

Then look for the R and G links in the far right-hand column (you will probably have to scroll to the right). The R takes you to a listing of all the refseq files for a genome project (R is referred over G). If you select an organism for whhich only the G link is available, and if this link does not include an faa.gz file, select a different strain.

You want to download the file ending in ".faa.gz". This "faa" file contains all of the proteins coded by a genome.

Download the file to your computer, uncompress the file, and rename it using the Genus_species_strain designation. Remember not to use spaces or special chaaracters in the name!

Using filezilla (transfer.cam.uchc.edu, username mcb3421usrxyz), create a directory for lab11, and transfer the faa files into the directory.

Which organisms did you select, which are the links that used:


PuTTY to xanadu-submit-ext.cam.uchc.edu

login

srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash

cd lab11

If you did not unzip the faa files on your computer, do this now:
gunzip *.gz (the .gz suffix means this text file is compressed, so uncompress it)

more the_name_of_one_of_your_genomes.faa (inspect the first few lines of the faa file, type "q" to exit) (...and space to go forward) (...and "b" to go back)

Today we will be using programs from the emboss package, and R scripts. Thus we need to load the corresponding modules:

module load R
module load emboss
module load perl 
(To check the modules available: module avail)

We will use the following scripts today

run_pepstats.pl
parse_pepstats.pl
parse_pepstats_mod2.pl
histogramScript_pdf.R

Either use curl -O name_of_link , or filezilla to get these scripts into the lab11 directory

finally, here is the exciting pepstats command:
pepstats the_name_of_one_of_your_genomes.faa -outfile the_name_of_one_of_your_genomes.pepstats
more the_name_of_one_of_your_genomes.pepstats

(inspect the pepstats file, type space to go forward)
(...and "b" to go back)
(...and "q" to exit)

Now we need a program to extract the isoelectric points (amongst other stuff). It's called parse_pepstats.pl. It will work provided the output of pepstats is in a file ending in ".pepstats" (remember that is what you named it above, type "ls" to confirm. Read through the parse_pepstats.pl script. Try to figure out how the program finds the values for the theoretical isoelectric points in the pepstats output.

perl parse_pepstats.pl (run the script, and extract the isoelectric points)
ls -l (you made a bunch of additional files)
head the_name_of_one_of_your_genomes.pepstats.pI (the first 10 isoelectic points!)

Use filezilla and drag the file from the lab11 folder containing the isoelectric points (.pepstats.pI ending) and the table with the parsed output (ending on .parsed) to your computer.
Load the .pI and .parsed into Excel.

Make histograms of the pI data in Excel, (remember to select "All Files" to see it), use Insert -- Statistic Chart (all-blue column chart icon in the Charts section) -- Histogram.

This is a rather tedious procedure that can be easily automated:

run_pepstats.pl (is a script that runs pepstats on all faa files in the directory, use nano or more to inspect the script)

parse_pepstats_mod2.pl (runs parse_pepstats on all pepstat output files, reformats the .pI file and hands it over to an R script that makes histograms, and finaly, renames the histograms).

Briefly read through the scripts (they should be already in the lab11 directory) to understant what they do.


ls                                (to make sure the scripts are in your directory)
perl run_pepstats.pl              (runs pepstat on all *.faa files in the lab9 directory)
ls                                (to check which files were created)
perl parse_pepstats_mod2.pl       (run parse pepstats on all files and extract the isoelectric points and creates a histogram)
ls -l                             (you made a bunch of additional files)
head G_species.pepstats.pI        (the first 10 isoelectic points!)
head G_species.pepstats.parsed    (check the table that summarizes the results for each protein) 
            (for each pepstats file a pdf file containing the histogram should be created in the folder)
            (move the pdf, .pI and parsed files to your PC and inspect them using acrobat, excel or similar)

Use filezilla to drag the file from the lab11 folder containing the isoelectric points (.pepstats.pI ending) and the table with the parsed output (ending on .parsed) and the .pdf files with the histograms to your computer.

For at least three of the analyzed genomes, describe the distribution of isoelectric points. How many peaks?
Why might there be a minimum at around pH7.5?
Compare your finding with others in the class.
Check a few of the ORFs with very alkaline theoretical isoelectric point (the *.parsed outfile contains accession numbers in the first column, using entrez Protein for the accession number will get you to the genebank entry for that protein). What functions do these genes have?
Which charge would these proteins have at neutral pH? Can you see a pattern in the types of enzymes?


Send an email to gogarten@uconn.edu with the pdf of the histograms you created.


    Finished?

    Type exit to release the compute node from the queue.
    If you you encountered problems in your session, check the queue for abandoned sessions using the command qstat. If there are abandoned sessions under your account, kill them by deleting them from the queue by typing qdel job-ID, e.g. "qdel 40000" would delete Job # 40000

 

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.