Assignment 12:
Creating publication quality trees
Handling genome (or larger) amounts of data -- Extracting text from other applications

Your name:
Your email address:

1) Histograms of isoelectric points (IEPs) of all proteins encoded in a genome (aka finding organisms that follow a salt-in strategy)

See these the lecture slides for some background.

Underlying today's exercise are two observation:
1) some organisms that live in environments with very high salt concentrations follow a salt in strategy.  Instead of keeping salt out of the cell, they accumulate very high concentration of KCl (>4M salt).  A consequence of the high salt concentration is that the reach of a charge is very low (the so-called Debye lengths --- using the formula given here, the Debye length is in the Angstrom range inside the cytoplasm of an organism following the salt in strategy).  Too compensate for this, organisms following the salt-in strategy have many negatively charged amino acids in their proteins (or in most of them).  And these negatively charges aa (sidechains containing a COO- group) lead to an isoelectric point for the proteins at a low (acidic) pH.  (I.e. one would need to move the pH of the solution to around pH 2 to have the overall protein with zero charge.) 

Aside:  most proteins have a negative charge at the pH at which they normally exist.  The overall negative charge prevents proteins from clumping together.  Exceptions are proteins that bind to the DNA (the backbone contains phosphate groups that give the DNA on overall negative charge; for a protein to bind to the DNA, it needs to have a positive charge), and proteins that bind to the cell-wall or extra-cellular matrix.  The cell wall has an overall negative charge, and for a protein to stay inside the cell wall it helps to have a net positive charge.  
For a "normal" cell the theoretical isoelectric points (calculated from the types of sidechains present in the protein) looks as follows: 

IEP Histogram

2) The Haloarchaea follow the salt-in strategy, and as a consequence have one big peak in the histogram of IEPs at below pH 4.  The same is true for a group of archaea know as Nanohaloarchaea.  The placement of the Nanohaloarchaea remains uncertain.  They were initially considered the sister group to the Haloarchaea.  Later they were grouped with other recently discovered Archaea into the DPANN group (one of the Ns stands for Nanohaloarchaea).  The DPANN group allegedly is a deep branching group in the archaea. 
And more recently a paper found them to have evolved independently from the Haloarchaea from a methanogen ancestor.  The putative ancestral groups of methanogens are the

Methanomicrobiales
(e.g., Methanoregula boonei 6A8; Methanolinea tarda NOBI-1; Methanoculleus marisnigri JR1; Methanolacinia petrolearia DSM 11571; Methanofollis liminatans DSM 4140; Methanocorpusculum labreanum Z; Methanosphaerula palustris E1-9c) for the Haloarchaea, and the
Methanocellales (e.g., Methanocella arvoryzae MRE50; Methanocella paludicola SANAE; Methanocella conradii HZ254), and
Nanohaloarchaea
A third possibly ancestral group to the Halobacteria are the
Methanosarcinales
(e.g., Methanosarcina barkeri; Methanohalobium evestigatum Z-7303; Methanosalsum zhilinae DSM 4017, Methanosaeta thermophila PT)
Recently a new group of methanogenes was described, the Methanonatronarchaeia (aka Methanonatronarcheia). The placement of this group inside the Archaea remains controversia (here, here and here).
Another group of possible interest are the Marine Group II and Marine Group III archaea (Ca. Poseidoniales ord. nov.),  marine archaea (related to the Thermoplasmatales).   

Your task is to test, if any of these ancestor groups contain species which appear to be on the path towards a salt-in strategy. 

Every student should analyze at least one haloarchaeal (the group is still called Halobacteria), one nanohaloarchaeal, and one genome from each of the proposed ancestral groups (Methomicrobiales, Methanocellales, Methanosarcinales, Methanonatronarchaeia). 
We do not restrict ourselves to completely sequenced genomes! In addition, feel free to use any other genome you are interested in (halophilic bacteria, acidophiles (how would you detect a proton in strategy?), human, yeast, ....
Also, the modified version of the script analyzes all faa files present in a directory automatically; therefore, feel free to download as many faa files (the only additional work is to rename the files, so you easily recognize which profile is from which organism)!

EMBOSS is installed on the cluster. Here is a list of programs in EMBOSS. Today we will be using pepstats. Click on its entry in the list to see the command line arguments.

First we will download the encoded proteins from the genome we will analyze

Go to the NCBI's current genome list.

Click on the "Prokaryotes" tab. Click on Filters in the upper right corner, Check Assembley level "Complete", "Chromosome", and "Scaffold".

Use the use the "Search by organism" box to narrow to a taxonomic group. (Note that after a "Search by organism", one might need to repeat the process of clicking on the "Prokaryotes" tab, and re-tick the filters genomes box.)

Then look for the R and G links in the far right-hand column (you will probably have to scroll to the right). The R takes you to a listing of all the refseq files for a genome project (R is referred over G). If you select an organism for which only the G link is available, and if this link does not include an faa.gz file, select a different strain.

You want to download the file ending in ".faa.gz". This "faa" file contains all of the proteins coded by a genome.

Download the file to your computer, uncompress the file, and rename it using the Genus_species_strain designation. Remember not to use spaces or special characters in the name!

Copy the name and the link address into the answer field below!  (e.g., 
Acidiphilium_cryptum ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/016/725/GCF_000016725.1_ASM1672v1/GCF_000016725.1_ASM1672v1_protein.faa.gz )

Using filezilla (transfer.cam.uchc.edu, username mcb3421usrXX), create a directory for lab12, and transfer the faa files into the directory.

Which organisms did you select, what are the links that you used:


PuTTY to xanadu-submit-ext.cam.uchc.edu

login

srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash

cd lab12

If you did not unzip the faa files on your computer, do this now:
gunzip *.gz (the .gz suffix means this text file is compressed, so uncompress it)

more the_name_of_one_of_your_genomes.faa (inspect the first few lines of the faa file, type "q" to exit) (...and space to go forward) (...and "b" to go back)

Today we will be using programs from the emboss package, and R scripts. Thus we need to load the corresponding modules:

module load R
module load emboss
module load perl 
(To check the modules available: module avail)

We will use the following scripts today

run_pepstats.pl
parse_pepstats.pl
parse_pepstats_mod2.pl
histogramScript_pdf.R

Either use curl -O name_of_link , or filezilla to get these scripts into the lab12 directory

finally, here is the exciting pepstats command:
pepstats the_name_of_one_of_your_genomes.faa -outfile the_name_of_one_of_your_genomes.pepstats
more the_name_of_one_of_your_genomes.pepstats

(inspect the pepstats file, type space to go forward)
(...and "b" to go back)
(...and "q" to exit)

Now we need a program to extract the isoelectric points (amongst other stuff). It's called parse_pepstats.pl. It will work provided the output of pepstats is in a file ending in ".pepstats" (remember that is what you named it above, type "ls" to confirm). Read through the parse_pepstats.pl script. Try to figure out how the program finds the values for the theoretical isoelectric points in the pepstats output.

perl parse_pepstats.pl (run the script, and extract the isoelectric points)
ls -l (you made a bunch of additional files)
head the_name_of_one_of_your_genomes.pepstats.pI (the first 10 isoelectic points!)
head
the_name_of_one_of_your_genomes.pepstats.parsed (the columns are the accession number of the protein, the length of the protein, the theoretical isoelectric point, and fraction of positively charged residues)

Use filezilla and drag the file from the lab12 folder containing the isoelectric points (.pepstats.pI ending) and the table with the parsed output (ending on .parsed) to your computer.
Load the .pI and .parsed into Excel.

Make histograms of the pI data in Excel, (remember to select "All Files" to see it), use Insert -- Statistic Chart (all-blue column chart icon in the Charts section) -- Histogram.

This is a rather tedious procedure that can be easily automated:

run_pepstats.pl (is a script that runs pepstats on all faa files in the directory, use nano or more to inspect the script)

parse_pepstats_mod2.pl (runs parse_pepstats on all pepstat output files, reformats the .pI file and hands it over to an R script that makes histograms, and finally, renames the histograms).

Briefly read through the scripts (they should be already in the lab11 directory) to understand what they do.


ls                                (to make sure the scripts are in your directory)
perl run_pepstats.pl              (runs pepstat on all *.faa files in the lab9 directory)
ls                                (to check which files were created)
perl parse_pepstats_mod2.pl       (run parse pepstats on all files and extract the isoelectric points and creates a histogram)
ls -l                             (you made a bunch of additional files)
head G_species.pepstats.pI        (the first 10 isoelectic points!)
head G_species.pepstats.parsed    (check the table that summarizes the results for each protein) 
            (for each pepstats file a pdf file containing the histogram should be created in the folder)
            (move the pdf, .pI and parsed files to your PC and inspect them using acrobat, excel or similar)

Use filezilla to drag the file(s) from the lab12 folder containing the isoelectric points (.pepstats.pI ending) and the table with the parsed output (ending on .parsed) and the .pdf files with the histograms to your computer.

For at least three of the analyzed genomes, describe the distribution of isoelectric points. How many peaks?
Why might there be a minimum at around pH7.5?
Compare your finding with others in the class.
Check a few of the ORFs with very alkaline theoretical isoelectric point (the *.parsed outfile contains accession numbers in the first column, sort on the IEP, using entrez Protein with the accession number will get you to the genbank entry for that protein; if you seem to be stuck with hypothetical proteins, pick proteins that are longer).
What functions do these genes have?
Which charge would these proteins have at neutral pH? Can you see a pattern in the types of enzymes?


Place the histograms of the theoretical isoelectric points into your notebook. 

2) Tree Images in Figtree

Figtree is a very useful program to create publication quality trees.  Is also can save them as vector graphics, which makes it easy to edit them in Inkscape or Illustrator.  See slides here

1) Download and install figtree from the github page

2) Open Figtree and copy the extein tree below into Figtree (copy/paste)

Extein tree:
(PA_Pharsalus:0.0000021695,(((((((CA_Youngblood:0.0007389017,DC_Kostya:0.0000021695)26:0.0000021695,((NJ_TeardropMSU:0.0000021695,PA_Mindy:0.0000021695)73:0.0007299594,KY_Icee:0.0007297887)19:0.0000021695)63:0.0007487685,NC_Command613:0.0099892160)26:0.0007521299,((NY_Tuco:0.0007274828,ME_IHOP:0.0000021695)33:0.0014822120,((((VA_Toto:0.0000021695,ME_Myrale:0.0000021695)16:0.0000021695,PA_Henry:0.0000021695)6:0.0000021695,(NJ_Dusk:0.0000021695,IL_DrDrey:0.0000021695)11:0.0000021695)7:0.0000021695,(RI_BadStone:0.0000021695,CT_ABCat:0.0000021695)16:0.0000021695)58:0.0000021695)30:0.0007461349)14:0.0007470694,PA_CrystalP:0.0000021695)10:0.0000021695,(NY_Miniwave:0.0007223984,((VA_NelitzaMV:0.0000021695,GA_Traaww1:0.0007246375)44:0.0014760113,WA_Rimmer:0.0000021695)61:0.0015323681)43:0.0015889701)13:0.0022451928,(ME_Phaja:0.0000021695,(SouthAfrica_BaboJay:0.0000021709,(((((CA_xkcd:0.0006370142,(((((VA_Rakim:0.0007198987,(WA_Petra64142:0.0038696953,(VA_Easy2Say:0.0007445415,(PA_Cjw1:0.0030100050,MT_Nimrod:0.0025433828)89:0.0033513392)47:0.0000021695)52:0.0024440353)22:0.0007041544,RI_Kanye:0.0039834044)21:0.0008356508,((CA_Murica:0.0000021695,CA_HufflyPuff:0.0000021695)89:0.0000021695,WA_Buck:0.0023155053)55:0.0015039979)15:0.0000021695,MT_Phaux:0.0000021695)26:0.0000021695,((CA_PhatBacter:0.0000021695,CA_MPhalcon:0.0000021695)44:0.0000021695,CA_FireRed:0.0000021695)96:0.0030518329)60:0.0016234527)69:0.0024210906,(SouthAfrica_Goku:0.0000021695,SouthAfrica_Eureka:0.0000021695)100:0.0066971945)41:0.0008081659,(((((CA_Willez:0.0000021695,CA_Bruin:0.0000021695)20:0.0000021695,((CA_Sassay:0.0000021695,CA_Mosby:0.0000021695)21:0.0000021695,CA_Contagion:0.0000021695)10:0.0000021695)99:0.0046911789,(((CA_Terminus:0.0000021695,CA_Misfit:0.0000021695)40:0.0000021695,CO_Tomaszewski:0.0000021695)74:0.0007390456,ME_ShereKhan:0.0000021695)73:0.0000021695)54:0.0015322692,CA_Nala:0.0007711033)13:0.0000021695,CA_MISSy:0.0022618887)20:0.0000021695)94:0.0056762426,GA_Sotrice96:0.0000021695)100:0.0130795493,SouthAfrica_Manda:0.0000025212)100:0.0220515140)34:0.0014967017)29:0.0007549360)7:0.0000021695,(((FL_Harella:0.0000021695,NC_Glexan:0.0000021695)22:0.0000021695,(NJ_Gator:0.0000021695,VA_Flypotenuse:0.0000021695)16:0.0000021695)37:0.0000021695,NV_Cookies:0.0007367395)38:0.0007276793);

3) Do the following: 

As many of the branches have very low support values, we can display the support in a different way

Figtree is intelligent. 
The tree we have displayed now was calculated from genes in genomes of tailed actinophage (viruses that infect actinobacteria) cluster E. The gene encodes a "terminase" (involved in packaging the DNA into the phage heads)
The names are composed of the state (or other indication of from where the phage was isolated).  The gene has been, surprize, invaded by an intein.  The displayed tree was calculated from the extein only. 
The intein sequences have the same names, but obviously the tree is expected (and it is) rather different.

Open a new Figtree window (leave the old tree open in its window!)

Copy paste the intein tree (below) into the empty Figtree window. 

Intein tree:
(PA_Pharsalus:0.0010157623,((((((CA_Youngblood:0.0000023697,KY_Icee:0.0000020247)0:0.0000020164,((((ME_IHOP:0.0000023697,PA_CrystalP:0.0000023697)0:0.0000023697,IL_DrDrey:0.0000023697)0:0.0000020139,CA_FireRed:0.0000023697)0:0.0000023697,((NJ_Gator:0.0000023697,VA_Flypotenuse:0.0000023766)3:0.0000022778,CA_MISSy:0.0000023697)0:0.0000020179)0:0.0000020160)0:0.0000020060,(((((NY_Tuco:0.0000023697,GA_Sotrice96:0.0000023697)0:0.0000023697,(VA_Toto:0.0000024910,NJ_TeardropMSU:0.0000023697)6:0.0000023697)0:0.0000021212,CA_Bruin:0.0000023697)0:0.0000020161,(FL_Harella:0.0000023697,((((((((NC_Glexan:0.0000023697,CA_Sassay:0.0000023697)0:0.0000023697,CO_Tomaszewski:0.0010161926)0:0.0000023697,CT_ABCat:0.0000023697)0:0.0000023697,NJ_Dusk:0.0000023697)0:0.0000023697,CA_Contagion:0.0000023697)0:0.0000023697,(MT_Phaux:0.0000023697,(CA_Murica:0.0000023697,RI_Kanye:0.0000023697)3:0.0000023697)0:0.0000022628)0:0.0000023697,CA_HufflyPuff:0.0000023697)0:0.0000023697,(CA_Misfit:0.0000023697,CA_MPhalcon:0.0000020389)0:0.0000020247)0:0.0000023697)0:0.0000020326)0:0.0000020277,(ME_ShereKhan:0.0000023697,VA_Rakim:0.0000023697)3:0.0000023697)0:0.0000022952)0:0.0000023697,(ME_Myrale:0.0000023697,CA_Nala:0.0000023697)3:0.0000023239)0:0.0000023697,((((((((ME_Phaja:0.0000023697,CA_xkcd:0.0000023697)1:0.0000023697,CA_Mosby:0.0000022786)0:0.0000023697,PA_Mindy:0.0000023697)0:0.0000020167,((RI_BadStone:0.0000023697,CA_PhatBacter:0.0000023697)1:0.0000023697,CA_Willez:0.0000023697)0:0.0000022376)0:0.0000020155,NV_Cookies:0.0000023697)0:0.0000020541,NC_Command613:0.0000023697)0:0.0000020336,(DC_Kostya:0.0000023697,PA_Henry:0.0000023697)3:0.0000023697)0:0.0000021281,CA_Terminus:0.0000023697)0:0.0000022223)3:0.0000023697,(NY_Miniwave:0.0009978895,((SouthAfrica_BaboJay:0.0000020393,SouthAfrica_Manda:0.0000020286)88:0.0022320460,(SouthAfrica_Goku:0.0000021165,SouthAfrica_Eureka:0.0000022045)88:0.0000021122)96:0.0045939461)13:0.0000022390)3:0.0000020074,(((VA_NelitzaMV:0.0000023697,(WA_Petra64142:0.0000020924,PA_Cjw1:0.0000020268)23:0.0000020412)40:0.0010005115,(((GA_Traaww1:0.0010081859,VA_Easy2Say:0.0000023697)5:0.0000020185,WA_Rimmer:0.0000020912)6:0.0000020028,MT_Nimrod:0.0000023697)29:0.0000020741)59:0.0010117882,WA_Buck:0.0000021882)32:0.0010033939);

If everything works as expected, the colors selected for the leaves on the extein tree are applied to the new tree.
Under file select new.  Copy past the intein tree (in Newick format into the figtree window)

Re-root the tree in one of the longer branches. Set the root branch length to zero.   "Colour" the branches based on support values (see above). 

-> the clans defined by the extein tree are not reflected in the intein phylogeny.  The intein was transferred several times between the red and green clans.   Note the nearly identical intein between red and green leaves.

Give two examples for very similar inteins from the two distinct extein clans.  Below list at least two pairs of actinophages with divergent extein sequences but with similar inteins (Blue and Red leaves next to each other on the intein tree) reflecting recent intein sharing: 


    Finished?

 

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.