Your name: Your email address:
See these the slides for some background.
Underlying today's exercise are two observation: 1) some organisms that live in environments with very high salt concentrations follow a salt-in strategy. Instead of keeping salt out of the cell, they accumulate very high concentration of KCl (>4M salt). A consequence of the high salt concentration is that the reach of a charge is very low (the so-called Debye lengths --- using the formula given here, the Debye length is in the Angstrom range inside the cytoplasm of an organism following the salt in strategy). Too compensate for this, organisms following the salt-in strategy have many negatively charged amino acids in their proteins (or in most of them). And these negatively charges aa (sidechains containing a COO- group) lead to an isoelectric point for the proteins at a low (acidic) pH. (I.e. one would need to move the pH of the solution to around pH 2 to have the overall protein with zero charge.)
Aside: most proteins have a negative charge at the pH at which they normally exist. The overall negative charge prevents proteins from clumping together. Exceptions are proteins that bind to the DNA (the backbone contains phosphate groups that give the DNA on overall negative charge; for a protein to bind to the DNA, it needs to have a positive charge), and proteins that bind to the cell-wall or extra-cellular matrix. The cell wall has an overall negative charge, and for a protein to stay inside the cell wall it helps to have a net positive charge. For a "normal" cell the theoretical isoelectric points (calculated from the types of sidechains present in the protein) looks as follows:
2) The Haloarchaea follow the salt-in strategy, and as a consequence have one big peak in the histogram of IEPs at below pH 4. The same is true for a group of archaea know as Nanohaloarchaea. The placement of the Nanohaloarchaea remains uncertain. They were initially considered the sister group to the Haloarchaea. Later they were grouped with other recently discovered Archaea into the DPANN group (one of the Ns stands for Nanohaloarchaea). The DPANN group allegedly is a deep branching group in the archaea. And more recently a paper found them to have evolved independently from the Haloarchaea from a methanogen ancestor. A recent article form the Gogarten lab on the origin of Haloarchaea is here. In today's exercise we will study the proteoms from Haloarchaea and groups possibly related to the haloarchaea for their IEP profiles. Nanohaloarchaea. are a group of small archaea that live in close association with Haloarchaea (ectosymbionts) In 16S rRNA and ATPase phylogenies they often are recovered with the Haloarchaea Hikarchaeia . The Hikarchaeia are a recently described group (from MAGs) that likely are the sistergroup of the Haloarchaea. A paper describing them is here (Note the genomes the authors submitted do not include any ORFs - the one in the attached were generated you Yutian Feng from the Gogarten Lab) Methanonatronarchaeia. A new group of methanogenes recently described, The placement of this group inside the Archaea remains controversial (here, here and here). Other groups of possible interest are the Marine Group I Marine Group II and Marine Group III archaea (Ca. Poseidoniales ord. nov.), marine archaea (related to the Thermoplasmatales).
Your task is to determine, if any of these groups contain species which appear to be on the path towards a salt-in strategy.
Every student should analyze at least one haloarchaeal (the group is still called Halobacteria), one nanohaloarchaeal, and several genomes from each of the proposed ancestral groups. A selection of multiple fasta files for different genome is here. We do not restrict ourselves to completely sequenced genomes! In addition, feel free to use any other genome you are interested in (halophilic bacteria, acidophiles (how would you detect a proton in strategy?), human, yeast, .... Also, the modified version of the script analyzes all faa files present in a directory automatically; therefore, feel free to download as many faa files (the only additional work is to rename the files, so you easily recognize which profile is from which organism)!
EMBOSS is installed on the cluster. Here is a list of programs in EMBOSS. Today we will be using pepstats. Click on its entry in the list to see the command line arguments.
First we will download the encoded proteins from the genome we will analyze and/or transfer the .faa files to the xanadu cluster.
Optional: Go to the NCBI's current genome list.
Click on the "Prokaryotes" tab. Click on Filters in the upper right corner, Check Assembley level "Complete", "Chromosome", and "Scaffold".
Use the use the "Search by organism" box to narrow to a taxonomic group. (Note that after a "Search by organism", one might need to repeat the process of clicking on the "Prokaryotes" tab, and re-tick the filters genomes box.)
Then look for the R and G links in the far right-hand column (you will probably have to scroll to the right). The R takes you to a listing of all the refseq files for a genome project (R is referred over G). If you select an organism for which only the G link is available, and if this link does not include an faa.gz file, select a different strain.
You want to download the file ending in ".faa.gz". This "faa" file contains all of the proteins coded by a genome.
Download the file to your computer, uncompress the file, and rename it using the Genus_species_strain designation. Remember not to use spaces or special characters in the name!
Using filezilla (transfer.cam.uchc.edu, username mcb3421usrXX), create a directory for lab13, and transfer the faa files into this directory.
Which organisms did you select? Why are these intersting?:
PuTTY to xanadu-submit-ext.cam.uchc.edu
login
srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash
cd lab13
If you did not unzip the faa files on your computer, do this now: gunzip *.gz (the .gz suffix means this text file is compressed, so uncompress it)
more the_name_of_one_of_your_genomes.faa (inspect the first few lines of the faa file, type "q" to exit) (...and space to go forward) (...and "b" to go back)
Today we will be using programs from the emboss package, and R scripts. Thus we need to load the corresponding modules:
module load R/4.0.3 module load emboss/6.6.0 module load perl
We will use the following scripts today
run_pepstats.pl parse_pepstats.pl parse_pepstats_mod2.pl histogramScript_pdf.R
These scripts are available in this archive Move them into the lab13 directory
finally, here is the exciting pepstats command: pepstats the_name_of_one_of_your_genomes.faa -outfile the_name_of_one_of_your_genomes.pepstats more the_name_of_one_of_your_genomes.pepstats
pepstats the_name_of_one_of_your_genomes.faa -outfile the_name_of_one_of_your_genomes.pepstats more the_name_of_one_of_your_genomes.pepstats
(inspect the pepstats file, type space to go forward) (...and "b" to go back) (...and "q" to exit)
Now we need a program to extract the isoelectric points (amongst other stuff). It's called parse_pepstats.pl. It will work provided the output of pepstats is in a file ending in ".pepstats" (remember that is what you named it above, type "ls" to confirm). Read through the parse_pepstats.pl script. Try to figure out how the program finds the values for the theoretical isoelectric points in the pepstats output.
perl parse_pepstats.pl (run the script, and extract the isoelectric points) ls -l (you made a bunch of additional files) head the_name_of_one_of_your_genomes.pepstats.pI (the first 10 isoelectic points!) head the_name_of_one_of_your_genomes.pepstats.parsed (the columns are the accession number of the protein, the length of the protein, the theoretical isoelectric point, and fraction of positively charged residues)
perl parse_pepstats.pl (run the script, and extract the isoelectric points) ls -l (you made a bunch of additional files)
Use filezilla and drag the file from the lab13 folder containing the isoelectric points (.pepstats and .pI ending) and the table with the parsed output (ending on .parsed) to your computer. Load the .pI and .parsed into Excel.
Make histograms of the pI data in Excel, (remember to select "All Files" to see the file in the Excel load window), use Insert -- Statistic Chart (all-blue column chart icon in the Charts section) -- Histogram.
Select a few proteins with very alkaline theoretical IEP, copy their accession number, and then use Entrez to determine the function of these proteins. (see the questions below).
This is a rather tedious procedure that can be easily automated:
run_pepstats.pl (is a script that runs pepstats on all faa files in the directory, use nano or more to inspect the script)
run_pepstats.pl
parse_pepstats_mod2.pl (runs parse_pepstats on all pepstat output files, reformats the .pI file and hands it over to an R script that makes histograms, and finally, renames the histograms).
parse_pepstats_mod2.pl
Briefly read through the scripts (they should be already in the lab13 directory) to understand what they do.
ls (to make sure the scripts are in your directory) perl run_pepstats.pl (runs pepstat on all *.faa files in the lab9 directory) ls (to check which files were created) perl parse_pepstats_mod2.pl (run parse pepstats on all files and extract the isoelectric points and creates a histogram) ls -l (you made a bunch of additional files) head G_species.pepstats.pI (the first 10 isoelectic points!) head G_species.pepstats.parsed (check the table that summarizes the results for each protein) (for each pepstats file a pdf file containing the histogram should be created in the folder) (move the pdf, .pI and parsed files to your PC and inspect them using acrobat, excel or similar)
Use filezilla to drag the file(s) from the lab13 folder containing the isoelectric points (.pepstats.pI ending) and the table with the parsed output (ending on .parsed) and the .pdf files with the histograms to your computer. For at least three of the analyzed genomes, describe the distribution of isoelectric points. How many peaks? Why might there be a minimum at around pH7.5? Compare your finding with others in the class. Check a few of the ORFs with very alkaline theoretical isoelectric point (the *.parsed outfile contains accession numbers in the first column, sort on the IEP, using entrez Protein with the accession number will get you to the genbank entry for that protein; if you seem to be stuck with hypothetical proteins, pick proteins that are longer). What functions do these genes have? Which charge would these proteins have at neutral pH? Can you see a pattern in the types of enzymes?
Send email to your instructor (and yourself) upon submit Send email to yourself only upon submit (as a backup) Show summary upon submit but do not send email to anyone.