Assignment 9 Handling genome (or larger) amounts of data -- Extracting text from other applications

Your name:
Your email address:

Histograms of isoelectric points (IEPs) of all proteins encoded in a genome OR finding organisms that follow a salt-in strategy.

See these slides for some background.

Underlying today's exercise are two observation:
1) some organisms that live in environments with very high salt concentrations follow a salt in strategy. Instead of keeping salt out of the cell, they accumulate very high concentration of KCl (>4M salt). A consequence of the high salt concentration is that the reach of a charge is very low (the so-called Debye lengths). Too compensate for this, organisms following the salt-in strategy have many negatively charged amino acids in their proteins (or in most of them). And these negatively charges aa (sidechains containing a COO- group) lead to an isoelectric point for the proteins at a low (acidic) pH. (I.e. one would need to move the pH of the solution to around pH 2 to have the overall protein with zero charge.)

Aside: most proteins have a negative charge at the pH at which they normally exist. The overall negative charge prevents proteins from clumping together. Exceptions are proteins that bind to the DNA (the backbone contains phosphate groups that give the DNA on overall negative charge; for a protein to bind to the DNA, it needs to have a positive charge), and proteins that bind to the cell-wall or extra-cellular matrix. The cell wall has an overall negative charge, and for a protein to stay inside the cell wall it helps to have a net positive charge.
For a "normal" cell the theoretical isoelectric points (calculated from the types of sidechains present in the protein) looks as follows:

2) The Haloarchaea follow the salt-in strategy, and as a consequence have one big peak in the histogram of IEPs at below pH 4. The same is true for a group of archaea know as Nanohaloarchaea. The placement of the Nanohaloarchaea remains uncertain. They were initially considered the sister group to the Haloarchaea. Later they were grouped with other recently discovered Archaea into the DPANN group (one of the Ns stands for Nanohaloarchaea). The DPANN group allegedly is a deep branching group in the archaea.
And more recently a paper found them to have evolved independently from the Haloarchaea from a methanogen ancestor. The two putatively ancestral groups of methanogens are the

Methomicrobiales
(e.g., Methanoregula boonei 6A8; Methanolinea tarda NOBI-1; Methanoculleus marisnigri JR1; Methanolacinia petrolearia DSM 11571; Methanofollis liminatans DSM 4140; Methanocorpusculum labreanum Z; Methanosphaerula palustris E1-9c) for the Haloarchaea, and the
Methanocellales (e.g., Methanocella arvoryzae MRE50; Methanocella paludicola SANAE; Methanocella conradii HZ254) or the Nanohaloarchaea.
A third possibly ancestral group to the Halobacteria are the
Methanosarcinales (e.g., Methanosarcina barkeri; Methanohalobium evestigatum Z-7303; Methanosalsum zhilinae DSM 4017, Methanosaeta thermophila PT)
Your task is to test, if any of these ancestor groups contain species which appear to be on the path towards a salt-in strategy.

Every student should analyze at least one haloarchaeal (the group is still called Halobacteria), one Nanohaloarchaeal, and one genome from each of the proposed ancestral groups (Methomicrobiales, Methanocellales, Methanosarcinales). We do not restrict ourselves to completely sequenced genomes! In addition feel free to use any other genome you are intersted in (halophilic bacteria, acidophiles (how would you detect a proton in strategy?).

EMBOSS is installed on the cluster. Here is a list of programs in EMBOSS. Today we will be using pepstats. Click on its entry in the list to see the command line arguments.

First we will download the encoded proteins from the genome we will analyze

Go to the NCBI's current genome list.

Click on the "Prokaryotes" tab. Click on Filters in the upper right corner, Check Assembley level "Complete", "Chromosome", and "Scaffold".

Use the use the "Search by organism" box to narrow to a taxonomic group. (Note that after a "Search by organism", one might need to repeat the process of clicking on the "Prokaryotes" tab, and re-tick the filters genomes box.)

Then look for the R and G links in the far right-hand column (you will probably have to scroll to the right). The R takes you to a listing of all the refseq files for a genome project. If you select an organism for whhich only the G link is available, and if this link does not include an faa.gz file, select a different strain.

You want to download the file ending in ".faa.gz". This "faa" file contains all of the proteins coded by a genome.

Right-click on the faa file, and select "Copy link location". Paste the links for the genomes you want to analyze into a textfile. Also note the name of the strain you selected.

Start the Bitvise SSH Client. In the "Host" field, type mcbsubmit.cam.uchc.edu
In the "Username" field, type your username: mcb3421usrXX, where XX is a number assigned to you. Login. It may ask you to accept a new host key. Now enter your password: xxx.

commands for the cluster via ssh: (bracketed parts are comments only!)

  srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash 
  mkdir lab9						    (make a new folder called "lab9")
	cd lab9								(equivalent of "double-clicking" a folder to descend "into" it)
	curl -O paste the link you copied from NCBI				(curl stands for "see URL", this downloads the NCBI link to the cluster)
	                                                        (...you could also accomplish this with the Bitvise SFTP client)
  [don't copy this mindlessly! It is a good idea to answer the first question 1st :-)] e.g. curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000940915.1_ASM94091v1/GCF_000940915.1_ASM94091v1_protein.faa.gz
  [comment] let's make the filename simpler and more descriptive for humans
	[comment] The easiest is to use the Organism's name (including the strain number. 
	[comment] REMEMBER TO USE THE TAB KEY TO COMPLETE LONG FILENAMES — this makes life much easier
	mv oldname.faa.gz newname.faa.gz						(changes the name of a file from oldname to newname
  (repeat the curl and mv commands for all the genomes (the .faa.gz files) you decided to analyze.)

Which organisms did you select, which are the links you copied (to use in the curl -O comand):


This is a sample of the commands to get 5 faa files into your folder (#not the typing is easier, when you use automatic line completion):
Do this with the genomes you selected! 
 
mkdir lab9
cd lab9
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/017/625/GCF_000017625.1_ASM1762v1/GCF_000017625.1_ASM1762v1_protein.faa.gz
mv GCF_000017625.1_ASM1762v1_protein.faa.gz Methanogegula_boonei6A8.faa.gz 
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/761/425/GCA_001761425.1_ASM176142v1/GCA_001761425.1_ASM176142v1_protein.faa.gz
mv GCA_001761425.1_ASM176142v1_protein.faa.gz Nanohaloarchaea_archaeon_SG9.faa.gz
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/805/GCF_000006805.1_ASM680v1/GCF_000006805.1_ASM680v1_protein.faa.gz
mv GCF_000006805.1_ASM680v1_protein.faa.gz Halobacterium_salinarum_NRC1.faa.gz
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/005/GCF_000011005.1_ASM1100v1/GCF_000011005.1_ASM1100v1_protein.faa.gz
mv GCF_000011005.1_ASM1100v1_protein.faa.gz Methanocella_paludicola_SANAE.faa.gz
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/355/655/GCF_002355655.1_ASM235565v1/GCF_002355655.1_ASM235565v1_protein.faa.gz
mv GCF_002355655.1_ASM235565v1_protein.faa.gz Halorubrum_trapanicum.faa.gz
gunzip *.gz

	ls -l									(make sure the files are there)
	gunzip *.gz								(the .gz suffix means this text file is compressed, so uncompress it)		
	ls -l									(the .gz suffix should be gone, and the file got larger)
	
	more G_species.faa					(inspect the first few lines of the faa file, type "q" to exit)
										(...and space to go forward)
										(...and "b" to go back)

Today we will be using programs from the emboss package, and R scripts. Thus we need to load the corresponding modules:

module load R/3.5.1
module load emboss/6.6.0
module load perl

(To check the modules available: module avail)

finally, here is the exciting pepstats command:

	pepstats G_species.faa -outfile G_species.pepstats
	more G_species.pepstats							(inspect the pepstats file, type space to go forward)
										(...and "b" to go back)
										(...and "q" to exit)

Now we need a program to extract the isoelectric points (amongst other stuff). It's called parse_pepstats.pl. It will work provided the output of pepstats is in a file ending in ".pepstats" (remember that is what you named it above, type "ls" to confirm. Read through the parse_pepstats.pl.txt script. Try to figure out how the program finds the values for the theoretical isoelectric points in the pepstats output.

	curl -O http://carrot.mcb.uconn.edu/mcb3421_2017/parse_pepstats.pl.txt		(download the perl script via link)
	perl parse_pepstats.pl.txt							(run the script, and extract the isoelectric points)
	ls -l										(you made a bunch of additional files)
	head G_species.pepstats.pI							(the first 10 isoelectic points!)

RECOMMENDED:
As an alternative, especially if you have many genomes that you want to analyze, you could use a modified script that calls an Rscript to calcualte and plot the histogram. The Perl script is parse_pepstats_mod2.pl.
The R script , called by parse_pepstats_mod2.pl, is histogramscript_pdf.R . I also have a script that runs pepstat run_pepstats.pl on all .faa files in the same directory.

Briefly read through the scripts to understant what they do.
Use the curl command to get the above scripts into your lab9 directory, then

	ls (to make sure the scripts are in your directory)
  perl run_pepstats.pl          							(runs pepstat on all *.faa files in the lab9 directory)
  ls to check which files were created
  perl parse_pepstats_mod2.pl					        (run parse pepstats on all files and extract the isoelectric points and creates a histogram)
	ls -l										        (you made a bunch of additional files)
	head G_species.pepstats.pI							(the first 10 isoelectic points!)
  head G_species.pepstats.parsed                      (check the table that summarizes the results for each protein)
                                                      (for each pepstats file a pdf file containing the histogram should be created in the folder)
                                                      (move the pdf, .pI and parsed files to your PC and inspect them using acrobat, excel or similar)
End RECOMMENDED:

Use bitwise and connect to transfer.cam.uchc.edu (using your username and password), using SFTP Window, drag the file from the lab9 folder containing the isoelectric points (.pepstats.pI ending) and the table with the parsed output (ending on .parsed) and, if you created them already, the .pdf files with the histograms to your computer.
Load the .pI and .parsed into Excel. (If you have the histograms already, skip the pI files not the .parsed files)

If you did not follw the Recommended route above, make histograms in Excel, after loading the .pI file (remember to select "All Files" to see it), use Insert -- Statistic Chart (all-blue column chart icon in the Charts section) -- Histogram.

For each of the genomes, describe the distribution of isoelectric points. How many peaks?
Why might there be a minimum at around pH7.5?
Compare your finding with others in the class.
Check a few of the ORFs with very alkaline theoretical isoelectric point (the *.parsed outfile contains accession numbers in the first column). What functions do these genes have?
Which charge would these proteins have at neutral pH? Can you see a pattern in the types of enzymes?

Finished?

Type exit to release the compute node from the queue.
If you you encountered problems in your session, check the queue for abandoned sessions using the command qstat. If there are abandoned sessions under your account, kill them by deleting them from the queue by typing qdel job-ID, e.g. "qdel 40000" would delete Job # 40000

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.