MCB 5472 : Determining dN/dS ratios along a sequence

We will use the sequences encoding the intein in the vacuolar ATPase in yeast. A file containing the nucleotide sequences is here.

Download the file and create a codon based alignment in seaview. (Load the sequences, under properties select view as protein, align using muscle, view as nucleotides, save as nexus, fst (fasta), mase and phylip formated files).

In the musle aligned file (here in phylip format with short names) the conserved blocks are at nucleotide positions 1-39 211-252 694-729 1324-1353 1444-1500 1762-1803 and 1810-1830; The LAGLIDADG motifs of the homing endonuclease are at pos 694-729 1324-1353, corresponding to codon 232-243 and 442-451.

Intro slides for today are here

Option 1: Use the data monkey

HyPhy is a program package that allows you to use maximum likelihood analyses to test hypothesis about the evolution of sequences, including different tests for types of selection, models for substitution, molecular clocks, and recombination. HyPhy (pronounced hifi) is most easily run on the data monkey server. If you have a more complicated model you want to test (e.g., different parts of a multiple sequence alignment follow different substitution models), you can use a GUI version that can be installed on your Windows or Mac computer. HyPhy allows the use of maximum likelihood ratio tests, and the use of simulations under a null hypothesis (parametric bootstrapping) to test the significance of results.

On the datamonkey webpage, click on get started, then choose "selection", then "sites", then "pervasive", then "small", then select the FEL model (which does a good testing of hypotheses see the background pages for more information).

In the output table, alpha gives the rate of synonymous substitutions and beta the rate of non-synonymous substitutions. This might take some time, results from an earlier analysis might still be here or here.

NOTE: the interface also allows you to test individual branches, or parts of the tree.

Which sites have the highest rates of non-synonymous substitution? Are the LAGLIDAG motifs under negative (purifying) selection?

Option 2: Use PAML on the cluster

Transfer the phylip formated file PAML compatible (with short names, missing data as ? and an I for interleaved in the first line) (here) into a directory on the cluster. A phylogentic tree calculated from these sequences (GTR + G +I) is here.

The control file to run three models (one omega, two omegas (one for purifying selection one neutral) and 3 omegas (one for positive selection) is here

NOTE: codeml is rather sensitive on the tree and sequence format. The treefile should be in Newick format, but not contain labels for the support values.

To run codeml (part of paml) to test different models that assume one, two, or three types of sites with respect to dN/dS ratios, transfer the sequence, tree and control file into a directory on the cluster. qlogin, change to this directory, and run codeml by executing codeml codeml.ctl.

The program takes about 5 minutes for the 1 omega model, about 20 minutes for the model that includes sites with an omega of 1, and 1.5 hours for the model tha considers positive selection. The outfiles generated by the program are here.

What is the estimated omega when all sites are considered equal, what is the fraction of sites that are nearly neutral (in the second model) and what is the estimated omega for the sites under purifying selection? Are there any sites for which there is significant support that the site may be under positive selection?

Option 3: Use MrBayes

The MrBayes manual is here

A file of the intein sequences in nexus format is here (this is the file format that MrBayes and PAUP use). The file was created in seaview.
I added the following MrBayes block:

begin mrbayes;
lset nst=2 rates=gamma nucmodel=codon omegavar=Ny98;
report possel = yes siteomega = yes;
mcmcp filename=analysisS;
mcmcp samplefreq=100 printfreq=100 diagnfreq=500;
mcmc ngen=100000;
mcmcp savebrlens=yes;
end;

This directs MrBayes to run the Yang model and to report omega and significance values for each codon.

To run the analysis:

move the nexus formated file into a directory on the cluster,
change into this directory,
load the MrBayes module for version 3.2.6. module load mrbayes/3.2.6
start MrBayes by typing mb
then type execute Yeast_vma1_intein_withMotifs.nxs
check that everything is loaded as expected
start the Metropolis coupled Monte Carlo Markov Chain by typing mcmc

After the specified number of generations, the program halts and waits for your input (continue the run, yes or no).

This takes at least a couple of hours, for an analysis that you want to rely on you should run this for several days! A criterion for halting the chains is the "average standard deviation of split frequencies". This number compares the results from two parallel runs, the closer to zero the value the better. Results from an overnight run are here.

After a run is finished, you can use the " sump " command (within MrBayes) to plot the logL vs. generation number. This allows to determine the necessary burnin (you want to discard those samples as "burnin" where the -logL is still rising steadily).
To see the whole logL curve, you need to set the burnin fraction to .02 . (type help sump at the mb commandline). sump burninfrac=.02

Rather than using the sump command, you also can download the tracer application that can read the parameter files from MrBayes, and provides an easy intuitive and interactive way to evaluate these files with respect to burnin and confidence intervals. The .p and .t files from an analyses that I ran on this dataset, and the terminal output from the sump and sumt commands is here.
Reading the sumt output, can you figure out what the posterior probability is for the intein from AB093510.1_Saccharomyces_exiguus (seq 22) froms a clann with sequence #19 (XM_003683782.1_Tetrapisispora_phaffii)? Note that sequences from the genus Sacharomyces do not form a clann.

Load the .p files (both) into into tracer, you can load both parameter files at the same time, and select in the upper left field of the tracer application, if you want to analyze one or both of the parameter files.

Inspect the trace (select the right button over the graphics window) of lnL (select in the list of parameters on the left).

Determine the mean, median and 95% highest posterior density interval for the estimated omega for the sites under purifying selection.

Load the two parameter files (.p) into excel and copy the values for the generations after the burnin into a new spreadsheet.

Follow the guidance from the powerpoint slides and calculate the values estimated for omega for each codon, and the probability that the codon is under positive selection. ALTERNATIVELY, here is a spreadsheet with the generations >80000 from the two runs, and the average for the probability to be under positive selection pr+(xyz) and the omega(zyz) already calculated (you need to scroll to the right to get to these columns). The second sheet contains bar graphs of these values.

Are the regions that contain the LAGLIDAdG motifs under negative (purifying) selection?

2) Work on your student project!

Include a one sentence summary of what you did in your report.