Assignment 12

Your name:
Your email address:

Today, we will start with exercise #2, followed by exercise #3, and continue with exercise #4 and #1, if time permits. (in case you did not do assignment #11, Exercize #1 is recommended)

Exercise 1:

Likelihood ratio tests

Open seaview, and load Yeast_vma1_all_seaview.mase. Note that these nucleotide sequences are already aligned based on their encoded amino acids. (To see the aa alignment click on view and check protein - return to the nucleotide sequence view). The dataset has three set of sites: all, intein, extein.
Collaborate with two of your neighbors, each of you will use a different set of sites (i.e., all, intein, extein).

We want to calculate two likelihood values using Trees in thePhyML menu:

Model: GTR, Across site rate variation: Optimized (4) — we'll call this "GTR+Gamma"
Model: GTR, Across site rate variation: Optimized (4) Invariable Sites Optimized — we'll call this "GTR+Gamma+I"

All other parameters are the same for the two runs: Support values: aLRTvalues, nucleotide equilibrium frequencies optimized, tree searching NNI, starting tree BioNJ, optimize tree topology.

Once the tree reconstruction is finished, DO NOT CLICK OK, rather click on copy at the bottom of the window, open a text editor and paste the content into the test window. We are interested in the log likelihood, and the last values estimated for the shape parameter, and the % invariable sites (if the latter two were estimated as part of the model).

One important condition that has to be fulfilled before one can use a Likelihood Ratio Test (LRT) to compare two models, is that the models should be "nested". This means that the simpler model must be a constrained version of the parameter-rich model. The likelihood ratio test is performed by doubling the difference in log-likelihood scores and comparing this test statistic with the critical value from a chi-squared distribution having degrees of freedom equal to the difference in the number of estimated parameters in the two models. The parameter-rich model will always have a better fit, due to the extra parameters and will therefore have the highest log-likelihood, so the difference should be a positive number. In this case there is 1 degree of freedom between each of the models — the gamma shape parameter is one parameter and the % invariant sites is the second parameter. Use this online chi-square calculator to determine the significance of the test.

Does the GTR+Gamma+I model explain the data significantly better than GTR+Gamma model?
Is this true for all of the the three sets of sites?

When you compare the trees you obtains for the intein, extein and the combined datasets, do you observe any differences?

Exercise 2:

Introduction to Bayesian Analyses

In class we will use MrBayes3.2 as installed on Xanadu. The goal of this exercise is to learn how to use MrBayes to reconstruct phylogenies and estimate parameters.

Log on to the mcbsubmit.cam.uchc.edu cluster using Bitvise SSH Client (ssh mcb3421usrx@mcbsubmit.cam.uchc.edu).
After you established an ssh connection to the terminal, log into into a compute node [srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash]
load the module for the latest version of the MrBayes program: module load MrBayes/3.2.7 (to see which modules are available on the cluster type module avail ).
Create a class12 directory on the cluster (mkdir class12).
The nexus formated file that we will use is here. Move/copy this file into the class 12 directory.
The easiest will be to use the curl command: curl -O http://gogarten.uconn.edu/mcb3421_2018/labs/Yeast_vma1_2partions.nxs
Have a look at the file in a text editor or using more - pay attention to the MrBayes block at the end of the file).
To start the program, in terminal change to the directory where the sequence data are located and type mb.
At the MrBayes command prompt type "execute Yeast_vma1_2partions.nxs". This will load the data into the program and execute the MrBayes block..

Here is an explanation of the commands at the end of sequence file in the MrBayes block:

charset intein = 856-2685;
charset extein = 1-855 2686-3705; [this assignes the alignment columns to the intein and extein partions.]
partition favored = 2: intein, extein; [the two partition scheme is named favored and contains two partitions]
set partition = favored; [this tell the program to use the two partitions]
lset applyto=(1,2) nst=6 rates=gamma; [both partitions are set to the GTR model with six parameters and an ASRV described by a gamma distribution]
mcmcp filename=analysisP; [the result files are named analysisP ...]
mcmcp samplefreq=50 printfreq=50;
[mcmcp sets parameters for the chain. This sets the frequency with which the "robot" reports results to the screen and to the files (different files for trees (.t) and other parameters (.p)]
mcmcp savebrlens=yes; [mcmcp sets parameters for the chain. savebrlens tells it to save the trees with branchlengths]
unlink statefreq=(all) revmat=(all) shape=(all); [unlinks partitions, shape parameters and rate multipliers will be estimated for each of the partitions]
prset applyto=(all) ratepr=variable; [allows rates to vary between partitions]
end;
" prset" sets the priors for the model used to calculate likelihoods.
" mcmcp " sets parameters for Markov Chain Monte Carlo: we set to sample every 50 generation, to print results to a screen every 50th generation, run 2 chains simultaneously, start with random tree, and save branch lengths.
The MrBayes manual (pdf) , which includes tutorial sections, is here. Use search in you pdf program to find information on a particular command.
Type "showmodel" to have MrBayes display the model you selected.
At the MrBayes command prompt type "mcmc ngen=5000". This actually runs the chains, and we set it to run for 5000 generations. The program runs two analyses in parallel (by default each with 4 chains, and three of these chains are heated; it definitely is a good idea to run mb on a fast and remote computer). The smaller the reported average standard deviation of split frequencies is, the more reliable the result (i.e., your run is close enough to infinitely long). To continue the run type yes and enter the number of additional generation. When the value is below .015, or when your patience is exhausted, terminate the run, by typing no at the prompt. Give it at least 5 minutes.
Make sure you have typed "no" at the Continue with analysis? (yes/no): prompt.

After the run is finished, the " sump " command will plot the logL vs. generation number, that allows to determine the necessary burnin (you want to discard those samples as "burnin" where the -logL is still rising steadily).
To see the whole logL curve, you need to set the burnin fraction to .02 . (type help sump at the mb command line). sump burninfrac=.02

[Rather than using the sump command, you also can import the parameter file into EXCEL and plot the logL values as a chart in EXCEL. See below.
Or you can download the tracer (local download link here) application that can read the parameter files from MrBayes, and provides an easy, intuitive and interactive way to evaluate these files with respect to burnin and confidence intervals]

At the start of the run, the likelihood rapidly increases by orders of magnitude. If the first samples are included in the plot, one really cannot see if the likelihood values fluctuate around a constant
value. You can exclude the first couple of samples by specifying a burnin or a burninfrac. (The new version of MrBayes uses a burnin of 25% by default.)

To change the size of the burnin, type
sumt burninfrac =.25
, where you need to substitute '.25' with the number you obtained in the previous step of the exercise. This command creates a consensus tree, and shows posterior probabilities of the clades. You can take a look at the tree on the screen (scroll up to see the bipartition table, and different version of the tree) - you want to take the time to answer the green questions 1-4 given below at this point.
Load the .p files into into tracer (local download link here), it should be already installed on your computer. You can load both parameter files at the same time, and select in the upper left field of the tracer application, if you want to analyze one or both of the parameter files.
Inspect the trace (select the right button over the graphics window) of lnL (select in the list of parameters on the left).
Determine the mean, median and 95% highest posterior density interval (select the left button over the graphics window) for shape parameter (gamma) and the relative branch length multiplier (m) for each of the data partitions. (1 is the intein, 2 is the extein) (the .p files from a longer mcmc are here.) (Questions 5 and 6 below)
Keep in mind the the x axis is different in the display for the two partitions. To see the histograms on the same scale, select marginal probabilities (button over the graphics window), and select the parameters for both of the partitions in the left panel. At the display bottom on top you might want to select histogram or KDE.
You also can plot the estimated stationary probabilities (pi(A) pi(T) pi(G) pi(C) for the two partitions [ {1} is the intein, {2} is the extein].

1) Which value did you use for the burnin/burninfraction?
2) Which branch in the tree is the longest? (check the bipartition and branch lengths tables)
3) How long is it?
4) What is the measure?
5) What are the shape parameters for the intein and extein partition, respectively? What is the respective 95% credibility interval?
6) What does this mean for the number of sites with low substitution rates in the intein and extein, receptively?
7) Can you explain in a few words, why is it important to exclude a 'burnin' from our analyses?
8) What is the 95% highest posterior density interval for the relative rate parameters for the two data partitions?
9) Does the intein or the extein have higher estimated A and T frequencies?

Type " quit " at the prompt to exit MrBayes.

MrBayes by example: Identification of sites under positive selection in a protein

Exercise 3:

Background:

Professor Walter M. Fitch and assistant research biologist Robin M. Bush of UCI's Department of Ecology and Evolutionary Biology, working with researchers at the Centers for Disease Control and Prevention, studied the evolution of a prevalent form of the influenza A virus during an 11-year period from 1986 to 1997. They discovered that viruses having mutations in certain parts of an important viral surface protein were more likely than other strains to spawn future influenza lineages. Human susceptibility to infection depends on immunity gained during past bouts of influenza; thus, new viral mutations are required for new epidemics to occur. Knowing which currently circulating mutant strains are more likely to have successful offspring potentially may help in vaccine strain selection. The researchers' findings appear in the Dec. 3 issue of Science magazine.

Fitch and his fellow researchers followed the evolutionary pattern of the influenza virus, one that involves a never-ending battle between the virus and its host. The human body fights the invading virus by making antibodies against it. The antibodies recognize the shape of proteins on the viral surface. Previous infections only prepare the body to fight viruses with recognizable shapes. Thus, only those viruses that have undergone mutations that change their shape can cause disease. Over time, new strains of the virus continually emerge, spread and produce offspring lineages that undergo further mutations. This process is called antigenic drift. "The cycle goes on and on-new antibodies, new mutants," Fitch said.

The research into the virus' genetic data focused on the evolution of the hemagglutinin gene-the gene that codes for the major influenza surface protein. Fitch and fellow researchers constructed "family trees" for viral strains from 11 consecutive flu seasons. Each branch on the tree represents a new mutant strain of the virus. They found that the viral strains undergoing the greatest number of amino acid changes in specified positions of the hemagglutinin gene were most closely related to future influenza lineages in nine of the 11 flu seasons tested.

By studying the family trees of various flu strains, Fitch said, researchers can attempt to predict the evolution of an influenza virus and thus potentially aid in the development of more effective influenza vaccines.

The research team is currently expanding its work to include all three groups of circulating influenza viruses, hoping that contrasting their evolutionary strategies may lend more insight into the evolution of influenza.

Along with Fitch and Bush, Catherine A. Bender, Kanta Subbarao and Nancy J. Cox of the Centers for Disease Control and Prevention participated in the study.

A talk by Walter Fitch (slides and sound) is here

The goal of this exercise is to detect sites in hemmagglutinin that are under positive selection.

Since the analysis takes a very long time to run (several days), here are the saved results of the MrBayes run: Fitch_HA.nex.p.txt, Fitch_HA.nex.t.txt .

The original data file is flu_data.paup . The dataset is obtained from an article by Yang et al, 2000 . The File used for MrBayes is here

The MrBayes block used to obtain results above is:

begin mrbayes;
set autoclose=yes;
lset nst=2 rates=gamma nucmodel=codon omegavar=Ny98;
mcmcp samplefreq=500 printfreq=500;
mcmc ngen=500000;
sump burnin=50;
sumt burnin=50; end;

Selecting a nucmodel=codon with Omegavar=Ny98 specifies a model in which for every codon the ratio of the rate of non-synonymous to synonymous substitutions is considered. This ratio is called OMEGA. The Ny98 model considers three different omegas, one equal to 1 (no selection, this site is neutral); the second with omega < 1, these sites are under purifying selection; and the third with Omega >1, i.e. these sites are under positive or diversifying selection. (The problem of this model is that the there are only three distinct omegas estimated, and for each site the probability to fall into one of these three classes. If the omega>1 is estimated to be very large, because one site has a large omega, the other sites might not have a high probability to have the same omega, even though they might also be under positive selection. This leads to the site with largest omega to be identified with confidence, the others have more moderate probabilities to be under positive selection).

Note : Version 2.0 of Mr Bayes has a model that estimates omega for each site individually, the new version only allows the Ny98 model as described above..

First, you need to detect how many generations to burn in (meaning the number of samples you will have to discard). Open the file Fitch_HA.nex.p.txt with Excel and plot # of generations versus -LnL values. Determine after how many generations the graph becomes "stationary" (hint: change the Y-axis bounds to "zoom in", e.g., -3300 min to -3200 max). The burnin value is that number of generations divided by 50 (since only every 50th generation was sampled; i.e. the burnin value roughly is equal to the number of rows - not quite because there is a header). To more accurately determine the burnin, you need to rescale the Y-axis (click at the Y-axis -- if you aim accurately, you'll get a box that allows rescaling).
The result (scatterplot of LogL versus generation) might look like this:
This file contains information for posterior probabilities for each codon (columns) at each sampled generation (rows). Scroll to the right to see these columns, starting with pr+(1,2,3), pr+(4,5,6), etc. Calculate average posterior probability for each site of being under positive selection (Do not forget to exclude first N rows as a burnin; you should have detected value of N in the first question of this exercise - to be clear on where the burnin ends, you might want to highlight the rows representing the burnin and select a different font color. (Use AVERAGE() function of Excel, enter the formula in a cell below the values for the individual generations -- starting in column pr+(1,2,3) -- copy the formula to all columns) (see slides)
Plot average posterior probability vs. site #. (select the row in which you calculated the averages, then click Graph, and select a bar graph). Write down the codon positions for a few sites with the highest posterior probability of being positively selected (the columns name pr+(1,2,3), pr+(4,5,6)....and so on. pr+(1,2,3) mean probability of codon #1 (nucleotide #1, #2 and #3) to be under positive selection))

Determine the 95% credibility interval for the omega<1 value. To do this you sort posterior probability column in ascending order (Select data you want to sort, and go to Data->Sort... ). Again, do not forget to discard the burnin ; the easiest might be to actually delete it.. After sorting, exclude 5% of the data on the top and on the bottom. The range of the remaining data gives you the 90% confidence interval. (Enter answer in box below!)
The structure of hemagglutinin has been crystallized and is publicly available through PDB. Examin the 2VIU.pdb file (here) in chimera. Chain A of the PDB file corresponds to the sequences we did our analysis with (color the molecule according to chain). Below is a comparison of one of the sequences we used for analyses with the sequence for which the structure was determined:

Using this alignment as a guide, map the site(s) which have the highest probability to belong to the class with omega>1. Where are these sites located in the protein? (Reminders: The position number in the nexus file corresponds to nucleotide sequence, the structure is based on the amino acid sequence - take the third codon position and divide by 3 to find the amino acid. You only want to be concerned with Chain A!)

What is the 95% credibility interval for the omega < 1?
Does this value indicate strong purifying selection?
Which codon(s) showed signs of positive selection?
Which position and which amino acid does this correspond to in the above alignment?
Where is this aa located in the structure?

Exercise 4:

dN/dS ratios along a sequence. I ran the dataset from exercise #1 and 2 using the NY98 model (no partitions). The file is here. The model has a parameter for transition/transversion ratio and for the dN/dS ratio (called omega). The model uses and explores only two of these - one for sites under purifying selection, one for sites under diversifying selection. In addition for each codon the probability that the codon is in the omega+ group, and the omega value for each codon is estimated. This is the MrBayes block in the file:

begin mrbayes;
lset nst=2 rates=gamma nucmodel=codon omegavar=Ny98;
report possel = yes siteomega = yes;
mcmcp filename=analysisS;
mcmcp samplefreq=50 printfreq=50 diagnfreq=500;
mcmcp savebrlens=yes;
end;

This results in a lot of parameters, which makes for a slow moving robot. The parameter files resulting form a 24h run is here (already imported into excel - a file with a sheet calculating the the expected omega value along the sequence is here)
NOTE: the spreadsheet contains several sheets, these contain so many rows (one for each generation) and columns (one for each parameter). The scrol bar covers the sheet tabs. To switch from sheet to sheet press the <ctrl ><Page Up> keys.|
Two sheets give the parameters for each run, the third contains the combined values after the burnin, and the third one a bar graph for the omega values along the sequence.

On the sheet giving the combined samples after the burnin, scroll to the right to see the columns, starting with pr+(1,2,3), pr+(4,5,6), and etc, and omega(1,2,3), omega(4,5,6) ... . Calculate averages for each of these columns using the =AVERAGE() function of Excel, enter the formula in a cell below the values for the individual generations -- starting in column pr+(1,2,3) -- copy the formula to all columns). You can do the same with the estimated omega values (to the right of the pr(xyz columns). Plot the average values for Omega for each column as a bar graph.
(Note: in my version of excel, this only worked when I specified the cells fro which to the average was to be calculated, i.e., =AVERAGE(BX2:BX1353) - when I highlighted the column, the resulting formula would not copy correctly to the neighboring cells.)

Do you see different estimated P+ and Omega values for the extein and the Intein?
If yes, which seems to be under stronger purifying selection?
What are the average of the expected Omega values for sites in the N-extein
What are the average of the expected Omega values for sites in the intein
What are the average of the expected Omega values for sites in the C-Extein

Finished?

Type logout to release the compute node from the queue.
If you you encountered problems in your session, check the queue for abandoned sessions using the command qstat. If there are abandoned sessions under your account, kill them by deleting them from the queue by typing qdel job-ID, e.g. "qdel 40000" would delete Job # 40000

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.