MCB 5472 : dN/dS ratios

1) MrBayes by example: Identification of sites under positive selection in a protein

Background:

News piece from UC Irvine ->

A talk by Walter Fitch (slides and sound) is here

Professor Walter M. Fitch and assistant research biologist Robin M. Bush of UCI's Department of Ecology and Evolutionary Biology, working with researchers at the Centers for Disease Control and Prevention, studied the evolution of a prevalent form of the influenza A virus during an 11-year period from 1986 to 1997. They discovered that viruses having mutations in certain parts of an important viral surface protein were more likely than other strains to spawn future influenza lineages. Human susceptibility to infection depends on immunity gained during past bouts of influenza; thus, new viral mutations are required for new epidemics to occur. Knowing which currently circulating mutant strains are more likely to have successful offspring potentially may help in vaccine strain selection. The researchers' findings appear in the Dec. 3 issue of Science magazine.

Fitch and his fellow researchers followed the evolutionary pattern of the influenza virus, one that involves a never-ending battle between the virus and its host. The human body fights the invading virus by making antibodies against it. The antibodies recognize the shape of proteins on the viral surface. Previous infections only prepare the body to fight viruses with recognizable shapes. Thus, only those viruses that have undergone mutations that change their shape can cause disease. Over time, new strains of the virus continually emerge, spread and produce offspring lineages that undergo further mutations. This process is called antigenic drift. "The cycle goes on and on-new antibodies, new mutants," Fitch said.

The research into the virus' genetic data focused on the evolution of the hemagglutinin gene-the gene that codes for the major influenza surface protein. Fitch and fellow researchers constructed "family trees" for viral strains from 11 consecutive flu seasons. Each branch on the tree represents a new mutant strain of the virus. They found that the viral strains undergoing the greatest number of amino acid changes in specified positions of the hemagglutinin gene were most closely related to future influenza lineages in nine of the 11 flu seasons tested.

By studying the family trees of various flu strains, Fitch said, researchers can attempt to predict the evolution of an influenza virus and thus potentially aid in the development of more effective influenza vaccines.

The research team is currently expanding its work to include all three groups of circulating influenza viruses, hoping that contrasting their evolutionary strategies may lend more insight into the evolution of influenza.

Along with Fitch and Bush, Catherine A. Bender, Kanta Subbarao and Nancy J. Cox of the Centers for Disease Control and Prevention participated in the study.

The goal of this exercise is to detect sites in hemagglutinin that are under positive selection (aka diversifying selection).

Since the analysis takes a very long time to run (several days), here are the saved results of the MrBayes run: Fitch_HA.nex.p, Fitch_HA.nex .t.

The original data file is flu_data.paup. The dataset is obtained from an article by Yang et al, 2000. The File used for MrBayes is here

The MrBayes block used to obtain results above is:

begin mrbayes;
set autoclose=yes;
lset nst=2 rates=gamma nucmodel=codon omegavar=Ny98;
mcmcp samplefreq=500 printfreq=500;
mcmc ngen=500000;
sump burnin=50;
sumt burnin=50;
end;

Selecting a nucmodel=codon with Omegavar=Ny98 specifies a model in which for every codon the ratio of the rate of non-synonymous to synonymous substitutions is considered. This ratio is called OMEGA. The Ny98 model considers three different omegas, one equal to 1 (no selection, this site is neutral); the second with omega < 1, these sites are under purifying selection; and the third with Omega >1, i.e. these sites are under positive or diversifying selection. (The problem of this model is that the there are only three distinct omegas estimated, and for each site the probability to fall into one of these three classes. If the omega>1 is estimated to be very large, because one site has a large omega, the other sites might not have a high probability to have the same omega, even though they might also be under positive selection. This leads to the site with largest omega to be identified with confidence, the others have more moderate probabilities to be under positive selection).

1. Load Fitch_HA.nex.p into Excel.

If you use the iMacs in the computerlab, you can ignore this paragraph.

If you use the old version of EXCEL, you need to load the data into 2 sheets, since this version of Excel does not allow to load more than 256 columns per sheet. To load the data, create a new Excel spreadsheet. Go to Data->Get External Data->Import Text File, and load first 256 columns. If you use a German version of Microsoft, in the import wizard's section on Format, select advanced and then select the correct decimal point symbol, i.e. ".".
Go to a separate sheet, and repeat "Get External Data" command to load the rest of the data, you need to block out (=select =make black) and exclude the first 256 columns.

If you use the new version of EXCEL, just import the file Fitch_HA.nex.p

This file contains information for posterior probabilities for each site (columns) at each sampled generation (rows).

First, you need to detect how many generations to burn in. Plot # of generations versus -LnL values using a scatterplot. Determine after how many generations the graph becomes stationary. The burnin value is that number of generations divided by 50 (since only every 50th generation was sampled; i.e. the burnin value roughly is equal to the number of rows - not quite because there is a header). To more accurately determine the burnin, you need to rescale the Y-axis.

2. Calculate average posterior probability for each site of being under positive selection (Do not forget to exclude first N rows as a burnin; you should have detected value of N in the first question of this exercise). (Use AVERAGE() function of Excel)

In your own words, describe what this posterior probability means, and how this is related to the omega value for sites under positive selection.

4. Plot average posterior probability vs. site # (You can select the values and use a column chart). Select a few sites with the highest posterior probability of being positively selected.

Which codon has the highest probability to be under positive selection? What is this probability?

5. If you are up to using a histogram speadsheet, create a histogram for the omega value <1 and the omega>1. A simple histogram spreadsheet is at http://www.coventry.ac.uk/ec/~nhunt/oatbran/ (seems to work better than the previous version)
Download the histogram spreadsheet, ignore the message that virtual basic could not be loaded, enter your data on the data page (again, do not forget to discard the burnin from consideration), enter values for lowest and highest bin and for the bin size.
Plot the calculated histogram data as a histogram. What is the shape of the distribution?

6. Determine the 95% credibility interval for the omega<1 value. To do this you sort posterior probability column in ascending order (Select data you want to sort and copy them into a new worksheet.). Again, do not forget to discard the burnin. After sorting, exclude 2.5% of the data on the top and on the bottom. The range of the remaining data gives you the 95% credibility interval.

7. The structure of hemagglutinin has been crystallized and is publicly available through PDB. Download the PDB file here and examine it with SPDBV (installed on your computers, if you have not used this program before, see below at exercise 3 for more info). Chain A of the PDB file corresponds to the sequences we did our analysis with (color the molecule according to chain). Below is a comparison of the one of the sequences we used for analyses with the sequence for which the structure was determined:

Using this alignment as a help, map the sites which have the highest probability to belong to the class with omega>1. Where are these sites located in the protein? (Reminders: The position number in the nexus file corresponds to nucleotide sequence, the structure is based on the amino acid sequence - take the third codon position and divide by 3 to find the amino acid. You only want to concern yourself with Chain A!)

2) Hypothesis Testing Using Phylogenies (HyPhy) (optional)

HyPhy is a program package that allows you to use maximum likelihood analyses to test hypothesis about the evolution of sequences, including different tests for types of selection, models for substitution, molecular clocks, and recombination. HyPhy (pronounced hifi) is most easily run on the data monkey server. If you have a more complicated model you want to test (e.g., different parts of a multiple sequence alignment follow different substitution models), you can use a GUI version that can be installed on your Windows or Mac computer. HyPhy allows the use of maximum likelihood ratio tests, and the use of simulations under a null hypothesis (parametric bootstrapping) to test the significance of results.

A command line version is also available and installed on the cluster; however, for the time being, if you want to run an analysis I recommend to download HYPHY2.0 (OSX version), cd into this directory, and invoke hyphy (else you get error messages on missing batch files etc.)

Do this: Go to the analysis page on the Data Monkey and upload the flu_data.paup file (or any other file you want to analyze). (If you want a nice picture of your alignment check out the amino acid translation in pdf format.)

Select the FEL model (which does a good testing of hypotheses see the help pages on what models are tested). For the flu data the best substitution model is AC: AC; all the other rates AT. Run the analysis, compare the results to the ones obtained with MrBayes. (Use the neighbor joining tree from HyPhy for the analysis).

Did MrBayes and the FEL model in HyPhy identify the same sites as being under positive selection?

3) Swiss Protein Data Bank viewer (optional)

Optional: If you think that working with protein structure will be useful for you, simple exercises that get you acquainted with the Swiss pdb file viewer are here (get to know the interface); here (working with layers). The program should be available on the computers in the computer lab, to install it on other computers go to http://spdbv.vital-it.ch/download.html .

4) Work on your student project!

Include a one sentence summary of what you did in your report.