MCB 5472 : Inferring Phylogenies (continued)

You should answer the questions in green! (email to gogarten@uconn.edu)

1) A first and short exposure to MrBayes

MrBayes is installed on both clusters. You invoke it by typing mb at the command line. In case you want to install the software on additional computers, go here and follow instructions to download and install MrBayes. (cluster 2 alos has an MPI version, see the wiki of the bioinformatics facility for more information)

The goal of this exercise is to learn how to use MrBayes to reconstruct phylogenies.

  1. Save testseq1c.nex (catalytic ATPase subunits) (or sequences of your choice, or here (glycyl-tRNA synthetase) or here (bacterial threonyl-tRNA synthetase)--The latter 2 analyses might take a long time. I placed the files into an archive. After you expand the archive, execute the file. You then can summarize the results from the previous run. The archives are here and here.) into the folder from which you want to run the MrBayes Program. The 1st is the dataset of vacuolar ATPase A subunit you used earlier, but in the NEXUS format that MrBayes reads (clustal and many other programs write nexus formated files, but usually, you need to go into the file with a texteditor and change things like the way gaps are treated, or the data type). The other two are aminoacyl tRNA synthetases. You might need to rename the file after downloading. Start MrBayes by typing mb. Keep in mind that if you use MrBayes to do work on a large dataset, you need to run it for millions of generations. Especially for datasets with more than 50 OTU (=Operational Taxonomic Units) repeated runs may be advisable to make sure one is not stuck on a local optimum. Plotting clann support values from one run against the other is a reasonable way to illustrate convergence. Also, for large datasets you want to explore the use of mb-mpi (check with Pascal).

  2. At the MrBayes command prompt type "execute filename" (where filename is the name of nexus file you want to analyze). This will load the data into the program.

  3. Type some of the following commands one by one to select a model and settings for your analysis:

    prset aamodelpr=fixed(jones) [this sets the substititution matrix to JTT, a mdern version of PAM]
    lset rates=invgamma
    lset ngammacat=4 [this selects a model in which ASRV is described by a gamma distribution approaximated by 4 categories]
    mcmcp samplefreq=50 printfreq=50 nchains=2 startingtree=random [this sets the frequency with which the "robot" reports results to the screen and to the files (different files for parameters (.p) and trees (.t))]
    mcmcp savebrlens=yes [mcmcp sets parameters for the chain. savebrlens tells it to save the trees with branchlengths]
    [mcmcp filename=testseq1c] [if you use this command, it tells the program to save the files under a certain name. This is handy, if you want to read in data from a previous run, but it usually is easier to go with the default, which in this case is testseq1c.nex]
    mcmc ngen=10000 [this starts the chain, at the end you'll be asked if you want to continue the run, read the stuff below while you wait]
    sump [or sump filename=testseq1c
    ] [this summarizes the data stored in the parameter file]

    help prset lists the settings for the priors, help lset lists the settings for the evolutionaly model, and help mcmcp lists the settings of the chains.
    Information on the BETA and DIRILECHT distributions is here and here
    . More info is in the MrBayes Manual.
    Before you start the run type showmodel.

    Rather than typing the commands one by one you could use a MrBayes block at the end of the input file, e.g.:
    begin mrbayes;
        prset aamodelpr=fixed(jones);
        lset rates=invgamma;
        lset ngammacat=4 ;
        mcmcp samplefreq=50 printfreq=50 nchains=2 startingtree=random;
        mcmcp savebrlens=yes filename=testseq1c;
        mcmc ngen=20000; [you need to add
    set autoclose=yes to get to the next step]
        sump filename=testseq1c;

    end;
    NOTE: The blue set of parameters (two chains only, no among site rate variation leads to very fast execution, but it is only recommended to create data that you can try sump and sumt on; this does NOT represent a reasonable set of parameters!)


    " prset" sets the priors for the model used to calculate likelihoods. In this case we choose the substitution parameters from the JTT amino acid substitution model (Jones et al., 1992).
    " mcmcp " sets parameters for Markov Chain Monte Carlo: we set to sample every 50 generation, to print results to a screen every 50th generation, run 2 chains simultaneously, start with random tree, and save branch lengths.
    " mcmc "
    actually runs the chain, and we set it to run for 20000 generations.
    The program runs to analyses in parallel (by default each with 4 chains, and three of these chains are heated; we use only two to make things a little faster--- it definitely is a good idea to run mb on a fast and remote computer). The smaller the reported average standard deviation of split frequencies is, the more reliable the result (i.e., your run is close enough to infinitely long). When the value is below .015, or when your patience is exhausted, terminate the run, by typing no at the prompt.

    After the run is finished, the " sump " command will plot the logL vs. generation number, that allows to determine the necessary burnin (you want to discard those samples as "burnin" where the -logL is still rising steadily).

    [Rather than using the sump command, you also can import the parameter file into EXEL and plot the logL values as a chart in EXEL.]

    During the start of the run, the likelihood rapidly increases by
    orders of magnitude. If the first samples are included in the plot, one
    really cannot see if the likelihood values fluctuate around a constant
    value. You can exclude the first couple of samples by specifying a burnin.
    sump burnin=20 excludes the first 20 samples.
    Note, that the y-axis is rescaled automatically, which provides you a better spread to judge it the chain is "mixing well", or still in the initial orientation.

    Type
    " sumt burnin=20 " or " sumt filename=testseq1c burnin=20 "
    , where you need to substitute '20' with the number you obtained in the previous step of the exercise (Note: that burnin values is the # of generations divided by 50, since we sample every 50th generation). This command creates a consensus tree, and shows posterior probabilities of the clades. You can take a look at the tree in Treeview or njplot by loading the testseq1.con file.

    Which branch in the tree is the longest?
    How long is it?
    What is the measure?
    Can you explain in a few words, why is it important to exclude a 'burnin' from our analyses?

    Did your analysis extimate a shape parameter and a % invariant sites?
    How do you tell MrBayes to use a model with a % invariant sites and a gamma distribution for the remainisn sites (help lset).
    If yes, what values did you obtain, and what is the 95% credibility intervall for these values?
    (To estimate the latter, load the parameter file into MSExel, copy the values for each generation after the burnin and sort them. Then discard the top and bottom 2.5% of the values. The reaminder represents the 95% credibility interval)
    What prior does MrBayes use for branchlengths? Why might this be better than a uniform prior between 0 and 100?