MCB 5472 : Inferring Phylogenies (continued)

You should answer the questions in red! (email to gogarten@uconn.edu)

1) If you have not done so, complete the MrBayes exercise from last week:

MrBayes is installed on the cluster. You invoke it by typing mb at the command line. In case you want to install the software on additional computers, go here and follow instructions to download and install MrBayes.

The goal of this exercise is to learn how to use MrBayes to reconstruct phylogenies.

  1. Save testseq1c.nex (catalytic ATPase subunits) (or sequences of your choice, or here (glycyl-tRNA synthetase) or here (bacterial threonyl-tRNA synthetase)--The latter 2 analyses might take a long time. I placed the files into an archive. After you expand the archive, execute the file. You then can summarize the results from the previous run. The archives are here and here.) into the folder from which you want to run the MrBayes Program. The 1st is the dataset of vacuolar ATPase A subunit you used earlier, but in the NEXUS format that MrBayes reads (clustal and many other programs write nexus formated files, but usually, you need to go into the file with a texteditor and change things like the way gaps are treated, or the data type). The other two are aminoacyl tRNA synthetases. You might need to rename the file after downloading. Start MrBayes by typing mb

  2. At the MrBayes command prompt type "execute filename" (where filename is the name of nexus file you want to analyze). This will load the data into the program.

  3. Type some of the following commands one by one to select a model and settings for your analyis:

    prset aamodelpr=fixed(jones) [this sets the substititution matrix to JTT, a mdern version of PAM]
    lset rates=invgamma
    lset ngammacat=4 [this selects a model in which ASRV is described by a gamma distribution approaximated by 4 categories]
    mcmcp samplefreq=50 printfreq=50 nchains=2 startingtree=random [this sets the frequency with which the "robot" reports results to the screen and to the files (different files for parameters (.p) and trees (.t))]
    mcmcp savebrlens=yes [mcmcp sets parameters for the chain. savebrlens tells it to save the trees with branchlengths]
    [mcmcp filename=testseq1c] [if you use this command, it tells the program to save the files under a certain name. This is handy, if you want to read in data from a previous run, but it usually is easier to go with the default, which in this case is testseq1c.nex]
    mcmc ngen=10000 [this starts the chain, at the end you'll be asked if you want to continue the run, read the stuff below while you wait]
    sump [or sump filename=testseq1c
    ] [this summarizes the data stored in the parameter file]

    help prset lists the settings for the priors, help lset lists the settings for the evolutionaly model, and help mcmcp lists the settings of the chains.
    Information on the BETA and DIRILECHT distributions is here and here
    . More info is in the MrBayes Manual.
    Before you start the run type showmodel.

    Rather than typing the commands one by one you could use a MrBayes block at the end of the input file, e.g.:
    begin mrbayes;
        prset aamodelpr=fixed(jones);
        mcmcp samplefreq=50 printfreq=50 nchains=2 startingtree=random;
        mcmcp savebrlens=yes filename=testseq1c;
        mcmc ngen=20000; [you need to add
    set autoclose=yes to get to the next step]
        sump filename=testseq1c;

    end;
    NOTE: The blue set of parameters (two chains only, no among site rate variation leads to very fast execution, but it is only recommended to create data that you can try sump and sumt on; this does NOT represent a reasonable set of parameters!)


    " prset" sets the priors for the model used to calculate likelihoods. In this case we choose the substitution parameters from the JTT amino acid substitution model (Jones et al., 1992).
    " mcmcp " sets parameters for Markov Chain Monte Carlo: we set to sample every 50 generation, to print results to a screen every 50th generation, run 2 chains simultaneously, start with random tree, and save branch lengths.
    " mcmc "
    actually runs the chain, and we set it to run for 20000 generations.
    The program runs to analyses in parallel (by default each with 4 chains, and three of these chains are heated; we use only two to make things a little faster--- it definitely is a good idea to run mb on a fast and remote computer). The smaller the reported average standard deviation of split frequencies is, the more reliable the result (i.e., your run is close enough to infinitely long). When the value is below .015, or when your patience is exhausted, terminate the run, by typing no at the prompt.

    After the run is finished, the " sump " command will plot the logL vs. generation number, that allows to determine the necessary burnin (you want to discard those samples as "burnin" where the -logL is still rising steadily).

    [Rather than using the sump command, you also can import the parameter file into EXEL and plot the logL values as a chart in EXEL. See below.]

    During the start of the run, the likelihood rapidly increases by
    orders of magnitude. If the first samples are included in the plot, one
    really cannot see if the likelihood values fluctuate around a constant
    value. You can exclude the first couple of samples by specifying a burnin.
    sump burnin=20 excludes the first 20 samples.
    Note, that the y-axis is rescaled automatically, which provides you a better spread to judge it the chain is "mixing well", or still in the initial orientation.

    Type
    " sumt burnin=20 " or " sumt filename=testseq1c burnin=20 "
    , where you need to substitute '20' with the number you obtained in the previous step of the exercise (Note: that burnin values is the # of generations divided by 50, since we sample every 50th generation). This command creates a consensus tree, and shows posterior probabilities of the clades. You can take a look at the tree in Treeview or njplot by loading the testseq1.con file.

    Which branch in the tree is the longest?
    How long is it?
    What is the measure?
    Can you explain in a few words, why is it important to exclude a 'burnin' from our analyses?

    Did your analysis extimate a shape parameter and a % invariant sites?
    If yes, what values did you obtain, and what is the 95% credibility intervall for these values?
    (To estimate the latter, load the parameter file into MSExel, copy the values for each generation after the burnin and sort them. Then discard the top and bottom 2.5% of the values. The reaminder represents the 95% credibility interval)

 

2) Handling genome amounts of data -- Extracting text from other applications

EMBOSS is installed on the cluster. Here is a list of programs in EMBOSS. Today we will be using pepstats. Click on its entry in the list to see the command line arguments.

Download a genome of your choice from NCBI. Use the blue "F" links (far right of screen) to see the list of files for a given genome. Since we will be using pepstats, be sure to grab the protein sequence ".faa" file(s).

pepstats genome.faa -outfile genome.pepstats
Check the output file generated by pepstats using a text editor.

 

Use parse_pepstats.pl to extract the isoelectric point for all proteins:

Read through the script. Try to figure out how the program finds the values for the theoretical isoelectric points in the pepstats output. Execute the program:

perl ./parse_pepstats.pl genome.pepstats

parse_pepstats.pl will generate three files, with suffixes ".pI", ".pos_charged", and ".parsed".

Use the ".pI" file (isoelectric points) to construct a histogram (or the .parsed file)
(an Excel 2008 spreadsheet is at http://gogarten.uconn.edu/mcb3421_2008/histo_upgrade4.xls . Copy the isoelectric point data into the first column below the X. In the table in the center, adjust the first midpoint to 0.5 and the last midpoint to 13.5)
Describe the distribution of isoelectric points in your selected genome. How many peaks?
Why might there be a minimum at around pH7?

Check a few of the ORFs with very alkaline theoretical isoelectric point (the *.parsed outfile contains accession numbers in the first column; you could use
fastacmd -s accession_number to retrieve the sequence from nr).
Which charge would these proteins have at neutral pH? Can you see a pattern in the types of enzmes?

3) Your main task for today is to work on your student project!

Include a one sentence summary of what you did in your report.