
Why phylogenetic reconstruction of molecular evolution?
A) Systematic classification of organisms
e.g.:
Who were the first angiosperms? (i.e. where are the first angiosperms located
relative
to present day angiosperms?)
Where
in the tree of life is the last common ancestor located?
B) Evolution
of molecules
e.g.: domain
shuffling, reassignment of function, gene duplications, horizontal gene transfer,
drug targets, detection of genes that drive evolution of a species/population
(e.g. influenca virus, see here for more examples)
C) Identification of organisms
e.g., phylotyping in microbiom samples),
origin of genes and viruses (e.g. recent ebola out break)

How:
1) Obtain sequences
Sequencing
Databank
Searches -> ncbi a) entrez, b) BLAST, c) blast of pre-release data
Friends
2)
Determine homology (see notes for earlier classes for practical implementation)
Reminder on Definitions:
Homology: Two sequences are homologous, if there
existed an ancestral molecule in the past that is ancestral to both of the sequences
3)
Align sequences
(most algorithms used
for phylogenetic reconstruction require a global alignment. An exception is statalign
from Thorne JL, and Kishino H, 1992, Freeing phylogenies from artifacts of
alignment. Mol Bio Evol 9:1148-1162)
Some evolutionary biologists recommend to select only the part of the alignment that is reliable. (Discuss!) Modify alignment,
if necessary.
4) Reconstruct
evolutionary history
A)
Distance analyses
- calculate
pairwise distances
(different distance measures, correction for multiple
hits, correction for codon bias)
- make distance
matrix (table of pairwise corrected distances)
- calculate
tree from distance matrix
i) using optimality criterion
(e.g.: smallest error
between distance matrix
and distances in tree, or use
ii) algorithmic
approaches (UPGMA or neighbor joining)
B) Parsimony analyses
find that tree that explains sequence data with minimum number
of substitutions
(tree includes hypothesis of sequence
at each of the nodes)
C)
Maximum Likelihood analyses
given
a model for sequence evolution, find the tree that has the highest probability
under this model.
This approach can also be used to
successively refine the model.
Bayesian statistics use ML analyses to calculate posterior probabilities for trees, clades and evolutionary
parameters. Especially MCMC approaches have become very popular in the last year,
because they allow to estimate evolutionary parameters (e.g., which site in a
virus protein is under positive selection), without assuming that one actually
knows the "true" phylogeny.
D
- ...) Else:
spectral analyses, evolutionary parsimony, i.e., look only
at patterns of substitutions,
Another
way to categorize methods of phylogenetic reconstruction is to ask if they are
using
- an optimality criterion (e.g.: smallest
error between distance matrix and distances in tree, least number of steps), or
- algorithmic approaches (UPGMA or neighbor joining)
5) Interpret the result.
It is
especially important to consider artifacts that might originate in phylogenetic
reconstruction, and to asses the reliability of your results.
|