Assignment for Friday:

Read through sequence alignment section (class 15)
If interested, read through the details of the Needleman Wunsch alignment here.
If you want to work on this at home or from another computer, install the latest version of seaview. Send email in case you encounter problems. Clustalx is available here (download x not w), SEAVIEW is available here
Takehome exam 5 will be due next Monday (last chance to ask questions) - Takehome exam#4 deadline has been extended.

Assignments for Monday

Read through the Wikipedia entries fro UPGMA and Neighbor Joining

Read Walter Fitch's article on types of homology (available on HuskyCT or here)
Assignment for today: Read excerpts of Chapters 5 and 6 from Li's "Molecular Evolution"

See here for the splice site consensus in Arabidopsis

Discuss non-sense mediated decay pathway (Wikipedia, review article)

Introns and Their Evolution

Three groups of introns based on their splicing mechanisms:

group I and II are self-splicing [have different splicing mechanism: see this figure for comparison of splicing]:

group III introns are present in eukaryotic nucleus, need spliceosomes to splice out:

Where different groups of introns occur?

Group I: were discovered in ciliated protozoan Tetrahymena; found also in Physarum, fungal and algal and plant mitochondria and in phage T4, rare in Bacteria, one is present in Thermotoga 23SrRNA. Similar to inteins, they often rely on Homing Endonucleases to invade a host gene.
Group II: common in Bacteria, and so far found only in one Archaeal genus, Methanosarcina
Spliceosomal Introns: present throughout eukaryotes, but more common in "crown-group" eukaryotes

Where do spliceosomal introns come from and how the splicing machinery evolved?

Hypothesis:

Spliceosomal introns evolved from Class II introns; the function of some of the internal loops of the class II introns are taken over by the spliceosomal snRNA (small nuclear RNA).

Support:

Group II introns are often located in intergenic regions in Bacteria, suggesting their mobility as parasitic genetic elements
Group II and spliceosomal introns both form a lariat structure (see figures above)
class II introns that are non-functioning, because a loop has been removed, splice in the presence of snRNA.
The reverse is true too: domain of a group II intron can substitute snRNA of the spliceosome

Gratuitous complexity hypothesis for evolution of spliceosomal machinery: See reading assignment on WebCT [the portions for the reading are highlighted in the PDF file]

Problem:

class II introns are found in bacteria, and only in one Archaeal genus, Methanosarcina; why is it that predominately "crown-group" eukaryotes have introns?

Not much of a splice site consensus (exon1 GT-intron-AT exon2, see here for the splice site consensus in Arabidopsis)

Group I introns often have homing endonucleases.
Homing endonucleases and intron mobility. Spread in populations, selective pressure on endonuclease. See the excellent paper by Goddard and Burt on the reinvasion cycle.

Also: reverse splicing

Possible benefits of having introns:

Exon shuffling, alternative splicing (1 gene -> different protein products) ....

Two rival hypotheses: Intron Early vs. Intron Late

Intron early:

Protein diversity arose in analogy to exon shuffling in the generation of antibody diversity (see your biochemistry or genetics textbook on the maturation of the immune system).

Claims:

Introns separate structural domains. Example of a Go-plot is here (from here, these authors describe an significant excess of introns in the linker regions defined through he overlap in the Go-plot).
In Triose Phosphate Isomerase an intron was found in a position suggested by a Go-plot (here).
Introns arose early, before the uptake of the mitochondrial and chloroplast endosymbiont,
Neighboring introns often are in the same phase. While significant, the excess is rather small: 216 of 570, 36 more than expected under a random distribution). However, the excess is larger, if only multidomain proteins are considered, suggesting that these indeed evolved through exon shuffling (see here for a recent analysis).

Intron late:

Present day introns are late invaders of already functional genes. Exon shuffling might play some role in eukaryotes, but most of protein diversity arose before introns invaded protein coding genes.

Claims:

distribution of introns mapped on phylogenetic trees unambiguously points towards late invasion (and here).
The correlation between structure and intron position is not unambiguous.
The finding that introns in mitochondrial (eubacterial) and nucleocytoplasmic genes have introns in the same location could reflect a preferred intron integration site. The phase pattern is also observed in vertebrate genes, in which the introns are of late origin.
Exon shuffling requires introns located in the same phase, but there might be other reasons for having a slight excess of introns in the same phase. For introns to frequently invade genes, there needs to be mechanisms for introns to find new "homes" (see above).

Compromise:

mixed model of intron evolution

version 1 - while some introns are recent, most are old. E.g.: [Roy, 2003].
version 2 - while most introns are recent, some are older, but not necessarily very old. E.g.: [Rogozin et al., 2003]

Else:

it was suggested that class II introns were the reason for the separation between transcription and translation in Eukaryotes (accomplished through the nuclear envelope). Martin and Koonin's hypothesis suggests that class 2 introns were brought into the eukaryotic cell by the mitochondrial endosymbiont.

Discussion - two debate teams on the function of introns in evolution:
Team A) Introns Early versus Team B) Introns Late

1) discuss arguments within group (5 minutes)
2) present arguments in favor of your thesis (each site, one person one argument)
3) discuss counter arguments within group (3 minutes)
4) present arguments against opposing teams evidence

Goals class 16:

Know abut genes in pieces and the intron early versus intron late debate
Know the main arguments in favor of introns late (Why is the TPI intron not a strong argument?)
Appreciate the contributions that spliceosomal introns make to the molecular biology of eukaryotes
Know what the term Go domain refers to (not the Japanese board game, but the scientist Mitiko G?) and how this relates to introns
Know about the theory that connects introns to the origin of the nuclear envelope

PRO INTRONS EARLY:

Self splicing RNA are an example for catalytic RNA that could have been present in RNA world.
There is little reason to assume that the RNA world was not plagued by self-splicing parasites
Neighboring Introns are more frequently in same phase than expected by chance
Spliceosomal introns are present in all eukaryotes (including supposedly deep branching ones)
Introns frequently are found in linker regions (connecting the more tightly packed Go domains)
Exon shuffling can create a large number of different catalytic sites (see the maturation of the immune system)

PRO INTRONS LATE :

Mapping individual introns onto organismal evolutionary history shows that many introns inserted into the sites where they are found presently more recently.
Exon shuffling in the maturation of the adaptive immune system is a modern trait of vertebrates.
Even if introns are ancient, this does not prove that they played a role in assembling the now existing protein families.
Intron preference for linker region could be the result of selection (they do less harm here than in tightly packed domains).

From:<http://dml.cmnh.org/2002Jul/msg00351.html>

----- Original Message -----
From: <Dinogeorge@aol.com>
Sent: Thursday, July 11, 2002 6:47 PM
Subject: Re: New finds

> > --+--+-----------A
> >   | `--+--+-----B
> >   |     | `--+--C
> >   |     |     `--D
> >   |     `--------E
> >    `--------------F
>
> This is >not< a Hennigian comb. Only the entire ABCDE clade and the F
lineage
> make a (two-toothed) Hennigian comb in this cladogram. In a Hennigian comb
> the side branches are left unbranched, like the teeth of a comb. Hence the
> name.

This _is_ a Hennigian comb, because in a cladogram, _only_ topology counts.
A cladogram is a mobile. Look at the following -- it's exactly the same
cladogram as above:

--+--F
  `--+--A
     `--+--E
        `--+--B
           `--+--D
              `--C

... what a side branch is lies completely in the hand of the presentator.
All I did was I rotated a few stems around their long axes.

sequence space slides

Intro to phylogenetic reconstruction

Phylogenetic analysis is an inference of evolutionary relationships between organisms.
Those relationships are usually represented by tree-like diagrams.
Note: the assumption of exclusively tree-likeliness of evolution is not justified.

Steps of the phylogenetic analysis:

Compilation of sequence dataset

Alignment

Determination of substitution model

Tree building

Tree evaluation

Why phylogenetic reconstruction of molecular evolution?

A) Systematic classification of organisms

e.g.: Who were the first angiosperms? (i.e. where are the first angiosperms located relative
to present day angiosperms?)

Where in the tree of life is the last common ancestor located?

B) Evolution of molecules

e.g.: domain shuffling, reassignment of function, gene duplications, horizontal gene transfer, drug targets, detection of genes that drive evolution of a species/population (e.g. influenca virus, see here for more examples)

C) Identification of organisms

e.g., phylotyping in microbiom samples),
origin of genes and viruses (e.g. recent ebola out break)

How:

1) Obtain sequences

Sequencing

Databank Searches -> ncbi a) entrez, b) BLAST, c) blast of pre-release data

Friends

2) Determine homology (see notes for earlier classes for practical implementation)

Reminder on Definitions:
Homology: Two sequences are homologous, if there existed an ancestral molecule in the past that is ancestral to both of the sequences

3) Align sequences

(most algorithms used for phylogenetic reconstruction require a global alignment. An exception is statalign
from Thorne JL, and Kishino H, 1992, Freeing phylogenies from artifacts of alignment. Mol Bio Evol 9:1148-1162)

Some evolutionary biologists recommend to select only the part of the alignment that is reliable. (Discuss!) Modify alignment, if necessary.

4) Reconstruct evolutionary history

A) Distance analyses

calculate pairwise distances
(different distance measures, correction for multiple hits, correction for codon bias)
make distance matrix (table of pairwise corrected distances)
calculate tree from distance matrix

i) using optimality criterion
(e.g.: smallest error between distance matrix
and distances in tree), or use
ii) algorithmic approaches (UPGMA or neighbor joining)

B) Parsimony analyses

find that tree that explains sequence data with minimum number of substitutions

(tree includes hypothesis of sequence at each of the nodes)

C) Maximum Likelihood analyses

given a model for sequence evolution, find the tree that has the highest probability under this model.

This approach can also be used to successively refine the model.

Bayesian statistics use ML analyses to calculate posterior probabilities for trees, clades and evolutionary parameters. Especially MCMC approaches have become very popular in the last year, because they allow to estimate evolutionary parameters (e.g., which site in a virus protein is under positive selection), without assuming that one actually knows the "true" phylogeny.

D - ...) Else:
spectral analyses, evolutionary parsimony, i.e., look only at patterns of substitutions, supertrees from many gene trees.

Another way to categorize methods of phylogenetic reconstruction is to ask if they are using

an optimality criterion (e.g.: smallest error between distance matrix and distances in tree, least number of steps), or

algorithmic approaches (UPGMA or neighbor joining)

5) Interpret the result.

It is especially important to consider artifacts that might originate in phylogenetic reconstruction, and to asses the reliability of your results.

6) Discussion: How can a tree be rooted?

Slides