The Clay of Evolution - How to study genes and genomes.

How can genes get duplicated:
Whole genome duplication, partial genome duplication, single genes get duplicated (tandem repeats)

Whole genome duplication: frequent event in plants, also speculated to have occurred at least twice in the early evolution of vertebrates.  15% of the yeast genome is present in duplicated form, the currently accepted idea is that there was an ancient duplication followed by rearrangement and gene loss.   The idea of genome duplications in early vertebrate evolution has become very popular, but phylogeny of regulatory proteins does not support this idea (see here and here for pro and here for contra).

The picture below is a comparison of the Yeast proteom with itself (the diagonal is removed).  It clearly shows many small regions of duplications. 

Parts of chromosomes get duplicated: traces of this seen in Arabidopsis and Caenorhabditis

Single genes get duplicated -> gene families originally tandemly replicated (see the Caeonrhapditis paper above)

Some TOOLS at NCBI

The NCBI provides several different interfaces to browse through and analyze genomes. For example, in the Borrelia genome, if you click on the complete genome, you get a graphical representation, further clicks move you down throw several levels to the nucleotide and encoded amino acid sequence.  If you click on an ORF, you retrieve the sequence followed by an output of a blast search of this sequence against the nr database.  The graphic representation shows you which part of the ORF generated the match, if you click on the number that represents the score, you open a new window with the alignment (again with nice graphics included).  If you click on the number an window with the matching sequence in gb-format opens up.  If the ORF is part of a cluster of putatively orthologous genes, you can get information on the cluster by clicking on the COGnumber.

From the Borrelia genome page, you can go to tables listing all ORF, or to taxtable, which provides an interesting nearest neighbor coloring of the genome.  It is noteworthy that many of the pink dots are endonucleases.  Also, there are many transporters among the odd colored genes. 

In an attempt to capture some phylogenetic information in blast comparisons, Olendzenski et al. pioneered an approach to use multiple reference genomes to screen for putatively horizontally transferred genes (see Fig. 4). A similar approach, but using only two instead of three reference genomes is implemented in the TAX PLOT program at the NCBI's genome page (see below).

You pick one genome to analyze, and two reference genomes. The program returns a plot of every ORF in the selected genome represented in a coordinate system, where the two coordinates are the highest alignment score with the two reference genomes:

Selected genome was from Borrelia burgdorferi. The list of selected genes is below:

DefinitionBlast2SeqGenBankBlink
V-type ATPase, subunit B (atpB) [Borrelia burgdorferi]15594439=>
aaV-TYPE ATP SYNTHASE BETA CHAIN (V-TYPE A72212585403=>
aaATP synthase F1 alpha subunit [Aquifex a26115606090=>

V-type ATPase, subunit A (atpA) [Borrelia burgdorferi]15594440=>
aaH+-transporting ATP synthase, subunit A 105111498766=>
aaATP synthase F1 beta subunit [Aquifex ae22115607015=>

prolyl-tRNA synthetase (proS) [Borrelia burgdorferi]15594747=>
aaprolyl-tRNA synthetase (proS) [Archaeogl65511499201=>
aaproline-tRNA synthetase [Aquifex aeolicu16715605873=>

phenylalanyl-tRNA synthetase, beta subunit (pheT) [Borrelia burgdorferi]15594859=>
aaphenylalanyl-tRNA synthetase, subunit be70911499019=>
aaphenylalanyl-tRNA synthetase beta subuni15315606806=>

chemotaxis histidine kinase (cheA-1) [Borrelia burgdorferi]15594912=>
aachemotaxis histidine kinase (cheA) [Arch79811498645=>
aahistidine kinase sensor protein [Aquifex8615605839=>

methionyl-tRNA synthetase (metG) [Borrelia burgdorferi]15594932=>
aamethionyl-tRNA synthetase (metS) [Archae87311499048=>
aamethionyl-tRNA synthetase alpha subunit 43615606482=>

spermidine/putrescine ABC transporter, ATP-binding protein (potA) [Borrelia burgdorferi]15594987=>
aaspermidine/putrescine ABC transporter, A67811499200=>
aaABC transporter [Aquifex aeolicus]32515607081=>

lysyl-tRNA synthetase [Borrelia burgdorferi]15595004=>
aalysyl-tRNA synthetase (lysS) [Archaeoglo64211498815=>
aacysteinyl-tRNA synthetase [Aquifex aeoli9215606347=>

More on Comparing Genomes:

Genome dot plots allow to compare two genomes (or rather the ORF in encoded in these genomes). In contrast to a normal dot plot, one does not move a window through the sequence, rather one takes one ORF at a time and compares it to the other genome.

Robert L. Charlebois' genome and bioinformatics site performed these and other analysis.

For example BLASTP-based dot plot of Pyrococcus abyssi vs Pyrococcus horikoshii depicted below clearly reveals inversions, and a duplication (two parallel diagonals), the latter can also be detected by comparing a genome to itself.

See this paper from Tillier and Collins on a discussion of this and similar patterns.

 

Assignments: (You should spend most of your time on 3 and 4)

1.      Go to the taxonomy browser in Entrez.  Use the search function on top of the page to find the taxonomic position of Aquifex, Borrelia, Pyrococcus and Aeropyrum?
To which superkingdoms (aka as domain) and phyla do these belong?
(Note: within the two prokaryotic domains, the Bacteria and Archaea, there is only one category between the class and the domain. Usually these categories are called phyla (one phylum), but sometimes these are also called kingdoms).

2.   Go to the ENTREZ genome section
Select the genome from Aeropyrum pernix. (click on genome under the appropriate domain, in the table select the link to the right hand of the species name. Selecting the species name itself, will bring you back to the taxonomy browser.)

Explore the different genome views:

  • select Protein coding genes under feature table -
    - scroll down to an entry that is not labeled as a hypothetical protein, and explore the different links, i.e., click at the three diamond shapes at the beginning of the line, at the PID and at the COG link.
  • select structural RNAs under feature table -- how many 16S rRNA and and how many 5S rRNA coding genes are described for this genome?
  • click somewhere on the circular map of the genome. In the window that opens, click on one of the ORF (what do the colors stand for?). In the blink (=BLAST link) report that opens in a separate window, what do the different colors represent in the symbolic alignment at the left hand site of the table (if you picked something that doesn't have any matches, go back and select an ORF that is colored). In the blink window, what are the scores linked to?
  • Select TaxMap - the window that opens has an interactive graphic that displays all ORFs as dots colored according to the domain that the highest scoring blast hit belongs to. (What do yellow, red and blue represent?) Click on one of the pink dots. In the table that will be displayed below the graphic, click on the number left to the pink letter E. Does the top hit represent a significant hit? (hint click on the score)
  • Click return in your browser window that has the taxmap until you are back at the window that contains the curves that display the distribution of blast hits.
    Can you figure out what the two axes represent?
    Can you guess why the pink line catches up with the yellow and blue line as you move to lower scores?
  • You can change the cuttoff by clicking inside the distribution curve. DO NOT CLICK REPEATEDLY. It takes time for the page to refresh. Click once approximately in the middle of the graphic. How many pink dots are left?
    Click on the number on the right hand in the BEST column, in the row of the eubacteria. Are you surprised by the type of proteins you find listed?
  • In the browser window that has the Entrez listing of the Aeropyrum genome, click on the green upward arrow next to "Microbial Genomes"

3.  Select a microbial genome, and a question to address using TAX PLOT. Select two reference genomes appropriate for your question (see below for examples). Change the zoom when you click at the graphic. Select a different function, then click compare.

Your question:

Your genome:

Your two reference genomes:

Which candidate genes did you find?:

For example:

  • If you ask the question: which genes in Treponema pallidum are candidates for having been transferred from the archaeal domain into this genome, you would select the Treponema pallidum as your query genome. To look for genes transferred from the archaea, you need to select one bacterial genome (a deep branching one would be nice, if there is such a thing), and an archaeal genome. Aquifex aeolicus and Archaeoglobus fulgidus would be suitable.

  • If you look for halobacterial (archaeal) genes in cyanobacteria you could select B. subtilis (B.=Bacillus) and H.sp NRC1 (-Halobacterium, which is an archaeon, not a bacterium!) as reference genomes and the genome from Synechocystis sp PCC6803 as the genome to analyze (i.e. your query).

What do the two coordinates represent? What are the individual dots?

If substitutions were fixed in the different genes in a clock like fashion if there were no Horizontal Gene Transfer, where would all the ORFs end up?

4. Go to the NCBI's GenePlot page.

First compare the two Leptospira serovars from the scroll down lists (serovar Copenhageni and serovar Lai) and do a Genome Plot by pressing the 'Compare' button. What can you conclude for the genome differences by looking on the left window?

Select two closely related species and do the Genome Plot (If the two species you have chose to compared does not retain any interresting results, retry with other species).

What species did you compared?

What interresting feature did you find when comparing those two genomes?