Introduction to Bioinformatics - Course at FH Mannheim, Summer 2007

J.Peter Gogarten
Dept.
Molecular and Cell Biology
University of Connecticut
Storrs, CT 06269-3125

Email: Gogarten@UConn.edu
http://gogarten.uconn.edu/

Basis for grading:
Participation, Assignments, Exam on Wednesday (Start 13.00)

Class notes and assignments will be available through the www@
        
http://web.uconn.edu/gogarten/bioinf (USA)
        

The first set of assignments are at the bottom of this page

Textbook: none is required but the following are recommended.

Essential Bioinformatics (Paperback)
by Jin Xiong

Excellent book, it provides a very readable and concise overview of the most important tools and concepts in Bioinformatics

Link to Amazon.com

Bioinformatics for Dummies
by Jean-Michel Claverie

Excellent introductory bioinformatics book.

 

Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, Third Edition

Edited by Andreas D. Baxevanis and B. F. Francis Ouellette

The book covers many aspect of bioinformatics that we do not cover in class, but it is an excellent reference. The section on phylogenetics is weak, but you have your instructor to provide you with much more detail.

Don't buy the 2nd edition by mistake!

link to Amazon.com link to publisher

Excellent book to look up things and to consult if faced with a real world problem.
Covers many more techniques and approaches than we will in this course.

Inferring Phylogenies
by Joe Felsenstein

ISBN: 0-87893-177-5;   $61.95 paper

Excellent book on phylogenetics and many aspects of population genetics (e.g., gene coalescence in populations, a topic that is rather relevant to species phylogenies in microorganisms :)).
For most Molecular Biologists this is not exactly bedtime reading, but if you need a well founded thorough explanation, this is a good book to consult.

Link to publisher, link to amazon.com

Bioinformatics And Molecular Evolution (Paperback)
by Paul G. Higgs, Teresa K. Attwood

The authors discuss in detail many applications in molecular evolution and bioinformatics. This book should be very useful to those who want to study some aspects of things covered in this course in more detail.

Link to amazon.com

Molecular Evolution : A Phylogenetic Approach
by Roderic D. M. Page, Edward C. Holmes Price: $63.95 Paperback - 352 pages (October 1998)

Blackwell Science Inc; ISBN: 0865428891

This book gives an excellent introduction to terms, methods, and problems in molecular evolution.  It does not contain too many details on individual algorithm, but it provides a very readable overview. 
Rather expensive!

 

Other recommended books:

Graur and Li: Fundamentals of Molecular Evolution, Second Edition

 

Bioinformatics (general definition): 
   Area between Computer Sciences (Informatics) and Biology (genomics)
   
(or application of the tools of informatics to biology)

Bioinformatics took off only with the availability of large amounts of genome information, thus a more narrow delineation might be:
 
     Area between Informatics and Genomics

Related areas: Computational biology, Cybernetics

 Typically bioinformatics is considered to include: 

management of biological databanks,
access to biological data, and
extracting useful information from biological data.
For more detailed discussion see Mark Gerstein's introduction

 

 

What does Bioinformatics have to do with Molecular Evolution? 

 Problem: Application of first principles does not (yet) work: 

 

The following chain of events although (believed to be) mainly determined by the DNA sequence (plus other components of the cell which in turn are encoded by other parts of the genome) can at present not be simulated in a computer.  

  

DNA sequence ->
transcription ->
translation ->
protein folding ->
protein function (catalytic and other properties) ->
properties of the organism(s) ->
ecology (taking also the non biological environment into account) -> ... .

    

Most scientists believe that the principle of reductionism (plus new laws and relations emerging on each level) is true for this chain; however, this is clearly “in principle” only.
Biology usually assumes this sequence works more or less unambiguously (prions), but:

At several steps along the way from DNA to function our understanding of the chemical and physical processes involved is so incomplete that prediction of protein function based on only a single DNA sequence is at present impossible (at least for a protein of reasonable size).

 

Solution: 

Use evolutionary context - 
“Everything in biology makes sense only if considered in the context of evolution.”

Present day proteins evolved through substitution and selection from ancestral proteins.  As a result
related proteins have similar sequence AND similar structure AND similar function. 

 

In the above mantra "similar function" can refer to:

Experience shows that protein sequence space is so big that similar sequences do not arise through convergent evolution (at least if significant similarity is detectable through pairwise comparison, in contrast simple similar protein folds might have evolved twice independently).

 

The Size of Protein Sequence Space (back of the envelope calculation):

Consider a protein of 600 amino acids. Assume that for every position there could be any of the twenty possible amino acid. Then the total number of possibilities is 20 choices for the first position times 20 for the second position times 20 to the third .... = 20 to the 600 = 4*10^780 different proteins possible with lengths of 600 amino acids.

For comparison the universe contains only about 10^89 protons and has an age of about 5*10^17 seconds or 5*10^29 picoseconds.

If every proton in the universe were a computer that explored one possible protein sequence per picosecond, we only would have explored 5*10^118 sequences, i.e. a negligible fraction of the possible sequences with length 600 (one in about 10^662).

 

The following is based on observation and not on an a priori truth:

If two sequences show significant similarity in their primary sequence, they have shared ancestry, and probably similar function.
(Although some proteins acquired radically new functional assignments, lysozyme -> lense crystalline). 


To date there is no example known where convergent evolution has let to significant similarity of the primary sequence (although here are examples where similar selection pressures have resulted in similar convergent substitutions in homologous proteins).

 

THE REVERSE IS NOT TRUE:


DOMAINS WITH THE SAME OR SIMILAR FUNCTION DO NOT ALWAYS SHOW SIGNIFICANT SEQUENCE SIMILARITY for one of two reasons:

a)  they evolved independently (e.g. different types of nucleotide binding sites); or

b)   they underwent so many substitution events that there is no readily detectable similarity remaining.) 

In particular, DOMAINS WITH SHARED ANCESTRY DO NOT ALWAYS SHOW SIGNIFICANT SIMILARITY (reason: see B above), many recent breakthroughs in bioinformatics concern the improved detection of similarity.

 

The problems in finding exons in eukaryotic genomes illustrate the difficulties in "first principle approaches". (See this afternoon's class.) Again, consideration of the evolutionary context provides a solution.

Powerpoint slides on protein space and homology are here

 

Databanks and databank searches

Sites for databank searches and retrieval:
     
This is probably the only site you'll ever need for databank searches:
       *******
http://www.ncbi.nlm.nih.gov/ ******

The NCBI maintains several databanks.  The entries in each databank are pre-linked to other entries in the same databank and to entries in the other databanks

Medline (PubMed), including books
Protein
Nucleotide
Structure
Genome
Taxonomy

Other Webpages

http://www.ebi.ac.uk/  
The European homologue/analogue to NCBI.  I use this site for their excellent software archive.

http://rdp.cme.msu.edu/
    The ribosomal databank project

http://www.jgi.doe.gov/index.html
    Microbial genomes at the DOE joint genome institute

http://www.tigr.org
        Home of several "completed" genomes projects

http://genome-www.stanford.edu/
        Yeast and Arabidopsis genome projects

http://www.ncbi.nlm.nih.gov/genomes/static/micr.html
        List of completed genomes at the NCBI

 

ENTREZ

Medline - DNA - protein genome data banks - protein structures - books

Everything already cross linked between the three databanks.

"Homologous" sequences and papers (!) one click away (related sequence / related medline buttons)

Warning: Sometimes CROSSLINKS are updated only slowly. Links of papers to sequences often never make it into the databanks

In addition to using the prelinked relationships you can search for similar sequences at the NCBI's site; however, often this is not necessary.

A sample genbank formated entry is here. Explore the meaning of the different links.

Other formats that are frequently used and notes on the different alphabets are here.

An easy way to stay up-to-date are services (agents) that search the web for new and interesting publications or sequences.  There are many companies that offer this, some that are available to everyone are:

Also: While Medline is incorporating more and more non-medical literature, there are still gaps in the coverage.  Alternatives are other databanks available though the National Library of Medicine (here), through the ISI Web of Science and through local services.

The Web of Science databases allow you to search articles that cite a particular article or author.

Powerpoint slides on data banks are here (use only the first 5 slides)

 Assignments:

[ A) write down your answers!
  
B) write your name on a piece of paper, fold it into a sign and put it on your desk!
  C) if you need help with an assignment, move your name sign to the top of your screen ]

  1. Use Pubmed in NCBI's Entrez to find an article written by Carl R. Woese (famous scientist, codiscover of the archaea), published in the journal Proceedings of the National Academy of Sciences with the words primary kingdoms in the title of the paper. Try to use Boolean operators and field tags; if you cannot recall the tags, use the Preview/Index tool.
    What query did find the 1977 article?
    How many related articles are linked to this article?
    When was the most recent of the related articles published?   
    Search for the same article in Google Scholar. How many articles cited Woese's 1977 PNAS paper?     When was the most recent citation?
    In what order does GOOGLE scholar list the articles that cite the Woese paper? (A consequence of this is that the rich get richer.)
    In what order does Entrez list the related articles? Dr. JP Gogarten seems obsessed by an important protein called ATP synthase. Is he interested in anything else? How many articles did he published that are NOT related to the ATP synthase OR ATPase?
    What query did you assemble?
    How many articles did you find? 

    2. Find a paper co-authored by Senejani, Hilario and Gogarten published in BMC Biochemistry. What was the topic of the paper?
    Display the abstract of this paper and click on book in the link menu (on top right of abstract). Items in the abstract that are covered in any of the reference books turn into hyperlinks. If you need more information on any of the items follow these links. What item did you look up? Was this helpful?

    3. To what domain, phylum/kingdom and family does Thermoplasma belong? (Use the Taxonomy link in Entrez)

    4. How many protein sequences are available for Thermoplasma acidophilum, how many are available for the genus Thermoplasma? (In the taxonomy browser go to Thermoplasma and check protein in the header then hit "return".)

5. Use Entrez to find a Protein sequence that is of interest to you. (If you don't find something of interest, use gi|405795).
How many related protein sequences does your sequence have (see the pulldown menu under LINK)?
How many related nucleotide sequences does your sequence have (see the pulldown menu under LINK)?
How many related nucleotide sequences does the nucleotide sequence have?
Explore the BLink page (results from a data bank search with this sequence).
What is shown on this page? (check where some of the links lead to)
What do the colors in the symbolic alignment on the right hand side signify?
Where do the three links in every entry link to?
Note: all of these results are already linked to your sequence, you did not need to perform a new search to get the results.
The symbolic alignment is particular helpful in case your protein consists of many different domains (go here for a striking example).

 

Challenge: 
6. How many different archaeal RubisCO (=ribulose bisphosphate carboxylase oxygenase = rbcl = ribulose bisphosphate carboxylase large subunit) encoding genes can you find in the protein data bank. Pretend that you are only intereted in RubisCO genes in Archaea NOT in bacterial RubisCOs. (Archaea and Bacteria are the two domains of prokaryotes.)
One possibility is to utilize your clipboard at the NCBI. Start by selecting "protein" in ENTREZ.  Explore different search strategies (names, fields, enzyme and substrate names ... .)  Save positives to the clipboard. If you later go to the clipboard, you can retrieve related sequences. Remember, nobody claimed that this is a perfect world.  It certainly is not easy to formulate a good search strategy. If you don't know if an organism is an Archaeon, click on the taxonomy link associated with most sequences.
How many different archaea that have a RubiCO homologue can you find?
Do many of these have more than one RubisCO gene?

7. If you have time and/or interest sign up for the pubcrawler service (see above) to send you a notice when a paper is published on something you are interested in.