Data Bank Searches

Related areas: Computational biology, Cybernetics

Typically bioinformatics is considered to include:

management of biological databanks,

access to biological data, and

extracting useful information from biological data.
For more detailed discussion see Mark Gerstein's introduction

Use evolutionary context -
“Everything in biology makes sense only if considered in the context of evolution.”

Present day proteins evolved through substitution and selection from ancestral proteins. As a result
related proteins have similar sequence AND similar structure AND similar function.

similar function, e.g.:

identical reactions catalyzed in different organisms; or

same catalytic mechanism but different substrate (malic and lactic acid dehydrogenases);

similar subunits and domains that are brought together through a (hypothetical) process called domain shuffling,
e.g. nucleotide binding domains in hexokinse, myosin, HSP70, and ATPsynthases.

Experience shows that protein sequence space is so big that similar sequences do not arise through convergent evolution (at least if significant similarity is detectable through pairwise comparison, in contrast simple similar protein folds might have evolved twice independently).

The Size of Protein Sequence Space (back of the envelope calculation):

Consider a protein of 600 amino acids. Assume that for every position there could be any of the twenty possible amino acid. Then the total number of possibilities is 20 choices for the first position times 20 for the second position times 20 to the third .... = 20 to the 600 = 4*10^780 different proteins possible with lengths of 600 amino acids.

For comparison the universe contains only about 10^89 protons and has an age of about 5*10^17 seconds or 5*10^29 picoseconds.

If every proton in the universe were a computer that explored one possible protein sequence per picosecond, we only would have explored 5*10^118 sequences, i.e. a negligible fraction of the possible sequences with length 600 (one in about 10^662).

The following is based on observation and not on an a priori truth:

If two sequences show significant similarity in their primary sequence, they have shared ancestry, and probably similar function.
(Although some proteins acquired radically new functional assignments, lysozyme -> lense crystalline).

To date there is no example known where convergent evolution has let to significant similarity of the primary sequence (although here are examples where similar selection pressures have resulted in similar convergent substitutions in homologous proteins).

THE REVERSE IS NOT TRUE:

DOMAINS WITH THE SAME OR SIMILAR FUNCTION DO NOT ALWAYS SHOW SIGNIFICANT SEQUENCE SIMILARITY for one of two reasons:

a) they evolved independently (e.g. different types of nucleotide binding sites); or

b) they underwent so many substitution events that there is no readily detectable similarity remaining.)

In particular, DOMAINS WITH SHARED ANCESTRY DO NOT ALWAYS SHOW SIGNIFICANT SIMILARITY (reason: see B above), many recent breakthroughs in bioinformatics concern the improved detection of similarity.

The problems in finding exons in eukaryotic genomes illustrate the difficulties in "first principle approaches". (See this afternoon's class.) Again, consideration of the evolutionary context provides a solution.

Powerpoint slides on protein space and homology are here

Databanks and databank searches:

Sites for databank searches and retrieval:

This is probably the only site you'll ever need for databank searches:
******* http://www.ncbi.nlm.nih.gov/ ******

The NCBI maintains several databanks. The entries in each databank are pre-linked to other entries in the same databank and to entries in the other databanks

Medline (PubMed), including books
Protein
Nucleotide
Structure
Genome
Taxonomy

Other Webpages

http://www.ebi.ac.uk/
The European homologue/analogue to NCBI. I use this site for their excellent software archive.

http://rdp.cme.msu.edu/
The ribosomal databank project

http://www.jgi.doe.gov/index.html
Microbial genomes at the DOE joint genome institute

http://www.tigr.org
Home of several "completed" genomes projects

http://genome-www.stanford.edu/
Yeast and Arabidopsis genome projects

http://www.ncbi.nlm.nih.gov/genomes/static/micr.html
List of completed genomes at the NCBI

ENTREZ

Medline - DNA - protein genome data banks - protein structures - books

Everything already cross linked between the three databanks.

"Homologous" sequences and papers (!) one click away (related sequence / related medline buttons)

Warning: Sometimes CROSSLINKS are updated only slowly. Links of papers to sequences often never make it into the databanks

In addition to using the prelinked relationships you can search for similar sequences at the NCBI's site; however, often this is not necessary.

A sample genbank formated entry is here. Explore the meaning of the different links.

Other formats that are frequently used and notes on the different alphabets are here.

An easy way to stay up-to-date are services (agents) that search the web for new and interesting publications or sequences. There are many companies that offer this, some that are available to everyone are:

Pubcrawler at http://pubcrawler.gen.tcd.ie/ for publications and sequence data (results are available via a webpage)
A similar literature search service from elsevier is at http://www.scirus.com/.
Swiss shop @ http://www.expasy.ch/swiss-shop/ for protein (results are send via email);

Also: While Medline is incorporating more and more non-medical literature, there are still gaps in the coverage. Alternatives are other databanks available though the National Library of Medicine (here), through the ISI Web of Science and through local services.

The Web of Science databases allow you to search articles that cite a particular article or author.

Powerpoint slides on data banks are here (use only the first 5 slides)

Assignments:

[ A) write down your answers!
B) write your name on a piece of paper, fold it into a sign and put it on your desk!
C) if you need help with an assignment, move your name sign to the top of your screen ]

Use Pubmed in NCBI's Entrez to find an article written by Carl R. Woese (famous scientist, codiscover of the archaea), published in the journal Proceedings of the National Academy of Sciences with the words primary kingdoms in the title of the paper. Try to use Boolean operators and field tags; if you cannot recall the tags, use the Preview/Index tool.
What query did find the 1977 article?
How many related articles are linked to this article?
When was the most recent of the related articles published?
Search for the same article in Google Scholar. How many articles cited Woese's 1977 PNAS paper? When was the most recent citation?
In what order does GOOGLE scholar list the articles that cite the Woese paper? (A consequence of this is that the rich get richer.)
In what order does Entrez list the related articles? Dr. JP Gogarten seems obsessed by an important protein called ATP synthase. Is he interested in anything else? How many articles did he published that are NOT related to the ATP synthase OR ATPase?
What query did you assemble?
How many articles did you find?

2. Find a paper co-authored by Senejani, Hilario and Gogarten published in BMC Biochemistry. What was the topic of the paper?
Display the abstract of this paper and click on book in the link menu (on top right of abstract). Items in the abstract that are covered in any of the reference books turn into hyperlinks. If you need more information on any of the items follow these links. What item did you look up? Was this helpful?

3. To what domain, phylum/kingdom and family does Thermoplasma belong? (Use the Taxonomy link in Entrez)

4. How many protein sequences are available for Thermoplasma acidophilum, how many are available for the genus Thermoplasma? (In the taxonomy browser go to Thermoplasma and check protein in the header then hit "return".)

5. Use Entrez to find a Protein sequence that is of interest to you. (If you don't find something of interest, use gi|405795).
How many related protein sequences does your sequence have (see the pulldown menu under LINK)?
How many related nucleotide sequences does your sequence have (see the pulldown menu under LINK)?
How many related nucleotide sequences does the nucleotide sequence have?
Explore the BLink page (results from a data bank search with this sequence).
What is shown on this page? (check where some of the links lead to)
What do the colors in the symbolic alignment on the right hand side signify?
Where do the three links in every entry link to?
Note: all of these results are already linked to your sequence, you did not need to perform a new search to get the results.
The symbolic alignment is particular helpful in case your protein consists of many different domains (go here for a striking example).

Challenge:
6. How many different archaeal RubisCO (=ribulose bisphosphate carboxylase oxygenase = rbcl = ribulose bisphosphate carboxylase large subunit) encoding genes can you find in the protein data bank. Pretend that you are only intereted in RubisCO genes in Archaea NOT in bacterial RubisCOs. (Archaea and Bacteria are the two domains of prokaryotes.)
One possibility is to utilize your clipboard at the NCBI. Start by selecting "protein" in ENTREZ. Explore different search strategies (names, fields, enzyme and substrate names ... .) Save positives to the clipboard. If you later go to the clipboard, you can retrieve related sequences. Remember, nobody claimed that this is a perfect world. It certainly is not easy to formulate a good search strategy. If you don't know if an organism is an Archaeon, click on the taxonomy link associated with most sequences.
How many different archaea that have a RubiCO homologue can you find?
Do many of these have more than one RubisCO gene?

7. If you have time and/or interest sign up for the pubcrawler service (see above) to send you a notice when a paper is published on something you are interested in.

	Essential Bioinformatics (Paperback) by Jin Xiong Excellent book, it provides a very readable and concise overview of the most important tools and concepts in Bioinformatics Link to Amazon.com
	Bioinformatics for Dummies by Jean-Michel Claverie Excellent introductory bioinformatics book.
	Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, Third Edition Edited by Andreas D. Baxevanis and B. F. Francis Ouellette The book covers many aspect of bioinformatics that we do not cover in class, but it is an excellent reference. The section on phylogenetics is weak, but you have your instructor to provide you with much more detail. Don't buy the 2nd edition by mistake! link to Amazon.com link to publisher Excellent book to look up things and to consult if faced with a real world problem. Covers many more techniques and approaches than we will in this course.
	Inferring Phylogenies by Joe Felsenstein ISBN: 0-87893-177-5; $61.95 paper Excellent book on phylogenetics and many aspects of population genetics (e.g., gene coalescence in populations, a topic that is rather relevant to species phylogenies in microorganisms :)). For most Molecular Biologists this is not exactly bedtime reading, but if you need a well founded thorough explanation, this is a good book to consult. Link to publisher, link to amazon.com
	Bioinformatics And Molecular Evolution (Paperback) by Paul G. Higgs, Teresa K. Attwood The authors discuss in detail many applications in molecular evolution and bioinformatics. This book should be very useful to those who want to study some aspects of things covered in this course in more detail. Link to amazon.com
	Molecular Evolution : A Phylogenetic Approach by Roderic D. M. Page, Edward C. Holmes Price: $63.95 Paperback - 352 pages (October 1998) Blackwell Science Inc; ISBN: 0865428891 This book gives an excellent introduction to terms, methods, and problems in molecular evolution. It does not contain too many details on individual algorithm, but it provides a very readable overview. Rather expensive!