Assignments for Today:

Read http://en.wikipedia.org/wiki/BLAST and http://en.wikipedia.org/wiki/FASTA
(read through, you don't need to recall the formulas, but you should understand the principle - )
Read through http://en.wikipedia.org/wiki/FASTA_format
http://en.wikipedia.org/wiki/Substitution_matrix and here (optional) for a curious twist
Read through the entry on the Bonferroni correction on wikipedia: http://en.wikipedia.org/wiki/Bonferroni_correction (a concise version is here, a discussion of fishing expeditions is in the introduction here. Especially this link is worth noting. Optional: Article in the New Yorker).

Assignments for Wednesday

Takehome exam #2 is due next Monday - if questions aren't clear, discuss on Wednesday,
Read http://en.wikipedia.org/wiki/Standard_score
Understand the difference between false positives and false negatives (see error types )

Discussion of Take Home Exam 1 (w anwers) (Excel file for extra question, see Genbank release notes) -- Pan Genome and KS Plot slides here

DataBank Searches

Sequence and structure databanks can be divided into many different categories.
One of the most important is:

Supervised databanks with gatekeeper.

Examples:

Swissprot

Refseq (at NCBI)

Entries are checked for accuracy.
+ more reliable annotations
-- frequently out of date

Repositories without gatekeeper.

Examples:

GenBank

EMBL

TrEMBL

Everything is accepted.
+ everything is available
-- many duplicates
-- poor reliability of annotations

One problem in maintaining databanks (supervised and unsupervised) is "owner ship" of sequences, which in many data banks prevents a continuous update of sequences. Even if errors are detected, they are not easily removed form the databank. E.g. ATP synthase operons in E.coli see Fig.1 in http://mic.microbiologyresearch.org/content/journal/micro/10.1099/mic.0.033811-0#tab2

Even species names are often wrongly assigned (slides)

If you can demonstrate significant similarity using randomization, your sequences are homologous (i.e. related by common ancestry). Convergent evolution has not been shown to lead to sequence similarities between complex sequences detectable through pairwise comparison.

When are two similar sequences significantly similar/homologous?
I.e.,
when can we infer that their similarity due to homology and shared ancestry?
(The opposite to homology is analogy, due to convergent evolution.)

(Note: we will discuss alignment algorithms later, for now it is sufficient to know that given a scoring matrix and two sequences, one can calculate an alignment that has an optimal score)

One way to quantify the similarity between two sequences is to

1. compare the actual sequences and calculate alignment score

2. randomize (scramble) one (or both) of the sequences and calculate the alignment score for the randomized sequences.

3. repeat step 2 at least 100 times

4. describe distribution of randomized alignment scores

5. do a statistical test to determine if the score obtained for the real sequences is significantly better than the score for the randomized sequences

To illustrate the assessment of similarity/homology we will use a program from Pearson's FASTA package called PRSS.
This and many other programs by Bill Pearson, web page at http://fasta.bioch.virginia.edu/

A web version is available here. (Output of old PRSS is here)

Go through example. Sequences are here (fl), here (B), here (A) and here (A2)

There are many other alignment programs. BLAST is a program that is widely used and offered through the NCBI (go here for more info). It also offers to do pairwise comparisons (go here, do example).

To force the program to report an alignment increase the E-value.

An approach similar to PRSS is used in the FASTA database search. If one chooses to display a histogram of the search, the output includes the histogram of all the alignment scores obtained with the individual sequences contained in the database. Includes are the actual sequence scores, and the ones that are expected based on a probability distribution. An example is here.

Summary of Terminology:

E-values give the expected number of matches with an alignment score this good or better due to chance alone (no shared ancestry, no convergent evolution)

P-values give the probability of to find a match of this quality or better due to chance alone (no shared ancestry, no convergent evolution). The P value is equal to the probability that the null hypothesis (similarity is due to chance alone) is true. This probability is also known as the significance level a which the null hypothesis can be rejected.

P values are [0,1], E-values are [0,infinity).

Both P and E values should take the size of the databank into consideration, and you should consider to correct for multiple searches to avoid "fishing expeditions".
For small values E=P

z-values give the distance between the actual alignment score and the mean of the scores for the randomized sequences expressed as multiples of the standard deviation calculated for the randomized scores.
For example: a z-value of 3 means that the actual alignment score is 3 standard deviations better than the average for the randomized sequences. Z-values > 3 are usually considered as suggestive of homology, z-values > 5 are considered as sufficient demonstration. (see the "but" below). A discussion of z-values is here. A somewhat readable description of E, P, HSP and other values is here.

BUT:
Failure to detect significant similarity does only shows our inability to detect homology, it does not prove that the sequences are not homologous.

Examples:

Jim Knox (MCB-UConn) has studied many proteins involved in bacterial cell wall biosynthesis and antibiotic binding, synthesis or destruction. Many of these proteins have identical 3-D structure, and therefore can be assumed to be homologous, however, the above tests fail to detect this homologies. (for example, enzymes with GRASP nucleotide binding sites are depicted here.)

DNA replication involves many different enzymes. Some of the proteins do the same thing in bacteria, archaea and eukaryotes; they have similar 3-D structures (e.g.: sliding clamp, E. coli dnaN and eukaryotic PCNA, see Edgell and Doolittle, Cell 89, 995-998), but again, the above tests fail to detect homology.

Helicase and F1-ATPase. Both form hexamers with something rotating in the middle (either the gamma subunit or the DNA; D. Crampton, pers. communication). The monomers have the same type of nucleotide binding fold (picture)

Discuss how the P values should be adjusted in case multiple tests are performed.

Types of Error in a Databank search

False positives: The number of false positives are estimated in the E-value. The P-value or significance value gives the probability that a positive identification is made in error (same as with drug tests).
Danger: avoid fishing expiditions. If you do 100 tests on random data, you expect one to be positive at the 1% significance level.

You could apply the Bonferroni correction:

The significance level for the individual test is calculated through dividing the overall desired significance level by the number of parallel tests. The hypothesis to be be rejected is that Not all of the individual tests are significantly different from chance. (all in the sense of "at least one"

False negatives: Homologous sequences in the databank that are not recognized as such. If there are only 12000 different protein families, an average a sequence should have (size of the databank)/12000 matches. In other words, the number of false negatives is probably very large.

Discussion: Decay of significance. Can this be corrected?

Goals class 8:

Understand how to analyze exponential growth / decay.
Understand the problems that result from ownership of database entries (know a few examples).
Be able to discuss the process that may lead to the decay of significance