- read through Using BLAST to Teach “E-value-tionary” Concepts (ppts here)

- If you have not done so, read through file on frequently used formats here
- Read the general wikipedia entry on substitution matrices and on PAM and Blosum matrices - which one would you use for closely related, which ones for divergent sequences (here)?
- Read through the powerpoint slides from using BLAST to teach E-value-tionary concpets

If you can demonstrate significant similarity using randomization, your sequences are homologous (i.e. related by common ancestry). Convergent evolution has not been shown to lead to sequence similarities between complex sequences detectable through pairwise comparison. When are two similar
sequences significantly similar/homologous? (Note: we will discuss alignment algorithms later, for now it is sufficient to know that given a scoring matrix and two sequences, one can calculate an alignment that has an optimal score) One
way to quantify the similarity between two sequences is to
1.
compare the actual sequences and calculate alignment score
2.
randomize (scramble) one (or both) of the sequences and calculate the alignment
score for the randomized sequences.
3.
repeat step 2 at least 100 times
4.
describe distribution of randomized alignment scores
5.
do a statistical test to determine if the score obtained for the real sequences
is significantly better than the score for the randomized sequences
A
There are many other alignment programs. BLAST is a program that is widely used and offered through the NCBI (go here for more info). It also offers to do pairwise comparisons (go here, do example). To force the program to report an alignment increase the E-value.
E-values give the expected number of matches with an alignment score this good or better due to chance alone (no shared ancestry, no convergent evolution) P-values give the probability of to find a match of this quality or better due to chance alone (no shared ancestry, no convergent evolution). The P value is equal to the probability that the null hypothesis (similarity is due to chance alone) is true. This probability is also known as the significance level a which the null hypothesis can be rejected. P values are [0,1], E-values are [0,infinity). give the distance between the actual alignment score and
the mean of the scores for the randomized sequences expressed as multiples of
the standard deviation calculated for the randomized scores. z-values For example: a z-value of 3 means that the actual alignment score is 3 standard deviations better than the average for the randomized sequences. Z-values > 3 are usually considered as suggestive of homology, z-values > 5 are considered as sufficient demonstration. (see the "but" below). A discussion of z-values is here. A somewhat readable description of E, P, HSP and other values is here.
Examples:
Jim Knox (MCB-UConn) has studied many
proteins involved in bacterial cell wall biosynthesis and antibiotic binding,
synthesis or destruction. Many of these proteins have identical 3-D structure,
and therefore can be assumed to be homologous, however, the above tests fail to
detect this homologies. (for example, enzymes with GRASP nucleotide binding sites
are depicted here.)
DNA
replication involves many different enzymes. Some of the proteins do the same
thing in bacteria, archaea and eukaryotes; they have similar 3-D structures (e.g.:
sliding clamp,
Discuss how the P values should be adjusted in case multiple tests are performed. |

Powerpoint slides on