Assignment for Friday's class:

read through Using BLAST to Teach “E-value-tionary” Concepts (ppts here)

Assignments for Monday's class:

Read through file on frequently used formats here
Read the general Wikipedia entry on substitution matrices and on PAM and Blosum matrices - which one would you use for closely related, which ones for divergent sequences (here)?
Optional: Dayhoff recoding is an elegant approach to avoid compositional bias in phylogenetic reconstruction. The PAM250 matrix with highlighting of the groups that are collapsed in the recoding is here.
Read through the powerpoint slides from using BLAST to teach E-value-tionary concepts.
Optional: Dan Graur wrote an introduction to his textbook "Intro to Molecular and Genome Evolution" (here) in which he argues that all of evolution boils down to changing allele frequencies. At least after a first reading this appears to embrace the modern synthesis, and does not consider symbiosis, holobionts and hologenomes (maybe one could argue that picking up a symbiont with new properties is equal to a mutation?). If you submit by next Wednesday a 1-2 page* (12pt font, line spacing 1.5, 1 inch margins) essay discussing/critiquing Dan Graur's introduction, it will be graded and may take the place of one of the takehome exams.
* Bibliography and figures to not count towards the page limit.

Types of Error in a Databank search

False positives: The number of false positives are estimated in the E-value. The P-value or significance value gives the probability that a positive identification is made in error (same as with drug tests).
Danger: avoid fishing expeditions. If you do 100 tests on random data, you expect one to be positive at the 1% significance level.

You could apply the Bonferroni correction:

The significance level for the individual test is calculated through dividing the overall desired significance level by the number of parallel tests.

The null-hypothesis of the overall test that is to be be rejected is that None of the individual tests is significantly different from chance. (The opposite of none being "at least one")

False negatives: Homologous sequences in the databank that are not recognized as such. If there are only 12000 different protein families, on average a sequence should have (size of the databank)/12000 matches. In other words, the number of false negatives is probably very large.

Discussion of problems with databanks:

Sequence and structure databanks can be divided into many different categories.
One of the most important is:

Supervised databanks with gatekeeper.

Examples:

Swissprot

Refseq (at NCBI)

Entries are checked for accuracy.
+ more reliable annotations
-- frequently out of date

Repositories without gatekeeper.

Examples:

GenBank

EMBL

TrEMBL

Everything is accepted.
+ everything is available
-- many duplicates
-- poor reliability of annotations

One problem in maintaining databanks (supervised and unsupervised) is "owner ship" of sequences, which in many data banks prevents a continuous update of sequences. Even if errors are detected, they are not easily removed form the databank.
Example 1: ATP synthase operons in E.coli see Fig.1 in http://mic.microbiologyresearch.org/content/journal/micro/10.1099/mic.0.033811-0#tab2
Example 2: Even species names are often wrongly assigned (slides)

Slides on Margaret Dayhoff and the origins of genbank

Powerpoint slides on blast

If time:
Discussion:
Meaning of phylogeny.
sequence space