Types of Error in a Databank search False positives: The number of false positives are estimated in the E-value. The P-value or significance value gives the probability that a positive identification is made in error (same as with drug tests).
False negatives: Homologous sequences in the databank that are not recognized as such. If there are only 12000 different protein families, on average a sequence should have (size of the databank)/12000 matches. In other words, the number of false negatives is probably very large. |
Discussion of problems with databanks:
Sequence and structure databanks can be divided into many different categories. One of the most important is: |
|
|
|
One problem in maintaining databanks (supervised and unsupervised) is "owner ship" of sequences, which in many data banks prevents a continuous update of sequences. Even if errors are detected, they are not easily removed form the databank.
Example 1: ATP synthase operons in E.coli see Fig.1 in http://mic.microbiologyresearch.org/content/journal/micro/10.1099/mic.0.033811-0#tab2
Example 2: Even species names are often wrongly assigned (slides)
Slides on Margaret Dayhoff and the origins of genbank
Powerpoint slides on blast
If time:
Discussion:
Meaning of phylogeny.
sequence space