Assignments for Wednesday's class:

Read through article from Tillier and Collins on Genome rearrangement by replication-directed translocation (also available on HuskyCT). Try to understand Figure 1 and 2. Can you think of an alternative explanation?

Assignment for Friday's class:

Go through today's blast slides and think about how you will transfer files back and forth from the cluster.

Discussion

PAM versus Blosum

Which to use for divergent sequences?
What is the PAM/Blosum matrix with the highest number?

(illustrations of homologs that do not show significant sequence similarity in pairwise comparisons :

Jim Knox (MCB-UConn) has studied many proteins involved in bacterial cell wall biosynthesis and antibiotic binding, synthesis or destruction. Many of these proteins have identical 3-D structure, and therefore can be assumed to be homologous; however, the tests based on pairwise sequence comparisons fail to detect this homologies. (for example, enzymes with GRASP nucleotide binding sites are depicted here.)

DNA replication involves many different enzymes. Some of the proteins do the same thing in bacteria, archaea and eukaryotes; they have similar 3-D structures (e.g.: sliding clamp, E. coli dnaN and eukaryotic PCNA, see Edgell and Doolittle, Cell 89, 995-998), but again, the above tests fail to detect homology.

Helicase and F1-ATPase. Both form hexamers with something rotating in the middle (either the gamma subunit or the DNA; D. Crampton, pers. communication). The monomers have the same type of nucleotide binding fold (picture)

blast and commandline blast (the slides contain links that only become accessible, after you switched to presentation mode)

Discussion 2

E-values and multiple tests

If you select two sequences from the database and calculate their pairwise alignment score, what would be a useful Null hypothesis to assess the significance.
How is this null hypothesis implemented in PRSS and FASTA?
Are the E-value and P-values a measure for false positives or false negatives?
Assume you have 100 students that repeat this exercise, what would be the expectation for a false positive if the individual test is required to pass the 1% significance level?
What would you need to do to have false positives with an overall (for all 100 students) rate of 1%? Which significance level would the individual experiment need to pass?

Types of Error in a Databank search

False positives: The number of false positives are estimated in the E-value. The P-value or significance value gives the probability that a positive identification is made in error (same as with drug tests).
Danger: avoid fishing expeditions. If you do 100 tests on random data, you expect one to be positive at the 1% significance level.

You could apply the Bonferroni correction:

The significance level for the individual test is calculated through dividing the overall desired significance level by the number of parallel tests. The null-hypothesis of the overall test that is to be be rejected is that None of the individual tests is significantly different from chance. (The opposite of none being "at least one")

False negatives: Homologous sequences in the databank that are not recognized as such. If there are only 12000 different protein families, on average a sequence should have (size of the databank)/12000 matches. In other words, the number of false negatives is probably very large.

Each research group applies significance testing on their own. How can this lead to the decay of significance. How can this be corrected?

If time:

short demo on Zotero
sequence space

Goals class 10

Understand how the databanks at the NCBI are different from flatfile and relational databanks.
Be able to discuss the advantages of the commandline in general and blast searches via the commandline in particular.
also see the goals from class 9