Questions (yours) and Answers (mine, i

Questions (yours) and Answers (mine, i.e. J. Peter Gogarten)

BLASTp compares amino acid sequence against protein sequence, but the amino acids are encoded by nucleotide sequence. I do not see clearly that what are the differences between nucleotide sequences and protein sequences in blast search? Do the protein sequences consider protein folds or other structures?
Comparing protein sequences one can look back in time 10 times further (Pearson's estimate) than when using nucleotide sequences. The reason is that text written in 20 letters is easier to align than text with only four different letters. As a consequence, if you align nucleotide sequences that encode proteins, it is better to align the proteins and then align the nucleotides based on the multiple protein sequence alignment, than to align the nucleotide sequences directly.
Protein blast and most multiple sequence alignment programs do not take protein structure into account. SAM <http://compbio.soe.ucsc.edu/sam.html> incorporates secondary structure prediction in finding an alignment. As rule of thumb, if two sequences have significantly similar primary sequences, then they also have similar structure (again the reverse is not true). Thus residues that in a multiple sequence alignment of significantly similar sequences align with one another, usually also occupy similar (homologous) places in the respective protein structures.

E-value gives the expected number of matches, but it is always very small when do the BLAST. Does this mean there is no match with an alignment score this good or better?
No, the E-value gives the expectation under the Null Hypothesis. If the query sequences would not have a match in the databank, then one would expect "E-value" many matches due to chance alone. For small E-values, the E-value is equal to the significance level P with which you can reject the Null Hypothesis.

> My major is computer science, and I have some difficulties in gathering biological data directly. Maybe I can only do something on computation or algorithm design or some mathematical modeling based on some databases. Can I foucs on these mathematical things? And whether these are enough to be a student project or not?
Definitely, a focus on mathematical and computaional things is entirely appropriate. Previously a student used simulated sequence evolution to study how different algorithm handle long branches. Maria Poptsova and I wrote a paper on in silico transfer to test approaches to detect transferred genes (essentially we took an faa file from one organism and dropped a few additional genes into it). There are so many genomes and meta genomes available, that it definitely is possible to do the student project on other peoples data.

In the lecture, you discussed the homology and analogy, and the relationships between them and protein structures. My question is that if A and B are homologs, what will happen when I do protein alignment?

If the homologs were identifies based on significant similarity of their primary sequence, then you will obtain a good alignment. However, it the homology bas established through PSI blast followed by additional supporting evidence (strucure, genomic location, gene neighborhood), then the alignment is often poor and unreliable.

I read a paper about protein-protein interact network, which said that this network helps to identify the function of proteins. Will this be helpful in determining somethings homology?

Homology, and especially homology based on significant sequence
similarity is a good predictor for similar function. This can be
improved by actually reconstructing phylogenies (so you know which sequences
are orthologs, and which ones are paralogs).

These days one possible outcome is that you find that the gene you are
interested in an ortholog to a "conserved hypothetical protein".
Under these circumstances the fact of homology is not enlightening.
One cool technique that can help is phylogenetic profiling, i.e. you
screen for genes that in a set of genomes have the same absence
presence distribution as your "conserved hypothetical protein". If
all the other proteins with the same distribution are part of a
metabolic pathway, or part of a multi enzyme complex, chances are high
that your "conserved hypothetical protein" is part of the same
pathway/ complex.
Genomic context also can help. If, for example, you know that a
protein is the substrate binding protein of an ABC transporter (using
homology inferred from siilarity of the primary sequence), and in the
al of the closely related homologs, the ABC transporter genes are
preceeded by carbohydrate hydrolyzing enzymes, and in the case you
study the enighboring enzyme encodes a laminarase, then a good guess
is that the APC transporter will actually bind and transport the
hydrolysis product of laminarin (beta 1->3 linked glucose or a related
sugar) .
I am less confident about using experimentally determined
protein interaction networks forprediction, the main reason
is that they are experimentally not too reliably determined.
One possibility would be to study the evolution of transcription
factor binding motifs in a set of tightly co-regulated operons, or the
substitution rate of DNA stretches that bind regulatory proteins that
regulate replication or the termination of repplication.

I am able to connect to the server by using the command
>sftp xxxx@bbcxsrv1.biotech.uconn.edu
and then entering my password.

When I use 'ls', it lists all of the files that i hav eon the server
directory, so I think I am connected. However, when I type "vi", the text
editor won't open, and it returns the error message 'invalid command'.
Also, the command 'qrsh **' returns the error message 'invalid command'.

I have no problem typing 'vi' and using the editor from the terminal when
I'm not connected to the server, so I don't know what the problem is.

Life is complicated sometimes.
sftp sets up a secure file sharing connection (ftp stands for file
transfer protocol). This allows to to get and put files from your
laptop to the server ls and cd list and change directory on the
server, lcd and lls do the same to your local directory. mget or mput allows you to transfer multiple files (as in mput *.faa)
Instead of sftp you could use fugu or set up an afp connection to the server.

sftp (or fugu, or afp) does not set up a terminal connection to the
server. For this you need to set up an ssh connection. In terminal
the commant would be
> ssh xxx@bbcxsrv1.biotech.uconn.edu
(you could use jellyfish to keep track of the ssh connections (less
typing, if you go to the same server repeatedly).

For the Mac I recommend the use of FUGU, Jellyfish and Textwrangler.
To install these go to

http://www.grepsoft.net/products.php
click on download behind the jellyfish icon, open the dmg dick image
(if it doesn't automatically) and drag the jellifish icon into your
application folder. Jellyfish sets up terminal connections.

http://rsug.itd.umich.edu/software/fugu/
click on the download link, then on the version 1.2 link rest is as above.

What is phylogenetics? Does it mean building trees?

The equation between trees and phylogeny is a widely propagated misconception. The origins of the word phylo-geny are Greek phlon, race or class and Greek -geneia, from -gens, born. (from the American Heritage Dictionary). Phylogeny describes how the larger taxonomic categories came into existence (as opposed to ontogeny which describes how the individual organism comes into existence). Botanist discovered long ago that the origin of many species results from the fusion of genomes belonging to different parent species. They coined the term reticulate evolution. There is hardly any crop plant that is not aneupolyploid (i.e. every cell contains copies of genomes from two different parent species. More here.).

Every eukaryotic cell represents the result of a fusion between at least two independent ancestors, an alpha proteobacterium that evolved into the mitochondrion, but whose genes nowadays mostly reside in the nucleus, and a host cell that was a close relative of the archaea. (There might have been many more organisms contributing genes over time, but except for the cyanobacteria, these additional contributors currently are less well defined.)

Many organisms are in fact microbial communities, whose members live in close association (e.g., lichen), and many (all?) microbial communities can be viewed as higher order entities with a shared genetic resource (open source genetics J).

Especially for microorganisms (but see here and here for recent examples of gene transfer between very divergent angiosperms) the evolutionary history of organisms is not tree-like, at best, it can be approximated by a tree. For more on this see
Gogarten, J. P., Doolittle, W. F., Lawrence, J. G. (2002). Prokaryotic evolution in light of gene transfer. Mol Biol Evol 19, 2226-2238.
Zhaxybayeva, O., Gogarten, J. P. (2004). Cladogenesis, Coalescence and the Evolution of the Three Domains of Life. Trends in Genetics 20, 182-187
Zhaxybayeva O, Swithers KS, Lapierre P, Fournier GP, Bickhart DM, DeBoy RT, Nelson KE, Nesbø CL, Doolittle WF, Gogarten JP, and Noll KM (2009) On the Chimeric Nature, Thermophilic Origin and Phylogenetic Placement of the Thermotogales. Proc Nat Acad Sci USA 106(14):5865-70

This is what Wikkipedia currently says on phylogeny:
A phylogeny (or phylogenesis) is the origin and evolution of a set of organisms, usually of a species. A major task of systematics is to determine the ancestral relationships among known species (both living and extinct), and the most commonly used methods to infer phylogenies include cladistics, phenetics, maximum likelihood, and Bayesian.

During the late 19th century, the theory of recapitulation, or Haeckel's biogenetic law, was widely accepted. This theory was often expressed as "ontogeny recapitulates phylogeny", i.e. that the development of an organism exactly mirrors the evolutionary development of the species. The early version of this hypothesis has since been rejected as being oversimplified and misleading. However, modern biology recognizes numerous connections between ontogeny and phylogeny, explains them using evolutionary theory, and views them as supporting evidence for that theory. See the article on ontogeny and phylogeny.

Can we collaborate on our student projects?
Every student needs to hand in their own project. It should be based on the student's own work. Sources (literature, web pages, other students) need to be clearly indicated. Students should not cut and past from the work of others with out giving credit. Plagiarism represents serious academic misconduct.