Questions (yours) and Answers (mine, i.e. J. Peter Gogarten)

I feel like the simplest way to do the assignment would be to use the "sed" command to change the header to the correct format, but I keep getting hung up on the specifics. The code I've written so far is as follows: #!/usr/bin/perl use warnings; system (grep ">" glcosylTransferases.fasta | sed 's/gi\|(\d+)\|ref|.|/gi\|(\d+)\_/); I'm not sure how to get the program to recognize the gi number (between > and |) and the species (between []). Do I have to declare these as variables for the program to recognize them? I suppose if it was this simple I could just use a unix command and not bother with a perl script, so maybe I'm completely oversimplifying it.

I am not sure that there is an easy way to do this using sed and grep, because you rearrange the order in which things are encountered in the comment line. Using a pattern match in perl, as in
$in =~ m/gi\|(\d+)\|/; #matches gi|number and captures the number in $1, $& contains gi|number|
would work much better and more elegantly.
You could use \w+ for the species name (see P17 in the text book)

But as usual, there many ways to do the same thing. At the moment I use a very ancient Perl script for this task, and it certainly is not the most elegant solution (actually pretty awful, because the writer was too lazy to look up the pertinent regular expressions).
first, after opening the input file and reading it line by line, the script identifies the comment lines using
if ($in =~ '\>'){} #works, but does not insist that the line actually begins with > looking for /^\>/ would be better
After the > is removed (substituted with nothing), the line is split on the [ character
@temp=split ('\[',$in);
The result in $temp[1], is split again on ] .
@temp2= split ('\]',$temp[1]);
this is horrible code, but $temp2[0] now has the species name in which spaces are substituted with _ using
$temp2[0] =~ s/ /_/g; #(using \s+ would be nicer and more fool proof
Then the script uses the same approach on the | symbole that separates the gi number @temp3 = split ('\|',$in); then the output is print ">$temp2[0]$temp3[1] $in"; As I said pretty awkward, but it works!

I also have a question about the next homework assignment. In the slides, you say that you give us a multiple sequence file for extracting the GI numbers, but there is no hyperlink to download the file. Is this a specific file you'd like us to use, or will any multiple-sequence fasta formatted file work?

I updated the link, or you can go directly to http://gogarten.uconn.edu/mcb5472_2012/glcosylTransferases.fasta or use a file of your choice.

I am able to connect to the server by using the command
>sftp xxxx@bbcxsrv1.biotech.uconn.edu
and then entering my password.

When I use 'ls', it lists all of the files that i hav eon the server
directory, so I think I am connected. However, when I type "vi", the text
editor won't open, and it returns the error message 'invalid command'.
Also, the command 'qrsh **' returns the error message 'invalid command'.

I have no problem typing 'vi' and using the editor from the terminal when
I'm not connected to the server, so I don't know what the problem is.

Life is complicated sometimes.
sftp sets up a secure file sharing connection (ftp stands for file
transfer protocol). This allows to to get and put files from your
laptop to the server ls and cd list and change directory on the
server, lcd and lls do the same to your local directory. mget or mput allows you to transfer multiple files (as in mput *.faa)
Instead of sftp you could use filezilla or set up an afp connection to the server.

sftp (or fugu, or afp) does not set up a terminal connection to the
server.
For this you need to set up an ssh connection. In terminal
the commant would be
> ssh xxx@bbcxsrv1.biotech.uconn.edu
(you could use jellyfish to keep track of the ssh connections (less
typing, if you go to the same server repeatedly).

For a Mac I recommend the use of filezilla, Jellyfish and Textwrangler.
To install these go to

http://www.grepsoft.net/products.php
click on download behind the jellyfish icon, open the dmg dick image
(if it doesn't automatically) and drag the jellifish icon into your
application folder. Jellyfish sets up terminal connections.

http://rsug.itd.umich.edu/software/fugu/
click on the download link, then on the latest version rest is as above.

http://filezilla-project.org/
click on the download button, follow instuctions

What is phylogenetics?  Does it mean building trees? 

The equation between trees and phylogeny is a widely propagated misconception.  The origins of the word phylo-geny are Greek phlon, race or class and Greek -geneia, from -gens, born. (from the American Heritage Dictionary).  Phylogeny describes how the larger taxonomic categories came into existence (as opposed to ontogeny which describes how the individual organism comes into existence).  Botanist discovered long ago that the origin of many species results from the fusion of genomes belonging to different parent species.  They coined the term reticulate evolution.  There is hardly any crop plant that is not aneupolyploid (i.e. every cell contains copies of genomes from two different parent species. More here.).

Every eukaryotic cell represents the result of a fusion between at least two independent ancestors, an alpha proteobacterium that evolved into the mitochondrion, but whose genes nowadays mostly reside in the nucleus, and a host cell that was a close relative of the archaea.  (There might have been many more organisms contributing genes over time, but except for the cyanobacteria, these additional contributors currently are less well defined.)

Many organisms are in fact microbial communities, whose members live in close association (e.g., lichen), and many (all?) microbial communities can be viewed as higher order entities with a shared genetic resource (open source genetics J). 

Especially for microorganisms (but see here and here for recent examples of gene transfer between very divergent angiosperms) the evolutionary history of organisms is not tree-like, at best, it can be approximated by a tree.  For more on this see
Gogarten, J. P., Doolittle, W. F., Lawrence, J. G. (2002). Prokaryotic evolution in light of gene transfer. Mol Biol Evol 19, 2226-2238.
Zhaxybayeva, O., Gogarten, J. P. (2004). Cladogenesis, Coalescence and the Evolution of the Three Domains of Life. Trends in Genetics 20, 182-187
Zhaxybayeva O, Swithers KS, Lapierre P, Fournier GP, Bickhart DM, DeBoy RT, Nelson KE, Nesbø CL, Doolittle WF, Gogarten JP, and Noll KM (2009) On the Chimeric Nature, Thermophilic Origin and Phylogenetic Placement of the Thermotogales. Proc Nat Acad Sci USA 106(14):5865-70

This is what Wikkipedia currently says on phylogeny:
A phylogeny (or phylogenesis) is the origin and evolution of a set of organisms, usually of a species. A major task of systematics is to determine the ancestral relationships among known species (both living and extinct), and the most commonly used methods to infer phylogenies include cladistics, phenetics, maximum likelihood, and Bayesian.

During the late 19th century, the theory of recapitulation, or Haeckel's biogenetic law, was widely accepted. This theory was often expressed as "ontogeny recapitulates phylogeny", i.e. that the development of an organism exactly mirrors the evolutionary development of the species. The early version of this hypothesis has since been rejected as being oversimplified and misleading. However, modern biology recognizes numerous connections between ontogeny and phylogeny, explains them using evolutionary theory, and views them as supporting evidence for that theory. See the article on ontogeny and phylogeny.

Can we collaborate on our student projects?
Every student needs to hand in their own project. It should be based on the student's own work. Sources (literature, web pages, other students) need to be clearly indicated. Students should not cut and past from the work of others with out giving credit. Plagiarism represents serious academic misconduct.