MCB 5472 : sequence alignment

Please send your answers per email to gogarten@uconn.edu, or hand in a hardcopy

You should answer the questions in red!

In today's class, we will begin by using the multiple alignment programs clustalw and muscle to align some sequences. Both of these programs are installed on the cluster. Try clustalw -help for the command-line options. Typing clustalw will bring up an interactive menu. Or you could use the newer version by typing clustalw2 instead of clustalw)

Typing muscle will list its command-line options. The muscle usermanual is availbale at http://www.drive5.com/muscle/muscle_userguide3.8.html

ClustalX can be used to view sequence alignments, and you also can use it as a GUI to calculate alignments and trees (only neighborjoining, but fast with bootstrap suppport). You can get it from here. There are both Mac and Windows versions available - if you try to load a file that seems to have the correct format, but is loaded into a single sequence only, open the file in text wrangler and save it as a unix file.

Seaview has been recently integrated with quite good phylogenetic reconstruction software. Its strenghts is a good graphics user interface, and the easy with which you can sample sequences (to study the impact of taxon sampling) and positions (e.g., you can exclude portions of a sequence, such as an intein, from an anlysis, or perfome a separate analysis for these sites. Seaview is availabe at http://pbil.univ-lyon1.fr/software/seaview.html .

The datasets:

We will mainly work with Thermotogales 23S.fna. This is a set of 23SRNA genes encoded in the genomes of different species of Thermotogales in multiple FASTA format.

[If you want to, you can use these datasets for comparison: Thermotogales 16S.fna, Thermotogales RpoC.faa, Thermotogales LeuRS.faa]

Read this section first, develop a strategy, then do it:
Goal is to add at least 5 additional homologous sequences to each of the datasets. Open the data files, copy paste one of the sequences (chose one of the shorter sequences, some of the longer ones have introns) and do a blast search of nr on the web.
(Under organisms you can select to exclude (include) the Thermotogales and to include only the Bacteria; if you leave Megablast as an option, you might be faster; if you select single genes, not genomes, you can down load them using the checkmarks in the alignment section -- this works better for the protein sequences).
Add at least 5 sequences from related bacteria to each dataset. If you can find them, add the homologous sequences from other Thermotogales, Aquifex, Thermus, Thermoanaerobacter, some other Clostridia, Bacillus.
(You might make things easier for you by using the send to clipboard feature of the NCBI web page -- this is NOT the clipboard on your iMac, but a clipboard at the NCBI, you can reformat all your selected sequences at once and copy/paste them to your sequence dataset.
One way to get to the rRNA coding sequences only is to click on the Max Score of the hit, which moves you to the alignment, if you click on the "features link" you get the sequence of the rRNA only, which you then can display as a Fasta sequence.
Naming sequences is difficult. It is a good idea to keep a fasta formated multiple sequence file that for each sequence includes the GI number and the positon of the sequence. In case the features link is not available (for recent sequences) you need to download the whole genome, with sequence, search for 23S, select RNA, copy the highlighted sequence. IF IT says that the complement strand encodes the rRNA, you need to calculate the reverse complement of the highlighted sequence. A website that does this is at http://reverse-complement.com/ .
The annotation line of most sequences from the NCBI conforms to a standard that begins with a GI number and ends with a species/strain designation.
Clustal uses the annotation line up to the first white space as name of the sequences. However, many other programs only use the first 10 letters. Thus it is important to have the first ten letters unique and meaningful (if you look at the aignment or a tree, you want to easily be able to figure out what is what).

Why for many of the genomes are several identical hits reported for the rRNA queries?

After you added the sequences, move the file to the cluster into an appropriately named subdirectory. (Don't forget to move away from the headnode!) On the cluster, align the sequences with clustalw2. If in doubt, use the default settings.

For each run you obtain two files as output: *.aln is the alignment, and *.dnd is the guide tree.

Now let's align the 23S dataset with muscle:

muscle -in 23S.fna -out 23S.muscle

Now we are going to compare the ClustalW and Muscle alignments. One way to "improvise" is to load both alignments into ClustalX, so they will both be on the screen - one under the other. A slight complication is that the sequences have the same name (ClustalX requires them to be unique), so let's make a small change to the muscle alignment. From the command-line, type:

sed "s/^>/>muscle/g" 23S.muscle > 23S.muscle.renamed

This will make a new file with a ".renamed" suffix. In this modified file, containing the Muscle alignment, all the FASTA sequence names will have "muscle" prepended to them. Sed is the UNIX stream editor. Type "man sed" for details. In this case, it is matching a ">" at the beginning of a line ("^" matches the beginning of a line), and substituting it with ">muscle". The "g" means it should make this replacement globally (throughout the entire file).

Move the aligned sequence files to you local computer. Start ClustalX. From the menu, select File... Load Sequences and the ClustalW alignment (.aln suffix).

Now we add the muscle alignment (with modified names) to the ClustalW alignment we previously loaded into ClustalX. Go back to the ClustalX screen (the ClustalW alignment should be already on the screen), and select File... Append Sequences. Choose the "23S.muscle.renamed" alignment.

You should now see both alignments on the screen. Scroll across the screen, through the alignment, and look for any differences (if any).

Are there any differences between the alignments these programs generate?
If there are differences, then which program appears to be doing a better job of reflecting homologous columns?
Did you succeed in creating an alignment in which the boarders of the selfsplicing introns are recognizable?

Can you find settings that improve the alignment around the self-splicing intron? (for clustalw2, you might want to use the menu driven version)

Perform alignments for the other 3 datasets (keep the results!).

If you never used clustalw/clustalx before, save the aligned sequences in the different available formats and look at them in MS Word (select a non-proportional font) to get familiar with the different multiple sequence file formats.

In the aln files, you can delete whole blocks using microsoft word (e.g., those corresponding to the inteins/introns) by pressing down the alt-option key while selecting with a mouse. After you saved the aln file from word, you can re-open the file in clustalx and save it in any other format you like.

To illustrate the advantages of the seaview GUI, load the 23S dataset or the archaeal ATPase dataset into seaview (on your computer). (The 23S dataset contains in intron in at least 4 of the sequences from the Thermotogales, the ATPases dataset contains sequences of archaeal ATPase catalytic subunits that were invaded by an intein, it also contains a few ATPases subunits without intein for reference). Using the align menu -> options, select muscle and then align the sequences. Using the sites menu, create a set called all sites, duplicate the set twice call the duplicate sets intron/intein and exon/extein. Then select the intron/intein set and remove the x's under the exon/extein parts of the alignment and vice versa for the intron/intein set*. (If you save the file in mase format, the site selections will be saved as well. If you want to analyze selected sites in a different program, you can select save selection as in the file menu). While we have not talked about phylogenetic trees, select all species (their names need to be black), and select distance method in the Tree menu. (HKY distance and 100 bootstrap samples work well for the RNA sequences, for the protein sequences, the best is to use phyml with the default options, but if this takes too long, a distance analysis with Poisson correction is a reasonable alternative). Then select the intron/intein sites and the intron/intein containing species, repeat the tree building. If you place a checkmark in the depicted tree window, the program displays support values in this case either bootstrap support values or aproximate Likelyhood RatioTest derived probabilities (the latter are less conservative than the former).
Are the two trees compatible? What are the possible explanations for intron/intein gain and loss?

When you are done, you might want to repeat the exercise for datasets that you consider using for your student project.

If you want to do more, consider doing the dotlet exercises form here (but do not submit the form :) ).

* from the help files of seaview:

Click mouse to select/unselect a site.
Drag with the mouse button an extremity of a block of sites to extend or reduce it.
Click with the middle mouse button or Ctrl-Click to extend the block of sites at left.
Click with the right mouse button or Shift-Click on a block of sites to delete it.

A good strategy is to unmark the first and the last x to remove and then to shift click in the midle of the block.