MCB 5472 : sequence alignment

Please send your answers per email to gogarten@uconn.edu, or hand in a hardcopy

You should answer the questions in red!

In today's class, we will begin by using the multiple alignment programs clustalw and muscle to align some sequences. Both of these programs are installed on the cluster. Try clustalw -help for the command-line options. Typing clustalw will bring up an interactive menu. Or you could use the newer version by typing clustalw2 instead of clustalw)

(clustalx is available on your iMacs. It only reads files with the unix new lines at the end of the annotation line - if you try to load a file that seems to have the correct format, but is loaded into a single sequence only, open the file in text wrangler and save it as a unix file.)

Typing muscle will list its command-line options.

The datasets:

Thermotogales 23S.fna, Themrotogales 16S.fna, Thermotogales RpoC.faa, Thermotogales LeuRS.faa are sets of genes/proteins encoded in the genomes of different species of Thermotogales in multiple FASTA format.

Read this section first, develop a strategy, then do it:
Goal is to add at least 5 additional homologous sequences to each of the datasets. Open the data files, copy paste one of the sequences (chose one of the shorter sequences, some of the longer ones have introns) and do a blast search of nr on the web.
(Under organisms you can select to exclude the Thermotogales and to include only the Bacteria; if you leave Megablast as an option, you might be faster; if you select single genes, not genomes, you can down load them using the checkmarks in the alignment section -- this works better for the protein sequences).
Add at least 5 sequences from related bacteria to each dataset. If you find them, add sequences from Aquifex, Thermus, Thermoanaerobacter, some other Clostridia, Bacillus.
(You might make things easier for you by using the send to clipboard feature of the NCBI web page -- this is NOT the clipboard on your iMac, but a clipboard at the NCBI, you can reformat all your selected sequences at once and copy/paste them to your sequence dataset.
One way to get to the rRNA coding sequences only is to click on the Max Score of the hit, which moves you to the alignment, if you click on the "features link" you get the sequence of the rRNA only, which you then can display as a Fasta sequence.
Naming sequences is difficult. It is a good idea to keep a fasta formated multiple sequence file that for each sequence includes the GI number and the positon of the sequence.
The annotation line of most sequences from the NCBI conforms to a standard that begins with a GI number and ends with a species/strain designation.
Clustal uses the annotation line up to the first white space as name of the sequences. However, many other programs only use the first 10 letters. Thus it is important to have the first ten letters unique and meaningful (if you look at the aignment or a tree, you want to easily be able to figure out what is what).

Why for many of the genomes are several identical hits reported for the rRNA queries?

After you added the sequences, move the file to the cluster into an appropriately named subdirectory. (Don't forget to move away from the masternode!) On the cluster, align the sequences with clustalw2. If in doubt, use the default settings.

For each run you obtain two files as output: *.aln is the alignment, and *.dnd is the guide tree.

Now let's align the 23S dataset with muscle:

muscle -in 23S.fna -out 23S.muscle

ClustalX can be used to view sequence alignments. You can get it from here, but it should be already installed on the iMac's in the classroom. There are both Mac (clustalx-2.0.3-macosx.dmg) and Windows (clustalx-2.0.3-win.msi) versions.

Start ClustalX. From the menu, select File... Load Sequences and the ClustalW alignment (.aln suffix).

Now we are going to compare the ClustalW and Muscle alignments. One way to "improvise" is to load both alignments into ClustalX, so they will both be on the screen - one under the other. A slight complication is that the sequences have the same name (ClustalX requires them to be unique), so let's make a small change to the muscle alignment. From the command-line, type:

sed "s/^>/>muscle/g" 23S.muscle > 23S.muscle.renamed

This will make a new file with a ".renamed" suffix. In this modified file, containing the Muscle alignment, all the FASTA sequence names will have "muscle" prepended to them. Sed is the UNIX stream editor. Type "man sed" for details. In this case, it is matching a ">" at the beginning of a line ("^" matches the beginning of a line), and substituting it with ">muscle". The "g" means it should make this replacement globally (throughout the entire file).

Now we can add this alignment to the ClustalW alignment we previously loaded into ClustalX. Go back to the ClustalX screen (the ClustalW alignment should be already on the screen), and select File... Append Sequences. Choose the "23S.muscle.renamed" alignment.

You should now see both alignments on the screen. Scroll across the screen, through the alignment, and look for any differences (if any).

Are there any differences between the alignments these programs generate?
If there are differences, then which program appears to be doing a better job of reflecting homologous columns?
Did you succeed in creating an alignment in which the boarders of the selfsplicing introns are recognizable?

Can you find settings that improve the alignment around the self-splicing intron? (for clustalw2, you might want to use the menu driven version)

Perform alignments for the other 3 datasets (keep the results!).

If you never used clustalw/clustalx before, save the aligned sequences in different format and look at them in MS Word (select a non-proportional font).

In the aln files, you can delete whole blocks using microsoft word (e.g., those corresponding to the inteins/introns) by pressing down the alt-option key while selecting with a mouse. After you saved the aln file from word, you can re-open the file in clustalx and save it in any other format you like. Do this

When you are done, you might want to repeat the exercise for datasets that you consider using for your student project.

If you want to do more, consider doing the dotlet exercises form here (but do not submit the form :) ).