MCB 5472 : sequence alignment

Please send your answers per email to gogarten@uconn.edu, or hand in a hardcopy

In case you encountered problems with Assignment #3, please discuss these with the instructors or with fellow students.

You should answer the questions in Blue!

In today's class, we will begin by using the multiple alignment programs clustalw and muscle to align some sequences. Both of these programs are installed on the cluster. Try clustalw -help for the command-line options. Typing clustalw will bring up an interactive menu. Or you could use the newer version by typing clustalw2 instead of clustalw)

Typing muscle will list its command-line options. The muscle usermanual is availbale at http://www.drive5.com/muscle/manual/

ClustalX can be used to view sequence alignments, and you also can use it as a GUI to calculate alignments and trees (only neighborjoining, but fast with bootstrap suppport). You can get it from here. There are both Mac and Windows versions available - if you try to load a file that seems to have the correct format, but is loaded into a single sequence only, open the file in text wrangler and save it as a unix file.

Seaview has been recently integrated with quite good phylogenetic reconstruction software. Its strenghts is a good graphics user interface, and the easy with which you can sample sequences (to study the impact of taxon sampling) and positions (e.g., you can exclude portions of a sequence, such as an intein, from an anlysis, or perfome a separate analysis for these sites. Seaview is availabe at http://pbil.univ-lyon1.fr/software/seaview.html .

The datasets:

We will mainly work with Thermotogales 23S.fna. This is a set of 23SRNA genes encoded in the genomes of different species of Thermotogales in multiple FASTA format.

[If you want to, you can use these datasets for comparison: Thermotogales 16S.fna, Thermotogales RpoC.faa, Thermotogales LeuRS.faa]

Read this section first, develop a strategy, then do it:
Goal is to add at least 5 additional homologous sequences to each of the datasets. Open the data files, copy paste one of the sequences (chose one of the shorter sequences, some of the longer ones have introns) and do a blast search of nr on the web.
(Under organisms you can select to exclude (include) the Thermotogales and to include only the Bacteria; if you leave Megablast as an option, you might be faster; if you select single genes, not genomes, you can down load them using the checkmarks in the alignment section -- this works better for the protein sequences).
Add at least 5 sequences from related bacteria to each dataset. If you can find them, add the homologous sequences from other Thermotogales, Aquifex, Thermus, Thermoanaerobacter, some other Clostridia, Bacillus.
(You might make things easier for you by using the send to clipboard feature of the NCBI web page -- this is NOT the clipboard on your iMac, but a clipboard at the NCBI, you can reformat all your selected sequences at once and copy/paste them to your sequence dataset.
One way to get to the rRNA coding sequences only is to click on the Max Score of the hit, which moves you to the alignment, if you click on the "features link" you get the sequence of the rRNA only, which you then can display as a Fasta sequence.
Naming sequences is difficult. It is a good idea to keep a fasta formated multiple sequence file that for each sequence includes the GI number and the positon of the sequence. In case the features link is not available (for recent sequences) you need to download the whole genome, with sequence, search for 23S, select RNA, copy the highlighted sequence. IF IT says that the complement strand encodes the rRNA, you need to calculate the reverse complement of the highlighted sequence. A website that does this is at http://reverse-complement.com/ (or use your own script from the homework).
The annotation line of most sequences from the NCBI conforms to a standard that begins with a GI number and ends with a species/strain designation.
Clustal uses the annotation line up to the first white space as name of the sequences. However, many other programs only use the first 10 letters. Thus it is important to have the first ten letters unique and meaningful (if you look at the aignment or a tree, you want to easily be able to figure out what is what).

Why for many of the genomes are several identical hits reported for the rRNA queries?

After you added the sequences, move the file to the cluster into an appropriately named subdirectory. (Don't forget to move away from the headnode!) On the cluster, align the sequences with clustalw2. If in doubt, use the default settings.

For each run you obtain two files as output: *.aln is the alignment, and *.dnd is the guide tree.

Now let's align the 23S dataset with muscle:

muscle -in 23S.fna -out 23S.muscle

Now we are going to compare the ClustalW and Muscle alignments. One way to "improvise" is to load both alignments into ClustalX or seaview, so they will both be on the screen - one under the other. A slight complication is that the sequences have the same name (ClustalX requires them to be unique), so let's make a small change to the muscle alignment. From the command-line, type:

sed "s/^>/>muscle/g" 23S.muscle > 23S.muscle.renamed

This will make a new file with a ".renamed" suffix. In this modified file, containing the Muscle alignment, all the FASTA sequence names will have "muscle" prepended to them. Sed is the UNIX stream editor. Type "man sed" for details. In this case, it is matching a ">" at the beginning of a line ("^" matches the beginning of a line), and substituting it with ">muscle". The "g" means it should make this replacement globally (throughout the entire file).

Move the aligned sequence files to you local computer. Start ClustalX. From the menu, select File... Load Sequences and the ClustalW alignment (.aln suffix).

Now we add the muscle alignment (with modified names) to the ClustalW alignment we previously loaded into ClustalX. Go back to the ClustalX screen (the ClustalW alignment should be already on the screen), and select File... Append Sequences. Choose the "23S.muscle.renamed" alignment.

You should now see both alignments on the screen. Scroll across the screen, through the alignment, and look for any differences (if any).

Are there any differences between the alignments these programs generate?
If there are differences, then which program appears to be doing a better job of reflecting homologous columns?
Did you succeed in creating an alignment in which the boarders of the selfsplicing introns are recognizable?

Can you find settings that improve the alignment around the self-splicing intron? (for clustalw2, you might want to use the menu driven version)

Perform alignments for the other 3 datasets (keep the results!).

Comment: In the aln files, you can delete whole blocks using microsoft word (e.g., those corresponding to the inteins/introns) by pressing down the alt-option key while selecting with a mouse. After you saved the aln file from word, you can re-open the file in clustalx and save it in any other format you like. Alternatively, you can select sites is seaview and save selected as - see below)

To illustrate the advantages of the seaview GUI, load the 23S dataset or the archaeal ATPase dataset into seaview (install on your computer). (The 23S dataset contains in intron in at least 4 of the sequences from the Thermotogales, the ATPases dataset contains sequences of archaeal ATPase catalytic subunits that were invaded by an intein, it also contains a few ATPases subunits without intein for reference). Using the align menu -> options, select muscle and then align the sequences. Using the sites menu, create a set called all sites, duplicate the set twice call the duplicate sets intron/intein and exon/extein. Then select the intron/intein set and remove the x's under the exon/extein parts of the alignment and vice versa for the intron/intein set*. (If you save the file in mase format, the site selections will be saved as well. If you want to analyze selected sites in a different program, you can select save selection as in the file menu). While we have not talked about phylogenetic trees, select all species (their names need to be black), and select distance method in the Tree menu. (HKY distance and 100 bootstrap samples work well for the RNA sequences, for the protein sequences, the best is to use phyml with the default options, but if this takes too long, a distance analysis with Poisson correction is a reasonable alternative). Then select the intron/intein sites and the intron/intein containing species, repeat the tree building. If you place a checkmark in the depicted tree window, the program displays support values in this case either bootstrap support values or aproximate Likelyhood RatioTest derived probabilities (the latter are less conservative than the former).
Are the two trees compatible? What are the possible explanations for intron/intein gain and loss?

When you are done, you might want to repeat the exercise for datasets that you consider using for your student project.

If time: Slide show on ancient paralogs (to explain my inordinate fondness for ATPases ...)

 

If you want to do more, consider doing the following dotlet exercises:

The Swiss Institute for Bioinformatics provides a Java applet called Dotlet that performs interactive dot plots. Warning: Google's Chrome and Microsoft's Edge browser do not support Java applets. Therefore, you will need to run today's exercises in Firefox or Internet Explorer (or Safari, if you are on a Mac). (Related, if experiencing problems: How do I enable Java in my web browser?)

java script version of dotlet is in beta testing. This version works fine for protein to protein comparisons, but fails tor DNA to protein.

The main use of dot plots is to detect domains, duplications, insertions, deletions, and, if you work at the DNA level, inversions. (Note dotlet compares both strands, and if you compare DNA and protein, the DNA is "translated" in the three forward reading frames. Excellent illustrations of the use of dot plots are given on the examples page).

Comparing yeast ATPase catalytic subunit with yeast HO endonuclease. Go to the applet and input the sequences:

    Sce_VMA.fa (the vacuolar ATPase catalytic subunit from yeast),

    SceHO.fa (the mating type switching HO endonuclease from yeast),

    vma1Neurospora.fa (the vacuolar ATPase catalytic subunit from Neurospora crassa) and

    Sce_intein.fa.

    Careful , once you leave the webpage, the back arrow will only return you to the applet, but you have to input the sequences again (so make sure that your applet is in a separate browser window). 
    Also, when you input sequences make sure you paste the sequence only , without a sequence description line. Give the sequences a name that allows you to recognize which sequence is which (e.g. Yeast_vma1, YeastHO, Neurospora_vma1, Yeast_intein)

Select Neurospora A-subunit (vma1Neurospora.fa) and the yeast subunit with intein (Sce_VMA.fa). Select a window size between 9 and 15 and click "compute". The program will compare every window of the chosen size in one sequence to all the possible windows in the other sequence. On the right you see a histogram that describes how often window pairs with the indicated score occurred. The sliding bars below and above the histogram let you select the colors with which matches are depicted. (I like black for matches, white for mismatches better than the default). 
Note: the sequences may be longer than fit into the display window. Either you can use the levers on top and at the left side to move the display window down and/or to the right, or you can select a compression using the pull-down menu labeled as 1:1 (1:3 or 1:4 usually work) 

If you click on the dot plot panel, the alignment window at the bottom aligns the two sequences accordingly. You can fine-tune the alignment using the arrows.

Which sequence positions (from ... to....) in the yeast sequence represent the intein?

If you compare the HO endonuclease (sex change enzyme) (SceHO.fa) to the intein (Sce_intein.fa) (PAM250, Window of 19 and 1:2 compression works well), does the complete intein sequence match to something in the HO endonuclease, or is there a part of the sequence in the HO endonuclease that might correspond to an extein?

Comparison of nucleotide sequence with introns vs. protein sequence it codes.
Dot plots have many different applications. One of them is to analyze and visualize the intron exon structure of genes. In dotlet, if you use a nucleotide sequence for the first sequence, and a protein sequence for the second, the program will compare the translation in all three frames to the protein sequence. Load the following two sequences into dotlet: 

A) The genomic sequence from Arabidopsis thaliana containing the gene encoding the vacuolar ATPase (arab.fa), the given sequence is the reverse complement of a sequence that is part of chromosome 1. 
B) The protein sequence as translated from the cDNA sequence as given in GI 3334404
(A windowsize of 29, the identity matrix, and a 1:5 compression work well. The DNA sequence needs to be the horizontal or 1st sequence)
How many exons are in the gene ? 

Repetitive proteins in Dotlet
Using dotlet load the sequence of the proteolipid of the ATP synthase from Methanpyrus kandleri  GI 19887539 (again omit the labels from the sequence). 
Compare the proteolipid protein against itself. Do you see any repetitive units? How many? 
Does the choice of scoring matrix make a difference?

Compare this sequence (AAM02226.1) to the sequence of a related archaeal proteolipid (WP_010876590.1) using Pairwise Blast (pairwise blast does not work to detect repeated domains in a comparison against self). Expand the Dot Matrix view. What is the effect of turning the low complexity filter on or off?

 

* from the help files of seaview:

A good strategy is to unmark the first and the last x to remove and then to shift click in the midle of the block.