Powerpoint slides are here

Assignments are at the end of the page

Sequence alignment

Pairwise alignment

A) DOT PLOT

The easiest way to align two sequences is to use a dotplot. In its most straight forward implementation the two sequences to be aligned are written along the coordinate axis.

In more realistic implementations a window of 5 to 20 nucleotides or amino acids is slid along one of the axes (i.e., sequences) and compared to every possible window on the other axis (sequence). The dot intensity is adjusted to reflect the percent identity (or similarity) in the two windows.

See the Dotlet exercises below.

Optimal global and local alignments.

There are many different algorithms to calculate pairwise sequence alignments. For two sequences it is "easy" to calculate an optimal global alignment. (According to the motto: "It can be easily shown" -- see here). The so called Needleman-Wunsch algorithm is widely used, it optimizes a positive alignment score, a related (and under some conditions equivalent approach) is to minimize the differences between to sequences.

Multiple Sequence Alignments

CLUSTAL, CLUSTALW and CLUSTALX

Usually global alignments are the easiest to calculate (local see below)

One of the easiest to use, most sophisticated, and most versatile alignment programs is clustalw

(Higgins DG, Sharp PM (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73:237-244;
Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22, 4673-4680
)
.

Clustalw runs on all possible platforms (unix, mac, pc), and it is part of most multiprogram packages, and it is also available via different web interfaces (for examples here, here, and here). 

Clustalw uses a very simple menu driven command-line interface, and you also can run it from the command line only (i.e. it is easy to incorporate into scripts.)

Clustalx uses the same algorithms as clustalw.  However, it has a much nicer interface, it displays information on the level of similarity, and it uses color in the alignment.  Especially for amino acids the use of color greatly enhances the ability to recognize conservative replacements. Clustalx is available for different platforms at the ebi's ftp site (follow your platform, clustalx is stored in the clustalw folders)

Clustal reads and writes most formats used by different programs.  The easiest format is the FASTA format:

> name of sequence or any other information goes in the first line. This line starts with ">". The line can be longer than 80 characters. The first line ends with the first paragraph sign.p
The second line contains the sequence itself; numbers and other non standard characters are ignored. Be careful if you download sequences. Often the transfer programs introduce paragraph signs every 100 characters, and the end of a command line frequently ends up as the beginning of the sequence.
All sequences to be read should be in a single file.

(sample clustalw input file)

(sample clustalw output file)

Clustal also reads aligned sequences.  If you input aligned sequences you can go directly to the tree section.
!! Be careful if you make a mistake, and the sequences are not aligned, your tree will look strange!!
!!!
ALWAYS CHECK YOUR ALIGNMENT!!!

Clustal also is useful to reformat and edit alignments, it is very forgiving in reading formats, e.g., you can open the clustal format (*.aln) in a text editor and delete columns and reload the file into clustalw, and output it in the other formats available.

For calculating an alignment, you can select different substitution matrices, and gap penalties (end-gaps can be considered differently!)

Clustal is better than its reputation. It is doing a great job in handling gaps, especially terminal gaps, and it makes good use of different substitution matrices.

To align sequences clustal performs the following steps:

1) Pairwise distance calculation
2) Clustering analysis of the sequences
3) Iterated alignment of two most similar sequences or groups of sequences.

It is important to realize that the second step is the most important. The relationships found here will create a serious bias in the final alignment. The better your guide tree, the better your final alignment. You can load a guide tree into clustal. This tree will then be used instead of the neighbor joining tree calculated by clustalw as a default. (The guide tree needs to be in normal parenthesis notation WITH branch lengths).

Other programs often used for multiple sequence alignment
(We will not use these program in this course; if you are already confused by the information provided, skip to the assignments):

A program available via the www is SAM (sequence alignment and modeling system) by Richard Hughey, Anders Krogh, Christian Barrett, & Leslie Grate at UCSC. The input consists of a multiple sequence file (aligned or not aligned) in FASTA format. The program uses secondary structure predictions, neighboring sites, etc. to place gaps. The program can be accessed using netscape at " http://www.cse.ucsc.edu/research/compbio/sam.html ".

If your sequences are not very similar, and if you are not able to generate a trustworthy multiple sequence alignment, you can calculate distance trees based on pairwise alignments only. The best program for this purpose is statalign from Jeff Thorne (Thorne JL, Kishino H (1992) Freeing phylogenies from artifacts of alignment. Mol Bio Evol 9:1148-1162). It runs under standard UNIX.  It's only worth your effort if you are getting gray hairs because of a data set you cannot reliably align.

PILEUP in the GCG package generates alignments that are very similar to clustalw. The TREE programs in GCG are currently considered by many to be worthless (UPGMA).  It is planned -since over three years- to incorporate PAUP into GCG in the "near future".

Local Alignments (e.g. MACAW) search the sequences for motives that occur in different sequences. In macaw the user has the option to select different tools to search matching motives, the user can select subsets of sequences or positions to search for similar motives, and the user has to accept/reject each of the motives found.

Alignments by Eye:
On PCs there is a DOS program called the Eyeball Sequence Editor (ESEE) that allows to simulataneously align nucleotid and encoded protein sequences. Needs some getting used to.

One useful sequence editor is seaview, the companion sequence editor to phylo_win. It runs on PC and most unix flavors, and is the easiest way to get alignments into phylo_win.

 

Assignments:

Tasks #1: Dotlet exercises..

The Swiss Institute for Bioinformatics provides a JAVA applet that perform interactive dot plots. It is called Dotlet. The main use of dot plots is to detect domains, duplications, insertions, deletions, and, if you work at the DNA level, inversions (excellent illustrations of the use of dot plots are given on the examples page). Currently, the applet will work in Internet Explorer ONLY.

  1. Comparing yeast ATPase catalytic subunit with yeast HO endonuclease. Go to the applet and input the sequences: here, here, here and here. Careful, once you leave the webpage, the back arrow will only return you to the applet, but you have to input the sequences again (so make sure that your applet is in a separate browser window). Also, when you input sequences make sure you paste sequence only, without a sequence description line.

  2. Select Neurospora A-subunit and the yeast subunit with intein. Select a window size between 9 and 15 and click "compute". The program will compare every window of the chosen size in one sequence to all the possible windows in the other sequence. On the right you see a histogram that describes how often window pairs with the indicated score occurred. The sliding bars below and above the histogram let you select the colors with which matches are depicted. (I like black for matches, white for mismatches better than the default).

    If you click on the dot plot panel, the alignment window at the bottom aligns the two sequences accordingly. You can fine-tune the alignment using the arrows.

    Which sequence positions in the yeast sequence represent the intein?

  3. If you compare the HO endonuclease (sex change enzyme) to the intein only, does the complete intein sequence match to something in the HO endonuclease? Is there a part of the sequence in the HO endonuclease that might correspond to an extein?

  4. Comparison of nucleotide sequence with introns vs. protein sequence it codes. (If you are low on time skip 4-12, results are in class12, mcb221, choose a few examples that you find interesting).
    Dot plots have many different applications. One of them is to analyze and visualize the intron exon structure of genes. In dotlet, if you use a nucleotide sequence for the first sequence, and a protein sequence for the second, the program will compare the translation in all three frames to the protein sequence. Load the following two sequences into dotlet:

    A) the genomic sequence from Arabidopsis thaliana containing the gene encoding the vacuolar ATPase (sequence below at end of page), the given sequence is the reverse complement of a sequence that is part of chromosome 1.

    B) the protein sequence as translated from the cDNA sequence as given in GI 3334404

  5. How many exons are in the gene ? (Is this the same result we got Wednesday afternoon?)

  6. Are neighboring exon sequences always in the same reading frame? (Use the mouse pointer to place the blue cross-hairs on the diagonal and then use the arrow key until one of the three frames matches to the protein sequence.) Try this for a couple of exons. WHY?

  7. Using BLAST to compare nucleotide and protein sequences
    An alternative approach is to use the genome sequence in a blast search of a protein databank. A drawback is that small exons might be below the significance level.

    Use the nucleotide sequence given below and compare it to the Swiss Protein database.
    Which BLAST program do you use?

  8. Should you turn off the low-complexity filter?

  9. How many exons can you identify?
    Note that you don't need to have the cDNA from the organism whose gene you are studying in the database. The matches to related organisms (in this case plants) are sufficient to identify the exons.

  10. Repetitive proteins in Dotlet
    Using dotlet load GI 30686594 and GI 19887539 (again omit the labels from the sequence, but give them a name so you can recognize them :)).
    Compare the Arabidopsis proteolipid against itself. Do you see any repetitive units? How many?
    Does the choice of scoring matrix make a difference?

  11. Compare the Methanopyrus sequence against the one from Arabidopsis. How many equivalents to the single repeat unit in Arabidopsis do you find?

  12. How many repeats do you identify when you compare the Methanopyrus sequence against itself?

 

Task 2)

Start clustalx by double-clicking the icon;
load globin.pep (right click, save to your computer) into the program;
calculate an alignment and save the result under different names in the different possible formats.  Using a text editor (MS Word) and a non-proportional font (i.e., a type of Courier), inspect the different formats.

*.MSF is read by GCG; *.PIR output is like FASTA - format, but the seq. all have equal lengths, you have to merge the first and second line in order to be FASTA compatible (you also could use the GDE format and using a text editor replace % with >).
Phylip (*.phy) is the "new" phylip - interlaced format. 

You can use the input/output options to reformat an alignment. Some programs require specialized formats. You can use a text editor like MSWord to get your alignment into the desired format, but things are certainly much easier, if you start out with a format close to the desired one.

Hint: Often you do not want to use the complete alignment, but only those portions which are sufficiently conserved. You can take a file in clustal format (*.aln) and delete columns with a text editor (in MSWord pressing down the alt key before clicking the mouse switches to column mode -- to see the alignment, you often need to decrease the font size and select a non-proportional font in your editor!). Although the different blocks in the resulting alignment have different lengths, clustal reads in the aligned sequences correctly, and you can output the shortened sequences in any desired format you want.
 

Task 3)

Use this file as input for clustalx.
Load the sequences into clustalx and calculate an alignment using the default parameters.
What is the huge indel in the center?

Explore the different menu options; in particular, explore the effect of different gap penalties on the alignment -- make sure that in alignment options the gaps are reset before you calculate a new alignment.

Task 4)

Jalview is an excellent JAVA applet to inspect and edit multiple sequence alignments. It also allows inspection of protein space for the aligned sequences. This works surprisingly well. The Jalview Homepage contains a lot of additional information.
Start Jalview as Java Web Start Application (top button on the linked we page; if the window does not appear after a few seconds, check the dock for the JAVA icon and click on it).

Align the file ATPaseSU.MAC.fa in clustalx (sequences need to be aligned! or use this) and load the aligned sequences into Jalview (File menu ->Input alignment from local file; NOTE: You need to select the correct format).
Explore the different coloring options (COLOUR menu). Which one seems to work best (most meaningful - scroll through the alignment to a more conserved region).

Note: You can change/edit the alignment by pointing on an amino acid residue and dragging it to the right or left. Try it, but leave the sequences in an aligned state before you move on.

CALCULATE an AVERAGE DISTANCE TREE USING PID
Click somewhere in the resulting tree to color groups of related sequences in the same color.

CALCULATE the PRINCIPAL COMPONENT ANALYSIS.
In a principal component analyses, the new dimensions are calculated as a linear combination of the original dimensions, so that greatest variance by any projection of the data set comes to lie on the first axis, etc. for the following dimensions. Can you find a higher dimension that breaks up the vacuolar ATPase A subunits? (Their names start with A.).

Which of the A subunit sequences cluster together, if you use this dimension (2, 3 and 5 worked for me)?

Task 5) (we skip this exercise)

If you have time, explore the different gap penalty parameters using the more divergent sequences in testseq2.txt. Are the nucleotide binding site motifs GGxxxxxGxxGxGKTV (in the V-ATPase A-subunits) properly aligned between the different types?

Does your alignment reveal the so-called non-homologous region in the catalytic V-ATPase subunits (i.e. a region of 100 aa, about 150 aa from the amino terminal end that is absent in the other ATPase subunits)?
Which choice of gap penalties gives the most satisfactory alignment?
What sequence in the V-ATPase B subunits does align to this motif?
Do the ATPase subunits and the rho termination factors (ttf in testseq2.txt) share other motifs besides the GKT region?

Task 6)

If you have time, explore one of the many different multiple sequence alignment websites available. Use a multiple sequence file of your choice (see above for examples). Possible programs include:

 

 

>gi|3766106:c18000-13000 Arabidopsis thaliana chromosome 1 BAC F9K20 sequence, complete sequence

 
ACTTGCTTTCCCTTTCTTCACTTTCAGAAAATTTCCCCGAGAAAACAAAACTGATCACTTTTGTGAATTT TCCGGCTTGATCAGAGAGAGAGAGAGATTCTCAAGCCGCATGTCTTTGTTGACTAAACGGACCCACTTAT 
TGGGCCGTAAGTATATATTATAAAGCCCAATAAATAAACTGTCACGTGGCTATTAACGACAAAGTCCAGA TCTACGCGCGTATCGTGTGTAGTCATGTACCGGATCACTCCACGATTTCGTCAATCAGATCAAAACTTCT 
CCAACTCTTTCAATATCTATCTACGTGCATCCACCTCTCTCTTCTTCTTCTTGTGACCACTCTTCCTGTC TGTTCACTGGTTAGGAGATCTCTATCCCTCTCTCTCTGATCCTTTTTGTTTCTTTCTGATTTATTGTATC 
GCTCTCTTTGAAAAACCTTAAGAGTTGTCTGATTTTGGCGATTTCTGATTTGTTGTTGTTGTTGTTTGAT CGGATCTCGGCATTTTCTAGGAGGTGAATTTTGATTTTTTTGGCCGATCGGATTTGAGTTTTTGAATCTC 
TGATTGCTGAGAAAATGCCGGCGTTTTACGGAGGAAAGCTTACGACCTTCGAGGACGATGAGAAGGAGAG CGAGTATGGTTACGTTCGTAAGGTATTATCCTGTTTCGTTCGATCTGGTTTCAATTTGTTTTTTTTTCTG 
TTTTGCGATGTTAGTTTTTGGTGATGGATAGATGAAATAGTTGATCTGCTTACCAGGTAAGATTGGTGGG ATAGCTAGATTTGATCTGATAGTTATCAGTGATTGAATCGGTTGATTCCGCGTTGGTTCAGTAACCTCGT 
CTTTGAATTTCTGATCTGATCTGATAGTTCTCAGTGATTTGACATTTTCTTTTATGGGATGCAGGTTTCA GGTCCTGTTGTTGTTGCCGATGGTATGGCCGGTGCTGCTATGTATGAGTTAGTGCGTGTTGGTCATGATA 
ATTTGATTGGTGAAATCATCCGTCTTGAAGGAGATTCTGCCACCATCCAAGGTTTGTTTCTTCTATTGTG CTTCCTAGTGTAATTTACTTTACGGCATCTTATGTGACCTCTGTCGAAGTAAGATATCTTAACTGATATT 
TGGCAACTTCCTTTTGATCAGTTTACGAGGAAACAGCTGGATTGACAGTTAATGATCCCGTTCTTCGAAC ACACAAGGTTCGCGAGTTATTTATCTTGGTTTTTTCTAGTGTTGTTCATCTGCAGCTAACATATAATTTG 
TCCTGAATTTACTACAGCCACTTTCTGTGGAGCTCGGGCCAGGAATATTGGGAAATATCTTTGATGGAAT TCAGGTTCAGTTGGATTTATAATCTTGCTAGACATGATTTTTTTTACTTTTATGATTCGTTTTATGTGGC 
TTCTTACGATTCTTTGGTTTCATTTCTTTAAATGTCACAGAGGCCTTTGAAGACTATTGCAAGAATATCC GGTGATGTGTACATTCCTCGGGGTGTGTCTGTTCCAGCTCTTGACAAAGATTGTCTTTGGGAGTTCCAGC 
CCAATAAATTTGGTAATGTGGTTTACTCCATATGCCTGTCTATGGAAGTGTTCATTTGGTTTTAATCTTG ATGGTCAATTGAATTCGTTTTGTTTGCAGTCGAGGGAGACACAATAACTGGTGGTGACTTGTATGCTGTA 
AGTTTATTGGTCTCCTCTTTAATCTGCTTTTGACAAGGGAATCTATTTACACAGTTACCGTGGTGTTTCC CTTGTTTACACTGGGAATAGTTTTTTCTGAAAGTCAAATTAAACTTTGGAATGCAGACTGTCTTTGAGAA 
CACTTTGATGAATCACCTCGTTGCCCTTCCTCCGGATGCCATGGGGAAGATCACTTACATTGCTCCAGCT GGTCAATATTCGCTTAAGGTTTGACTTTAAGTTTCCCTCAAACAGTTATGAATAAATACGTTTCAAACTT 
TTTCTTCCTTGATTTCTTTGAATTCAACGTTTGAGTTAATATATGGCTAACTTGATCAATTGGTAATCAC TTCCTGTTGTAGATCATGTTTGGCTTGTTGCTAATAATTGTTTGTCGGTGATTTTCATTTCTCAGGATAC 
CGTGATAGAGCTTGAATTCCAGGGGATCAAGAAATCTTACACCATGCTTCAGGTTTGCATGTATCTTTAA TCTTCCTACTTGCAAACGTAAATTTTAAGCTATTTGGTTCACTCTGTTAAATTGGTTTGGTTGATATATG 
TCAGAGCTGGCCTGTACGTACGCCTAGGCCAGTTGCATCAAAGCTTGCTGCCGATACTCCTCTACTTACG GGGCAGGTGATTACTCGATTAATTCTTCTTACAGTGGTGATAGTCATTTGAATACATGTGTTGCTGATTG 
CTTTCTTTTCCTGTTGTCAGCGTGTTCTTGATGCCCTTTTCCCTTCTGTTCTTGGTGGAACCTGTGCCAT TCCTGGTGCTTTTGGCTGTGGGAAAACTGTTATCAGTCAGGCACTTTCCAAGGTACCTTGTGACACTCTC 
TGGTTTTGTTCCATTTAATTACTGGATAGATTGAATTTCCAAAGCTAACTTTTTCTTATTTACATAGTAC TCCAACTCTGATGCTGTTGTGTATGTTGGTTGTGGAGAGAGAGGAAATGAAATGGCTGAGGTATATCTCT 
TCTCATTCTAAATTTGCATATTGTTCATACAAATCGGACATTTGATCTGATTGTTTCTCATAAATTAGGT TCTTATGGACTTCCCACAATTGACAATGACGTTGCCTGATGGCCGTGAGGAATCTGTCATGAAACGTACC 
ACACTTGTTGCTAACACCTCTAACATGCCTGTGGCTGCTCGTGAAGCCTCAATTTACACAGGTAATGTTC AGGCACACAGATTTAATAGTTATTGATGAATCCCATTGCCTATGCTCATTTTTTTTTTTTTTTTTTAATG 
TGAATTCCAGGAATCACAATCGCTGAATATTTTAGAGATATGGGCTACAATGTTAGTATGATGGCAGACT CAACTTCCCGTTGGGCAGAAGCATTAAGAGAAATTTCAGGACGGCTGGTAATCTTATGCGTTTCACTTTT 
GCTATATGGATGTTCGTGTTGTCCTCATCTCACTTTTCTTTTTCTCAGTTTATTGACACCTATTTTGCTT TGTTTTATAGGCTGAAATGCCTGCTGACAGTGGATATCCAGCCTATCTAGCAGCACGTTTAGCATCTTTC 
TATGAACGTGCTGGTAAAGTAAAATGTCTTGGTGGACCAGAACGTAACGGAAGTGTTACAATTGTTGGTG CAGTTTCGCCTCCTGGAGGAGACTTTTCAGATCCTGTGACTTCAGCAACCCTTAGTATTGTGCAGGTGAT 
TATTTGGTTCATGTCTGCTTCCCTATCTTCCATTGTAGATTACATAGTCGTATATGTTGGTTGAGATGAA CCAGATGGTGTTTAGTTTTAGATCTGCCGCAGACTCGTATATTTAAGCATTTTTTTTCTCCACTTTGAAA 
TGCTTACTCTTCCATTCTGGTTGTTTCTCTTTTCTTCTGCAGGTCTTCTGGGGTTTGGACAAAAAGCTTG CCCAGAGAAAACATTTTCCCTCTGTTAATTGGTTGATTTCTTACTCAAAGTATTCAACGGTATGCTTAAA 
TATTCTCGGTTCAAACTTGTCTTGGTTTACTATCTAGAAATCTTGTATATAAAACGCTGCTTTTTGTTTT AGGCACTGGAATCTTTCTATGAGAAGTTCGATCCAGATTTCATCAACATCAGGACAAAGGCCAGAGAGGT 
GTTGCAGAGGGAAGACGATCTTAATGAAATTGTCCAGGTATGTATCACTTATCCTTGTATAAGTATCTAT TGTGGTGACCAATGAACTCTTGTCTCAGCAACCCTAATACATTTTGAAGGGGTTGAACGATAATCTTGGC 
ATGTAAACTTGACTTGAGTTATAGAAGGAAAACAGTGCTAGCACGTTATTCTTTTCGAAAGGAACTTATT TGACCCACACATTGCTTTTTGTGTGCAGCTTGTAGGAAAAGATGCGCTAGCAGAAGGGGACAAAATCACA 
TTGGAAACAGCTAAGCTATTGAGGGAAGATTACCTTGCTCAAAACGCGTTTACACCGTAAGATTTGTTGG CTCCCTTCGTTTTGGTTTAGTACTCTCTCTTTCTCTCTCAACGGGTTATTCACTCTTGAACCTTTTGGAT 
GAATTTTTTGACAGATATGACAAATTCTGTCCTTTCTACAAGTCCGTGTGGATGATGCGTAACATTATCC ATTTCTACAACCTAGCCAACCAGGTAAATAAGATGAGATTTATACATACTATGCTAAGTGGGGATTAAGG 
TCAATTGGTTTGTCTAGGTAAAAACCCATTAATTGTTTTGGATACACAGGCGGTTGAGAGAGCAGCTGGA ATGGACGGTCAAAAGATTACCTACACTCTTATCAAGCATCGCTTAGGAGATCTTTTCTACCGTTTAGTGT 
AAGCAAACGACTTGCTTCTCCTCGATTTCTCTATGACTCTGTTACATAGCGCTCTAATAAAATGGTCTGA AACGGAATTATGGGAACTACAGGTCTCAGAAGTTCGAAGACCCAGCAGAAGGGGAGGATACACTGGTGGA 
AAAATTCAAGAAATTGTACGACGATCTCAATGCTGGATTCCGTGCTTTGGAAGATGAAACTCGGTAAGCT GTCGAGTCTCCACCGCAAGTAAAAAAAATCCACAGAATTGGGTTGTTTTTGGAGAAAGAGGGTTTCATTC 
ATGGTCTCTTTCTTGTGTTTTTGAACCAACAACTATCATAGTGGTCGGTATTTTATTTATCGGTTTGGTC GATCGATTGAGTTTTAGCTCTGTGAGCGTCATGATTCTCCGGCTGTGCTGTGCTGTGTAATATGTTTGAT 
TCGTTGTTTTCATGTTTTTATTTCGGTGGTAATAAGGTACAGCCAATGTGAGTCATATATTTGATTTGAT GTACCCTCTCAATTCAATAAGTTAATTTTAT