MCB 372 : sequence alignment

Please send your answers per email to gogarten@uconn.edu, or hand in a hardcopy

You should answer the questions in red!

Part 1

In today's class, we will begin by using the multiple alignment programs clustalw and muscle to align some sequences. Both of these programs are installed on the cluster. Try clustalw -help for the command-line options. Typing clustalw will bring up an interactive menu.

Typing muscle will list its command-line options.

9Thermotoga_23S.fna is a set of Thermotoga 23S rRNA genes in FASTA format.

On the cluster, align 9Thermotoga_23S.fna with clustalw: (don't forget to move away from the masternode!)

clustalw 9Thermotoga_23S.fna

Two files are output. 9Thermotoga_23S.aln is the alignment, and 9Thermotoga_23S.dnd is the guide tree.

Now let's align 9Thermotoga_23S.fna with muscle:

muscle -in 9Thermotoga_23S.fna -out 9Thermotoga_23S.muscle

ClustalX can be used to view sequence alignments. You can get it from here. There are both Mac (clustalx-2.0.3-macosx.dmg) and Windows (clustalx-2.0.3-win.msi) versions.

Start ClustalX. From the menu, select File... Load Sequences and the ClustalW alignment (.aln suffix).

Now we are going to compare the ClustalW and Muscle alignments. One way to "improvise" is to load both alignments into ClustalX, so they will both be on the screen - one under the other. A slight complication is that the sequences have the same name (ClustalX requires them to be unique), so let's make a small change to the muscle alignment. From the command-line, type:

sed "s/^>/>muscle/g" 9Thermotoga_23S.muscle > 9Thermotoga_23S.muscle.renamed

This will make a new file with a ".renamed" suffix. In this modified file, containing the Muscle alignment, all the FASTA sequence names will have "muscle" prepended to them. Sed is the UNIX stream editor. Type "man sed" for details. In this case, it is matching a ">" at the beginning of a line ("^" matches the beginning of a line), and replacing it with ">muscle". The "g" means it should make this replacement globally (throughout the entire file).

Now we can add this alignment to the ClustalW alignment we previously loaded into ClustalX. Go back to the ClustalX screen (the ClustalW alignment should be already on the screen), and select File... Append Sequences. Choose the "9Thermotoga_23S.muscle.renamed" alignment.

You should now see both alignments on the screen. Scroll across the screen, through the alignment, and look for any differences (if any).

Are there any differences between the alignments these programs generate? If there are differences, then which program appears to be doing a better job of reflecting homologous columns?

Now repeat these steps for 10Thermotoga_23S.fna, which includes an extra sequence. Try to answer the same question for the expanded alignment.

Can you find settings (for clustalw you might want to use the menu driven option that you can invoke using clustalw) that improve the alignment around the self-splicing intron?

Part 2

EMBOSS is installed on the cluster. Here is a list of programs in EMBOSS. Today we will be using pepstats. Click on its entry in the list to see the command line arguments.

Download a genome of your choice from NCBI. Use the blue "F" links (far right of screen) to see the list of files for a given genome. Since we will be using pepstats, be sure to grab the protein sequence ".faa" file(s).

pepstats genome.faa -outfile genome.pepstats

Check the output file generated by pepstats using a text editor.

Use parse_pepstats.pl to extract the isoelectric point for all proteins:

Read through the script. Try to figure out how the program finds the values for the theoretical isoelectric points in the pepstats output.

perl ./parse_pepstats.pl genome.pepstats

parse_pepstats.pl will generate three files, with suffixes ".pI", ".pos_charged", and ".parsed".

Use the ".pI" file (isoelectric points) to construct a histogram. Describe the distribution of isoelectric points in your selected genome.