muscle/g" 9Thermotoga_23S.muscle > 9Thermotoga_23S.muscle.renamed
This will make a new file with a ".renamed" suffix. In this modified file, containing the Muscle alignment, all the FASTA sequence names will have "muscle" prepended to them. Sed is the UNIX stream editor. Type "man sed" for details. In this case, it is matching a ">" at the beginning of a line ("^" matches the beginning of a line), and replacing it with ">muscle". The "g" means it should make this replacement globally (throughout the entire file).
Now we can add this alignment to the ClustalW alignment we previously loaded into ClustalX. Go back to the ClustalX screen (the ClustalW alignment should be already on the screen), and select File... Append Sequences. Choose the "9Thermotoga_23S.muscle.renamed" alignment.
You should now see both alignments on the screen. Scroll across the screen, through the alignment, and look for any differences (if any).
Are there any differences between the alignments these programs generate? If there are differences, then which program appears to be doing a better job of reflecting homologous columns?
Now repeat these steps for 10Thermotoga_23S.fna, which includes an extra sequence. Try to answer the same question for the expanded alignment.
Can you find settings (for clustalw you might want to use the menu driven option that you can invoke using clustalw) that improve the alignment around the self-splicing intron?
EMBOSS is installed on the cluster. Here is a list of programs in EMBOSS. Today we will be using pepstats. Click on its entry in the list to see the command line arguments.
Download a genome of your choice from NCBI. Use the blue "F" links (far right of screen) to see the list of files for a given genome. Since we will be using pepstats, be sure to grab the protein sequence ".faa" file(s).
pepstats genome.faa -outfile genome.pepstatsCheck the output file generated by pepstats using a text editor.
Use parse_pepstats.pl to extract the isoelectric point for all proteins:
Read through the script. Try to figure out how the program finds the values for the theoretical isoelectric points in the pepstats output.
perl ./parse_pepstats.pl genome.pepstats
parse_pepstats.pl will generate three files, with suffixes ".pI", ".pos_charged", and ".parsed".
Use the ".pI" file (isoelectric points) to construct a histogram. Describe the distribution of isoelectric points in your selected genome.