MCB 5472 : Branchclust

You should answer the questions in red! (email to gogarten@uconn.edu)

1) Overview and setting up

flow_scheme

setting up- copy files and directories using the command line interface

Do the following:

At the command line in your terminal connection to the server type

mkdir workshop

cd workshop

mkdir test cd test

Your curser line should now look like: bbcxsrv1:~/workshop/test mcb221_u1$

to copy files and scripts into your folder:

cp -R /Users/jpgogarten/workshop/test/* /Users/mcb221_unnn/workshop/test/

This should be one line, and mcb221_unnn should be replaced with the name of your home directory.

The –R tells UNIX to copy recursively (including subdirectories). This command also copies a directory called fasta that contains 5 genomes to work on. If you want to work on different genomes, delete the 5 *.faa files that contain the genomes from the Thermotogales and replace them with the genomes of your choice. (“genomes” really means all the proteins encoded by ORF present in the genome).

ASIDE (you do not need to do this if you work on the cluster): If you install branchclust onto a differnt machine, you also need to have bioperl installed. you can copy the bioperl folder from /Users/jpgogarten/bioperl-1.5-my/* (using for example
cp -R /Users/jpgogarten/bioperl-1.5-my/* /Users/mcb221_unnn/bioperl-1.5-my/
NOTE: You will also need to change the location of modules in some of the scripts to refer to the bioperl location! At present the scripts refer to the bioperl in my home directory :))

The folder that you just copied (workshop/test) contains a text file with a summary of the commands: commands_workshop_one_script . It will be useful to open this file in text wrangler and use it to copy paste commands (sorry, some of the text below are images from which you cannot copy/paste text). Also included in the folder are a file with slides: "WorkshopTAU_final.pdf" and a tutorial file that might address some problems you might encounter later on: "BranchClustTutorial.pdf"

other genomes

IF YOU USE GENOMES WITH NCBI ANNOTATION LINES, YOU NEED TO USE THE SCRIPTS CALLED BY do_all_GI.sh !!
(Sorry, its present form this version does not allow to filter the E-values in the parsing of the blast searches. This means that you need to select a reasonable E-value in your initial blast searches. If you want to use an E-value cut-off of 10^-20, you need to edit the do_blast.pl script! If you use the JGI format, you can use the parse_blast_cutoff_Thermotoga script to change the E-value, i.e, you don't have to re-run all of the blast searches.).

2) building superfamilies

turn genomes into searchable library

cd into the test directory.

Execute the following three commands in the test subdirectory.

perl create_one_faa.pl
perl format_faa.pl
more formatdb.log

create_one_faa.pl is a perlscript that copies all the protein sequences in the fasta subdirectory into a single file into a directory called fasta_all

format_faa.pl is a script that uses a program from the blastall package to make a blast searchable library from the collection of multiple fasta files

more formatdb.log lists the content of the log file created by formatdb

queue

The do_all shell script:

bc15

If you use sequences with GI numbers ala NCBI, you need to use the script do_all_GI.sh .

 

vi do_blast.pl

# to see what the parameters are doing type blastall or
# bastall | more at the commandline.
# If you move this to a different computer you might need to change a 2 to a 1

 

vi parse_blast_cutoff_thermotoga.pl

# change bioperl directory; change cutoff E-value
# the script as written uses the bioperl library in my home directory
# Note: if using closely related genomes, you can cut back on the
# size of the superfamilies by using a smaller E-value
# (if you genomes have normal GI numbers, use
# vi parse_blast_cutoff1.pl)

# check output:
more parsed/all_vs_all.parsed ### type q to leave more
more parsed/all_vs_all.parsed | wc -l # checks for number of lines=super famiies output

3)
fam to tree

4) Superfamilies to families

bc6

The choice of the many parameter is crucial. If it is too small, you get too many small families, if it is too large, too many families are lumped together. You can rerun branchclust on a tree with different values for many.

bc7

bc8

bc9

bc10

Which superfamily was the largest in terms of families?

bc11

bc12

Which family did you inspect?

Where there discrepancies between your evaluation of the tree and the clustering generated by branchclust?

 

5) Families to multiple sequence fasta files

bc14


6) Work on your student project!

Include a one sentence summary of what you did in your report.