Many completed genomes can be found on the NCBI FTP site (ftp://ftp.ncbi.nih.gov) in the folder named genomes.
The eukaryotic genomes are accessible from the genomes directory while all prokaryotic
(bacteria + archaea) genomes are stored in the the subfolder /bacteria (a clear
case of archaeal discrimination). Go into the bacteria sub-folder and download two or more genomes or your choice. to do so, click
on the appropriately named folder (e.g., Pyrococcus abyssi, Pyrococcus is a member of
the archaeal domain). Among the files in these folders we are are interested in
the .ffn file which contains all nucleotide sequences for individual
predicted ORF's, the .faa file which contains the predicted
ORF's amino acids sequences, and the .fna file, which contains the whole genome. These files are in a fasta format. For this exercise we will only use the .faa
files. If a folder contains more than one .faa file, take both of these, and copy them into a single file*. On to your computer rename the files into something you recognize. (e.g., p_abyssi.faa and t_maritima.faa).
Load the two .faa files into a folder in your home directory on bbcxsrv1.biotech.uconn.edu.
As usual there are many ways to do this.
Alternative #1: The easiest might be to use the finder. Select Go->connect to server in the finder-menu. In the address field enter afp://bbcxsrv1.biotech.uconn.edu . When prompted enter your username and password (if you still need to change your password, see below first)
Alternative #2: ssh "always" works, but it is very pedestrian: Save the two files onto your computer. Then open a terminal window on
your mac and type
ssh your_account_name@bbcxsrv1.biotech.uconn.edu enter your password when prompted.
psswd enter
a new non-trivial password. Remember UNIX is case sensitive. This permanently
changes your password. DO NOT LOOSE IT.
mkdir
blasttest creates a directory named blasttest
Open
another terminal window on your mac and type
sftp
your_account_name@bbcxsrv1.biotech.uconn.edu enter
your password when prompted. This establishes a secure ftp connection to the XSERV
cluster
cd blasttest changes
to the directory blasttest. lcd changes the directory on your local machine, lpwd
reports your local directory name
mput *.faa transfers all files from
your local directory that end with .faa from the local machine to the remote machine.
Alternative 3: use FUGU and Jellyfish, or finder and jellyfish. jellyfish is a simple program that keeps track of the ssh connections. It saves you from typing ssh username ....
Whatever works for you - get the two files into your account on the cluster and establish a terminal with an ssh connection to the cluster.
On
the remote machine, you should NOT run long processes on the master node. Either qrsh to
a node that is not busy (type qrsh **), or submit your command to the queue using
qsub.
On the XSERV cluster using the ssh connection established earlier type
qrsh <enter> enter
your password when prompted,
cd
blasttest
formatdb
-i p_abyssi.faa -o T -p T
The -i is to
indicate your infile.name, -p T is for a protein sequence, for a nucleotide sequence,
you need to use -p F (for False); the -o T instructs the formatdb program to create
indices. For more information on the formatdb program check http://bioinformatics.ubc.ca/resources/tools/formatdb or type formatdb - <enter>
You need to replace p_abyssi.faa with one of the genomes you downloaded. This genome will be converted into a searchable database.
To
do a blast search of every sequence in t_maritima.faa against the P.abyssi
genome, we use blastall.
To get information on the program type
blastall
<enter>
For details check one of the many help pages on
blastall, examples are here and here.
We
will use the following options:
blastall
-i t_maritima.faa -d p_abyssi.faa -o blast.out -p blastp -e 10e-9 -m 9
-a2
-i the query sequence(s)
-d
the database
-o the output file
-p the blast program to use
-e the e-value
cut off
-m the format of the output. option 8 gives a tabular output, that
can be easily read into excel spreadsheet or into a database program. option 9
is similar but includes comment lines.
the individual columns in the output
denote the following
# Query id # Subject id # % identity # alignment length
# number of mismatches # number of gap openings # position of query start # position
of query end # position of subject start # position of subject end # e-value of
a hit # bit score of a hit
Which option turns off the the low complexity
filter?
Which option, and which setting, sets the wordsize to 2?
Which option
allows to use two processors?
Inspect the output using more blast.out
Often one is only interested in the top scoring hit for each query. A simple perlscript (here) goes through the blastall output in m8 or m9 format and retains only the top scoring hit, and it adds a header line that will be correctly interpreted by EXCEL. Save this script into your blasttest folder on the cluster, and read through the script using a text editor (vi or textwrangler).
To run the blast.out file through the perl script type
perl extract_lines.pl blast.out
This will create a file in your directory called blast.out.top that only contains the top scoring hits, and from which the comment lines are removed.
A modified version of this program adds another column to the listing that for each alignment gives the bitscore per residue in the alignment.
Import this file into an excel spreadsheet. If you have an afp connection to the cluster, you can start EXCEL on you local machine, open an empty spreadsheet, select Data -> Get external data; select your file (you'll need to tell it to look for any file).
Sorting the data on the different columns you can figure out which gene was found to be most similar between the two genomes.
Which were the genomes you selected?
Which gene pair had the highest % identity?
Next, we will calculate a histogram for the % identity values for all the significant alignments. In Excel select Tools -> Data Analysis, select histogram. (If this does not show-up in your menu, you need to install the analysis tool pack. If you did not install office with all options, you will need to have the office installation disk, sorry.)
In the window select the column you want to analyze, to beautify things, you might add a columns with bin ranges - see demo in class.
Do you observe a smooth distribution?
Do any genes constitute a second peak of more similar sequences, compared to the bulk of the comparisons?
Can you can up with other interesting thing you could do with these data in excel? (histogram for the values in other columns, correlation between the values in different columns, calculate additional columns with corrected values)
The following is for your information only. We will not do this in class.
You can install blastall on your own machine. The program is available at the NCBI ftp site (see above).
In case you want to do this, you need to add the directory where blast is residing to
your PATH, i.e, the list of directories where the operating system will be looking
for things. The path is stored in a variable called $PATH .
To add a new directory
to the end of $PATH type the following at the command prompt in a terminal window:
PATH=$PATH:/Blast-2.2.10/bin
export
PATH
(for the advanced: you can add these commands to a file called
.profile. This file is "hidden" and it is read every time you log in.
You can do this using the text editor vi. To do so type
vi
.profile
<type "i" to enter insert mode>
PATH=$PATH:/Blast-2.2.10/bin
export
PATH
<type ESC to leave insert mode>
:wq
(: opens a command line. w writes to file, q quites. To quit without changes
type :q!)
To submit a job to the queue, you need to have a script that
executes the blast search.
The first line of the script tells the the operating
system what to do with the following lines. For example, the first line in the
file named blastall.sh could be
#!
/bin/bash/
The next line could be the command you want to execute:
blastall
-i t_maritima.faa -d p_abyssi.faa -p blastp -m 8 -e 10e-4 -o test.out -a 2
To
make the file executable type the following at the command line:
>
chmod 755 blastall.sh
To submit the file to the queue type at
the command line:
> qsub -cwd
blastall.sh
# -cwd tells the queue to use the current working directory.
* You can use a text editor for this (textwrangler), but the files might be pretty large. An alternative is to use unix:
** Screen is a unix program that can be very useful, especially, if you want to go home and keep the cluster running on a program that you started. The best is to start screen immediately after you connect to the server via ssh. To get going type screen -a <return>, then type qrsh <return> this sends you to a compute node, you might need to enter your password again. Once you are ready to leave, enter the following keys <ctrl> and a at the same time followed by a d. This detaches the screen. logout from the master node; go home; ....; when you want to check the program, login to the master node and type screen -r to reattach to the screen (if you have several screens running you get instructions how to connect to a particular screen).