MCB 5472 : Blastall and genome plots

Please let me know, how far you got during the lab. If most students didn't finish, we will continue this next week!

Questions you should answer are given in blue. Please send your answers per email to gogarten@uconn.edu with subject MCB5472

Using Blast on a "local" machine:

Many completed genomes can be found on the NCBI FTP site (ftp://ftp.ncbi.nih.gov) in the folder named genomes. The eukaryotic genomes are accessible from the genomes directory while all prokaryotic (bacteria + archaea) genomes are stored in the the subfolder /bacteria (a clear case of archaeal discrimination). Go into the bacteria sub-folder and download two or more genomes or your choice. to do so, click on the appropriately named folder (e.g., Pyrococcus abyssi, Pyrococcus is a member of the archaeal domain). Among the files in these folders we are are interested in the .ffn file which contains all nucleotide sequences for individual predicted ORF's, the .faa file which contains the predicted ORF's amino acids sequences, and the .fna file, which contains the whole genome. These files are in a fasta format. For this exercise we will only use the .faa files. If a folder contains more than one .faa file, take both of these, and copy them into a single file*. On to your computer rename the files into something you recognize. (e.g., p_abyssi.faa and t_maritima.faa). Note: the *.ptt and *.rnt files contain tables where genes are located in the genome sequence.
An alternative way to download the .faa files is to use the genome listing at the NCBI here. Use the "F" links in the "Tools" column for your chosen genomes (ctrl click on the *.faa file, not the F, to save the file on your computer.)

Load the two .faa files into a folder in your home directory on bbcxsrv1.biotech.uconn.edu.
As usual there are many ways to do this.

Alternative #1: The easiest might be to use the finder. Select Go->connect to server in the finder-menu. In the address field enter afp://bbcxsrv1.biotech.uconn.edu . When prompted enter your username and password (if you still need to change your password, see below first)

Alternative #2: ssh "always" works, but it is very pedestrian: Save the two files onto your computer. Then open a terminal window on your mac and type
ssh your_account_name@bbcxsrv1.biotech.uconn.edu   enter your password when prompted.
passwd    enter a new non-trivial password. Remember UNIX is case sensitive. This permanently changes your password. DO NOT LOOSE IT.
mkdir blasttest    creates a directory named blasttest

Open another terminal window on your mac and type
sftp your_account_name@bbcxsrv1.biotech.uconn.edu enter your password when prompted. This establishes a secure ftp connection to the XSERV cluster
cd blasttest changes to the directory blasttest. lcd changes the directory on your local machine, lpwd reports your local directory name
mput *.faa transfers all files from your local directory that end with .faa from the local machine to the remote machine.

Alternative 3: use FUGU and Jellyfish, or finder and jellyfish. Jellyfish is a simple program that keeps track of the ssh connections. It saves you from typing ssh username .... These programs should be available on your desktop. JELLYFISH maintains a list of data for your ssh connections, FUGU is a file transfer program with GUI that works, even if afp is not installed on the server. Both programs should be located inside a folder on your desktop.

Whatever works for you - get the two files into your account on the cluster and establish a terminal with an ssh connection to the cluster.

On the remote machine, you should NOT run long processes on the master node. Either qrsh to a node that is not busy (type qrsh **), or submit your command to the queue using qsub (we will learn about this next week).

On the XSERV cluster using the ssh connection established earlier type
qrsh <enter> enter your password when prompted,
cd blasttest # (If you haven't done above you need to make the directory 1st)
formatdb -i p_abyssi.faa -o T -p T
The -i is to indicate your infile.name, -p T is for a protein sequence, for a nucleotide sequence, you need to use -p F (for False); the -o T instructs the formatdb program to create indices. For more information on the formatdb program type formatdb - <enter>
You need to replace p_abyssi.faa with one of the genomes you downloaded. This genome will be converted into a searchable database.

To do a blast search of every sequence in t_maritima.faa against the P.abyssi genome, we use blastall. To get information on the program type
blastall <enter>
For even more details check one of the many help pages on blastall available on the web.

We will use the following options:
blastall -i t_maritima.faa -d p_abyssi.faa -o blast.out -p blastp -e 10e-9 -m 9 -a2
-i the query sequence(s)
-d the database
-o the output file
-p the blast program to use
-e the e-value cut off
-m the format of the output. option 8 gives a tabular output, that can be easily read into excel spreadsheet or into a database program. option 9 is similar but includes comment lines.
the individual columns in the output denote the following
# Query id # Subject id # % identity # alignment length # number of mismatches # number of gap openings # position of query start # position of query end # position of subject start # position of subject end # e-value of a hit # bit score of a hit

Which option turns off the the low complexity filter?
Which option, and which setting, sets the wordsize to 2?
Which option allows to use two processors?

Inspect the output using more blast.out

Often one is only interested in the top scoring hit for each query. A simple perlscript (here) goes through the blastall output in m8 or m9 format and retains only the top scoring hit, and it adds a header line that will be correctly interpreted by EXCEL. Save this script into your blasttest folder on the cluster, and read through the script using a text editor (vi or textwrangler).
A modified version of this program (here) adds another column to the listing that for each alignment gives the bitscore per residue in the alignment.

To run the blast.out file through the perl script type
perl extract_lines.pl blast.out
This will create a file in your directory called blast.out.top that only contains the top scoring hits, and from which the comment lines are removed.

Import this file into an excel spreadsheet. If you have an afp connection to the cluster, you can start EXCEL on you local machine, open an empty spreadsheet, select Data -> Get external data; select your file (you'll need to tell it to look for any file).

Sorting the data on the different columns you can figure out which gene was found to be most similar between the two genomes.
Which were the genomes you selected?
Which gene pair had the highest % identity?
Which pair had the smallest E-value?
Add additional columns to your spreadsheet: bit score per aligned position (= bitscore / alignmentlength); # of number of identical aa per match (= %identity * alignmenet lenghts), and negative log of E-value. [Double click in the 1st field of the new column, then type an = sign, click in the same row on the column whose value you want to use in the calculation, then enter the mathematical operation (* or+ or - or /), then click on the other column (same row) for the field you want to use in calculating the new column. press <return> when you are done. Select the field you just created, copy to clipboard, select all field in the column you want to create, and past the content from your clipboard.]
For ease of analysis it is useful to highlight the part of the table that contains significant results by selecting this part and selecting a different font color. (You need to sort your spreadsheet according to E-values first).

To retrieve the target sequence, you can use the command fastacmd. If the GI number of the target is 14521962, and the database is p_all.faa the command is as follows:
fastacmd -s 14521962 -d p_all.faa
By default the databank is nr as installed on the cluster. Thus fastacmd -s 14521962 also works (and you get more annotation), provided the sequences in your databank are also in nr.

Next, we will calculate a histogram for the % identity values for all the significant alignments.

FOR THE OLD VERSION OF EXCEL: In Excel select Tools -> Data Analysis, select histogram. (If this does not show-up in your menu, you need to install the analysis tool pack.)
Select the column you want to analyze, to beautify things, you might add a columns with bin ranges.

For the 2008 version of Excel for Macs: This version no longer allows to calculate histograms via the add-ons, but there are spreadsheets abailable on the net into which you can copy paste the the data you want to plot (a histrogram generating spreadsheet for Excel2008 is here). Open the blastall result table as an EXCEL spreadsheet. Select the data you want to turn into a histogram (do not select the column but click on the 1st cell that contains data you want to use, then <shift> click at the last cell you want to use) and copy to your selection to the clipp board. Open the hist_upgrade4.xls speadsheet (download via the here link). Click on the 1st cell on the histogram spreadsheet that contains data (Cell $A$5), paste the data from the clipboard.

Do you observe a smooth distribution?
Do any genes constitute a second peak of more similar sequences, compared to the bulk of the comparisons?
Does the histogram change when you use only the % identity values from significant matches?

Do at least two other interesting things with these data in excel?
(histogram for the values in other columns; plot values from different columns against one another).
What did you discover?

* You can use a text editor for this (textwrangler), but the files might be pretty large. An alternative is to use unix:

open a terminal window (the application is in the utility folder).
cd to the directory where you downloaded the two files (e.g., cd Desktop)
cat name1.faa name2.faa > new_name_for_both_faa_files.faa
where name1.faa name2.faa are the names of the files you want to copy. cat list the content of one ore more files and > directs the output from the default (the screen) to a file

2) Using Blastall to do genome plots for microbial genomes (different Frankia, or different Thermotoga species work nicely) .

Download the *.faa, *.ptt and *.fna files from two genomes (main chromosome only) from the ncbi's ftp site ftp://ftp.ncbi.nih.gov. As we want to plot syntenic relationships between the two genomes, we want to use two closely related genomes.

We want to plot a genes genome position in one genome against the genome position of a homolog in another genome. One way to do this is to replace the GI numbers in the genome with the location in the genome. A program that does this is here. Save the two genomes (as faa files, the two ptt files, the two files, and the perlscript into a new directory on the cluster (name in gp (for genome plot)). We also will use the extract_lines.pl script.

Open a terminal connection to the cluster,
qrsh
cd gp, run the addnucnum script. For example, if the genome and ptt files are ACN.faa, ACN.ptt, CCI.faa, CCI.ppt then the commands would be
perl addnucnum.pl ACN
and
perl addnucnum.pl CCI
Check the output and the program.

Pick one of the numbered genomes and turn it into a "blastable" databank. e.g.
formatdb -i ACN.num.faa -p T -o T
Check the result of the formatdb command by typing
more *.log

then run a blastall search of the other genome (or rather the proteins encoded in the other genome) against the databank
blastall -i CCI.num.faa -d ACN.num.faa -o blast.out -p blastp -e 10e-13 -m 8 -a2
NOTE: You want to select a reasonable significance cut-off, because we will plot all significant matches.
NOTE: You need to use the numbered genomes in both cases.

Run blast.out through extract_lines.pl
perl extract_lines.pl blast.out

Load the files blast.out and blast.out.top into excel and create a scatterplot for the first two columns.

What if any is the difference between the two plots?
Why do some proteins have more than one match?

If you have time, repeat the analysis, but use a lower or higher E-value cut-off, and/or use the same genome as target and query (matches off the diagonal are paralogs).

3) Mummer is a program that finds matching nucleotide sequences, and produces nice plots, similar to the genome plot created above. The difference is that in this case the search is done on the nucleotide level, and that the program keeps track of the + and the - strand. Mummer is installed on the cluster. Execute the following command, replacing CCI and ACN with the genomes you chose.
mummer -mum -b -c CCI.fna ACN.fna > ref_qry.mums
If you are interested in details, check the mummer manual at http://mummer.sourceforge.net/manual/#mummer
You can use mummer -maxmatch -b -c CCI.fna ACN.fna > ref_qry.mums to get all matches, not only the unique matches.

mummerplot --postscript --prefix=ref_qry ref_qry.mums

To plot the results, we need to go to the head-node (only that node has gnuplot installed)
exit

cd gp

The installed version of mummerplot creates a gnuplot file (*.gp) designed for an older version of gnuplot. To fix the problem, open the file ref_qry.gp in vi and delete the three set linestyle statements. (move the cursor over the line and type dd to delete the line. Then type :wq .

gnuplot ref_qry.gp

You will get a warning about "deprecated syntax" - ignore it.
View output (ref_qry.ps) with preview (installed on your iMAC).

Do find differences between the plots created in exercise #2 and the mummer plots?
Which of the generated plots do you think most useful for ... ?

Do not forget to exit, quit, logout from the cluster.

More assignments for Monday:

A) Work through the tutorial on PSI blast at the NCBI here (click at the button next to the red arrows to get to the next page)

B) Read the page on RPS blast and the CDD databank here

C) For the following array declaration

@myArray = ('A', 'B', 'C', 'D', 'E');

what is the value of the following expressions:

$#myArray
length(@myArray)
$myArray[1]
$n=@myArray
reverse (@myArray)