A) obtaining the genome sequences
When searching the Internet, you may find references to both the BLAST (classic) and BLAST+ (new) command-line tools. If you have used the old BLAST tools, you may find this Quick Start Guide for switching from BLAST to BLAST+ command line tools useful. Some searches only work in the old blastall system, e.g., searching a nucleotide database (e.g. a genome) with a position specific scoring matrix.
Today we want to compare every open reading frame in one genome, with all the open reading frames in another genome. We will use the encoded aa sequences as queries and as the databank.
Go to the NCBI's current genome list.
Click on the "Prokaryotes" tab. Then click on filters and try to figure out how many entries are for completely sequenced genomes, or genomes that had at least their chromosome completely sequenced?
Download two chromosomes or two completely sequenced genomes from organisms of your choice that belong to the same species or the same genus.
To do so locate the genome in the current genome list. On the right side of the row are one or tow links to the refseq (R) and genome (G) download links. If an R-link is available use it, if not take the G link. Click on it, or open the link in your favorite FTP program. At the minimum you want to down load the feature_tabble.txt, the faa, and the fna file (if you might want to do other things with these genomes, download all the files).
To do so, copy the link from the ftp listing (right click copy link), then ssh to your account (if you use terminal, ssh your_account_name@bbcsrv3.uconn.edu). Once you logged in make a directory for today's class (e.g., mkdir lab03). Change into this directory (e.g., cd lab03) then download the files into this directory using curl (copy pasting the link from the ftp site to your terminal window, e.g.:
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/895/965/GCF_001895965.1_ASM189596v1/GCF_001895965.1_ASM189596v1_feature_table.txt.gz
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/895/965/GCF_001895965.1_ASM189596v1/GCF_001895965.1_ASM189596v1_protein.faa.gz
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/285/935/GCF_002285935.1_ASM228593v1/GCF_002285935.1_ASM228593v1_feature_table.txt.gz
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/285/935/GCF_002285935.1_ASM228593v1/GCF_002285935.1_ASM228593v1_protein.faa.gz
Alternatively, you can download the files to your laptop, and then use sftp in ssh-client or filezilla to transfer them to bbcsrv3.
Whatever works for you - get the four files into your account on the cluster and establish a terminal with an ssh connection to the cluster.
On
the remote machine, you should NOT run long processes on the master node. Either qlogin (see below) to
a node that is not busy or submit your command to the queue using
qsub.
In the SSH connection, change into the directory that contains your genome and feature files.
Upack the compressed files you downloaded:
gunzip *.gz (The star is a wild card, the shell will look for all files whose name ends with .gz and unpack them
ls (to see what files you now have in your directory)
You might want to use the "mv" command to rename a long, unwieldy filename into a more concise one. E.g.,
mv ridiculouslyLongFilename.faa somethingShorter.faa
A good convention would be the first letter of the genus, and then the species name and the strain designation, e.g., A_hydrophila_ML09-119.faa.gz and A_hydrophila_ML09-119_feature.txt.gz
Remember to NOT use spaces in the names
Take a look at the first few lines with
head filename.faa
Here is a Perl program that substitutes the accession numbers in a protein FASTA file with the
corresponding genome (start) coordinates in the feature table:Load it into your directory using curl:
curl -O http://carrot.mcb.uconn.edu/mcb3421_2017/faaReplaceAccessionWithStart.pl
You use it as follows:
perl faaReplaceAccessionWithStart.pl yourGenome1.faa yourFeatureTable1.txt > yourGenome1WithStart.faa
perl faaReplaceAccessionWithStart.pl yourGenome2.faa yourFeatureTable2.txt > yourGenome2WithStart.faa
head yourGenome1WithStart.faa
Check that the ">accession" lines have been replaced with ">number" lines.
B) Running BLAST using the modified *.faa files
qlogin
This takes you to a "compute" node.
(Why? Because we are going to run the BLAST+ command, and we want to "farm" the processing out to another computer, rather than hammering the single computer which operates as the "gateway" for everyone. See cluster etiquette.) If you have problems to log into your default queue, the command qacct -q lists all available queues and qlogin -q name_of_que, uses that queue for an interactive login.
cd lab01
We want to turn one genome into a searchable databank, and use the other genome as query. To make the databank, we either use the the program makeblastbb, which is part of the blast+ package, or formatdb, which is part of the bastall package. The databanks created with one, can be used with the other.
type makeblastdb -help to get information on the program parameters you can set.
makeblastdb -in yourGenome1WithStart.faa -dbtype prot -parse_seqids
Choose the first of your genomes as the "database". Do an "ls" to see the extra files you just made. Use the FASTA file with the start coordinates!
The -parse_seqids option directs the program to create an index that allows to retrieve sequences from the databank.
type blastp -help to get information on the blastp program
Which option turns off the the low complexity
filter in blastp?
Which option, and which setting, sets the wordsize to 2?
Which option
allows to use two processors?
blastp -query yourGenome2WithStart.faa -db yourGenome1WithStart.faa -out blast.txt -outfmt 6 -evalue 1e-8
The other genome will be the "query".
An E-value cut-off of 10-8 is used. Use the FASTA files with the start coordinates.
-outfmt 6 specifies a tabular output format, which will be writen into a file called blast.txt .
This will take a few minutes.
Here is a description of the columns.
To get just the top hit for each query sequence, we use another Perl program. Since the hits for each query are ordered by best E-value to worst, the top hit is simply the first hit for each query:
curl -O http://carrot.mcb.uconn.edu/mcb3421_2016/blastTopHit.pl
You use it as follows:
perl blastTopHit.pl blast.txt > blastTopHit.txt
head blastTopHit.txt
Notice that there is only one hit returned per query in the blastTopHit.txt file.
Note: The "-max_target_seqs 1" option also returns the top BLASTp database hit for each query sequence.
However, since we also want all hits with E-value ≤ 10-8 for the other plot, we can use the Perl program to avoid computing the BLASTp twice (once without the max_target_seqs option, and once with).
C) Plotting the results:
To plot the location in one genome against the location of the matches in the other genome we have two options. (B) using excel (see below) or (A) using gnuplot.
Option 1: Plotting the results using gnuplot
Gnuplot is installed on the cluster, and you can use it to create scatter plots for both of the matches (top and all significant) in the same coordinate system. (More on gnuplot here.)
A script that does this is here.
The script needs to be present in the same folder as the files with the blast output (blast.txt and blastTopHit.txt)
You need to open the file in a text editor, and add the names of the file with the blast output - if you used the names blastTopHit.txt and blast.txt the program runs without editing.
(there are many ways to download and edit the perl script. One is to use curl to get the file
curl -O http://carrot.mcb.uconn.edu/mcb3421_2017/plotwgnu_mod2.pl
and use the editor nano to edit the file
nano plotwgnu_mod2.pl
An alternative is download the file to your desktop (rightclick on the link and select save as), edit the file on you desktop (in your favorite editor), and transfer it to the cluster using the sFTP)
To run the script type
perl plotwgnu_mod2.pl
Transfer the resulting plot to your computer using the SFTP , and display the image on the screen. Remember to rename the files (blast.txt .... plot.png before you run the second analysis)
What type of homology is represented by most of the green x?
==============================================
Option 2: plotting the results using Excel
Open a SFTP window in sshclient or filezilla, navigate to your lab01 directory, and transfer the
blast.txt and blastTopHit.txt files
to your Desktop.
Make an Excel scatter plot using all BLAST hits with E-value ≤ 10-8, and another using just the top hits.
What type of homology is represented by most of the dots in the plot of blast.txt? ?
Mummer as an easy alternative to study the synteny between two genomes:
Mummer is a program that finds matching nucleotide sequences, and produces nice plots, similar to the genome plot created above. The difference is that in this case the search is done on the nucleotide level, and that the program keeps track of the + and the - strand. Mummer is installed on the cluster. Execute the following command, replacing CCI and ACN with the genomes you chose.
mummer -mum -b -c CCI.fna ACN.fna > ref_qry.mums
To keep track of things, you could replace ref_qry with something meaningful, e.g., mummer -mum -b -c B_garinii.fna B_burg.fna > B_garinii_B_burg.mums
If you are interested in details, check the mummer manual at http://mummer.sourceforge.net/manual/#mummer
You can use mummer -maxmatch -b -c CCI.fna ACN.fna > ref_qry.mums to get all matches, not only the unique matches.
mummerplot --postscript --prefix=ref_qry ref_qry.mums
You might get a warning about "deprecated syntax" - ignore it.
If you use a Mac, you can view the postscript output (ref_qry.ps) with preview. On a Windows PCyou might need to install a program to display postscript files (GSview is one option). Alternatively, you could replace the --postscript option with --png --large.
(The installed version of mummerplot calls gnuplot from within the mummerplot script; however, the gnuplot file described in the manual is generated and stored. Editing this file (extension: .gp) is an easy way to modify the plot. To replot the data type gnuplot ref_qry.gp). More on gnuplot here.
Is the result similar to the one you obtained with the geneplot created in ? (Two of the gene plots for the different Aeromonas strains are here and here). What can explain the difference?
What are the differences between the plots created in section (C) and the mummer plot? Which is more useful (this might depend on the question you are asking)?