MCB 5472 : Database searches, Blast and Blastall

Please let me know, how far you got during the lab. If most students didn't finish, we will continue this next week!

Questions you should answer are given in blue. Please send your answers per email to gogarten@uconn.edu with subject MCB5472

Database Searches :

Literature Databases

99.9% of the time Google scholar or pubmed is sufficient. In rare instances, e.g., if one is interested in conference proceedings or published abstracts it may be advisable to search the web of science of scopus databases. Most universities or companies have subscriptions to dedicated literature data bank search services. For a scientist of your choice (e.g., your advisor, or someone who publishes in your field of interest), using the SCOPUS database.

For which author did you search?
In the search results click on the author and then on "View in Analyze author output"
Which was the most cited article?
What is the authors H-index (what does this even mean?)
When was the most recent citation?
Why might the H value be a reasonable measure for an authors importance? Did you find any interesting article (if yes, list title)?
Was this article available online?

Repeat the search using Pubmed as available at the NCBI (the NCBI interface was known as Entrez for some time, and I might refer to the NCBI web-interface by this name). Use this link (it installs "uconn-tools")

Repeat the search using Google scholar. Enter the author name as "Initials Family name" (e.g. "JP Gogarten).

Which database gave you the most reasonable results?

If you have not completed the blast search exercise using the NCBIs web interface form last week, do it now.

Using Blast on a "local" machine:

A) obtaining the genome sequences

When searching the Internet, you may find references to both the BLAST (classic) and BLAST+ (new) command-line tools. If you have used the old BLAST tools, you may find this Quick Start Guide for switching from BLAST to BLAST+ command line tools useful. Some searches only work in the old blastall system, e.g., searching a nucleotide database (e.g. a genome) with a position specific scoring matrix.

Today we want to compare every open reading frame in one genome, with all the open reading frames in another genome. We will use the encoded aa sequences as queries and as the databank.

Go to the NCBI's current genome list.

Click on the "Prokaryotes" tab. Then click on filters and try to figure out how many entries are for completely sequenced genomes, or genomes that had at least their chromosome completely sequenced?

Download two chromosomes or two completely sequenced genomes from organisms of your choice that belong to the same species or the same genus.
To do so locate the genome in the current genome list. On the right side of the row are one or tow links to the refseq (R) and genome (G) download links. If an R-link is available use it, if not take the G link. Click on it, or open the link in your favorite FTP program. At the minimum you want to down load the feature_tabble.txt and the faa file (if you might want to do other things with these genomes, download all the files).

To do so, copy the link from the ftp listing (right click copy link), then ssh to your account (if you use terminal, ssh your_account_name@bbcsrv3.uconn.edu). Once you logged in make a directory for today's class (e.g., mkdir lab01). Change into this directory (e.g., cd lab01) then download the files into this directory using curl (copy pasting the link from the ftp site to your terminal window, e.g.:

curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/895/965/GCF_001895965.1_ASM189596v1/GCF_001895965.1_ASM189596v1_feature_table.txt.gz
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/895/965/GCF_001895965.1_ASM189596v1/GCF_001895965.1_ASM189596v1_protein.faa.gz
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/285/935/GCF_002285935.1_ASM228593v1/GCF_002285935.1_ASM228593v1_feature_table.txt.gz
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/285/935/GCF_002285935.1_ASM228593v1/GCF_002285935.1_ASM228593v1_protein.faa.gz
Alternatively, you can download the files to your laptop, and then use sftp in ssh-client or filezilla to transfer them to bbcsrv3.

Whatever works for you - get the four files into your account on the cluster and establish a terminal with an ssh connection to the cluster.

On the remote machine, you should NOT run long processes on the master node. Either qlogin (see below) to a node that is not busy or submit your command to the queue using qsub.

In the SSH connection, change into the directory that contains your genome and feature files.

Upack the compressed files you downloaded:

gunzip *.gz (The star is a wild card, the shell will look for all files whose name ends with .gz and unpack them

ls (to see what files you now have in your directory)

You might want to use the "mv" command to rename a long, unwieldy filename into a more concise one. E.g.,

mv ridiculouslyLongFilename.faa somethingShorter.faa

A good convention would be the first letter of the genus, and then the species name and the strain designation, e.g., A_hydrophila_ML09-119.faa.gz and A_hydrophila_ML09-119_feature.txt.gz
Remember to NOT use spaces in the names

Take a look at the first few lines with

head filename.faa

Here is a Perl program that substitutes the accession numbers in a protein FASTA file with the corresponding genome (start) coordinates in the feature table:Load it into your directory using curl:

curl -O http://carrot.mcb.uconn.edu/mcb3421_2017/faaReplaceAccessionWithStart.pl

You use it as follows:

perl faaReplaceAccessionWithStart.pl yourGenome1.faa yourFeatureTable1.txt > yourGenome1WithStart.faa

perl faaReplaceAccessionWithStart.pl yourGenome2.faa yourFeatureTable2.txt > yourGenome2WithStart.faa

head yourGenome1WithStart.faa

Check that the ">accession" lines have been replaced with ">number" lines.

qlogin
This takes you to a "compute" node. (Why? Because we are going to run the BLAST+ command, and we want to "farm" the processing out to another computer, rather than hammering the single computer which operates as the "gateway" for everyone. See cluster etiquette.) If you have problems to log into your default queue, the command qacct -q lists all available queues and qlogin -q name_of_que, uses that queue for an interactive login.

cd lab01

We want to turn one genome into a searchable databank, and use the other genome as query. To make the databank, we either use the the program makeblastbb, which is part of the blast+ package, or formatdb, which is part of the bastall package. The databanks created with one, can be used with the other.

type makeblastdb -help to get information on the program parameters you can set.
makeblastdb -in yourGenome1WithStart.faa -dbtype prot -parse_seqids
Choose the first of your genomes as the "database". Do an "ls" to see the extra files you just made. Use the FASTA file with the start coordinates!
The -parse_seqids option directs the program to create an index that allows to retrieve sequences from the databank.

type blastp -help to get information on the blastp program

Which option turns off the the low complexity filter in blastp?
Which option, and which setting, sets the wordsize to 2?
Which option allows to use two processors?

blastp -query yourGenome2WithStart.faa -db yourGenome1WithStart.faa -out blast.txt -outfmt 6 -evalue 1e-8
The other genome will be the "query". An E-value cut-off of 10^-8 is used. Use the FASTA files with the start coordinates.
-outfmt 6 specifies a tabular output format, which will be writen into a file called blast.txt .

This will take a few minutes. Here is a description of the columns.

To get just the top hit for each query sequence, we use another Perl program. Since the hits for each query are ordered by best E-value to worst, the top hit is simply the first hit for each query:

curl -O http://carrot.mcb.uconn.edu/mcb3421_2016/blastTopHit.pl

You use it as follows:

perl blastTopHit.pl blast.txt > blastTopHit.txt

head blastTopHit.txt
Notice that there is only one hit returned per query in the blastTopHit.txt file.
Note: The "-max_target_seqs 1" option also returns the top BLASTp database hit for each query sequence. However, since we also want all hits with E-value ≤ 10^-8 for the other plot, we can use the Perl program to avoid computing the BLASTp twice (once without the max_target_seqs option, and once with).

To plot the location in one genome against the location of the matches in the other genome we have two options. (B) using excel (see below) or (A) using gnuplot.

A) Plotting the results using gnuplot

Gnuplot is installed on the cluster, and you can use it to create scatter plots for both of the matches (top and all significant) in the same coordinate system.
A script that does this is here. The script needs to be present in the same folder as the files with the blast output (blast.txt and blastTopHit.txt)
You need to open the file in a text editor, and add the names of the file with the blast output - if you used the names blastTopHit.txt and blast.txt the program runs without editing.

(there are many ways to download and edit the perl script. One is to use curl to get the file
curl -O http://carrot.mcb.uconn.edu/mcb3421_2017/plotwgnu_mod2.pl
and use the editor nano to edit the file
nano plotwgnu_mod2.pl

An alternative is download the file to your desktop (rightclick on the link and select save as), edit the file on you desktop (in your favorite editor), and transfer it to the cluster using the sFTP)

To run the script type
perl plotwgnu_mod2.pl

Transfer the resulting plot to your computer using the SFTP , and display the image on the screen. Remember to rename the files (blast.txt .... plot.png before you run the second analysis)

==============================================
B) plotting the results using Excel:

Open a SFTP window in sshclient or filezilla, navigate to your lab01 directory, and transfer the blast.txt and blastTopHit.txt files to your Desktop.

Make an Excel scatter plot using all BLAST hits with E-value ≤ 10^-8, and another using just the top hits.

What if any is the difference between the two plots?

Assignment for next Monday:

class3_2018.pl