Assignment 6: Database searches using BLAST+ executables

Your name:
Your email address:

1) Using Blast on a "local" machine via the command-line

A) obtaining the genome sequences

When searching the Internet, you may find references to both the BLAST (classic) and BLAST+ (new) command-line tools. If you have used the old BLAST tools, you may find this Quick Start Guide for switching from BLAST to BLAST+ command line tools useful.

We will need two sets of 2 or four sequences each. Set A should consist distantly related organisms (e.g. one Archaeon, one Bacterium, or a DPANN archaeon and a Euryarchaeote. If in doubt of what to choose, go for one of the Nanohaloarchaea archaeon SG9 and Haloferax volcanii DS2 genomes).; the second set (B)should consist of closely related bacteria or archaea (either strains from the same species, or species from the same genus).

There are many different ways to get the genome sequences onto the cluster; however, the easiest will be to first download the files to your computer, unzip the files and then transfer them to a directory on the cluster.

Go to the NCBI's current genome list.

Click on the "Prokaryotes" tab. Then click on "Levels" (at the top right) Under Assembly level select "Complete". This should result in a listing of completely sequenced genomes.

Take a note of the number of complete genomes:

Click on "Download Reports from FTP site" (top right of page). Then download "prokaryotes.txt" to your computer. (On the ftp site, right click on prokaryotes.txt, save target as, or Save link as, depending on your browser.) Save to the Desktop.

Load "prokaryotes.txt" into Excel. Start with a blank workbook, then File -- Open -- Browse -- Select "All Files". Choose prokaryotes.txt

Text Import Wizard opens. Just choose "Finish", as the defaults work fine.

Click on the header of column "A". The status bar should show about 160,000 rows.

Click on the first cell (A1). Then select "Format as Table". The header lines should now appear in table format.

Scroll to the right until you see a "Status" heading. Click on Status, and tick only "Complete Genome". Check status bar for the number of records found.
Does it agree with the previous number?

Sort by Release Date. How many complete genomes have been published so far this year?

Which genome has the most genes (look at all submission dates)?

Which genome is the largest?

The highest GC content (give name and GC content)?

The lowest GC content (give name and GC content)?

Go back to the NCBI's current genome list. Click on the "Prokaryotes" tab, display only complete genomes.

We will need two sets of 2 or 4 genome sequences each. Set A should consist distantly related organisms (e.g. one Archaeon, one Bacterium, or a DPANN archaeon and a Euryarchaeote. If in doubt of what to choose, go for one of the Nanohaloarchaea archaeon SG9 and Haloferax volcanii DS2 genomes).; the second set (B)should consist of closely related bacteria or archaea (either strains from the same species, or species from the same genus (e.g., different Aeromonas species, different Thermotoga species, different Borreliella species (they have linear chromosomes).

MODIFIED PROCEDURE:

For today we want to use two genomes from organisms from differnent bacterial or archaeal phyla. Identify the genomes you want to use on the NCBI's current genome list, click on prokaryotes, and then select the filter for complete genomes. In the right column is the link (R or G) to the FTP site for the genomes.

You want to download the protein.faa file for each genome (do not do this yet). This "faa" file contains all of the proteins coded by a genome. Right click on the faa file and copy the file location.

Log into the cluster via mcbsubmit.cam.uchc.edu (same username and password as you used last class).

mkdir lab6

cd lab6

curl -O paste the link you copied

Repeat this for the second genome.

e.g. curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/761/425/GCA_001761425.1_ASM176142v1/GCA_001761425.1_ASM176142v1_protein.faa.gz
and
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/025/685/GCF_000025685.1_ASM2568v1/GCF_000025685.1_ASM2568v1_protein.faa.gz

ls

Once you have the two files in the lab rename them into something you remember:

mv oldname newname

mv ridiculouslyLongFilename.faa.gz somethingShorter.faa.gz

A good convention would be the first letter of the genus, followed by the species name, followed by the strain identifier, e.g., A_veronii_TH0426.faa.gz

gunzip YourFileName.faa.gz

ls

The file should now end with just "faa". The "gz" ending was a compressed file format. Take a look at the first few lines with

head YourFileName.faa

And repeat the procedure for the second genome.

END modified procedure, continue at #####

 

There are many different ways to get the genome sequences onto the cluster; the easiest will be to first download the files to your computer, unzip the files and then transfer them to a directory on the cluster.

Use the search feature (bar on top of the page) to find the complete genomes you want to work on.

Then look for the R or the G link in the far right-hand column. This green R takes you to an FTP site listing of all the RefSeq files for a genome project. The G gets you to an FTP site that does not have refseq review.

You want to download the protein.faa file for each genome. This "faa" file contains all of the proteins coded by a genome. For the genomes from closely related organisms we will also need the the file ending with _feature_table.txt

Right click on the file you want to download, save it to your computer, unzip the file and give it a name that you recognize and that also contains the strain ID. E.g., A_veronii.faa would not be sufficient, because dozens of A. veronii genomes have been sequence. A_veronii_TH0426.faa would be fine.

Then start the bitwise ssh application. To upload files, you need to connect to a special transfer node: transfer.cam.uchc.edu. Your username and password are the same as last class. Create two directories (lab6 for the divergent genomes and lab7 for the similar genomes) and transfer the files from the two sets into their respective directories.

Then log into the cluster via mcbsubmit.cam.uchc.edu (same username and password).

The following section between {} is an option only in case the file transfer via your local computer does not work. Skip this section, if you created the directories and transfered the files successfully.

{

Right-click on the faa file in the refseq ftp listing, and select "Copy link location". Then go to the cluster and paste the link, like this:

Start a program called Bitvise SSH Client. In the "Host" field, type mcbsubmit.cam.uchc.edu
In the "Username" field, type your username (mcb3421usrXX, where XX is a number assigned to you). Login. It may ask you to accept a new host key. Now enter your password.
It might help further on, if you tell the SSH Client to use lines longer than 80 characters in the terminal.

mkdir lab6

cd lab6

curl -O paste the link you copied

e.g. curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/761/425/GCA_001761425.1_ASM176142v1/GCA_001761425.1_ASM176142v1_protein.faa.gz
and
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/025/685/GCF_000025685.1_ASM2568v1/GCF_000025685.1_ASM2568v1_protein.faa.gz

ls

mv oldname newname

mv ridiculouslyLongFilename.faa.gz somethingShorter.faa.gz

A good convention would be the first letter of the genus, followed by the species name, followed by the strain identifier, e.g., A_veronii_TH0426.faa.gz

gunzip YourFileName.faa.gz

ls

The file should now end with just "faa". The "gz" ending was a compressed file format. Take a look at the first few lines with

head YourFileName.faa

And repeat the procedure for the second genome.

}

 

#######

srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash
This takes you to a "compute" node. (Why? Because we are going to run the BLAST+ command, and we want to "farm" the processing out to another computer, rather than hammering the single computer which operates as the "gateway" for everyone. See cluster etiquette.)

If you are not sure that you are on a compute node, execute the command hostname

cd lab6

To use blast you need to load the blast module:
module load blast

Execute the following to turn one of the multiple fasta files from set one into a searchable databank:
makeblastdb -in database_protein.faa -dbtype prot -parse_seqids

Do an "ls" to see the extra files you just made. -parse_seqids directs the program to create an index that allows to retrieve sequences from the databank you have created .

The other genome will be the "query".
blastp -query query_protein.faa -db database_protein.faa -out blast.txt -outfmt 6

-outfmt 6 specifies a tabular output format.

Note, consider reducing the size of the output file by setting the e value cut-off to 1 you would add -evalue 1 to the above command.

The blast program will now take every sequence in query file and do a blast search against the database (i.e., this performs a couple of thousand blast searches).
This will take a few minutes. While waiting help your neighbor :)

Here is a description of the columns in the outfit -6. To add these to the blast output file,

curl -O http://carrot.mcb.uconn.edu/mcb3421_2017/header.txt
cat header.txt blast.txt > blast.out.txt

Open a new SFTP window in BitVise, navigate to your lab6 directory, and transfer over the blast.out.txt file to your Desktop. Load it into Excel.
You may find it useful to convert to table format with headings.

What is the accession ID pair of the most conserved protein between your two genomes?

Copy the Accession Number of the best match in the database, go back to the terminal window and type

blastdbcmd -entry IDnumber -db DatabaseName
for example
blastdbcmd -entry WP_004041152.1 -db Haloferax_volcanii_DS2_protein.faa

This should return the fasta formated sequence of the match.

What are the IDs of the best match, and what is given as function in the fasta annotation line?

In Excel, select the third column from the left (this is the percent identity for each BLASTp match), and Insert -- Statistic Chart (all-blue column chart icon in the Charts section) -- Histogram
(Aside: I had problems in editing the histograms this creates on a Mac. It worked better for me when I selected tools, analysis tool pak Histogram. I added a column somewhere that gave the bins ) to 100 in 5% increments)

(More about histograms in Excel, including the formula used for default binning.)

You get a histogram of all percent identities. You may find it useful to convert to table format, and add headings. Repeat the histogram for E-values ≤ 1, and E-values ≤ 10-3.

Add a column to the spreadsheet that calculates the number of identical residues (column 3 * column 4 /100).
Make separate histograms for the number of identical residues for all E-values, better than 1 and better than 1E-3.

Do you observe a smooth distribution of % identity values?


Do any genes constitute a second peak of more similar sequences, compared to the bulk of the comparisons?


What can explain the difference in the histograms for significant and insignificant hits?

Can you explain, why % identity is not a good criterion to distinguish significant from insignificant hits?
(Hint: you could insert another column that contains the number of matching aa, by inserting a column that contains a formula multiplying percentID * alignment length)
(If we are running out of time see the demo file)


For the following questions, the BLAST Command Line Applications User Manual may be helpful → link


How can you get information on the possible parameters in blastp using the commandline? (try blastp -h first)


How can you set the wordsize to 2 ?


How can you filter the query sequence for regions with low complexity?

Type logout
You will return to the master computer. Now type qstat {
you could only list your jobs using qstat -u username}
You should have no jobs running.
If there is a job listed, then type qdel
followed by a space, and then the job-ID number, followed by return or enter key. Then type qstat
again, to confirm that there are no running jobs. Then type logout
to exit the main computer.

Finished?

Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.