Assignment 6: Database searches using BLAST+ executables and possibly GC Skew

Your name:
Your email address:

1) Using Blast on a "local" machine via the command-line

A) obtaining the genome sequences

When searching the Internet, you may find references to both the BLAST (classic) and BLAST+ (new) command-line tools. If you have used the old BLAST tools, you may find this Quick Start Guide for switching from BLAST to BLAST+ command line tools useful.

There are many different ways to get the genome sequences onto the cluster; however, the easiest will be to first download the files to your computer, unzip the files and then transfer them to a directory on the cluster.

Go to the NCBI's current genome list.

Click on the "Prokaryotes" tab. Then click on "Filters" (at the top right). Place a check mark behind archaea, and under Assembly level select "Complete". The number in parenthesis following complete, should give the number of completely sequenced archaeal genomes, and these should be listed in the table below.

Take a note of the number of complete archaeal genomes:

Unclick the filter for complete genomes (leave the checkmark behind archaeal), then click on the "Download" button (top right of page, or below the filter settings). This download "prokaryotes.csv" to your computer. Save to the Desktop.

Load "prokaryotes.csv" into Excel.

Click on the first cell (A1). Then select "Format as Table". The header lines should now appear in table format.

Scroll to the right until you see a "Level" heading. Click on Status, and tick only "Complete Genome". Check status bar (on the bottom of the window) for the number of records found.
Does it agree with the previous number?

Sort by Release Date. How many complete archaeal genomes have been published so far this year?

Which genome has the most genes (look at all submission dates)?

Which genome is the largest?

The highest GC content (give name and GC content)?

The highest AT content (give name and GC content)?

Go back to the NCBI's current genome list. Click on the "Prokaryotes" tab, display only complete genomes.

For today's first exercise we will need  2 genome very divergent genome sequences. We will use two of my favorites: Haloferax volcanii DS2 and Aeromonas veronii HM21

Two choices are possible to download the genomes: 
Choice 1) In the NCBIs complete genome list for an organism of your choice, scroll to the right and click on the R (which is a link to the NCBI FTP server). This green R takes you to an FTP site listing of all the RefSeq files for a genome project. The G gets you to an FTP site that does not have refseq review.  We try to stick to the R link for now.  Many browsers (on a Mac use waterfox) will comply without any problem and open the ftp page.  For each genome we will download three files form the FTP directory.  The first contains all the protein sequences encoded in the genome (it ends with  _protein.faa.gz; the second one is the feature table, ending in _feature_table.txt.gz, and the third is the nucleotide sequence of the selected genome ending in _genomic.fna.gz).  Clicking on them in the browser will download the files to your computer.  Open filezilla, connect to Xanadu's transfer node transfer.cam.uchc.edu, log in with your username and password.  Filezilla's site manager should remember the site you filled out in lab5.  Create a new laboratory for lab6 and transfer the six files into the lab6 subdirecotry. 

Alternative Procedure:

Establish an secure shell connection to Xanadu.  ssh mcb3421usrXX@xanadu-submit-ext.cam.uchc.edu . See lab5 for details. 

To move from the login node to run an interactive session (you should never, ever run things on this node) to a compute node,  type 
srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash

In your home directory make a new subdirectory for lab6 and enter it:

mkdir lab6
cd lab6

In the NCBIs complete genome list select organism you want to analyze, scroll to the right and click on the R (which is a link to the NCBI FTP server).    For each genome we will download three files form the FTP directory. 
The first contains all the protein sequences encoded in the genome (it ends with _protein.faa.gz);
the second one is the genebank feature file, ending in _feature_table.txt.gz, and
the third is the nucleotide sequence of the selected genome ending in _genomic.fna.gz). 

You now should have the ftp site for your genome open in a browser, and a terminal connection to xanadu. 
In the browser window, right click on one of the files you want to download and copy the file location.
Go to the terminal window and type curl -O and paste the link you had copied.  (The -O is a capital O -- not a zero)

curl -O paste the link you copied

Repeat this for the three files and then repeat it for the other genome.

To uncompress the files type gunzip *.gz

End Alternative Procedure

What are the ftp links to the pages for the Haloferax volcanii DS2 and Aeromonas veronii HM21 genomes at the NCBI ftp server?  

Establish an secure shell connection to Xanadu.  ssh mcb3421usrXX@xanadu-submit-ext.cam.uchc.edu . See lab5 for details. 

To move from the login node to run an interactive session (you should never, ever run things on this node) to a compute node,  type 
srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash

If you are in your home directory change to the lab6 subdirecory (you can use the pwd command to check where you are in the directory hierarchy. 

cd lab6


The files at the NCBI FTP server are compressed.  If your browser did not yet unpack them, you can unzip them on the command line:

gunzip *.gz   The star is wild card and *.gz will be replaced with a list of all files ending in .gz)

Once you have the 6 files rename them into something you recognize. 
Rename the A_veronii_Hm21 files to
 A_veronii_Hm21.faa,
 A_veronii_Hm21_feature_table.txt, and
 A_veronii_Hm21_genomic.fna


Rename the Haloferx volcanii DS2 files into
 Haloferax_volcanii_DS2.faa
 Haloferax_volcanii_DS2_feature_table.txt
 Haloferax_volcanii_DS2_genomic.fna
The easiest is to do this in filezilla (click on the file to select it - wait - then click again and you can edit the file name).

Alternatively, you can use the command line:  mv oldname newname (Do not forget that you can copy/pasta and/or use automatic line completion)

Take a look at the first few lines of the faa files with

head YourFileName.faa

This "faa" files contains all of the proteins coded by a genome.

Today  we will use perl (an interpreter that allows you to use perl scripts), and blast (the basic local alignment search tool).   To use these programs You need to load the corresponding modules:
module load blast/2.11.0
module load perl

If you want to check which modules are available on xanadu, type
module avail

We also will use a couple of scripts that you can download via the copy url command (the O is the letter, not the number zero): 

curl -O https://j.p.gogarten.uconn.edu/mcb3421_2021/labs/Files_you_need_forLab6.zip

Execute the following to turn one of the multiple fasta files into a blast searchable databank:

makeblastdb -in A_veronii_Hm21.faa -dbtype prot -parse_seqids
-parse_seqids directs the program to create an index that allows to retrieve sequences from the databank you have created.
Do an "ls" to see the extra files you just made.

The Haloferax "genome" will be the "query".
blastp -query Haloferax_volcanii_DS2.faa -db A_veronii_Hm21.faa -out blast_all.txt -outfmt 6 -evalue 10  

-outfmt 6 specifies a tabular output format. (We will use an evalue of 10 to also obtain some insignificant hits.)

The blast program will now take every sequence in query file and do a blast search against the database (i.e., this performs a couple of thousand blast searches).
This will take a few minutes. While waiting check in what other students are doing :) Or check this listing of options to use with the blast command.


We are also interested in the best blast hit for each query.  The script
blastTopHit.pl goes to thought he output table and for every query only prints out the first entry. 
(It is possible to restrict the blast search to return only one hit, but it is alleged that this only return the first significant hit encountered, not the best one).  To run the perl script, execute:

perl blastTopHit.pl blast_all.txt > blast_top.txt

Here is a description of the columns in the outfmt -6 option. A text file containing the header is here. Use the cat command to add these to the blast output file

cat header.txt blast_top.txt > blast.top_h.txt

Open an SFTP window in Filezilla [start the filezilla application. You need to connect to the transfer node: transfer.cam.uchc.edu. Your username and password are the same as in lab5. Navigate to your lab6 directory, and transfer both the blast_all.txt and the blast.top_h.txt file to your Desktop. Load it into Excel.  Click in the top left field of the table, then convert the spreadsheet into a table with headers.  

What is the accession ID pair of the most conserved protein between the two genomes (sort the table on bitscores, then on e-values)?

To learn what the most similar sequence(s) encode, we can copy the Accession numbers for the most similar matches (use the database IDs in the second column), paste them into a text editor, save the the file as IDfile.txt and transfer it into the class6 directory:


blastdbcmd -entrybatch Filename.txt IDnumber1,IDnumber2,IDnumber3 -db YourDatabaseName
for example (all on one line):
blastdbcmd -entry_batch IDfile.txt -db A_veronii_Hm21.faa

This should return the fasta formated sequences of the matches.

What is given as function in the fasta annotation lines?

In Excel, select the third column from the left (this is the percent identity for each BLASTp match), and Insert -- Statistic Chart (all-blue column chart icon in the Charts section) -- Histogram
(Aside: I had problems in editing the histograms this creates on a Mac. It worked much better, when I selected tools, analysis tool pack (you might need to install this first), Histogram. I added a column somewhere that gave the bins) from 0 to 100% in 5% increments)

(More about histograms in Excel, including the formula used for default binning.)

You get a histogram of all percent identities. Repeat the histogram for E-values ≤ 10-3, and  E-values >.1

Do you observe a smooth distribution of % identity values (a distribution with two peaks would be noteworthy, as it might be an indication of a chimera formation)?


What can explain the difference (or lack thereof) in the histograms for significant and insignificant hits?

Can you explain, why % identity is not a good criterion to distinguish significant from insignificant hits?


For the following questions, the BLAST Command Line Applications User Manual may be helpful → link


How can you get information on the possible parameters in blastp using the command line? (try blastp -h first)


How can you set the wordsize to 2 ?


How can you filter the query sequence for regions with low complexity?



==========> This is how far you should get in lab 6 ... You can leave the two answers below empty, but you need to log out of Xanadu and submit your work sheet.  <===================


2) Cumulative Strand Bias

The script GCskew.pl goes through a genome and counts how many more Cs than Gs (or As than Ts, or Gs +Ts more than Cs and As).  It counts continuously, if the C is encountered the C versus G counter goes up, and the Gs +Ts versus Cs and As goes down one. 

The script runs on a file that ends in fna that is in the same directory as the script.  We already down-loaded the nucleotide sequence of one bacterial and one archaeal genome (see above).  For this exercise, we only want to use the bacterial chromosome. (if you are curious repeat this exercise for the archaeal genome)  Transfer the *genomic.fna files for the bacterial genome to your computer (do the same for the the feature_table.txt file, we will need it later).  Open the fna file in a text editor (notepad++, BBedit).  The first annotation line usually says something like this is the chromosome, and even if it doesn't, it usually is. This is the part of the file you want to keep. In rare cases, the first annotation line says plasmid or mini chromosome. 
To get to the second annotation line, search for >. 
If the first sequence was the chromosome, delete from before the second > to the end. 
If the first one was a plasmid, delete form before the second > to the beginning.   (repeat until only the chromosome is left)

Using filezilla, create a subdirectory in lab6 (call it GCSkew), and copy the fna file with only the bacterial chromosome sequence and the GCskew.pl script in this directory.  Go back to the terminal window. 
[If you were locked out, you need to execute the following: 
ssh mcb3421usrXX@xanadu-submit-ext.cam.uchc.edu
module load perl
cd lab6 ]

cd GCSkew
perl GCskew.pl
Aside:  If you want to repeat this for the archaeal sequence, or for another genome, you need to put its fna file into a separate directory and run it there.  (The script runs on one fna file in the directory and overwrites my_table if the file already exists.)

The script creates a file called my_table. Transfer it to your computer, add the extension .txt, and open it in excel.  Turn it into a table and plot the first four columns as a scatter plot.
Do you see one or more turning points (Peaks)? (Consider that the genome is circular. How would it look, if you keep going around the genome? ) 


The turning points usually are the origin and the terminus of replication. 
Open the feature_table.txt file in a texteditor and search for dnaA in the annotation lines. In bacteria dnaA is often/usually the first gene next to the origin of replication. 
Does the position of the dnaA encoding gene correspond to one of the turning points?  Is it located at the beginning of the plot?



Type exit at the xanadu commandline. This will return to the head node. Now type qstat {
you could only list your jobs using qstat -u username}
You should have no jobs running.
If there is a job listed, then type qdel
followed by a space, and then the job-ID number, followed by return or enter key. Then type qstat
again, to confirm that there are no running jobs. Then type logout
to exit the main computer.

Finished?

Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.































If you are struggling with transferring files to the cluster, most of the stuff we need today is in this zip file: