Assignment 6: Database searches using BLAST+ executables and possibly GC Skew

Your name:
Your email address:

1) Using Blast on a "local" machine via the command-line

A) obtaining the genome sequences

When searching the Internet, you may find references to both the BLAST (classic) and BLAST+ (new) command-line tools. If you have used the old BLAST tools, you may find this Quick Start Guide for switching from BLAST to BLAST+ command line tools useful.

There are many different ways to get the genome sequences onto the cluster; however, the easiest will be to first download the files to your computer, unzip the files and then transfer them to a directory on the cluster.

Go to the NCBI's current genome list.

Click on the "Prokaryotes" tab. Then click on "Filters" (at the top right). Under Assembly level select "Complete". This should result in a listing of completely sequenced genomes.

Take a note of the number of complete genomes:

Unclick the filter, then click on the "Download" button (top right of page). This download "prokaryotes.csv" to your computer. Save to the Desktop.

Load "prokaryotes.csv" into Excel.

Click on the header of column "A". The status bar should show the number of rows.

Click on the first cell (A1). Then select "Format as Table". The header lines should now appear in table format.

Scroll to the right until you see a "Status" heading. Click on Status, and tick only "Complete Genome". Check status bar for the number of records found.
Does it agree with the previous number?

Sort by Release Date. How many complete genomes have been published so far this year?

Which genome has the most genes (look at all submission dates)?

Which genome is the largest?

The highest GC content (give name and GC content)?

The lowest GC content (give name and GC content)?

Go back to the NCBI's current genome list. Click on the "Prokaryotes" tab, display only complete genomes.

For today's exercise we will need 2 genome sequences. These should be distantly related organisms (e.g. one Archaeon, one Bacterium will work). The genomes need to be completely assembled.
Which organisms do you choose?

Two choices are possible to download the genomes:
1) In the NCBIs complete genome list for an organism of your choice, scroll to the right and click on the R (which is a link to the NCBI FTP server). This green R takes you to an FTP site listing of all the RefSeq files for a genome project. The G gets you to an FTP site that does not have refseq review. We try to stick to the R link for now. Most browsers will comply without any problem and open the ftp page. For each genome we will download three files form the FTP directory. The first contains all the protein sequences encoded in the genome (it ends with _protein.faa.gz; the second one is the genebank feature file, ending in _genomic.gbff.gz, and the third is the nucleotide sequence of the selected genome ending in _genomic.fna.gz). Clicking on them in the browser will download the files to your computer. Open filezilla, connect to Xanadu's transfer node transfer.cam.uchc.edu, log in with your username and password. Filezilla's site manager should remember the site you filled out in lab5. Create a new laboratory for lab6 and transfer the six files into the lab6 subdirecotry.

Alternative Procedure:

Establish an secure shell connection to Xanadu. ssh mcb3421usrXX@xanadu-submit-ext.cam.uchc.edu . See lab5 for details.

To move from the login node to run an interactive session (you should never, ever run things on this node) to a compute node, type
srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash

In your home directory make a new subdirectory for lab6 and enter it:

mkdir lab6
cd lab6

In the NCBIs complete genome list for an organism of your choice, scroll to the right and click on the R (which is a link to the NCBI FTP server). For each genome we will download three files form the FTP directory. The first contains all the protein sequences encoded in the genome (it ends with _protein.faa.gz; the second one is the genebank feature file, ending in _genomic.gbff.gz, and the third is the nucleotide sequence of the selected genome ending in _genomic.fna.gz).

You now should have the ftp site for your genome open in a browser, and a terminal connection to xanadu.
In the browser window, right click on one of the files you want to download and copy the file location.
Go to the terminal window and type curl -O and paste the link you had copied. (The -O is a capital O -- not a zero)

curl -O paste the link you copied

Repeat this for the three files and then repeat it for the other genome.

End Alternative Procedure:

Establish an secure shell connection to Xanadu. ssh mcb3421usrXX@xanadu-submit-ext.cam.uchc.edu . See lab5 for details.

To move from the login node to run an interactive session (you should never, ever run things on this node) to a compute node, type
srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash

If you are in your home directory change to the lab6 subdirecory (you can use the pwd command to check where you are in the directory hierarchy.

cd lab6

The files at the NCBI FTP server are compressed. If your browser did not yet unpack them, you can unzip them on the command line:

gunzip *.gz The star is wild card and *.gz will be replaced with a list of all files ending in .gz)

Once you have the 6 files rename them into something you recognize. The easiest is to do this in filezilla (click on the file to select it - wait - then click again and you can edit the file name).

Alternatively, you can use the command line: mv oldname newname (Do not forget to use automatic line completion)

Take a look at the first few lines of the faa files with

head YourFileName.faa

This "faa" files contains all of the proteins coded by a genome.

Today we will use perl, and blast. You need to load the blast corresponding modules:
module load blast
module load perl

We also will use a couple of scripts that you can download via the copy url command (the O is the letter, not the number zero):

curl -O https://j.p.gogarten.uconn.edu/mcb3421_2020/labs/faaReplaceAccessionWithStart.pl
curl -O https://j.p.gogarten.uconn.edu/mcb3421_2020/labs/blastTopHit.pl
curl -O https://raw.githubusercontent.com/Gogarten-Lab/GC-Strand-Bias-Perl-Script/master/GCskew.pl
curl -O https://j.p.gogarten.uconn.edu/mcb3421_2020/labs/header.txt

Execute the following to turn one of the multiple fasta files into a blast searchable databank:

makeblastdb -in One_of_your_filenames.faa -dbtype prot -parse_seqids
-parse_seqids directs the program to create an index that allows to retrieve sequences from the databank you have created.
Do an "ls" to see the extra files you just made.

The other genome will be the "query".
blastp -query query_protein.faa -db One_of_your_filenames.faa -out blast_all.txt -outfmt 6

-outfmt 6 specifies a tabular output format. (We will use the default evalue of 10 to also obtain some insignificant hits.)

The blast program will now take every sequence in query file and do a blast search against the database (i.e., this performs a couple of thousand blast searches).
This will take a few minutes. While waiting check in what other students are going :)

We are only interested in the best blast hit for each query. The script blastTopHit.pl goes to thought he output table and for every query only prints out the first entry.
(It is possible to restrict the blast search to return only one hit, but it is alleged that this only return the first significant hit encountered, not the best one). To run the perl script, execute:

perl blastTopHit.pl blast_all.txt > blast_top.txt

Here is a description of the columns in the outfit -6. Use the cat command to add these to the blast output file

cat header.txt blast_top.txt > blast.top_h.txt

Open an SFTP window in Filezilla [start the filezilla application. You need to connect to a special transfer node: transfer.cam.uchc.edu. Your username and password are the same as in lab5. Navigate to your lab6 directory, and transfer the blast.top_h.txt file to your Desktop. Load it into Excel. Click in the top left field of the table, then convert the spreadsheet into a table with headers.

What is the accession ID pair of the most conserved protein between your two genomes (sort the table on bitscores, then on e-values)?

To learn what the most similar sequence(s) encode, we can copy the Accession numbers for the most similar matches (use the database IDs in the second column), paste them into a text editor, replace the new line symbols "\n" with "," and copy the comma separated list into the following command:

blastdbcmd -entry IDnumber1,IDnumber2,IDnumber3 -db YourDatabaseName
for example (all on one line):
blastdbcmd -entry WP_012166471.1,WP_012162656.1,WP_012165611.1, WP_012164084.1,WP_012164084.1,WP_012163537.1,WP_012165712.1, WP_012164790.1,WP_071819954.1,WP_012161635.1,WP_012162913.1, WP_012163195.1,WP_012163567.1,WP_012164795.1,WP_012163348.1,
WP_012163195.1,WP_012163250.1,WP_012163431.1 -db Acaryochloris_marina_MBIC11017.faa

This should return the fasta formated sequences of the matches.

What is given as function in the fasta annotation lines?

In Excel, select the third column from the left (this is the percent identity for each BLASTp match), and Insert -- Statistic Chart (all-blue column chart icon in the Charts section) -- Histogram
(Aside: I had problems in editing the histograms this creates on a Mac. It worked much better, when I selected tools, analysis tool pack (you might need to install this first), Histogram. I added a column somewhere that gave the bins) from 0 to 100% in 5% increments)

(More about histograms in Excel, including the formula used for default binning.)

You get a histogram of all percent identities. Repeat the histogram for E-values ≤ 10^-3, and E-values >.1

Do you observe a smooth distribution of % identity values (a distribution with two peaks would be noteworthy)?

What can explain the difference (or lack thereof) in the histograms for significant and insignificant hits?

Can you explain, why % identity is not a good criterion to distinguish significant from insignificant hits?

For the following questions, the BLAST Command Line Applications User Manual may be helpful → link

How can you get information on the possible parameters in blastp using the command line? (try blastp -h first)

How can you set the wordsize to 2 ?

How can you filter the query sequence for regions with low complexity?

==========> This is how far you should get in lab 6 ... You can leave the two answers below empty, but you need to log out of Xanadu and submit your work sheet. <===================

2) Cumulative Strand Bias

The script GCskew.pl goes through a genome and counts how many more Cs than Gs (or As than Ts, or Gs +Ts more than Cs and As). It counts continuously, if the C is encountered the C versus G counter goes up, and the Gs +Ts versus Cs and As goes down one.

The script runs on a file that ends in fna that is in the same directory as the script. We already down-loaded the nucleotide sequence of one bacterial genome (see above). For this exercise, we only want to use the chromosome. Transfer the *genomic.fna file for the bacterial genome to your computer (do the same for the the genomic.gbff file, we will need it later). Open the fna file in a text editor (atom, notepad++, BBedit). The first annotation line usually says something like this is the chromosome, and even if it doesn't, it usually is. This is the part of the file you want to keep. In rare cases, the first annotation line says plasmid or mini chromosome.
To get to the second annotation line, search for >.
If the first sequence was the chromosome, delete from before the second > to the end.
If the first one was a plasmid, delete form before the second > to the beginning. (repeat until only the chromosome is left)

Using filezilla, create a subdirectory in lab6 (call it GCSkew), and copy the fna file with only the bacterial chromosome sequence and the GCskew.pl script in this directory. Go back to the terminal window.

[If you were locked out, you need to execute the following:

ssh mcb3421usrXX@xanadu-submit-ext.cam.uchc.edu
module load perl
cd lab6 ]

cd GCSkew
perl GCskew.pl
Aside: If you want to repeat this for the archaeal sequence, you need to put its fna file into a separate directory and run it there. (The script runs it on one fna file in the directory and overwrites my_table if the file already exists.)

The script creates a file called my_table. Transfer it to your computer, add the extension .txt, and open it in excel. Turn it into a table and plot the first four columns as a scatter plot.
Do you see one or more turning points (Peaks)?

These usually are the origin and the terminus of replication.
Open the genomic.gbff file in a texteditor and search for dnaA in the annotation lines (sorry, there likely will be several amino acid sequences that also spell out DNAA). In bacteria dnaA is often/usually the first gene next to the origin of replication.
Does the position of the dnaA encoding gene correspond to one of the turning points? Is it located at the beginning of the plot?

You will return to the master computer. Now type qstat {
you could only list your jobs using qstat -u username}
You should have no jobs running.
If there is a job listed, then type qdel
followed by a space, and then the job-ID number, followed by return or enter key. Then type qstat
again, to confirm that there are no running jobs. Then type logout
to exit the main computer.

Finished?

Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.