Assignment 7: Gene Plots via BLAST+ executables & Strand Bias

Your name:
Your email address:

Introduction

This week, we will make a gene plot. (As seen in the lecture.) This is basically a way of visualizing the BLASTp hits between proteins (ORFs) in two genomes, in order to compare their relative arrangement (inversions, etc.). One genome is the x-axis, and the other genome is the y-axis. For each point (x,y) on a scatter plot, the following holds:

x and y are coordinates (base pairs from the origin of replication) of two proteins in their respective genomes
x and y represents a BLASTp match between the proteins at coordinates x (in genome 1) and y (in genome 2)

From last week, you'll recall that the tabular output (i.e., outfmt option 6) of a BLASTp search between proteins in two genomes (g1 and g2) looks like this:

BLASTp results
g1 accession	g2 accession	and 8 other columns (link)	E-value	Bit score
g1p1	g2p7	...	1e-5	50
g1p2	g2p8	...	2e-5	49
g1p3	g2p9	...	3e-5	48

You'll also recall that we downloaded a "..._protein.faa" file, which was a FASTA format file of all the encoded proteins in a given genome.

The FASTA format has a ">" line, followed by each sequence. The NCBI ..._protein.faa files look something like this:

>proteinAccessionA(space)some other descriptive text
M...the rest of the protein for the accession on the previous line, on one or more lines...
>proteinAccessionB(space)some other descriptive text
M...the rest of the protein for the accession on the previous line, on one or more lines...

There are several files in the FTP directories (RefSeq or other) for a genome . Today we will be using the four shaded in blue:

ReqSeq FTP directory contents, amongst others
File type	Description of contents
..._protein.faa	FASTA of encoded proteins
..._protein.gbff	Genbank format file
..._feature_table.txt	For each gene, lists the coordinate and whether chromosome/plasmid
..._genomic.fna	FASTA of genome assembly
..._cds_from_genomic.fna	FASTA of nucleotide coding sequences
..._genomic.gff	Generic Feature Format Version 3, similar to the feature table information, link
..._genomic.gbff	Genbank format file
..._rna_from_genomic.fna	FASTA of RNA sequences

Adding gene coordinates to BLASTp output

Given two genome (g1 and g2) FASTA protein files, our BLASTp output might look as follows:

g1 protein.faa FASTA (the query)
>g1p1 MLAMARCK >g1p2 MDARWIN >g1p3 MWALLACE

g2 protein.faa FASTA (the database)
>g2p7 MLAMARCH >g2p8 MDARWEN >g2p9 MWALLASE

BLASTp output
g1 accession	g2 accession	and 8 other columns (link)	E-value	Bit score
g1p1	g2p7	...	1e-5	50
g1p2	g2p8	...	2e-5	49
g1p3	g2p9	...	3e-5	48

We would then need some way to add the coordinates for the query and database accessions to the BLASTp output. This information is in the respective feature tables.

g1 feature table
g1 accession	start coordinate	end coordinate
g1p1	10	33
g1p2	40	60
g1p3	70	93

g2 feature table
g2 accession	start coordinate	end coordinate
g2p7	20	43
g2p8	50	70
g2p9	80	103

One simple way would be to change the header lines in our FASTA protein files to a genome coordinate instead of an accession. For this example we will choose the start coordinate.

modified g1 protein.faa FASTA
>10 MLAMARCK >40 MDARWIN >70 MWALLACE

modified g2 protein.faa FASTA
>20 MLAMARCH >50 MDARWEN >80 MWALLASE

BLASTp output from modified FASTA
g1 accession	g2 accession	and 8 other columns (link)	E-value	Bit score
10	20	...	1e-5	50
40	50	...	2e-5	49
70	80	...	3e-5	48

Then we proceed to do a scatter plot of the first two columns.

There are many ways to accomplish this (below), but today we will use option 1.

Use a Perl program to substitute the accession numbers in the FASTA files for each genome with the corresponding genome coordinate in the feature table, then BLAST as usual.
Use the command-line program "join". See Appendix 1 (if you dare).
Use Excel to merge the BLASTp output with the feature tables. This is called "Get & Transform".

Using Blast to do gene plots for microbial genomes

(different E.coli/Shigella/Salmonella, Frankia, Aeromonas, or different Thermotoga species work nicely).

A) obtaining the genome sequences

For today's exercise we will need two closely related bacterial (or archaeal) genomes. These could be strains for the same species (this has the risk of being slightly boring, if the two strains are closely related), or from the same genus (this has the risk of being too interesting, if the two strains separated by too many recombination events. Go to the NCBI's current genome list. Click on the "Prokaryotes" tab, display only complete genomes. The genomes need to be completely assembled.

Which species and strains did you choose?

The easiest will be to download the files for two or more genomes to your computer. Then use FileZilla to transfer the files into a Lab7 directory in your account on the cluster.

1) start filezilla and connect to transfer.cam.uchc.edu. (use you username and password)
2) create a lab7 directory and transfer the files you downloaded from the NCBI into that directory.
3) if you want to, rename the files into something more memorable than GCF_000025685.1_ASM2568v1_protein.faa.
Be careful when renaming the file. The most common mistake is to mix up file names. You an keep the long unwieldy names (at least you know what faa file corresponds to the feature table, but write into your notebook which organism is behind the number.
Keep the extensions of the file the same ( _protein.faa.gz , _feature_table.txt.gz , _genomic.fna.gz and _genomic.gbff).

Start Terminal or PuTTY. In the "Host" field, type xanadu-submit-ext.cam.uchc.edu
or, if you use terminal on a Mac open terminal and type (you could try the upwards arrow on your keybord to recall the command used last week):
ssh mcb3421usrXX@xanadu-submit-ext.cam.uchc.edu )
In the "Username" field, type your username: mcb3421usrXX, where XX is a number assigned to you. Login. It may ask you to accept a new host key. Now enter your password: .

type
srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash
and then hit the return or enter key. This takes you to a "compute" node. (Why? Because we are going to run the BLAST+ command, and we want to "farm" the processing out to another computer, rather than hammering the single computer which operates as the "gateway" for everyone.

Use qstat or hostname to verify that you are on a compute node.

change into the directory where fhe genome iles are located
cd lab7

Uncompress all the files using
gunzip *.gz (gunzip calls the program to unzip files, and * is a wild card the will be expanded into a list of filenames that includes all files ending on .gz)

To make the blast+ programs available:
module load blast

To make perl available in a way that works:
module load perl

To make Gnuplot available
module load gnuplot

Use the copy url commands below to get the following scripts into the lab 7 directory.

Perl program that substitutes the accession numbers in a protein FASTA file with the corresponding genome (start) coordinates in the feature table:
curl -O https://j.p.gogarten.uconn.edu/mcb3421_2020/labs/faaReplaceAccessionWithStart.pl

Other script we will need are

For removing all blast hits but the best one from the blast output table:
curl -O https://j.p.gogarten.uconn.edu/mcb3421_2020/labs/blastTopHit.pl

For making a plot using gnuplot
curl -O https://j.p.gogarten.uconn.edu/mcb3421_2020/labs/plotwgnu_mod2.pl

To add the location in the genome to the annotation line use the following command (the green text needs to be replaced by the names of your files, the pink is a name of your choice, but make sure you know which genome is which if you decide to call them genome_1wS.faa and genome_2wS.faa). Keep the extensions as indicated:

perl faaReplaceAccessionWithStart.pl yourGenome1.faa yourFeatureTable1.txt > yourGenome1WithStart.faa

perl faaReplaceAccessionWithStart.pl yourGenome2.faa yourFeatureTable2.txt > yourGenome2WithStart.faa

perl faaReplaceAccessionWithStart.pl yourGenome3.faa yourFeatureTable3.txt > yourGenome3WithStart.faa

Check that the content of the files is as expected
head yourGenome1WithStart.faa

Check that the ">accession" lines have been replaced with ">number" lines.

In case this doesn't work for one of the genomes, pick another genome and send me a link to the featuretable and faa file that failed.

Replace the accession lines in both (or more) genome's multiple FASTA files (the ones ending in .faa). Be careful that you use the corresponding feature table for each genome. Also note that if a file exists with the same name as that to the right of the "screen output" redirect (>) symbol, it will be replaced!

makeblastdb -in database_proteinWithStart.faa -dbtype prot -parse_seqids
Choose the first of your genomes as the "database". Do an "ls" to see the extra files you just made. Use the FASTA file with the start coordinates!

blastp -query query_proteinWithStart.faa -db database_proteinWithStart.faa -out blast.txt -outfmt 6 -evalue 1e-8
The other genome will be the "query". An E-value cut-off of 10^-8 is used, you could select a smaller cut-off (1e-14). Use the FASTA files with the start coordinates!
If you do multiple geneplots, you could call one output file blast.txt, and the next one blast2.txt ..,

This will take a few minutes. Again, here is a description of the columns.

To get just the top hit for each query sequence, we use another Perl program. Since the hits for each query are ordered by best E-value to worst, the top hit is simply the first hit for each query:
You use it as follows:

perl blastTopHit.pl blast.txt > blastTopHit.txt

head blastTopHit.txt

Notice that there is only one hit returned per query in the blastTopHit.txt file.
If you leave the names as given, the following script (below) works without modification. If you want to create multiple geneplots, you could call the files

Note: The "-max_target_seqs 1" option also returns one top BLASTp database hit for each query sequence (for problems with this command see here). However, since we also want all hits with E-value ≤ 10^-8 for the other plot, we can use the Perl program to avoid computing the BLASTp twice (once without the max_target_seqs option, and once with).

To plot the location in one genome against the location of the matches in the other genome we have two options. (B) using excel (see below) or (A) using gnuplot. The use of gnuplot is recomended.

A) Plotting the results using gnuplot

Gnuplot is installed on the cluster, and you can use it to create scatter plots for both of the matches (top and all significant) in the same coordinate system.

A script that does this is here. The script needs to be present in the same folder as the files with the blast output (blast.txt and blastTopHit.txt)
You need to open the file in a text editor, and edit the names of the files with the blast output - if you used the names blastTopHit.txt and blast.txt the program runs without editing, and you can skip the next paragraph.

There are many ways to download and edit the perl script.
One is to save and edit locally and use filezilla to move the file back and forth), or
use the editor nano to edit the file:
nano plotwgnu_mod2.pl
An alternative is download the file to your desktop (right click on the link and select save as), edit the file on you desktop (MSWord, save as txt), and transfer it back to the cluster using Filezilla)

To run the script type
perl plotwgnu_mod2.pl

Transfer the resulting plot (the file is called plot.png) to your computer using filezilla, and display the image on the screen.

If you want the best scoring blasthits depicted in red, you can edit the script in nano and change this line
print PLOT "plot ".'"'."$plot1".'" using 2:1 with points ls 2,'.'"'."$plot2".'" using 2:1 with points $

print PLOT "plot ".'"'."$plot1".'" using 2:1 with points ls 7,'.'"'."$plot2".'" using 2:1 with points $

Remember to rename the files (blast.txt .... plot.png before you run a second analysis)

==============================================
B) plotting the results using Excel:

Open filezilla and navigate to your lab7 directory, and transfer the blast.txt and blastTopHit.txt files to your Desktop.

Make an Excel scatter plot using all BLAST hits with E-value ≤ 10^-8, and another using just the top hits.

===============================================

What if any is the difference between the two plots? (Or the data plotted in different colors using gnuplot.)

Remember to rename the files (blast.txt .... before you run the second analysis)

Which genomes did you compare?
Describe the results you obtained in words AND copy the plots into you notebook.
For each plot, give the name of the strain used as databank (using the plotwgnu_mod2.pl script, the the databank genome ends up on the X-axis), and the name of the strain used as query.
Discuss what genome rearrangements might have given rise to this result. Does it appear likley that the origins of replication were placed in non-homologous positions?
Description of results: Was the ORI placed in homologous locations?

2) Cumulative Strand Bias

The script GCskew.pl goes through a genome and counts how many more Cs than Gs (or As than Ts, or Gs +Ts more than Cs and As). It counts continuously, if the C is encountered the C versus G counter goes up, and the Gs +Ts versus Cs and As goes down one.

The script runs on a file that ends in fna that is in the same directory as the script. We already down-loaded the nucleotide sequence of two genomes (the one with the fna extension -- see above). For this exercise, we only want to use the chromosome. Transfer the *genomic.fna file(s) for the bacterial genome to your computer (do the same for the the genomic.gbff files, we will need it later). Open the fna file in a text editor (atom, notepad++, BBedit). The first annotation line usually says something like this is the chromosome, and even if it doesn't, it usually is. This is the part of the file you want to keep. In rare cases, the first annotation line says plasmid or mini chromosome.
To get to the second annotation line, search for >.
If the first sequence was the chromosome, delete from before the second > to the end.
If the first one was a plasmid, delete form before the second > to the beginning. (repeat until only the chromosome is left).
Save the file under name_chromosome.fna

Using filezilla, create two subdirectories in lab6 (call them GCSkew1 and GCSkew2), and copy the fna files with only the chromosome sequences and the GCskew.pl script into these directory (one each). Go back to the terminal window.

[If you were locked out, you need to execute the following:

ssh mcb3421usrXX@xanadu-submit-ext.cam.uchc.edu
module load perl
cd lab7 ]

cd GCSkew1
perl GCskew.pl
cd..
cd GCSkew2
perl GCskew.pl
Note: The script runs it on one fna file in the directory and overwrites my_table if the file already exists.

The script creates a file called my_table. Transfer it to your computer, add the extension .txt, and open it in excel. Turn it into a table and plot the first four columns as a scatter plot.
Do you see one or more turning points (Peaks)?

These usually are the origin and the terminus of replication.
Open the genomic.gbff or the file in a texteditor and search for dnaA in the annotation lines (sorry, there likely will be several amino acid sequences that also spell out dnaA). In bacteria dnaA is often/usually the first gene next to the origin of replication. [You also can open the feature table in Excel and search for dnaA]
Does the position of the dnaA encoding gene correspond to one of the turning points? Is it located at the beginning of the plot?

Optional: Sequence conservation along a genome

Plot the level of sequence conservation along a genome. An easy way to do this is to sort the EXCEL spreadsheet on the ORF position, and then plot the bitscores as a bargraph, or using a scatterplot (bitscore versus position, or -log E-values versus position, or % identity versus position ... ). For this last exercise, if you want to identify the genes (see blastdbcmd).

Which region(s) of the genome is least conserved?

Type exit
You will return to the master node. Now type qstat
You should have no jobs running. If there is a job listed, then type qdel
followed by a space, and then the job-ID number, followed by return or enter key. Then type qstat
again, to confirm that there are no running jobs. Then type logout
to exit the main computer.

Finished?

Type logout to release the compute node form the queue.
Check the queue for abandoned sessions using qstat.
If there are abandoned sessions under your account, kill them by deleting them from the queue by typing qdel job-ID, e.g. "qdel 40000" would delete Job # 40000

Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.

Appendix 1: Using the join command

Problem: Your BLASTp output contains accession numbers, but no genome coordinates. The genome coordinates are in the feature_table files. We want to compare the genome coordinates of the matches. One way is with the "join" command:

e.g., Take two genomes, g1 and g2.

BLASTp results, top hits with "-max_target_seqs 1"
g1 accession	g2 accession
g1p1	g2p7
g1p2	g2p8
g1p3	g2p9

g1 feature table
g1 accession	coordinate
g1p1	1
g1p2	3
g1p3	5

g2 feature table
g2 accession	coordinate
g2p7	2
g2p8	4
g2p9	6

join blast_top_hit.txt g1_feature.txt > step1.txt

It will join on the first column.

step1.txt
g1 accession	g2 accession	g1 coordinate
g1p1	g2p7	1
g1p2	g2p8	3
g1p3	g2p9	5

join -1 2 step1.txt g2_feature.txt

The "-1 2" tells join that the first file (-1) will be joined on the second (2) column.

final output
g2 accession	g1 accession	g1 coordinate	g2 coordinate
g2p1	g1p7	1	2
g2p2	g1p8	3	4
g2p3	g1p9	5	6

A bit tedious, but it gets the job done. The files to join must be sorted by the columns they're joined on.

grep '^CDS' query_feature_table.txt | grep $'\tchromosome\t' | cut -f8,11 > query_start_accession.txt
head query_start_accession.txt
grep '^CDS' database_feature_table.txt | grep $'\tchromosome\t' | cut -f8,11 > database_start_accession.txt
head database_start_accession.txt
cut -f1,2 blast_top_hit.txt > accession_top_hit.txt
join -1 1 -2 2 <(sort accession_top_hit.txt) <(sort -k 2 query_start_accession.txt) > accession_top_hit_query_start.txt
join -1 2 -2 2 <(sort -k 2 accession_top_hit_query_start.txt) <(sort -k 2 database_start_accession.txt) > accession_top_hit_query_start_database_start.txt

When finished, open a new SFTP window in filezilla, navigate to your lab7 directory, and transfer over the accession_top_hit_query_start_database_start.txt file to your Desktop. Load it into Excel.

Make an Excel scatter plot of the joined file (accession_top_hit_query_start_database_start.txt)

{ This is an alternative to download the genomes! Filezilla is easier!

Start PuTTY. In the "Host" field, type mcbsubmit.cam.uchc.edu
In the "Username" field, type your username: mcb3421usrXX, where XX is a number assigned to you. Login. It may ask you to accept a new host key. Now enter your password: xxx.

In the terminal window type srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash
and then hit the return or enter key. This takes you to a "compute" node.
mkdir lab7
cd lab7

In your browser go to the NCBI's current genome list. Click on the "Prokaryotes" tab, display only complete genomes.
For today's exercise, we will need the genomes from two or more closely related bacteria or archaea (either strains from the same species, or species from the same genus (e.g., different Aeromonas species, different Thermotoga species, different Borreliella species (they have linear chromosomes)).

Open a text or word editor on you PC, and copy paste the names of the genomes you want to analyze. The reference genome (the one you will turn into a databank should be first). z
On the NCBI's current genome list move to the right an click on the R (or if there is only a G click there) to go to the FTP server, from which you can download the genome data. Copy the names of the .faa.gz file and the feature-table.txt.gz file onto your text file below the species and strain name.
Right click on the .faa.gz file and the feature-table.txt.gz and copy the link address, and paste it under the names of the feature-table and faa file. The result should look something like this:

Haloferax volcanii DS2
GCF_000025685.1_ASM2568v1_feature_table.txt.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/025/685/GCF_000025685.1_ASM2568v1/GCF_000025685.1_ASM2568v1_feature_table.txt.gz
File:GCF_000025685.1_ASM2568v1_protein.faa.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/025/685/GCF_000025685.1_ASM2568v1/GCF_000025685.1_ASM2568v1_protein.faa.gz

Haloferax mediterranei ATCC 33500
File:GCF_000306765.2_ASM30676v2_feature_table.txt.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/306/765/GCF_000306765.2_ASM30676v2/GCF_000306765.2_ASM30676v2_feature_table.txt.gz
GCF_000306765.2_ASM30676v2_protein.faa.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/306/765/GCF_000306765.2_ASM30676v2/GCF_000306765.2_ASM30676v2_protein.faa.gz

Haloferax gibbonsii
GCF_001190965.1_ASM119096v1_feature_table.txt.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/190/965/GCF_001190965.1_ASM119096v1/GCF_001190965.1_ASM119096v1_feature_table.txt.gz
GCF_001190965.1_ASM119096v1_protein.faa.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/190/965/GCF_001190965.1_ASM119096v1/GCF_001190965.1_ASM119096v1_protein.faa.gz

Make sure you are in the lab7 subdirectory.Then down load the faa and feature_table.txt files using the curl command:

curl -O paste_the_ftp_link_you_copied
e.g.,

curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/025/685/GCF_000025685.1_ASM2568v1/GCF_000025685.1_ASM2568v1_feature_table.txt.gz
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/025/685/GCF_000025685.1_ASM2568v1/GCF_000025685.1_ASM2568v1_protein.faa.gz

curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/306/765/GCF_000306765.2_ASM30676v2/GCF_000306765.2_ASM30676v2_feature_table.txt.gz
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/306/765/GCF_000306765.2_ASM30676v2/GCF_000306765.2_ASM30676v2_protein.faa.gz

curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/190/965/GCF_001190965.1_ASM119096v1/GCF_001190965.1_ASM119096v1_feature_table.txt.gz
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/190/965/GCF_001190965.1_ASM119096v1/GCF_001190965.1_ASM119096v1_protein.faa.gz

Uncompress all the files using
gunzip *.gz (gunzip calls the program to unzip files, and * is a wild card the will be expanded into a list of filenames that includes all files ending on .gz)

Rename the files using the species and strain names using the mv command. The names you choose should not contain spaces. e.g., Haloferax volcanii DS2 files should be named Haloferax_volcanii_DS2.

mv ridiculouslyLongFilename.faa somethingShorter.faa
mv ridiculouslyLongFilename_feature_table.txt somethingShorter_feature_table.txt

e.g.:
mv GCF_000025685.1_ASM2568v1/GCF_000025685.1_ASM2568v1_feature_table.txt Haloferax_volcanii_DS2_feature_table.txt
mv GCF_000025685.1_ASM2568v1/GCF_000025685.1_ASM2568v1_protein.faa Haloferax_volcanii_DS2.faa

mv GCF_000306765.2_ASM30676v2/GCF_000306765.2_ASM30676v2_feature_table.txt Haloferax_mediterranei_ATCC_33500_feature_table.txt
mv GCF_000306765.2_ASM30676v2/GCF_000306765.2_ASM30676v2_protein.faa Haloferax_mediterranei_ATCC_33500.faa

mv GCF_001190965.1_ASM119096v1_feature_table.txt Haloferax_gibbonsii_feature_table.txt
mv GCF_001190965.1_ASM119096v1_protein.faa Haloferax_gibbonsii.faa

}