Assignment 7: Gene Bias Plots via BLAST+ executables & Strand Bias Plots

Your name:
Your email address:




Today we will meet on webex at https://uconn-cmr.webex.com/meet/jpg02006


Introduction

This week, we will make a gene and a strand bias plot. (As seen in the lecture.)

The strand bias plot depicts the bias in nucleotide usage along the genome (x-axis). Plotting this in a cumulative fashion integrates over local fluctuation in a window (see slides 29 an 32 in class11_strand_bias_and_recombination)

A gene plot is basically a way of visualizing the BLASTp hits between proteins (ORFs) in two genomes, in order to compare their relative arrangement (inversions, etc.). One genome is the x-axis, and the other genome is the y-axis. For each point (x,y) on a scatter plot, the following holds:

For both of these plots it is helpful to keep in mind that most of the bacterial and archaeal genome are circular, i.e., the x-axis should be considered on a circle.

Set-Up

ssh

If you are working on a windows machine, start a program called PuTTY (download from here). 
In the "Host" field, type xanadu-submit-ext.cam.uchc.edu

In the "Username" field, type your username mcb3421usrXX, where XX is a number assigned to you. (Go to the class notebook, behind the name of your personal section is an _#. this is your number.  If your number is 1, do not use a leading 0.  Your username would be mcb3421usr1.)
Login. It will ask you to accept a new host key. Now enter the password _______ (the password is in the notebook under today's summary and goals section).  Hit return. <Note the terminal will not echo the keystrokes you make to enter the password.>

The window with a black background that opens on your screen is the command-line window.

In the window with a black background, you will see a flashing cursor. Stuff you type appears at the cursor position.

If you are working on a Mac, find the terminal application (Termianl.app).  By default is located in the Applications folder, in a subfolder utilitites (/System/Applications/Utilities/Terminal.app).  The easiest might be to search for Terminal.app in the finder or in spotlight. 

Double click the Terminal application icon.  In the window that opens, type

ssh mcb3421usrXX@xanadu-submit-ext.cam.uchc.edu
where XX is a number assigned to you. (Go to the class notebook, behind the name of your personal section is an _#. this is your number.  If your number is 1, do not use a leading 0.  Your username would be mcb3421usr1.)
It will ask you to accept a new host key, type yes <return>  [<return> means hit the return key.]
Now enter the password _______ (the password is in the notebook under today's summary and goals section).  Hit return. <Note the terminal will not echo the keystrokes you make to enter the password.>

sftp

File transfer using Filezilla

Start  Filezilla -- download here

Open the site manager (under the file menu, or the left most icon in the bar on top of the application window. 
Click on the NewSite button at the bottom left of the site manager window.
On the right
under protocol select SFTP
under host enter transfer.cam.uchc.edu
under LogOn Type select normal
under User enter mcb3421usrXX (Go to the class notebook, behind the name of your personal section is an _#. this is your number.  If your number is 1, do not use a leading 0.  Your username would be mcb3421usr1.)
under Password enter your xanadu password
Hit connect.

You should see 4 windows.  On top the directories structure on your computer (left) and on the cluster (right).  On the bottom the contents of the two active directories (i.e. the directories between which the files will be transferred).  You can move to a directory/ subdirectory by double clicking.  The folder with two dots behind it is the parent directory.
If you double click on a file in the bottom windows it is transferred (copied) between the two directories on the two computers.  You also can drop and drag files in the window representing the directory on xanadu. 


Files and scripts for today

The files that you will need today are here (as zip file) or here as a tar.gz file.
You could download the latter using ssh using
curl -O https://j.p.gogarten.uconn.edu/mcb3421_2021/labs/lab7archive.tar.gz
To unpack the file, execute
tar -xzvf lab7archive.tar.gz
The advantage of the curl command is that you do not fill up your harddrive.
The zip archive needs to be unpacked on your desk- or laptop, and the files transferred via Filezilla.

Cumulative Strand Bias

The script GCskew_1000_all.pl goes through a genome and counts how many more Cs than Gs (or As than Ts, or Gs +Ts more than Cs and As).  It counts continuously, if the C is encountered the C versus G counter goes up, and the Gs +Ts versus Cs and As goes down one. 

The script runs on all file that ends in fna that are in the same directory as the script. These files should only contain the chromosome. Several chromosome sequences are already in the Lab7_FileArchive.

available ), or TextEdit (Mac).  The first annotation line usually says something like this is the chromosome, and even if it doesn't, it usually is. This is the part of the file you want to keep. In rare cases, the first annotation line says plasmid or mini chromosome. 
To get to the second annotation line, search for >. 
If the first sequence was the chromosome, delete from before the second > to the end. 
If the first one was a plasmid, delete form before the second > to the beginning.   (repeat until only the chromosome is left)]

Using filezilla, create a directory called lab7. Inside this directory create a directory called GCSkew.
Copy the ...chromosome.fna files and the GCskew_1000_all.pl into this directory.  Open an ssh terminal window (either using PuTTY or Terminal) Log into xanadu (see above)
ssh mcb3421usrXX@xanadu-submit-ext.cam.uchc.edu
module load perl
cd lab7
cd GCSkew
perl GCskew_1000_all.pl
The script works on all *.fna sequences in the directory and creates one table for each ending in ...mytable.txt . In the provided archive are several chromosome sequences, and a few files where the chromosome sequence is repeated three times. These files (ending in _3x.fna) might make it more clear that the chromosome is circular and that there are two turning point in GC bias along the chromosome.

The script creates a files ending in my_table.txt. Transfer these to your computer, and open it in excel.  Turn it into a table and plot the first four columns as a scatter plot.
Do you see one or more turning points (Peaks)? (Consider that the genome is circular. How would it look, if you keep going around the genome? If in doubt look at the analysis of the 3x.fna files) 


The turning points usually are the origin and the terminus of replication. 
Open the feature_table.txt file for the genome you are analyzing in a texteditor and search for dnaA in the annotation lines. In bacteria dnaA is often/usually the first gene next to the origin of replication. 
Does the position of the dnaA encoding gene correspond to one of the turning points?  Is it located at the beginning of the plot?


Repeat this analysis for the Haloferax volcanii genome.
How many turning points do you see for Haloferax? What could be an explanation for this?
(see Figure 4 in https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0030077 and
Figure 2 in https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0009605 and
figure 4 in https://www.nature.com/articles/ncomms9321?origin=ppub )



Gene Plots

For a gene plot We will use a ...protein.faa files for which each annotation line starts with the location of the gene in the genome. The perl script used to do this is part of the file archive for this class. The command to run the program is

perl faaReplaceAccessionWithStart.pl yourGenome1.faa yourFeatureTable1.txt > yourGenome1WithStart.faa

The faa and the feature table file need to be in the same directory.

For today's exercise we will need two closely related bacterial (or archaeal) genomes.  These could be strains for the same species or from the same genus In the archive for this class are several .faa, where the location of the genome has been added to the beginning of each annotation line.
these files end with ..._with_start.faa. Select two of these files and transfer them to xanadu (using filezilla)

Which species/strains did you choose?  


Start Terminal or PuTTY. In the "Host" field, type xanadu-submit-ext.cam.uchc.edu
or, if you use terminal on a Mac open terminal and type
ssh mcb3421usrXX@xanadu-submit-ext.cam.uchc.edu
In the "Username" field of PuTTY, type your username: mcb3421usrXX, where XX is a number assigned to you. Login. It may ask you to accept a new host key. Now enter your password: ......

type 
srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash
and then hit the return or enter key. This takes you to a "compute" node.  (Why? Because we are going to run the BLAST+ command, and we want to "farm" the processing out to another computer, rather than hammering the single computer which operates as the "gateway" for everyone.

Use qstat or hostname to verify that you are on a compute node.

change into the directory where fhe genome iles are located
cd lab7

To make the blast+ programs available:
module load blast/2.11.0
module load perl

To make Gnuplot available
module load gnuplot

Copy the following scripts into the lab 7 directory.

Two genomes ...with_start.faa
plotwgnu_mod2.pl
header.txt
blastTopHit.pl

Check that the content of the files is as expected
head
yourGenome1_with_start.faa

Check that the ">accession" lines have been replaced with ">number something" lines.


In the following command, replace OneOfYour...with_start.faa with the actual file name of one of the genomes you want to compare. This file will become your databank

makeblastdb -in OneOfYour...with_start.faa -dbtype prot -parse_seqids
To do the blast search we will use the following command (again, you need to replace the names with the names of the actual filenames!

query_protein_with_start.faa it the other genome you want to analyze.
database_protein_with_start.fa is the database we had just created.

blastp -query query_protein_with_start.faa -db database_protein_with_start.faa -out blast.txt -outfmt 6 -evalue 1e-20
An E-value cut-off of 10-20 is used, you could select a smaller cut-off (1e-40).

This will take a few minutes. Again, here is a description of the columns.

To get just the top hit for each query sequence, we use another Perl program. Since the hits for each query are ordered by best E-value to worst, the top hit is simply the first hit for each query:
You use it as follows:

perl blastTopHit.pl blast.txt > blastTopHit.txt

head blastTopHit.txt

Notice that there is only one hit returned per query in the blastTopHit.txt file.
If you leave the names as given, the following script (below) works without modification.  If you want to create multiple geneplots, you could call the files

Note: The "-max_target_seqs 1" option also returns one top BLASTp database hit for each query sequence (for problems with this command see here). However, since we also want all hits with E-value ≤ 10-20 for the other plot, we can use the Perl program to avoid computing the BLASTp twice (once without the max_target_seqs option, and once with).

To plot the location in one genome against the location of the matches in the other genome you have two options. A) Download the tabular blast output into excel, or use gnuplot. The use of gnuplot is recomended.

Plotting the results using gnuplot

Gnuplot is installed on the cluster, and you can use it to create scatter plots for both of the matches (top and all significant) in the same coordinate system.

A script that does this is plotwgnu_mod2.pl (in the file archive). The script needs to be present in the same folder as the files with the blast output (blast.txt and blastTopHit.txt)
If you used the commands as given above, i.e., your blast output tables are called blast.txt and blastTopHit.txt you can use the script as is.
Else, you need to open the file in a text editor, and edit the names of the files with the blast output.

There are many ways to download and edit the perl script.
One is to save and edit locally and use filezilla to move the file back and forth), or
use the editor nano to edit the file:
nano
plotwgnu_mod2.pl

An alternative is download the file to your desktop (right click on the link and select save as), edit the file on you desktop (MSWord, save as txt), and transfer it back to the cluster using Filezilla)

To run the script type
perl plotwgnu_mod2.pl

Transfer the resulting plot (the file is called plot.png) to your computer using filezilla, and display the image on the screen.

If you want the best scoring blasthits depicted in red, you can edit the script in nano and change this line
print PLOT "plot ".'"'."$plot1".'" using 2:1 with points ls 2,'.'"'."$plot2".'" using 2:1 with points ls 1;'."\n";

to

print PLOT "plot ".'"'."$plot1".'" using 2:1 with points ls 7 ,'.'"'."$plot2".'" using 2:1 with points ls 1;'."\n";


If you want larger or smaller point plotted for each blast hit, change the set pointsize in line 11 of the plotwgnu_mod2.pl script.

Remember to rename the files (blast.txt .... plot.png) before you run a second analysis

==============================================
B) plotting the results using Excel:

Open filezilla and navigate to your lab7 directory, and transfer the blast.txt and blastTopHit.txt files to your Desktop.

Make an Excel scatter plot using all BLAST hits with E-value ≤ 10-20, and another using just the top hits.

===============================================

If you used excel: What if any is the difference between the two plots? What is the difference between the data plotted in different colors using gnuplot?

In case you want to run a second analysis, remember to rename the files (blast.txt .... before you run the second analysis)


Which genomes did you compare?
Describe the results you obtained in words AND copy the plots into you notebook.
For each plot, give the name of the strain used as databank (If you used the plotwgnu_mod2.pl script, the databank genome ends up on the X-axis), and the name of the strain used as query.
Discuss what and how many genome rearrangements might have given rise to this result. Does it appear likley that the starts of the genomes were placed in non-homologous positions?
Does this agree with the GC Bias plots?


In your ssh connection, type exit
You will return to the master node. Now type qstat
You should have no jobs running. If there is a job listed, then type qdel
followed by a space, and then the job-ID number, followed by return or enter key. Then type qstat
again, to confirm that there are no running jobs. Then type logout
to exit the main computer.


Finished?

Type logout to release the compute node form the queue.
Check the queue for abandoned sessions using qstat.
If there are abandoned sessions under your account, kill them by deleting them from the queue by typing qdel job-ID, e.g. "qdel 40000" would delete Job # 40000

Check the appropriate radio button below before pressing the submit button:

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.