Assignment 8 Handling genome (or larger) amounts of data -- Extracting text from other applications

Your name:
Your email address:

1) Complete the gene-plot from Assignment 7:

(This assumes that you already have the faa files and corresponding feature tables in the lab7 directory. Use the R link in the NCBI's current genome list. Just in case, get the files for three genomes from the same genus - sometimes the feature table and the faa file appear to use different accession numbers).

AND: keep notes of what you are doing! (open a textfile and copy names, links and commands to use into that file!

In the instruction below, the names of files you need to change are given in green courier font.

Start Filezilla and connect to transfer.cam.uchc.edu

PuTTY to xanadu-submit-ext.cam.uchc.edu

srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash

cd lab7

module load blast

module load perl

module load gnuplot

curl -O https://j.p.gogarten.uconn.edu/mcb3421_2019/labs/faaReplaceAccessionWithStart.pl

curl -O http://carrot.mcb.uconn.edu/mcb3421_2016/blastTopHit.pl

perl faaReplaceAccessionWithStart.pl yourGenome1.faa yourFeatureTable1.txt > yourGenome1WithStart.faa

perl faaReplaceAccessionWithStart.pl yourGenome2.faa yourFeatureTable2.txt > yourGenome2WithStart.faa

Check the resulting files.

Select one of the renumbered genomes as the query genome, and the other renumbered genomes as database.

makeblastdb -in database_proteinsWithStart.faa -dbtype prot -parse_seqids

blastp -query query_proteinWithStart.faa -db database_proteinWithStart.faa -out blast.txt -outfmt 6 -evalue 1e-8

perl blastTopHit.pl blast.txt > blastTopHit.txt

Use blast.txt and blastTopHit.txt as filenames, then you do not have to edit the gnuplot script.

Check the contents of blast.txt and blastTopHit.txt (use cat or head)

Create a gene plot using

perl plotwgnu_mod2.pl

(this creates a file called plot.png file in lab7). Transfer plot.png to your machine using filezilla and look at the plot.

Which genomes did you compare?
Describe the results you obtained in words AND the this description and email the resulting plots as attachment to gogarten@uconn.edu.
For each plot, give the name of the strain used as databank (using the plotwgnu_mod2.pl script, the the databank genome ends up on the Y-axis), and the name of the strain used as query.
Discuss what genome rearrangements might have given rise to this result. Does it appear likely that the origins of replication were placed in non-homologous positions?
Genome1 (as databank): Genome2 (as query): Description of results: Were the ORIs placed in homologous locations?

2) Visualizing strand bias

Go the NCBI's current genome list - and return to the two genomes you analyzed for the gene plot. (note: if possible, you should have used the R -link in the right column.) Download the two *_genomic.fna.gz files. Unzip the file (this might work by double clicking on the file on your PC, if not, filezilla to the lab7 directory, go to the PuTTY terminal, gunzip *.gz the file, and filezilla it back to your PC).

Open the unzipped file in a unix compatible texteditor (notepad++) and search for the fasta annotation lines. If there are multiple annotation lines, delete the annotation lines and sequences that do NOT correspond to the main chromosome. When only the sequence of the main chromosome is left, save the file (give it a name that lets you recognize which genome you used, and makes clear that this is only the chromosome (something like GCF_002934425.1_ASM293442v1/GCF_002934425.1_ASM293442v1_genomic_chromosome.fna would work).

Put each of the chromosome sequences in a separate subdirectory.

cd into each of these subdirectories and download a script to analyze strand bias into the same directories:

curl -O https://raw.githubusercontent.com/Gogarten-Lab/GC-Strand-Bias-Perl-Script/master/GCskew.pl

execute the script by typing

perl GCskew.pl

Do these for both genomes!

(Note: check out the script. In its present incarnation, it prints to file every 50 nucleotides. This maybe excessive. Can you find the line where the print to file is initiated?)

The script creates a table (my_Table) that you can load into Excel (maybe give it a txt extension first).

In Excel, select the first 4 columns and plot them as a scatter plot. For most bacterial genomes the turning points in cumulative composition bias correspond the origin and terminus of replication.
Can you identify the two turning points in cumulative bias?
Do your findings agree with the gene plot you had created? (In your answer include the organism whose genome you analyzed)
An unwritten convention was to start the numbering of a genome at the origin of replication and to go in the direction of the DnaA emcoding gene, often next to the origin of replication. Is DnaA the first gene in the faa file?

Description of results on Genome1: Genome2:

3) Creating a mummer plot

Mummer as an easy alternative to gene plots to study the synteny between two genomes:

Mummer finds matching nucleotide sequences, and produces nice plots, similar to the gene plot (above). The difference is that in this case the search is done on the nucleotide level, and that the program keeps track of the + and the - strand. Mummer is installed on the cluster.

The following assumes that you already have a terminal and filezilla connection to the cluster, and that you have moved to a compute node!

load MUMmer/4.0.2

module load gnuplot

Download two or more genomes you want to compare (you can use the same as above, or you can browse microbial genomes at NCBI and download chromosomes or genomes (*.fna) files to your computer or use the curl -O command targeting the NCBI FTP site.

For an example you could download three Haloferax chromosomes from http://gogarten.uconn.edu/mcb3421_2019/labs/Haloferax_genomes.zip

If you only have the main chromosomes in the files, mummer works fine:
Execute the following commands, replacing the genome names with the names of the ones you chose to analyze. gib_med is an abbreviation that should remind you of which genomes you compared. Replace it with something corresponding to your genomes.
mummer -maxmatch -b -c Haloferax_gibbonsii_strain_ARA6.fasta Haloferax_mediterranei_ATCC_33500.fasta > gib_med.mums

To plot the matches we again use gnuplot. Mummer comes with a script that creates a file to be send to gnuplot. Sadly, a few of the options on file output that mummerplot makes do not work on the cluster, resulting in an error message. Nevertheless, mummer generates a file that is close to being understood by gnuplot as installed on the cluster. We will execute mummerplot and then edit this file, and then send it to gnuplot from the commandline:

mummerplot --png --large --prefix=gib_med gib_med.mums

You will get an error message, because the gnuplotfile is not quite understood by gnuplot. To fix the problem, edit the gp file in a text editor (nano, vi):
delete "tiny" and increase size to 4000 4000 (e.g.: nano gib_med.gp will open the file in nano. Use the cursor keys to move around in the file and to edit delete things, then use the <control> key options to save the file and exit the editor.)

gnuplot gib_med.gp

Download the resulting png file to your computer ....

If your genomes consists of multiple contigs or chromosomes, mummerplot is rescaling the output, so that each contig covers the whole Y axis, which may not be what one is looking for. The plots that only give the best matches look nicer:

nucmer Haloferax_gibbonsii_strain_ARA6.fasta Haloferax_mediterranei_ATCC_33500.fasta
mummerplot out.delta
Again, you get an error that an incorrect output device is selected.

Modify out.gp in a texteditor
delete the current set terminal line
add lines specifying a png terminal (below)
and a line sending the output to a file

set terminal png size 4000,4000
set output "gib_med_nucmer.png"

Then execute
gnuplot out.gp

hit <return> to leave the "interactive mode"

Repeat for other genome comparisons:

If you are interested in details, check the mummer manual at http://mummer.sourceforge.net/manual/#mummer

Which genomes did you analyze?
Did the genomes use homologous starting points?

Description of your results Genome1: Genome2:

Finished?

Type exit to release the compute node from the queue.
If you you encountered problems in your session, check the queue for abandoned sessions using the command qstat. If there are abandoned sessions under your account, kill them by deleting them from the queue by typing qdel job-ID, e.g. "qdel 40000" would delete Job # 40000

Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.