This week, we will make a gene plot. (As seen in the lecture.) This is basically a way of visualizing the BLASTp hits between proteins (ORFs) in two genomes, in order to compare their relative arrangement (inversions, etc.). One genome is the x-axis, and the other genome is the y-axis. For each point (x,y) on a scatter plot, the following holds:
From last week, you'll recall that the tabular output (i.e., outfmt option 6) of a BLASTp search between proteins in two genomes (g1 and g2) looks like this:
You'll also recall that we downloaded a "..._protein.faa" file, which was a FASTA format file of all the encoded proteins in a given genome.
The FASTA format has a ">" line, followed by each sequence. The NCBI ..._protein.faa files look something like this:
>proteinAccessionA(space)some other descriptive text M...the rest of the protein for the accession on the previous line, on one or more lines... >proteinAccessionB(space)some other descriptive text M...the rest of the protein for the accession on the previous line, on one or more lines...
There are several files in the FTP directories (RefSeq or other) for a genome . Today we will be using the four shaded in blue:
Given two genome (g1 and g2) FASTA protein files, our BLASTp output might look as follows:
We would then need some way to add the coordinates for the query and database accessions to the BLASTp output. This information is in the respective feature tables.
One simple way would be to change the header lines in our FASTA protein files to a genome coordinate instead of an accession. For this example we will choose the start coordinate.
Then we proceed to do a scatter plot of the first two columns.
There are many ways to accomplish this (below), but today we will use option 1.
(different E.coli/Shigella/Salmonella, Frankia, Aeromonas, or different Thermotoga species work nicely).
A) obtaining the genome sequences
For today's exercise we will need two closely related bacterial (or archaeal) genomes. These could be strains for the same species (this has the risk of being slightly boring, if the two strains are closely related), or from the same genus (this has the risk of being too interesting, if the two strains separated by too many recombination events. Go to the NCBI's current genome list. Click on the "Prokaryotes" tab, display only complete genomes. The genomes need to be completely assembled.
The easiest will be to download the files for two or more genomes to your computer. Then use FileZilla to transfer the files into a Lab7 directory in your account on the cluster.
1) start filezilla and connect to (use you username and password) 2) create a lab7 directory and transfer the files you downloaded from the NCBI into that directory. 3) if you want to, rename the files into something more memorable than GCF_000025685.1_ASM2568v1_protein.faa. Be careful when renaming the file. The most common mistake is to mix up file names. You an keep the long unwieldy names (at least you know what faa file corresponds to the feature table, but write into your notebook which organism is behind the number. Keep the extensions of the file the same ( _protein.faa.gz , _feature_table.txt.gz , _genomic.fna.gz and _genomic.gbff).
Start Terminal or PuTTY. In the "Host" field, type or, if you use terminal on a Mac open terminal and type (you could try the upwards arrow on your keybord to recall the command used last week): ssh ) In the "Username" field, type your username: mcb3421usrXX, where XX is a number assigned to you. Login. It may ask you to accept a new host key. Now enter your password: .
type srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash and then hit the return or enter key. This takes you to a "compute" node. (Why? Because we are going to run the BLAST+ command, and we want to "farm" the processing out to another computer, rather than hammering the single computer which operates as the "gateway" for everyone.
Use qstat or hostname to verify that you are on a compute node.
change into the directory where fhe genome iles are located cd lab7
Uncompress all the files using gunzip *.gz (gunzip calls the program to unzip files, and * is a wild card the will be expanded into a list of filenames that includes all files ending on .gz)
To make the blast+ programs available: module load blast
To make perl available in a way that works: module load perl
To make Gnuplot available module load gnuplot
Use the copy url commands below to get the following scripts into the lab 7 directory.
Perl program that substitutes the accession numbers in a protein FASTA file with the corresponding genome (start) coordinates in the feature table: curl -O
Other script we will need are
For removing all blast hits but the best one from the blast output table: curl -O
For making a plot using gnuplot curl -O
To add the location in the genome to the annotation line use the following command (the green text needs to be replaced by the names of your files, the pink is a name of your choice, but make sure you know which genome is which if you decide to call them genome_1wS.faa and genome_2wS.faa). Keep the extensions as indicated:
perl yourGenome1.faa yourFeatureTable1.txt > yourGenome1WithStart.faa
perl yourGenome2.faa yourFeatureTable2.txt > yourGenome2WithStart.faa
perl yourGenome3.faa yourFeatureTable3.txt > yourGenome3WithStart.faa
Check that the ">accession" lines have been replaced with ">number" lines.
In case this doesn't work for one of the genomes, pick another genome and send me a link to the featuretable and faa file that failed.
Replace the accession lines in both (or more) genome's multiple FASTA files (the ones ending in .faa). Be careful that you use the corresponding feature table for each genome. Also note that if a file exists with the same name as that to the right of the "screen output" redirect (>) symbol, it will be replaced!
makeblastdb -in database_proteinWithStart.faa -dbtype prot -parse_seqids Choose the first of your genomes as the "database". Do an "ls" to see the extra files you just made. Use the FASTA file with the start coordinates!
blastp -query query_proteinWithStart.faa -db database_proteinWithStart.faa -out blast.txt -outfmt 6 -evalue 1e-8 The other genome will be the "query". An E-value cut-off of 10-8 is used, you could select a smaller cut-off (1e-14). Use the FASTA files with the start coordinates! If you do multiple geneplots, you could call one output file blast.txt, and the next one blast2.txt ..,
This will take a few minutes. Again, here is a description of the columns.
To get just the top hit for each query sequence, we use another Perl program. Since the hits for each query are ordered by best E-value to worst, the top hit is simply the first hit for each query: You use it as follows:
perl blast.txt > blastTopHit.txt
head blastTopHit.txt
A) Plotting the results using gnuplot
Gnuplot is installed on the cluster, and you can use it to create scatter plots for both of the matches (top and all significant) in the same coordinate system.
There are many ways to download and edit the perl script. One is to save and edit locally and use filezilla to move the file back and forth), or use the editor nano to edit the file: nano An alternative is download the file to your desktop (right click on the link and select save as), edit the file on you desktop (MSWord, save as txt), and transfer it back to the cluster using Filezilla)
To run the script type perl
Transfer the resulting plot (the file is called plot.png) to your computer using filezilla, and display the image on the screen. If you want the best scoring blasthits depicted in red, you can edit the script in nano and change this line print PLOT "plot ".'"'."$plot1".'" using 2:1 with points ls 2,'.'"'."$plot2".'" using 2:1 with points $
print PLOT "plot ".'"'."$plot1".'" using 2:1 with points ls 7,'.'"'."$plot2".'" using 2:1 with points $
============================================== B) plotting the results using Excel:
Open filezilla and navigate to your lab7 directory, and transfer the blast.txt and blastTopHit.txt files to your Desktop.
Make an Excel scatter plot using all BLAST hits with E-value ≤ 10-8, and another using just the top hits.
What if any is the difference between the two plots? (Or the data plotted in different colors using gnuplot.)
Remember to rename the files (blast.txt .... before you run the second analysis)
Which genomes did you compare? Describe the results you obtained in words AND copy the plots into you notebook. For each plot, give the name of the strain used as databank (using the script, the the databank genome ends up on the X-axis), and the name of the strain used as query. Discuss what genome rearrangements might have given rise to this result. Does it appear likley that the origins of replication were placed in non-homologous positions? Description of results: Was the ORI placed in homologous locations?
Plot the level of sequence conservation along a genome. An easy way to do this is to sort the EXCEL spreadsheet on the ORF position, and then plot the bitscores as a bargraph, or using a scatterplot (bitscore versus position, or -log E-values versus position, or % identity versus position ... ). For this last exercise, if you want to identify the genes (see blastdbcmd).
Which region(s) of the genome is least conserved?
Type logout to release the compute node form the queue. Check the queue for abandoned sessions using qstat. If there are abandoned sessions under your account, kill them by deleting them from the queue by typing qdel job-ID, e.g. "qdel 40000" would delete Job # 40000
Problem: Your BLASTp output contains accession numbers, but no genome coordinates. The genome coordinates are in the feature_table files. We want to compare the genome coordinates of the matches. One way is with the "join" command:
e.g., Take two genomes, g1 and g2.
join blast_top_hit.txt g1_feature.txt > step1.txt
It will join on the first column.
join -1 2 step1.txt g2_feature.txt
The "-1 2" tells join that the first file (-1) will be joined on the second (2) column.
A bit tedious, but it gets the job done. The files to join must be sorted by the columns they're joined on.
grep '^CDS' query_feature_table.txt | grep $'\tchromosome\t' | cut -f8,11 > query_start_accession.txt head query_start_accession.txt grep '^CDS' database_feature_table.txt | grep $'\tchromosome\t' | cut -f8,11 > database_start_accession.txt head database_start_accession.txt cut -f1,2 blast_top_hit.txt > accession_top_hit.txt join -1 1 -2 2 <(sort accession_top_hit.txt) <(sort -k 2 query_start_accession.txt) > accession_top_hit_query_start.txt join -1 2 -2 2 <(sort -k 2 accession_top_hit_query_start.txt) <(sort -k 2 database_start_accession.txt) > accession_top_hit_query_start_database_start.txt
When finished, open a new SFTP window in filezilla, navigate to your lab7 directory, and transfer over the accession_top_hit_query_start_database_start.txt file to your Desktop. Load it into Excel.
Make an Excel scatter plot of the joined file (accession_top_hit_query_start_database_start.txt)
