MCB 5472 : Delineating pan-genomes

Please send your answers per email to gogarten@uconn.edu, or hand in a hardcopy

Please let me know, how far you got during the lab. If most students didn't finish, we may continue this next week!

Walk-through

Log on to the cluster on your Terminal and SSH client and log into a compute node (qlogin).

mkdir Pangenome_Lab

cd Pangenome_Lab

Load the GET_HOMOLOGUES module
module load get_homologues/20160802
(you can check for available modules typing module avail)

Create a directory for .faa files
mkdir Hrr_Files

Move the 5 genome (1, 2, 3, 4, 5 or zipped) files to 'Hrr_Files' (if you use curl -O link for the zipped file, you can unzip it using unzip *.zip on the commandline)

Time to run the BLAST searches and create our first pan-genome
get_homologues.pl -M -c -t 0 -d Hrr_Files
run the command from the Pangenome_Lab directory. This should take ~20 minutes to run on BBC3.

The get_homologues manual is here; the paper describing the software is here

The '-M' selects the OrthoMCL algorithm, which uses MCL to merge and break apart clusters after the BLAST calculations. This is probably the best general purpose algorithm to use. The '-t 0' option instructs what sized clusters to not report. '0' instructs it to report every cluster, period. '-t 4' will prevent any clusters with 4 or fewer members from being included in the results. The '-c' option creates a .tab file that we will be able to use later to create core genome & rarefaction curve plots sensu Tettelin, 2005.

Once that is done, create a 2nd pan-genome. Because all of the BLASTing is already done this run will only take moments.
get_homologues.pl -G -d Hrr_Files

This command uses '-G', which calls the COGtriangle algorithm. This gives us the opportunity to compare two pan-genomes and, if we like both algorithms, take the consensus pan-genome of both.

Once both pan-genomes are finished:
cd Hrr_Files_homologues

When you 'ls' you should see a number of formated BLAST databases, 25 .blast files and two folders with names like 'HrrGa36_f0_0taxa_algOMCL_e0_' and 'HrrGa36_f0_alltaxa_algCOG_e0_'

With both pan-genomes completed, we can create pan-genome matrices and simple parsimony trees for each method.
compare_clusters.pl -o ./Intersection_OMCL/ -m -T -d ./HrrGa36_f0_0taxa_algOMCL_e0_
& then:
compare_clusters.pl -o ./Intersection_COG/ -m -T -d ./HrrGa36_f0_alltaxa_algCOG_e0_

This will create 2 new folders. Inside each will be an essential repeat of the initial clustering and pangenome_matrix_t0.tab, pangenome_matrix_t0.phylip.ph, and pangenome_matrix_t0.phylip files. 'cd' into one of the directories and 'ls' to take a look at the contents.

One last preparation step. Create a consensus of the two pan-genomes:
compare_clusters.pl -o ./Intersection_Combined/ -m -T -d \./HrrGa36_f0_0taxa_algOMCL_e0_,\./HrrGa36_f0_alltaxa_algCOG_e0_

Note that the intersection is 1761 clusters. How does this compare to the size of the two pan-genomes that went into it?
Why are these numbers the same or different?

cd Intersection_Combined
Look over the files here. Open one of the unique_* files.
Move the 'venn.pdf' file to your local device and view it in an image viewing software.
Kind of boring right? Well good news, with all that work done we are now ready to run some analyses.
Pull back to the Hrr_Files_homologues level, if not there already.

Lets start by making some core genome estimates:
plot_pancore_matrix.pl -i core_genome_algOMCL.tab -f core_Tettelin
plot_pancore_matrix.pl -i pan_genome_algOMCL.tab -f core_Tettelin

Transfer the .png files to your machine and view in an image viewer. These are diagrams of your core genome sizes and rarefaction curves in sensu Tettelin 2005.

Then run all of these commands:
parse_pangenome_matrix.pl -m ./Intersection_OMCL/pangenome_matrix_t0.tab -s
parse_pangenome_matrix.pl -m ./Intersection_COG/pangenome_matrix_t0.tab -s
parse_pangenome_matrix.pl -m ./Intersection_Combined/pangenome_matrix_t0.tab -s

Look in both of these directories and transfer the image files to your local machine and view them.
(hint, you may want to rename them to reflect which algorithm was used. The program tends to use generic naming for these types of outputs.)

Now create heat maps using the following commands:
plot_matrix_heatmap.sh -i ./Intersection_OMCL/pangenome_matrix_t0.tab
plot_matrix_heatmap.sh -i ./Intersection_COG/pangenome_matrix_t0.tab
plot_matrix_heatmap.sh -i ./Intersection_Combined/pangenome_matrix_t0.tab

This may take a number of minutes, as the program calls R and seems to run in an inefficient manner. It may also sometimes fail for reasons often opaque to me. If it works it is kind of cool, though, so we try it.

While this runs, transfer one of your other Intersection folders to your local machine. This will take anywhere from seconds to a few minutes depending on the size of the pan-genome. So, maybe choose one of the smaller ones.

Open R and set that downloaded folder as your working directory

Let's make our own rarefaction curve using the following code:

library("vegan")
pan_matrix <- read.table(file = "pangenome_matrix_t0.tab",stringsAsFactors = TRUE, header = TRUE, row.names = 1)
sp <- specaccum(pan_matrix,method="random",permutations=100)
plot(sp,ci.type="poly",col="blue",lwd=2,ci.lty=0,ci.col="lightblue",xlab="Genomes",ylab="Proteins",main="Gene accumulation plot")
boxplot(sp,col="yellow",add=TRUE)

#Write out to a file
jpeg('rarefaction_plot.jpg')
plot(sp,ci.type="poly",col="blue",lwd=2,ci.lty=0,ci.col="lightblue",xlab="Genomes",ylab="Proteins",main="Gene accumulation plot")
boxplot(sp,col="yellow",add=TRUE)
dev.off()

And now, make our own histogram:

n_row <- nrow(pan_matrix) #count number of rows
n_col <- ncol(pan_matrix) #Ditto columns
HG <- vector(mode="numeric", length = 0)

for(col in 1:n_col){
col_count = 0
for(row in 1:n_row){
if(pan_matrix[row,col] > 0){
col_count = col_count + 1
}
}
HG[col] <- col_count
}
t_HG <- t(HG)
hist(HG,col="red",xlab="Present in # Genomes",ylab="# HGs",main="Histogram of HG Frequenies")

jpeg('histogram_plot.jpg')
hist(HG,col="red",xlab="Present in # Genomes",ylab="# HGs",main="Histogram of HG Frequenies")
dev.off()

How do you like our own figures vs. those that came from GET_HOMOLOGUES? Try playing with some of the settings to see if you can make something you like more. On the rarefaction curve I recommend playing with the "type" and "lwd." It may help to search "R plot" to get ideas of what other options exist for these plots.

Lets take a look at that ultra-simple parsimony tree GET_HOMOLOGUES made for us.

library("ape")
library("philentropy")
r_names <- row.names(pan_matrix)
matrix_dist <- distance(pan_matrix, method = "jaccard")
row.names(matrix_dist) <- r_names
tree_bionj <- bionj(matrix_dist)
plot(tree_bionj)

This produces a very simple and quick tree from the number of representatives each taxon has for each HG. It is probably marginally more reliable than the parsimony tree GET_HOMOLOGUES produced, but far from top marks. But its should give a reasonable view of how everything fits together.

Another way to look at that is to plot our genomes on a PCoA:

pcoa <- wcmdscale(matrix_dist)
pcoa_names <- row.names(pcoa)
plot(pcoa,pch=20,cex=0.25)
text(pcoa, pcoa_names,cex=0.5)

You might want to sort your HGs by how many taxa are represented within them.
Move the "SortPangenomeByHGSize_01_27Feb18.pl" script to your downloaded folder. Then in your Terminal invoke it with:
perl SortPangenomeByHGSize_01_27Feb18.pl
(Note: If your local machine does not run perl, do this on the cluster!)

You should see 4 new folders appear in that directory with copies of your HGs in them.
Look at the script and one of the HG files. Can you tell which part of that ugly identification format tells us which genome the ORF hails from?

======================== END Assignment================

Work on your student project !