Add on: Different ways to perform a gene plot

A) as in Appendix to class 7: Using data tables and the join command

This likely is the most elegant way, and it teaches you about data tables.

One problem is that you need to carefully check your organisms genome for the presence of multiple chromosomes. If you do not, then the multiple chromosomes end up in the same plot.

The problem the join comman d solves is that your BLASTp output contains accession numbers, but no genome coordinates. The genome coordinates are in the feature_table files. We want to compare the genome coordinates of the matches. One way is with the "join" command:

e.g., Take two genomes, g1 and g2.

BLASTp results, top hits with "-max_target_seqs 1"
g1 accession	g2 accession
g1p1	g2p7
g1p2	g2p8
g1p3	g2p9

g1 feature table
g1 accession	coordinate
g1p1	1
g1p2	3
g1p3	5

g2 feature table
g2 accession	coordinate
g2p7	2
g2p8	4
g2p9	6

join blast_top_hit.txt g1_feature.txt > step1.txt

It will join on the first column.

step1.txt
g1 accession	g2 accession	g1 coordinate
g1p1	g2p7	1
g1p2	g2p8	3
g1p3	g2p9	5

join -1 2 step1.txt g2_feature.txt

The "-1 2" tells join that the first file (-1) will be joined on the second (2) column.

final output
g2 accession	g1 accession	g1 coordinate	g2 coordinate
g2p1	g1p7	1	2
g2p2	g1p8	3	4
g2p3	g1p9	5	6

A bit tedious, but it gets the job done. The files to join must be sorted by the columns they're joined on.

grep '^CDS' query_feature_table.txt | grep $'\tchromosome\t' | cut -f8,11 > query_start_accession.txt
head query_start_accession.txt
grep '^CDS' database_feature_table.txt | grep $'\tchromosome\t' | cut -f8,11 > database_start_accession.txt
head database_start_accession.txt
cut -f1,2 blast_top_hit.txt > accession_top_hit.txt
join -1 1 -2 2 <(sort accession_top_hit.txt) <(sort -k 2 query_start_accession.txt) > accession_top_hit_query_start.txt
join -1 2 -2 2 <(sort -k 2 accession_top_hit_query_start.txt) <(sort -k 2 database_start_accession.txt) > accession_top_hit_query_start_database_start.txt

When finished, open a new SFTP window in BitVise, navigate to your lab7 directory, and transfer over the accession_top_hit_query_start_database_start.txt file to your Desktop. Load it into Excel.

Make an Excel scatter plot of the joined file (accession_top_hit_query_start_database_start.txt)

B) As described in class 7

Again, you need the two feature tables and the two genomes, but you use a perlscript to replace the annotation line with the location. The concern about multiple chromosomes is the same as above.

C) Use one chromosome at a time -

download multiple fasta files from NCBI see pptx slides here (slide 34 ff)

One minor complication is that the makeblastdb program installed on the cluster has a bug (removed in newer versions of the program) that does not allow the multiple fasta file to begin with a new line symbol.
Solution A: use formatdb from the old blastall package
Solution B: open the output from the program that adds the location in the genome as identifier in a text editor and delete the first empty line.