This likely is the most elegant way, and it teaches you about data tables.
One problem is that you need to carefully check your organisms genome for the presence of multiple chromosomes. If you do not, then the multiple chromosomes end up in the same plot.
The problem the join comman d solves is that your BLASTp output contains accession numbers, but no genome coordinates. The genome coordinates are in the feature_table files. We want to compare the genome coordinates of the matches. One way is with the "join" command:
e.g., Take two genomes, g1 and g2.
g1 accession | g2 accession |
---|---|
g1p1 | g2p7 |
g1p2 | g2p8 |
g1p3 | g2p9 |
g1 accession | coordinate |
---|---|
g1p1 | 1 |
g1p2 | 3 |
g1p3 | 5 |
g2 accession | coordinate |
---|---|
g2p7 | 2 |
g2p8 | 4 |
g2p9 | 6 |
join blast_top_hit.txt g1_feature.txt > step1.txt
It will join on the first column.
g1 accession | g2 accession | g1 coordinate |
---|---|---|
g1p1 | g2p7 | 1 |
g1p2 | g2p8 | 3 |
g1p3 | g2p9 | 5 |
join -1 2 step1.txt g2_feature.txt
The "-1 2" tells join that the first file (-1) will be joined on the second (2) column.
g2 accession | g1 accession | g1 coordinate | g2 coordinate |
---|---|---|---|
g2p1 | g1p7 | 1 | 2 |
g2p2 | g1p8 | 3 | 4 |
g2p3 | g1p9 | 5 | 6 |
A bit tedious, but it gets the job done. The files to join must be sorted by the columns they're joined on.
grep '^CDS' query_feature_table.txt | grep $'\tchromosome\t' | cut -f8,11 > query_start_accession.txt
head query_start_accession.txt
grep '^CDS' database_feature_table.txt | grep $'\tchromosome\t' | cut -f8,11 > database_start_accession.txt
head database_start_accession.txt
cut -f1,2 blast_top_hit.txt > accession_top_hit.txt
join -1 1 -2 2 <(sort accession_top_hit.txt) <(sort -k 2 query_start_accession.txt) > accession_top_hit_query_start.txt
join -1 2 -2 2 <(sort -k 2 accession_top_hit_query_start.txt) <(sort -k 2 database_start_accession.txt) > accession_top_hit_query_start_database_start.txt
When finished, open a new SFTP window in BitVise, navigate to your lab7 directory, and transfer over the accession_top_hit_query_start_database_start.txt file to your Desktop. Load it into Excel.
Make an Excel scatter plot of the joined file (accession_top_hit_query_start_database_start.txt)
Again, you need the two feature tables and the two genomes, but you use a perlscript to replace the annotation line with the location. The concern about multiple chromosomes is the same as above.
download multiple fasta files from NCBI see pptx slides here (slide 34 ff)
One minor complication is that the makeblastdb program installed on the cluster has a bug (removed in newer versions of the program) that does not allow the multiple fasta file to begin with a new line symbol.
Solution A: use formatdb from the old blastall package
Solution B: open the output from the program that adds the location in the genome as identifier in a text editor and delete the first empty line.