Problem: Your BLASTp output contains accession numbers, but no genome coordinates. The genome coordinates are in the feature_table files. We want to compare the genome coordinates of the matches. One way is with the "join" command:
e.g., Take two genomes, g1 and g2.
g1 accession | g2 accession |
---|---|
g1p1 | g2p7 |
g1p2 | g2p8 |
g1p3 | g2p9 |
g1 accession | coordinate |
---|---|
g1p1 | 1 |
g1p2 | 3 |
g1p3 | 5 |
g2 accession | coordinate |
---|---|
g2p7 | 2 |
g2p8 | 4 |
g2p9 | 6 |
join blast_top_hit.txt g1_feature.txt > step1.txt
It will join on the first column.
g1 accession | g2 accession | g1 coordinate |
---|---|---|
g1p1 | g2p7 | 1 |
g1p2 | g2p8 | 3 |
g1p3 | g2p9 | 5 |
join -1 2 step1.txt g2_feature.txt
The "-1 2" tells join that the first file (-1) will be joined on the second (2) column.
g2 accession | g1 accession | g1 coordinate | g2 coordinate |
---|---|---|---|
g2p1 | g1p7 | 1 | 2 |
g2p2 | g1p8 | 3 | 4 |
g2p3 | g1p9 | 5 | 6 |
A bit tedious, but it gets the job done. The files to join must be sorted by the columns they're joined on.
grep '^CDS' query_feature_table.txt | grep $'\tchromosome\t' | cut -f8,11 > query_start_accession.txt
head query_start_accession.txt
grep '^CDS' database_feature_table.txt | grep $'\tchromosome\t' | cut -f8,11 > database_start_accession.txt
head database_start_accession.txt
cut -f1,2 blast_top_hit.txt > accession_top_hit.txt
join -1 1 -2 2 <(sort accession_top_hit.txt) <(sort -k 2 query_start_accession.txt) > accession_top_hit_query_start.txt
join -1 2 -2 2 <(sort -k 2 accession_top_hit_query_start.txt) <(sort -k 2 database_start_accession.txt) > accession_top_hit_query_start_database_start.txt
When finished, open a new SFTP window in BitVise, navigate to your lab7 directory, and transfer over the accession_top_hit_query_start_database_start.txt file to your Desktop. Load it into Excel.
Make an Excel scatter plot of the joined file (accession_top_hit_query_start_database_start.txt)