Problem: Your BLASTp output contains accession numbers, but no genome coordinates. The genome coordinates are in the feature_table files. We want to compare the genome coordinates of the matches. One way is with the "join" command:
e.g., Take two genomes, g1 and g2.
g1 accession | g2 accession |
---|---|
g1p1 | g2p7 |
g1p2 | g2p8 |
g1p3 | g2p9 |
g1 accession | coordinate |
---|---|
g1p1 | 1 |
g1p2 | 3 |
g1p3 | 5 |
g2 accession | coordinate |
---|---|
g2p7 | 2 |
g2p8 | 4 |
g2p9 | 6 |
join blast_top_hit.txt g1_feature.txt > step1.txt
It will join on the first column.
g1 accession | g2 accession | g1 coordinate |
---|---|---|
g1p1 | g2p7 | 1 |
g1p2 | g2p8 | 3 |
g1p3 | g2p9 | 5 |
join -1 2 step1.txt g2_feature.txt
The "-1 2" tells join that the first file (-1) will be joined on the second (2) column.
g2 accession | g1 accession | g1 coordinate | g2 coordinate |
---|---|---|---|
g2p1 | g1p7 | 1 | 2 |
g2p2 | g1p8 | 3 | 4 |
g2p3 | g1p9 | 5 | 6 |
A bit tedious, but it gets the job done. The files to join must be sorted by the columns they're joined on.
grep '^CDS' query_feature_table.txt | grep $'\tchromosome\t' | cut -f8,11 > query_start_accession.txt
head query_start_accession.txt
grep '^CDS' database_feature_table.txt | grep $'\tchromosome\t' | cut -f8,11 > database_start_accession.txt
head database_start_accession.txt
cut -f1,2 blast_top_hit.txt > accession_top_hit.txt
join -1 1 -2 2 <(sort accession_top_hit.txt) <(sort -k 2 query_start_accession.txt) > accession_top_hit_query_start.txt
join -1 2 -2 2 <(sort -k 2 accession_top_hit_query_start.txt) <(sort -k 2 database_start_accession.txt) > accession_top_hit_query_start_database_start.txt
When finished, open a new SFTP window in BitVise, navigate to your lab7 directory, and transfer over the accession_top_hit_query_start_database_start.txt file to your Desktop. Load it into Excel.
Make an Excel scatter plot of the joined file (accession_top_hit_query_start_database_start.txt)
{ This is an alternative to download the genomes! Filezilla is easier!
Start PuTTY. In the "Host" field, type mcbsubmit.cam.uchc.edu
In the "Username" field, type your username: mcb3421usrXX, where XX is a number assigned to you. Login. It may ask you to accept a new host key. Now enter your password: xxx.In the terminal window type srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash
and then hit the return or enter key. This takes you to a "compute" node.
mkdir lab7
cd lab7In your browser go to the NCBI's current genome list. Click on the "Prokaryotes" tab, display only complete genomes.
For today's exercise, we will need the genomes from two or more closely related bacteria or archaea (either strains from the same species, or species from the same genus (e.g., different Aeromonas species, different Thermotoga species, different Borreliella species (they have linear chromosomes)).Open a text or word editor on you PC, and copy paste the names of the genomes you want to analyze. The reference genome (the one you will turn into a databank should be first). z
On the NCBI's current genome list move to the right an click on the R (or if there is only a G click there) to go to the FTP server, from which you can download the genome data. Copy the names of the .faa.gz file and the feature-table.txt.gz file onto your text file below the species and strain name.
Right click on the .faa.gz file and the feature-table.txt.gz and copy the link address, and paste it under the names of the feature-table and faa file. The result should look something like this:Haloferax volcanii DS2
GCF_000025685.1_ASM2568v1_feature_table.txt.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/025/685/GCF_000025685.1_ASM2568v1/GCF_000025685.1_ASM2568v1_feature_table.txt.gz
File:GCF_000025685.1_ASM2568v1_protein.faa.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/025/685/GCF_000025685.1_ASM2568v1/GCF_000025685.1_ASM2568v1_protein.faa.gzHaloferax mediterranei ATCC 33500
File:GCF_000306765.2_ASM30676v2_feature_table.txt.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/306/765/GCF_000306765.2_ASM30676v2/GCF_000306765.2_ASM30676v2_feature_table.txt.gz
GCF_000306765.2_ASM30676v2_protein.faa.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/306/765/GCF_000306765.2_ASM30676v2/GCF_000306765.2_ASM30676v2_protein.faa.gzHaloferax gibbonsii
GCF_001190965.1_ASM119096v1_feature_table.txt.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/190/965/GCF_001190965.1_ASM119096v1/GCF_001190965.1_ASM119096v1_feature_table.txt.gz
GCF_001190965.1_ASM119096v1_protein.faa.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/190/965/GCF_001190965.1_ASM119096v1/GCF_001190965.1_ASM119096v1_protein.faa.gz
Make sure you are in the lab7 subdirectory.Then down load the faa and feature_table.txt files using the curl command:
curl -O paste_the_ftp_link_you_copied
e.g.,
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/025/685/GCF_000025685.1_ASM2568v1/GCF_000025685.1_ASM2568v1_feature_table.txt.gz
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/025/685/GCF_000025685.1_ASM2568v1/GCF_000025685.1_ASM2568v1_protein.faa.gz
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/306/765/GCF_000306765.2_ASM30676v2/GCF_000306765.2_ASM30676v2_feature_table.txt.gz
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/306/765/GCF_000306765.2_ASM30676v2/GCF_000306765.2_ASM30676v2_protein.faa.gz
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/190/965/GCF_001190965.1_ASM119096v1/GCF_001190965.1_ASM119096v1_feature_table.txt.gz
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/190/965/GCF_001190965.1_ASM119096v1/GCF_001190965.1_ASM119096v1_protein.faa.gzUncompress all the files using
gunzip *.gz (gunzip calls the program to unzip files, and * is a wild card the will be expanded into a list of filenames that includes all files ending on .gz)Rename the files using the species and strain names using the mv command. The names you choose should not contain spaces. e.g., Haloferax volcanii DS2 files should be named Haloferax_volcanii_DS2.
mv ridiculouslyLongFilename.faa somethingShorter.faa
mv ridiculouslyLongFilename_feature_table.txt somethingShorter_feature_table.txte.g.:
mv GCF_000025685.1_ASM2568v1/GCF_000025685.1_ASM2568v1_feature_table.txt Haloferax_volcanii_DS2_feature_table.txt
mv GCF_000025685.1_ASM2568v1/GCF_000025685.1_ASM2568v1_protein.faa Haloferax_volcanii_DS2.faa
mv GCF_000306765.2_ASM30676v2/GCF_000306765.2_ASM30676v2_feature_table.txt Haloferax_mediterranei_ATCC_33500_feature_table.txt
mv GCF_000306765.2_ASM30676v2/GCF_000306765.2_ASM30676v2_protein.faa Haloferax_mediterranei_ATCC_33500.faa
mv GCF_001190965.1_ASM119096v1_feature_table.txt Haloferax_gibbonsii_feature_table.txt
mv GCF_001190965.1_ASM119096v1_protein.faa Haloferax_gibbonsii.faa
}