Your name: Your email address:
A) obtaining the genome sequences
When searching the Internet, you may find references to both the BLAST (classic) and BLAST+ (new) command-line tools. If you have used the old BLAST tools, you may find this Quick Start Guide for switching from BLAST to BLAST+ command line tools useful.
We will need two sets of 2 or four sequences each. Set A should consist distantly related organisms (e.g. one Archaeon, one Bacterium, or a DPANN archaeon and a Euryarchaeote. If in doubt of what to choose, go for one of the Nanohaloarchaea archaeon SG9 and Haloferax volcanii DS2 genomes).; the second set (B)should consist of closely related bacteria or archaea (either strains from the same species, or species from the same genus).
There are many different ways to get the genome sequences onto the cluster; however, the easiest will be to first download the files to your computer, unzip the files and then transfer them to a directory on the cluster.
Go to the NCBI's current genome list.
Click on the "Prokaryotes" tab. Then click on "Filters" (at the top right). Under Assembly level select "Complete". This should result in a listing of completely sequenced genomes.
Take a note of the number of complete genomes:
Unclick the filter, then click on the "Download" button (top right of page). This downloasd "prokaryotes.csv" to your computer. Save to the Desktop.
Load "prokaryotes.csv" into Excel.
Click on the header of column "A". The status bar should show the number of rows.
Click on the first cell (A1). Then select "Format as Table". The header lines should now appear in table format.
Scroll to the right until you see a "Status" heading. Click on Status, and tick only "Complete Genome". Check status bar for the number of records found. Does it agree with the previous number?
Sort by Release Date. How many complete genomes have been published so far this year?
Which genome has the most genes (look at all submission dates)?
Which genome is the largest?
The highest GC content (give name and GC content)?
The lowest GC content (give name and GC content)?
Go back to the NCBI's current genome list. Click on the "Prokaryotes" tab, display only complete genomes.
We will needa sets of 2 genome sequences. These should be distantly related organisms (e.g. one Archaeon, one Bacterium, or a DPANN archaeon and a Euryarchaeote. If in doubt of what to choose, go for one of the Nanohaloarchaea archaeon SG9 and Haloferax volcanii DS2 genomes).
MODIFIED PROCEDURE:
For today we want to use two genomes from organisms from differnent bacterial or archaeal phyla. Identify the genomes you want to use on the NCBI's current genome list, click on prokaryotes, and then select the filter for complete genomes. In the right column is the link (R or G) to the FTP site for the genomes.
You want to download the protein.faa file for each genome (do not do this yet). This "faa" file contains all of the proteins coded by a genome. Right click on the faa file and copy the file location.
Log into the cluster at xanadu-submit-ext.cam.uchc.edu (same username and password as you used last class).
mkdir lab6
cd lab6
curl -O paste the link you copied
Repeat this for the second genome.
e.g. curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/761/425/GCA_001761425.1_ASM176142v1/GCA_001761425.1_ASM176142v1_protein.faa.gz and curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/025/685/GCF_000025685.1_ASM2568v1/GCF_000025685.1_ASM2568v1_protein.faa.gz
ls
Once you have the two files in the lab rename them into something you remember:
mv oldname newname
mv ridiculouslyLongFilename.faa.gz somethingShorter.faa.gz
A good convention would be the first letter of the genus, followed by the species name, followed by the strain identifier, e.g., H_volcanii_DS.faa.gz
gunzip YourFileName.faa.gz
The file should now end with just "faa". The "gz" ending was a compressed file format. Take a look at the first few lines with
head YourFileName.faa
And repeat the procedure for the second genome.
END modified procedure, continue at #####
There are many different ways to get the genome sequences onto the cluster; the easiest will be to first download the files to your computer, unzip the files and then transfer them to a directory on the cluster.
Use the search feature (bar on top of the page) to find the complete genomes you want to work on.
Then look for the R or the G link in the far right-hand column. This green R takes you to an FTP site listing of all the RefSeq files for a genome project. The G gets you to an FTP site that does not have refseq review.
You want to download the protein.faa file for each genome. This "faa" file contains all of the proteins coded by a genome. For the genomes from closely related organisms we will also need the the file ending with _feature_table.txt
Right click on the file you want to download, save it to your computer, unzip the file and give it a name that you recognize and that also contains the strain ID. E.g., A_veronii.faa would not be sufficient, because dozens of A. veronii genomes have been sequence. A_veronii_TH0426.faa would be fine.
Then start the filezilla sftp application. To upload files, you need to connect to a special transfer node: transfer.cam.uchc.edu. Your username and password are the same as last class. Select port 22. Create a directories (lab6) and transfer the files into the directory.
Then log into the cluster using PuTTY via xanadu-submit-ext.cam.uchc.edu (same username and password as you used last class).(same username and password).
#######
srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bashThis takes you to a "compute" node. (Why? Because we are going to run the BLAST+ command, and we want to "farm" the processing out to another computer, rather than hammering the single computer which operates as the "gateway" for everyone. See cluster etiquette.)
If you are not sure that you are on a compute node, execute the command hostname
To use blast you need to load the blast module: module load blast
Execute the following to turn one of the multiple fasta files from set one into a searchable databank: makeblastdb -in database_protein.faa -dbtype prot -parse_seqids Do an "ls" to see the extra files you just made. -parse_seqids directs the program to create an index that allows to retrieve sequences from the databank you have created .
The other genome will be the "query". blastp -query query_protein.faa -db database_protein.faa -out blast.txt -outfmt 6 -outfmt 6 specifies a tabular output format.
Note, consider reducing the size of the output file by setting the e value cut-off to 1. You would add -evalue 1 to the above command.
The blast program will now take every sequence in query file and do a blast search against the database (i.e., this performs a couple of thousand blast searches). This will take a few minutes. While waiting help your neighbor :)
Here is a description of the columns in the outfit -6. To add these to the blast output file,
curl -O http://carrot.mcb.uconn.edu/mcb3421_2017/header.txt cat header.txt blast.txt > blast.out.txt
Open an SFTP window in Filezilla [start the filezilla application. You need to connect to a special transfer node: transfer.cam.uchc.edu. Your username and password are the same as last class. Select port 22] navigate to your lab6 directory, and transfer over the blast.out.txt file to your Desktop. Load it into Excel. You may find it useful to convert to table format with headings.
What is the accession ID pair of the most conserved protein between your two genomes?
Copy the Accession Number of the best match in the database, go back to the terminal window and type
blastdbcmd -entry IDnumber -db DatabaseName for example blastdbcmd -entry WP_004041152.1 -db Haloferax_volcanii_DS2_protein.faa
This should return the fasta formated sequence of the match.
What are the IDs of the best match, and what is given as function in the fasta annotation line?
In Excel, select the third column from the left (this is the percent identity for each BLASTp match), and Insert -- Statistic Chart (all-blue column chart icon in the Charts section) -- Histogram (Aside: I had problems in editing the histograms this creates on a Mac. It worked better for me when I selected tools, analysis tool pak Histogram. I added a column somewhere that gave the bins ) to 100 in 5% increments)
(More about histograms in Excel, including the formula used for default binning.)
You get a histogram of all percent identities. Repeat the histogram for E-values ≤ 1, and E-values ≤ 10-3.
Add a column to the spreadsheet that calculates the number of identical residues (column 3 * column 4 /100). Make separate histograms for the number of identical residues for all E-values, better than 1 and better than 1E-3.
Do you observe a smooth distribution of % identity values? Do any genes constitute a second peak of more similar sequences, compared to the bulk of the comparisons? What can explain the difference in the histograms for significant and insignificant hits?
Can you explain, why % identity is not a good criterion to distinguish significant from insignificant hits? (Hint: you could insert another column that contains the number of matching aa, by inserting a column that contains a formula multiplying percentID * alignment length) (If we are running out of time see the demo file)
For the following questions, the BLAST Command Line Applications User Manual may be helpful → link
How can you get information on the possible parameters in blastp using the commandline? (try blastp -h first) How can you set the wordsize to 2 ? How can you filter the query sequence for regions with low complexity?
Type logout You will return to the master computer. Now type qstat { you could only list your jobs using qstat -u username} You should have no jobs running. If there is a job listed, then type qdel followed by a space, and then the job-ID number, followed by return or enter key. Then type qstat again, to confirm that there are no running jobs. Then type logout to exit the main computer.
Check the appropriate radio button below before pressing the submit button:
Send email to your instructor (and yourself) upon submit Send email to yourself only upon submit (as a backup) Show summary upon submit but do not send email to anyone.