Your name: Your email address:
A) obtaining the genome sequences
When searching the Internet, you may find references to both the BLAST (classic) and BLAST+ (new) command-line tools. If you have used the old BLAST tools, you may find this Quick Start Guide for switching from BLAST to BLAST+ command line tools useful.
There are many different ways to get the genome sequences onto the cluster; however, the easiest will be to first download the files to your computer, unzip the files and then transfer them to a directory on the cluster.
Go to the NCBI's current genome list.
Click on the "Prokaryotes" tab. Then click on "Filters" (at the top right). Under Assembly level select "Complete". This should result in a listing of completely sequenced genomes.
Take a note of the number of complete genomes:
Unclick the filter, then click on the "Download" button (top right of page). This download "prokaryotes.csv" to your computer. Save to the Desktop.
Load "prokaryotes.csv" into Excel.
Click on the header of column "A". The status bar should show the number of rows.
Click on the first cell (A1). Then select "Format as Table". The header lines should now appear in table format.
Scroll to the right until you see a "Status" heading. Click on Status, and tick only "Complete Genome". Check status bar for the number of records found. Does it agree with the previous number?
Sort by Release Date. How many complete genomes have been published so far this year?
Which genome has the most genes (look at all submission dates)?
Which genome is the largest?
The highest GC content (give name and GC content)?
The lowest GC content (give name and GC content)?
Go back to the NCBI's current genome list. Click on the "Prokaryotes" tab, display only complete genomes.
For today's exercise we will need 2 genome sequences. These should be distantly related organisms (e.g. one Archaeon, one Bacterium will work). The genomes need to be completely assembled. Which organisms do you choose?
Two choices are possible to download the genomes: 1) In the NCBIs complete genome list for an organism of your choice, scroll to the right and click on the R (which is a link to the NCBI FTP server). This green R takes you to an FTP site listing of all the RefSeq files for a genome project. The G gets you to an FTP site that does not have refseq review. We try to stick to the R link for now. Most browsers will comply without any problem and open the ftp page. For each genome we will download three files form the FTP directory. The first contains all the protein sequences encoded in the genome (it ends with _protein.faa.gz; the second one is the genebank feature file, ending in _genomic.gbff.gz, and the third is the nucleotide sequence of the selected genome ending in _genomic.fna.gz). Clicking on them in the browser will download the files to your computer. Open filezilla, connect to Xanadu's transfer node transfer.cam.uchc.edu, log in with your username and password. Filezilla's site manager should remember the site you filled out in lab5. Create a new laboratory for lab6 and transfer the six files into the lab6 subdirecotry.
Alternative Procedure:
Establish an secure shell connection to Xanadu. ssh mcb3421usrXX@xanadu-submit-ext.cam.uchc.edu . See lab5 for details.
To move from the login node to run an interactive session (you should never, ever run things on this node) to a compute node, type srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash
In your home directory make a new subdirectory for lab6 and enter it:
mkdir lab6 cd lab6
In the NCBIs complete genome list for an organism of your choice, scroll to the right and click on the R (which is a link to the NCBI FTP server). For each genome we will download three files form the FTP directory. The first contains all the protein sequences encoded in the genome (it ends with _protein.faa.gz; the second one is the genebank feature file, ending in _genomic.gbff.gz, and the third is the nucleotide sequence of the selected genome ending in _genomic.fna.gz).
You now should have the ftp site for your genome open in a browser, and a terminal connection to xanadu. In the browser window, right click on one of the files you want to download and copy the file location. Go to the terminal window and type curl -O and paste the link you had copied. (The -O is a capital O -- not a zero)
curl -O paste the link you copied
Repeat this for the three files and then repeat it for the other genome.
End Alternative Procedure:
If you are in your home directory change to the lab6 subdirecory (you can use the pwd command to check where you are in the directory hierarchy.
cd lab6
The files at the NCBI FTP server are compressed. If your browser did not yet unpack them, you can unzip them on the command line:
gunzip *.gz The star is wild card and *.gz will be replaced with a list of all files ending in .gz)
Once you have the 6 files rename them into something you recognize. The easiest is to do this in filezilla (click on the file to select it - wait - then click again and you can edit the file name).
Alternatively, you can use the command line: mv oldname newname (Do not forget to use automatic line completion)
head YourFileName.faa
Today we will use perl, and blast. You need to load the blast corresponding modules: module load blast module load perl
We also will use a couple of scripts that you can download via the copy url command (the O is the letter, not the number zero):
curl -O https://j.p.gogarten.uconn.edu/mcb3421_2020/labs/faaReplaceAccessionWithStart.pl curl -O https://j.p.gogarten.uconn.edu/mcb3421_2020/labs/blastTopHit.pl curl -O https://raw.githubusercontent.com/Gogarten-Lab/GC-Strand-Bias-Perl-Script/master/GCskew.pl curl -O https://j.p.gogarten.uconn.edu/mcb3421_2020/labs/header.txt
Execute the following to turn one of the multiple fasta files into a blast searchable databank:
makeblastdb -in One_of_your_filenames.faa -dbtype prot -parse_seqids -parse_seqids directs the program to create an index that allows to retrieve sequences from the databank you have created. Do an "ls" to see the extra files you just made.
The other genome will be the "query". blastp -query query_protein.faa -db One_of_your_filenames.faa -out blast_all.txt -outfmt 6
-outfmt 6 specifies a tabular output format. (We will use the default evalue of 10 to also obtain some insignificant hits.)
The blast program will now take every sequence in query file and do a blast search against the database (i.e., this performs a couple of thousand blast searches). This will take a few minutes. While waiting check in what other students are going :)
perl blastTopHit.pl blast_all.txt > blast_top.txt
Here is a description of the columns in the outfit -6. Use the cat command to add these to the blast output file
cat header.txt blast_top.txt > blast.top_h.txt
Open an SFTP window in Filezilla [start the filezilla application. You need to connect to a special transfer node: transfer.cam.uchc.edu. Your username and password are the same as in lab5. Navigate to your lab6 directory, and transfer the blast.top_h.txt file to your Desktop. Load it into Excel. Click in the top left field of the table, then convert the spreadsheet into a table with headers.
What is the accession ID pair of the most conserved protein between your two genomes (sort the table on bitscores, then on e-values)?
To learn what the most similar sequence(s) encode, we can copy the Accession numbers for the most similar matches (use the database IDs in the second column), paste them into a text editor, replace the new line symbols "\n" with "," and copy the comma separated list into the following command:
This should return the fasta formated sequences of the matches.
What is given as function in the fasta annotation lines?
In Excel, select the third column from the left (this is the percent identity for each BLASTp match), and Insert -- Statistic Chart (all-blue column chart icon in the Charts section) -- Histogram (Aside: I had problems in editing the histograms this creates on a Mac. It worked much better, when I selected tools, analysis tool pack (you might need to install this first), Histogram. I added a column somewhere that gave the bins) from 0 to 100% in 5% increments)
(More about histograms in Excel, including the formula used for default binning.)
You get a histogram of all percent identities. Repeat the histogram for E-values ≤ 10-3, and E-values >.1
Do you observe a smooth distribution of % identity values (a distribution with two peaks would be noteworthy)? What can explain the difference (or lack thereof) in the histograms for significant and insignificant hits?
Can you explain, why % identity is not a good criterion to distinguish significant from insignificant hits?
For the following questions, the BLAST Command Line Applications User Manual may be helpful → link
How can you get information on the possible parameters in blastp using the command line? (try blastp -h first) How can you set the wordsize to 2 ? How can you filter the query sequence for regions with low complexity?
You will return to the master computer. Now type qstat { you could only list your jobs using qstat -u username} You should have no jobs running. If there is a job listed, then type qdel followed by a space, and then the job-ID number, followed by return or enter key. Then type qstat again, to confirm that there are no running jobs. Then type logout to exit the main computer.
Check the appropriate radio button below before pressing the submit button:
Send email to your instructor (and yourself) upon submit Send email to yourself only upon submit (as a backup) Show summary upon submit but do not send email to anyone.