Your name: Your email address:
A) obtaining the genome sequences
When searching the Internet, you may find references to both the BLAST (classic) and BLAST+ (new) command-line tools. If you have used the old BLAST tools, you may find this Quick Start Guide for switching from BLAST to BLAST+ command line tools useful.
There are many different ways to get the genome sequences onto the cluster; however, the easiest will be to first download the files to your computer, unzip the files and then transfer them to a directory on the cluster.
Go to the NCBI's current genome list.
Click on the "Prokaryotes" tab. Then click on "Filters" (at the top right). Place a check mark behind archaea, and under Assembly level select "Complete". The number in parenthesis following complete, should give the number of completely sequenced archaeal genomes, and these should be listed in the table below.
Take a note of the number of complete archaeal genomes:
Unclick the filter for complete genomes (leave the checkmark behind archaeal), then click on the "Download" button (top right of page, or below the filter settings). This download "prokaryotes.csv" to your computer. Save to the Desktop.
Load "prokaryotes.csv" into Excel.
Click on the first cell (A1). Then select "Format as Table". The header lines should now appear in table format.
Scroll to the right until you see a "Level" heading. Click on Status, and tick only "Complete Genome". Check status bar (on the bottom of the window) for the number of records found. Does it agree with the previous number?
Sort by Release Date. How many complete archaeal genomes have been published so far this year?
Which genome has the most genes (look at all submission dates)?
Which genome is the largest?
The highest GC content (give name and GC content)?
The highest AT content (give name and GC content)?
Go back to the NCBI's current genome list. Click on the "Prokaryotes" tab, display only complete genomes.
For today's first exercise we will need 2 genome very divergent genome sequences. We will use two of my favorites: Haloferax volcanii DS2 and Aeromonas veronii HM21
Two choices are possible to download the genomes: Choice 1) In the NCBIs complete genome list for an organism of your choice, scroll to the right and click on the R (which is a link to the NCBI FTP server). This green R takes you to an FTP site listing of all the RefSeq files for a genome project. The G gets you to an FTP site that does not have refseq review. We try to stick to the R link for now. Many browsers (on a Mac use waterfox) will comply without any problem and open the ftp page. For each genome we will download three files form the FTP directory. The first contains all the protein sequences encoded in the genome (it ends with _protein.faa.gz; the second one is the feature table, ending in _feature_table.txt.gz, and the third is the nucleotide sequence of the selected genome ending in _genomic.fna.gz). Clicking on them in the browser will download the files to your computer. Open filezilla, connect to Xanadu's transfer node transfer.cam.uchc.edu, log in with your username and password. Filezilla's site manager should remember the site you filled out in lab5. Create a new laboratory for lab6 and transfer the six files into the lab6 subdirecotry.
Alternative Procedure:
Establish an secure shell connection to Xanadu. ssh mcb3421usrXX@xanadu-submit-ext.cam.uchc.edu . See lab5 for details.
To move from the login node to run an interactive session (you should never, ever run things on this node) to a compute node, type srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash
In your home directory make a new subdirectory for lab6 and enter it:
mkdir lab6 cd lab6
In the NCBIs complete genome list select organism you want to analyze, scroll to the right and click on the R (which is a link to the NCBI FTP server). For each genome we will download three files form the FTP directory. The first contains all the protein sequences encoded in the genome (it ends with _protein.faa.gz); the second one is the genebank feature file, ending in _feature_table.txt.gz, and the third is the nucleotide sequence of the selected genome ending in _genomic.fna.gz).
You now should have the ftp site for your genome open in a browser, and a terminal connection to xanadu. In the browser window, right click on one of the files you want to download and copy the file location. Go to the terminal window and type curl -O and paste the link you had copied. (The -O is a capital O -- not a zero)
curl -O paste the link you copied
Repeat this for the three files and then repeat it for the other genome.
To uncompress the files type gunzip *.gz
End Alternative Procedure
If you are in your home directory change to the lab6 subdirecory (you can use the pwd command to check where you are in the directory hierarchy.
cd lab6
The files at the NCBI FTP server are compressed. If your browser did not yet unpack them, you can unzip them on the command line:
gunzip *.gz The star is wild card and *.gz will be replaced with a list of all files ending in .gz)
Once you have the 6 files rename them into something you recognize. Rename the A_veronii_Hm21 files to A_veronii_Hm21.faa, A_veronii_Hm21_feature_table.txt, and A_veronii_Hm21_genomic.fna Rename the Haloferx volcanii DS2 files into Haloferax_volcanii_DS2.faa Haloferax_volcanii_DS2_feature_table.txt Haloferax_volcanii_DS2_genomic.fna The easiest is to do this in filezilla (click on the file to select it - wait - then click again and you can edit the file name).
Alternatively, you can use the command line: mv oldname newname (Do not forget that you can copy/pasta and/or use automatic line completion)
head YourFileName.faa
Today we will use perl (an interpreter that allows you to use perl scripts), and blast (the basic local alignment search tool). To use these programs You need to load the corresponding modules: module load blast/2.11.0 module load perl
If you want to check which modules are available on xanadu, type module avail
curl -O https://j.p.gogarten.uconn.edu/mcb3421_2021/labs/Files_you_need_forLab6.zip Execute the following to turn one of the multiple fasta files into a blast searchable databank:
makeblastdb -in A_veronii_Hm21.faa -dbtype prot -parse_seqids -parse_seqids directs the program to create an index that allows to retrieve sequences from the databank you have created. Do an "ls" to see the extra files you just made.
The Haloferax "genome" will be the "query". blastp -query Haloferax_volcanii_DS2.faa -db A_veronii_Hm21.faa -out blast_all.txt -outfmt 6 -evalue 10
-outfmt 6 specifies a tabular output format. (We will use an evalue of 10 to also obtain some insignificant hits.)
The blast program will now take every sequence in query file and do a blast search against the database (i.e., this performs a couple of thousand blast searches). This will take a few minutes. While waiting check in what other students are doing :) Or check this listing of options to use with the blast command.
perl blastTopHit.pl blast_all.txt > blast_top.txt
Here is a description of the columns in the outfmt -6 option. A text file containing the header is here. Use the cat command to add these to the blast output file
cat header.txt blast_top.txt > blast.top_h.txt
Open an SFTP window in Filezilla [start the filezilla application. You need to connect to the transfer node: transfer.cam.uchc.edu. Your username and password are the same as in lab5. Navigate to your lab6 directory, and transfer both the blast_all.txt and the blast.top_h.txt file to your Desktop. Load it into Excel. Click in the top left field of the table, then convert the spreadsheet into a table with headers.
What is the accession ID pair of the most conserved protein between the two genomes (sort the table on bitscores, then on e-values)?
To learn what the most similar sequence(s) encode, we can copy the Accession numbers for the most similar matches (use the database IDs in the second column), paste them into a text editor, save the the file as IDfile.txt and transfer it into the class6 directory:
This should return the fasta formated sequences of the matches.
What is given as function in the fasta annotation lines?
In Excel, select the third column from the left (this is the percent identity for each BLASTp match), and Insert -- Statistic Chart (all-blue column chart icon in the Charts section) -- Histogram (Aside: I had problems in editing the histograms this creates on a Mac. It worked much better, when I selected tools, analysis tool pack (you might need to install this first), Histogram. I added a column somewhere that gave the bins) from 0 to 100% in 5% increments)
(More about histograms in Excel, including the formula used for default binning.)
You get a histogram of all percent identities. Repeat the histogram for E-values ≤ 10-3, and E-values >.1
Do you observe a smooth distribution of % identity values (a distribution with two peaks would be noteworthy, as it might be an indication of a chimera formation)? What can explain the difference (or lack thereof) in the histograms for significant and insignificant hits?
Can you explain, why % identity is not a good criterion to distinguish significant from insignificant hits?
For the following questions, the BLAST Command Line Applications User Manual may be helpful → link
How can you get information on the possible parameters in blastp using the command line? (try blastp -h first) How can you set the wordsize to 2 ? How can you filter the query sequence for regions with low complexity?
Type exit at the xanadu commandline. This will return to the head node. Now type qstat { you could only list your jobs using qstat -u username} You should have no jobs running. If there is a job listed, then type qdel followed by a space, and then the job-ID number, followed by return or enter key. Then type qstat again, to confirm that there are no running jobs. Then type logout to exit the main computer.
Check the appropriate radio button below before pressing the submit button:
Send email to your instructor (and yourself) upon submit Send email to yourself only upon submit (as a backup) Show summary upon submit but do not send email to anyone.