Your name: Your email address:
This week, we will make a gene and a strand bias plot. (As seen in the lecture.)
The strand bias plot depicts the bias in nucleotide usage along the genome (x-axis). Plotting this in a cumulative fashion integrates over local fluctuation in a window (see slides 29 an 32 in class11_strand_bias_and_recombination)
If you are working on a windows machine, start a program called PuTTY (download from here). In the "Host" field, type xanadu-submit-ext.cam.uchc.edu
In the "Username" field, type your username mcb3421usrXX, where XX is a number assigned to you. (Go to the class notebook, behind the name of your personal section is an _#. this is your number. If your number is 1, do not use a leading 0. Your username would be mcb3421usr1.) Login. It will ask you to accept a new host key. Now enter the password _______ (the password is in the notebook under today's summary and goals section). Hit return. <Note the terminal will not echo the keystrokes you make to enter the password.>
The window with a black background that opens on your screen is the command-line window.
In the window with a black background, you will see a flashing cursor. Stuff you type appears at the cursor position.
If you are working on a Mac, find the terminal application (Termianl.app). By default is located in the Applications folder, in a subfolder utilitites (/System/Applications/Utilities/Terminal.app). The easiest might be to search for Terminal.app in the finder or in spotlight.
Double click the Terminal application icon. In the window that opens, type
Start Filezilla -- download here
Open the site manager (under the file menu, or the left most icon in the bar on top of the application window. Click on the NewSite button at the bottom left of the site manager window. On the right under protocol select SFTP under host enter transfer.cam.uchc.edu under LogOn Type select normal under User enter mcb3421usrXX (Go to the class notebook, behind the name of your personal section is an _#. this is your number. If your number is 1, do not use a leading 0. Your username would be mcb3421usr1.) under Password enter your xanadu password Hit connect. You should see 4 windows. On top the directories structure on your computer (left) and on the cluster (right). On the bottom the contents of the two active directories (i.e. the directories between which the files will be transferred). You can move to a directory/ subdirectory by double clicking. The folder with two dots behind it is the parent directory. If you double click on a file in the bottom windows it is transferred (copied) between the two directories on the two computers. You also can drop and drag files in the window representing the directory on xanadu.
The files that you will need today are here (as zip file) or here as a tar.gz file. You could download the latter using ssh using curl -O https://j.p.gogarten.uconn.edu/mcb3421_2021/labs/lab7archive.tar.gz To unpack the file, execute tar -xzvf lab7archive.tar.gz The advantage of the curl command is that you do not fill up your harddrive. The zip archive needs to be unpacked on your desk- or laptop, and the files transferred via Filezilla.
available ), or TextEdit (Mac). The first annotation line usually says something like this is the chromosome, and even if it doesn't, it usually is. This is the part of the file you want to keep. In rare cases, the first annotation line says plasmid or mini chromosome. To get to the second annotation line, search for >. If the first sequence was the chromosome, delete from before the second > to the end. If the first one was a plasmid, delete form before the second > to the beginning. (repeat until only the chromosome is left)]
perl faaReplaceAccessionWithStart.pl yourGenome1.faa yourFeatureTable1.txt > yourGenome1WithStart.faa
For today's exercise we will need two closely related bacterial (or archaeal) genomes. These could be strains for the same species or from the same genus In the archive for this class are several .faa, where the location of the genome has been added to the beginning of each annotation line. these files end with ..._with_start.faa. Select two of these files and transfer them to xanadu (using filezilla)
Start Terminal or PuTTY. In the "Host" field, type xanadu-submit-ext.cam.uchc.edu or, if you use terminal on a Mac open terminal and type ssh mcb3421usrXX@xanadu-submit-ext.cam.uchc.edu In the "Username" field of PuTTY, type your username: mcb3421usrXX, where XX is a number assigned to you. Login. It may ask you to accept a new host key. Now enter your password: ......
type srun --pty -p mcbstudent --qos=mcbstudent --mem=2G bash and then hit the return or enter key. This takes you to a "compute" node. (Why? Because we are going to run the BLAST+ command, and we want to "farm" the processing out to another computer, rather than hammering the single computer which operates as the "gateway" for everyone.
Use qstat or hostname to verify that you are on a compute node.
change into the directory where fhe genome iles are located cd lab7
To make the blast+ programs available: module load blast/2.11.0 module load perl
To make Gnuplot available module load gnuplot
Copy the following scripts into the lab 7 directory.
Check that the ">accession" lines have been replaced with ">number something" lines.
makeblastdb -in OneOfYour...with_start.faa -dbtype prot -parse_seqids To do the blast search we will use the following command (again, you need to replace the names with the names of the actual filenames!
blastp -query query_protein_with_start.faa -db database_protein_with_start.faa -out blast.txt -outfmt 6 -evalue 1e-20 An E-value cut-off of 10-20 is used, you could select a smaller cut-off (1e-40).
This will take a few minutes. Again, here is a description of the columns.
To get just the top hit for each query sequence, we use another Perl program. Since the hits for each query are ordered by best E-value to worst, the top hit is simply the first hit for each query: You use it as follows:
perl blastTopHit.pl blast.txt > blastTopHit.txt
head blastTopHit.txt
Gnuplot is installed on the cluster, and you can use it to create scatter plots for both of the matches (top and all significant) in the same coordinate system.
There are many ways to download and edit the perl script. One is to save and edit locally and use filezilla to move the file back and forth), or use the editor nano to edit the file: nano plotwgnu_mod2.pl An alternative is download the file to your desktop (right click on the link and select save as), edit the file on you desktop (MSWord, save as txt), and transfer it back to the cluster using Filezilla)
To run the script type perl plotwgnu_mod2.pl
Transfer the resulting plot (the file is called plot.png) to your computer using filezilla, and display the image on the screen. If you want the best scoring blasthits depicted in red, you can edit the script in nano and change this line print PLOT "plot ".'"'."$plot1".'" using 2:1 with points ls 2,'.'"'."$plot2".'" using 2:1 with points ls 1;'."\n";
print PLOT "plot ".'"'."$plot1".'" using 2:1 with points ls 7 ,'.'"'."$plot2".'" using 2:1 with points ls 1;'."\n";
============================================== B) plotting the results using Excel:
Open filezilla and navigate to your lab7 directory, and transfer the blast.txt and blastTopHit.txt files to your Desktop.
Make an Excel scatter plot using all BLAST hits with E-value ≤ 10-20, and another using just the top hits.
===============================================
If you used excel: What if any is the difference between the two plots? What is the difference between the data plotted in different colors using gnuplot?
In case you want to run a second analysis, remember to rename the files (blast.txt .... before you run the second analysis)
Which genomes did you compare? Describe the results you obtained in words AND copy the plots into you notebook. For each plot, give the name of the strain used as databank (If you used the plotwgnu_mod2.pl script, the databank genome ends up on the X-axis), and the name of the strain used as query. Discuss what and how many genome rearrangements might have given rise to this result. Does it appear likley that the starts of the genomes were placed in non-homologous positions?Does this agree with the GC Bias plots? Description of results: Was the genome start placed in homologous locations?
Which genomes did you compare? Describe the results you obtained in words AND copy the plots into you notebook. For each plot, give the name of the strain used as databank (If you used the plotwgnu_mod2.pl script, the databank genome ends up on the X-axis), and the name of the strain used as query. Discuss what and how many genome rearrangements might have given rise to this result. Does it appear likley that the starts of the genomes were placed in non-homologous positions?Does this agree with the GC Bias plots?
Type logout to release the compute node form the queue. Check the queue for abandoned sessions using qstat. If there are abandoned sessions under your account, kill them by deleting them from the queue by typing qdel job-ID, e.g. "qdel 40000" would delete Job # 40000
Check the appropriate radio button below before pressing the submit button:
Send email to your instructor (and yourself) upon submit Send email to yourself only upon submit (as a backup) Show summary upon submit but do not send email to anyone.