Midterm 2018

Practical Portion – you are allowed to use your notes, the computer, google, … .

======================================================================

NOTE 1: If you log into bbcsrv3 from outside the University, you need to first establish a vpn connection. Juno Pulse Secure works great and is available here.

NOTE 2: Never ever run you blast search on the master node. qlogin into a compute node. Use qlogin -q course.q to specify a particular queue (in this case "course.q". Use qacct -q to get info on the available queues.

NOTE 3: Submit the answers/results before 4 pm on Wednesday March 21st per email to gogarten@uconn.edu, with a cc to Artemis.Louyakis@uconn.edu and to yourself. 

======================================================================

First exercise: The most conserved protein homologs.

Given two multiple sequence fasta files that each include all the proteins encoded in an organism’s genome
(human ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_protein.faa.gz  and
the archaeon Sulfolobus acidocaldarius ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/338/775/GCF_000338775.1_ASM33877v1/GCF_000338775.1_ASM33877v1_protein.faa.gz ), find the homologs that are most conserved between these two genomes.

What are the most conserved genes? Which measure of similarity did you use?
(If you find several with equal score, you can copy paste the IDs into the search field at https://www.ncbi.nlm.nih.gov/protein/ )

Hint:
To obtain an easily manageable list of hits, consider setting the maximum target sequences per query to 1, and
using a small e-value cut-off (you are interested only in the most conserved sequences).

======================================================================

Second exercise: Cumulative plot of palindromes along a genome.  

Palindromic DNA motifs are sequences that are identical to their reverse complement.  E.g., CTAG.   Your task is to explore the distribution of pairs of palindromes.  You first should try to calculate the cumulative occurrence for CTAG and GATC.  Note that the second palindrome is the first one backwards; however, these are different palindromes.  In DNA a sequence is considered a palindrome only when its reverse complement is the same motif. 

Write a script that determines the cumulative occurrence of this pair of palindromes (CTAG and GATC) along the E.coli K12 chromosome < https://www.ncbi.nlm.nih.gov/nuccore/NC_000913.3?report=fasta or https://www.ncbi.nlm.nih.gov/nuccore/NC_000913.3?report=fasta&log$=seqview&format=text > You might need to wait a little before the sequnce is loaded. 
Use Excel (or any other program of your choice) to plot the results.

Submit the plot(s) you obtain as an image file, make sure that the axes are labeled and the numbers are readable.

Only in case you do not complete this exercise, and if you want to receive partial credit, submit the latest script you have written.

If time permits, use your script on a different genome, e.g., an archaeon.

If you still have time, modify the script to also work on CATG and GTAC; or/and AGCT and TCGA; or/and ACGT and TGCA.

If you got this far, write down a few sentences on the implications of your result.

=======================================================================

Send your results (most similar genes and chosen measure of similarity; plot of cumulative occurrence as pdf, jpg, of png) to gogarten@uconn.edu  with a cc to Artemis.Louyakis@uconn.edu and to yourself. 
Before you leave, make sure that the cc to yourself arrived. 

=======================================================================