MCB 372 : geneplot and repetitive elements

Please send your answers per email to gogarten@uconn.edu, or hand in a hardcopy

You should answer the questions in red!

In today's exercise you will use two genomes of your choice. These two genomes should be from closely related organisms. We only will use the main chromosome.

Good choices are two of the three Frankia genomes or Thermotoga maritima and Thermotoga letingae or Pyrococcus abyssi and Pyrococcus horikoshii. If you want to use other genomes, do a quick check using geneplot to see if the genomes contain some syntenic regions.

Go to ftp://ftp.ncbi.nih.gov/genomes/Bacteria and download the genomes in faa and fna format. Also download the file with the .ptt extension. Open the file and glance through its contents. Place the files into a new folder on bbcxsv1.. (note: if the folder contains multiple chromosomes, take the "main" chromosome.)
Rename the files with names that are informative, but keep the extensions. For each genome you should have .faa .fna and .ptt files.

Download the perl scripts extract_lines.pl and addnumnuc.pl and place them into the same folder.

A) Geneplot using Excel

We will use the information in the *.ptt files to add the nucleotide position as first entry in the annotation line. To do so ssh to bbcxsrv1 using your favorite tools.
cd into the directory where you stored the files.

If you have not done so already, glance through the addnumnuc.pl script using vi .

For both of the genomes execute the following:
perl adnumnuc.pl name_of_genome

This should generate new *.faa files in the directory that end on name_of_genome.num.faa

For example, if your genomes are Tpet, Tlet and Tmar

chmod a+x *.pl
perl addnucnum.pl Tpet
perl addnucnum.pl Tmar
perl addnucnum.pl Tlet

will create the file Tpet.num.faa, Tmar.num.faa ...

Next we will turn one of these genomes into a searchable databank:

formatdb -i Tmar.num.faa -p T -o T
formatdb -i Tpet.num.faa -p T -o T

We now can search the all the ORFs in one genome against all the ORFs in the the other genome. Using the m8 option all significant hits (which e-value do you choose, the default is ten and definitely gives too much junk) will be stored in table, where the first column is the location in nucleotides of the query, and the second is the location in the other genome. The following command would work in case of the Tpet Tmar genomes:
blastall -p blastp -d Tpet.num.faa -i Tmar.num.faa -o Tpet_Tmar.tab -F F -m 8 -W 2 -a 2 -e 0.001

You might want to run this again with a higher significance level (10^-9). If you do, give the output a different name.

Next we will load the output table (Tpet_Tmar.tab in the example) into Excel (data - get external data - import text file). Because the script added a \t you should tell Excel to treat consecutive delineators as one (checkmark on the 2nd import screen).

Produce a x-y scatter plot of the first two columns.

If you cannot see syntenic regions, repeat the blastall search with a smaller E value cut-off.

Which genomes did you choose?
Can you see syntenic regions?
Did it help to lower the E-value cut-off?

To only analyze the top scoring blast hit, we will send the output from the blastall search through extract_lines.pl, for example

perl extract_lines.pl Tpet_Tmar.tab #replace Tpet_Tmar.tab with the filename where you stored your blast search results

This results in a file named Tpet_Tmar.tab.top that contains only the top scoring hit per query. Import this file into Excel and produce a x-y scatter plot of the first two columns.

Can you more clearly recognize syntenic regions?
How many rearrangements separate the two genomes?

B) We want to use PSIblast to determine where genes of divergent families are placed in a genome. To build the PSSM we use the non redundant database and a seed sequence. Possible are transposases, integrases, or phage proteins (e.g. capsid or tail proteins that are found in many phages and prophages).

In a first step we build the check point file using nr. E.g.:
blastpgp -i integrase.fa -d nr -I T -h 0.000001 -j 5 -C integrase.chk -a2 -F F

In a second step we use the profile to search the genomes of our choice, e.g.:
blastall -i integrase.fa -d Fnod.fna -p psitblastn -R integrase.chk – o integrase_Fnod.tab –m8 –a2 –F F

If the output is in m8 format, we can import it into Excel and plot position against bitscore.

What sequence did you use to build your profile?
What genomes did you use as target?
How many significant matches did you find with your profile?
Were the matches to the genome in syntenic regions?
Did the some matches correspond to a match in the other genome?

PS: if you are serious about this, you need to use more than on sequence as a seed. You could for example write script that uses all known transposases, builds a PSSM for each and searches the target genome...