Assignment 12

Your name:
Your email address:

1) fill out the SETs

2) go over take home exam #9

3) ask any other questions regarding the final

4) Analyze this gene sequenced from a deep sea sediment metagenome (the location is known as Loki's castle, and name giver to the Lokiarchaea, which now are widely believed to the ancestors of the eukaryotic nucleocytoplasm).

This is the coding sequence:

>LAZR01000084.1:16162-19626 Marine sediment metagenome LCGC14_contig000084
ATGTCAAACGCAGATGATGTGACTGATTTCTTTCAGCGTTTTTTCACCGAATACAAGGACGATGATGGTA
ATTTCAAATATACCAATAGAATAAATCGTATGGTCAGAGACGGCAGTCAATATTTACAGATTGATTATGA
TGATGTATTATTATACGAGGCAGGAGATGGCGATATTTCCACTCCACTTTTTGAAAATCCTTATTTTGTT
ATGGATTATGCCAATATGGCACTGGGTGAGGCTGTACGCCAAGAATCAGTAGATTTTTATAATGATATGA
ATCGTGATGGGGTGGATTTTATGATTCAATTTGTGGACTTACCTATTGAGATAGGTCTCAGAGATCTGAG
GGCAAAACACGTACGCACGATGCGCGTGATTGAGGGCATTGTGACTAGAACCACCGATATAAAGGGAATT
ATCCGCGAGGCTCAATTTTTCTGCAAGGAAAACAGAGAACACATAGTGGTGATGACTTTACTCGATGGAA
TATACTCGTCTCCCAACCAATGCAGTGTTCCGACCTGTAGAAGTAAACAATTCTCATTAGAAATGGAATT
TTCAAGTCAAGTAGATTGGCAATTGGTCACGCTTCAAGAAATGCCTGAAAACATCTCTGCAGGGCGAACT
CCTGTATCAATTAGATGTAGATTCACACAAGGGATGGTGGGTTCAGCTAATCCGGGAAACCGTATTGCAG
TTACAGGAGTAATTCGGGCTCAATCAAGAAAAACTATCCAGAAAGGAAAAATCATGTTGTTGGATAAATG
GATTGATACCAATCATGTGAAAGTATTGGGTTATCAACAAAAATATGAAGAAATCTTACCAGCAGAATTA
AAAGAATTTGACGAGATGGCAAAAGATCCAAAACTATTTGATAAACTGGTAAATTCGTTTGCCCCAACCA
TTTATGGGTTAAAAGAGGTAAAAGCAGCATTACTTTTATTTTTGTTGGGTGGTGTAGATAAAATACGTGC
TGATGGAATTAAATTAAGGGGACAATCCAATATCTTGCTAGTTGGCGATCCAAGTATGGGAAAATGCTGT
CGGGGGGCCACAACCTACGTATTTAGTAATAGAGGCATGAAATTACTTAGTGATTTTTATCAAACAGATG
AAATTGAAAGTGACAAAGAATTTTCATTGGGAATTGAAACATTTGATTTTCAATCATTCAATCCAAAGAA
CACGGTGGCAATTTATCAACGCAAACAAGCAAAAACTATAAAGATTACCAATAGTATAGGATTGGCGATA
GAAGGAACTCCACATCACAGAATCATTATTCAAAACAATACTGGTGACATTGAATGGAAACAATTACAAG
ATATTCAAGAATCAGACCACATAGTTATTCGTGTGGGTAGCAATCTATTCAACAAATCACACAAAAAGAT
AACCTTCTCATCACCAATTCAAAAAATAAAGAATACGAAAAAAATAACAATTCCAAAAACAATGAATATG
GACTTGGCTTATTACATTAGTCTATTGATTGGCGATGGCTGTCTAACCAAAAAAAGATGTATTGAATTTA
CAAATGCAGATAAATATTTATTAGATCAATTCGAAAAATTATCTCTTGATCTATTTGGTCTGATTGGGTT
TGTACAACTAAAGAAAGGATCTATTGCTTCTACTGTAGTAATTTCTTCTGTCACATTACAGCGGTTCTTC
GATTATTTAGGGCTCGGTGGAAAATACTCTTTTGAGAAAACAATACCCCAACTTATCTTGGAATCACCAA
AATCCGTTCAGATAGCGTGCTTGAAGGGATTATTTGATACCGATGGTGAAATTTCTAAATATGATGTGGC
ATACACATCAACATCTGAACTATTGGTTATACAGGTTCAACTAATATTATTAAATCTAGGGATTGTCACA
TCTAAGAAAACAAAGACAACAACTCATAGGGATCTCTATCGACTTAGAATTGTTGGTGCCTACATTCCTC
TCTTCAAAGAATTAATTGGTTTTCGTTGTACACAAAAAAGAACGGCTCTCGACAAAACAAAAGCAAGAAA
CAAGACCAACGTATGTGGAGTTCCAAATATCCAACGTCATTTATACAAATTGTGGTATTCTATTCCCGAG
GAGGTTCGTTATAAAAATGGTTATAAAAAAGGATCAAAGGCAAGTATTGGTGGAGTCACGTTCACTTATC
TACGTAGATATTTCTTAAAATCACAAAATCGAAACATTCCTGTTTATAAATTAGGTTCGCTATTAGAGGG
TTTTGCAAAATTATATCCTAAAATTACTCAACTAAAAGAATATAAAAAACTAACAGTATTTACAAAGGGG
ATGTTCTTTACAAAATTAGCACATAAAACAACAGGCATTGCAGATGTAATGGATTTTACAATTCCTGACA
CCGAATCTTTTACAGCAAATGGTATAATTAATCACAACAGCATATTGCTAAAGTATACTAAATCCCTATC
TGATCGTGCAATATTCACCTCTGGGAAGGGCTCAACTGCAGCAGGTTTAACGGCTGCGATGTTACGGGAC
CCTGATACGGGCGAATTCAATCTAGAAGCCGGGGCAATTGTACTTGCAGATGAAGGATATGTTTGTATTG
ATGAATTCGATAAGATGAGTGAGAATGATCGGTCTGCAATTCATGAGGCGATGGAACAACATCAGGTTTC
AATATCCAAAGCTGGTATAGTGACAACATTAAATGCCCGAACAGGAATTTTAGCAGCCGCAAATCCAAAA
TATGGGCGTTATGAATCACATAGAACATTTATGGAAAATGTAAATTTACCTCCTACAATACTGTCTCGTT
TTGATTTAATATTTCCATTACTGGATGATCCCAAACAGAGGGACGATGCTGCGCGAGTAGAATATATTCT
TGCCAGTCATAGAATGGAAACAATAGCAAAAACAACCGAAACTTACTCTACTGCGGTAATGCAAAAATAT
ATTGCATATGCGAAATCAACATCATCCCCCATACTATCAGAGAGTGCTGAACAGGCTATCTTTGAGTTCT
ATATTAATCTAAGAGAACAGATTGGTGACGATAAGGGACGAATCCCCATTACAGATCGTCAACTTGAAAG
TATTATCAGATTGGCAGAAGCAAGGGCAAAAATCAATCTTAAAAAAACAGTTTCCAAACAAGATGCATTG
AAAGCAATTCAACTGGTACAATATTGTCTTGAACAGGTAACAACTGATCCAGAAACTGGAAAATTAGATA
TAGATTTCATGTATTCGGGAGAAAGTTCTACCAAACGGACTACAAGGAATAAGATGGAGAAAATCATGGC
ATTATTGAGTTTCTTCCAACGCACTTACTCGGGGCCATTCAGTGAGGAAGAATTTCTCAAAGAAGCAGAA
AATGAGGGGTTGACTCAAGAATATACAATTGCTGTGTTGGAGCAGTTAAAAAGAGATGGAAAAATTTATA
CGCCGACGCCTGGCCGTCTGAAGCTTGCTTCATGA

This is the encoded protein (extein + intein):

>lcl_ORF1
MSNADDVTDFFQRFFTEYKDDDGNFKYTNRINRMVRDGSQYLQIDYDDVL
LYEAGDGDISTPLFENPYFVMDYANMALGEAVRQESVDFYNDMNRDGVDF
MIQFVDLPIEIGLRDLRAKHVRTMRVIEGIVTRTTDIKGIIREAQFFCKE
NREHIVVMTLLDGIYSSPNQCSVPTCRSKQFSLEMEFSSQVDWQLVTLQE
MPENISAGRTPVSIRCRFTQGMVGSANPGNRIAVTGVIRAQSRKTIQKGK
IMLLDKWIDTNHVKVLGYQQKYEEILPAELKEFDEMAKDPKLFDKLVNSF
APTIYGLKEVKAALLLFLLGGVDKIRADGIKLRGQSNILLVGDPSMGKCC
RGATTYVFSNRGMKLLSDFYQTDEIESDKEFSLGIETFDFQSFNPKNTVA
IYQRKQAKTIKITNSIGLAIEGTPHHRIIIQNNTGDIEWKQLQDIQESDH
IVIRVGSNLFNKSHKKITFSSPIQKIKNTKKITIPKTMNMDLAYYISLLI
GDGCLTKKRCIEFTNADKYLLDQFEKLSLDLFGLIGFVQLKKGSIASTVV
ISSVTLQRFFDYLGLGGKYSFEKTIPQLILESPKSVQIACLKGLFDTDGE
ISKYDVAYTSTSELLVIQVQLILLNLGIVTSKKTKTTTHRDLYRLRIVGA
YIPLFKELIGFRCTQKRTALDKTKARNKTNVCGVPNIQRHLYKLWYSIPE
EVRYKNGYKKGSKASIGGVTFTYLRRYFLKSQNRNIPVYKLGSLLEGFAK
LYPKITQLKEYKKLTVFTKGMFFTKLAHKTTGIADVMDFTIPDTESFTAN
GIINHNSILLKYTKSLSDRAIFTSGKGSTAAGLTAAMLRDPDTGEFNLEA
GAIVLADEGYVCIDEFDKMSENDRSAIHEAMEQHQVSISKAGIVTTLNAR
TGILAAANPKYGRYESHRTFMENVNLPPTILSRFDLIFPLLDDPKQRDDA
ARVEYILASHRMETIAKTTETYSTAVMQKYIAYAKSTSSPILSESAEQAI
FEFYINLREQIGDDKGRIPITDRQLESIIRLAEARAKINLKKTVSKQDAL
KAIQLVQYCLEQVTTDPETGKLDIDFMYSGESSTKRTTRNKMEKIMALLS
FFQRTYSGPFSEEEFLKEAENEGLTQEYTIAVLEQLKRDGKIYTPTPGRL
KLAS

This protein contains an Intein. To determine where the intein is located, you can do a siple blast search with lcl_ORF1. Restrict the search to archaea in nr. Download matching sequences, be sure to include sequences that do not contain an intein.
Align the sequences in seaview. One complication is that the host protein in different organisms harbors different intein alleles (i.e., inteins that target different insertion sites. We are only interested in the ones in the same insertion site as lcl|ORF1. As we use this only to find the intein boundaries, delete the sequences that have inteins in different locations.

Realign the sequences, determine where the intein start and stop are, and use this information to cut out the intein, and to rejoin the extein parts. (If you are short on time use the sequences below).

Also, keep the well matching sequences that contain only the lcl_ORF1 intein for the phylogenetic analysis, and some of the sequences that do not contain the intein!

This is the intein only:

> Intein only
CCRGATTYVFSNRGMKLLSDFYQTDEIESDKEFSLGIETFDFQSFNPKNTVA
IYQRKQAKTIKITNSIGLAIEGTPHHRIIIQNNTGDIEWKQLQDIQESDH
IVIRVGSNLFNKSHKKITFSSPIQKIKNTKKITIPKTMNMDLAYYISLLI
GDGCLTKKRCIEFTNADKYLLDQFEKLSLDLFGLIGFVQLKKGSIASTVV
ISSVTLQRFFDYLGLGGKYSFEKTIPQLILESPKSVQIACLKGLFDTDGE
ISKYDVAYTSTSELLVIQVQLILLNLGIVTSKKTKTTTHRDLYRLRIVGA
YIPLFKELIGFRCTQKRTALDKTKARNKTNVCGVPNIQRHLYKLWYSIPE
EVRYKNGYKKGSKASIGGVTFTYLRRYFLKSQNRNIPVYKLGSLLEGFAK
LYPKITQLKEYKKLTVFTKGMFFTKLAHKTTGIADVMDFTIPDTESFTAN
GIINHN

Extein only:

> Extein only
MSNADDVTDFFQRFFTEYKDDDGNFKYTNRINRMVRDGSQYLQIDYDDVL
LYEAGDGDISTPLFENPYFVMDYANMALGEAVRQESVDFYNDMNRDGVDF
MIQFVDLPIEIGLRDLRAKHVRTMRVIEGIVTRTTDIKGIIREAQFFCKE
NREHIVVMTLLDGIYSSPNQCSVPTCRSKQFSLEMEFSSQVDWQLVTLQE
MPENISAGRTPVSIRCRFTQGMVGSANPGNRIAVTGVIRAQSRKTIQKGK
IMLLDKWIDTNHVKVLGYQQKYEEILPAELKEFDEMAKDPKLFDKLVNSF
APTIYGLKEVKAALLLFLLGGVDKIRADGIKLRGQSNILLVGDPSMGK
SILLKYTKSLSDRAIFTSGKGSTAAGLTAAMLRDPDTGEFNLEA
GAIVLADEGYVCIDEFDKMSENDRSAIHEAMEQHQVSISKAGIVTTLNAR
TGILAAANPKYGRYESHRTFMENVNLPPTILSRFDLIFPLLDDPKQRDDA
ARVEYILASHRMETIAKTTETYSTAVMQKYIAYAKSTSSPILSESAEQAI
FEFYINLREQIGDDKGRIPITDRQLESIIRLAEARAKINLKKTVSKQDAL
KAIQLVQYCLEQVTTDPETGKLDIDFMYSGESSTKRTTRNKMEKIMALLS
FFQRTYSGPFSEEEFLKEAENEGLTQEYTIAVLEQLKRDGKIYTPTPGRL
KLAS

We want to learn, if the intein and extein evolved together, if transfers were only between related, or also between divergent organisms.To do this we will compile a datset that contains all sequences that harbor the homologous intein allele.
You can try to do a databank search with the intein only; however, with the setting and databases I tried, I only retrieved inteins that sit in MutS, a protein that is different from lcl_ORF1. The reason for this is that inteins are not well conserved ...

To supplement our dataset, with extein containing sequences, we will do a blastp search with the extein only. To avoid being overwhelmed by 10000s of sequences, we will use the uniprot database.

A file that combines both sequences from the first blast search, and from the search of uniprot50 is here (note the annotation lines were modifiend by scripts to provide names that give some indication of the organisms to which the sequence is ascribed; the intein containing sequences are at the bottom). (The script used to reannotate the NCBI fasta files is here, the one for uniprot is here - seaview complains: you need to delete the first empty line, and replace "(" and ")" with "_" .)

Use this file (or your own data) to calculate phylogenies for the inteins and exteins seperately. If you do this in class, you want to use Neighbor joining for the extein trees; the inteins are too divergent fro the Poisson correction, use ml instead). If you have time, do the analysis in IQTree

Save the trees as unrooted trees, and load them into Figtree. Arrange and color the resulting trees, so that you can make sense of what might have happened. For example, in the extein tree, color all sequences that were invaded by an intein in a differnet color. Collapse part of the tree that only contains non invaded exteind.
Highlight the sequence from the metagenome. Is it in the same neighborhood in both trees?

Descripe your findings, and email the resulting trees (and treefile and as pdf) to gogarten@uconn.edu


Send email to your instructor (and yourself) upon submit
Send email to yourself only upon submit (as a backup)
Show summary upon submit but do not send email to anyone.