Your name: Your email address:
1) fill out the SETs
2) go over take home exam #9
3) ask any other questions regarding the final
4) Analyze this gene sequenced from a deep sea sediment metagenome (the location is known as Loki's castle, and name giver to the Lokiarchaea, which now are widely believed to the ancestors of the eukaryotic nucleocytoplasm).
This is the coding sequence:
>LAZR01000084.1:16162-19626 Marine sediment metagenome LCGC14_contig000084 ATGTCAAACGCAGATGATGTGACTGATTTCTTTCAGCGTTTTTTCACCGAATACAAGGACGATGATGGTA ATTTCAAATATACCAATAGAATAAATCGTATGGTCAGAGACGGCAGTCAATATTTACAGATTGATTATGA TGATGTATTATTATACGAGGCAGGAGATGGCGATATTTCCACTCCACTTTTTGAAAATCCTTATTTTGTT ATGGATTATGCCAATATGGCACTGGGTGAGGCTGTACGCCAAGAATCAGTAGATTTTTATAATGATATGA ATCGTGATGGGGTGGATTTTATGATTCAATTTGTGGACTTACCTATTGAGATAGGTCTCAGAGATCTGAG GGCAAAACACGTACGCACGATGCGCGTGATTGAGGGCATTGTGACTAGAACCACCGATATAAAGGGAATT ATCCGCGAGGCTCAATTTTTCTGCAAGGAAAACAGAGAACACATAGTGGTGATGACTTTACTCGATGGAA TATACTCGTCTCCCAACCAATGCAGTGTTCCGACCTGTAGAAGTAAACAATTCTCATTAGAAATGGAATT TTCAAGTCAAGTAGATTGGCAATTGGTCACGCTTCAAGAAATGCCTGAAAACATCTCTGCAGGGCGAACT CCTGTATCAATTAGATGTAGATTCACACAAGGGATGGTGGGTTCAGCTAATCCGGGAAACCGTATTGCAG TTACAGGAGTAATTCGGGCTCAATCAAGAAAAACTATCCAGAAAGGAAAAATCATGTTGTTGGATAAATG GATTGATACCAATCATGTGAAAGTATTGGGTTATCAACAAAAATATGAAGAAATCTTACCAGCAGAATTA AAAGAATTTGACGAGATGGCAAAAGATCCAAAACTATTTGATAAACTGGTAAATTCGTTTGCCCCAACCA TTTATGGGTTAAAAGAGGTAAAAGCAGCATTACTTTTATTTTTGTTGGGTGGTGTAGATAAAATACGTGC TGATGGAATTAAATTAAGGGGACAATCCAATATCTTGCTAGTTGGCGATCCAAGTATGGGAAAATGCTGT CGGGGGGCCACAACCTACGTATTTAGTAATAGAGGCATGAAATTACTTAGTGATTTTTATCAAACAGATG AAATTGAAAGTGACAAAGAATTTTCATTGGGAATTGAAACATTTGATTTTCAATCATTCAATCCAAAGAA CACGGTGGCAATTTATCAACGCAAACAAGCAAAAACTATAAAGATTACCAATAGTATAGGATTGGCGATA GAAGGAACTCCACATCACAGAATCATTATTCAAAACAATACTGGTGACATTGAATGGAAACAATTACAAG ATATTCAAGAATCAGACCACATAGTTATTCGTGTGGGTAGCAATCTATTCAACAAATCACACAAAAAGAT AACCTTCTCATCACCAATTCAAAAAATAAAGAATACGAAAAAAATAACAATTCCAAAAACAATGAATATG GACTTGGCTTATTACATTAGTCTATTGATTGGCGATGGCTGTCTAACCAAAAAAAGATGTATTGAATTTA CAAATGCAGATAAATATTTATTAGATCAATTCGAAAAATTATCTCTTGATCTATTTGGTCTGATTGGGTT TGTACAACTAAAGAAAGGATCTATTGCTTCTACTGTAGTAATTTCTTCTGTCACATTACAGCGGTTCTTC GATTATTTAGGGCTCGGTGGAAAATACTCTTTTGAGAAAACAATACCCCAACTTATCTTGGAATCACCAA AATCCGTTCAGATAGCGTGCTTGAAGGGATTATTTGATACCGATGGTGAAATTTCTAAATATGATGTGGC ATACACATCAACATCTGAACTATTGGTTATACAGGTTCAACTAATATTATTAAATCTAGGGATTGTCACA TCTAAGAAAACAAAGACAACAACTCATAGGGATCTCTATCGACTTAGAATTGTTGGTGCCTACATTCCTC TCTTCAAAGAATTAATTGGTTTTCGTTGTACACAAAAAAGAACGGCTCTCGACAAAACAAAAGCAAGAAA CAAGACCAACGTATGTGGAGTTCCAAATATCCAACGTCATTTATACAAATTGTGGTATTCTATTCCCGAG GAGGTTCGTTATAAAAATGGTTATAAAAAAGGATCAAAGGCAAGTATTGGTGGAGTCACGTTCACTTATC TACGTAGATATTTCTTAAAATCACAAAATCGAAACATTCCTGTTTATAAATTAGGTTCGCTATTAGAGGG TTTTGCAAAATTATATCCTAAAATTACTCAACTAAAAGAATATAAAAAACTAACAGTATTTACAAAGGGG ATGTTCTTTACAAAATTAGCACATAAAACAACAGGCATTGCAGATGTAATGGATTTTACAATTCCTGACA CCGAATCTTTTACAGCAAATGGTATAATTAATCACAACAGCATATTGCTAAAGTATACTAAATCCCTATC TGATCGTGCAATATTCACCTCTGGGAAGGGCTCAACTGCAGCAGGTTTAACGGCTGCGATGTTACGGGAC CCTGATACGGGCGAATTCAATCTAGAAGCCGGGGCAATTGTACTTGCAGATGAAGGATATGTTTGTATTG ATGAATTCGATAAGATGAGTGAGAATGATCGGTCTGCAATTCATGAGGCGATGGAACAACATCAGGTTTC AATATCCAAAGCTGGTATAGTGACAACATTAAATGCCCGAACAGGAATTTTAGCAGCCGCAAATCCAAAA TATGGGCGTTATGAATCACATAGAACATTTATGGAAAATGTAAATTTACCTCCTACAATACTGTCTCGTT TTGATTTAATATTTCCATTACTGGATGATCCCAAACAGAGGGACGATGCTGCGCGAGTAGAATATATTCT TGCCAGTCATAGAATGGAAACAATAGCAAAAACAACCGAAACTTACTCTACTGCGGTAATGCAAAAATAT ATTGCATATGCGAAATCAACATCATCCCCCATACTATCAGAGAGTGCTGAACAGGCTATCTTTGAGTTCT ATATTAATCTAAGAGAACAGATTGGTGACGATAAGGGACGAATCCCCATTACAGATCGTCAACTTGAAAG TATTATCAGATTGGCAGAAGCAAGGGCAAAAATCAATCTTAAAAAAACAGTTTCCAAACAAGATGCATTG AAAGCAATTCAACTGGTACAATATTGTCTTGAACAGGTAACAACTGATCCAGAAACTGGAAAATTAGATA TAGATTTCATGTATTCGGGAGAAAGTTCTACCAAACGGACTACAAGGAATAAGATGGAGAAAATCATGGC ATTATTGAGTTTCTTCCAACGCACTTACTCGGGGCCATTCAGTGAGGAAGAATTTCTCAAAGAAGCAGAA AATGAGGGGTTGACTCAAGAATATACAATTGCTGTGTTGGAGCAGTTAAAAAGAGATGGAAAAATTTATA CGCCGACGCCTGGCCGTCTGAAGCTTGCTTCATGA
This is the encoded protein (extein + intein):
>lcl_ORF1 MSNADDVTDFFQRFFTEYKDDDGNFKYTNRINRMVRDGSQYLQIDYDDVL LYEAGDGDISTPLFENPYFVMDYANMALGEAVRQESVDFYNDMNRDGVDF MIQFVDLPIEIGLRDLRAKHVRTMRVIEGIVTRTTDIKGIIREAQFFCKE NREHIVVMTLLDGIYSSPNQCSVPTCRSKQFSLEMEFSSQVDWQLVTLQE MPENISAGRTPVSIRCRFTQGMVGSANPGNRIAVTGVIRAQSRKTIQKGK IMLLDKWIDTNHVKVLGYQQKYEEILPAELKEFDEMAKDPKLFDKLVNSF APTIYGLKEVKAALLLFLLGGVDKIRADGIKLRGQSNILLVGDPSMGKCC RGATTYVFSNRGMKLLSDFYQTDEIESDKEFSLGIETFDFQSFNPKNTVA IYQRKQAKTIKITNSIGLAIEGTPHHRIIIQNNTGDIEWKQLQDIQESDH IVIRVGSNLFNKSHKKITFSSPIQKIKNTKKITIPKTMNMDLAYYISLLI GDGCLTKKRCIEFTNADKYLLDQFEKLSLDLFGLIGFVQLKKGSIASTVV ISSVTLQRFFDYLGLGGKYSFEKTIPQLILESPKSVQIACLKGLFDTDGE ISKYDVAYTSTSELLVIQVQLILLNLGIVTSKKTKTTTHRDLYRLRIVGA YIPLFKELIGFRCTQKRTALDKTKARNKTNVCGVPNIQRHLYKLWYSIPE EVRYKNGYKKGSKASIGGVTFTYLRRYFLKSQNRNIPVYKLGSLLEGFAK LYPKITQLKEYKKLTVFTKGMFFTKLAHKTTGIADVMDFTIPDTESFTAN GIINHNSILLKYTKSLSDRAIFTSGKGSTAAGLTAAMLRDPDTGEFNLEA GAIVLADEGYVCIDEFDKMSENDRSAIHEAMEQHQVSISKAGIVTTLNAR TGILAAANPKYGRYESHRTFMENVNLPPTILSRFDLIFPLLDDPKQRDDA ARVEYILASHRMETIAKTTETYSTAVMQKYIAYAKSTSSPILSESAEQAI FEFYINLREQIGDDKGRIPITDRQLESIIRLAEARAKINLKKTVSKQDAL KAIQLVQYCLEQVTTDPETGKLDIDFMYSGESSTKRTTRNKMEKIMALLS FFQRTYSGPFSEEEFLKEAENEGLTQEYTIAVLEQLKRDGKIYTPTPGRL KLAS
This protein contains an Intein. To determine where the intein is located, you can do a siple blast search with lcl_ORF1. Restrict the search to archaea in nr. Download matching sequences, be sure to include sequences that do not contain an intein. Align the sequences in seaview. One complication is that the host protein in different organisms harbors different intein alleles (i.e., inteins that target different insertion sites. We are only interested in the ones in the same insertion site as lcl|ORF1. As we use this only to find the intein boundaries, delete the sequences that have inteins in different locations.
Realign the sequences, determine where the intein start and stop are, and use this information to cut out the intein, and to rejoin the extein parts. (If you are short on time use the sequences below).
Also, keep the well matching sequences that contain only the lcl_ORF1 intein for the phylogenetic analysis, and some of the sequences that do not contain the intein!
This is the intein only:
> Intein only CCRGATTYVFSNRGMKLLSDFYQTDEIESDKEFSLGIETFDFQSFNPKNTVA IYQRKQAKTIKITNSIGLAIEGTPHHRIIIQNNTGDIEWKQLQDIQESDH IVIRVGSNLFNKSHKKITFSSPIQKIKNTKKITIPKTMNMDLAYYISLLI GDGCLTKKRCIEFTNADKYLLDQFEKLSLDLFGLIGFVQLKKGSIASTVV ISSVTLQRFFDYLGLGGKYSFEKTIPQLILESPKSVQIACLKGLFDTDGE ISKYDVAYTSTSELLVIQVQLILLNLGIVTSKKTKTTTHRDLYRLRIVGA YIPLFKELIGFRCTQKRTALDKTKARNKTNVCGVPNIQRHLYKLWYSIPE EVRYKNGYKKGSKASIGGVTFTYLRRYFLKSQNRNIPVYKLGSLLEGFAK LYPKITQLKEYKKLTVFTKGMFFTKLAHKTTGIADVMDFTIPDTESFTAN GIINHN
Extein only:
> Extein only MSNADDVTDFFQRFFTEYKDDDGNFKYTNRINRMVRDGSQYLQIDYDDVL LYEAGDGDISTPLFENPYFVMDYANMALGEAVRQESVDFYNDMNRDGVDF MIQFVDLPIEIGLRDLRAKHVRTMRVIEGIVTRTTDIKGIIREAQFFCKE NREHIVVMTLLDGIYSSPNQCSVPTCRSKQFSLEMEFSSQVDWQLVTLQE MPENISAGRTPVSIRCRFTQGMVGSANPGNRIAVTGVIRAQSRKTIQKGK IMLLDKWIDTNHVKVLGYQQKYEEILPAELKEFDEMAKDPKLFDKLVNSF APTIYGLKEVKAALLLFLLGGVDKIRADGIKLRGQSNILLVGDPSMGK SILLKYTKSLSDRAIFTSGKGSTAAGLTAAMLRDPDTGEFNLEA GAIVLADEGYVCIDEFDKMSENDRSAIHEAMEQHQVSISKAGIVTTLNAR TGILAAANPKYGRYESHRTFMENVNLPPTILSRFDLIFPLLDDPKQRDDA ARVEYILASHRMETIAKTTETYSTAVMQKYIAYAKSTSSPILSESAEQAI FEFYINLREQIGDDKGRIPITDRQLESIIRLAEARAKINLKKTVSKQDAL KAIQLVQYCLEQVTTDPETGKLDIDFMYSGESSTKRTTRNKMEKIMALLS FFQRTYSGPFSEEEFLKEAENEGLTQEYTIAVLEQLKRDGKIYTPTPGRL KLAS
We want to learn, if the intein and extein evolved together, if transfers were only between related, or also between divergent organisms.To do this we will compile a datset that contains all sequences that harbor the homologous intein allele. You can try to do a databank search with the intein only; however, with the setting and databases I tried, I only retrieved inteins that sit in MutS, a protein that is different from lcl_ORF1. The reason for this is that inteins are not well conserved ...
To supplement our dataset, with extein containing sequences, we will do a blastp search with the extein only. To avoid being overwhelmed by 10000s of sequences, we will use the uniprot database.
A file that combines both sequences from the first blast search, and from the search of uniprot50 is here (note the annotation lines were modifiend by scripts to provide names that give some indication of the organisms to which the sequence is ascribed; the intein containing sequences are at the bottom). (The script used to reannotate the NCBI fasta files is here, the one for uniprot is here - seaview complains: you need to delete the first empty line, and replace "(" and ")" with "_" .)
Use this file (or your own data) to calculate phylogenies for the inteins and exteins seperately. If you do this in class, you want to use Neighbor joining for the extein trees; the inteins are too divergent fro the Poisson correction, use ml instead). If you have time, do the analysis in IQTree
Save the trees as unrooted trees, and load them into Figtree. Arrange and color the resulting trees, so that you can make sense of what might have happened. For example, in the extein tree, color all sequences that were invaded by an intein in a differnet color. Collapse part of the tree that only contains non invaded exteind. Highlight the sequence from the metagenome. Is it in the same neighborhood in both trees?
Descripe your findings, and email the resulting trees (and treefile and as pdf) to gogarten@uconn.edu
Send email to your instructor (and yourself) upon submit Send email to yourself only upon submit (as a backup) Show summary upon submit but do not send email to anyone.