Questions (yours) and Answers (mine, i.e. J. Peter Gogarten)

 

============================================

Midterm postponed to Wednesday after Spring-bereak. Please try to do assignment #5 and go through these rest of these slides.

============================================

Hello! I was working through the lab. I think everything worked fine until part 2. When I input the Yeast_vma1_all_codon.fst from the website, it threw an error "Unable to proceed (missing number of sequences)". I don't see an argument for number of seqs in the tree-puzzle users manual. I tried with the input file as .fst .aln and .phy (which I think the user manual said it should be able to read). Not sure what's up. 

Tree-puzzle only reads phylip formated files, which start with the number of sequences and the number of alignment columns. I added a link to a file in this format. Tree-puzzle only allows for sequence names with fewer than 10 characters. The easiest way to create this format is to load the file into clustal and save it in PHYLIP format (sadly seaview does not work for this, because it uses an expanded version of this format that allows for longer species names).

============================================

I wrote the script to check/reconfirm my understanding of Perl.  I know that as Perl doesn't have strict classes for integers or strings, that there is some malleability when dealing with variables.  However the script also yields True, when the statement is 
if ( $a = $b) {...
Based on this, I would think the correct answer would be all three choices.

NOT CORRECT!
($a == $b) and ($a eq $b) work (most of the time).
"eq" considers $b as a string. If $a=2 and $b=2 this is not a problem.
However if $b is a string '2.0' and $a is 2, then eq will evaluate them as different.
== considers numbers and would correctly consider 2.0 and 2 as equivalent.
So far so good, usually both eq and == work the same, but using == for numbers and eq for strings might sometimes help.
This is why if ($a=$b) {...} is wrong:
= is an assignment. $a = $b will assign the value of $b to $a. If this is done successfully, ($a = $b) might evaluate to true, but this is not what the if statement intends to do. The intention is to do something if $a has the same value as $b (and not if an assignment was done successfully).

============================================

I know there are two parts of the exam; questions on paper and a practical.
How many questions are there going to be on the exam (specifically, questions on papers)?
I remember you said there will be a couple of questions 10s or so.

The closed book part at the moment has 13 questions, about 4 of these have multiple parts.
The practical has 2 distinct exercises. My expectation is that the closed book part should not take more than 30 minutes, leaving about one hour for each of the practical exercises. 

============================================

I turned Matt Fullmer's R instructions into an
Rmd and an html markup file,
updated the histogram script, and
placed all the created plots (including the ones from get_homologues) into a zip archive. All linked here

============================================

Possible questions for mid-term (to be continued):

When are two sequences homologous? 

Do all sequences that are homologous show significant sequence similarity. 

If two sequences (that do not contain regions of low complexity) show significant similarity in their primary sequence, are they homologous? 

Are orthologous sequences homologs? 

Can a sequence in one organism have many orthologous sequences in another organism? 

Two sequences (each 300 amino acids long) in a pairwise sequence alignment have 70% identical residues (distributed rather evenly along the sequences), another 15% of the residues show conservative substitution.  Which is correct?

A) the sequences are 70% homologous
B) the sequences are 85% homologous
C) the sequences are homologous
D) There is a good chance the sequences are not homologous. 

What is the difference between a global and a local alignment.

In a blast search, how is the chance expressed that a match between a query and a database sequence is a false positive? 

In blast searches, is the E-value of a match propositional to the database that was searched to obtain the match? 

What does the abbreviation “ssh” mean, and what can you do with ssh?

You have several sequence files in your directory that have very, very long filenames (GCF_000196515.1_ASM19651v1_cds_from_genomic.fna GCF_000020025.1_ASM2002v1_cds_from_genomic.fna and GCF_000009705.1_ASM970v1_cds_from_genomic.fna).  You want to copy the contents of these files into a single file.  How can you do this without typing the very long names at the command line (and without using a Graphics User Interface)? 

What is automatic line completion, and how can you invoke it?

At the command line you start typing a command, e.g. "cd ~/D".  You then hit the <tab> key on your keyboard.  What will happen? 

You do a pan genome analysis of 28 genomes. 
One gene family has members in all 28 genomes, the genes in this family would be considered to be part of the __________________
One gene family has members in 10 genomes, genes of this family would be considered to be part of the __________________
One gene family has members in 27 genomes, these genes would be considered to be part of the ____________________ or of ___________________

What are possible reasons for the recombination events within a genome to frequently occur between sites equidistant to the origin of replication. 

In comparing two genomes, what are differences between a mummer and a gene plot (Which is based on nucleotide sequences, and which on encoded proteins?  Which provides information on the encoding DNA strand?  Which is better in detecting paralogs? 

What is strand bias? 

G pairs with C and A pairs with T, how can there be a bias in Gs and Cs? 

Why is plotting cumulative bias less affected by noise that plotting the G over C bias in a window? 

Why is a PSI blast search more effective in finding distant homologs than a normal blast search?

Why do PSI blast searches sometimes have a problem with estimating the probability of false positives in a useful way? 

Why is % identity between a query and a match in the database not a good criterion to evaluate the significance of a match?   

You write a script that for each entry in array (e.g., a nucleotide in a genome) performs an activity (e.g., increasing the counter for this base by 1).  You start the loop with
foreach (@array) { }
What is the standard name for each array entry inside the loop? 

You want to perform a set of steps in a program only if two numerical variables ($a and $b) have the same value.  Which is correct:
if ($a = $b) {some commands}
if ($a == $b) {some commands}
if ($a eq $b) {some commands}

Which character at the beginning of the line denotes a comment line in perl?

Which character at the beginning of the line denotes the name/annotation of a sequence in a fasta formatted sequence file?

What is a potential problem with progressive alignment programs (e.g., clustalw)?

If an intron does not have a length that is a multiple of 3, what problem does this cause for alternative splicing? 

Briefly describe the following approaches to phylogenetic reconstruction from aligned sequences:
Distance matrix analysis
Maximum parsimony analysis
Maximum likelihood analysis

Now that you made it this far, you could check here for the answers I expected.

============================================

I just wanted to make sure that I understand this: you want me to write a Perl script that will create a table that looks similar to the table on this excel spreadsheet correct?
(The file attached to this E-mail is excel spreadsheet)

Also, question about this "tetra" and "Penta"-nucleotide

Let's say the sequence is ATAGCTAG;
This means I will have two tetra-nucleotide correct? the first "tetra" is ATAG, and the second tetra is CTAG?
Then there will be only one penta-nucleotide in the same sequence: ATAGC

Is this right?
I am thinking about creating arrays that contain 4 characters (for tetra-nucleotides), then count the number of arrays to get the "tetra-nucleotides sequence"

Sadly, the answer to all of these questions is no:

Let's say the sequence is ATAGCTAG,
then the first tetranucleotide is ATAC,
the second is TAGC, i.e., you move over one nucleotide over, not four.
the third is AGCT ....

If you want to have the overall occurrence (not the occurrence in one strand only, if for example you would want to plot cumulative tetranucleotide strand bias), you also need to account for the reverse complement, i.e., if you enconter ATAC, you also need to count GTAT.

For the first part of the task (genome wide analysis), your excel spreadsheet needs to have a row (or column) for every possible tetramer - you might need to think briefly about palindromic sequences (such as GTAC, whose reverse complement is itself), but if you use hashes this should work nicely, because a palindromic sequence will be added twice.

HINT: Assuming that your genome is a big array called @genome, then the following will create a counter for each tetramer occuring on one DNA strand:

for ($i = 1 ; $i<=($number-3) ;$i++) 
   { $string = $genome[$i].$genome[$i+1].$genome[$i+2].$genome[$i+3];  
     $oligo_hash{$string} += 1;}  

============================================

Question about "true or false" - Powerpoint class 4, Slides #32.

What exactly are we suppose to do to "evaluate to true or false"?

Should I write a perl script to evaluate the "true or false"? YES!
Am I suppose to compare values? If so, what value do I compare "1" with?

These are considered logical expression (or rather, if you consider them as logical expressions, what do they evaluate to), and the task is to write a script in Perl that evaluates them.
if you want to know if a statement evaluates to true, you could have the following line:
if($R eq R){print "this is true\n"};
This would print "this is true", if $R is equal to R. In the same way you could have
if(1){print "this is true\n"};

============================================

Is there a way for perl to ignore blank lines? I ask because when working on the tasks in the class4.pl script, I find that the loops are counting the blank line as a GI number. To compensate, one could either remove the blank line at the bottom of gi_list.txt prior to answering the questions or fix any counters by subtracting one from their final summation, but I'd would think there would be better ways of fixing this issue, especially to fix a potential file littered with blank lines. 

One possibility (I am sure there is a more elegant solution) is to read in the line, then remove all white spaces from the line (in case there is a tab or a space on the empty line), and then to only push the line into the array, if it is not empty:

while(defined($line=<IN>)){
      chomp($line);
      $line =~ s/\s//g; #removes white spaces from line
      if ($line ne ''){
          push(@gi,$line); # push is a function that adds the $calar to the array.
          } # pushes line into array, only in case it is not empty
      }

If the input line has words separated by spaces, it becomes more difficult, because the =~ s/\s//g would change the line .... One could either hope that the empty lines are really empty, or split the line on the white spaces, and check if the resulting array has an entry

============================================

A few pointers for class 3 home work

Please don't stress to much over the homework.  This is for you to try things out, and to learn how to do things, and obviously, one makes mistakes in the beginning.  You are not getting a worse grade for trying and failing in a homework assignment, you only get a worse grade for not trying.   

A) I would make a new program for the GC counting program, and not as part of a single homework script.

B) the  @ARGV array by default contains the strings the follow the name of the script when the script is called.  E.g., if the script is called countGC.pl and is called with the following command:

perl countGC.pl MahellaSequence.fasta filename2

the the  @ARGV  has the following entries:

$ARGV[0] = MahellaSequence.fasta 

and

$ARGV[1] = filename2

C) trying something out, I would NOT use a complete genome, but only about a hundred nucleotides.  In troubleshooting a script, you don't want to wait a 30 seconds, every time, before you get an error message.   

D) Before you actually start writing the script, do the reading assignment on conditional statements and on loops.

============================================

E) Fragments of code that might be useful: 

Reading in a fasta file with one or more headers . This assume the file handle IN is assigned to a fasta file.

while(defined($line=<IN>)){
#this loop goes through an input file with the file handle IN ?line by line
  if ($line =~m/^>/) 
       #this is true when the line matches (~m) a ">" at the beginning of the line "^" ?
      {$header=$line;
   print "\nthe analyzed sequence has the following comment line:\n$header \n\n"}
   else #if $line is not a header do the following
    {chomp($line);
    $seq=$seq.$line}
     }
The result is that the headers of a fasta file (one or multiple) are printed out, and that the sequences are placed into a single string that is not interrupted by any line breaks.

==================================?==========
Split a string into an array
assume you have your sequence in one long string.  You can split the string in smaller strings based on a pattern (see class03.pl)
if you use an empty string, it splits after every character, i.e, you have created an array where every letter in the original string is sitting it own slot in the array.
The following would split a nucleotide sequence, and every base consecutively is assigned to an array called bases.  
@bases=split(//,$seq);

==================================?==========
Foreach Loop

foreach (@bases) {...} 
#the commands in curly brackets are executed for every element of @bases.  Perl by default assigns the array element it is working on to a variable called $_.   But you also could specify a variable 
foreach $dummy (@bases) {...} 

===========================================
As simple counter:

if ($_ =~ A) {$A++};  
# this increases the variable $A by one if $_ matches an A.  

=================================================================================================================================

When I got to the second problem in the assignment associated with the file class02_2018.pl I got a little confused with the wording of the question. If you could please clarify the following parts, I'd appreciate it very much.
a) Is the first operation suppose to be $1=1 or is it suppose to be $i=1? If its the former then there's no value since $i wouldn't have an assignment, and if it was the latter the answer would be 1. and
b) The last line, $i = $i . Òscore andÓ . $i+3; , is that suppose to be one line or two separate assignments?

Yes, there were two mistakes that found their way into the assignment.  (I had updated the class02_2018.pl file, but some of you were faster than I).  And I had added a note to the class page. 
You are correct on the $1=1  (should have been $i=1).   
The strange symbols should have been quotation marks - I have no idea how they ended up in the file (I had mentioned the problem with "intelligent" quotation marks in word - but this file should have never seen word, maybe I copied it out of a pdf .... .  The . operator concatenates two strings. E.g., "hallo"." world" is the same as "hello world".

======================================================================================

I have some questions about the "student projects"
On the second slide, you stated "students are required to complete a student project (this project can be related to the student's research)"
At this moment, I'm not in a lab and don't have a specific project on my own at the moment.

There are plenty of data available to do a project on.  You certainly do not need to be affiliated with a laboratory. 
Over 40,000 microbial genomes have been sequenced, the databases are overflowing with metagenome data.  Pick something that interests you.  In the past I had many undergraduates taking the course, and they were, at least when they started the course, not involved in research.  The slides make some suggestions.  A long time ago I did some interferon phylogenies, and on the surface these suggested that in birds and mammals the interferon gene duplicated independently; however, the regulation of the two version of the gene appear to be regulated in a similar way.  It might be interesting to revisit this with an eye towards finding possible recombination events between the two versions of the gene.  I know nothing about SIRT, but there are many genes related to disease that will be interesting to research (Toll like receptors are currently popular).  

My questions are: Do I need to be in a lab, in order to work on this type of project?
NO.
How should I gather data for the project? (I'm interested in interferon, SIRT and other gene that usually related to diseases)
Data bank searches - we will go over this in detail.  

======================================================================================

What is phylogenetics? Does it mean building trees?

The equation between trees and phylogeny is a widely propagated misconception. The origins of the word phylo-geny are Greek phlon, race or class and Greek -geneia, from -gens, born. (from the American Heritage Dictionary). Phylogeny describes how the larger taxonomic categories came into existence (as opposed to ontogeny which describes how the individual organism comes into existence). Botanist discovered long ago that the origin of many species results from the fusion of genomes belonging to different parent species. They coined the term reticulate evolution. There is hardly any crop plant that is not aneupolyploid (i.e. every cell contains copies of genomes from two different parent species. More here.).

Every eukaryotic cell represents the result of a fusion between at least two independent ancestors, an alpha proteobacterium that evolved into the mitochondrion, but whose genes nowadays mostly reside in the nucleus, and a host cell that was a close relative of the archaea. (There might have been many more organisms contributing genes over time, but except for the cyanobacteria, these additional contributors currently are less well defined.)

Many organisms are in fact microbial communities, whose members live in close association (e.g., lichen), and many (all?) microbial communities can be viewed as higher order entities with a shared genetic resource (open source genetics J).

Especially for microorganisms (but see here and here for recent examples of gene transfer between very divergent angiosperms) the evolutionary history of organisms is not tree-like, at best, it can be approximated by a tree. For more on this see
Gogarten, J. P., Doolittle, W. F., Lawrence, J. G. (2002). Prokaryotic evolution in light of gene transfer. Mol Biol Evol 19, 2226-2238.
Zhaxybayeva, O., Gogarten, J. P. (2004). Cladogenesis, Coalescence and the Evolution of the Three Domains of Life. Trends in Genetics 20, 182-187
Zhaxybayeva O, Swithers KS, Lapierre P, Fournier GP, Bickhart DM, DeBoy RT, Nelson KE, Nesbø CL, Doolittle WF, Gogarten JP, and Noll KM (2009) On the Chimeric Nature, Thermophilic Origin and Phylogenetic Placement of the Thermotogales. Proc Nat Acad Sci USA 106(14):5865-70

During the late 19th century, the theory of recapitulation, or Haeckel's biogenetic law, was widely accepted. This theory was often expressed as "ontogeny recapitulates phylogeny", i.e. that the development of an organism exactly mirrors the evolutionary development of the species. The early version of this hypothesis has since been rejected as being oversimplified and misleading. However, modern biology recognizes numerous connections between ontogeny and phylogeny, explains them using evolutionary theory, and views them as supporting evidence for that theory.

See the article on ontogeny and phylogeny.

=======================================================================================

Can we collaborate on our student projects?
Every student needs to hand in their own project. It should be based on the student's own work. Sources (literature, web pages, other students) need to be clearly indicated. Students should not cut and past from the work of others with out giving credit. Plagiarism represents serious academic misconduct.