|
Author
|
Topic: I.G.D. Strachan: An Evaluation of "Ev"
|
yersinia
Member
Member # 324
|
posted 15. August 2003 14:16
I vaguely recall asking this question before, but if not:
If two proteins, or a DNA sequence and a protein, evolve to "match" (bind) each other via RM&NS, has information been evolved, or was there a pre-specified target that "front-loaded" this information?
I don't see how Ev can be "evaluated" without an up-front answer to this question.
IP: Logged
|
|
Pim van Meurs
Member
Member # 541
|
posted 16. August 2003 14:58
John,
You seem to want to change the focus from the applicability of a fixed target solution ala Vete to a co-evolving target ala Ev, to whether or not Ev shows how Rseq can evolve to Rfreq.
First of all it seems obvious now that mutation and selection in Ev does lead to an increase in information in the genome without the need for a fixed target. Let me point out the inconsistency in your arguments where you on the one hand want to argue that Rseq will evolve to Rfreq because of the front loading and at the same time you want to argue to inadequacy of Ev since Rseq does not evolve to Rfreq.
Thus when you state "So obviously, the binding sites will automatically evolve enough information to locate them in the genome, which is precisely what rfreq tells you" it seems 1) hardly obvious when in fact according to Strachan Rseq does NOT evolve to Rfreq and 2) there is no global requirement like found in Vete for Ev to evolve to.
In other words in Vete the goal is to evolve exactly the amount of information as found in the target string. In Ev there is no such explicit target and the fact that Rseq evolves to match Rfreq shows how mutation and selection can increase the information.
Schneider is very clear: Rfreq is pre-specified but there are not apriori goals for Rseq to match Rfreq. That evolutionary mechanisms which are argued to match what is found in biology can lead to Rseq evolving towards Rfreq is what Ev shows.
Thus when John repeats his (erroneous) claim that 'I think he should have been forthright in his paper that he injected this information and that it constituted an explicit, external target toward which the program evolved.' he still seems to confuse Rfreq and Rseq. Probably because in Vete Rseq is 'forced' to evolve to exactly Rfreq through the cost function when in fact in Ev no such global requirement exists and thus not surprisingly the information in Ev is fully contingent (which is why Rseq varies around a mean value) while in Vete Rseq is forced to be equal to Rfreq. That's a major difference and thus the claim that Schneider injected the information Rseq in an explicit manner seems hard to support.
So Strachan fails in his argument about Ev being similar to Vete. In fact that Rseq in Ev is contingent and varies around a mean value, unlike in Vete should have been an indication that there was something wrong with the similarity. In fact all Vete does is 'hide' the explicit information in an encoded version of the target string, yet the information is from information theoretical perspectives equivalent to the target string "The answer is forty two", this can be easily determined by setting the encoding string lenght to one.
The open issue is the accuracy of calculating Rseq in Ev and does Ev explain why in real life Rfreq and Rseq seem to be very similar. Strachan argues that Ev does not achieve this since Ev uses an approximation for the entropy which fails at small sample sizes.
Thus when Strachan discusses the main objections
quote:
1. The information that arises has to be pre-speci¯ed in the sitelocations array
1. Describes the experiment which pre-specifies Rfreq and shows that Rseq evolves towards Rfreq. I fail to see why this is a relevant objection.
quote:
2. Rsequence will not in general be equal to Rfrequency, unless strict conditions are applied to the probability distribution.
Schneider is clear that he uses the assumption of no correlation between the sites. Why this should be an objection is not fully clear to me.
quote:
3. The reason Rsequence appears to be the same as Rfrequency is because it has been calculated incorrectly for the small sample sizes.
This may be a valid objection for Schneider's simulations but do these objections invalidate Schneider's findings that Rseq tends to evolve towards Rfreq? What would the findings be if the genome simulated in Ev would be of a more realistic size and form?
The answer to this question seems to be found in Kim et al, Journal Of Theoretical Biology 220 (2003) 529-544
quote:
Abstract
Empirically, it has been observed in several cases that the information content of transcription factor binding site sequences (Rsequence) approximately equals the information content of binding site positions (Rfrequency). A general framework for formal models of transcription factors and binding sites is developed to address this issue. Measures for information content in transcription factor binding sites are revisited and theoretic analyses are compared on this basis. These analyses do not lead to consistent results. A comparative review reveals that these inconsistent approaches do not include a transcription factor state space. Therefore, a state space for mathematically representing transcription factors with respect to their binding site recognition properties is introduced into the modelling framework. Analysis of the resulting comprehensive model shows that the structure of genome state space favours equality of Rsequence and Rfrequency indeed, but the relation between the two information quantities also depends on the structure of the transcription factor state space. This might lead to significant deviations between Rsequence and Rfrequency. However, further investigation and biological arguments show that the effects of the structure of the transcription factor state space on the relation of Rsequence and Rfrequency are strongly limited for systems which are autonomous in the sense that all DNA binding proteins operating on the genome are encoded in the genome itself. This provides a theoretical explanation for the empirically observed equality.
Seems that there is after all a solid theoretical foundation for Schneider's claims when Schneider's work is extended to include a more general system.
In Fact Kim et al highlight
quote:
SCHNEIDER et al. observed an approximate equality of the information content of binding site sequences (Rsequence ) and the information content of binding site positions, calculated on the basis of binding site frequency within the genome (Rfrequency) (Schneider et al., 1986). Intuitively, this equality appears plausible: Rsequence, the amount of information required to identify one out of 2^Rsequebce sequences as a binding sequence, may be expected to equal the amount of information necessary to address one site out of 2^Rfreqeuncy possible sites on the genome. However, this is just a vague plausibility argument, and counterexamples can easily be constructed. With this contribution, we aim to extend the theoretical basis for studying transcription factors, their binding sites, and their coevolution within the context of regulatory gene networks.
Certainly the argument that Schneider pre-specified Rsequence through the specification of the site locations seems to be untenable in this light and in light of my arguments above.
Kim et al continue
quote:
We then show that already for an unbiased (within the chosen coding scheme) a priori probability for the binding behaviour of the transcription factor the quantity Rsequence can significantly deviate from Rfrequency. However, by using biological principles we can show that these deviations can expected to be limited.
They continue to note something interesting as well
quote:
It is interesting to note that this optimization can be achieved by transcription factor binding mechanisms in which each nucleotide in the binding site contributes to the binding energy independently of the nucleotides at the other positions. Indeed, this independence is observed experimentally (Stormo & Fields, 1998).
They find that Rfreq and Rseq are equal if tau_k is a constant function of k and that B_n,k induces a tendency towards equivalence of Rfreq and Rseq while tau_k displaces Rseq from Rfreq
k is the number of words recognized by the transcription factor tau as binding words tau_k is an individual transcription factor, B_n,k is the pattern of binding sites along the genome. They conclude that a uniform distribution of tau_k or a saddle point at k_n would be a remarkable property and require explanation. They find that the amount of displacement of Rseq from Rfreq is determined by the variance of tau_k versus Bn,k at k_n. In fact given biological constraints it can be argued that the equivalence of Rseq and Rfreq in biological systems can be explained to result from the limited size of the transcription factor state space.
They argue that the limited size of the transcription state space is derived from protein biochemistry.
quote:
From this perspective, it appears that the physical size of the device for storing genetic information (i.e. DNA), relative to the size of the components of the machinery interpreting genetic information (i.e. transcription factors etc.), imposes a limit on |THETA| .
In fact they go beyond this conclusion and find that
quote:
The approximate equality of Rsequence and Rfrequency can therefore be characterized as a bioinformatic property which invariantly applies to any genetically autonomous system even if the chemistry or the information storage and interpretation devices underlying biological systems were fundamentally different from those known today.
They argue that the equality of Rseq and RFreq cannot be deduced from information theoretic principles alone and that in fact
quote:
While the distribution of |tau_k| cannot be calculated in detail, an upper limit of its magnitude derives from the biological principle that the genome has to encode all components of a living system, including all transcription factors, in a genetically autonomous system. Therefore, the size of the transcription factor space, and also the variance of the distribution of k values within this space, are limited to be small in relation to the size of the genome space. This constraint, which applies independently of the physical and chemical constitution of a living sytem, is the key principle which confines deviations from Rseq=Rfreq to a very small range.
To conclude:
Strachan's objections to Schneider having pre-specified the information Rfreq seems to be untenable and the similarity between Ev and Vete is at most superficial.
Strachan's objections to the actual calculations by Schneider and the possible deviations of Rseq from Rfreq are valid but when biologically relevant parameters are applied it is found from a information theoretical perspective that Rseq and Rfreq will be very close, as observed in nature. [ 16. August 2003, 15:02: Message edited by: Pim van Meurs ]
IP: Logged
|
|
John Bracht
Member
Member # 5
|
posted 16. August 2003 18:12
quote:
In other words in Vete the goal is to evolve exactly the amount of information as found in the target string. In Ev there is no such explicit target and the fact that Rseq evolves to match Rfreq shows how mutation and selection can increase the information.
Pim, I feel frustrated. Here you simply re-state what you've wrongly been stating all along. It's as if my posts to this point have absolutely been ignored. You have failed to show why VETE has a fixed target any more than Ev (esp. considering the contingency in the actual evolved text strings) and you simply continue to repeat your tired-out argument above. I'm done arguing with you, because to continue this discussion would require basically just re-posting my previous post (where the arguments I actually presented were ignored). I don't have time to go in circles.
If you decide you are willing to try to understand what I'm trying to say, let me know. Otherwise, I think your blinders are keeping you from really understanding the two programs. You seem genuinely unable to see what I'm trying to say, and I'm tired of saying it.
John
IP: Logged
|
|
Rex Kerr
Member
Member # 632
|
posted 16. August 2003 23:45
Micah,
When doing experiments, an experimenter usually has to intentionally construct a scenario that is unlikely to occur by chance while you're watching, but is informative.
The key is that the intentional construct must not change the conclusion of the study by virtue of it being an experiment. For example, if one creates a binding site and an evolutionary algorithm that always produces a model binding protein that binds it, the conclusion is not that every possible binding site necessarily has a binding protein that is specific for it. That's confusing the intentionality present in the experiment (to have a binding site and strong selection towards binders) from the conclusion. The proper conclusion of the experiment would be, if there exists a site at which binding of this protein confers a strong selective advantage then it is plausible that evolution will lead to such a binding. In order to apply the results of such a study to biology, one has to separately address the conditionals: does such a site exist to confer a strong selective advantage? By being careful with our conclusions, we can thus decouple necessary intentionality when doing experiments from our conclusions about a process hypothesized to be non-intentional.
Selection is specifying in Ev a *particular* "binding protein" (actually, a genetically encoded artificial neural network) and a *particular* recognition sequence. This need not have happened; you could very well have specified a binding location, and yet found that Ev was unable to produce a sequence at that location and a "binding protein" that was able to recognize that sequence accurately. And if, for example, you plop a binding protein down in the middle of a transcription start site, it'll block transcription of a gene--potentially with profound consequences. There is absolutely no question that there exist locations in the genome which would provide a strong selective advantage (or disadvantage) to having something bind there. This is where the biological relevance comes from.
I actually don't think that Ev has much more biological relevance than Vete or WEASEL, once you understand the mathematics of information gain in evolutionary algorithms. Given the mathematics, the behavior of Ev, Vete, and WEASEL are all unavoidable consequences.
However, if the mathematics were unknown (and they are to most people, including apparently the reviewers of Schneider's article), Ev would be more biologically relevant because it solves a biologically relevant task, i.e. co-specifying a sequence of DNA and a binding protein for that sequence of DNA, whereas Vete and WEASEL play games with English letters--which is completely irrelevant biologically.
Of course, there are many aspects of Ev which are not biologically realistic, most important of which is perhaps the nature of the binding "protein". Real proteins are not simple artificial neural networks, and they don't treat binding sites as a classification problem. However, the Ev result does demonstrate in principle that binder/target coevolution is feasible. Given the Ev results, one cannot simply dismiss evolution of DNA binding proteins with a particular specificity as implausible. Instead, the job is twofold: first, show how the biological job is different than the one Ev solved, and second, show (quantatatively!) why the difference causes the process to fail in the more biologically plausible case rather than succeeding in the Ev case.
Pim,
I think I'm going to have to agree with John on this one--Vete and Ev both have targets of a similar amount of fixedness. Vete has a substantially simpler mapping than Ev does, but the basic idea is the same. Reward similarity to THE ANSWER IS FORTY TWO, or reward similarity to the desired binding pattern.
John,
Since you seem to keep bringing this point up, I have to wonder why you think it's important. Yes, in the experiment, the binding sites are pre-specified. Do you doubt that binding of proteins in certain locations can be highly relevant biologically, due to a combination of chance position of genes and physical law governing transcription, translation, and so on? Explicit specification of the binding sites is just an abstraction of the observation that position of binding matters, just like explicit specification of THE ANSWER IS FORTY TWO is an abstraction of the observation that particular sequences can be functionally important.
IP: Logged
|
|
John Bracht
Member
Member # 5
|
posted 17. August 2003 03:14
Rex,
I haven't been making any arguments about how realistic the Ev simulation is regarding actual biology. I agree with you that there may well be examples of pre-specified binding sites in biology, and there may be smooth fitness functions which target those binding sites (as in Ev).
However, Pim has been making the rather outlandish claim (in my opinion) that Ev has no fixed target. He has further claimed that VETE is an irrelevant simulation with no similarity to Ev because it does have a fixed target. This was the claim I wanted to challenge, since I felt it was clearly wrong. Once the question of fixed targets is out of the way (as it seems to be, at least for me, though I doubt it is for Pim), we can address what is relevant to biology. In addition, I think we want to address whether the Ev simulation's fitness function is similar to biological fitness functions.
What do you all think Schneider meant to model with the Ev program? Is it transcription (or the binding of transcription factors to cis-regulatory DNA)? He implies in the paper that it's a splicing system (I think). Is he modelling spliceosome function (recognition), and if so, what process is analogous in biology to the evolution of new or novel splice sites?
John
IP: Logged
|
|
Art
Member
Member # 179
|
posted 17. August 2003 08:55
John asked: "what process is analogous in biology to the evolution of new or novel splice sites?"
???
How about the evolution of alternative splicing events? "Recruitment" of new exons, skipping of others, producing proteins with different combinations of exon-encoded modules, etc. This would seem to be highly relevant (and adds some interesting twists to discussions about protein evolution).
I haven't read Schneider's ideas on this, but splice site characterization and evolution would seem to me to be pretty darned important. And interesting.
IP: Logged
|
|
Pim van Meurs
Member
Member # 541
|
posted 17. August 2003 13:09
John: Pim, I feel frustrated. Here you simply re-state what you've wrongly been stating all along.
I disagree John. You seem to have been confusing Rfreq and Rseq. In Vete the goal is explicitly to evolve Rseq to become identical to Rfreq. In fact repeating the simulations show that the information in the decoder/string is exactly Rfreq, in fact the information in the code is exactly "The answer is forty two", no variations always the same answer. Surely it should be obvious that this is very different from Ev in which Rseq is totally contingent and varies around a mean value.
Rex: I think I'm going to have to agree with John on this one--Vete and Ev both have targets of a similar amount of fixedness. Vete has a substantially simpler mapping than Ev does, but the basic idea is the same. Reward similarity to THE ANSWER IS FORTY TWO, or reward similarity to the desired binding pattern.
There is NO desired binding pattern in Ev. Vete ALWAYS evolves to the global pattern "The answer is forty two", Ev has no such global pattern to evolve to. In fact Vete's goal is to exactly recover the global string "The answer is forty two" and the information created matches the global string exactly. No so in Ev.
I find it somewhat frustrating that this simple fact is so easily overlooked. John tries
quote:
However, Pim has been making the rather outlandish claim (in my opinion) that Ev has no fixed target. He has further claimed that VETE is an irrelevant simulation with no similarity to Ev because it does have a fixed target. This was the claim I wanted to challenge, since I felt it was clearly wrong.
Could you please explain why the Rseq in Vete seems to be exactly Rfreq, time after time while the Rseq in Ev is not so constrained. Of course there is a 'fixed target' in Ev in the sense that the location of the binding sites is (randomly) fixed, that should be obvious. But that has nothing to do with the information that evolves in these binding sites. In fact the goal of the simulation was to determine if mutation and selection were sufficient to explain the similarities in Rseq and Rfreq. Unlike Vete however, Ev does not have a global fitness function which requires it to evolve Rseq to Rfreq. We can of course discuss the equivocation of the term fixed target but in Ev there is no fixed target for the information at the binding sites, there is at most a fixed binding site but the information at these binding sites is fully contingent and not always exactly equal to Rfreq as is with Vete.
That should be obvious I think. So is my claim outlandish when I point out the discrepancies between Vete and Ev? Perhaps its time for John to address these issues rather than assert that I am wrong.
RBH has given an excellent overview
quote:
Given that each 'evolutionary' run in VETE is initialized with a different randomization of the key and text, the exact key and text strings that are produced by the simulation are virtually certain to be different from run to run. Yet all produce the same output when key is applied to text: "THE ANSWER IS FORTY TWO." So there is a many-one mapping from key/text strings to output string.
So the differences between Ev and Vete are quite distinct
quote:
Where does selection come from? In biology, it comes from the environment, where "environment" includes other members of the population of which a replicator is a member (its most formidable competitors); other species that may be prey, predators, parasites, or competitors for the same resources; and the non-biological physical environment.
No distant goal
quote:
Does the fact that experimenters, including Author, set a distant (from the initial conditions) "target" for an evolutionary simulation invalidate the assertion that in biology, evolution proceeds without long-term goals? No. The Avida digital organisms are not 'striving' toward any goal; they are not required to perform any particular input-output mapping in order to replicate. They are merely differentially reproducing at a rate that is a function of the reproductive resources that they have managed to accumulate from behaving in their environment. And when they do things that their selective environment rewards with reproductive advantage, they out-reproduce their brethren. But at every single time slice they are responding to their immediate environment; they know nothing of long-term "targets" or distant goals. In our experimenter-as-God role we know that the selective environment will reward this or that behavior, but the replicators don't. All they 'know' is what they can do in the immediate context.
But in Vete there is an explicit distant goal namely the target string "The answer is forty two" and invariably Vete finds the correct solutions. Or as RBH argues
quote:
The illegitimacy of the VETE demonstration as some sort of refutation of the Lenski, et al., study is due to its definition of fitness according to a global criterion (similarity to a distant goal) as opposed to local criteria (local topography of a fitness landscape).
[ 17. August 2003, 15:46: Message edited by: Pim van Meurs ]
IP: Logged
|
|
Pim van Meurs
Member
Member # 541
|
posted 17. August 2003 16:53
To address the relevance of Ev to biological reality I would like to point to the following paper
arXiv:cond-mat/0301574 v1 29 Jan 2003 On the evolution of gene regulation Johannes Berg, Michael Laessig, and Stana Radic
quote:
Transcription factors and their binding sites emerge as an ideal model system to study molecular evolution. Binding site sequences are short and their sequence space is simple. Moreover, explicit fitness landscapes can be derived from empirical data on binding anities. For a single site, the simplest examples are of the mesa [9] or of the crater type, see fig. 1(a,b). Landscapes for a pair of sites with cooperative binding interactions are of a similar kind as shown in fig. 2(a-d). They can be used to predict the outcome of specific single-site mutation experiments to a certain extent.
Additional references
Berg OG, von Hippel PH Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J Mol Biol. 1987 Feb 20;193(4):723-50.
Berg OG Selection of DNA binding sites by regulatory proteins. Functional specificity and pseudosite competition.J Biomol Struct Dyn. 1988 Oct;6(2):275-97.
IP: Logged
|
|
Rex Kerr
Member
Member # 632
|
posted 19. August 2003 00:39
Pim,
When Ev randomly fixes its binding sites, that is exactly a global pattern that is sought. From Schneider's results section: quote: Then the half of the population making the least mistakes is allowed to replicate by having their genomes replace (`kill') the ones making more mistakes.
And there's no reason why Vete has to have Rseq be equal to Rfreq, except that it has an extraordinarily linear "encryption" key (i.e. utterly useless for real encryption). If a real encryption system had been used (e.g. DES), Vete would have failed utterly to generate the target sequence because good encryption is all about having tiny changes in the input produce unpredictably decorrelated output.
These are not the reasons why Ev and Vete are of differing biological interest. (The article you linked to gives one reason: Vete is not a model for t.f./binding site evolution.)
John,
Why not move on to a discussion of the degree to which Ev is relevant to biology? It seems to me that if it is relevant, the question of pre-specified binding sites is unimportant.
I will take Schneider at his word that he was trying to model splice acceptor site recognition, but his results are sufficiently general as to be essentially equally applicable to transcription factor/binding site evolution, or even to something like evolution of specificity of tRNA loading.
--------
As an aside, since we keep bringing up Vete, I think there's been a misunderstanding of the relevance of distant ideal targets. WEASEL, Vete, and Ev all have distant ideal targets. This is not inherently a problem: given that chromosomal rearrangement and/or complete reworking of cellular biochemistry is infrequent compared to point mutations, there exist in biological organisms potential ideal targets (or classes thereof).
WEASEL is a bad model because selection does not reward closeness of DNA sequence to the ideal target. Evolution has no long-term goal, and DNA sequence alone generally doesn't provide any advantage. (E.g. if you have seven stop codons in your gene-to-be, one correct conversion of an untranslated alanine to valine isn't going to help your fitness.)
Vete is a bad model because selection does not reward closeness of the linear mapping of a pair of sequences to an ideal target.
Ev is a bad model because selection does not reward closeness of a binding pattern to that of an ideal target (and it doesn't recognize the pattern via perceptrons).
But you can see the direction here--we're moving from biologically ridiculous ideal targets towards those more like the ones that necessarily exist by virtue of being a biochemical organism in a diverse environment. (Vete is a bit of an odd offshoot; the mapping between the genome and target is clearly more complex than in WEASEL, yet it is no more biologically plausible.)
IP: Logged
|
|
Pim van Meurs
Member
Member # 541
|
posted 19. August 2003 00:55
Rex: When Ev randomly fixes its binding sites, that is exactly a global pattern that is sought. From Schneider's results section:
Ah, I see the confusion. I do not disagree that these binding sites are being set up as a goal. But there is no predefined goal as to what the information has to evolve to, unlike Vete which prescribes a fixed string. This seems to go contrary to the claims made by Bracht namely that setting up Rfreq somehow sets a goal for Rseq when in fact such a goal in Ev is not pre-specified but rather follows from the mutation/selection. The suggestion that Ev somehow preloaded the information that eventually arises seems hard to support when one realizes that the Rseq in Ev is not a fixed target as in Vete.
Thus in Vete any information that arises has been predefined in the target string. In Ev however the information has not been predefined and in fact as can be seen, the amount of information that arises is contingent. That is Rseq varies around a mean value. Not in Vete.
Rex: Ev is a bad model because selection does not reward closeness of a binding pattern to that of an ideal target
But there is no 'ideal target' in Ev other than that the distance between the recognizer and the binding sites is given a fitness value. The perceptron statement is more interesting since this is the same argument the paper I quoted mentions. Something they claim to have fixed in their more general model, a model which btw does not result in Rseq = Rfreq unless biological restrictions are taken into consideration which result in Rseq to approximate Rfreq.
The linear encryption key in Vete results in Vete finding the same amount of information. No surprise since it focuses in on the distant goal of "The answer is forty two".
There is no 'a priori' reason for Rseq to evolve to Rfreq in Ev, unlike in Weasel and Vete where these values are pre-determined by the distant goal. In fact Rseq in Ev can be seen to evolve to a mean value of close to Rfreq and a variance. The information generated here is not because some distant goal proscribed it to reach this value but rather the result of the mutation/selection process at these binding sites. Furthermore as Berg's article suggest, empirical fitness functions for transcription factors and their binding sites can be derived. [ 19. August 2003, 11:56: Message edited by: Pim van Meurs ]
IP: Logged
|
|
Rex Kerr
Member
Member # 632
|
posted 20. August 2003 04:34
Pim,
I still think you're overstating the differences. Perhaps it's a case of it being more difficult to intuitively see how Ev finds a fitness maximum as compared to Vete.
Suppose you have N possible binding sites, out of which you need to recognize K as actual binding sites and not recognize N-K as actual binding sites. For simplicity, assume that each site is disjoint and can be in one of M possible configurations. The goal is to find a partition q such that q of these configurations signal a binding site and M-q do not.
In Ev, one has to solve this problem locally--but in Vete one has to locally come up with a genome that best generates THE ANSWER IS FORTY TWO. Fortunately for our intuition, it's easy to see that answers exist for Vete, and that they're easy to find by cumulative mutation and selection.
Let's consider Ev's problem in more detail. Selection is going to add information to the genome, while mutation is going to whittle it away. As such, we expect to end up in a situation where we have maximum entropy over the entire genome while still solving the problem. The number of states of the genome, given fixed positiions for the binding sites, is going to be (M-q)^(N-K)*q^K. This is maximized when (M-q)*q^(K/(N-K)) is maximized (since x^(1/(N-K)) is monotonically increasing in x). Let A = K/(N-K). We then have a bit of calculus and algebra:
code:
d/dq [ (M-q)*q^A ] = 0 -q^A + A*(M-q)*q^(A-1) = 0 A*(M-q)*q^(A-1) = q^A A*(M-q) = q A*M = (1+A)*q q = M * A/(1+A)
Now if we substitute in for A, we find that q = M * K/(N-K) * (N-K)/N = M * (K/N)
So, it's quite simple. The information-theoretically ideal solution is q/M = K/N. But that's just the same thing as saying Rseq = Rfreq. (Since Rseq = -log2(q/M) and Rfreq = -log2(K/N).)
The fact that Ev more-or-less converges on this answer is not surprising to me, and it seems quite appropriate to call Rseq = Rfreq a fixed target, since that is the optimal solution.
Added in edit: the same analysis indicates that Rfreq = Rseq is (approximately) a fixed target in biology as well.
Also, note that the result can be skewed somewhat if the pattern-recognizer (be it perceptron or binding protein) is not good at picking the fraction q/M but can pick some other (similar) fraction more reliably. However, both proteins and perceptrons appear to be pretty general-purpose, so one wouldn't expect the effect to be particularly large. [ 20. August 2003, 05:04: Message edited by: Rex Kerr ]
IP: Logged
|
|
Pim van Meurs
Member
Member # 541
|
posted 23. August 2003 15:35
Rex: Added in edit: the same analysis indicates that Rfreq = Rseq is (approximately) a fixed target in biology as well.
I think that the important statement here is "approximately" a fixed target. The "solution" allows for a variation around the maximum likelihood value.
I refer to "Bioinformatic Principles Underlying the Information Content of Transcription Factor Binding Sites" by Jan T. Kim, Thomas Martinetz and Daniel Polani
They talk about Rfreq and Rseq be the same value
"However, this is just a vague plausibility argument, and counterexamples can easily be constructed."
quote:
We then show that already for an unbiased (within the chosen coding scheme) a priori probability for the binding behaviour of the transcription factor the quantity Rseq can significantly deviate from Rfreq. However, by using biological principles we can show that these deviations can expected to be limited.
In your approach you neglect the evolution of the transcription factor something the authors from the paper above also point out when they show using MaxEntropy methods the equality of the two values
quote:
It is important to notice that evolution of the transcription factor is neglected by this approach. It is assumed that evolution samples at random from the genome space as a state space (i.e. neutral evolution is assumed), while the transcription factor, which structures the genome state space, is assumed to be constant. This is a substantial difference from biological evolution in which transcription factors and genomes coevolve. This motivates us to introduce a more comprehensive model in Section 5.
IP: Logged
|
|
Rex Kerr
Member
Member # 632
|
posted 24. August 2003 02:46
Pim: I agree, but the Schneider article doesn't address the question of how the nonuniformity of binding-space affects Rfreq and Rseq. Firstly, it doesn't because perceptrons are not proteins, and have different nonuniformities. Secondly, it doesn't because he doesn't seem to consider the issue.
IP: Logged
|
|
Pim van Meurs
Member
Member # 541
|
posted 24. August 2003 14:23
Rex: Schneider's recognizer/binding sites coevolve which seems to complicate the Max Entropy calculations.
Schneider does in passing mention the effects and causes of skewed genomic composition and references various papers
Measuring Molecular Information [ 24. August 2003, 14:24: Message edited by: Pim van Meurs ]
IP: Logged
|
|
Rex Kerr
Member
Member # 632
|
posted 25. August 2003 00:01
They do complicate it a bit, but perceptrons are pretty general classifiers. In any case, if Strachan's calculations are correct (and they seem to be), Schneider doesn't actually show Rfreq = Rseq, but only Rfreq ~= Rseq, which we already knew from the max entropy argument.
Anyway, regardless of whether Ev shows us nothing, little, or much, I'm much more curious as to whether and why John (and others?) thinks that similarities between Ev and Vete are telling.
IP: Logged
|
|
|