ISCID News Editor
Member # 1417
posted 09. January 2005 19:30
Synopsis courtesy PLoS Biology
Genome Sequencing: Using Models to Predict Who's Next
Published January 4, 2005
Copyright: © 2004 Public Library of Science. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Citation: (2005) Genome Sequencing: Using Models to Predict Who's Next. PLoS Biol 3(1): e25.
It's hard to believe it was just ten years ago that scientists reported the first complete genome ^ sequence of an organism, the bacterial pathogen Haemophilus influenzae. The list has grown considerably since then: add over 160 bacterial species (and counting), most major model organisms, and an ever-growing list of mammals—including, of course, humans. With 99% of our genome now fully sequenced, the Human Genome Project's next major goal is to identify all the functional elements contained in our 2.85 billion nucleotides. Such an effort is hardly trivial: producing the sequence of a mammalian-size genome can run from $10 to $50 million, the estimated price tag of the Cow Genome Project.
In an ideal world, any organism would be fair game for sequencing, but in the real world, sequencing resources are scarce. Comparing genome sequences turns out to be a great way to identify regions that have important functions, but comparative genomics studies would be far more efficient if scientists could figure out in advance which genomes would reveal the most information ^ about a particular question. Taking up that challenge, computational biologist Sean Eddy reports a statistical model that predicts how many genomes, and at what evolutionary distance ^, are needed for effective comparative genomic analyses. In addition to confirming some working principles of comparative genomics, the model also reveals a surprisingly simple guideline for future studies.
Comparative genomics works by aligning sequences of different organisms to identify patterns that operate over both large and small distances. Aligning mouse chromosomes ^ with human chromosomes, for example, shows that 99% of our protein-coding genes align with homologous sequences in mice. Underlying such analyses is the principle that DNA ^ sequences that are highly conserved are likely to be functionally important. A common assumption is that adding more comparative genomes to the alignment helps distinguish functionally significant from irrelevant conserved sequences^.#
How do you go about creating an abstract model that captures what Eddy calls the “essential flavor of comparative genomic analysis”? His model puts aside the specific characteristics of individual organisms, genomic features, and analysis programs# in favor of identifying higher-level patterns and scaling relationships, specifically between the number of genomes, evolutionary distance, and feature size (features include genetic elements like exons ^ and transcription ^ factors).
The model shows that the number of genomes required to identify conserved regions—that is, regions evolving under selection—scales inversely with the size of the feature being sought. Thus, to look for conserved sequences half as long, you need twice as many genomes, assuming a constant evolutionary distance and statistical power. For example, to identify a conserved human feature the size of a coding exon (about 50 nucleotides), it is sufficient to compare just the human and mouse genomes. But to identify conserved single nucleotides, you would need 55 comparative genomes at “mouse-like” evolutionary distances (roughly 75 million years).
Things get a little trickier when varying evolutionary distance. We can see a substitution only at a given point in time: we can't tell how many times a site has changed, for example, or whether it changed at some point and then changed back. But at short evolutionary distances—where it's safer to assume no sites have changed more than once—the evolutionary distance is roughly the same as the fraction of sites identified as changed, and evolutionary distance and the number of genomes needed scale inversely. Therefore, the closer the evolutionary distance, the more genomes needed: one would need seven times as many comparative genomes using human/baboon distances, for example, compared to human/mouse distances. So when it comes to using primate sequences to study the human genome, our most distant relatives (such as lemurs) offer far more comparative analysis power than our next of kin (chimps and bonobos).
While this model confirms the intuitive assumption ^* that identifying smaller features requires more genomes, it reveals an inverse scaling relationship far more direct, and precise, than previously imagined. With the next phase of the Human Genome Project under way, Eddy's model offers valuable guidelines for identifying which genomes and how many might best meet this ambitious goal.
[Emphases added by ISCID News Editor]
[Link-underlined terms with ^ indicate linked entry in ISCID Encyclopedia of Science and Philosophy as added by ISCID News Editor]
[^* Link to a 2001 article by John Bracht which elucidates the pitfalls of assuming invented (evolutionary) algorithms are capable of generating novel information, and that 'tuning' and 'selective parameterisation' of inputs and boundary conditions in computational algorithms do not affect algorithmic outcomes - ISCID News Editor]
[# For a powerful critique of the practices of claiming unbiased informationally novel confirmation of Darwinian assumptions by using computational analysis incorporating evolutionary assumptions, and affecting algorithmic searches and analysis by displacing critical search information to search constraint space, see
Dembski, William A. The Design Revolution Ch. 35, Displacement and the No Free Lunch principle, Intervarsity Press: Downers Grove, 2004 - ISCID News Editor]
From the Research Article:
A Model of the Statistical Power of Comparative Genome Sequence Analysis
Sean R. Eddy
Howard Hughes Medical Institute and Department of Genetics, Washington University School of Medicine, Saint Louis, Missouri, United States of America
Comparative genome sequence analysis is powerful, but sequencing genomes is expensive. It is desirable to be able to predict how many genomes are needed for comparative genomics, and at what evolutionary distances. Here I describe a simple mathematical model for the common problem of identifying conserved sequences. The model leads to some useful rules of thumb. For a given evolutionary distance, the number of comparative genomes needed for a constant level of statistical stringency in identifying conserved regions scales inversely with the size of the conserved feature to be detected. At short evolutionary distances, the number of comparative genomes required also scales inversely with distance. These scaling behaviors provide some intuition for future comparative genome sequencing needs, such as the proposed use of “phylogenetic shadowing” methods using closely related comparative genomes, and the feasibility of high-resolution detection of small conserved features.
Received June 9, 2004; Accepted November 2, 2004; Published January 4, 2005
Copyright: © 2005 Sean R. Eddy. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abbreviations: FP, false positive probability; FN, false negative probability; HKY, Hasegawa-Kishino-Yano; LLR, log likelihood ratio
Academic Editor: Ross C. Hardison, Pennsylvania State University, United States of America
Citation: Eddy SR (2005) A Model of the Statistical Power of Comparative Genome Sequence Analysis. PLoS Biol 3(1): e10.
Comparative genome sequence analysis is a powerful means of identifying functional DNA sequences by their evolutionary conservation [1,2,3]. It will be instrumental for achieving the goal of the Human Genome Project to comprehensively identify functional elements in the human genome . How many comparative genome sequences do we need? Where is the point of diminishing returns, after which sequencing another koala or bat does not contribute significant information to human genome analysis? Since sequencing is expensive and capacity remains limited, one would like to address this issue as rigorously as possible.
Empirical evaluations of candidate comparative genomes have become important in allocating sequencing resources. Pilot sequencing and analysis in Saccharomyces and Drosophila species were done to choose appropriate species for comparative genome sequencing [5,6]. A pilot sequencing effort is underway for a number of mammalian genomes to evaluate their utility for human genome analysis . Given the complexity of genomes, empirical studies are necessary. However, one would also like to complement this with higher-level, general insights that are independent of the details of particular analysis programs, organisms, and genomic features.
Cooper et al. proposed a mathematical model of one important type of comparative genome analysis . They framed a question amenable to quantitative modeling: how many comparative genomes, and at what distances, are required to detect that an individual base in a target genome is “neutral” (inferred to be evolving at the neutral rate) as opposed to “conserved” (inferred to be under purifying selection)? Their model infers a nucleotide site to be conserved if it is 100% identical to homologous sites in N comparative genomes. The key parameters are the independent branch lengths (di) contributed to a phylogeny by each new comparative genome (i), measured in neutral substitutions per site. More neutral evolutionary distance makes it more likely that neutral sites will have one or more substitutions in the alignment. Analytical strength increases as a function of the total neutral branch length in the phylogeny (Σidi), because the probability that a neutral site has no changes in any branch of the phylogeny (and thus would be misclassified as conserved) is taken to be approximately e−Σidi. Based on the model, they concluded that 5.0 neutral substitutions/site of total branch length (about 10–20 well-chosen mammalian genomes) would approach “single nucleotide resolution” for human genome analysis, with a false positive probability (FP) of less than e−5.0 per invariant site.
This model has some limitations that seem serious enough to question the proposed target of 10–20 mammalian genomes. Most importantly, it assumes that conserved sites are invariant. Few conserved features are absolutely invariant. If invariance is required to infer conservation, the fraction of truly conserved sites that are wrongly inferred to be neutral (because a substitution is seen in one of the comparative genomes) asymptotically approaches one as the number of comparative genomes or their evolutionary distance increases. We want to consider not just our FP, but our statistical power—our ability to successfully detect features that are conserved.
Additionally, single nucleotide resolution may not be the most relevant goal. It is useful to consider single nucleotide resolution as an ultimate limit on comparative analyses—one can imagine plausible analyses of single bases, and certainly individual codons ^—but we are mostly concerned with identifying conserved features of greater length, such as exons or transcription factor binding sites.
Nonetheless, the level of abstraction introduced by Cooper et al. is attractive. There is a need for better intuitions for planning comparative genome sequencing. How many more comparative genomes are needed as one looks for smaller and smaller conserved features—from exons to regulatory sites to single codons or even single nucleotides? How many more genomes are needed as one uses more and more closely related comparative genomes, in order to improve the chances that homologous lineage-specific features are found and correctly aligned [8,9]? Precise answers will be elusive, because genome biology is complex, but perhaps there are rough, useful scaling relationships amongst comparative genome number, evolutionary distance, and feature size. To explore this, I have extended the ideas introduced by Cooper et al. and developed an abstract model that seems to capture the essential flavor of comparative genome analysis.
Description of the Model
A “feature” is a sequence of L nucleotide ^ sites in the target genome. We assume we have a correct, ungapped multiple sequence alignment of this sequence to N homologous features from N additional comparative genomes, and that the L sites are independent.
In the NL nucleotides in the aligned comparative sequences, we count how many changes are observed relative to the target feature sequence; call this c. If c is greater than some threshold C, we infer the feature is evolving at the neutral rate. If, on the other hand, c is less than or equal to C, we infer the feature is conserved.
We assume that each comparative genome is independently related to the target genome by a branch length of D neutral substitutions per site, that is, a uniform star topology, with the target at the root, and equal length branches to the comparative genomes at the leaves. A uniform star topology allows us to model how evolutionary distance affects comparative analysis at an abstract level, as a single variable D, independent of the details of real phylogenies. The biologically unrealistic placement of the known target at the root simplifies the mathematics, and does not significantly affect the results compared to making the more realistic assumption of an unknown ancestor at the root of a tree with N + 1 leaves, including the target.
We assume that the only difference between conserved features and neutral features is that conserved features evolve more slowly, by a relative rate coefficient ω. A conserved site accumulates an average of ωD substitutions, whereas a neutral site accumulates an average of D substitutions. ω = 0 for an absolutely conserved feature; ω = 1 for a neutrally evolving feature. At short evolutionary distances, we expect about c = DNL changes in neutral features, and c = ωDNL changes in conserved features, with binomial densities for P(c) around those values.
To model the probability that two nucleotides diverged by D or ωD substitutions will be observed to be identical (to deal with multiple substitutions at one site), we assume a Jukes-Cantor process in which all types of base substitution occur at the same rate . Under a Jukes-Cantor model, the probability that two sites that have diverged by D substitution events are identical is , which approaches 25% at infinite divergence.
Given these assumptions, the FP in a comparative analysis (the probability that we erroneously infer that a neutral feature is conserved) is the probability that a neutral feature happens to have C or fewer observed changes (a cumulative binomial distribution):
...[formula ommitted - ISCID News Editor] ...
The model therefore depends on four parameters: the size of the conserved feature, L, the relative rate of evolution of the conserved feature, ω, the number of comparative genomes, N, and the neutral distance of the comparative genomes from the target genome, D. The threshold C is usually not an input parameter (except in the special case of invariance; C = 0). Rather, we find the minimum genome number N (or feature size L) at which there exists any cutoff C that can satisfy specified FN and FP thresholds.#
The Cooper et al. model is essentially a special case where L = 1 (single nucleotide resolution), ω = 0 (conserved sites are always invariant), C = 0 (only invariant sites are inferred to be conserved), and FN = 0 by definition (if all conserved sites are invariant, and all invariant sites are inferred to be conserved, then all conserved sites are detected). Also, instead of using an evolutionary model to account for multiple substitutions at one site (saturation), Cooper et al. make a Poisson assumption that the probability of observing no change at a comparative site is e−D, which is only valid for small D.
The model discriminates features based on their relative rate of evolution. The same equations could be used to detect features evolving faster than the neutral rate (positively selected features), or to detect highly conserved features on a background of less strongly conserved sequence, as, for instance, transcription factor binding sites in an upstream region often appear [11,12]. For simplicity, I will only talk about discriminating “conserved” from “neutral” features here.
Reasonable Parameter Values #
The feature length L and conservation coefficient ω abstractly model the type of feature one is looking for. I use L = 50, L = 8, and L =1 as examples of detecting small coding exons, transcription factor binding sites, and single nucleotides, respectively, solely by sequence conservation. On average, conserved exons and regulatory sites appear to evolve about 2- to 7-fold slower than neutral sequences (ω = 0.5–0.15) [7,8,13,14,15]. I use 5-fold slower (ω = 0.2) in most cases discussed below. Typically, one doesn't know L or ω when looking for novel features. These two parameters behave as bounds: if one can detect a specified feature, larger and/or more conserved features are also detected.
The model's single distance parameter, D, abstractly represents the independent neutral branch length contributed by each comparative genome . In a phylogenetic tree of the target with N > 1 comparative genomes that are as independent from each other as possible, we can roughly consider the independent branch length contributed by each comparative genome to be one-half its pairwise distance to the target genome, because in a real tree (with unknown common ancestors, as opposed to placing the target at the root of a uniform star topology) all comparative genomes share at least one branch leading to the target. Thus the figures highlight D = 0.03, 0.19, and 0.31 as “baboon-like,” “dog-like,” and “mouse-like” distances from human, 50% of one set of pairwise neutral distance estimates of 0.06, 0.38, and 0.62, respectively, arbitrarily chosen from the literature . These labels are solely to give some intuition for what the model's D parameter means. The correspondence between D and real branch lengths is crude. Real neutral distance estimates are a subject of substantial (up to about 2-fold) uncertainty in the literature, and there are regional variations and strong context effects on neutral substitution rates in mammalian genomes [16,17]. More importantly, the model's uniform star topology, though it allows a high-level analysis in terms of just two parameters, D and N, makes direct comparison to real phylogenies difficult. Large numbers of equidistant, independently evolved mammalian genomes do not occur in reality. Real genomes are not independent, and will generally contribute an independent neutral branch length of less than one-half of their pairwise distance to the target genome.
Critically, the model assumes that homologous features are present, correctly detected, and correctly aligned. In reality, with increasing evolutionary distance, features can be gained, lost, or transposed [14,18,19,20,21], the ability to detect homology by significant sequence similarity decreases, and alignments become less reliable . The frequency of effects like loss, gain, and transposition depend on the biology of particular types of features, so departures from the model's “alignment assumptions” are difficult to model abstractly. However, minimally, we can posit a maximum neutral distance, Dmax, beyond which the alignment assumptions will not hold, based just on the ability of alignment programs to recognize and align homologous DNA sequences. Roughly speaking, reliability of DNA sequence alignments begins to break down at about 70% pairwise identity. For alignments of conserved features evolving 5-fold slower than neutral, this suggests Dmax 0.15/0.2 = 0.75; Figures 1 and 2 show results out a little further, to Dmax = 1.0.
Read full research article at PLoS Biology
Read the synopsis at PLoS Biology
[ 10. January 2005, 18:56: Message edited by: ISCID News Editor ]