ISCID News Editor
Member # 1417
posted 25. April 2006 05:05
Source: PLoS Computational Biology
Identification and Classification of Conserved RNA Secondary Structures in the Human Genome
Editor: Richard Durbin, Sanger Institute, United Kingdom
Received: September 8, 2005; Accepted: March 6, 2006; Published: April 21, 2006
Jakob Skou Pedersen, Gill Bejerano, Adam Siepel, Kate Rosenbloom, Kerstin Lindblad-Toh, Eric S. Lander, Jim Kent, Webb Miller, David Haussler
The discoveries of microRNAs and riboswitches, among others, have shown functional RNAs to be biologically more important and genomically more prevalent than previously anticipated. We have developed a general comparative genomics method based on phylogenetic stochastic context-free grammars for identifying functional RNAs encoded in the human genome and used it to survey an eight-way genome-wide alignment of the human, chimpanzee, mouse, rat, dog, chicken, zebra-fish, and puffer-fish genomes for deeply conserved functional RNAs. At a loose threshold for acceptance, this search resulted in a set of 48,479 candidate RNA structures. This screen finds a large number of known functional RNAs, including 195 miRNAs, 62 histone 3′UTR stem loops, and various types of known genetic recoding elements. Among the highest-scoring new predictions are 169 new miRNA candidates, as well as new candidate selenocysteine insertion sites, RNA editing hairpins, RNAs involved in transcript auto regulation, and many folds that form singletons or small functional RNA families of completely unknown function. While the rate of false positives in the overall set is difficult to estimate and is likely to be substantial, the results nevertheless provide evidence for many new human functional RNAs and present specific predictions to facilitate their further characterization.
Structurally functional RNA is a versatile component of the cell that comprises both independent molecules and regulatory elements of mRNA transcripts. The many recent discoveries of functional RNAs, most notably miRNAs, suggests that many more are yet to be found. Computational identification of functional RNAs has traditionally been hampered by the lack of strong sequence signals. However, structural conservation over long evolutionary times creates a characteristic substitution pattern, which can be exploited with the advent of comparative genomics. The authors have devised a method for identification of functional RNA structures based on phylogenetic analysis of multiple alignments. This method has been used to screen the regions of the human genome that are under strong selective constraints. The result is a set of 48,479 candidate RNA structures. For some classes of known functional RNAs, such as miRNAs and histone 3′UTR stem loops, this set includes nearly all deeply conserved members. The initial large candidate set has been partitioned by size, shape, and genomic location and ranked by score to produce specific lists of top candidates for miRNAs, selenocysteine insertion sites, RNA editing hairpins, and RNAs involved in transcript auto regulation.
Many new classes of functional RNA structures (fRNAs), such as snoRNAs, miRNAs, splicing factors, and riboswitches [1–3], have been discovered over the last few years. These structures function both as independent molecules and as part of mRNA transcripts. These recent discoveries verify that fRNAs fulfill many important regulatory, structural, and catalytic roles in the cell, and suggest that perhaps only a small fraction of these fRNAs are currently identified [1,3,4].
The development of computational methods that can efficiently identify fRNAs by comparative genomics has been hampered by the fact that fRNAs often exhibit only weakly conserved primary-sequence signals . Fortunately, the stem-pairing regions of fRNA structures evolve mostly with a characteristic substitution pattern such that only substitutions that maintain the pairing capability between paired bases will be allowed. This leads to compensatory double substitutions (e.g., GC AU) and to a few types of compatible single substitutions (e.g., GC GU); the latter made possible by RNA's ability to form a non–Watson-Crick pair between G and U. This evolutionary signal can be exploited for comparative identification of fRNAs [6–12].
The many non-human vertebrate genomes now sequenced can be aligned against the human genome, leading to a multiple alignment with considerable information about the evolutionary process at every position [13–15]. Given a diverse enough set of genomes, comparative methods that can make effective use of this evolutionary information should in principle be able to efficiently identify the conserved human fRNAs. We have developed a comparative method called EvoFold for functional RNA-structure identification in multiple sequence alignments. EvoFold makes use of a recently devised model construction, a phylogenetic stochastic context-free grammar (phylo-SCFG) [refs], which is a combined probabilistic model of RNA secondary structure and sequence evolution. Phylo-SCFGs use stochastic context-free grammars (SCFGs) [refs] to define a prior distribution over possible RNA secondary structures, and a set of phylogenetic models [refs] to evaluate how well the substitution pattern of each alignment column conforms with its secondary-structure annotation. EvoFold uses a very general model of RNA secondary structures that allows it to model everything from short hairpins to complex multiforking structures, including novel structures not seen in its training set. The substitution process explicitly models co-evolution of paired bases within the structure using the phylogenetic tree and evolutionary branch lengths relating the sequences of the alignment. Stem-pairing regions are detected not only by the presence of compensatory substitutions, but also by the presence of compatible single substitutions and the overall slower rate of evolution. We have built a human-referenced eight-way vertebrate whole-genome alignment and used EvoFold to search for functional RNAs in the human genome. This search resulted in a total of 48,479 candidate RNA structures. Based on estimates of the false-positive rate, which unfortunately are associated with very large uncertainties, we estimate that the candidate set contains approximately 18,500 substructures of approximately 10,000 RNA transcripts. These numbers are derived using an estimated false-positive rate of 62%. Among the highest-scoring candidates, where the estimated false-positive rate is much lower, this screen finds a large number of known functional RNAs, and contains new candidate miRNAs, selenocysteine insertion sites, RNA editing hairpins, RNAs involved in transcript auto regulation, and many folds that form singletons or small functional RNA families of completely unknown function.
[Emphases added by ISCID News Editor]
[Link-underlined terms in text (as added by ISCID News Editor) indicate linked entry in ISCID Encyclopedia of Science and Philosophy]
Read the full research paper at PLoS Biology
Copyright[PLoS]: © 2006 Pedersen et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
* To whom correspondence should be addressed. E-mail: firstname.lastname@example.org
[ 25. April 2006, 05:19: Message edited by: ISCID News Editor ]