ISCID News Editor
Member # 1417
posted 16. October 2004 11:31
Biomed Central, BMC Bioinformatics, September 9 2004
Copyright © 2004 Sinha et al; licensee BioMed Central Ltd.
BMC Bioinformatics. 2004; 5: 129.
doi: 10.1186/1471-2105-5-129. Published online 2004 September 9.
Cross-species comparison significantly improves genome-wide prediction of cis-regulatory modules in Drosophila
Saurabh Sinha, Mark D Schroeder, Ulrich Unnerstall, Ulrike Gaul, and Eric D Siggia1
Saurabh Sinha: email@example.com; Mark D Schroeder: firstname.lastname@example.org; Ulrich Unnerstall: email@example.com; Ulrike Gaul: firstname.lastname@example.org; Eric D Siggia: email@example.com
Received July 8, 2004; Accepted September 9, 2004.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
The discovery of cis-regulatory modules in metazoan genomes is crucial for understanding the connection between genes and organism diversity. It is important to quantify how comparative genomics can improve computational detection of such modules.
We run the Stubb software on the entire D. melanogaster genome, to obtain predictions of modules involved in segmentation of the embryo. Stubb uses a probabilistic model to score sequences for clustering of transcription factor binding sites, and can exploit multiple species data within the same probabilistic framework. The predictions are evaluated using publicly available gene expression data for thousands of genes, after careful manual annotation. We demonstrate that the use of a second genome (D. pseudoobscura) for cross-species comparison significantly improves the prediction accuracy of Stubb, and is a more sensitive approach than intersecting the results of separate runs over the two genomes. The entire list of predictions is made available online.
Evolutionary conservation of modules serves as a filter to improve their detection in silico. The future availability of additional fruitfly genomes therefore carries the prospect of highly specific genome-wide predictions using Stubb.
Several computational approaches to the problem of predicting cis-regulatory modules ('CRM's) have been reported recently. Berman et al. , Markstein et al.  and Halfon et al.  predicted CRM's involved in body patterning in the fly, and experimentally verified their predictions. The underlying principle in these algorithms was to detect dense clusters of binding sites, as determined by matches (above some threshold) to catalogued transcription factor weight matrices. The algorithm of Rajewsky et al. , called Ahab, avoided the use of thresholds on weight matrix matches by a probabilistic modeling of CRM's. Ahab predictions within the segmentation gene network were subjected to extensive experimental validation, with excellent overall success (Schroeder et al. ). Most predicted CRM's, when placed upstream of a reporter gene, faithfully reproduce one or more aspects of the endogenous gene expression pattern. Moreover, an analysis of binding site composition over the entire set of validated modules reveals that Ahab's prediction of binding sites correlates well with expression patterns produced by the modules and suggests basic rules governing module composition.
The Stubb algorithm (Sinha et al. ) extended Ahab's approach by incorporating the use of two-species sequence information. Stubb also allows the option of scoring positional correlations between binding sites, but this option was not exercised in this study. For each sequence window analyzed, Stubb first computes the homologous sequence in the second species and aligns them using LAGAN (Brudno et al. ). The sequence is then partitioned into "blocks" (contiguous ungapped aligned regions of high percent identity) and non-blocks (sequence fragments between consecutive blocks, in either species). Putative binding sites in blocks are scored under an assumption of common evolutionary descent, using a probabilistic model of binding site evolution. Thus a "weak" site that is well conserved will score higher, while a "strong" site that is poorly conserved will have its score down-weighted. The score of the sequence window includes contributions from binding sites in blocks as well as in non-blocks. Stubb is implemented so that it can be run either on single species or two species data. In the single species mode, it is practically identical to the Ahab program. The Stubb software is available for download from [URL=http://edsc.rockefeller.edu/cgi-bin/stubb/download.pl ]http://edsc.rockefeller.edu/cgi-bin/stubb/download.pl [/URL]
[ 03. December 2004, 21:35: Message edited by: ISCID News Editor ]