1
|
Bohnsack KS, Kaden M, Abel J, Villmann T. Alignment-Free Sequence Comparison: A Systematic Survey From a Machine Learning Perspective. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:119-135. [PMID: 34990369 DOI: 10.1109/tcbb.2022.3140873] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The encounter of large amounts of biological sequence data generated during the last decades and the algorithmic and hardware improvements have offered the possibility to apply machine learning techniques in bioinformatics. While the machine learning community is aware of the necessity to rigorously distinguish data transformation from data comparison and adopt reasonable combinations thereof, this awareness is often lacking in the field of comparative sequence analysis. With realization of the disadvantages of alignments for sequence comparison, some typical applications use more and more so-called alignment-free approaches. In light of this development, we present a conceptual framework for alignment-free sequence comparison, which highlights the delineation of: 1) the sequence data transformation comprising of adequate mathematical sequence coding and feature generation, from 2) the subsequent (dis-)similarity evaluation of the transformed data by means of problem-specific but mathematically consistent proximity measures. We consider coding to be an information-loss free data transformation in order to get an appropriate representation, whereas feature generation is inevitably information-lossy with the intention to extract just the task-relevant information. This distinction sheds light on the plethora of methods available and assists in identifying suitable methods in machine learning and data analysis to compare the sequences under these premises.
Collapse
|
2
|
Bohnsack KS, Kaden M, Abel J, Saralajew S, Villmann T. The Resolved Mutual Information Function as a Structural Fingerprint of Biomolecular Sequences for Interpretable Machine Learning Classifiers. ENTROPY (BASEL, SWITZERLAND) 2021; 23:1357. [PMID: 34682081 PMCID: PMC8534762 DOI: 10.3390/e23101357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/19/2021] [Revised: 10/11/2021] [Accepted: 10/14/2021] [Indexed: 11/16/2022]
Abstract
In the present article we propose the application of variants of the mutual information function as characteristic fingerprints of biomolecular sequences for classification analysis. In particular, we consider the resolved mutual information functions based on Shannon-, Rényi-, and Tsallis-entropy. In combination with interpretable machine learning classifier models based on generalized learning vector quantization, a powerful methodology for sequence classification is achieved which allows substantial knowledge extraction in addition to the high classification ability due to the model-inherent robustness. Any potential (slightly) inferior performance of the used classifier is compensated by the additional knowledge provided by interpretable models. This knowledge may assist the user in the analysis and understanding of the used data and considered task. After theoretical justification of the concepts, we demonstrate the approach for various example data sets covering different areas in biomolecular sequence analysis.
Collapse
Affiliation(s)
- Katrin Sophie Bohnsack
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| | - Marika Kaden
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| | - Julia Abel
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| | - Sascha Saralajew
- Bosch Center for Artificial Intelligence, 71272 Renningen, Germany;
| | - Thomas Villmann
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| |
Collapse
|
3
|
Lu YY, Tang K, Ren J, Fuhrman JA, Waterman MS, Sun F. CAFE: aCcelerated Alignment-FrEe sequence analysis. Nucleic Acids Res 2019; 45:W554-W559. [PMID: 28472388 PMCID: PMC5793812 DOI: 10.1093/nar/gkx351] [Citation(s) in RCA: 40] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2017] [Accepted: 04/20/2017] [Indexed: 12/13/2022] Open
Abstract
Alignment-free genome and metagenome comparisons are increasingly important with the development of next generation sequencing (NGS) technologies. Recently developed state-of-the-art k-mer based alignment-free dissimilarity measures including CVTree, \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$d_2^*$\end{document} and \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$d_2^S$\end{document} are more computationally expensive than measures based solely on the k-mer frequencies. Here, we report a standalone software, aCcelerated Alignment-FrEe sequence analysis (CAFE), for efficient calculation of 28 alignment-free dissimilarity measures. CAFE allows for both assembled genome sequences and unassembled NGS shotgun reads as input, and wraps the output in a standard PHYLIP format. In downstream analyses, CAFE can also be used to visualize the pairwise dissimilarity measures, including dendrograms, heatmap, principal coordinate analysis and network display. CAFE serves as a general k-mer based alignment-free analysis platform for studying the relationships among genomes and metagenomes, and is freely available at https://github.com/younglululu/CAFE.
Collapse
Affiliation(s)
- Yang Young Lu
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, CA 90089, USA
| | - Kujin Tang
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, CA 90089, USA
| | - Jie Ren
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, CA 90089, USA
| | - Jed A Fuhrman
- Department of Biological Sciences and Wrigley Institute for Environmental Studies, University of Southern California, Los Angeles, CA 90089, USA
| | - Michael S Waterman
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, CA 90089, USA.,Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, 200433 Shanghai, China
| | - Fengzhu Sun
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, CA 90089, USA.,Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, 200433 Shanghai, China
| |
Collapse
|
4
|
Saw AK, Raj G, Das M, Talukdar NC, Tripathy BC, Nandi S. Alignment-free method for DNA sequence clustering using Fuzzy integral similarity. Sci Rep 2019; 9:3753. [PMID: 30842590 PMCID: PMC6403383 DOI: 10.1038/s41598-019-40452-6] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2018] [Accepted: 01/28/2019] [Indexed: 12/28/2022] Open
Abstract
A larger amount of sequence data in private and public databases produced by next-generation sequencing put new challenges due to limitation associated with the alignment-based method for sequence comparison. So, there is a high need for faster sequence analysis algorithms. In this study, we developed an alignment-free algorithm for faster sequence analysis. The novelty of our approach is the inclusion of fuzzy integral with Markov chain for sequence analysis in the alignment-free model. The method estimate the parameters of a Markov chain by considering the frequencies of occurrence of all possible nucleotide pairs from each DNA sequence. These estimated Markov chain parameters were used to calculate similarity among all pairwise combinations of DNA sequences based on a fuzzy integral algorithm. This matrix is used as an input for the neighbor program in the PHYLIP package for phylogenetic tree construction. Our method was tested on eight benchmark datasets and on in-house generated datasets (18 s rDNA sequences from 11 arbuscular mycorrhizal fungi (AMF) and 16 s rDNA sequences of 40 bacterial isolates from plant interior). The results indicate that the fuzzy integral algorithm is an efficient and feasible alignment-free method for sequence analysis on the genomic scale.
Collapse
Affiliation(s)
- Ajay Kumar Saw
- Institute of Advanced Study in Science and Technology, Mathematical Sciences Division, Guwahati, 781035, India
| | - Garima Raj
- Institute of Advanced Study in Science and Technology, Life Science Division, Guwahati, 781035, India
| | - Manashi Das
- Institute of Advanced Study in Science and Technology, Life Science Division, Guwahati, 781035, India
| | - Narayan Chandra Talukdar
- Institute of Advanced Study in Science and Technology, Life Science Division, Guwahati, 781035, India
| | | | - Soumyadeep Nandi
- Institute of Advanced Study in Science and Technology, Life Science Division, Guwahati, 781035, India.
| |
Collapse
|
5
|
Leimeister CA, Sohrabi-Jahromi S, Morgenstern B. Fast and accurate phylogeny reconstruction using filtered spaced-word matches. Bioinformatics 2017; 33:971-979. [PMID: 28073754 PMCID: PMC5409309 DOI: 10.1093/bioinformatics/btw776] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2016] [Accepted: 12/02/2016] [Indexed: 11/13/2022] Open
Abstract
Motivation Word-based or ‘alignment-free’ algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment-free programs, however, are less accurate than alignment-based methods. Results We propose Filtered Spaced Word Matches (FSWM), a fast alignment-free approach to estimate phylogenetic distances between large genomic sequences. For a pre-defined binary pattern of match and don’t-care positions, FSWM rapidly identifies spaced word-matches between input sequences, i.e. gap-free local alignments with matching nucleotides at the match positions and with mismatches allowed at the don’t-care positions. We then estimate the number of nucleotide substitutions per site by considering the nucleotides aligned at the don’t-care positions of the identified spaced-word matches. To reduce the noise from spurious random matches, we use a filtering procedure where we discard all spaced-word matches for which the overall similarity between the aligned segments is below a threshold. We show that our approach can accurately estimate substitution frequencies even for distantly related sequences that cannot be analyzed with existing alignment-free methods; phylogenetic trees constructed with FSWM distances are of high quality. A program run on a pair of eukaryotic genomes of a few hundred Mb each takes a few minutes. Availability and Implementation The program source code for FSWM including a documentation, as well as the software that we used to generate artificial genome sequences are freely available at http://fswm.gobics.de/ Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chris-André Leimeister
- Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, Goldschmidtstr. 1, 37077?Göttingen, Germany
| | - Salma Sohrabi-Jahromi
- Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, Goldschmidtstr. 1, 37077?Göttingen, Germany
| | - Burkhard Morgenstern
- Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, Goldschmidtstr. 1, 37077 Göttingen, Germany.,University of Göttingen, Center for Computational Sciences, Goldschmidtstr. 1, 37077 Göttingen, Germany
| |
Collapse
|
6
|
Kleftogiannis D, Kalnis P, Bajic VB. Progress and challenges in bioinformatics approaches for enhancer identification. Brief Bioinform 2015; 17:967-979. [PMID: 26634919 PMCID: PMC5142011 DOI: 10.1093/bib/bbv101] [Citation(s) in RCA: 56] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2015] [Revised: 10/22/2015] [Indexed: 12/20/2022] Open
Abstract
Enhancers are cis-acting DNA elements that play critical roles in distal regulation of gene expression. Identifying enhancers is an important step for understanding distinct gene expression programs that may reflect normal and pathogenic cellular conditions. Experimental identification of enhancers is constrained by the set of conditions used in the experiment. This requires multiple experiments to identify enhancers, as they can be active under specific cellular conditions but not in different cell types/tissues or cellular states. This has opened prospects for computational prediction methods that can be used for high-throughput identification of putative enhancers to complement experimental approaches. Potential functions and properties of predicted enhancers have been catalogued and summarized in several enhancer-oriented databases. Because the current methods for the computational prediction of enhancers produce significantly different enhancer predictions, it will be beneficial for the research community to have an overview of the strategies and solutions developed in this field. In this review, we focus on the identification and analysis of enhancers by bioinformatics approaches. First, we describe a general framework for computational identification of enhancers, present relevant data types and discuss possible computational solutions. Next, we cover over 30 existing computational enhancer identification methods that were developed since 2000. Our review highlights advantages, limitations and potentials, while suggesting pragmatic guidelines for development of more efficient computational enhancer prediction methods. Finally, we discuss challenges and open problems of this topic, which require further consideration.
Collapse
|
7
|
Kazemian M, Suryamohan K, Chen JY, Zhang Y, Samee MAH, Halfon MS, Sinha S. Evidence for deep regulatory similarities in early developmental programs across highly diverged insects. Genome Biol Evol 2015; 6:2301-20. [PMID: 25173756 PMCID: PMC4217690 DOI: 10.1093/gbe/evu184] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Many genes familiar from Drosophila development, such as the so-called gap, pair-rule, and segment polarity genes, play important roles in the development of other insects and in many cases appear to be deployed in a similar fashion, despite the fact that Drosophila-like "long germband" development is highly derived and confined to a subset of insect families. Whether or not these similarities extend to the regulatory level is unknown. Identification of regulatory regions beyond the well-studied Drosophila has been challenging as even within the Diptera (flies, including mosquitoes) regulatory sequences have diverged past the point of recognition by standard alignment methods. Here, we demonstrate that methods we previously developed for computational cis-regulatory module (CRM) discovery in Drosophila can be used effectively in highly diverged (250-350 Myr) insect species including Anopheles gambiae, Tribolium castaneum, Apis mellifera, and Nasonia vitripennis. In Drosophila, we have successfully used small sets of known CRMs as "training data" to guide the search for other CRMs with related function. We show here that although species-specific CRM training data do not exist, training sets from Drosophila can facilitate CRM discovery in diverged insects. We validate in vivo over a dozen new CRMs, roughly doubling the number of known CRMs in the four non-Drosophila species. Given the growing wealth of Drosophila CRM annotation, these results suggest that extensive regulatory sequence annotation will be possible in newly sequenced insects without recourse to costly and labor-intensive genome-scale experiments. We develop a new method, Regulus, which computes a probabilistic score of similarity based on binding site composition (despite the absence of nucleotide-level sequence alignment), and demonstrate similarity between functionally related CRMs from orthologous loci. Our work represents an important step toward being able to trace the evolutionary history of gene regulatory networks and defining the mechanisms underlying insect evolution.
Collapse
Affiliation(s)
- Majid Kazemian
- Department of Computer Science, University of Illinois at Urbana-Champaign Laboratory of Molecular Immunology, National Heart Lung and Blood Institute, National Institutes of Health, Bethesda, Maryland
| | - Kushal Suryamohan
- Department of Biochemistry, University at Buffalo-State University of New York NY State Center of Excellence in Bioinformatics and Life Sciences, Buffalo, New York
| | - Jia-Yu Chen
- Department of Computer Science, University of Illinois at Urbana-Champaign
| | - Yinan Zhang
- Department of Computer Science, University of Illinois at Urbana-Champaign
| | | | - Marc S Halfon
- Department of Biochemistry, University at Buffalo-State University of New York NY State Center of Excellence in Bioinformatics and Life Sciences, Buffalo, New York Department of Biological Sciences, University at Buffalo-State University of New York Molecular and Cellular Biology Department and Program in Cancer Genetics, Roswell Park Cancer Institute, Buffalo, New York
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois at Urbana-Champaign Institute of Genomic Biology, University of Illinois at Urbana-Champaign
| |
Collapse
|
8
|
Morgenstern B, Zhu B, Horwege S, Leimeister CA. Estimating evolutionary distances between genomic sequences from spaced-word matches. Algorithms Mol Biol 2015; 10:5. [PMID: 25685176 PMCID: PMC4327811 DOI: 10.1186/s13015-015-0032-x] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2014] [Accepted: 01/06/2015] [Indexed: 01/06/2023] Open
Abstract
Alignment-free methods are increasingly used to calculate evolutionary distances between DNA and protein sequences as a basis of phylogeny reconstruction. Most of these methods, however, use heuristic distance functions that are not based on any explicit model of molecular evolution. Herein, we propose a simple estimator d N of the evolutionary distance between two DNA sequences that is calculated from the number N of (spaced) word matches between them. We show that this distance function is more accurate than other distance measures that are used by alignment-free methods. In addition, we calculate the variance of the normalized number N of (spaced) word matches. We show that the variance of N is smaller for spaced words than for contiguous words, and that the variance is further reduced if our spaced-words approach is used with multiple patterns of 'match positions' and 'don't care positions'. Our software is available online and as downloadable source code at: http://spaced.gobics.de/.
Collapse
|
9
|
Rouault H, Santolini M, Schweisguth F, Hakim V. Imogene: identification of motifs and cis-regulatory modules underlying gene co-regulation. Nucleic Acids Res 2014; 42:6128-45. [PMID: 24682824 PMCID: PMC4041412 DOI: 10.1093/nar/gku209] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Cis-regulatory modules (CRMs) and motifs play a central role in tissue and condition-specific gene expression. Here we present Imogene, an ensemble of statistical tools that we have developed to facilitate their identification and implemented in a publicly available software. Starting from a small training set of mammalian or fly CRMs that drive similar gene expression profiles, Imogene determines de novocis-regulatory motifs that underlie this co-expression. It can then predict on a genome-wide scale other CRMs with a regulatory potential similar to the training set. Imogene bypasses the need of large datasets for statistical analyses by making central use of the information provided by the sequenced genomes of multiple species, based on the developed statistical tools and explicit models for transcription factor binding site evolution. We test Imogene on characterized tissue-specific mouse developmental CRMs. Its ability to identify CRMs with the same specificity based on its de novo created motifs is comparable to that of previously evaluated ‘motif-blind’ methods. We further show, both in flies and in mammals, that Imogene de novo generated motifs are sufficient to discriminate CRMs related to different developmental programs. Notably, purely relying on sequence data, Imogene performs as well in this discrimination task as a previously reported learning algorithm based on Chromatin Immunoprecipitation (ChIP) data for multiple transcription factors at multiple developmental stages.
Collapse
Affiliation(s)
- Hervé Rouault
- Developmental and Stem Cell Biology Department, Institut Pasteur, F-75015 Paris, France CNRS, URA2578, F-75015 Paris, France
| | - Marc Santolini
- Laboratoire de Physique Statistique, CNRS, École Normale Supérieure, Université P. et M. Curie, Université Paris-Diderot
| | - François Schweisguth
- Developmental and Stem Cell Biology Department, Institut Pasteur, F-75015 Paris, France CNRS, URA2578, F-75015 Paris, France
| | - Vincent Hakim
- Laboratoire de Physique Statistique, CNRS, École Normale Supérieure, Université P. et M. Curie, Université Paris-Diderot
| |
Collapse
|
10
|
Song K, Ren J, Zhai Z, Liu X, Deng M, Sun F. Alignment-free sequence comparison based on next-generation sequencing reads. J Comput Biol 2013; 20:64-79. [PMID: 23383994 DOI: 10.1089/cmb.2012.0228] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
Next-generation sequencing (NGS) technologies have generated enormous amounts of shotgun read data, and assembly of the reads can be challenging, especially for organisms without template sequences. We study the power of genome comparison based on shotgun read data without assembly using three alignment-free sequence comparison statistics, D(2), D(*)(2) and D(s)(2), both theoretically and by simulations. Theoretical formulas for the power of detecting the relationship between two sequences related through a common motif model are derived. It is shown that both D(*)(2) and D(s)(2), outperform D(2) for detecting the relationship between two sequences based on NGS data. We then study the effects of length of the tuple, read length, coverage, and sequencing error on the power of D(*)(2) and D(s)(2). Finally, variations of these statistics, d(2), d(*)(2) and d(s)(2), respectively, are used to first cluster five mammalian species with known phylogenetic relationships, and then cluster 13 tree species whose complete genome sequences are not available using NGS shotgun reads. The clustering results using d(s)(2) are consistent with biological knowledge for the 5 mammalian and 13 tree species, respectively. Thus, the statistic d(s)(2) provides a powerful alignment-free comparison tool to study the relationships among different organisms based on NGS read data without assembly.
Collapse
Affiliation(s)
- Kai Song
- School of Mathematics, Peking University, Beijing, PR China
| | | | | | | | | | | |
Collapse
|
11
|
Wang C, Zhang MQ, Zhang Z. Computational identification of active enhancers in model organisms. GENOMICS, PROTEOMICS & BIOINFORMATICS 2013; 11:142-50. [PMID: 23685394 PMCID: PMC4357786 DOI: 10.1016/j.gpb.2013.04.002] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/28/2012] [Revised: 04/01/2013] [Accepted: 04/20/2013] [Indexed: 12/11/2022]
Abstract
As a class of cis-regulatory elements, enhancers were first identified as the genomic regions that are able to markedly increase the transcription of genes nearly 30years ago. Enhancers can regulate gene expression in a cell-type specific and developmental stage specific manner. Although experimental technologies have been developed to identify enhancers genome-wide, the design principle of the regulatory elements and the way they rewire the transcriptional regulatory network tempo-spatially are far from clear. At present, developing predictive methods for enhancers, particularly for the cell-type specific activity of enhancers, is central to computational biology. In this review, we survey the current computational approaches for active enhancer prediction and discuss future directions.
Collapse
Affiliation(s)
- Chengqi Wang
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China
| | - Michael Q. Zhang
- Department of Molecular Cell Biology, Center for Systems Biology, University of Texas at Dallas, Richardson, TX 75080, USA
- Bioinformatics Division, Center for Synthetic and Systems Biology, TNLIST, Tsinghua University, Beijing 100084, China
| | - Zhihua Zhang
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China
| |
Collapse
|
12
|
Busser BW, Taher L, Kim Y, Tansey T, Bloom MJ, Ovcharenko I, Michelson AM. A machine learning approach for identifying novel cell type-specific transcriptional regulators of myogenesis. PLoS Genet 2012; 8:e1002531. [PMID: 22412381 PMCID: PMC3297574 DOI: 10.1371/journal.pgen.1002531] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2011] [Accepted: 12/23/2011] [Indexed: 12/22/2022] Open
Abstract
Transcriptional enhancers integrate the contributions of multiple classes of transcription factors (TFs) to orchestrate the myriad spatio-temporal gene expression programs that occur during development. A molecular understanding of enhancers with similar activities requires the identification of both their unique and their shared sequence features. To address this problem, we combined phylogenetic profiling with a DNA-based enhancer sequence classifier that analyzes the TF binding sites (TFBSs) governing the transcription of a co-expressed gene set. We first assembled a small number of enhancers that are active in Drosophila melanogaster muscle founder cells (FCs) and other mesodermal cell types. Using phylogenetic profiling, we increased the number of enhancers by incorporating orthologous but divergent sequences from other Drosophila species. Functional assays revealed that the diverged enhancer orthologs were active in largely similar patterns as their D. melanogaster counterparts, although there was extensive evolutionary shuffling of known TFBSs. We then built and trained a classifier using this enhancer set and identified additional related enhancers based on the presence or absence of known and putative TFBSs. Predicted FC enhancers were over-represented in proximity to known FC genes; and many of the TFBSs learned by the classifier were found to be critical for enhancer activity, including POU homeodomain, Myb, Ets, Forkhead, and T-box motifs. Empirical testing also revealed that the T-box TF encoded by org-1 is a previously uncharacterized regulator of muscle cell identity. Finally, we found extensive diversity in the composition of TFBSs within known FC enhancers, suggesting that motif combinatorics plays an essential role in the cellular specificity exhibited by such enhancers. In summary, machine learning combined with evolutionary sequence analysis is useful for recognizing novel TFBSs and for facilitating the identification of cognate TFs that coordinate cell type-specific developmental gene expression patterns.
Collapse
Affiliation(s)
- Brian W. Busser
- Laboratory of Developmental Systems Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Leila Taher
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Yongsok Kim
- Laboratory of Developmental Systems Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Terese Tansey
- Laboratory of Developmental Systems Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Molly J. Bloom
- Laboratory of Developmental Systems Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Ivan Ovcharenko
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
- * E-mail: (IO); (AMM)
| | - Alan M. Michelson
- Laboratory of Developmental Systems Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland, United States of America
- * E-mail: (IO); (AMM)
| |
Collapse
|
13
|
Nourmohammad A, Lässig M. Formation of regulatory modules by local sequence duplication. PLoS Comput Biol 2011; 7:e1002167. [PMID: 21998564 PMCID: PMC3188502 DOI: 10.1371/journal.pcbi.1002167] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2011] [Accepted: 06/30/2011] [Indexed: 11/24/2022] Open
Abstract
Turnover of regulatory sequence and function is an important part of molecular evolution. But what are the modes of sequence evolution leading to rapid formation and loss of regulatory sites? Here we show that a large fraction of neighboring transcription factor binding sites in the fly genome have formed from a common sequence origin by local duplications. This mode of evolution is found to produce regulatory information: duplications can seed new sites in the neighborhood of existing sites. Duplicate seeds evolve subsequently by point mutations, often towards binding a different factor than their ancestral neighbor sites. These results are based on a statistical analysis of 346 cis-regulatory modules in the Drosophila melanogaster genome, and a comparison set of intergenic regulatory sequence in Saccharomyces cerevisiae. In fly regulatory modules, pairs of binding sites show significantly enhanced sequence similarity up to distances of about 50 bp. We analyze these data in terms of an evolutionary model with two distinct modes of site formation: (i) evolution from independent sequence origin and (ii) divergent evolution following duplication of a common ancestor sequence. Our results suggest that pervasive formation of binding sites by local sequence duplications distinguishes the complex regulatory architecture of higher eukaryotes from the simpler architecture of unicellular organisms. Since Jacob and Monod stressed the importance of gene regulation in evolution, our understanding of the mechanisms of regulation has substantially advanced. In higher eukaryotes, genes often have complex regulatory input, which is encoded in cis-regulatory sequence with multiple transcription factor binding sites. However, the modes of genome evolution generating regulatory complexity are much less understood. This study reports a surprising finding: in fly regulatory modules, the majority of transcription factor binding sites show evidence of a local sequence duplication in their evolutionary history, which relates their sequence information to that of neighboring binding sites. Our analysis suggests that local sequence duplications are a pervasive production mode of regulatory information. This mode appears to be specific to higher eukaryotes; we have not found evidence of frequent local duplications in the yeast genome. Our results affect genomic sequence analysis, in particular, computational identification of cis-regulatory elements and alignment of regulatory DNA. At the same time, they address fundamental questions on the evolution of regulation: How much of the regulatory “grammar” observed in higher eukaryotes is due to optimization of function, and how much reflects the underlying sequence evolution modes? What is the result and what is the substrate of natural selection?
Collapse
Affiliation(s)
| | - Michael Lässig
- Institute for Theoretical Physics, University of Cologne, Köln, Germany
- * E-mail:
| |
Collapse
|
14
|
Abstract
Accurately predicting regulatory sequences and enhancers in entire genomes is an important but difficult problem, especially in large vertebrate genomes. With the advent of ChIP-seq technology, experimental detection of genome-wide EP300/CREBBP bound regions provides a powerful platform to develop predictive tools for regulatory sequences and to study their sequence properties. Here, we develop a support vector machine (SVM) framework which can accurately identify EP300-bound enhancers using only genomic sequence and an unbiased set of general sequence features. Moreover, we find that the predictive sequence features identified by the SVM classifier reveal biologically relevant sequence elements enriched in the enhancers, but we also identify other features that are significantly depleted in enhancers. The predictive sequence features are evolutionarily conserved and spatially clustered, providing further support of their functional significance. Although our SVM is trained on experimental data, we also predict novel enhancers and show that these putative enhancers are significantly enriched in both ChIP-seq signal and DNase I hypersensitivity signal in the mouse brain and are located near relevant genes. Finally, we present results of comparisons between other EP300/CREBBP data sets using our SVM and uncover sequence elements enriched and/or depleted in the different classes of enhancers. Many of these sequence features play a role in specifying tissue-specific or developmental-stage-specific enhancer activity, but our results indicate that some features operate in a general or tissue-independent manner. In addition to providing a high confidence list of enhancer targets for subsequent experimental investigation, these results contribute to our understanding of the general sequence structure of vertebrate enhancers.
Collapse
|
15
|
Kazemian M, Zhu Q, Halfon MS, Sinha S. Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison. Nucleic Acids Res 2011; 39:9463-72. [PMID: 21821659 PMCID: PMC3239187 DOI: 10.1093/nar/gkr621] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Despite recent advances in experimental approaches for identifying transcriptional cis-regulatory modules (CRMs, ‘enhancers’), direct empirical discovery of CRMs for all genes in all cell types and environmental conditions is likely to remain an elusive goal. Effective methods for computational CRM discovery are thus a critically needed complement to empirical approaches. However, existing computational methods that search for clusters of putative binding sites are ineffective if the relevant TFs and/or their binding specificities are unknown. Here, we provide a significantly improved method for ‘motif-blind’ CRM discovery that does not depend on knowledge or accurate prediction of TF-binding motifs and is effective when limited knowledge of functional CRMs is available to ‘supervise’ the search. We propose a new statistical method, based on ‘Interpolated Markov Models’, for motif-blind, genome-wide CRM discovery. It captures the statistical profile of variable length words in known CRMs of a regulatory network and finds candidate CRMs that match this profile. The method also uses orthologs of the known CRMs from closely related genomes. We perform in silico evaluation of predicted CRMs by assessing whether their neighboring genes are enriched for the expected expression patterns. This assessment uses a novel statistical test that extends the widely used Hypergeometric test of gene set enrichment to account for variability in intergenic lengths. We find that the new CRM prediction method is superior to existing methods. Finally, we experimentally validate 12 new CRM predictions by examining their regulatory activity in vivo in Drosophila; 10 of the tested CRMs were found to be functional, while 6 of the top 7 predictions showed the expected activity patterns. We make our program available as downloadable source code, and as a plugin for a genome browser installed on our servers.
Collapse
Affiliation(s)
- Majid Kazemian
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | | | | | | |
Collapse
|
16
|
Liu X, Wan L, Li J, Reinert G, Waterman MS, Sun F. New powerful statistics for alignment-free sequence comparison under a pattern transfer model. J Theor Biol 2011; 284:106-16. [PMID: 21723298 DOI: 10.1016/j.jtbi.2011.06.020] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2011] [Revised: 05/30/2011] [Accepted: 06/17/2011] [Indexed: 12/15/2022]
Abstract
Alignment-free sequence comparison is widely used for comparing gene regulatory regions and for identifying horizontally transferred genes. Recent studies on the power of a widely used alignment-free comparison statistic D2 and its variants D*2 and D(s)2 showed that their power approximates a limit smaller than 1 as the sequence length tends to infinity under a pattern transfer model. We develop new alignment-free statistics based on D2, D*2 and D(s)2 by comparing local sequence pairs and then summing over all the local sequence pairs of certain length. We show that the new statistics are much more powerful than the corresponding statistics and the power tends to 1 as the sequence length tends to infinity under the pattern transfer model.
Collapse
Affiliation(s)
- Xuemei Liu
- School of Physics, South China University of Technology, Guangzhou, PR China
| | | | | | | | | | | |
Collapse
|
17
|
Koohy H, Dyer NP, Reid JE, Koentges G, Ott S. An alignment-free model for comparison of regulatory sequences. ACTA ACUST UNITED AC 2010; 26:2391-7. [PMID: 20696736 DOI: 10.1093/bioinformatics/btq453] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
MOTIVATION Some recent comparative studies have revealed that regulatory regions can retain function over large evolutionary distances, even though the DNA sequences are divergent and difficult to align. It is also known that such enhancers can drive very similar expression patterns. This poses a challenge for the in silico detection of biologically related sequences, as they can only be discovered using alignment-free methods. RESULTS Here, we present a new computational framework called Regulatory Region Scoring (RRS) model for the detection of functional conservation of regulatory sequences using predicted occupancy levels of transcription factors of interest. We demonstrate that our model can detect the functional and/or evolutionary links between some non-alignable enhancers with a strong statistical significance. We also identify groups of enhancers that are likely to be similarly regulated. Our model is motivated by previous work on prediction of expression patterns and it can capture similarity by strong binding sites, weak binding sites and even the statistically significant absence of sites. Our results support the hypothesis that weak binding sites contribute to the functional similarity of sequences. Our model fills a gap between two families of models: detailed, data-intensive models for the prediction of precise spatio-temporal expression patterns on the one side, and crude, generally applicable models on the other side. Our model borrows some of the strengths of each group and addresses their drawbacks. AVAILABILITY The RRS source code is freely available upon publication of this manuscript: http://www2.warwick.ac.uk/fac/sci/systemsbiology/staff/ott/tools_and_software/rrs.
Collapse
Affiliation(s)
- Hashem Koohy
- MOAC Doctoral Training Centre, Coventry House, University of Warwick, Coventry, CV4 7AL, UK.
| | | | | | | | | |
Collapse
|
18
|
Genome-wide identification of cis-regulatory motifs and modules underlying gene coregulation using statistics and phylogeny. Proc Natl Acad Sci U S A 2010; 107:14615-20. [PMID: 20671200 DOI: 10.1073/pnas.1002876107] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Abstract
Cell fate determination depends in part on the establishment of specific transcriptional programs of gene expression. These programs result from the interpretation of the genomic cis-regulatory information by sequence-specific factors. Decoding this information in sequenced genomes is an important issue. Here, we developed statistical analysis tools to computationally identify the cis-regulatory elements that control gene expression in a set of coregulated genes. Starting with a small number of validated and/or predicted cis-regulatory modules (CRMs) in a reference species as a training set, but with no a priori knowledge of the factors acting in trans, we computationally predicted transcription factor binding sites (TFBSs) and genomic CRMs underlying coregulation. This method was applied to the gene expression program active in Drosophila melanogaster sensory organ precursor cells (SOPs), a specific type of neural progenitor cells. Mutational analysis showed that four, including one newly characterized, out of the five top-ranked families of predicted TFBSs were required for SOP-specific gene expression. Additionaly, 19 out of the 29 top-ranked predicted CRMs directed gene expression in neural progenitor cells, i.e., SOPs or larval brain neuroblasts, with a notable fraction active in SOPs (11/29). We further identified the lola gene as the target of two SOP-specific CRMs and found that the lola gene contributed to SOP specification. The statistics and phylogeny-based tools described here can be more generally applied to identify the cis-regulatory elements of specific gene regulatory networks in any family of related species with sequenced genomes.
Collapse
|
19
|
Arunachalam M, Jayasurya K, Tomancak P, Ohler U. An alignment-free method to identify candidate orthologous enhancers in multiple Drosophila genomes. ACTA ACUST UNITED AC 2010; 26:2109-15. [PMID: 20624780 DOI: 10.1093/bioinformatics/btq358] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
MOTIVATION Evolutionarily conserved non-coding genomic sequences represent a potentially rich source for the discovery of gene regulatory region such as transcriptional enhancers. However, detecting orthologous enhancers using alignment-based methods in higher eukaryotic genomes is particularly challenging, as regulatory regions can undergo considerable sequence changes while maintaining their functionality. RESULTS We have developed an alignment-free method which identifies conserved enhancers in multiple diverged species. Our method is based on similarity metrics between two sequences based on the co-occurrence of sequence patterns regardless of their order and orientation, thus tolerating sequence changes observed in non-coding evolution. We show that our method is highly successful in detecting orthologous enhancers in distantly related species without requiring additional information such as knowledge about transcription factors involved, or predicted binding sites. By estimating the significance of similarity scores, we are able to discriminate experimentally validated functional enhancers from seemingly equally conserved candidates without function. We demonstrate the effectiveness of this approach on a wide range of enhancers in Drosophila, and also present encouraging results to detect conserved functional regions across large evolutionary distances. Our work provides encouraging steps on the way to ab initio unbiased enhancer prediction to complement ongoing experimental efforts. AVAILABILITY The software, data and the results used in this article are available at http://www.genome.duke.edu/labs/ohler/research/transcription/fly_enhancer/.
Collapse
|