1
|
Rasoarahona R, Wattanadilokchatkun P, Panthum T, Jaisamut K, Lisachov A, Thong T, Singchat W, Ahmad SF, Han K, Kraichak E, Muangmai N, Koga A, Duengkae P, Antunes A, Srikulnath K. MicrosatNavigator: exploring nonrandom distribution and lineage-specificity of microsatellite repeat motifs on vertebrate sex chromosomes across 186 whole genomes. Chromosome Res 2023; 31:29. [PMID: 37775555 DOI: 10.1007/s10577-023-09738-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Revised: 08/11/2023] [Accepted: 09/05/2023] [Indexed: 10/01/2023]
Abstract
Microsatellites are short tandem DNA repeats, ubiquitous in genomes. They are believed to be under selection pressure, considering their high distribution and abundance beyond chance or random accumulation. However, limited analysis of microsatellites in single taxonomic groups makes it challenging to understand their evolutionary significance across taxonomic boundaries. Despite abundant genomic information, microsatellites have been studied in limited contexts and within a few species, warranting an unbiased examination of their genome-wide distribution in distinct versus closely related-clades. Large-scale comparisons have revealed relevant trends, especially in vertebrates. Here, "MicrosatNavigator", a new tool that allows quick and reliable investigation of perfect microsatellites in DNA sequences, was developed. This tool can identify microsatellites across the entire genome sequences. Using this tool, microsatellite repeat motifs were identified in the genome sequences of 186 vertebrates. A significant positive correlation was noted between the abundance, density, length, and GC bias of microsatellites and specific lineages. The (AC)n motif is the most prevalent in vertebrate genomes, showing distinct patterns in closely related species. Longer microsatellites were observed on sex chromosomes in birds and mammals but not on autosomes. Microsatellites on sex chromosomes of non-fish vertebrates have the lowest GC content, whereas high-GC microsatellites (≥ 50 M% GC) are preferred in bony and cartilaginous fishes. Thus, similar selective forces and mutational processes may constrain GC-rich microsatellites to different clades. These findings should facilitate investigations into the roles of microsatellites in sex chromosome differentiation and provide candidate microsatellites for functional analysis across the vertebrate evolutionary spectrum.
Collapse
Affiliation(s)
- Ryan Rasoarahona
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
- Sciences for Industry, Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
| | - Pish Wattanadilokchatkun
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
| | - Thitipong Panthum
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
- Special Research Unit for Wildlife Genomics (SRUWG), Department of Forest Biology, Faculty of Forestry, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
| | - Kitipong Jaisamut
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
| | - Artem Lisachov
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
| | - Thanyapat Thong
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
| | - Worapong Singchat
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
- Special Research Unit for Wildlife Genomics (SRUWG), Department of Forest Biology, Faculty of Forestry, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
| | - Syed Farhan Ahmad
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
- Special Research Unit for Wildlife Genomics (SRUWG), Department of Forest Biology, Faculty of Forestry, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
| | - Kyudong Han
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
- Department of Microbiology, College of Science & Technology, Dankook University, Cheonan, 31116, Republic of Korea
- Center for Bio-Medical Engineering Core Facility, Dankook University, Cheonan, 31116, Republic of Korea
| | - Ekaphan Kraichak
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
- Department of Botany, Faculty of Science, Kasetsart University, Bangkok, 10900, Thailand
| | - Narongrit Muangmai
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
- Department of Fishery Biology, Faculty of Fisheries, Kasetsart University, Chatuchak, Bangkok, 10900, Thailand
| | - Akihiko Koga
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
| | - Prateep Duengkae
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
- Special Research Unit for Wildlife Genomics (SRUWG), Department of Forest Biology, Faculty of Forestry, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand
| | - Agostinho Antunes
- CIIMAR/CIMAR, Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros Do Porto de Leixes, Av. General Norton de Matos, S/N, 4450-208, Porto, Portugal
- Department of Biology, Faculty of Sciences, University of Porto, Rua do Campo Alegre, S/N, 4169-007, Porto, Portugal
| | - Kornsorn Srikulnath
- Animal Genomics and Bioresource Research Unit (AGB Research Unit), Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand.
- Sciences for Industry, Faculty of Science, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand.
- Special Research Unit for Wildlife Genomics (SRUWG), Department of Forest Biology, Faculty of Forestry, Kasetsart University, 50 Ngamwongwan, Chatuchak, Bangkok, 10900, Thailand.
- Center for Advanced Studies in Tropical Natural Resources, National Research University-Kasetsart University, Kasetsart University, (CASTNAR, NRU-KU, Thailand), Bangkok, 10900, Thailand.
- Center of Excellence on Agricultural Biotechnology (AG-BIO/PERDO-CHE), Bangkok, 10900, Thailand.
| |
Collapse
|
2
|
Tognon M, Giugno R, Pinello L. A survey on algorithms to characterize transcription factor binding sites. Brief Bioinform 2023; 24:bbad156. [PMID: 37099664 PMCID: PMC10422928 DOI: 10.1093/bib/bbad156] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 03/27/2023] [Accepted: 04/01/2023] [Indexed: 04/28/2023] Open
Abstract
Transcription factors (TFs) are key regulatory proteins that control the transcriptional rate of cells by binding short DNA sequences called transcription factor binding sites (TFBS) or motifs. Identifying and characterizing TFBS is fundamental to understanding the regulatory mechanisms governing the transcriptional state of cells. During the last decades, several experimental methods have been developed to recover DNA sequences containing TFBS. In parallel, computational methods have been proposed to discover and identify TFBS motifs based on these DNA sequences. This is one of the most widely investigated problems in bioinformatics and is referred to as the motif discovery problem. In this manuscript, we review classical and novel experimental and computational methods developed to discover and characterize TFBS motifs in DNA sequences, highlighting their advantages and drawbacks. We also discuss open challenges and future perspectives that could fill the remaining gaps in the field.
Collapse
Affiliation(s)
- Manuel Tognon
- Computer Science Department, University of Verona, Verona, Italy
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Rosalba Giugno
- Computer Science Department, University of Verona, Verona, Italy
| | - Luca Pinello
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Department of Pathology, Harvard Medical School, Boston, Massachusetts, United States of America
| |
Collapse
|
3
|
Theepalakshmi P, Reddy US. Freezing firefly algorithm for efficient planted (ℓ, d) motif search. Med Biol Eng Comput 2022; 60:511-530. [PMID: 35020123 DOI: 10.1007/s11517-021-02468-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2020] [Accepted: 11/06/2021] [Indexed: 10/19/2022]
Abstract
The detection of inimitable patterns (motif) occurring in a set of biological sequences could elevate new biological discoveries. Its application in recognition of transcription factors and their binding sites have demonstrated the necessity to attain knowledge of gene function, human diseases, and drug design. The literature identifies (ℓ, d) motif search as the widely studied problem in PMS (Planted Motif Search). This paper proposes an efficient optimization algorithm named "Freezing FireFly (FFF)" to solve (ℓ, d) motif search problem. The new strategy freezing such as local and global was added to increase the performance of the basic Firefly algorithm. It freezes the best possible out coming positions even in the lesser brighter one. The performance of the proposed algorithm is experienced on simulated and real datasets. The experimental results show that the proposed algorithm resolves the instance (50, 21) within 1.47 min in the simulated dataset. For real (such as ChIP-seq (Chromatin Immunoprecipitation)) and synthetic datasets, the proposed algorithm runs much faster in comparison to existing state-of-the-art optimization algorithms, including Samselect, TraverStringRef, PMS8, qPMS9, AlignACE, FMGA, and GSGA.
Collapse
Affiliation(s)
- P Theepalakshmi
- Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamilnadu, India.
| | - U Srinivasulu Reddy
- Machine Learning and Data Analytics Lab, Center of Excellence in Artificial Intelligence, Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamilnadu, India
| |
Collapse
|
4
|
Bailey TL. STREME: accurate and versatile sequence motif discovery. Bioinformatics 2021; 37:2834-2840. [PMID: 33760053 PMCID: PMC8479671 DOI: 10.1093/bioinformatics/btab203] [Citation(s) in RCA: 237] [Impact Index Per Article: 79.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Revised: 02/21/2021] [Accepted: 03/23/2021] [Indexed: 02/02/2023] Open
Abstract
MOTIVATION Sequence motif discovery algorithms can identify novel sequence patterns that perform biological functions in DNA, RNA and protein sequences-for example, the binding site motifs of DNA- and RNA-binding proteins. RESULTS The STREME algorithm presented here advances the state-of-the-art in ab initio motif discovery in terms of both accuracy and versatility. Using in vivo DNA (ChIP-seq) and RNA (CLIP-seq) data, and validating motifs with reference motifs derived from in vitro data, we show that STREME is more accurate, sensitive and thorough than several widely used algorithms (DREME, HOMER, MEME, Peak-motifs) and two other representative algorithms (ProSampler and Weeder). STREME's capabilities include the ability to find motifs in datasets with hundreds of thousands of sequences, to find both short and long motifs (from 3 to 30 positions), to perform differential motif discovery in pairs of sequence datasets, and to find motifs in sequences over virtually any alphabet (DNA, RNA, protein and user-defined alphabets). Unlike most motif discovery algorithms, STREME reports a useful estimate of the statistical significance of each motif it discovers. STREME is easy to use individually via its web server or via the command line, and is completely integrated with the widely used MEME Suite of sequence analysis tools. The name STREME stands for 'Simple, Thorough, Rapid, Enriched Motif Elicitation'. AVAILABILITY AND IMPLEMENTATION The STREME web server and source code are provided freely for non-commercial use at http://meme-suite.org. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Timothy L Bailey
- Department of Pharmacology, University of Nevada, Reno, NV 89557, USA,
| |
Collapse
|
5
|
A noncanonical AR addiction drives enzalutamide resistance in prostate cancer. Nat Commun 2021; 12:1521. [PMID: 33750801 PMCID: PMC7943793 DOI: 10.1038/s41467-021-21860-7] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2020] [Accepted: 02/17/2021] [Indexed: 12/13/2022] Open
Abstract
Resistance to next-generation anti-androgen enzalutamide (ENZ) constitutes a major challenge for the treatment of castration-resistant prostate cancer (CRPC). By performing genome-wide ChIP-seq profiling in ENZ-resistant CRPC cells we identify a set of androgen receptor (AR) binding sites with increased AR binding intensity (ARBS-gained). While ARBS-gained loci lack the canonical androgen response elements (ARE) and pioneer factor FOXA1 binding motifs, they are highly enriched with CpG islands and the binding sites of unmethylated CpG dinucleotide-binding protein CXXC5 and the partner TET2. RNA-seq analysis reveals that both CXXC5 and its regulated genes including ID1 are upregulated in ENZ-resistant cell lines and these results are further confirmed in patient-derived xenografts (PDXs) and patient specimens. Consistent with the finding that ARBS-gained loci are highly enriched with H3K27ac modification, ENZ-resistant PCa cells, organoids, xenografts and PDXs are hyper-sensitive to NEO2734, a dual inhibitor of BET and CBP/p300 proteins. These results not only reveal a noncanonical AR function in acquisition of ENZ resistance, but also posit a treatment strategy to target this vulnerability in ENZ-resistant CRPC. Resistance to second generation anti-androgen therapies such as enzalutamide (ENZ) can emerge in prostate cancer patients. Here, the authors identify an ENZ-resistant mechanism driven by AR-dependent transcription of noncanonical targets that make resistant cells susceptible to dual inhibition of BET and CBP/p300 signaling.
Collapse
|
6
|
Li Y, Ni P, Zhang S, Li G, Su Z. ProSampler: an ultrafast and accurate motif finder in large ChIP-seq datasets for combinatory motif discovery. Bioinformatics 2020; 35:4632-4639. [PMID: 31070745 DOI: 10.1093/bioinformatics/btz290] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2018] [Revised: 03/29/2019] [Accepted: 04/18/2019] [Indexed: 01/25/2023] Open
Abstract
MOTIVATION The availability of numerous ChIP-seq datasets for transcription factors (TF) has provided an unprecedented opportunity to identify all TF binding sites in genomes. However, the progress has been hindered by the lack of a highly efficient and accurate tool to find not only the target motifs, but also cooperative motifs in very big datasets. RESULTS We herein present an ultrafast and accurate motif-finding algorithm, ProSampler, based on a novel numeration method and Gibbs sampler. ProSampler runs orders of magnitude faster than the fastest existing tools while often more accurately identifying motifs of both the target TFs and cooperators. Thus, ProSampler can greatly facilitate the efforts to identify the entire cis-regulatory code in genomes. AVAILABILITY AND IMPLEMENTATION Source code and binaries are freely available for download at https://github.com/zhengchangsulab/prosampler. It was implemented in C++ and supported on Linux, macOS and MS Windows platforms. SUPPLEMENTARY INFORMATION Supplementary materials are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yang Li
- School of Mathematics, Shandong University, Jinan 250100, China.,Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Pengyu Ni
- Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Shaoqiang Zhang
- College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China
| | - Guojun Li
- School of Mathematics, Shandong University, Jinan 250100, China.,Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Zhengchang Su
- Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| |
Collapse
|
7
|
Wylie DC, Hofmann HA, Zemelman BV. SArKS: de novo discovery of gene expression regulatory motif sites and domains by suffix array kernel smoothing. Bioinformatics 2020; 35:3944-3952. [PMID: 30903136 DOI: 10.1093/bioinformatics/btz198] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2018] [Revised: 03/04/2019] [Accepted: 03/20/2019] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION We set out to develop an algorithm that can mine differential gene expression data to identify candidate cell type-specific DNA regulatory sequences. Differential expression is usually quantified as a continuous score-fold-change, test-statistic, P-value-comparing biological classes. Unlike existing approaches, our de novo strategy, termed SArKS, applies non-parametric kernel smoothing to uncover promoter motif sites that correlate with elevated differential expression scores. SArKS detects motif k-mers by smoothing sequence scores over sequence similarity. A second round of smoothing over spatial proximity reveals multi-motif domains (MMDs). Discovered motif sites can then be merged or extended based on adjacency within MMDs. False positive rates are estimated and controlled by permutation testing. RESULTS We applied SArKS to published gene expression data representing distinct neocortical neuron classes in Mus musculus and interneuron developmental states in Homo sapiens. When benchmarked against several existing algorithms using a cross-validation procedure, SArKS identified larger motif sets that formed the basis for regression models with higher correlative power. AVAILABILITY AND IMPLEMENTATION https://github.com/denniscwylie/sarks. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dennis C Wylie
- Center for Computational Biology and Bioinformatics, University of Texas at Austin, Austin, TX, USA
| | - Hans A Hofmann
- Center for Computational Biology and Bioinformatics, University of Texas at Austin, Austin, TX, USA.,Institute for Cellular and Molecular Biology, University of Texas at Austin, Austin, TX, USA.,Department of Integrative Biology, University of Texas at Austin, Austin, TX, USA.,Institute for Neuroscience, University of Texas at Austin, Austin, TX, USA
| | - Boris V Zemelman
- Institute for Cellular and Molecular Biology, University of Texas at Austin, Austin, TX, USA.,Institute for Neuroscience, University of Texas at Austin, Austin, TX, USA.,Department of Neuroscience, University of Texas at Austin, Austin, TX, USA.,Center for Learning and Memory, University of Texas at Austin, Austin, TX, USA
| |
Collapse
|
8
|
Hashim FA, Houssein EH, Hussain K, Mabrouk MS, Al-Atabany W. A modified Henry gas solubility optimization for solving motif discovery problem. Neural Comput Appl 2019. [DOI: 10.1007/s00521-019-04611-0] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
9
|
Sun CX, Yang Y, Wang H, Wang WH. A Clustering Approach for Motif Discovery in ChIP-Seq Dataset. ENTROPY (BASEL, SWITZERLAND) 2019; 21:E802. [PMID: 33267515 PMCID: PMC7515331 DOI: 10.3390/e21080802] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Revised: 08/04/2019] [Accepted: 08/15/2019] [Indexed: 12/25/2022]
Abstract
Chromatin immunoprecipitation combined with next-generation sequencing (ChIP-Seq) technology has enabled the identification of transcription factor binding sites (TFBSs) on a genome-wide scale. To effectively and efficiently discover TFBSs in the thousand or more DNA sequences generated by a ChIP-Seq data set, we propose a new algorithm named AP-ChIP. First, we set two thresholds based on probabilistic analysis to construct and further filter the cluster subsets. Then, we use Affinity Propagation (AP) clustering on the candidate cluster subsets to find the potential motifs. Experimental results on simulated data show that the AP-ChIP algorithm is able to make an almost accurate prediction of TFBSs in a reasonable time. Also, the validity of the AP-ChIP algorithm is tested on a real ChIP-Seq data set.
Collapse
Affiliation(s)
- Chun-xiao Sun
- College of Science, Northwest A&F University, Yangling 712100, China
| | - Yu Yang
- School of Computer Science, Pingdingshan University, Pingdingshan 467000, China
- School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Hua Wang
- College of Software, Nankai University, Tianjin 300071, China
- Department of Mathematical Sciences, Georgia Southern University, Statesboro, GA 30460, USA
| | - Wen-hu Wang
- School of Computer Science, Pingdingshan University, Pingdingshan 467000, China
| |
Collapse
|
10
|
Hashim FA, Mabrouk MS, Al-Atabany W. Review of Different Sequence Motif Finding Algorithms. Avicenna J Med Biotechnol 2019; 11:130-148. [PMID: 31057715 PMCID: PMC6490410] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2018] [Accepted: 05/26/2018] [Indexed: 11/05/2022] Open
Abstract
The DNA motif discovery is a primary step in many systems for studying gene function. Motif discovery plays a vital role in identification of Transcription Factor Binding Sites (TFBSs) that help in learning the mechanisms for regulation of gene expression. Over the past decades, different algorithms were used to design fast and accurate motif discovery tools. These algorithms are generally classified into consensus or probabilistic approaches that many of them are time-consuming and easily trapped in a local optimum. Nature-inspired algorithms and many of combinatorial algorithms are recently proposed to overcome these problems. This paper presents a general classification of motif discovery algorithms with new sub-categories that facilitate building a successful motif discovery algorithm. It also presents a summary of comparison between them.
Collapse
Affiliation(s)
- Fatma A. Hashim
- Department of Biomedical Engineering, Helwan University, Egypt
| | - Mai S. Mabrouk
- Department of Biomedical Engineering, Misr University for Science and Technology (MUST), Egypt
| | | |
Collapse
|
11
|
Hashim FA, Mabrouk MS, Atabany WA. Comparative Analysis of DNA Motif Discovery Algorithms: A Systemic Review. CURRENT CANCER THERAPY REVIEWS 2019. [DOI: 10.2174/1573394714666180417161728] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Bioinformatics is an interdisciplinary field that combines biology and information
technology to study how to deal with the biological data. The DNA motif discovery
problem is the main challenge of genome biology and its importance is directly proportional to increasing
sequencing technologies which produce large amounts of data. DNA motif is a repeated
portion of DNA sequences of major biological interest with important structural and functional
features. Motif discovery plays a vital role in the antibody-biomarker identification which is useful
for diagnosis of disease and to identify Transcription Factor Binding Sites (TFBSs) that help in
learning the mechanisms for regulation of gene expression. Recently, scientists discovered that the
TFs have a mutation rate five times higher than the flanking sequences, so motif discovery also
has a crucial role in cancer discovery.
Methods:
Over the past decades, many attempts use different algorithms to design fast and accurate
motif discovery tools. These algorithms are generally classified into consensus or probabilistic
approach.
Results:
Many of DNA motif discovery algorithms are time-consuming and easily trapped in a local
optimum.
Conclusion:
Nature-inspired algorithms and many of combinatorial algorithms are recently proposed
to overcome the problems of consensus and probabilistic approaches. This paper presents a
general classification of motif discovery algorithms with new sub-categories. It also presents a
summary comparison between them.
Collapse
Affiliation(s)
- Fatma A. Hashim
- Department of Biomedical Engineering, Helwan University, Helwan, Egypt
| | - Mai S. Mabrouk
- Department of Biomedical Engineering, Misr University for Science and Technology (MUST), Cairo, Egypt
| | | |
Collapse
|
12
|
Pei C, Wang SL, Fang J, Zhang W. GSMC: Combining Parallel Gibbs Sampling with Maximal Cliques for Hunting DNA Motif. J Comput Biol 2017; 24:1243-1253. [PMID: 29116820 DOI: 10.1089/cmb.2017.0100] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Regulatory elements are responsible for regulating gene transcription. Therefore, identification of these elements is a tremendous challenge in the field of gene expression. Transcription factors (TFs) play a key role in gene regulation by binding to target promoter sequences. A set of conserved sequence patterns with a highly similar structure that is bound by a TF is called a motif. Motif discovery has been a difficult problem over the past decades. Meanwhile, it is a foundation stone in meeting this challenge. Recent advances in obtaining genomic sequences and high-throughput gene expression analysis techniques have enabled the rapid development of computational methods for motif discovery. As a result, a large number of motif-finding algorithms aiming at various motif models have sprung up in the past few years. However, most of them are not suitable for analysis of the large data sets generated by next-generation sequencing. To better handle large-scale ChIP-Seq data and achieve better performance in computational time and motif detection accuracy, we propose an excellent motif-finding algorithm known as GSMC (Combining Parallel Gibbs Sampling with Maximal Cliques for hunting DNA Motif). The GSMC algorithm consists of two steps. First, we employ the commonly used Gibbs sampling to generating initial motifs. Second, we utilize maximal cliques to cluster motifs according to Similarity with Position Information Contents (SPIC). Consequently, we raise the detection accuracy in a great degree, in the meantime holding comparative computation efficiency. In addition, we can find much more credible cofactor interacting motifs.
Collapse
Affiliation(s)
- Chao Pei
- 1 College of Computer Science and Electronics Engineering, Hunan University , Changsha, China
| | - Shu-Lin Wang
- 1 College of Computer Science and Electronics Engineering, Hunan University , Changsha, China
| | - Jianwen Fang
- 2 Biometric Research Program, Division of Cancer Treatment and Diagnosis, National Cancer Institute , Rockville, MD 20850
| | - Wei Zhang
- 1 College of Computer Science and Electronics Engineering, Hunan University , Changsha, China
| |
Collapse
|
13
|
Reinert K, Dadi TH, Ehrhardt M, Hauswedell H, Mehringer S, Rahn R, Kim J, Pockrandt C, Winkler J, Siragusa E, Urgese G, Weese D. The SeqAn C++ template library for efficient sequence analysis: A resource for programmers. J Biotechnol 2017; 261:157-168. [PMID: 28888961 DOI: 10.1016/j.jbiotec.2017.07.017] [Citation(s) in RCA: 67] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2017] [Revised: 07/17/2017] [Accepted: 07/19/2017] [Indexed: 11/27/2022]
Abstract
BACKGROUND The use of novel algorithmic techniques is pivotal to many important problems in life science. For example the sequencing of the human genome (Venter et al., 2001) would not have been possible without advanced assembly algorithms and the development of practical BWT based read mappers have been instrumental for NGS analysis. However, owing to the high speed of technological progress and the urgent need for bioinformatics tools, there was a widening gap between state-of-the-art algorithmic techniques and the actual algorithmic components of tools that are in widespread use. We previously addressed this by introducing the SeqAn library of efficient data types and algorithms in 2008 (Döring et al., 2008). RESULTS The SeqAn library has matured considerably since its first publication 9 years ago. In this article we review its status as an established resource for programmers in the field of sequence analysis and its contributions to many analysis tools. CONCLUSIONS We anticipate that SeqAn will continue to be a valuable resource, especially since it started to actively support various hardware acceleration techniques in a systematic manner.
Collapse
Affiliation(s)
- Knut Reinert
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany.
| | - Temesgen Hailemariam Dadi
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany
| | - Marcel Ehrhardt
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany
| | - Hannes Hauswedell
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany
| | - Svenja Mehringer
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany
| | - René Rahn
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, Takustrasse 9, 14195 Berlin, Germany
| | - Jongkyu Kim
- Efficient Algorithms for -Omics Data, Max Planck Institute for Molecular Genetics, Ihnestrasse 62-73, 14195 Berlin, Germany
| | - Christopher Pockrandt
- Efficient Algorithms for -Omics Data, Max Planck Institute for Molecular Genetics, Ihnestrasse 62-73, 14195 Berlin, Germany
| | - Jörg Winkler
- Efficient Algorithms for -Omics Data, Max Planck Institute for Molecular Genetics, Ihnestrasse 62-73, 14195 Berlin, Germany
| | | | - Gianvito Urgese
- Department of Control and Computer Engineering, Politecnico di Torino, Italy
| | | |
Collapse
|
14
|
ATF3 negatively regulates cellular antiviral signaling and autophagy in the absence of type I interferons. Sci Rep 2017; 7:8789. [PMID: 28821775 PMCID: PMC5562757 DOI: 10.1038/s41598-017-08584-9] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2017] [Accepted: 07/21/2017] [Indexed: 01/19/2023] Open
Abstract
Stringent regulation of antiviral signaling and cellular autophagy is critical for the host response to virus infection. However, little is known how these cellular processes are regulated in the absence of type I interferon signaling. Here, we show that ATF3 is induced following Japanese encephalitis virus (JEV) infection, and regulates cellular antiviral and autophagy pathways in the absence of type I interferons in mouse neuronal cells. We have identified new targets of ATF3 and show that it binds to the promoter regions of Stat1, Irf9, Isg15 and Atg5 thereby inhibiting cellular antiviral signaling and autophagy. Consistent with these observations, ATF3-depleted cells showed enhanced antiviral responses and induction of robust autophagy. Furthermore, we show that JEV replication was significantly reduced in ATF3-depleted cells. Our findings identify ATF3 as a negative regulator of antiviral signaling and cellular autophagy in mammalian cells, and demonstrate its important role in JEV life cycle.
Collapse
|
15
|
Fu H, Zhang X. Noncoding Variants Functional Prioritization Methods Based on Predicted Regulatory Factor Binding Sites. Curr Genomics 2017; 18:322-331. [PMID: 29081688 PMCID: PMC5635616 DOI: 10.2174/1389202918666170228143619] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2016] [Revised: 10/16/2016] [Accepted: 11/02/2016] [Indexed: 12/31/2022] Open
Abstract
BACKGROUNDS With the advent of the post genomic era, the research for the genetic mechanism of the diseases has found to be increasingly depended on the studies of the genes, the gene-networks and gene-protein interaction networks. To explore gene expression and regulation, the researchers have carried out many studies on transcription factors and their binding sites (TFBSs). Based on the large amount of transcription factor binding sites predicting values in the deep learning models, further computation and analysis have been done to reveal the relationship between the gene mutation and the occurrence of the disease. It has been demonstrated that based on the deep learning methods, the performances of the prediction for the functions of the noncoding variants are outperforming than those of the conventional methods. The research on the prediction for functions of Single Nucleotide Polymorphisms (SNPs) is expected to uncover the mechanism of the gene mutation affection on traits and diseases of human beings. RESULTS We reviewed the conventional TFBSs identification methods from different perspectives. As for the deep learning methods to predict the TFBSs, we discussed the related problems, such as the raw data preprocessing, the structure design of the deep convolution neural network (CNN) and the model performance measure et al. And then we summarized the techniques that usually used in finding out the functional noncoding variants from de novo sequence. CONCLUSION Along with the rapid development of the high-throughout assays, more and more sample data and chromatin features would be conducive to improve the prediction accuracy of the deep convolution neural network for TFBSs identification. Meanwhile, getting more insights into the deep CNN framework itself has been proved useful for both the promotion on model performance and the development for more suitable design to sample data. Based on the feature values predicted by the deep CNN model, the prioritization model for functional noncoding variants would contribute to reveal the affection of gene mutation on the diseases.
Collapse
Affiliation(s)
- Haoyue Fu
- College of Sciences, Northeastern University, Shenyang, China
| | - LianpingYang
- College of Sciences, Northeastern University, Shenyang, China
- University of Southern California, Dept. Biol. Sci., Program Mol & Computat Biol, USA
| | - Xiangde Zhang
- College of Sciences, Northeastern University, Shenyang, China
| |
Collapse
|
16
|
Liu B, Yang J, Li Y, McDermaid A, Ma Q. An algorithmic perspective of de novo cis-regulatory motif finding based on ChIP-seq data. Brief Bioinform 2017; 19:1069-1081. [DOI: 10.1093/bib/bbx026] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2016] [Indexed: 01/06/2023] Open
Affiliation(s)
- Bingqiang Liu
- School of Mathematics, Shandong University, Jinan Shandong, P. R. China
| | - Jinyu Yang
- Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, USA
| | - Yang Li
- School of Mathematics, Shandong University, Jinan Shandong, P. R. China
| | - Adam McDermaid
- Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, USA
| | - Qin Ma
- Department of Agronomy, Horticulture and Plant Science, South Dakota State University, Brookings, SD, USA
| |
Collapse
|
17
|
Yu Q, Huo H, Feng D. PairMotifChIP: A Fast Algorithm for Discovery of Patterns Conserved in Large ChIP-seq Data Sets. BIOMED RESEARCH INTERNATIONAL 2016; 2016:4986707. [PMID: 27843946 PMCID: PMC5098105 DOI: 10.1155/2016/4986707] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/22/2016] [Revised: 09/04/2016] [Accepted: 09/27/2016] [Indexed: 11/18/2022]
Abstract
Identifying conserved patterns in DNA sequences, namely, motif discovery, is an important and challenging computational task. With hundreds or more sequences contained, the high-throughput sequencing data set is helpful to improve the identification accuracy of motif discovery but requires an even higher computing performance. To efficiently identify motifs in large DNA data sets, a new algorithm called PairMotifChIP is proposed by extracting and combining pairs of l-mers in the input with relatively small Hamming distance. In particular, a method for rapidly extracting pairs of l-mers is designed, which can be used not only for PairMotifChIP, but also for other DNA data mining tasks with the same demand. Experimental results on the simulated data show that the proposed algorithm can find motifs successfully and runs faster than the state-of-the-art motif discovery algorithms. Furthermore, the validity of the proposed algorithm has been verified on real data.
Collapse
Affiliation(s)
- Qiang Yu
- School of Computer Science and Technology, Xidian University, Xi'an 710071, China
| | - Hongwei Huo
- School of Computer Science and Technology, Xidian University, Xi'an 710071, China
| | - Dazheng Feng
- School of Electronic Engineering, Xidian University, Xi'an 710071, China
| |
Collapse
|
18
|
Ye Z, Chen Z, Sunkel B, Frietze S, Huang THM, Wang Q, Jin VX. Genome-wide analysis reveals positional-nucleosome-oriented binding pattern of pioneer factor FOXA1. Nucleic Acids Res 2016; 44:7540-54. [PMID: 27458208 PMCID: PMC5027512 DOI: 10.1093/nar/gkw659] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2016] [Accepted: 07/12/2016] [Indexed: 11/24/2022] Open
Abstract
The compaction of nucleosomal structures creates a barrier for DNA-binding transcription factors (TFs) to access their cognate cis-regulatory elements. Pioneer factors (PFs) such as FOXA1 are able to directly access these cis-targets within compact chromatin. However, how these PFs interplay with nucleosomes remains to be elucidated, and is critical for us to understand the underlying mechanism of gene regulation. Here, we have conducted a computational analysis on a strand-specific paired-end ChIP-exo (termed as ChIP-ePENS) data of FOXA1 in LNCaP cells by our novel algorithm ePEST. We find that FOXA1 chromatin binding occurs via four distinct border modes (or footprint boundary patterns), with a preferential footprint boundary patterns relative to FOXA1 motif orientation. In addition, from this analysis three fundamental nucleotide positions (oG, oS and oH) emerged as major determinants for blocking exo-digestion and forming these four distinct border modes. By integrating histone MNase-seq data, we found an astonishingly consistent, ‘well-positioned’ configuration occurs between FOXA1 motifs and dyads of nucleosomes genome-wide. We further performed ChIP-seq of eight chromatin remodelers and found an increased occupancy of these remodelers on FOXA1 motifs for all four border modes (or footprint boundary patterns), indicating the full occupancy of FOXA1 complex on the three blocking sites (oG, oS and oH) likely produces an active regulatory status with well-positioned phasing for protein binding events. Together, our results suggest a positional-nucleosome-oriented accessing model for PFs seeking target motifs, in which FOXA1 can examine each underlying DNA nucleotide and is able to sense all potential motifs regardless of whether they face inward or outward from histone octamers along the DNA helix axis.
Collapse
Affiliation(s)
- Zhenqing Ye
- Department of Molecular Medicine, University of Texas Health Science Center at San Antonio, TX 78229, USA
| | - Zhong Chen
- Department of Molecular Virology, Immunology and Medical Genetics, The Ohio State University College of Medicine, OH 43210, USA Comprehensive Cancer Center, The Ohio State University College of Medicine, OH 43210, USA
| | - Benjamin Sunkel
- Department of Molecular Virology, Immunology and Medical Genetics, The Ohio State University College of Medicine, OH 43210, USA Comprehensive Cancer Center, The Ohio State University College of Medicine, OH 43210, USA
| | - Seth Frietze
- MLRS Department, University of Vermont, VT 05405, USA
| | - Tim H-M Huang
- Department of Molecular Medicine, University of Texas Health Science Center at San Antonio, TX 78229, USA
| | - Qianben Wang
- Department of Molecular Virology, Immunology and Medical Genetics, The Ohio State University College of Medicine, OH 43210, USA Comprehensive Cancer Center, The Ohio State University College of Medicine, OH 43210, USA
| | - Victor X Jin
- Department of Molecular Medicine, University of Texas Health Science Center at San Antonio, TX 78229, USA
| |
Collapse
|
19
|
MOCCS: Clarifying DNA-binding motif ambiguity using ChIP-Seq data. Comput Biol Chem 2016; 63:62-72. [PMID: 26971251 DOI: 10.1016/j.compbiolchem.2016.01.014] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2016] [Accepted: 01/25/2016] [Indexed: 11/21/2022]
Abstract
BACKGROUND As a key mechanism of gene regulation, transcription factors (TFs) bind to DNA by recognizing specific short sequence patterns that are called DNA-binding motifs. A single TF can accept ambiguity within its DNA-binding motifs, which comprise both canonical (typical) and non-canonical motifs. Clarification of such DNA-binding motif ambiguity is crucial for revealing gene regulatory networks and evaluating mutations in cis-regulatory elements. Although chromatin immunoprecipitation sequencing (ChIP-seq) now provides abundant data on the genomic sequences to which a given TF binds, existing motif discovery methods are unable to directly answer whether a given TF can bind to a specific DNA-binding motif. RESULTS Here, we report a method for clarifying the DNA-binding motif ambiguity, MOCCS. Given ChIP-Seq data of any TF, MOCCS comprehensively analyzes and describes every k-mer to which that TF binds. Analysis of simulated datasets revealed that MOCCS is applicable to various ChIP-Seq datasets, requiring only a few minutes per dataset. Application to the ENCODE ChIP-Seq datasets proved that MOCCS directly evaluates whether a given TF binds to each DNA-binding motif, even if known position weight matrix models do not provide sufficient information on DNA-binding motif ambiguity. Furthermore, users are not required to provide numerous parameters or background genomic sequence models that are typically unavailable. MOCCS is implemented in Perl and R and is freely available via https://github.com/yuifu/moccs. CONCLUSIONS By complementing existing motif-discovery software, MOCCS will contribute to the basic understanding of how the genome controls diverse cellular processes via DNA-protein interactions.
Collapse
|
20
|
Zhang Y, Wang P. A Fast Cluster Motif Finding Algorithm for ChIP-Seq Data Sets. BIOMED RESEARCH INTERNATIONAL 2015; 2015:218068. [PMID: 26236718 PMCID: PMC4509496 DOI: 10.1155/2015/218068] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/08/2015] [Accepted: 06/04/2015] [Indexed: 11/17/2022]
Abstract
New high-throughput technique ChIP-seq, coupling chromatin immunoprecipitation experiment with high-throughput sequencing technologies, has extended the identification of binding locations of a transcription factor to the genome-wide regions. However, the most existing motif discovery algorithms are time-consuming and limited to identify binding motifs in ChIP-seq data which normally has the significant characteristics of large scale data. In order to improve the efficiency, we propose a fast cluster motif finding algorithm, named as FCmotif, to identify the (l, d) motifs in large scale ChIP-seq data set. It is inspired by the emerging substrings mining strategy to find the enriched substrings and then searching the neighborhood instances to construct PWM and cluster motifs in different length. FCmotif is not following the OOPS model constraint and can find long motifs. The effectiveness of proposed algorithm has been proved by experiments on the ChIP-seq data sets from mouse ES cells. The whole detection of the real binding motifs and processing of the full size data of several megabytes finished in a few minutes. The experimental results show that FCmotif has advantageous to deal with the (l, d) motif finding in the ChIP-seq data; meanwhile it also demonstrates better performance than other current widely-used algorithms such as MEME, Weeder, ChIPMunk, and DREME.
Collapse
Affiliation(s)
- Yipu Zhang
- Department of Automation, School of Electronics and Control Engineering, Chang'An University, Xi'an 710064, China
| | - Ping Wang
- Department of Automation, School of Electronics and Control Engineering, Chang'An University, Xi'an 710064, China
| |
Collapse
|
21
|
Zhang Y, He Y, Zheng G, Wei C. MOST+: A de novo motif finding approach combining genomic sequence and heterogeneous genome-wide signatures. BMC Genomics 2015; 16 Suppl 7:S13. [PMID: 26099518 PMCID: PMC4474412 DOI: 10.1186/1471-2164-16-s7-s13] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
Background Motifs are regulatory elements that will activate or inhibit the expression of related genes when proteins (such as transcription factors, TFs) bind to them. Therefore, motif finding is important to understand the mechanisms of gene regulation. De novo discovery of regulatory elements, like transcription factor binding sites (TFBSs), has long been a major challenge to gain insight on mechanisms of gene regulation. Recent advances in experimental profiling of genome-wide signals such as histone modifications and DNase I hypersensitivity sites allow scientists to develop better computational methods to enhance motif discovery. However, existing methods for motif finding suffer from high false positive rates and slow speed, and it's difficult to evaluate the performance of these methods systematically. Result Here we present MOST+, a motif finder integrating genomic sequences and genome-wide signals such as intensity and shape features from histone modification marks and DNase I hypersensitivity sites, to improve the prediction accuracy. MOST+ can detect motifs from a large input sequence of about 100 Mbs within a few minutes. Systematic comparison method has been established and MOST+ has been compared with existing methods. Conclusion MOST+ is a fast and accurate de novo method for motif finding by integrating genomic sequence and experimental signals as clues.
Collapse
|
22
|
Colombo N, Vlassis N. FastMotif: spectral sequence motif discovery. Bioinformatics 2015; 31:2623-31. [PMID: 25886979 DOI: 10.1093/bioinformatics/btv208] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2014] [Accepted: 04/09/2015] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Sequence discovery tools play a central role in several fields of computational biology. In the framework of Transcription Factor binding studies, most of the existing motif finding algorithms are computationally demanding, and they may not be able to support the increasingly large datasets produced by modern high-throughput sequencing technologies. RESULTS We present FastMotif, a new motif discovery algorithm that is built on a recent machine learning technique referred to as Method of Moments. Based on spectral decompositions, our method is robust to model misspecifications and is not prone to locally optimal solutions. We obtain an algorithm that is extremely fast and designed for the analysis of big sequencing data. On HT-Selex data, FastMotif extracts motif profiles that match those computed by various state-of-the-art algorithms, but one order of magnitude faster. We provide a theoretical and numerical analysis of the algorithm's robustness and discuss its sensitivity with respect to the free parameters. AVAILABILITY AND IMPLEMENTATION The Matlab code of FastMotif is available from http://lcsb-portal.uni.lu/bioinformatics. CONTACT vlassis@adobe.com SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nicoló Colombo
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Luxembourg and
| | | |
Collapse
|
23
|
Ikebata H, Yoshida R. Repulsive parallel MCMC algorithm for discovering diverse motifs from large sequence sets. Bioinformatics 2015; 31:1561-8. [PMID: 25583120 PMCID: PMC4426842 DOI: 10.1093/bioinformatics/btv017] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2014] [Accepted: 01/06/2015] [Indexed: 11/14/2022] Open
Abstract
Motivation: The motif discovery problem consists of finding recurring patterns of short strings in a set of nucleotide sequences. This classical problem is receiving renewed attention as most early motif discovery methods lack the ability to handle large data of recent genome-wide ChIP studies. New ChIP-tailored methods focus on reducing computation time and pay little regard to the accuracy of motif detection. Unlike such methods, our method focuses on increasing the detection accuracy while maintaining the computation efficiency at an acceptable level. The major advantage of our method is that it can mine diverse multiple motifs undetectable by current methods. Results: The repulsive parallel Markov chain Monte Carlo (RPMCMC) algorithm that we propose is a parallel version of the widely used Gibbs motif sampler. RPMCMC is run on parallel interacting motif samplers. A repulsive force is generated when different motifs produced by different samplers near each other. Thus, different samplers explore different motifs. In this way, we can detect much more diverse motifs than conventional methods can. Through application to 228 transcription factor ChIP-seq datasets of the ENCODE project, we show that the RPMCMC algorithm can find many reliable cofactor interacting motifs that existing methods are unable to discover. Availability and implementation: A C++ implementation of RPMCMC and discovered cofactor motifs for the 228 ENCODE ChIP-seq datasets are available from http://daweb.ism.ac.jp/yoshidalab/motif. Contact:ikebata.hisaki@ism.ac.jp, yoshidar@ism.ac.jp Supplementary information:Supplementary data are available from Bioinformatics online.
Collapse
Affiliation(s)
- Hisaki Ikebata
- Department of Statistical Science, The Graduate University for Advanced Studies (Sokendai), 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, Department of Statistical Modeling, The Institute of Statistical Mathematics, Research Organization of Information and Systems, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, JST-CREST, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, JST-ERATO Sato Live Bio-Forecasting Project, 2-2-2 Hikaridai Seika-cho, Soraku-gun, Khoto-fu 619-0288, Japan and The Thomas N. Sato BioMEC-X Laboratories, Advanced Telecommunications Research Institute International, 2-2-2 Hikaridai Seika-cho, Soraku-gun, Khoto-fu 619-0288, Japan
| | - Ryo Yoshida
- Department of Statistical Science, The Graduate University for Advanced Studies (Sokendai), 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, Department of Statistical Modeling, The Institute of Statistical Mathematics, Research Organization of Information and Systems, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, JST-CREST, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, JST-ERATO Sato Live Bio-Forecasting Project, 2-2-2 Hikaridai Seika-cho, Soraku-gun, Khoto-fu 619-0288, Japan and The Thomas N. Sato BioMEC-X Laboratories, Advanced Telecommunications Research Institute International, 2-2-2 Hikaridai Seika-cho, Soraku-gun, Khoto-fu 619-0288, Japan Department of Statistical Science, The Graduate University for Advanced Studies (Sokendai), 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, Department of Statistical Modeling, The Institute of Statistical Mathematics, Research Organization of Information and Systems, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, JST-CREST, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, JST-ERATO Sato Live Bio-Forecasting Project, 2-2-2 Hikaridai Seika-cho, Soraku-gun, Khoto-fu 619-0288, Japan and The Thomas N. Sato BioMEC-X Laboratories, Advanced Telecommunications Research Institute International, 2-2-2 Hikaridai Seika-cho, Soraku-gun, Khoto-fu 619-0288, Japan Department of Statistical Science, The Graduate University for Advanced Studies (Sokendai), 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, Department of Statistical Modeling, The Institute of Statistical Mathematics, Research Organization of Information and Systems, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, JST-CREST, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan, JST-ERATO Sato Live Bio-Forecasting Project, 2-2-2 Hikaridai Seika-cho, Soraku-gun, Khoto-fu 619-0288, Japan and The Thomas N. Sato BioMEC-X Laboratories, Advanced Telecommunications Research Institute International, 2-2-2 Hikaridai Seika-cho, Soraku-gun, Khoto-fu 619-0288, Japan Depar
| |
Collapse
|
24
|
Niu M, Tabari ES, Su Z. De novo prediction of cis-regulatory elements and modules through integrative analysis of a large number of ChIP datasets. BMC Genomics 2014; 15:1047. [PMID: 25442502 PMCID: PMC4265420 DOI: 10.1186/1471-2164-15-1047] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2014] [Accepted: 11/19/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In eukaryotes, transcriptional regulation is usually mediated by interactions of multiple transcription factors (TFs) with their respective specific cis-regulatory elements (CREs) in the so-called cis-regulatory modules (CRMs) in DNA. Although the knowledge of CREs and CRMs in a genome is crucial to elucidate gene regulatory networks and understand many important biological phenomena, little is known about the CREs and CRMs in most eukaryotic genomes due to the difficulty to characterize them by either computational or traditional experimental methods. However, the exponentially increasing number of TF binding location data produced by the recent wide adaptation of chromatin immunoprecipitation coupled with microarray hybridization (ChIP-chip) or high-throughput sequencing (ChIP-seq) technologies has provided an unprecedented opportunity to identify CRMs and CREs in genomes. Nonetheless, how to effectively mine these large volumes of ChIP data to identify CREs and CRMs at nucleotide resolution is a highly challenging task. RESULTS We have developed a novel graph-theoretic based algorithm DePCRM for genome-wide de novo predictions of CREs and CRMs using a large number of ChIP datasets. DePCRM predicts CREs and CRMs by identifying overrepresented combinatorial CRE motif patterns in multiple ChIP datasets in an effective way. When applied to 168 ChIP datasets of 56 TFs from D. melanogaster, DePCRM identified 184 and 746 overrepresented CRE motifs and their combinatorial patterns, respectively, and predicted a total of 115,932 CRMs in the genome. The predictions recover 77.9% of known CRMs in the datasets and 89.3% of known CRMs containing at least one predicted CRE. We found that the putative CRMs as well as CREs as a whole in a CRM are more conserved than randomly selected sequences. CONCLUSION Our results suggest that the CRMs predicted by DePCRM are highly likely to be functional. Our algorithm is the first of its kind for de novo genome-wide prediction of CREs and CRMs using larger number of transcription factor ChIP datasets. The algorithm and predictions will hopefully facilitate the elucidation of gene regulatory networks in eukaryotes. All the predicted CREs, CRMs, and their target genes are available at http://bioinfo.uncc.edu/mniu/pcrms/www/.
Collapse
Affiliation(s)
| | | | - Zhengchang Su
- Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, NC 28223, USA.
| |
Collapse
|
25
|
Reid JE, Wernisch L. STEME: a robust, accurate motif finder for large data sets. PLoS One 2014; 9:e90735. [PMID: 24625410 PMCID: PMC3953122 DOI: 10.1371/journal.pone.0090735] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2014] [Accepted: 02/04/2014] [Indexed: 11/19/2022] Open
Abstract
Motif finding is a difficult problem that has been studied for over 20 years. Some older popular motif finders are not suitable for analysis of the large data sets generated by next-generation sequencing. We recently published an efficient approximation (STEME) to the EM algorithm that is at the core of many motif finders such as MEME. This approximation allows the EM algorithm to be applied to large data sets. In this work we describe several efficient extensions to STEME that are based on the MEME algorithm. Together with the original STEME EM approximation, these extensions make STEME a fully-fledged motif finder with similar properties to MEME. We discuss the difficulty of objectively comparing motif finders. We show that STEME performs comparably to existing prominent discriminative motif finders, DREME and Trawler, on 13 sets of transcription factor binding data in mouse ES cells. We demonstrate the ability of STEME to find long degenerate motifs which these discriminative motif finders do not find. As part of our method, we extend an earlier method due to Nagarajan et al. for the efficient calculation of motif E-values. STEME's source code is available under an open source license and STEME is available via a web interface.
Collapse
Affiliation(s)
- John E. Reid
- MRC Biostatistics Unit, Institute of Public Health, Cambridge, United Kingdom
- * E-mail:
| | - Lorenz Wernisch
- MRC Biostatistics Unit, Institute of Public Health, Cambridge, United Kingdom
| |
Collapse
|
26
|
Abstract
MOTIVATION Identifying regulatory elements is a fundamental problem in the field of gene transcription. Motif discovery-the task of identifying the sequence preference of transcription factor proteins, which bind to these elements-is an important step in this challenge. MEME is a popular motif discovery algorithm. Unfortunately, MEME's running time scales poorly with the size of the dataset. Experiments such as ChIP-Seq and DNase-Seq are providing a rich amount of information on the binding preference of transcription factors. MEME cannot discover motifs in data from these experiments in a practical amount of time without a compromising strategy such as discarding a majority of the sequences. RESULTS We present EXTREME, a motif discovery algorithm designed to find DNA-binding motifs in ChIP-Seq and DNase-Seq data. Unlike MEME, which uses the expectation-maximization algorithm for motif discovery, EXTREME uses the online expectation-maximization algorithm to discover motifs. EXTREME can discover motifs in large datasets in a practical amount of time without discarding any sequences. Using EXTREME on ChIP-Seq and DNase-Seq data, we discover many motifs, including some novel and infrequent motifs that can only be discovered by using the entire dataset. Conservation analysis of one of these novel infrequent motifs confirms that it is evolutionarily conserved and possibly functional. AVAILABILITY AND IMPLEMENTATION All source code is available at the Github repository http://github.com/uci-cbcl/EXTREME.
Collapse
Affiliation(s)
- Daniel Quang
- Department of Computer Science, University of California, Irvine, CA 92697, USA and Center for Complex Biological Systems, University of California, Irvine, CA 92697, USADepartment of Computer Science, University of California, Irvine, CA 92697, USA and Center for Complex Biological Systems, University of California, Irvine, CA 92697, USA
| | - Xiaohui Xie
- Department of Computer Science, University of California, Irvine, CA 92697, USA and Center for Complex Biological Systems, University of California, Irvine, CA 92697, USADepartment of Computer Science, University of California, Irvine, CA 92697, USA and Center for Complex Biological Systems, University of California, Irvine, CA 92697, USA
| |
Collapse
|
27
|
Jia C, Carson MB, Wang Y, Lin Y, Lu H. A new exhaustive method and strategy for finding motifs in ChIP-enriched regions. PLoS One 2014; 9:e86044. [PMID: 24475069 PMCID: PMC3901781 DOI: 10.1371/journal.pone.0086044] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2013] [Accepted: 12/04/2013] [Indexed: 12/22/2022] Open
Abstract
ChIP-seq, which combines chromatin immunoprecipitation (ChIP) with next-generation parallel sequencing, allows for the genome-wide identification of protein-DNA interactions. This technology poses new challenges for the development of novel motif-finding algorithms and methods for determining exact protein-DNA binding sites from ChIP-enriched sequencing data. State-of-the-art heuristic, exhaustive search algorithms have limited application for the identification of short (l, d) motifs (l ≤ 10, d ≤ 2) contained in ChIP-enriched regions. In this work we have developed a more powerful exhaustive method (FMotif) for finding long (l, d) motifs in DNA sequences. In conjunction with our method, we have adopted a simple ChIP-enriched sampling strategy for finding these motifs in large-scale ChIP-enriched regions. Empirical studies on synthetic samples and applications using several ChIP data sets including 16 TF (transcription factor) ChIP-seq data sets and five TF ChIP-exo data sets have demonstrated that our proposed method is capable of finding these motifs with high efficiency and accuracy. The source code for FMotif is available at http://211.71.76.45/FMotif/.
Collapse
Affiliation(s)
- Caiyan Jia
- School of Computer and Information Technology & Beijing Key Lab of Traffic Data Analysis, Beijing Jiaotong University, Beijing, China
- Department of Bioengineering/Bioinformatics, University of Illinois at Chicago, Chicago, Illinois, United States of America
| | - Matthew B. Carson
- Center for Healthcare Studies, Institute for Public Health and Medicine, Northwestern University Feinberg School of Medicine, Chicago, Illinois, United States of America
- Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, Illinois, United States of America
| | - Yang Wang
- School of Computer and Information Technology & Beijing Key Lab of Traffic Data Analysis, Beijing Jiaotong University, Beijing, China
| | - Youfang Lin
- School of Computer and Information Technology & Beijing Key Lab of Traffic Data Analysis, Beijing Jiaotong University, Beijing, China
| | - Hui Lu
- Department of Bioengineering/Bioinformatics, University of Illinois at Chicago, Chicago, Illinois, United States of America
- Shanghai Institute of Medical Genetics, Shanghai Children’s Hospital, Shanghai JiaoTong University, Shanghai, China
| |
Collapse
|
28
|
Zhang Z, Chang CW, Hugo W, Cheung E, Sung WK. Simultaneously learning DNA motif along with its position and sequence rank preferences through expectation maximization algorithm. J Comput Biol 2014; 20:237-48. [PMID: 23461573 DOI: 10.1089/cmb.2012.0233] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
Although de novo motifs can be discovered through mining over-represented sequence patterns, this approach misses some real motifs and generates many false positives. To improve accuracy, one solution is to consider some additional binding features (i.e., position preference and sequence rank preference). This information is usually required from the user. This article presents a de novo motif discovery algorithm called SEME (sampling with expectation maximization for motif elicitation), which uses pure probabilistic mixture model to model the motif's binding features and uses expectation maximization (EM) algorithms to simultaneously learn the sequence motif, position, and sequence rank preferences without asking for any prior knowledge from the user. SEME is both efficient and accurate thanks to two important techniques: the variable motif length extension and importance sampling. Using 75 large-scale synthetic datasets, 32 metazoan compendium benchmark datasets, and 164 chromatin immunoprecipitation sequencing (ChIP-Seq) libraries, we demonstrated the superior performance of SEME over existing programs in finding transcription factor (TF) binding sites. SEME is further applied to a more difficult problem of finding the co-regulated TF (coTF) motifs in 15 ChIP-Seq libraries. It identified significantly more correct coTF motifs and, at the same time, predicted coTF motifs with better matching to the known motifs. Finally, we show that the learned position and sequence rank preferences of each coTF reveals potential interaction mechanisms between the primary TF and the coTF within these sites. Some of these findings were further validated by the ChIP-Seq experiments of the coTFs. The application is available online.
Collapse
Affiliation(s)
- ZhiZhuo Zhang
- National University of Singapore, Singapore, Singapore
| | | | | | | | | |
Collapse
|
29
|
Zhang Y, Huo H, Yu Q. A heuristic cluster-based EM algorithm for the planted (l, d) problem. J Bioinform Comput Biol 2013; 11:1350009. [PMID: 23859273 DOI: 10.1142/s0219720013500091] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The planted motif search problem arises from locating the transcription factor binding sites (TFBSs) which are crucial for understanding the gene regulatory relationship. Many attempts in using expectation maximization for TFBSs discovery are successful in past. However, identifying highly degenerate motifs and reducing the effect of local optima are still an arduous task. To alleviate the vulnerability of EM to local optima trapping, we present a heuristic cluster-based EM algorithm, CEM, which refines the cluster subsets in EM method to explore the best local optimal solution. Based on experiments using both synthetic and real datasets, our algorithm demonstrates significant improvements in identifying the motif instances and performs better than current widely used algorithms. CEM is a novel planted motif finding algorithm, which is able to solve the challenging instances and easy to parallel since the process of solving each cluster subset is independent.
Collapse
Affiliation(s)
- Yipu Zhang
- Department of Computer Science, Xidian University, Xi'an, 710071, Shaanxi, P. R. China.
| | | | | |
Collapse
|
30
|
Disordered binding regions and linear motifs--bridging the gap between two models of molecular recognition. PLoS One 2012; 7:e46829. [PMID: 23056474 PMCID: PMC3463566 DOI: 10.1371/journal.pone.0046829] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2012] [Accepted: 09/05/2012] [Indexed: 12/25/2022] Open
Abstract
Intrinsically disordered proteins (IDPs) exist without the presence of a stable tertiary structure in isolation. These proteins are often involved in molecular recognition processes via their disordered binding regions that can recognize partner molecules by undergoing a coupled folding and binding process. The specific properties of disordered binding regions give way to specific, yet transient interactions that enable IDPs to play central roles in signaling pathways and act as hubs of protein interaction networks. An alternative model of protein-protein interactions with largely overlapping functional properties is offered by the concept of linear interaction motifs. This approach focuses on distilling a short consensus sequence pattern from proteins with a common interaction partner. These motifs often reside in disordered regions and are considered to mediate the interaction roughly independent from the rest of the protein. Although a connection between linear motifs and disordered binding regions has been established through common examples, the complementary nature of the two concepts has yet to be fully explored. In many cases the sequence based definition of linear motifs and the structural context based definition of disordered binding regions describe two aspects of the same phenomenon. To gain insight into the connection between the two models, prediction methods were utilized. We combined the regular expression based prediction of linear motifs with the disordered binding region prediction method ANCHOR, each specialized for either model to get the best of both worlds. The thorough analysis of the overlap of the two methods offers a bioinformatics tool for more efficient binding site prediction that can serve a wide range of practical implications. At the same time it can also shed light on the theoretical connection between the two co-existing interaction models.
Collapse
|
31
|
Zambelli F, Pesole G, Pavesi G. Motif discovery and transcription factor binding sites before and after the next-generation sequencing era. Brief Bioinform 2012; 14:225-37. [PMID: 22517426 PMCID: PMC3603212 DOI: 10.1093/bib/bbs016] [Citation(s) in RCA: 93] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Motif discovery has been one of the most widely studied problems in bioinformatics ever since genomic and protein sequences have been available. In particular, its application to the de novo prediction of putative over-represented transcription factor binding sites in nucleotide sequences has been, and still is, one of the most challenging flavors of the problem. Recently, novel experimental techniques like chromatin immunoprecipitation (ChIP) have been introduced, permitting the genome-wide identification of protein-DNA interactions. ChIP, applied to transcription factors and coupled with genome tiling arrays (ChIP on Chip) or next-generation sequencing technologies (ChIP-Seq) has opened new avenues in research, as well as posed new challenges to bioinformaticians developing algorithms and methods for motif discovery.
Collapse
|