1
|
Badia-I-Mompel P, Wessels L, Müller-Dott S, Trimbour R, Ramirez Flores RO, Argelaguet R, Saez-Rodriguez J. Gene regulatory network inference in the era of single-cell multi-omics. Nat Rev Genet 2023; 24:739-754. [PMID: 37365273 DOI: 10.1038/s41576-023-00618-5] [Citation(s) in RCA: 28] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/12/2023] [Indexed: 06/28/2023]
Abstract
The interplay between chromatin, transcription factors and genes generates complex regulatory circuits that can be represented as gene regulatory networks (GRNs). The study of GRNs is useful to understand how cellular identity is established, maintained and disrupted in disease. GRNs can be inferred from experimental data - historically, bulk omics data - and/or from the literature. The advent of single-cell multi-omics technologies has led to the development of novel computational methods that leverage genomic, transcriptomic and chromatin accessibility information to infer GRNs at an unprecedented resolution. Here, we review the key principles of inferring GRNs that encompass transcription factor-gene interactions from transcriptomics and chromatin accessibility data. We focus on the comparison and classification of methods that use single-cell multimodal data. We highlight challenges in GRN inference, in particular with respect to benchmarking, and potential further developments using additional data modalities.
Collapse
Affiliation(s)
- Pau Badia-I-Mompel
- Heidelberg University, Faculty of Medicine, Heidelberg University Hospital, Institute for Computational Biomedicine, Bioquant, Heidelberg, Germany
| | - Lorna Wessels
- Heidelberg University, Faculty of Medicine, Heidelberg University Hospital, Institute for Computational Biomedicine, Bioquant, Heidelberg, Germany
- Department of Vascular Biology and Tumor Angiogenesis, European Center for Angioscience, Medical Faculty, MannHeim Heidelberg University, Mannheim, Germany
| | - Sophia Müller-Dott
- Heidelberg University, Faculty of Medicine, Heidelberg University Hospital, Institute for Computational Biomedicine, Bioquant, Heidelberg, Germany
| | - Rémi Trimbour
- Heidelberg University, Faculty of Medicine, Heidelberg University Hospital, Institute for Computational Biomedicine, Bioquant, Heidelberg, Germany
- Institut Pasteur, Université Paris Cité, CNRS UMR 3738, Machine Learning for Integrative Genomics Group, Paris, France
| | - Ricardo O Ramirez Flores
- Heidelberg University, Faculty of Medicine, Heidelberg University Hospital, Institute for Computational Biomedicine, Bioquant, Heidelberg, Germany
| | | | - Julio Saez-Rodriguez
- Heidelberg University, Faculty of Medicine, Heidelberg University Hospital, Institute for Computational Biomedicine, Bioquant, Heidelberg, Germany.
| |
Collapse
|
2
|
Tognon M, Giugno R, Pinello L. A survey on algorithms to characterize transcription factor binding sites. Brief Bioinform 2023; 24:bbad156. [PMID: 37099664 PMCID: PMC10422928 DOI: 10.1093/bib/bbad156] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 03/27/2023] [Accepted: 04/01/2023] [Indexed: 04/28/2023] Open
Abstract
Transcription factors (TFs) are key regulatory proteins that control the transcriptional rate of cells by binding short DNA sequences called transcription factor binding sites (TFBS) or motifs. Identifying and characterizing TFBS is fundamental to understanding the regulatory mechanisms governing the transcriptional state of cells. During the last decades, several experimental methods have been developed to recover DNA sequences containing TFBS. In parallel, computational methods have been proposed to discover and identify TFBS motifs based on these DNA sequences. This is one of the most widely investigated problems in bioinformatics and is referred to as the motif discovery problem. In this manuscript, we review classical and novel experimental and computational methods developed to discover and characterize TFBS motifs in DNA sequences, highlighting their advantages and drawbacks. We also discuss open challenges and future perspectives that could fill the remaining gaps in the field.
Collapse
Affiliation(s)
- Manuel Tognon
- Computer Science Department, University of Verona, Verona, Italy
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Rosalba Giugno
- Computer Science Department, University of Verona, Verona, Italy
| | - Luca Pinello
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Department of Pathology, Harvard Medical School, Boston, Massachusetts, United States of America
| |
Collapse
|
3
|
Farhadi F, Allahbakhsh M, Maghsoudi A, Armin N, Amintoosi H. DiMo: discovery of microRNA motifs using deep learning and motif embedding. Brief Bioinform 2023; 24:bbad182. [PMID: 37165972 DOI: 10.1093/bib/bbad182] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 04/17/2023] [Accepted: 04/21/2023] [Indexed: 05/12/2023] Open
Abstract
MicroRNAs are small regulatory RNAs that decrease gene expression after transcription in various biological disciplines. In bioinformatics, identifying microRNAs and predicting their functionalities is critical. Finding motifs is one of the most well-known and important methods for identifying the functionalities of microRNAs. Several motif discovery techniques have been proposed, some of which rely on artificial intelligence-based techniques. However, in the case of few or no training data, their accuracy is low. In this research, we propose a new computational approach, called DiMo, for identifying motifs in microRNAs and generally macromolecules of small length. We employ word embedding techniques and deep learning models to improve the accuracy of motif discovery results. Also, we rely on transfer learning models to pre-train a model and use it in cases of a lack of (enough) training data. We compare our approach with five state-of-the-art works using three real-world datasets. DiMo outperforms the selected related works in terms of precision, recall, accuracy and f1-score.
Collapse
Affiliation(s)
- Fatemeh Farhadi
- Department of Bioinformatics, University of Zabol, Zabol, Iran
| | | | - Ali Maghsoudi
- Department of Bioinformatics, University of Zabol, Zabol, Iran
| | - Nadieh Armin
- Computer Engineering Department, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Haleh Amintoosi
- Computer Engineering Department, Ferdowsi University of Mashhad, Mashhad, Iran
| |
Collapse
|
4
|
Monteiro LDFR, Giraldi LA, Winck FV. From Feasting to Fasting: The Arginine Pathway as a Metabolic Switch in Nitrogen-Deprived Chlamydomonas reinhardtii. Cells 2023; 12:1379. [PMID: 37408213 PMCID: PMC10216424 DOI: 10.3390/cells12101379] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Revised: 05/09/2023] [Accepted: 05/10/2023] [Indexed: 07/07/2023] Open
Abstract
The metabolism of the model microalgae Chlamydomonas reinhardtii under nitrogen deprivation is of special interest due to its resulting increment of triacylglycerols (TAGs), that can be applied in biotechnological applications. However, this same condition impairs cell growth, which may limit the microalgae's large applications. Several studies have identified significant physiological and molecular changes that occur during the transition from an abundant to a low or absent nitrogen supply, explaining in detail the differences in the proteome, metabolome and transcriptome of the cells that may be responsible for and responsive to this condition. However, there are still some intriguing questions that reside in the core of the regulation of these cellular responses that make this process even more interesting and complex. In this scenario, we reviewed the main metabolic pathways that are involved in the response, mining and exploring, through a reanalysis of omics data from previously published datasets, the commonalities among the responses and unraveling unexplained or non-explored mechanisms of the possible regulatory aspects of the response. Proteomics, metabolomics and transcriptomics data were reanalysed using a common strategy, and an in silico gene promoter motif analysis was performed. Together, these results identified and suggested a strong association between the metabolism of amino acids, especially arginine, glutamate and ornithine pathways to the production of TAGs, via the de novo synthesis of lipids. Furthermore, our analysis and data mining indicate that signalling cascades orchestrated with the indirect participation of phosphorylation, nitrosylation and peroxidation events may be essential to the process. The amino acid pathways and the amount of arginine and ornithine available in the cells, at least transiently during nitrogen deprivation, may be in the core of the post-transcriptional, metabolic regulation of this complex phenomenon. Their further exploration is important to the discovery of novel advances in the understanding of microalgae lipids' production.
Collapse
Affiliation(s)
- Lucca de Filipe Rebocho Monteiro
- Laboratory of Regulatory Systems Biology, Center for Nuclear Energy in Agriculture, University of São Paulo, Piracicaba 13416-000, Brazil
- Department of Botany, Institute of Biosciences, University of São Paulo, São Paulo 05508-090, Brazil
| | - Laís Albuquerque Giraldi
- Laboratory of Regulatory Systems Biology, Center for Nuclear Energy in Agriculture, University of São Paulo, Piracicaba 13416-000, Brazil
- Department of Biochemistry, Institute of Chemistry, University of São Paulo, São Paulo 05508-000, Brazil
| | - Flavia Vischi Winck
- Laboratory of Regulatory Systems Biology, Center for Nuclear Energy in Agriculture, University of São Paulo, Piracicaba 13416-000, Brazil
| |
Collapse
|
5
|
Maseko NN, Steenkamp ET, Wingfield BD, Wilken PM. An in Silico Approach to Identifying TF Binding Sites: Analysis of the Regulatory Regions of BUSCO Genes from Fungal Species in the Ceratocystidaceae Family. Genes (Basel) 2023; 14:genes14040848. [PMID: 37107606 PMCID: PMC10137650 DOI: 10.3390/genes14040848] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 03/26/2023] [Accepted: 03/27/2023] [Indexed: 04/03/2023] Open
Abstract
Transcriptional regulation controls gene expression through regulatory promoter regions that contain conserved sequence motifs. These motifs, also known as regulatory elements, are critically important to expression, which is driving research efforts to identify and characterize them. Yeasts have been the focus of such studies in fungi, including in several in silico approaches. This study aimed to determine whether in silico approaches could be used to identify motifs in the Ceratocystidaceae family, and if present, to evaluate whether these correspond to known transcription factors. This study targeted the 1000 base-pair region upstream of the start codon of 20 single-copy genes from the BUSCO dataset for motif discovery. Using the MEME and Tomtom analysis tools, conserved motifs at the family level were identified. The results show that such in silico approaches could identify known regulatory motifs in the Ceratocystidaceae and other unrelated species. This study provides support to ongoing efforts to use in silico analyses for motif discovery.
Collapse
|
6
|
Deyneko IV. Guidelines on the performance evaluation of motif recognition methods in bioinformatics. Front Genet 2023; 14:1135320. [PMID: 36824436 PMCID: PMC9941176 DOI: 10.3389/fgene.2023.1135320] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2022] [Accepted: 01/19/2023] [Indexed: 02/09/2023] Open
|
7
|
Scagnoli F, Palma A, Favia A, Scuoppo C, Illi B, Nasi S. A New Insight into MYC Action: Control of RNA Polymerase II Methylation and Transcription Termination. Biomedicines 2023; 11:biomedicines11020412. [PMID: 36830948 PMCID: PMC9952900 DOI: 10.3390/biomedicines11020412] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Revised: 01/16/2023] [Accepted: 01/26/2023] [Indexed: 02/01/2023] Open
Abstract
MYC oncoprotein deregulation is a common catastrophic event in human cancer and limiting its activity restrains tumor development and maintenance, as clearly shown via Omomyc, an MYC-interfering 90 amino acid mini-protein. MYC is a multifunctional transcription factor that regulates many aspects of transcription by RNA polymerase II (RNAPII), such as transcription activation, pause release, and elongation. MYC directly associates with Protein Arginine Methyltransferase 5 (PRMT5), a protein that methylates a variety of targets, including RNAPII at the arginine residue R1810 (R1810me2s), crucial for proper transcription termination and splicing of transcripts. Therefore, we asked whether MYC controls termination as well, by affecting R1810me2S. We show that MYC overexpression strongly increases R1810me2s, while Omomyc, an MYC shRNA, or a PRMT5 inhibitor and siRNA counteract this phenomenon. Omomyc also impairs Serine 2 phosphorylation in the RNAPII carboxyterminal domain, a modification that sustains transcription elongation. ChIP-seq experiments show that Omomyc replaces MYC and reshapes RNAPII distribution, increasing occupancy at promoter and termination sites. It is unclear how this may affect gene expression. Transcriptomic analysis shows that transcripts pivotal to key signaling pathways are both up- or down-regulated by Omomyc, whereas genes directly controlled by MYC and belonging to a specific signature are strongly down-regulated. Overall, our data point to an MYC/PRMT5/RNAPII axis that controls termination via RNAPII symmetrical dimethylation and contributes to rewiring the expression of genes altered by MYC overexpression in cancer cells. It remains to be clarified which role this may have in tumor development.
Collapse
Affiliation(s)
- Fiorella Scagnoli
- IBPM—CNR, Biology and Biotechnology Department, Sapienza University, 00185 Rome, Italy
- Correspondence: (F.S.); (B.I.); (S.N.)
| | - Alessandro Palma
- Translational Cytogenomics Research Unit, Bambino Gesù Children’s Hospital, IRCCS, 00146 Rome, Italy
| | - Annarita Favia
- IBPM—CNR, Biology and Biotechnology Department, Sapienza University, 00185 Rome, Italy
| | - Claudio Scuoppo
- Institute for Cancer Genetics, Columbia University, New York, NY 10032, USA
| | - Barbara Illi
- IBPM—CNR, Biology and Biotechnology Department, Sapienza University, 00185 Rome, Italy
- Correspondence: (F.S.); (B.I.); (S.N.)
| | - Sergio Nasi
- IBPM—CNR, Biology and Biotechnology Department, Sapienza University, 00185 Rome, Italy
- Correspondence: (F.S.); (B.I.); (S.N.)
| |
Collapse
|
8
|
Yu Q, Zhang X, Hu Y, Chen S, Yang L. A Method for Predicting DNA Motif Length Based On Deep Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:61-73. [PMID: 35275822 DOI: 10.1109/tcbb.2022.3158471] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
A DNA motif is a sequence pattern shared by the DNA sequence segments that bind to a specific protein. Discovering motifs in a given DNA sequence dataset plays a vital role in studying gene expression regulation. As an important attribute of the DNA motif, the motif length directly affects the quality of the discovered motifs. How to determine the motif length more accurately remains a difficult challenge to be solved. We propose a new motif length prediction scheme named MotifLen by using supervised machine learning. First, a method of constructing sample data for predicting the motif length is proposed. Secondly, a deep learning model for motif length prediction is constructed based on the convolutional neural network. Then, the methods of applying the proposed prediction model based on a motif found by an existing motif discovery algorithm are given. The experimental results show that i) the prediction accuracy of MotifLen is more than 90% on the validation set and is significantly higher than that of the compared methods on real datasets, ii) MotifLen can successfully optimize the motifs found by the existing motif discovery algorithms, and iii) it can effectively improve the time performance of some existing motif discovery algorithms.
Collapse
|
9
|
NetREX-CF integrates incomplete transcription factor data with gene expression to reconstruct gene regulatory networks. Commun Biol 2022; 5:1282. [PMID: 36418514 PMCID: PMC9684490 DOI: 10.1038/s42003-022-04226-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Accepted: 11/04/2022] [Indexed: 11/25/2022] Open
Abstract
The inference of Gene Regulatory Networks (GRNs) is one of the key challenges in systems biology. Leading algorithms utilize, in addition to gene expression, prior knowledge such as Transcription Factor (TF) DNA binding motifs or results of TF binding experiments. However, such prior knowledge is typically incomplete, therefore, integrating it with gene expression to infer GRNs remains difficult. To address this challenge, we introduce NetREX-CF-Regulatory Network Reconstruction using EXpression and Collaborative Filtering-a GRN reconstruction approach that brings together Collaborative Filtering to address the incompleteness of the prior knowledge and a biologically justified model of gene expression (sparse Network Component Analysis based model). We validated the NetREX-CF using Yeast data and then used it to construct the GRN for Drosophila Schneider 2 (S2) cells. To corroborate the GRN, we performed a large-scale RNA-Seq analysis followed by a high-throughput RNAi treatment against all 465 expressed TFs in the cell line. Our knockdown result has not only extensively validated the GRN we built, but also provides a benchmark that our community can use for evaluating GRNs. Finally, we demonstrate that NetREX-CF can infer GRNs using single-cell RNA-Seq, and outperforms other methods, by using previously published human data.
Collapse
|
10
|
Zhang S, Ma A, Zhao J, Xu D, Ma Q, Wang Y. Assessing deep learning methods in cis-regulatory motif finding based on genomic sequencing data. Brief Bioinform 2022; 23:bbab374. [PMID: 34607350 PMCID: PMC8769700 DOI: 10.1093/bib/bbab374] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2021] [Revised: 08/22/2021] [Accepted: 08/23/2021] [Indexed: 12/28/2022] Open
Abstract
Identifying cis-regulatory motifs from genomic sequencing data (e.g. ChIP-seq and CLIP-seq) is crucial in identifying transcription factor (TF) binding sites and inferring gene regulatory mechanisms for any organism. Since 2015, deep learning (DL) methods have been widely applied to identify TF binding sites and predict motif patterns, with the strengths of offering a scalable, flexible and unified computational approach for highly accurate predictions. As far as we know, 20 DL methods have been developed. However, without a clear and systematic assessment, users will struggle to choose the most appropriate tool for their specific studies. In this manuscript, we evaluated 20 DL methods for cis-regulatory motif prediction using 690 ENCODE ChIP-seq, 126 cancer ChIP-seq and 55 RNA CLIP-seq data. Four metrics were investigated, including the accuracy of motif finding, the performance of DNA/RNA sequence classification, algorithm scalability and tool usability. The assessment results demonstrated the high complementarity of the existing DL methods. It was determined that the most suitable model should primarily depend on the data size and type and the method's outputs.
Collapse
Affiliation(s)
- Shuangquan Zhang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
| | - Anjun Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Jing Zhao
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, and Christopher S. Bond Life Science Center, University of Missouri, MO, 65211, USA
| | - Qin Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Yan Wang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
- School of Artificial Intelligence, Jilin University, Changchun, 130012, China
| |
Collapse
|
11
|
Kuiper M, Bonello J, Fernández-Breis JT, Bucher P, Futschik ME, Gaudet P, Kulakovskiy IV, Licata L, Logie C, Lovering RC, Makeev VJ, Orchard S, Panni S, Perfetto L, Sant D, Schulz S, Zerbino DR, Lægreid A. The Gene Regulation Knowledge Commons: The action area of GREEKC. BIOCHIMICA ET BIOPHYSICA ACTA-GENE REGULATORY MECHANISMS 2021; 1865:194768. [PMID: 34757206 DOI: 10.1016/j.bbagrm.2021.194768] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Revised: 10/18/2021] [Accepted: 10/20/2021] [Indexed: 02/08/2023]
Abstract
The COST Action Gene Regulation Ensemble Effort for the Knowledge Commons (GREEKC, CA15205, www.greekc.org) organized nine workshops in a four-year period, starting September 2016. The workshops brought together a wide range of experts from all over the world working on various parts of the knowledge cycle that is central to understanding gene regulatory mechanisms. The discussions between ontologists, curators, text miners, biologists, bioinformaticians, philosophers and computational scientists spawned a host of activities aimed to update and standardise existing knowledge management workflows, encourage new experimental approaches and thoroughly involve end-users in the process to design the Gene Regulation Knowledge Commons (GRKC). The GREEKC consortium describes its main achievements, contextualised in a state-of-the-art of current tools and resources that today represent the GRKC.
Collapse
Affiliation(s)
- Martin Kuiper
- Systems Biology Group, Department of Biology, Norwegian University of Science and Technology, Trondheim, Norway.
| | - Joseph Bonello
- Faculty of Information & Communication Technology, University of Malta, Msida, Malta
| | | | - Philipp Bucher
- Swiss Institute of Bioinformatics, Quartier Sorge, Bâtiment Amphipôle, 1015 Lausanne, Switzerland
| | - Matthias E Futschik
- Systems Biology and Bioinformatics Laboratory (SysBioLab), Centre of Marine Sciences (CCMAR), University of Algarve, 8005-139 Faro, Portugal
| | - Pascale Gaudet
- SIB Swiss Institute of Bioinformatics, 1 Rue Michel-Servet, 1204 Geneva, Switzerland
| | - Ivan V Kulakovskiy
- Institute of Protein Research, Russian Academy of Sciences, Institutskaya 4, 142290 Pushchino, Russia
| | - Luana Licata
- Department of Biology, University of Rome Tor Vergata, Rome, Italy
| | - Colin Logie
- Department of Molecular Biology, Faculty of Science, Radboud University, PO Box 9101, Nijmegen 6500HG, the Netherlands
| | - Ruth C Lovering
- Functional Gene Annotation, Pre-clinical and Fundamental Science, Institute of Cardiovascular Science, University College London, 5 University Street, London WC1E 6JF, UK
| | - Vsevolod J Makeev
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Gubkina 3, 119991 Moscow, Russia
| | - Sandra Orchard
- European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Simona Panni
- Department DIBEST, University of Calabria, Rende, Italy
| | - Livia Perfetto
- Fondazione Human Technopole, Department of Biology, Via Cristina Belgioioso, 171, 20157 Milan, Italy
| | - David Sant
- Department of Biomedical Informatics, University of Utah, 421 Wakara Way #140, Salt Lake City, UT 84108, United States
| | - Stefan Schulz
- Institute of Medical Informatics, Statistics and Documentation, Medical University of Graz, Auenbruggerpl. 2, Graz, Austria
| | - Daniel R Zerbino
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Astrid Lægreid
- Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology, 7491 Trondheim, Norway
| | | |
Collapse
|
12
|
Castellana S, Biagini T, Parca L, Petrizzelli F, Bianco SD, Vescovi AL, Carella M, Mazza T. A comparative benchmark of classic DNA motif discovery tools on synthetic data. Brief Bioinform 2021; 22:6341664. [PMID: 34351399 DOI: 10.1093/bib/bbab303] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 07/08/2021] [Accepted: 07/15/2021] [Indexed: 01/01/2023] Open
Abstract
Hundreds of human proteins were found to establish transient interactions with rather degenerated consensus DNA sequences or motifs. Identifying these motifs and the genomic sites where interactions occur represent one of the most challenging research goals in modern molecular biology and bioinformatics. The last twenty years witnessed an explosion of computational tools designed to perform this task, whose performance has been last compared fifteen years ago. Here, we survey sixteen of them, benchmark their ability to identify known motifs nested in twenty-nine simulated sequence datasets, and finally report their strengths, weaknesses, and complementarity.
Collapse
Affiliation(s)
- Stefano Castellana
- Bioinformatics Unit, IRCCS Casa Sollievo della Sofferenza, S. Giovanni Rotondo 71013, Italy
| | - Tommaso Biagini
- Bioinformatics Unit, IRCCS Casa Sollievo della Sofferenza, S. Giovanni Rotondo 71013, Italy
| | - Luca Parca
- Bioinformatics Unit, IRCCS Casa Sollievo della Sofferenza, S. Giovanni Rotondo 71013, Italy
| | - Francesco Petrizzelli
- Bioinformatics Unit, IRCCS Casa Sollievo della Sofferenza, S. Giovanni Rotondo 71013, Italy.,Department of Experimental Medicine, Sapienza University of Rome, Rome 00161, Italy
| | | | - Angelo Luigi Vescovi
- ISBReMIT Institute for Stem Cell Biology, Regenerative Medicine and Innovative Therapies, IRCSS Casa Sollievo della Sofferenza, San Giovanni Rotondo (FG), 71013, Italy
| | - Massimo Carella
- Medical Genetics Unit, IRCCS Casa Sollievo della Sofferenza, S. Giovanni Rotondo 71013, Italy
| | - Tommaso Mazza
- Bioinformatics Unit, IRCCS Casa Sollievo della Sofferenza, S. Giovanni Rotondo 71013, Italy
| |
Collapse
|
13
|
Li JY, Jin S, Tu XM, Ding Y, Gao G. Identifying complex motifs in massive omics data with a variable-convolutional layer in deep neural network. Brief Bioinform 2021; 22:6312656. [PMID: 34219140 DOI: 10.1093/bib/bbab233] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Revised: 05/25/2021] [Accepted: 05/28/2021] [Indexed: 01/10/2023] Open
Abstract
Motif identification is among the most common and essential computational tasks for bioinformatics and genomics. Here we proposed a novel convolutional layer for deep neural network, named variable convolutional (vConv) layer, for effective motif identification in high-throughput omics data by learning kernel length from data adaptively. Empirical evaluations on DNA-protein binding and DNase footprinting cases well demonstrated that vConv-based networks have superior performance to their convolutional counterparts regardless of model complexity. Meanwhile, vConv could be readily integrated into multi-layer neural networks as an 'in-place replacement' of canonical convolutional layer. All source codes are freely available on GitHub for academic usage.
Collapse
Affiliation(s)
- Jing-Yi Li
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Shen Jin
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Xin-Ming Tu
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Yang Ding
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Ge Gao
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| |
Collapse
|
14
|
Abstract
Aims:
Robust and more accurate method for identifying transcription factor binding sites
(TFBS) for gene expression.
Background:
Deep neural networks (DNNs) have shown promising growth in solving complex
machine learning problems. Conventional techniques are comfortably replaced by DNNs in
computer vision, signal processing, healthcare, and genomics. Understanding DNA sequences is
always a crucial task in healthcare and regulatory genomics. For DNA motif prediction, choosing the
right dataset with a sufficient number of input sequences is crucial in order to design an effective
model.
Objective:
Designing a new algorithm which works on different dataset while an improved
performance for TFBS prediction.
Methods:
With the help of Layerwise Relevance Propagation, the proposed algorithm identifies the
invariant features with adaptive noise patterns.
Results:
The performance is compared by calculating various metrics on standard as well as recent
methods and significant improvement is noted.
Conclusion:
By identifying the invariant and robust features in the DNA sequences, the
classification performance can be increased.
Collapse
Affiliation(s)
- Kanu Geete
- Department of Computer Science & Engineering, Maulana Azad National Institute of Technology, Bhopal, India
| | - Manish Pandey
- Department of Computer Science & Engineering, Maulana Azad National Institute of Technology, Bhopal, India
| |
Collapse
|
15
|
Identification of Cis-Regulatory Sequences Controlling Pollen-Specific Expression of Hydroxyproline-Rich Glycoprotein Genes in Arabidopsis thaliana. PLANTS 2020; 9:plants9121751. [PMID: 33322028 PMCID: PMC7763877 DOI: 10.3390/plants9121751] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Revised: 11/26/2020] [Accepted: 12/07/2020] [Indexed: 02/06/2023]
Abstract
Hydroxyproline-rich glycoproteins (HRGPs) are a superfamily of plant cell wall structural proteins that function in various aspects of plant growth and development, including pollen tube growth. We have previously characterized protein sequence signatures for three family members in the HRGP superfamily: the hyperglycosylated arabinogalactan-proteins (AGPs), the moderately glycosylated extensins (EXTs), and the lightly glycosylated proline-rich proteins (PRPs). However, the mechanism of pollen-specific HRGP gene expression remains unexplored. To this end, we developed an integrative analysis pipeline combining RNA-seq gene expression and promoter sequences to identify cis-regulatory motifs responsible for pollen-specific expression of HRGP genes in Arabidopsis thaliana. Specifically, we mined the public RNA-seq datasets and identified 13 pollen-specific HRGP genes. Ensemble motif discovery identified 15 conserved promoter elements between A.thaliana and A. lyrata. Motif scanning revealed two pollen related transcription factors: GATA12 and brassinosteroid (BR) signaling pathway regulator BZR1. Finally, we performed a regression analysis and demonstrated that the 15 motifs provided a good model of HRGP gene expression in pollen (R = 0.61). In conclusion, we performed the first integrative analysis of cis-regulatory motifs in pollen-specific HRGP genes, revealing important insights into transcriptional regulation in pollen tissue.
Collapse
|
16
|
He Y, Shen Z, Zhang Q, Wang S, Huang DS. A survey on deep learning in DNA/RNA motif mining. Brief Bioinform 2020; 22:5916939. [PMID: 33005921 PMCID: PMC8293829 DOI: 10.1093/bib/bbaa229] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2020] [Revised: 08/19/2020] [Accepted: 08/24/2020] [Indexed: 01/18/2023] Open
Abstract
DNA/RNA motif mining is the foundation of gene function research. The DNA/RNA motif mining plays an extremely important role in identifying the DNA- or RNA-protein binding site, which helps to understand the mechanism of gene regulation and management. For the past few decades, researchers have been working on designing new efficient and accurate algorithms for mining motif. These algorithms can be roughly divided into two categories: the enumeration approach and the probabilistic method. In recent years, machine learning methods had made great progress, especially the algorithm represented by deep learning had achieved good performance. Existing deep learning methods in motif mining can be roughly divided into three types of models: convolutional neural network (CNN) based models, recurrent neural network (RNN) based models, and hybrid CNN–RNN based models. We introduce the application of deep learning in the field of motif mining in terms of data preprocessing, features of existing deep learning architectures and comparing the differences between the basic deep learning models. Through the analysis and comparison of existing deep learning methods, we found that the more complex models tend to perform better than simple ones when data are sufficient, and the current methods are relatively simple compared with other fields such as computer vision, language processing (NLP), computer games, etc. Therefore, it is necessary to conduct a summary in motif mining by deep learning, which can help researchers understand this field.
Collapse
Affiliation(s)
- Ying He
- computer science and technology at Tongji University, China
| | - Zhen Shen
- computer science and technology at Tongji University, China
| | - Qinhu Zhang
- computer science and technology at Tongji University, China
| | - Siguo Wang
- computer science and technology at Tongji University, China
| | - De-Shuang Huang
- Institute of Machines Learning and Systems Biology, Tongji University
| |
Collapse
|
17
|
Sultan I, Fromion V, Schbath S, Nicolas P. Statistical modelling of bacterial promoter sequences for regulatory motif discovery with the help of transcriptome data: application to Listeria monocytogenes. J R Soc Interface 2020; 17:20200600. [PMID: 33023397 PMCID: PMC7653377 DOI: 10.1098/rsif.2020.0600] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2020] [Accepted: 09/10/2020] [Indexed: 11/12/2022] Open
Abstract
Automatic de novo identification of the main regulons of a bacterium from genome and transcriptome data remains a challenge. To address this task, we propose a statistical model that can use information on exact positions of the transcription start sites and condition-dependent expression profiles. The central idea of this model is to improve the probabilistic representation of the promoter DNA sequences by incorporating covariates summarizing expression profiles (e.g. coordinates in projection spaces or hierarchical clustering trees). A dedicated trans-dimensional Markov chain Monte Carlo algorithm adjusts the width and palindromic properties of the corresponding position-weight matrices, the number of parameters to describe exact position relative to the transcription start site, and chooses the expression covariates relevant for each motif. All parameters are estimated simultaneously, for many motifs and many expression covariates. The method is applied to a dataset of transcription start sites and expression profiles available for Listeria monocytogenes. The results validate the approach and provide a new global view of the transcription regulatory network of this important pathogen. Remarkably, a previously unreported motif is found in promoter regions of ribosomal protein genes, suggesting a role in the regulation of growth.
Collapse
Affiliation(s)
- Ibrahim Sultan
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
| | | | | | - Pierre Nicolas
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
| |
Collapse
|
18
|
|
19
|
Zhou J, Lu Q, Gui L, Xu R, Long Y, Wang H. MTTFsite: cross-cell type TF binding site prediction by using multi-task learning. Bioinformatics 2020; 35:5067-5077. [PMID: 31161194 PMCID: PMC6954652 DOI: 10.1093/bioinformatics/btz451] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 05/19/2019] [Accepted: 05/30/2019] [Indexed: 12/30/2022] Open
Abstract
Motivation The prediction of transcription factor binding sites (TFBSs) is crucial for gene expression analysis. Supervised learning approaches for TFBS predictions require large amounts of labeled data. However, many TFs of certain cell types either do not have sufficient labeled data or do not have any labeled data. Results In this paper, a multi-task learning framework (called MTTFsite) is proposed to address the lack of labeled data problem by leveraging on labeled data available in cross-cell types. The proposed MTTFsite contains a shared CNN to learn common features for all cell types and a private CNN for each cell type to learn private features. The common features are aimed to help predicting TFBSs for all cell types especially those cell types that lack labeled data. MTTFsite is evaluated on 241 cell type TF pairs and compared with a baseline method without using any multi-task learning model and a fully shared multi-task model that uses only a shared CNN and do not use private CNNs. For cell types with insufficient labeled data, results show that MTTFsite performs better than the baseline method and the fully shared model on more than 89% pairs. For cell types without any labeled data, MTTFsite outperforms the baseline method and the fully shared model by more than 80 and 93% pairs, respectively. A novel gene expression prediction method (called TFChrome) using both MTTFsite and histone modification features is also presented. Results show that TFBSs predicted by MTTFsite alone can achieve good performance. When MTTFsite is combined with histone modification features, a significant 5.7% performance improvement is obtained. Availability and implementation The resource and executable code are freely available at http://hlt.hitsz.edu.cn/MTTFsite/ and http://www.hitsz-hlt.com:8080/MTTFsite/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jiyun Zhou
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China.,Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Hong Kong
| | - Qin Lu
- Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Hong Kong
| | - Lin Gui
- Department of Computer Science, University of Warwick, Coventry CV4 4AL, UK
| | - Ruifeng Xu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China
| | - Yunfei Long
- Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Hong Kong
| | - Hongpeng Wang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China
| |
Collapse
|
20
|
Höllbacher B, Balázs K, Heinig M, Uhlenhaut NH. Seq-ing answers: Current data integration approaches to uncover mechanisms of transcriptional regulation. Comput Struct Biotechnol J 2020; 18:1330-1341. [PMID: 32612756 PMCID: PMC7306512 DOI: 10.1016/j.csbj.2020.05.018] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2020] [Revised: 05/21/2020] [Accepted: 05/23/2020] [Indexed: 02/06/2023] Open
Abstract
Advancements in the field of next generation sequencing lead to the generation of ever-more data, with the challenge often being how to combine and reconcile results from different OMICs studies such as genome, epigenome and transcriptome. Here we provide an overview of the standard processing pipelines for ChIP-seq and RNA-seq as well as common downstream analyses. We describe popular multi-omics data integration approaches used to identify target genes and co-factors, and we discuss how machine learning techniques may predict transcriptional regulators and gene expression.
Collapse
Affiliation(s)
- Barbara Höllbacher
- Institute for Diabetes and Cancer IDC, Helmholtz Zentrum Muenchen (HMGU) and German Center for Diabetes Research (DZD), Munich 85764, Neuherberg, Germany.,Institute of Computational Biology ICB, Helmholtz Zentrum Muenchen (HMGU) and German Center for Diabetes Research (DZD), Munich 85764, Neuherberg, Germany.,Department of Informatics, TUM, Munich 85748, Garching, Germany
| | - Kinga Balázs
- Institute for Diabetes and Cancer IDC, Helmholtz Zentrum Muenchen (HMGU) and German Center for Diabetes Research (DZD), Munich 85764, Neuherberg, Germany
| | - Matthias Heinig
- Institute of Computational Biology ICB, Helmholtz Zentrum Muenchen (HMGU) and German Center for Diabetes Research (DZD), Munich 85764, Neuherberg, Germany.,Department of Informatics, TUM, Munich 85748, Garching, Germany
| | - N Henriette Uhlenhaut
- Institute for Diabetes and Cancer IDC, Helmholtz Zentrum Muenchen (HMGU) and German Center for Diabetes Research (DZD), Munich 85764, Neuherberg, Germany.,Metabolic Programming, TUM School of Life Sciences Weihenstephan, Munich 85354, Freising, Germany
| |
Collapse
|
21
|
Ronzio M, Zambelli F, Dolfini D, Mantovani R, Pavesi G. Integrating Peak Colocalization and Motif Enrichment Analysis for the Discovery of Genome-Wide Regulatory Modules and Transcription Factor Recruitment Rules. Front Genet 2020; 11:72. [PMID: 32153638 PMCID: PMC7046753 DOI: 10.3389/fgene.2020.00072] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2019] [Accepted: 01/22/2020] [Indexed: 12/14/2022] Open
Abstract
Chromatin immunoprecipitation followed by next-generation sequencing (ChIP-Seq) has opened new avenues of research in the genome-wide characterization of regulatory DNA-protein interactions at the genetic and epigenetic level. As a consequence, it has become the de facto standard for studies on the regulation of transcription, and literally thousands of data sets for transcription factors and cofactors in different conditions and species are now available to the scientific community. However, while pipelines and best practices have been established for the analysis of a single experiment, there is still no consensus on the best way to perform an integrated analysis of multiple datasets in the same condition, in order to identify the most relevant and widespread regulatory modules composed by different transcription factors and cofactors. We present here a computational pipeline for this task, that integrates peak summit colocalization, a novel statistical framework for the evaluation of its significance, and motif enrichment analysis. We show examples of its application to ENCODE data, that led to the identification of relevant regulatory modules composed of different factors, as well as the organization on DNA of the binding motifs responsible for their recruitment.
Collapse
Affiliation(s)
- Mirko Ronzio
- Dipartimento di Bioscienze, Università di Milano, Milan, Italy
| | | | - Diletta Dolfini
- Dipartimento di Bioscienze, Università di Milano, Milan, Italy
| | | | - Giulio Pavesi
- Dipartimento di Bioscienze, Università di Milano, Milan, Italy
| |
Collapse
|
22
|
Li T, Zhang X, Luo F, Wu FX, Wang J. MultiMotifMaker: A Multi-Thread Tool for Identifying DNA Methylation Motifs from Pacbio Reads. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:220-225. [PMID: 30059318 DOI: 10.1109/tcbb.2018.2861399] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The methylation of DNA is an important mechanism to control biological processes. Recently, the Pacbio SMRT technology provides a new way to identify base methylation in the genome. MotifMaker is a tool developed by Pacbio for discovering DNA methylation motifs from methylated DNA sequences. However, MotifMaker is single-threaded and computational expensive for identifying methylation motifs from large genomes. Here, we present an efficient motif finding algorithm (MultiMotifMaker) by implementing multi threads of the MotifMaker. The MultiMotifMaker speeds up the motif search about 8-9 times on a 32 core computer comparing to MotifMaker. MultiMotifMaker makes it possible to identify methylation motifs from Pacbio reads for large genomes.
Collapse
|
23
|
Yu Q, Zhao X, Huo H. A new algorithm for DNA motif discovery using multiple sample sequence sets. J Bioinform Comput Biol 2019; 17:1950021. [PMID: 31617465 DOI: 10.1142/s0219720019500215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
DNA motif discovery plays an important role in understanding the mechanisms of gene regulation. Most existing motif discovery algorithms can identify motifs in an efficient and effective manner when dealing with small datasets. However, large datasets generated by high-throughput sequencing technologies pose a huge challenge: it is too time-consuming to process the entire dataset, but if only a small sample sequence set is processed, it is difficult to identify infrequent motifs. In this paper, we propose a new DNA motif discovery algorithm: first divide the input dataset into multiple sample sequence sets, then refine initial motifs of each sample sequence set with the expectation maximization method, and finally combine all the results from each sample sequence set. Besides, we design a new initial motif generation method with the utilization of the entire dataset, which helps to identify infrequent motifs. The experimental results on the simulated data show that the proposed algorithm has better time performance for large datasets and better accuracy of identifying infrequent motifs than the compared algorithms. Also, we have verified the validity of the proposed algorithm on the real data.
Collapse
Affiliation(s)
- Qiang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, 710071, P. R. China
| | - Xiang Zhao
- School of Computer Science and Technology, Xidian University, Xi'an, 710071, P. R. China
| | - Hongwei Huo
- School of Computer Science and Technology, Xidian University, Xi'an, 710071, P. R. China
| |
Collapse
|
24
|
Lan G, Zhou J, Xu R, Lu Q, Wang H. Cross-Cell-Type Prediction of TF-Binding Site by Integrating Convolutional Neural Network and Adversarial Network. Int J Mol Sci 2019; 20:ijms20143425. [PMID: 31336830 PMCID: PMC6679139 DOI: 10.3390/ijms20143425] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2019] [Revised: 06/27/2019] [Accepted: 07/08/2019] [Indexed: 01/18/2023] Open
Abstract
Transcription factor binding sites (TFBSs) play an important role in gene expression regulation. Many computational methods for TFBS prediction need sufficient labeled data. However, many transcription factors (TFs) lack labeled data in cell types. We propose a novel method, referred to as DANN_TF, for TFBS prediction. DANN_TF consists of a feature extractor, a label predictor, and a domain classifier. The feature extractor and the domain classifier constitute an Adversarial Network, which ensures that learned features are common features across different cell types. DANN_TF is evaluated on five TFs in five cell types with a total of 25 cell-type TF pairs and compared to a baseline method which does not use Adversarial Network. For both data augmentation and cross-cell-type prediction, DANN_TF performs better than the baseline method on most cell-type TF pairs. DANN_TF is further evaluated by an additional 13 TFs in the five cell types with a total of 65 cell-type TF pairs. Results show that DANN_TF achieves significantly higher AUC than the baseline method on 96.9% pairs of the 65 cell-type TF pairs. This is a strong indication that DANN_TF can indeed learn common features for cross-cell-type TFBS prediction.
Collapse
Affiliation(s)
- Gongqiang Lan
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China
| | - Jiyun Zhou
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China.
| | - Ruifeng Xu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China.
| | - Qin Lu
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong 810005, China
| | - Hongpeng Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China
| |
Collapse
|
25
|
Hu J, Wang J, Lin J, Liu T, Zhong Y, Liu J, Zheng Y, Gao Y, He J, Shang X. MD-SVM: a novel SVM-based algorithm for the motif discovery of transcription factor binding sites. BMC Bioinformatics 2019; 20:200. [PMID: 31074373 PMCID: PMC6509868 DOI: 10.1186/s12859-019-2735-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
BACKGROUND Transcription factors (TFs) play important roles in the regulation of gene expression. They can activate or block transcription of downstream genes in a manner of binding to specific genomic sequences. Therefore, motif discovery of these binding preference patterns is of central significance in the understanding of molecular regulation mechanism. Many algorithms have been proposed for the identification of transcription factor binding sites. However, it remains a challengeable problem. RESULTS Here, we proposed a novel motif discovery algorithm based on support vector machine (MD-SVM) to learn a discriminative model for TF binding sites. MD-SVM firstly obtains position weight matrix (PWM) from a set of training datasets. Then it translates the MD problem into a computational framework of multiple instance learning (MIL). It was applied to several real biological datasets. Results show that our algorithm outperforms MI-SVM in terms of both accuracy and specificity. CONCLUSIONS In this paper, we modeled the TF motif discovery problem as a MIL optimization problem. The SVM algorithm was adapted to discriminate positive and negative bags of instances. Compared to other svm-based algorithms, MD-SVM show its superiority over its competitors in term of ROC AUC. Hopefully, it could be of benefit to the research community in the understanding of molecular functions of DNA functional elements and transcription factors.
Collapse
Affiliation(s)
- Jialu Hu
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
- Centre of Multidisciplinary Convergence Computing, School of Computer Science, Northwestern Polytechnical University, 1 Dong Xiang Road, Xi’an, 710129 China
| | - Jingru Wang
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Jianan Lin
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Tianwei Liu
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Yuanke Zhong
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Jie Liu
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Yan Zheng
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Yiqun Gao
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Junhao He
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, West Youyi Road 127, Xi’an, 710072 China
| |
Collapse
|
26
|
Transcription-dependent spreading of the Dal80 yeast GATA factor across the body of highly expressed genes. PLoS Genet 2019; 15:e1007999. [PMID: 30818362 PMCID: PMC6413948 DOI: 10.1371/journal.pgen.1007999] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2018] [Revised: 03/12/2019] [Accepted: 01/31/2019] [Indexed: 12/30/2022] Open
Abstract
GATA transcription factors are highly conserved among eukaryotes and play roles in transcription of genes implicated in cancer progression and hematopoiesis. However, although their consensus binding sites have been well defined in vitro, the in vivo selectivity for recognition by GATA factors remains poorly characterized. Using ChIP-Seq, we identified the Dal80 GATA factor targets in yeast. Our data reveal Dal80 binding to a large set of promoters, sometimes independently of GATA sites, correlating with nitrogen- and/or Dal80-sensitive gene expression. Strikingly, Dal80 was also detected across the body of promoter-bound genes, correlating with high expression. Mechanistic single-gene experiments showed that Dal80 spreading across gene bodies requires active transcription. Consistently, Dal80 co-immunoprecipitated with the initiating and post-initiation forms of RNA Polymerase II. Our work suggests that GATA factors could play dual, synergistic roles during transcription initiation and post-initiation steps, promoting efficient remodeling of the gene expression program in response to environmental changes. GATA transcription factors are highly conserved among eukaryotes and play key roles in cancer progression and hematopoiesis. In budding yeast, four GATA transcription factors are involved in the response to the quality of nitrogen supply. Here, we have determined the whole genome binding profile of the Dal80 GATA factor, and revealed that it also associates with the body of promoter-bound genes. The observation that intragenic spreading correlates with high expression levels and exquisite Dal80 sensitivity suggests that GATA factors could play other, unexpected roles at post-initiation stages in eukaryotes.
Collapse
|
27
|
Dao P, Hoinka J, Takahashi M, Zhou J, Ho M, Wang Y, Costa F, Rossi JJ, Backofen R, Burnett J, Przytycka TM. AptaTRACE Elucidates RNA Sequence-Structure Motifs from Selection Trends in HT-SELEX Experiments. Cell Syst 2019; 3:62-70. [PMID: 27467247 DOI: 10.1016/j.cels.2016.07.003] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2016] [Revised: 06/24/2016] [Accepted: 07/01/2016] [Indexed: 10/21/2022]
Abstract
Aptamers, short RNA or DNA molecules that bind distinct targets with high affinity and specificity, can be identified using high-throughput systematic evolution of ligands by exponential enrichment (HT-SELEX), but scalable analytic tools for understanding sequence-function relationships from diverse HT-SELEX data are not available. Here we present AptaTRACE, a computational approach that leverages the experimental design of the HT-SELEX protocol, RNA secondary structure, and the potential presence of many secondary motifs to identify sequence-structure motifs that show a signature of selection. We apply AptaTRACE to identify nine motifs in C-C chemokine receptor type 7 targeted by aptamers in an in vitro cell-SELEX experiment. We experimentally validate two aptamers whose binding required both sequence and structural features. AptaTRACE can identify low-abundance motifs, and we show through simulations that, because of this, it could lower HT-SELEX cost and time by reducing the number of selection cycles required.
Collapse
Affiliation(s)
- Phuong Dao
- National Center of Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD 20894, USA
| | - Jan Hoinka
- National Center of Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD 20894, USA
| | - Mayumi Takahashi
- Department of Molecular and Cellular Biology, Beckman Research Institute of City of Hope, Duarte, CA 91010, USA
| | - Jiehua Zhou
- Department of Molecular and Cellular Biology, Beckman Research Institute of City of Hope, Duarte, CA 91010, USA
| | - Michelle Ho
- Department of Molecular and Cellular Biology, Beckman Research Institute of City of Hope, Duarte, CA 91010, USA
| | - Yijie Wang
- National Center of Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD 20894, USA
| | - Fabrizio Costa
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg 79110, Germany
| | - John J Rossi
- Department of Molecular and Cellular Biology, Beckman Research Institute of City of Hope, Duarte, CA 91010, USA
| | - Rolf Backofen
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg 79110, Germany
| | - John Burnett
- Department of Molecular and Cellular Biology, Beckman Research Institute of City of Hope, Duarte, CA 91010, USA
| | - Teresa M Przytycka
- National Center of Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD 20894, USA.
| |
Collapse
|
28
|
Tran NTL, Huang CH. Performance evaluation for MOTIFSIM. Biol Proced Online 2018; 20:23. [PMID: 30574025 PMCID: PMC6299673 DOI: 10.1186/s12575-018-0088-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2018] [Accepted: 12/07/2018] [Indexed: 11/10/2022] Open
Abstract
Background Previous studies show various results obtained from different motif finders for an identical dataset. This is largely due to the fact that these tools use different strategies and possess unique features for discovering the motifs. Hence, using multiple tools and methods has been suggested because the motifs commonly reported by them are more likely to be biologically significant. Results The common significant motifs from multiple tools can be obtained by using MOTIFSIM tool. In this work, we evaluated the performance of MOTIFSIM in three aspects. First, we compared the pair-wise comparison technique of MOTIFSIM with the un-gapped Smith-Waterman algorithm and four common distance metrics: average Kullback-Leibler, average log-likelihood ratio, Chi-Square distance, and Pearson Correlation Coefficient. Second, we compared the performance of MOTIFSIM with RSAT Matrix-clustering tool for motif clustering. Lastly, we evaluated the performances of nineteen motif finders and the reliability of MOTIFSIM for identifying the common significant motifs from multiple tools. Conclusions The pair-wise comparison results reveal that MOTIFSIM attains better performance than the un-gapped Smith-Waterman algorithm and four distance metrics. The clustering results also demonstrate that MOTIFSIM achieves similar or even better performance than RSAT Matrix-clustering. Furthermore, the findings indicate if the motif detection does not require a special tool for detecting a specific type of motif then using multiple motif finders and combining with MOTIFSIM for obtaining the common significant motifs, it improved the results for DNA motif detection. Electronic supplementary material The online version of this article (10.1186/s12575-018-0088-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ngoc Tam L Tran
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269 USA
| | - Chun-Hsi Huang
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269 USA
| |
Collapse
|
29
|
Tran NTL, Huang CH. MODSIDE: a motif discovery pipeline and similarity detector. BMC Genomics 2018; 19:755. [PMID: 30340511 PMCID: PMC6194616 DOI: 10.1186/s12864-018-5148-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2018] [Accepted: 10/08/2018] [Indexed: 01/06/2023] Open
Abstract
Background Previous studies demonstrate the usefulness of using multiple tools and methods for improving the accuracy of motif detection. Over the past years, numerous motif discovery pipelines have been developed. However, they typically report only the top ranked results either from individual motif finders or from a combination of multiple tools and algorithms. Results Here we present MODSIDE, a motif discovery pipeline and similarity detector. The pipeline integrated four de novo motif finders: ChIPMunk, MEME, Weeder, and XXmotif. It also incorporated a motif similarity detection tool MOTIFSIM. MODSIDE was designed for delivering not only the predictive results from individual motif finders but also the comparison results for multiple tools. The results include the common significant motifs from multiple tools, the motifs detected by some tools but not by others, and the best matches for each motif in the motif collection of multiple tools. MODSIDE also possesses other useful features for merging similar motifs and clustering motifs into motif trees. Conclusions We evaluated MODSIDE and its adopted motif finders on 16 benchmark datasets. The statistical results demonstrate MODSIDE achieves better accuracy than individual motif finders. We also compared MODSIDE with two popular motif discovery pipelines: MEME-ChIP and RSAT peak-motifs. The comparison results reveal MODSIDE attains similar performance as RSAT peak-motifs but better accuracy than MEME-ChIP. In addition, MODSIDE is able to deliver various comparison results that are not offered by MEME-ChIP, RSAT peak-motifs, and other existing motif discovery pipelines. Electronic supplementary material The online version of this article (10.1186/s12864-018-5148-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ngoc Tam L Tran
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, 06269, USA.
| | - Chun-Hsi Huang
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, 06269, USA
| |
Collapse
|
30
|
Lee NK, Li X, Wang D. A comprehensive survey on genetic algorithms for DNA motif prediction. Inf Sci (N Y) 2018. [DOI: 10.1016/j.ins.2018.07.004] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
|
31
|
Guca E, Suñol D, Ruiz L, Konkol A, Cordero J, Torner C, Aragon E, Martin-Malpartida P, Riera A, Macias MJ. TGIF1 homeodomain interacts with Smad MH1 domain and represses TGF-β signaling. Nucleic Acids Res 2018; 46:9220-9235. [PMID: 30060237 PMCID: PMC6158717 DOI: 10.1093/nar/gky680] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2018] [Accepted: 07/17/2018] [Indexed: 12/16/2022] Open
Abstract
TGIF1 is a multifunctional protein that represses TGF-β-activated transcription by interacting with Smad2-Smad4 complexes. We found that the complex structure of TGIF1-HD bound to the TGACA motif revealed a combined binding mode that involves the HD core and the major groove, on the one hand, and the amino-terminal (N-term) arm and the minor groove of the DNA, on the other. We also show that TGIF1-HD interacts with the MH1 domain of Smad proteins, thereby indicating that TGIF1-HD is also a protein-binding domain. Moreover, the formation of the HD-MH1 complex partially hinders the DNA-binding site of the complex, preventing the efficient interaction of TGIF1-HD with DNA. We propose that the binding of the TGIF1 C-term to the Smad2-MH2 domain brings both the HD and MH1 domain into close proximity. This local proximity facilitates the interaction of these DNA-binding domains, thus strengthening the formation of the protein complex versus DNA binding. Once the protein complex has been formed, the TGIF1-Smad system would be released from promoters/enhancers, thereby illustrating one of the mechanisms used by TGIF1 to exert its function as an active repressor of Smad-induced TGF-β signaling.
Collapse
Affiliation(s)
- Ewelina Guca
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac, 10, Barcelona 08028, Spain
| | - David Suñol
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac, 10, Barcelona 08028, Spain
| | - Lidia Ruiz
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac, 10, Barcelona 08028, Spain
| | - Agnieszka Konkol
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac, 10, Barcelona 08028, Spain
| | - Jorge Cordero
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac, 10, Barcelona 08028, Spain
| | - Carles Torner
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac, 10, Barcelona 08028, Spain
| | - Eric Aragon
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac, 10, Barcelona 08028, Spain
| | - Pau Martin-Malpartida
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac, 10, Barcelona 08028, Spain
| | - Antoni Riera
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac, 10, Barcelona 08028, Spain
- Departament de Química Inorgànica i Orgànica, Secció de Química Orgànica, Universitat de Barcelona, Martí i Franquès 1-11, 08028, Barcelona, Spain
| | - Maria J Macias
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac, 10, Barcelona 08028, Spain
- ICREA, Passeig Lluís Companys 23, 08010-Barcelona, Spain
- To whom correspondence should be addressed. Tel: +34 934037189;
| |
Collapse
|
32
|
Al-Ouran R, Schmidt R, Naik A, Jones J, Drews F, Juedes D, Elnitski L, Welch L. Discovering Gene Regulatory Elements Using Coverage-Based Heuristics. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1290-1300. [PMID: 26540692 DOI: 10.1109/tcbb.2015.2496261] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Data mining algorithms and sequencing methods (such as RNA-seq and ChIP-seq) are being combined to discover genomic regulatory motifs that relate to a variety of phenotypes. However, motif discovery algorithms often produce very long lists of putative transcription factor binding sites, hindering the discovery of phenotype-related regulatory elements by making it difficult to select a manageable set of candidate motifs for experimental validation. To address this issue, the authors introduce the motif selection problem and provide coverage-based search heuristics for its solution. Analysis of 203 ChIP-seq experiments from the ENCyclopedia of DNA Elements project shows that our algorithms produce motifs that have high sensitivity and specificity and reveals new insights about the regulatory code of the human genome. The greedy algorithm performs the best, selecting a median of two motifs per ChIP-seq transcription factor group while achieving a median sensitivity of 77 percent.
Collapse
|
33
|
Yu Q, Wei D, Huo H. SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets. BMC Bioinformatics 2018; 19:228. [PMID: 29914360 PMCID: PMC6006848 DOI: 10.1186/s12859-018-2242-y] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2018] [Accepted: 06/12/2018] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND Given a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequences. Existing qPMS algorithms have been able to efficiently process small standard datasets (e.g., t = 20 and n = 600), but they are too time consuming to process large DNA datasets, such as ChIP-seq datasets that contain thousands of sequences or more. RESULTS We analyze the effects of t and q on the time performance of qPMS algorithms and find that a large t or a small q causes a longer computation time. Based on this information, we improve the time performance of existing qPMS algorithms by selecting a sample sequence set D' with a small t and a large q from the large input dataset D and then executing qPMS algorithms on D'. A sample sequence selection algorithm named SamSelect is proposed. The experimental results on both simulated and real data show (1) that SamSelect can select D' efficiently and (2) that the qPMS algorithms executed on D' can find implanted or real motifs in a significantly shorter time than when executed on D. CONCLUSIONS We improve the ability of existing qPMS algorithms to process large DNA datasets from the perspective of selecting high-quality sample sequence sets so that the qPMS algorithms can find motifs in a short time in the selected sample sequence set D', rather than take an unfeasibly long time to search the original sequence set D. Our motif discovery method is an approximate algorithm.
Collapse
Affiliation(s)
- Qiang Yu
- School of Computer Science and Technology, Xidian University, Xi’an, 710071 China
| | - Dingbang Wei
- School of Computer Science and Technology, Xidian University, Xi’an, 710071 China
| | - Hongwei Huo
- School of Computer Science and Technology, Xidian University, Xi’an, 710071 China
| |
Collapse
|
34
|
Lee NK, Azizan FL, Wong YS, Omar N. DeepFinder: An integration of feature-based and deep learning approach for DNA motif discovery. BIOTECHNOL BIOTEC EQ 2018. [DOI: 10.1080/13102818.2018.1438209] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022] Open
Affiliation(s)
- Nung Kion Lee
- Department of Cognitive Sciences, Faculty of Cognitive Sciences and Human Development, Universiti Malaysia Sarawak, Kota Samarahan, Sarawak, Malaysia
| | - Farah Liyana Azizan
- Centre For Pre-University Studies, Universiti Malaysia Sarawak, Kota Samarahan, Sarawak, Malaysia
| | - Yu Shiong Wong
- Department of Cognitive Sciences, Faculty of Cognitive Sciences and Human Development, Universiti Malaysia Sarawak, Kota Samarahan, Sarawak, Malaysia
| | - Norshafarina Omar
- Department of Cognitive Sciences, Faculty of Cognitive Sciences and Human Development, Universiti Malaysia Sarawak, Kota Samarahan, Sarawak, Malaysia
| |
Collapse
|
35
|
Mitra S, Biswas A, Narlikar L. DIVERSITY in binding, regulation, and evolution revealed from high-throughput ChIP. PLoS Comput Biol 2018; 14:e1006090. [PMID: 29684008 PMCID: PMC5933800 DOI: 10.1371/journal.pcbi.1006090] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2017] [Revised: 05/03/2018] [Accepted: 03/14/2018] [Indexed: 12/27/2022] Open
Abstract
Genome-wide in vivo protein-DNA interactions are routinely mapped using high-throughput chromatin immunoprecipitation (ChIP). ChIP-reported regions are typically investigated for enriched sequence-motifs, which are likely to model the DNA-binding specificity of the profiled protein and/or of co-occurring proteins. However, simple enrichment analyses can miss insights into the binding-activity of the protein. Note that ChIP reports regions making direct contact with the protein as well as those binding through intermediaries. For example, consider a ChIP experiment targeting protein X, which binds DNA at its cognate sites, but simultaneously interacts with four other proteins. Each of these proteins also binds to its own specific cognate sites along distant parts of the genome, a scenario consistent with the current view of transcriptional hubs and chromatin loops. Since ChIP will pull down all X-associated regions, the final reported data will be a union of five distinct sets of regions, each containing binding sites of one of the five proteins, respectively. Characterizing all five different motifs and the corresponding sets is important to interpret the ChIP experiment and ultimately, the role of X in regulation. We present diversity which attempts exactly this: it partitions the data so that each partition can be characterized with its own de novo motif. Diversity uses a Bayesian approach to identify the optimal number of motifs and the associated partitions, which together explain the entire dataset. This is in contrast to standard motif finders, which report motifs individually enriched in the data, but do not necessarily explain all reported regions. We show that the different motifs and associated regions identified by diversity give insights into the various complexes that may be forming along the chromatin, something that has so far not been attempted from ChIP data. Webserver at http://diversity.ncl.res.in/; standalone (Mac OS X/Linux) from https://github.com/NarlikarLab/DIVERSITY/releases/tag/v1.0.0. A high-throughput chromatin immunoprecipitation (ChIP) experiment identifies genomic regions bound by a protein in vivo. Current motif-discovery approaches seek an enriched motif signature in the reported regions, which they can attribute to the protein’s binding preferences. However, Diversity models the fact that since a ChIP experiment pulls down regions participating in all complexes involving the profiled protein, the reported regions are in all likelihood, a collection of different types of protein-DNA contacts. Diversity asks a different question: what sequence component caused a specific region to be reported in a ChIP experiment? The answer, in combination with additional data such as sequence conservation, SNPs, chromatin structure, downstream gene-expression, etc. can yield insights into the diverse regulatory mechanisms at play. The added benefits of a webserver and a standalone parallel version make diversity a practical tool for discovering new biology from ChIP experiments.
Collapse
Affiliation(s)
- Sneha Mitra
- Department of Chemical Engineering, CSIR-National Chemical Laboratory, Pune, India
| | - Anushua Biswas
- Department of Chemical Engineering, CSIR-National Chemical Laboratory, Pune, India
| | - Leelavati Narlikar
- Department of Chemical Engineering, CSIR-National Chemical Laboratory, Pune, India
- * E-mail:
| |
Collapse
|
36
|
Guo Y, Tian K, Zeng H, Guo X, Gifford DK. A novel k-mer set memory (KSM) motif representation improves regulatory variant prediction. Genome Res 2018; 28:891-900. [PMID: 29654070 PMCID: PMC5991515 DOI: 10.1101/gr.226852.117] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2017] [Accepted: 04/04/2018] [Indexed: 12/15/2022]
Abstract
The representation and discovery of transcription factor (TF) sequence binding specificities is critical for understanding gene regulatory networks and interpreting the impact of disease-associated noncoding genetic variants. We present a novel TF binding motif representation, the k-mer set memory (KSM), which consists of a set of aligned k-mers that are overrepresented at TF binding sites, and a new method called KMAC for de novo discovery of KSMs. We find that KSMs more accurately predict in vivo binding sites than position weight matrix (PWM) models and other more complex motif models across a large set of ChIP-seq experiments. Furthermore, KSMs outperform PWMs and more complex motif models in predicting in vitro binding sites. KMAC also identifies correct motifs in more experiments than five state-of-the-art motif discovery methods. In addition, KSM-derived features outperform both PWM and deep learning model derived sequence features in predicting differential regulatory activities of expression quantitative trait loci (eQTL) alleles. Finally, we have applied KMAC to 1600 ENCODE TF ChIP-seq data sets and created a public resource of KSM and PWM motifs. We expect that the KSM representation and KMAC method will be valuable in characterizing TF binding specificities and in interpreting the effects of noncoding genetic variations.
Collapse
Affiliation(s)
- Yuchun Guo
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Kevin Tian
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Haoyang Zeng
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Xiaoyun Guo
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - David Kenneth Gifford
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| |
Collapse
|
37
|
Tiana M, Acosta-Iborra B, Puente-Santamaría L, Hernansanz-Agustin P, Worsley-Hunt R, Masson N, García-Rio F, Mole D, Ratcliffe P, Wasserman WW, Jimenez B, del Peso L. The SIN3A histone deacetylase complex is required for a complete transcriptional response to hypoxia. Nucleic Acids Res 2018; 46:120-133. [PMID: 29059365 PMCID: PMC5758878 DOI: 10.1093/nar/gkx951] [Citation(s) in RCA: 64] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2017] [Revised: 10/02/2017] [Accepted: 10/06/2017] [Indexed: 01/02/2023] Open
Abstract
Cells adapt to environmental changes, including fluctuations in oxygen levels, through the induction of specific gene expression programs. To identify genes regulated by hypoxia at the transcriptional level, we pulse-labeled HUVEC cells with 4-thiouridine and sequenced nascent transcripts. Then, we searched genome-wide binding profiles from the ENCODE project for factors that correlated with changes in transcription and identified binding of several components of the Sin3A co-repressor complex, including SIN3A, SAP30 and HDAC1/2, proximal to genes repressed by hypoxia. SIN3A interference revealed that it participates in the downregulation of 75% of the hypoxia-repressed genes in endothelial cells. Unexpectedly, it also blunted the induction of 47% of the upregulated genes, suggesting a role for this corepressor in gene induction. In agreement, ChIP-seq experiments showed that SIN3A preferentially localizes to the promoter region of actively transcribed genes and that SIN3A signal was enriched in hypoxia-repressed genes, prior exposure to the stimulus. Importantly, SINA3 occupancy was not altered by hypoxia in spite of changes in H3K27ac signal. In summary, our results reveal a prominent role for SIN3A in the transcriptional response to hypoxia and suggest a model where modulation of the associated histone deacetylase activity, rather than its recruitment, determines the transcriptional output.
Collapse
Affiliation(s)
- Maria Tiana
- Departamento de Bioquímica, Universidad Autónoma de Madrid (UAM) and Instituto de Investigaciones Biomédicas ‘Alberto Sols’ (CSIC-UAM), 28029 Madrid, Spain
- IdiPaz, Instituto de Investigación Sanitaria del Hospital Universitario La Paz, 28029 Madrid, Spain
- CIBER de Enfermedades Respiratorias (CIBERES), Instituto de Salud Carlos III, 28029 Madrid, Spain
| | - Barbara Acosta-Iborra
- Departamento de Bioquímica, Universidad Autónoma de Madrid (UAM) and Instituto de Investigaciones Biomédicas ‘Alberto Sols’ (CSIC-UAM), 28029 Madrid, Spain
| | - Laura Puente-Santamaría
- Departamento de Bioquímica, Universidad Autónoma de Madrid (UAM) and Instituto de Investigaciones Biomédicas ‘Alberto Sols’ (CSIC-UAM), 28029 Madrid, Spain
| | - Pablo Hernansanz-Agustin
- Departamento de Bioquímica, Universidad Autónoma de Madrid (UAM) and Instituto de Investigaciones Biomédicas ‘Alberto Sols’ (CSIC-UAM), 28029 Madrid, Spain
- Servicio Inmunología, Hospital Universitario de La Princesa, Instituto de Investigación Sanitaria del hospital de La Princesa, 28006 Madrid, Spain
| | - Rebecca Worsley-Hunt
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, Department of Medical Genetics, University of British Columbia Vancouver, British Columbia V5Z 4H4, Canada
| | - Norma Masson
- Target Discovery Institute, University of Oxford, Oxford OX3 7FZ, UK
| | - Francisco García-Rio
- IdiPaz, Instituto de Investigación Sanitaria del Hospital Universitario La Paz, 28029 Madrid, Spain
- CIBER de Enfermedades Respiratorias (CIBERES), Instituto de Salud Carlos III, 28029 Madrid, Spain
- Servicio de Neumología, Hospital Universitario La Paz, Instituto de Investigación Sanitaria del hospital de La Paz, 28029 Madrid, Spain
| | - David Mole
- Henry Wellcome Building for Molecular Physiology, University of Oxford, Oxford OX3 7BN, UK
| | - Peter Ratcliffe
- Target Discovery Institute, University of Oxford, Oxford OX3 7FZ, UK
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, Department of Medical Genetics, University of British Columbia Vancouver, British Columbia V5Z 4H4, Canada
| | - Benilde Jimenez
- Departamento de Bioquímica, Universidad Autónoma de Madrid (UAM) and Instituto de Investigaciones Biomédicas ‘Alberto Sols’ (CSIC-UAM), 28029 Madrid, Spain
- IdiPaz, Instituto de Investigación Sanitaria del Hospital Universitario La Paz, 28029 Madrid, Spain
- CIBER de Enfermedades Respiratorias (CIBERES), Instituto de Salud Carlos III, 28029 Madrid, Spain
| | - Luis del Peso
- Departamento de Bioquímica, Universidad Autónoma de Madrid (UAM) and Instituto de Investigaciones Biomédicas ‘Alberto Sols’ (CSIC-UAM), 28029 Madrid, Spain
- IdiPaz, Instituto de Investigación Sanitaria del Hospital Universitario La Paz, 28029 Madrid, Spain
- CIBER de Enfermedades Respiratorias (CIBERES), Instituto de Salud Carlos III, 28029 Madrid, Spain
| |
Collapse
|
38
|
Vishnevsky OV, Bocharnikov AV, Kolchanov NA. Argo_CUDA: Exhaustive GPU based approach for motif discovery in large DNA datasets. J Bioinform Comput Biol 2017; 16:1740012. [PMID: 29281953 DOI: 10.1142/s0219720017400121] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The development of chromatin immunoprecipitation sequencing (ChIP-seq) technology has revolutionized the genetic analysis of the basic mechanisms underlying transcription regulation and led to accumulation of information about a huge amount of DNA sequences. There are a lot of web services which are currently available for de novo motif discovery in datasets containing information about DNA/protein binding. An enormous motif diversity makes their finding challenging. In order to avoid the difficulties, researchers use different stochastic approaches. Unfortunately, the efficiency of the motif discovery programs dramatically declines with the query set size increase. This leads to the fact that only a fraction of top "peak" ChIP-Seq segments can be analyzed or the area of analysis should be narrowed. Thus, the motif discovery in massive datasets remains a challenging issue. Argo_Compute Unified Device Architecture (CUDA) web service is designed to process the massive DNA data. It is a program for the detection of degenerate oligonucleotide motifs of fixed length written in 15-letter IUPAC code. Argo_CUDA is a full-exhaustive approach based on the high-performance GPU technologies. Compared with the existing motif discovery web services, Argo_CUDA shows good prediction quality on simulated sets. The analysis of ChIP-Seq sequences revealed the motifs which correspond to known transcription factor binding sites.
Collapse
Affiliation(s)
- Oleg V Vishnevsky
- * Institute of Cytology and Genetics SB RAS, Lavrentieva Ave., 10, Novosibirsk 630090, Russia.,† Novosibirsk State University, Pirogova, 10, Novosibirsk 630090, Russia
| | | | - Nikolay A Kolchanov
- * Institute of Cytology and Genetics SB RAS, Lavrentieva Ave., 10, Novosibirsk 630090, Russia.,† Novosibirsk State University, Pirogova, 10, Novosibirsk 630090, Russia
| |
Collapse
|
39
|
Fu H, Zhang X. Noncoding Variants Functional Prioritization Methods Based on Predicted Regulatory Factor Binding Sites. Curr Genomics 2017; 18:322-331. [PMID: 29081688 PMCID: PMC5635616 DOI: 10.2174/1389202918666170228143619] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2016] [Revised: 10/16/2016] [Accepted: 11/02/2016] [Indexed: 12/31/2022] Open
Abstract
BACKGROUNDS With the advent of the post genomic era, the research for the genetic mechanism of the diseases has found to be increasingly depended on the studies of the genes, the gene-networks and gene-protein interaction networks. To explore gene expression and regulation, the researchers have carried out many studies on transcription factors and their binding sites (TFBSs). Based on the large amount of transcription factor binding sites predicting values in the deep learning models, further computation and analysis have been done to reveal the relationship between the gene mutation and the occurrence of the disease. It has been demonstrated that based on the deep learning methods, the performances of the prediction for the functions of the noncoding variants are outperforming than those of the conventional methods. The research on the prediction for functions of Single Nucleotide Polymorphisms (SNPs) is expected to uncover the mechanism of the gene mutation affection on traits and diseases of human beings. RESULTS We reviewed the conventional TFBSs identification methods from different perspectives. As for the deep learning methods to predict the TFBSs, we discussed the related problems, such as the raw data preprocessing, the structure design of the deep convolution neural network (CNN) and the model performance measure et al. And then we summarized the techniques that usually used in finding out the functional noncoding variants from de novo sequence. CONCLUSION Along with the rapid development of the high-throughout assays, more and more sample data and chromatin features would be conducive to improve the prediction accuracy of the deep convolution neural network for TFBSs identification. Meanwhile, getting more insights into the deep CNN framework itself has been proved useful for both the promotion on model performance and the development for more suitable design to sample data. Based on the feature values predicted by the deep CNN model, the prioritization model for functional noncoding variants would contribute to reveal the affection of gene mutation on the diseases.
Collapse
Affiliation(s)
- Haoyue Fu
- College of Sciences, Northeastern University, Shenyang, China
| | - LianpingYang
- College of Sciences, Northeastern University, Shenyang, China
- University of Southern California, Dept. Biol. Sci., Program Mol & Computat Biol, USA
| | - Xiangde Zhang
- College of Sciences, Northeastern University, Shenyang, China
| |
Collapse
|
40
|
Zhang H, Zhu L, Huang DS. WSMD: weakly-supervised motif discovery in transcription factor ChIP-seq data. Sci Rep 2017; 7:3217. [PMID: 28607381 PMCID: PMC5468353 DOI: 10.1038/s41598-017-03554-7] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2016] [Accepted: 05/02/2017] [Indexed: 01/24/2023] Open
Abstract
Although discriminative motif discovery (DMD) methods are promising for eliciting motifs from high-throughput experimental data, due to consideration of computational expense, most of existing DMD methods have to choose approximate schemes that greatly restrict the search space, leading to significant loss of predictive accuracy. In this paper, we propose Weakly-Supervised Motif Discovery (WSMD) to discover motifs from ChIP-seq datasets. In contrast to the learning strategies adopted by previous DMD methods, WSMD allows a "global" optimization scheme of the motif parameters in continuous space, thereby reducing the information loss of model representation and improving the quality of resultant motifs. Meanwhile, by exploiting the connection between DMD framework and existing weakly supervised learning (WSL) technologies, we also present highly scalable learning strategies for the proposed method. The experimental results on both real ChIP-seq datasets and synthetic datasets show that WSMD substantially outperforms former DMD methods (including DREME, HOMER, XXmotif, motifRG and DECOD) in terms of predictive accuracy, while also achieving a competitive computational speed.
Collapse
Affiliation(s)
- Hongbo Zhang
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - Lin Zhu
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China.
| |
Collapse
|
41
|
Tran NTL, Huang CH. Cloud-based MOTIFSIM: Detecting Similarity in Large DNA Motif Data Sets. J Comput Biol 2017; 24:450-459. [DOI: 10.1089/cmb.2016.0080] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Ngoc Tam L. Tran
- Department of Computer Science and Engineering, University of Connecticut, Storrs, Connecticut
| | - Chun-Hsi Huang
- Department of Computer Science and Engineering, University of Connecticut, Storrs, Connecticut
| |
Collapse
|
42
|
Liu B, Yang J, Li Y, McDermaid A, Ma Q. An algorithmic perspective of de novo cis-regulatory motif finding based on ChIP-seq data. Brief Bioinform 2017; 19:1069-1081. [DOI: 10.1093/bib/bbx026] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2016] [Indexed: 01/06/2023] Open
Affiliation(s)
- Bingqiang Liu
- School of Mathematics, Shandong University, Jinan Shandong, P. R. China
| | - Jinyu Yang
- Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, USA
| | - Yang Li
- School of Mathematics, Shandong University, Jinan Shandong, P. R. China
| | - Adam McDermaid
- Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, USA
| | - Qin Ma
- Department of Agronomy, Horticulture and Plant Science, South Dakota State University, Brookings, SD, USA
| |
Collapse
|
43
|
Kumagai Y, Vandenbon A, Teraguchi S, Akira S, Suzuki Y. Genome-wide map of RNA degradation kinetics patterns in dendritic cells after LPS stimulation facilitates identification of primary sequence and secondary structure motifs in mRNAs. BMC Genomics 2016; 17:1032. [PMID: 28155712 PMCID: PMC5259865 DOI: 10.1186/s12864-016-3325-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Immune cells have to change their gene expression patterns dynamically in response to external stimuli such as lipopolysaccharide (LPS). The gene expression is regulated at multiple steps in eukaryotic cells, in which control of RNA levels at both the transcriptional level and the post-transcriptional level plays important role. Impairment of the control leads to aberrant immune responses such as excessive or impaired production of cytokines. However, genome-wide studies focusing on the post-transcriptional control were relatively rare until recently. Moreover, several RNA cis elements and RNA-binding proteins have been found to be involved in the process, but our general understanding remains poor, partly because identification of regulatory RNA motifs is very challenging in spite of its importance. We took advantage of genome-wide measurement of RNA degradation in combination with estimation of degradation kinetics by qualitative approach, and performed de novo prediction of RNA sequence and structure motifs. METHODS To classify genes by their RNA degradation kinetics, we first measured RNA degradation time course in mouse dendritic cells after LPS stimulation and the time courses were clustered to estimate degradation kinetics and to find patterns in the kinetics. Then genes were clustered by their similarity in degradation kinetics patterns. The 3' UTR sequences of a cluster was subjected to de novo sequence or structure motif prediction. RESULTS The quick degradation kinetics was found to be strongly associated with lower gene expression level, immediate regulation (both induction and repression) of gene expression level, and longer 3' UTR length. De novo sequence motif prediction found AU-rich element-like and TTP-binding sequence-like motifs which are enriched in quickly degrading genes. De novo structure motif prediction found a known functional motif, namely stem-loop structure containing sequence bound by RNA-binding protein Roquin and Regnase-1, as well as unknown motifs. CONCLUSIONS The current study indicated that degradation kinetics patterns lead to classification different from that by gene expression and the differential classification facilitates identification of functional motifs. Identification of novel motif candidates implied post-transcriptional controls different from that by known pairs of RNA-binding protein and RNA motif.
Collapse
Affiliation(s)
- Yutaro Kumagai
- Quantitative Immunology Research Unit, WPI Immunology Frontier Research Center, Osaka University, 3-1 Yamada-oka, Suita, Osaka, 565-0871, Japan.
| | - Alexis Vandenbon
- Immuno-Genomics Research Unit, WPI Immunology Frontier Research Center, Osaka University, 3-1 Yamada-oka, Suita, Osaka, 565-0871, Japan
| | - Shunsuke Teraguchi
- Quantitative Immunology Research Unit, WPI Immunology Frontier Research Center, Osaka University, 3-1 Yamada-oka, Suita, Osaka, 565-0871, Japan.
| | - Shizuo Akira
- Laboratory of Host Defense, WPI Immunology Frontier Research Center, Osaka University, 3-1 Yamada-oka, Suita, Osaka, 565-0871, Japan
| | - Yutaka Suzuki
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, 277-8561, Japan
| |
Collapse
|
44
|
Zhang Y, Wang P, Yan M. An Entropy-Based Position Projection Algorithm for Motif Discovery. BIOMED RESEARCH INTERNATIONAL 2016; 2016:9127474. [PMID: 27882329 PMCID: PMC5110948 DOI: 10.1155/2016/9127474] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/17/2016] [Revised: 09/20/2016] [Accepted: 10/05/2016] [Indexed: 12/31/2022]
Abstract
Motif discovery problem is crucial for understanding the structure and function of gene expression. Over the past decades, many attempts using consensus and probability training model for motif finding are successful. However, the most existing motif discovery algorithms are still time-consuming or easily trapped in a local optimum. To overcome these shortcomings, in this paper, we propose an entropy-based position projection algorithm, called EPP, which designs a projection process to divide the dataset and explores the best local optimal solution. The experimental results on real DNA sequences, Tompa data, and ChIP-seq data show that EPP is advantageous in dealing with the motif discovery problem and outperforms current widely used algorithms.
Collapse
Affiliation(s)
- Yipu Zhang
- Department of Automation, School of Electronics and Control Engineering, Chang'An University, Xi'an 710064, China
| | - Ping Wang
- Department of Automation, School of Electronics and Control Engineering, Chang'An University, Xi'an 710064, China
| | - Maode Yan
- Department of Automation, School of Electronics and Control Engineering, Chang'An University, Xi'an 710064, China
| |
Collapse
|
45
|
Yu Q, Huo H, Feng D. PairMotifChIP: A Fast Algorithm for Discovery of Patterns Conserved in Large ChIP-seq Data Sets. BIOMED RESEARCH INTERNATIONAL 2016; 2016:4986707. [PMID: 27843946 PMCID: PMC5098105 DOI: 10.1155/2016/4986707] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/22/2016] [Revised: 09/04/2016] [Accepted: 09/27/2016] [Indexed: 11/18/2022]
Abstract
Identifying conserved patterns in DNA sequences, namely, motif discovery, is an important and challenging computational task. With hundreds or more sequences contained, the high-throughput sequencing data set is helpful to improve the identification accuracy of motif discovery but requires an even higher computing performance. To efficiently identify motifs in large DNA data sets, a new algorithm called PairMotifChIP is proposed by extracting and combining pairs of l-mers in the input with relatively small Hamming distance. In particular, a method for rapidly extracting pairs of l-mers is designed, which can be used not only for PairMotifChIP, but also for other DNA data mining tasks with the same demand. Experimental results on the simulated data show that the proposed algorithm can find motifs successfully and runs faster than the state-of-the-art motif discovery algorithms. Furthermore, the validity of the proposed algorithm has been verified on real data.
Collapse
Affiliation(s)
- Qiang Yu
- School of Computer Science and Technology, Xidian University, Xi'an 710071, China
| | - Hongwei Huo
- School of Computer Science and Technology, Xidian University, Xi'an 710071, China
| | - Dazheng Feng
- School of Electronic Engineering, Xidian University, Xi'an 710071, China
| |
Collapse
|
46
|
Mathelier A, Xin B, Chiu TP, Yang L, Rohs R, Wasserman WW. DNA Shape Features Improve Transcription Factor Binding Site Predictions In Vivo. Cell Syst 2016; 3:278-286.e4. [PMID: 27546793 PMCID: PMC5042832 DOI: 10.1016/j.cels.2016.07.001] [Citation(s) in RCA: 84] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2015] [Revised: 03/04/2016] [Accepted: 06/30/2016] [Indexed: 01/09/2023]
Abstract
Interactions of transcription factors (TFs) with DNA comprise a complex interplay between base-specific amino acid contacts and readout of DNA structure. Recent studies have highlighted the complementarity of DNA sequence and shape in modeling TF binding in vitro. Here, we have provided a comprehensive evaluation of in vivo datasets to assess the predictive power obtained by augmenting various DNA sequence-based models of TF binding sites (TFBSs) with DNA shape features (helix twist, minor groove width, propeller twist, and roll). Results from 400 human ChIP-seq datasets for 76 TFs show that combining DNA shape features with position-specific scoring matrix (PSSM) scores improves TFBS predictions. Improvement has also been observed using TF flexible models and a machine-learning approach using a binary encoding of nucleotides in lieu of PSSMs. Incorporating DNA shape information is most beneficial for E2F and MADS-domain TF families. Our findings indicate that incorporating DNA sequence and shape information benefits the modeling of TF binding under complex in vivo conditions.
Collapse
Affiliation(s)
- Anthony Mathelier
- Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, 980 West 28th Avenue, Vancouver, BC V5Z 4H4, Canada; Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo and Oslo University Hospital, 0318 Oslo, Norway; Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital Radiumhospitalet, 0372 Oslo, Norway
| | - Beibei Xin
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Tsu-Pei Chiu
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Lin Yang
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Remo Rohs
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA.
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, 980 West 28th Avenue, Vancouver, BC V5Z 4H4, Canada.
| |
Collapse
|
47
|
Guo H, Huo H, Yu Q. SMCis: An Effective Algorithm for Discovery of Cis-Regulatory Modules. PLoS One 2016; 11:e0162968. [PMID: 27637070 PMCID: PMC5026350 DOI: 10.1371/journal.pone.0162968] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2016] [Accepted: 08/31/2016] [Indexed: 12/02/2022] Open
Abstract
The discovery of cis-regulatory modules (CRMs) is a challenging problem in computational biology. Limited by the difficulty of using an HMM to model dependent features in transcriptional regulatory sequences (TRSs), the probabilistic modeling methods based on HMMs cannot accurately represent the distance between regulatory elements in TRSs and are cumbersome to model the prevailing dependencies between motifs within CRMs. We propose a probabilistic modeling algorithm called SMCis, which builds a more powerful CRM discovery model based on a hidden semi-Markov model. Our model characterizes the regulatory structure of CRMs and effectively models dependencies between motifs at a higher level of abstraction based on segments rather than nucleotides. Experimental results on three benchmark datasets indicate that our method performs better than the compared algorithms.
Collapse
Affiliation(s)
- Haitao Guo
- School of Computer Science and Technology, Xidian University, Xi’an, Shaanxi, China
| | - Hongwei Huo
- School of Computer Science and Technology, Xidian University, Xi’an, Shaanxi, China
- * E-mail:
| | - Qiang Yu
- School of Computer Science and Technology, Xidian University, Xi’an, Shaanxi, China
| |
Collapse
|
48
|
Liu B, Zhang H, Zhou C, Li G, Fennell A, Wang G, Kang Y, Liu Q, Ma Q. An integrative and applicable phylogenetic footprinting framework for cis-regulatory motifs identification in prokaryotic genomes. BMC Genomics 2016; 17:578. [PMID: 27507169 PMCID: PMC4977642 DOI: 10.1186/s12864-016-2982-x] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2016] [Accepted: 07/29/2016] [Indexed: 11/10/2022] Open
Abstract
Background Phylogenetic footprinting is an important computational technique for identifying cis-regulatory motifs in orthologous regulatory regions from multiple genomes, as motifs tend to evolve slower than their surrounding non-functional sequences. Its application, however, has several difficulties for optimizing the selection of orthologous data and reducing the false positives in motif prediction. Results Here we present an integrative phylogenetic footprinting framework for accurate motif predictions in prokaryotic genomes (MP3). The framework includes a new orthologous data preparation procedure, an additional promoter scoring and pruning method and an integration of six existing motif finding algorithms as basic motif search engines. Specifically, we collected orthologous genes from available prokaryotic genomes and built the orthologous regulatory regions based on sequence similarity of promoter regions. This procedure made full use of the large-scale genomic data and taxonomy information and filtered out the promoters with limited contribution to produce a high quality orthologous promoter set. The promoter scoring and pruning is implemented through motif voting by a set of complementary predicting tools that mine as many motif candidates as possible and simultaneously eliminate the effect of random noise. We have applied the framework to Escherichia coli k12 genome and evaluated the prediction performance through comparison with seven existing programs. This evaluation was systematically carried out at the nucleotide and binding site level, and the results showed that MP3 consistently outperformed other popular motif finding tools. We have integrated MP3 into our motif identification and analysis server DMINDA, allowing users to efficiently identify and analyze motifs in 2,072 completely sequenced prokaryotic genomes. Conclusion The performance evaluation indicated that MP3 is effective for predicting regulatory motifs in prokaryotic genomes. Its application may enhance progress in elucidating transcription regulation mechanism, thus provide benefit to the genomic research community and prokaryotic genome researchers in particular. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2982-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Bingqiang Liu
- School of Mathematics, Shandong University, Jinan, 250100, China
| | - Hanyuan Zhang
- Systems Biology and Biomedical Informatics (SBBI) Laboratory University of Nebraska-Lincoln, Lincoln, NE, 68588-0115, USA
| | - Chuan Zhou
- School of Mathematics, Shandong University, Jinan, 250100, China
| | - Guojun Li
- School of Mathematics, Shandong University, Jinan, 250100, China
| | - Anne Fennell
- Department of Agronomy, Horticulture, and Plant Science, South Dakota State University, Brookings, SD, 57007, USA.,BioSNTR, Brookings, SD, USA
| | - Guanghui Wang
- School of Mathematics, Shandong University, Jinan, 250100, China
| | - Yu Kang
- CAS Key Laboratory of Genome Sciences and information, Beijing Institute of Genomics of CAS, Beijing, 100101, People's Republic of China
| | - Qi Liu
- Department of Bioinformatics, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Qin Ma
- Department of Agronomy, Horticulture, and Plant Science, South Dakota State University, Brookings, SD, 57007, USA. .,BioSNTR, Brookings, SD, USA.
| |
Collapse
|
49
|
Yu Q, Huo H, Zhao R, Feng D, Vitter JS, Huan J. RefSelect: a reference sequence selection algorithm for planted (l, d) motif search. BMC Bioinformatics 2016; 17 Suppl 9:266. [PMID: 27454113 PMCID: PMC4959363 DOI: 10.1186/s12859-016-1130-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The planted (l, d) motif search (PMS) is an important yet challenging problem in computational biology. Pattern-driven PMS algorithms usually use k out of t input sequences as reference sequences to generate candidate motifs, and they can find all the (l, d) motifs in the input sequences. However, most of them simply take the first k sequences in the input as reference sequences without elaborate selection processes, and thus they may exhibit sharp fluctuations in running time, especially for large alphabets. RESULTS In this paper, we build the reference sequence selection problem and propose a method named RefSelect to quickly solve it by evaluating the number of candidate motifs for the reference sequences. RefSelect can bring a practical time improvement of the state-of-the-art pattern-driven PMS algorithms. Experimental results show that RefSelect (1) makes the tested algorithms solve the PMS problem steadily in an efficient way, (2) particularly, makes them achieve a speedup of up to about 100× on the protein data, and (3) is also suitable for large data sets which contain hundreds or more sequences. CONCLUSIONS The proposed algorithm RefSelect can be used to solve the problem that many pattern-driven PMS algorithms present execution time instability. RefSelect requires a small amount of storage space and is capable of selecting reference sequences efficiently and effectively. Also, the parallel version of RefSelect is provided for handling large data sets.
Collapse
Affiliation(s)
- Qiang Yu
- School of Computer Science and Technology, Xidian University, Xi’an, 710071 China
| | - Hongwei Huo
- School of Computer Science and Technology, Xidian University, Xi’an, 710071 China
| | - Ruixing Zhao
- School of Computer Science and Technology, Xidian University, Xi’an, 710071 China
| | - Dazheng Feng
- School of Electronic Engineering, Xidian University, Xi’an, 710071 China
| | - Jeffrey Scott Vitter
- Department of Computer and Information Science, The University of Mississippi, Oxford, MS 38677-1848 USA
| | - Jun Huan
- Department of Electrical Engineering and Computer Science, the University of Kansas, Lawrence, KS 66045 USA
| |
Collapse
|
50
|
Al-Okaily A, Huang CH. ET-Motif: Solving the Exact (l, d)-Planted Motif Problem Using Error Tree Structure. J Comput Biol 2016; 23:615-23. [PMID: 27152692 DOI: 10.1089/cmb.2015.0238] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Motif finding is an important and a challenging problem in many biological applications such as discovering promoters, enhancers, locus control regions, transcription factors, and more. The (l, d)-planted motif search, PMS, is one of several variations of the problem. In this problem, there are n given sequences over alphabets of size [Formula: see text], each of length m, and two given integers l and d. The problem is to find a motif m of length l, where in each sequence there is at least an l-mer at a Hamming distance of [Formula: see text] of m. In this article, we propose ET-Motif, an algorithm that can solve the PMS problem in [Formula: see text] time and [Formula: see text] space. The time bound can be further reduced by a factor of m with [Formula: see text] space. In case the suffix tree that is built for the input sequences is balanced, the problem can be solved in [Formula: see text] time and [Formula: see text] space. Similarly, the time bound can be reduced by a factor of m using [Formula: see text] space. Moreover, the variations of the problem, namely the edit distance PMS and edited PMS (Quorum), can be solved using ET-Motif with simple modifications but upper bands of space and time. For edit distance PMS, the time and space bounds will be increased by [Formula: see text], while for edited PMS the increase will be of [Formula: see text] in the time bound.
Collapse
Affiliation(s)
- Anas Al-Okaily
- Computer Science & Engineering Department, University of Connecticut , Storrs, Connecticut
| | - Chun-Hsi Huang
- Computer Science & Engineering Department, University of Connecticut , Storrs, Connecticut
| |
Collapse
|