1
|
Yu Q, Zhang X, Hu Y, Chen S, Yang L. A Method for Predicting DNA Motif Length Based On Deep Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:61-73. [PMID: 35275822 DOI: 10.1109/tcbb.2022.3158471] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
A DNA motif is a sequence pattern shared by the DNA sequence segments that bind to a specific protein. Discovering motifs in a given DNA sequence dataset plays a vital role in studying gene expression regulation. As an important attribute of the DNA motif, the motif length directly affects the quality of the discovered motifs. How to determine the motif length more accurately remains a difficult challenge to be solved. We propose a new motif length prediction scheme named MotifLen by using supervised machine learning. First, a method of constructing sample data for predicting the motif length is proposed. Secondly, a deep learning model for motif length prediction is constructed based on the convolutional neural network. Then, the methods of applying the proposed prediction model based on a motif found by an existing motif discovery algorithm are given. The experimental results show that i) the prediction accuracy of MotifLen is more than 90% on the validation set and is significantly higher than that of the compared methods on real datasets, ii) MotifLen can successfully optimize the motifs found by the existing motif discovery algorithms, and iii) it can effectively improve the time performance of some existing motif discovery algorithms.
Collapse
|
2
|
Delos Santos NP, Duttke S, Heinz S, Benner C. MEPP: more transparent motif enrichment by profiling positional correlations. NAR Genom Bioinform 2022; 4:lqac075. [PMID: 36267125 PMCID: PMC9575187 DOI: 10.1093/nargab/lqac075] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2022] [Revised: 08/18/2022] [Accepted: 09/23/2022] [Indexed: 11/11/2022] Open
Abstract
Score-based motif enrichment analysis (MEA) is typically applied to regulatory DNA to infer transcription factors (TFs) that may modulate transcription and chromatin state in different conditions. Most MEA methods determine motif enrichment independent of motif position within a sequence, even when those sequences harbor anchor points that motifs and their bound TFs may functionally interact with in a distance-dependent fashion, such as other TF binding motifs, transcription start sites (TSS), sequencing assay cleavage sites, or other biologically meaningful features. We developed motif enrichment positional profiling (MEPP), a novel MEA method that outputs a positional enrichment profile of a given TF's binding motif relative to key anchor points (e.g. transcription start sites, or other motifs) within the analyzed sequences while accounting for lower-order nucleotide bias. Using transcription initiation and TF binding as test cases, we demonstrate MEPP's utility in determining the sequence positions where motif presence correlates with measures of biological activity, inferring positional dependencies of binding site function. We demonstrate how MEPP can be applied to interpretation and hypothesis generation from experiments that quantify transcription initiation, chromatin structure, or TF binding measurements. MEPP is available for download from https://github.com/npdeloss/mepp.
Collapse
Affiliation(s)
- Nathaniel P Delos Santos
- Department of Biomedical Informatics, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0634, USA
| | - Sascha Duttke
- School of Molecular Biosciences, College of Veterinary Medicine, Washington State University, Pullman, WA, USA
| | - Sven Heinz
- Department of Medicine, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0634, USA
| | - Christopher Benner
- Department of Medicine, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0634, USA
| |
Collapse
|
3
|
Heide T, Househam J, Cresswell GD, Spiteri I, Lynn C, Mossner M, Kimberley C, Fernandez-Mateos J, Chen B, Zapata L, James C, Barozzi I, Chkhaidze K, Nichol D, Gunasri V, Berner A, Schmidt M, Lakatos E, Baker AM, Costa H, Mitchinson M, Piazza R, Jansen M, Caravagna G, Ramazzotti D, Shibata D, Bridgewater J, Rodriguez-Justo M, Magnani L, Graham TA, Sottoriva A. The co-evolution of the genome and epigenome in colorectal cancer. Nature 2022; 611:733-743. [PMID: 36289335 PMCID: PMC9684080 DOI: 10.1038/s41586-022-05202-1] [Citation(s) in RCA: 52] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Accepted: 08/05/2022] [Indexed: 12/13/2022]
Abstract
Colorectal malignancies are a leading cause of cancer-related death1 and have undergone extensive genomic study2,3. However, DNA mutations alone do not fully explain malignant transformation4-7. Here we investigate the co-evolution of the genome and epigenome of colorectal tumours at single-clone resolution using spatial multi-omic profiling of individual glands. We collected 1,370 samples from 30 primary cancers and 8 concomitant adenomas and generated 1,207 chromatin accessibility profiles, 527 whole genomes and 297 whole transcriptomes. We found positive selection for DNA mutations in chromatin modifier genes and recurrent somatic chromatin accessibility alterations, including in regulatory regions of cancer driver genes that were otherwise devoid of genetic mutations. Genome-wide alterations in accessibility for transcription factor binding involved CTCF, downregulation of interferon and increased accessibility for SOX and HOX transcription factor families, suggesting the involvement of developmental genes during tumourigenesis. Somatic chromatin accessibility alterations were heritable and distinguished adenomas from cancers. Mutational signature analysis showed that the epigenome in turn influences the accumulation of DNA mutations. This study provides a map of genetic and epigenetic tumour heterogeneity, with fundamental implications for understanding colorectal cancer biology.
Collapse
Affiliation(s)
- Timon Heide
- Centre for Evolution and Cancer, The Institute of Cancer Research, London, UK
- Computational Biology Research Centre, Human Technopole, Milan, Italy
| | - Jacob Househam
- Centre for Evolution and Cancer, The Institute of Cancer Research, London, UK
- Evolution and Cancer Lab, Centre for Genomics and Computational Biology, Barts Cancer Institute, Queen Mary University of London, London, UK
| | - George D Cresswell
- Centre for Evolution and Cancer, The Institute of Cancer Research, London, UK
| | - Inmaculada Spiteri
- Centre for Evolution and Cancer, The Institute of Cancer Research, London, UK
| | - Claire Lynn
- Centre for Evolution and Cancer, The Institute of Cancer Research, London, UK
| | - Maximilian Mossner
- Centre for Evolution and Cancer, The Institute of Cancer Research, London, UK
- Evolution and Cancer Lab, Centre for Genomics and Computational Biology, Barts Cancer Institute, Queen Mary University of London, London, UK
| | - Chris Kimberley
- Evolution and Cancer Lab, Centre for Genomics and Computational Biology, Barts Cancer Institute, Queen Mary University of London, London, UK
| | | | - Bingjie Chen
- Centre for Evolution and Cancer, The Institute of Cancer Research, London, UK
| | - Luis Zapata
- Centre for Evolution and Cancer, The Institute of Cancer Research, London, UK
| | - Chela James
- Centre for Evolution and Cancer, The Institute of Cancer Research, London, UK
| | - Iros Barozzi
- Department of Surgery and Cancer, Imperial College London, London, UK
- Centre for Cancer Research, Medical University of Vienna, Vienna, Austria
| | - Ketevan Chkhaidze
- Centre for Evolution and Cancer, The Institute of Cancer Research, London, UK
| | - Daniel Nichol
- Centre for Evolution and Cancer, The Institute of Cancer Research, London, UK
| | - Vinaya Gunasri
- Centre for Evolution and Cancer, The Institute of Cancer Research, London, UK
- Evolution and Cancer Lab, Centre for Genomics and Computational Biology, Barts Cancer Institute, Queen Mary University of London, London, UK
| | - Alison Berner
- Evolution and Cancer Lab, Centre for Genomics and Computational Biology, Barts Cancer Institute, Queen Mary University of London, London, UK
| | - Melissa Schmidt
- Evolution and Cancer Lab, Centre for Genomics and Computational Biology, Barts Cancer Institute, Queen Mary University of London, London, UK
| | - Eszter Lakatos
- Centre for Evolution and Cancer, The Institute of Cancer Research, London, UK
- Evolution and Cancer Lab, Centre for Genomics and Computational Biology, Barts Cancer Institute, Queen Mary University of London, London, UK
| | - Ann-Marie Baker
- Centre for Evolution and Cancer, The Institute of Cancer Research, London, UK
- Evolution and Cancer Lab, Centre for Genomics and Computational Biology, Barts Cancer Institute, Queen Mary University of London, London, UK
| | - Helena Costa
- Department of Pathology, UCL Cancer Institute, University College London, London, UK
| | - Miriam Mitchinson
- Department of Pathology, UCL Cancer Institute, University College London, London, UK
| | - Rocco Piazza
- Department of Medicine and Surgery, University of Milano-Bicocca, Milan, Italy
| | - Marnix Jansen
- Department of Pathology, UCL Cancer Institute, University College London, London, UK
| | - Giulio Caravagna
- Centre for Evolution and Cancer, The Institute of Cancer Research, London, UK
- Department of Mathematics and Geosciences, University of Triest, Triest, Italy
| | - Daniele Ramazzotti
- Department of Medicine and Surgery, University of Milano-Bicocca, Milan, Italy
| | - Darryl Shibata
- Department of Pathology, University of Southern California Keck School of Medicine, Los Angeles, CA, USA
| | | | | | - Luca Magnani
- Department of Surgery and Cancer, Imperial College London, London, UK
| | - Trevor A Graham
- Centre for Evolution and Cancer, The Institute of Cancer Research, London, UK.
- Evolution and Cancer Lab, Centre for Genomics and Computational Biology, Barts Cancer Institute, Queen Mary University of London, London, UK.
| | - Andrea Sottoriva
- Centre for Evolution and Cancer, The Institute of Cancer Research, London, UK.
- Computational Biology Research Centre, Human Technopole, Milan, Italy.
| |
Collapse
|
4
|
Fostier J. BLAMM: BLAS-based algorithm for finding position weight matrix occurrences in DNA sequences on CPUs and GPUs. BMC Bioinformatics 2020; 21:81. [PMID: 32164557 PMCID: PMC7068855 DOI: 10.1186/s12859-020-3348-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND The identification of all matches of a large set of position weight matrices (PWMs) in long DNA sequences requires significant computational resources for which a number of efficient yet complex algorithms have been proposed. RESULTS We propose BLAMM, a simple and efficient tool inspired by high performance computing techniques. The workload is expressed in terms of matrix-matrix products that are evaluated with high efficiency using optimized BLAS library implementations. The algorithm is easy to parallelize and implement on CPUs and GPUs and has a runtime that is independent of the selected p-value. In terms of single-core performance, it is competitive with state-of-the-art software for PWM matching while being much more efficient when using multithreading. Additionally, BLAMM requires negligible memory. For example, both strands of the entire human genome can be scanned for 1404 PWMs in the JASPAR database in 13 min with a p-value of 10-4 using a 36-core machine. On a dual GPU system, the same task can be performed in under 5 min. CONCLUSIONS BLAMM is an efficient tool for identifying PWM matches in large DNA sequences. Its C++ source code is available under the GNU General Public License Version 3 at https://github.com/biointec/blamm.
Collapse
Affiliation(s)
- Jan Fostier
- Department of Information Technology - IDLab, Ghent University - imec, Technologiepark 126, Ghent (Zwijnaarde), B-9052, Belgium.
| |
Collapse
|
5
|
Abstract
Background Spaced-seeds, i.e. patterns in which some fixed positions are allowed to be wild-cards, play a crucial role in several bioinformatics applications involving substrings counting and indexing, by often providing better sensitivity with respect to k-mers based approaches. K-mers based approaches are usually fast, being based on efficient hashing and indexing that exploits the large overlap between consecutive k-mers. Spaced-seeds hashing is not as straightforward, and it is usually computed from scratch for each position in the input sequence. Recently, the FSH (Fast Spaced seed Hashing) approach was proposed to improve the time required for computation of the spaced seed hashing of DNA sequences with a speed-up of about 1.5 with respect to standard hashing computation. Results In this work we propose a novel algorithm, Fast Indexing for Spaced seed Hashing (FISH), based on the indexing of small blocks that can be combined to obtain the hashing of spaced-seeds of any length. The method exploits the fast computation of the hashing of runs of consecutive 1 in the spaced seeds, that basically correspond to k-mer of the length of the run. Conclusions We run several experiments, on NGS data from simulated and synthetic metagenomic experiments, to assess the time required for the computation of the hashing for each position in each read with respect to several spaced seeds. In our experiments, FISH can compute the hashing values of spaced seeds with a speedup, with respect to the traditional approach, between 1.9x to 6.03x, depending on the structure of the spaced seeds. Electronic supplementary material The online version of this article (10.1186/s12859-018-2415-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Samuele Girotto
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, Italy
| | - Matteo Comin
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, Italy.
| | - Cinzia Pizzi
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, Italy.
| |
Collapse
|
6
|
A protein activity assay to measure global transcription factor activity reveals determinants of chromatin accessibility. Nat Biotechnol 2018; 36:521-529. [DOI: 10.1038/nbt.4138] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2017] [Accepted: 03/23/2018] [Indexed: 12/29/2022]
|
7
|
Zhu L, Zhang HB, Huang DS. LMMO: A Large Margin Approach for Refining Regulatory Motifs. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:913-925. [PMID: 28391205 DOI: 10.1109/tcbb.2017.2691325] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Although discriminative motif discovery (DMD) methods are promising for eliciting motifs from high-throughput experimental data, they usually have to sacrifice accuracy and may fail to fully leverage the potential of large datasets. Recently, it has been demonstrated that the motifs identified by DMDs can be significantly improved by maximizing the receiver-operating characteristic curve (AUC) metric, which has been widely used in the literature to rank the performance of elicited motifs. However, existing approaches for motif refinement choose to directly maximize the non-convex and discontinuous AUC itself, which is known to be difficult and may lead to suboptimal solutions. In this paper, we propose Large Margin Motif Optimizer (LMMO), a large-margin-type algorithm for refining regulatory motifs. By relaxing the AUC cost function with the surrogate convex hinge loss, we show that the resultant learning problem can be cast as an instance of difference-of-convex (DC) programs, and solve it iteratively using constrained concave-convex procedure (CCCP). To further save computational time, we combine LMMO with existing techniques for improving the scalability of large-margin-type algorithms, such as cutting plane method. Experimental evaluations on synthetic and real data illustrate the performance of the proposed approach. The code of LMMO is freely available at: https://github.com/ekffar/LMMO.
Collapse
|
8
|
Pizzi C, Ornamenti M, Spangaro S, Rombo SE, Parida L. Efficient Algorithms for Sequence Analysis with Entropic Profiles. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:117-128. [PMID: 28113780 DOI: 10.1109/tcbb.2016.2620143] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Entropy, being closely related to repetitiveness and compressibility, is a widely used information-related measure to assess the degree of predictability of a sequence. Entropic profiles are based on information theory principles, and can be used to study the under-/over-representation of subwords, by also providing information about the scale of conserved DNA regions. Here, we focus on the algorithmic aspects related to entropic profiles. In particular, we propose linear time algorithms for their computation that rely on suffix-based data structures, more specifically on the truncated suffix tree (TST) and on the enhanced suffix array (ESA). We performed an extensive experimental campaign showing that our algorithms, beside being faster, make it possible the analysis of longer sequences, even for high degrees of resolution, than state of the art algorithms.
Collapse
|
9
|
Korhonen JH, Palin K, Taipale J, Ukkonen E. Fast motif matching revisited: high-order PWMs, SNPs and indels. Bioinformatics 2017; 33:514-521. [PMID: 28011774 DOI: 10.1093/bioinformatics/btw683] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2016] [Accepted: 10/27/2016] [Indexed: 01/09/2023] Open
Abstract
Motivation While the position weight matrix (PWM) is the most popular model for sequence motifs, there is growing evidence of the usefulness of more advanced models such as first-order Markov representations, and such models are also becoming available in well-known motif databases. There has been lots of research of how to learn these models from training data but the problem of predicting putative sites of the learned motifs by matching the model against new sequences has been given less attention. Moreover, motif site analysis is often concerned about how different variants in the sequence affect the sites. So far, though, the corresponding efficient software tools for motif matching have been lacking. Results We develop fast motif matching algorithms for the aforementioned tasks. First, we formalize a framework based on high-order position weight matrices for generic representation of motif models with dinucleotide or general q -mer dependencies, and adapt fast PWM matching algorithms to the high-order PWM framework. Second, we show how to incorporate different types of sequence variants , such as SNPs and indels, and their combined effects into efficient PWM matching workflows. Benchmark results show that our algorithms perform well in practice on genome-sized sequence sets and are for multiple motif search much faster than the basic sliding window algorithm. Availability and Implementation Implementations are available as a part of the MOODS software package under the GNU General Public License v3.0 and the Biopython license ( http://www.cs.helsinki.fi/group/pssmfind ). Contact janne.h.korhonen@gmail.com.
Collapse
Affiliation(s)
- Janne H Korhonen
- School of Computer Science, Reykjavík University, Reykjavík, Iceland.,Helsinki Institute for Information Technology HIIT, Helsinki, Finland.,Department of Computer Science
| | - Kimmo Palin
- Genome-Scale Biology Research Program, Research Programs Unit
| | - Jussi Taipale
- Department of Biosciences and Nutrition, Karolinska Institutet, Genome Scale Biology Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland
| | - Esko Ukkonen
- Helsinki Institute for Information Technology HIIT, Helsinki, Finland.,Department of Computer Science
| |
Collapse
|
10
|
Colbran LL, Chen L, Capra JA. Short DNA sequence patterns accurately identify broadly active human enhancers. BMC Genomics 2017; 18:536. [PMID: 28716036 PMCID: PMC5512948 DOI: 10.1186/s12864-017-3934-9] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2016] [Accepted: 07/09/2017] [Indexed: 12/25/2022] Open
Abstract
Background Enhancers are DNA regulatory elements that influence gene expression. There is substantial diversity in enhancers’ activity patterns: some enhancers drive expression in a single cellular context, while others are active across many. Sequence characteristics, such as transcription factor (TF) binding motifs, influence the activity patterns of regulatory sequences; however, the regulatory logic through which specific sequences drive enhancer activity patterns is poorly understood. Recent analysis of Drosophila enhancers suggested that short dinucleotide repeat motifs (DRMs) are general enhancer sequence features that drive broad regulatory activity. However, it is not known whether the regulatory role of DRMs is conserved across species. Results We performed a comprehensive analysis of the relationship between short DNA sequence patterns, including DRMs, and human enhancer activity in 38,538 enhancers across 411 different contexts. In a machine-learning framework, the occurrence patterns of short sequence motifs accurately predicted broadly active human enhancers. However, DRMs alone were weakly predictive of broad enhancer activity in humans and showed different enrichment patterns than in Drosophila. In general, GC-rich sequence motifs were significantly associated with broad enhancer activity, and consistent with this enrichment, broadly active human TFs recognize GC-rich motifs. Conclusions Our results reveal the importance of specific sequence motifs in broadly active human enhancers, demonstrate the lack of evolutionary conservation of the role of DRMs, and provide a computational framework for investigating the logic of enhancer sequences. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3934-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Laura L Colbran
- Vanderbilt Genetics Institute, Vanderbilt University, Nashville, TN, 37235, USA
| | - Ling Chen
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, 37235, USA
| | - John A Capra
- Vanderbilt Genetics Institute, Vanderbilt University, Nashville, TN, 37235, USA. .,Department of Biological Sciences, Vanderbilt University, Nashville, TN, 37235, USA. .,Center for Structural Biology, Departments of Biomedical Informatics and Computer Science, Vanderbilt University, Nashville, TN, 37235, USA.
| |
Collapse
|
11
|
Quorum Sensing Regulators Are Required for Metabolic Fitness in Vibrio parahaemolyticus. Infect Immun 2017; 85:IAI.00930-16. [PMID: 28069817 DOI: 10.1128/iai.00930-16] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2016] [Accepted: 12/20/2016] [Indexed: 12/13/2022] Open
Abstract
Quorum sensing (QS) is a process by which bacteria alter gene expression in response to cell density changes. In Vibrio species, at low cell density, the sigma 54-dependent response regulator LuxO is active and regulates the two QS master regulators AphA, which is induced, and OpaR, which is repressed. At high cell density the opposite occurs: LuxO is inactive, and therefore OpaR is induced while AphA is repressed. In Vibrio parahaemolyticus, a significant enteric pathogen of humans, the roles of these regulators in pathogenesis are less known. We examined deletion mutants of luxO, opaR, and aphA for in vivo fitness using an adult mouse model. We found that the luxO and aphA mutants were defective in colonization compared to levels in the wild type. The opaR mutant did not show any defect in vivo Colonization was restored to wild-type levels in a luxO opaR double mutant and was also increased in an opaR aphA double mutant. These data suggest that AphA is important and that overexpression of opaR is detrimental to in vivo fitness. Transcriptome sequencing (RNA-Seq) analysis of the wild type and luxO mutant grown in mouse intestinal mucus showed that 60% of the genes that were downregulated in the luxO mutant were involved in amino acid and sugar transport and metabolism. These data suggest that the luxO mutant has a metabolic disadvantage, which was confirmed by growth pattern analysis using phenotype microarrays. Bioinformatics analysis revealed OpaR binding sites in the regulatory region of 55 carbon transporter and metabolism genes. Biochemical analysis of five representatives of these regulatory regions demonstrated direct binding of OpaR in all five tested. These data demonstrate the role of OpaR in carbon utilization and metabolic fitness, an overlooked role in the QS regulon.
Collapse
|
12
|
Jain S, Bader GD. Predicting physiologically relevant SH3 domain mediated protein-protein interactions in yeast. Bioinformatics 2016; 32:1865-72. [PMID: 26861823 PMCID: PMC4908317 DOI: 10.1093/bioinformatics/btw045] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2015] [Revised: 12/05/2015] [Accepted: 01/20/2016] [Indexed: 12/02/2022] Open
Abstract
MOTIVATION Many intracellular signaling processes are mediated by interactions involving peptide recognition modules such as SH3 domains. These domains bind to small, linear protein sequence motifs which can be identified using high-throughput experimental screens such as phage display. Binding motif patterns can then be used to computationally predict protein interactions mediated by these domains. While many protein-protein interaction prediction methods exist, most do not work with peptide recognition module mediated interactions or do not consider many of the known constraints governing physiologically relevant interactions between two proteins. RESULTS A novel method for predicting physiologically relevant SH3 domain-peptide mediated protein-protein interactions in S. cerevisae using phage display data is presented. Like some previous similar methods, this method uses position weight matrix models of protein linear motif preference for individual SH3 domains to scan the proteome for potential hits and then filters these hits using a range of evidence sources related to sequence-based and cellular constraints on protein interactions. The novelty of this approach is the large number of evidence sources used and the method of combination of sequence based and protein pair based evidence sources. By combining different peptide and protein features using multiple Bayesian models we are able to predict high confidence interactions with an overall accuracy of 0.97. AVAILABILITY AND IMPLEMENTATION Domain-Motif Mediated Interaction Prediction (DoMo-Pred) command line tool and all relevant datasets are available under GNU LGPL license for download from http://www.baderlab.org/Software/DoMo-Pred The DoMo-Pred command line tool is implemented using Python 2.7 and C ++. CONTACT gary.bader@utoronto.ca SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shobhit Jain
- Department of Computer Science and The Donnelly Centre, University of Toronto, Toronto, ON, Canada
| | - Gary D Bader
- Department of Computer Science and The Donnelly Centre, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
13
|
Kleftogiannis D, Kalnis P, Bajic VB. DEEP: a general computational framework for predicting enhancers. Nucleic Acids Res 2014; 43:e6. [PMID: 25378307 PMCID: PMC4288148 DOI: 10.1093/nar/gku1058] [Citation(s) in RCA: 102] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
Transcription regulation in multicellular eukaryotes is orchestrated by a number of DNA functional elements located at gene regulatory regions. Some regulatory regions (e.g. enhancers) are located far away from the gene they affect. Identification of distal regulatory elements is a challenge for the bioinformatics research. Although existing methodologies increased the number of computationally predicted enhancers, performance inconsistency of computational models across different cell-lines, class imbalance within the learning sets and ad hoc rules for selecting enhancer candidates for supervised learning, are some key questions that require further examination. In this study we developed DEEP, a novel ensemble prediction framework. DEEP integrates three components with diverse characteristics that streamline the analysis of enhancer's properties in a great variety of cellular conditions. In our method we train many individual classification models that we combine to classify DNA regions as enhancers or non-enhancers. DEEP uses features derived from histone modification marks or attributes coming from sequence characteristics. Experimental results indicate that DEEP performs better than four state-of-the-art methods on the ENCODE data. We report the first computational enhancer prediction results on FANTOM5 data where DEEP achieves 90.2% accuracy and 90% geometric mean (GM) of specificity and sensitivity across 36 different tissues. We further present results derived using in vivo-derived enhancer data from VISTA database. DEEP-VISTA, when tested on an independent test set, achieved GM of 80.1% and accuracy of 89.64%. DEEP framework is publicly available at http://cbrc.kaust.edu.sa/deep/.
Collapse
Affiliation(s)
- Dimitrios Kleftogiannis
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Panos Kalnis
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Vladimir B Bajic
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| |
Collapse
|
14
|
Giaquinta E, Grabowski S, Ukkonen E. Fast matching of transcription factor motifs using generalized position weight matrix models. J Comput Biol 2013; 20:621-30. [PMID: 23919388 DOI: 10.1089/cmb.2012.0289] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The problem of finding the locations in DNA sequences that match a given motif describing the binding specificities of a transcription factor (TF) has many applications in computational biology. This problem has been extensively studied when the position weight matrix (PWM) model is used to represent motifs. We investigate it under the feature motif model, a generalization of the PWM model that does not assume independence between positions in the pattern while being compatible with the original PWM. We present a new method for finding the binding sites of a transcription factor in a DNA sequence when the feature motif model is used to describe transcription factor binding specificities. The experimental results on random and real data show that the search algorithm is fast in practice.
Collapse
Affiliation(s)
- Emanuele Giaquinta
- Department of Computer Science, University of Helsinki, Helsinki, Finland.
| | | | | |
Collapse
|
15
|
Billings T, Parvanov ED, Baker CL, Walker M, Paigen K, Petkov PM. DNA binding specificities of the long zinc-finger recombination protein PRDM9. Genome Biol 2013; 14:R35. [PMID: 23618393 PMCID: PMC4053984 DOI: 10.1186/gb-2013-14-4-r35] [Citation(s) in RCA: 61] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2013] [Accepted: 04/24/2013] [Indexed: 12/13/2022] Open
Abstract
Background Meiotic recombination ensures proper segregation of homologous chromosomes and creates genetic variation. In many organisms, recombination occurs at limited sites, termed 'hotspots', whose positions in mammals are determined by PR domain member 9 (PRDM9), a long-array zinc-finger and chromatin-modifier protein. Determining the rules governing the DNA binding of PRDM9 is a major issue in understanding how it functions. Results Mouse PRDM9 protein variants bind to hotspot DNA sequences in a manner that is specific for both PRDM9 and DNA haplotypes, and that in vitro binding parallels its in vivo biological activity. Examining four hotspots, three activated by Prdm9Cst and one activated by Prdm9Dom2, we found that all binding sites required the full array of 11 or 12 contiguous fingers, depending on the allele, and that there was little sequence similarity between the binding sites of the three Prdm9Cst activated hotspots. The binding specificity of each position in the Hlx1 binding site, activated by Prdm9Cst, was tested by mutating each nucleotide to its three alternatives. The 31 positions along the binding site varied considerably in the ability of alternative bases to support binding, which also implicates a role for additional binding to the DNA phosphate backbone. Conclusions These results, which provide the first detailed mapping of PRDM9 binding to DNA and, to our knowledge, the most detailed analysis yet of DNA binding by a long zinc-finger array, make clear that the binding specificities of PRDM9, and possibly other long-array zinc-finger proteins, are unusually complex.
Collapse
|
16
|
Korhonen J, Martinmäki P, Pizzi C, Rastas P, Ukkonen E. MOODS: fast search for position weight matrix matches in DNA sequences. Bioinformatics 2009; 25:3181-2. [PMID: 19773334 PMCID: PMC2778336 DOI: 10.1093/bioinformatics/btp554] [Citation(s) in RCA: 106] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
UNLABELLED MOODS (MOtif Occurrence Detection Suite) is a software package for matching position weight matrices against DNA sequences. MOODS implements state-of-the-art online matching algorithms, achieving considerably faster scanning speed than with a simple brute-force search. MOODS is written in C++, with bindings for the popular BioPerl and Biopython toolkits. It can easily be adapted for different purposes and integrated into existing workflows. It can also be used as a C++ library. AVAILABILITY The package with documentation and examples of usage is available at http://www.cs.helsinki.fi/group/pssmfind. The source code is also available under the terms of a GNU General Public License (GPL).
Collapse
Affiliation(s)
- Janne Korhonen
- Department of Computer Science and Helsinki Institute for Information Technology, University of Helsinki, Helsinki, Finland.
| | | | | | | | | |
Collapse
|