1
|
Li Y, Wang Y, Wang C, Ma A, Ma Q, Liu B. A weighted two-stage sequence alignment framework to identify motifs from ChIP-exo data. Patterns (N Y) 2024; 5:100927. [PMID: 38487805 PMCID: PMC10935504 DOI: 10.1016/j.patter.2024.100927] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 08/18/2023] [Accepted: 01/10/2024] [Indexed: 03/17/2024]
Abstract
In this study, we introduce TESA (weighted two-stage alignment), an innovative motif prediction tool that refines the identification of DNA-binding protein motifs, essential for deciphering transcriptional regulatory mechanisms. Unlike traditional algorithms that rely solely on sequence data, TESA integrates the high-resolution chromatin immunoprecipitation (ChIP) signal, specifically from ChIP-exonuclease (ChIP-exo), by assigning weights to sequence positions, thereby enhancing motif discovery. TESA employs a nuanced approach combining a binomial distribution model with a graph model, further supported by a "bookend" model, to improve the accuracy of predicting motifs of varying lengths. Our evaluation, utilizing an extensive compilation of 90 prokaryotic ChIP-exo datasets from proChIPdb and 167 H. sapiens datasets, compared TESA's performance against seven established tools. The results indicate TESA's improved precision in motif identification, suggesting its valuable contribution to the field of genomic research.
Collapse
Affiliation(s)
- Yang Li
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Yizhong Wang
- School of Mathematics, Shandong University, Jinan, Shandong 250100, China
| | - Cankun Wang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Anjun Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Qin Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH 43210, USA
| | - Bingqiang Liu
- School of Mathematics, Shandong University, Jinan, Shandong 250100, China
| |
Collapse
|
2
|
Wang Y, Li Y, Wang C, Lio CWJ, Ma Q, Liu B. CEMIG: prediction of the cis-regulatory motif using the de Bruijn graph from ATAC-seq. Brief Bioinform 2023; 25:bbad505. [PMID: 38189539 PMCID: PMC10772951 DOI: 10.1093/bib/bbad505] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Revised: 11/21/2023] [Accepted: 12/03/2023] [Indexed: 01/09/2024] Open
Abstract
Sequence motif discovery algorithms enhance the identification of novel deoxyribonucleic acid sequences with pivotal biological significance, especially transcription factor (TF)-binding motifs. The advent of assay for transposase-accessible chromatin using sequencing (ATAC-seq) has broadened the toolkit for motif characterization. Nonetheless, prevailing computational approaches have focused on delineating TF-binding footprints, with motif discovery receiving less attention. Herein, we present Cis rEgulatory Motif Influence using de Bruijn Graph (CEMIG), an algorithm leveraging de Bruijn and Hamming distance graph paradigms to predict and map motif sites. Assessment on 129 ATAC-seq datasets from the Cistrome Data Browser demonstrates CEMIG's exceptional performance, surpassing three established methodologies on four evaluative metrics. CEMIG accurately identifies both cell-type-specific and common TF motifs within GM12878 and K562 cell lines, demonstrating its comparative genomic capabilities in the identification of evolutionary conservation and cell-type specificity. In-depth transcriptional and functional genomic studies have validated the functional relevance of CEMIG-identified motifs across various cell types. CEMIG is available at https://github.com/OSU-BMBL/CEMIG, developed in C++ to ensure cross-platform compatibility with Linux, macOS and Windows operating systems.
Collapse
Affiliation(s)
- Yizhong Wang
- School of Mathematics, Shandong University, Jinan, 250100, China
| | - Yang Li
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Cankun Wang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Chan-Wang Jerry Lio
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH, 43210, USA
| | - Qin Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH, 43210, USA
| | - Bingqiang Liu
- School of Mathematics, Shandong University, Jinan, 250100, China
| |
Collapse
|
3
|
Mier P, Andrade-Navarro MA. MAGA: A Supervised Method to Detect Motifs From Annotated Groups in Alignments. Evol Bioinform Online 2020; 16:1176934320916199. [PMID: 32425492 PMCID: PMC7218316 DOI: 10.1177/1176934320916199] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2020] [Accepted: 03/10/2020] [Indexed: 11/17/2022] Open
Abstract
Multiple sequence alignments are usually phylogenetically driven. They are studied in the framework of evolution. But sometimes, it is interesting to study residue conservation at positions unconstrained by evolutionary rules. We present a supervised method to access a layer of information difficult to appreciate visually when many protein sequences are aligned. This new tool (MAGA; http://cbdm-01.zdv.uni-mainz.de/~munoz/maga/) locates positions in multiple sequence alignments differentially conserved in manually defined groups of sequences.
Collapse
Affiliation(s)
- Pablo Mier
- Institute of Organismic and Molecular Evolution, Faculty of Biology, Johannes Gutenberg University Mainz, Hanns-Dieter-Hüsch-Weg 15, Mainz 55128, Germany
| | - Miguel A Andrade-Navarro
- Institute of Organismic and Molecular Evolution, Faculty of Biology, Johannes Gutenberg University Mainz, Hanns-Dieter-Hüsch-Weg 15, Mainz 55128, Germany
| |
Collapse
|
4
|
Wang X, Wang S, Song T. A Spectral Rotation Method with Triplet Periodicity Property for Planted Motif Finding Problems. Comb Chem High Throughput Screen 2019; 22:683-693. [PMID: 31782356 DOI: 10.2174/1386207322666191129112433] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2019] [Revised: 07/18/2019] [Accepted: 08/07/2019] [Indexed: 11/22/2022]
Abstract
BACKGROUND Genes are known as functional patterns in the genome and are presumed to have biological significance. They can indicate binding sites for transcription factors and they encode certain proteins. Finding genes from biological sequences is a major task in computational biology for unraveling the mechanisms of gene expression. OBJECTIVE Planted motif finding problems are a class of mathematical models abstracted from the process of detecting genes from genome, in which a specific gene with a number of mutations is planted into a randomly generated background sequence, and then gene finding algorithms can be tested to check if the planted gene can be found in feasible time. METHODS In this work, a spectral rotation method based on triplet periodicity property is proposed to solve planted motif finding problems. RESULTS The proposed method gives significant tolerance of base mutations in genes. Specifically, genes having a number of substitutions can be detected from randomly generated background sequences. Experimental results on genomic data set from Saccharomyces cerevisiae reveal that genes can be visually distinguished. It is proposed that genes with about 50% mutations can be detected from randomly generated background sequences. CONCLUSION It is found that with about 5 insertions or deletions, this method fails in finding the planted genes. For a particular case, if the deletion of bases is located at the beginning of the gene, that is, bases are not randomly deleted, then the tolerance of the method for base deletion is increased.
Collapse
Affiliation(s)
- Xun Wang
- School of Electrical Engineering and Automation, Tiangong University, Tianjin 300387, China
| | - Shudong Wang
- School of Electrical Engineering and Automation, Tiangong University, Tianjin 300387, China
| | - Tao Song
- School of Electrical Engineering and Automation, Tiangong University, Tianjin 300387, China.,Department of Artificial Intelligence, Faculty of Computer Science, Polytechnical University of Madrid, Campus de Montegancedo, Boadilla del Monte 28660, Madrid, Spain
| |
Collapse
|
5
|
Abstract
Protein–DNA binding plays a central role in gene regulation and by that in all processes in the living cell. Novel experimental and computational approaches facilitate better understanding of protein–DNA binding preferences via high-throughput measurement of protein binding to a large number of DNA sequences and inference of binding models from them. Here we review the state of the art in measuring protein–DNA binding in vitro, emphasizing the advantages and limitations of different technologies. In addition, we describe models for representing protein–DNA binding preferences and key computational approaches to learn those from high-throughput data. Using large experimental data sets, we test the performance of different models based on different measuring techniques. We conclude with pertinent open problems.
Collapse
|
6
|
Pereira MB, Wallroth M, Kristiansson E, Axelson-Fisk M. HattCI: Fast and Accurate attC site Identification Using Hidden Markov Models. J Comput Biol 2016; 23:891-902. [PMID: 27428829 DOI: 10.1089/cmb.2016.0024] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Integrons are genetic elements that facilitate the horizontal gene transfer in bacteria and are known to harbor genes associated with antibiotic resistance. The gene mobility in the integrons is governed by the presence of attC sites, which are 55 to 141-nucleotide-long imperfect inverted repeats. Here we present HattCI, a new method for fast and accurate identification of attC sites in large DNA data sets. The method is based on a generalized hidden Markov model that describes each core component of an attC site individually. Using twofold cross-validation experiments on a manually curated reference data set of 231 attC sites from class 1 and 2 integrons, HattCI showed high sensitivities of up to 91.9% while maintaining satisfactory false-positive rates. When applied to a metagenomic data set of 35 microbial communities from different environments, HattCI found a substantially higher number of attC sites in the samples that are known to contain more horizontally transferred elements. HattCI will significantly increase the ability to identify attC sites and thus integron-mediated genes in genomic and metagenomic data. HattCI is implemented in C and is freely available at http://bioinformatics.math.chalmers.se/HattCI .
Collapse
Affiliation(s)
- Mariana Buongermino Pereira
- 1 Department of Mathematical Sciences, Chalmers University of Technology and University of Gothenburg , Gothenburg, Sweden .,2 Centre for Antibiotic Resistance Research (CARe) at University of Gothenburg, Gothenburg, Sweden
| | - Mikael Wallroth
- 1 Department of Mathematical Sciences, Chalmers University of Technology and University of Gothenburg , Gothenburg, Sweden
| | - Erik Kristiansson
- 1 Department of Mathematical Sciences, Chalmers University of Technology and University of Gothenburg , Gothenburg, Sweden .,2 Centre for Antibiotic Resistance Research (CARe) at University of Gothenburg, Gothenburg, Sweden
| | - Marina Axelson-Fisk
- 1 Department of Mathematical Sciences, Chalmers University of Technology and University of Gothenburg , Gothenburg, Sweden
| |
Collapse
|
7
|
Abstract
Transcription factors (TFs) have to find their binding sites, which are distributed throughout the genome. Facilitated diffusion is currently the most widely accepted model for this search process. Based on this model the TF alternates between one-dimensional sliding along the DNA, and three-dimensional bulk diffusion. In this view, the non-specific associations between the proteins and the DNA play a major role in the search dynamics. However, little is known about how the DNA properties around the motif contribute to the search. Accumulating evidence showing that TF binding sites are embedded within a unique environment, specific to each TF, leads to the hypothesis that the search process is facilitated by favorable DNA features that help to improve the search efficiency. Here, we review the field and present the hypothesis that TF-DNA recognition is dictated not only by the motif, but is also influenced by the environment in which the motif resides.
Collapse
Affiliation(s)
- Iris Dror
- Department of Biology, Technion - Israel Institute of Technology, Technion City, Haifa, Israel.,Departments of Biological Sciences, Chemistry, Physics, and Computer Science, Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA, USA
| | - Remo Rohs
- Departments of Biological Sciences, Chemistry, Physics, and Computer Science, Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA, USA
| | - Yael Mandel-Gutfreund
- Department of Biology, Technion - Israel Institute of Technology, Technion City, Haifa, Israel
| |
Collapse
|
8
|
Abstract
Finding short patterns with residue variation in a set of sequences is still an open problem in genetics, since motif-finding techniques on DNA and protein sequences are inconclusive on real data sets and their performance varies on different species. Hence, finding new algorithms and evolving established methods are vital to further understanding of genome properties and the mechanisms of protein development. In this work, we present an approach to finding functional motifs in DNA sequences in connection to Gibbs sampling method. Starting points in the search space are partly determined via graphical representation of input sequences opposed to completely random initial points with the standard Gibbs sampling. Our algorithm is evaluated on synthetic as well as on real data sets by using several statistics, such as sensitivity, positive predictive value, specificity, performance, and correlation coefficient. Additionally, a comparison between our algorithm and the basic standard Gibbs sampling algorithm is made to show improvement in accuracy, repeatability, and performance.
Collapse
|
9
|
Abstract
The novel high-throughput technology of protein-binding microarrays (PBMs) measures binding intensity of a transcription factor to thousands of DNA probe sequences. Several algorithms have been developed to extract binding-site motifs from these data. Such motifs are commonly represented by positional weight matrices. Previous studies have shown that the motifs produced by these algorithms are either accurate in predicting in vitro binding or similar to previously published motifs, but not both. In this work, we present a new simple algorithm to infer binding-site motifs from PBM data. It outperforms prior art both in predicting in vitro binding and in producing motifs similar to literature motifs. Our results challenge previous claims that motifs with lower information content are better models for transcription-factor binding specificity. Moreover, we tested the effect of motif length and side positions flanking the "core" motif in the binding site. We show that side positions have a significant effect and should not be removed, as commonly done. A large drop in the results quality of all methods is observed between in vitro and in vivo binding prediction. The software is available on acgt.cs.tau.ac.il/rap.
Collapse
Affiliation(s)
- Yaron Orenstein
- Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
| | | | | |
Collapse
|
10
|
Li X, Zhong S, Wong WH. Reliable prediction of transcription factor binding sites by phylogenetic verification. Proc Natl Acad Sci U S A 2005; 102:16945-50. [PMID: 16286651 PMCID: PMC1283155 DOI: 10.1073/pnas.0504201102] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2005] [Accepted: 10/03/2005] [Indexed: 11/18/2022] Open
Abstract
We present a statistical methodology that largely improves the accuracy in computational predictions of transcription factor (TF) binding sites in eukaryote genomes. This method models the cross-species conservation of binding sites without relying on accurate sequence alignment. It can be coupled with any motif-finding algorithm that searches for overrepresented sequence motifs in individual species and can increase the accuracy of the coupled motif-finding algorithm. Because this method is capable of accurately detecting TF binding sites, it also enhances our ability to predict the cis-regulatory modules. We applied this method on the published chromatin immunoprecipitation (ChIP)-chip data in Saccharomyces cerevisiae and found that its sensitivity and specificity are 9% and 14% higher than those of two recent methods. We also recovered almost all of the previously verified TF binding sites and made predictions on the cis-regulatory elements that govern the tight regulation of ribosomal protein genes in 13 eukaryote species (2 plants, 4 yeasts, 2 worms, 2 insects, and 3 mammals). These results give insights to the transcriptional regulation in eukaryotic organisms.
Collapse
Affiliation(s)
- Xiaoman Li
- Department of Statistics, Stanford University, Sequoia Hall, 390 Serra Mall, Stanford, CA 94305-4065, USA.
| | | | | |
Collapse
|