1
|
Rafi AM, Nogina D, Penzar D, Lee D, Lee D, Kim N, Kim S, Kim D, Shin Y, Kwak IY, Meshcheryakov G, Lando A, Zinkevich A, Kim BC, Lee J, Kang T, Vaishnav ED, Yadollahpour P, Kim S, Albrecht J, Regev A, Gong W, Kulakovskiy IV, Meyer P, de Boer CG. A community effort to optimize sequence-based deep learning models of gene regulation. Nat Biotechnol 2024:10.1038/s41587-024-02414-w. [PMID: 39394483 DOI: 10.1038/s41587-024-02414-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Accepted: 08/29/2024] [Indexed: 10/13/2024]
Abstract
A systematic evaluation of how model architectures and training strategies impact genomics model performance is needed. To address this gap, we held a DREAM Challenge where competitors trained models on a dataset of millions of random promoter DNA sequences and corresponding expression levels, experimentally determined in yeast. For a robust evaluation of the models, we designed a comprehensive suite of benchmarks encompassing various sequence types. All top-performing models used neural networks but diverged in architectures and training strategies. To dissect how architectural and training choices impact performance, we developed the Prix Fixe framework to divide models into modular building blocks. We tested all possible combinations for the top three models, further improving their performance. The DREAM Challenge models not only achieved state-of-the-art results on our comprehensive yeast dataset but also consistently surpassed existing benchmarks on Drosophila and human genomic datasets, demonstrating the progress that can be driven by gold-standard genomics datasets.
Collapse
Affiliation(s)
| | - Daria Nogina
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia
| | - Dmitry Penzar
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
- AIRI, Moscow, Russia
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, Russia
| | - Dohoon Lee
- Seoul National University, Seoul, South Korea
| | | | - Nayeon Kim
- Seoul National University, Seoul, South Korea
| | | | - Dohyeon Kim
- Seoul National University, Seoul, South Korea
| | - Yeojin Shin
- Seoul National University, Seoul, South Korea
| | | | | | | | - Arsenii Zinkevich
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
| | | | - Juhyun Lee
- Chung-Ang University, Seoul, South Korea
| | - Taein Kang
- Chung-Ang University, Seoul, South Korea
| | - Eeshit Dhaval Vaishnav
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Sequome, Inc., South San Francisco, CA, USA
| | | | - Sun Kim
- Seoul National University, Seoul, South Korea
| | | | - Aviv Regev
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Genentech, San Francisco, CA, USA
| | - Wuming Gong
- University of Minnesota, Minneapolis, MN, USA
| | - Ivan V Kulakovskiy
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, Russia
| | - Pablo Meyer
- Health Care and Life Sciences, IBM Research, New York, NY, USA
| | - Carl G de Boer
- University of British Columbia, Vancouver, British Columbia, Canada.
| |
Collapse
|
2
|
Wu Q, Li Y, Wang Q, Zhao X, Sun D, Liu B. Identification of DNA motif pairs on paired sequences based on composite heterogeneous graph. Front Genet 2024; 15:1424085. [PMID: 38952710 PMCID: PMC11215013 DOI: 10.3389/fgene.2024.1424085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2024] [Accepted: 05/22/2024] [Indexed: 07/03/2024] Open
Abstract
Motivation The interaction between DNA motifs (DNA motif pairs) influences gene expression through partnership or competition in the process of gene regulation. Potential chromatin interactions between different DNA motifs have been implicated in various diseases. However, current methods for identifying DNA motif pairs rely on the recognition of single DNA motifs or probabilities, which may result in local optimal solutions and can be sensitive to the choice of initial values. A method for precisely identifying DNA motif pairs is still lacking. Results Here, we propose a novel computational method for predicting DNA Motif Pairs based on Composite Heterogeneous Graph (MPCHG). This approach leverages a composite heterogeneous graph model to identify DNA motif pairs on paired sequences. Compared with the existing methods, MPCHG has greatly improved the accuracy of motifs prediction. Furthermore, the predicted DNA motifs demonstrate heightened DNase accessibility than the background sequences. Notably, the two DNA motifs forming a pair exhibit functional consistency. Importantly, the interacting TF pairs obtained by predicted DNA motif pairs were significantly enriched with known interacting TF pairs, suggesting their potential contribution to chromatin interactions. Collectively, we believe that these identified DNA motif pairs held substantial implications for revealing gene transcriptional regulation under long-range chromatin interactions.
Collapse
Affiliation(s)
- Qiuqin Wu
- School of Mathematics, Shandong University, Jinan, China
| | - Yang Li
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, United States
| | - Qi Wang
- School of Mathematics, Shandong University, Jinan, China
| | - Xiaoyu Zhao
- School of Mathematics, Shandong University, Jinan, China
| | - Duanchen Sun
- School of Mathematics, Shandong University, Jinan, China
| | - Bingqiang Liu
- School of Mathematics, Shandong University, Jinan, China
| |
Collapse
|
3
|
Garbelini JMC, Sanches DS, Pozo ATR. BIOMAPP::CHIP: large-scale motif analysis. BMC Bioinformatics 2024; 25:128. [PMID: 38528492 DOI: 10.1186/s12859-024-05752-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Accepted: 03/18/2024] [Indexed: 03/27/2024] Open
Abstract
BACKGROUND Discovery biological motifs plays a fundamental role in understanding regulatory mechanisms. Computationally, they can be efficiently represented as kmers, making the counting of these elements a critical aspect for ensuring not only the accuracy but also the efficiency of the analytical process. This is particularly useful in scenarios involving large data volumes, such as those generated by the ChIP-seq protocol. Against this backdrop, we introduce BIOMAPP::CHIP, a tool specifically designed to optimize the discovery of biological motifs in large data volumes. RESULTS We conducted a comprehensive set of comparative tests with state-of-the-art algorithms. Our analyses revealed that BIOMAPP::CHIP outperforms existing approaches in various metrics, excelling both in terms of performance and accuracy. The tests demonstrated a higher detection rate of significant motifs and also greater agility in the execution of the algorithm. Furthermore, the SMT component played a vital role in the system's efficiency, proving to be both agile and accurate in kmer counting, which in turn improved the overall efficacy of our tool. CONCLUSION BIOMAPP::CHIP represent real advancements in the discovery of biological motifs, particularly in large data volume scenarios, offering a relevant alternative for the analysis of ChIP-seq data and have the potential to boost future research in the field. This software can be found at the following address: (https://github.com/jadermcg/biomapp-chip).
Collapse
Affiliation(s)
- Jader M Caldonazzo Garbelini
- Department of Informatics, Federal University of Parana, XV de Novembro Street, Curitiba, Parana, 80060000, Brazil.
| | - Danilo S Sanches
- Department of Informatics, Federal University of Technology, Alberto Carazzai Avenue, Cornelio Procopio, Parana, 86300000, Brazil
| | - Aurora T Ramirez Pozo
- Department of Informatics, Federal University of Parana, XV de Novembro Street, Curitiba, Parana, 80060000, Brazil
| |
Collapse
|
4
|
Li Y, Wang Y, Wang C, Ma A, Ma Q, Liu B. A weighted two-stage sequence alignment framework to identify motifs from ChIP-exo data. PATTERNS (NEW YORK, N.Y.) 2024; 5:100927. [PMID: 38487805 PMCID: PMC10935504 DOI: 10.1016/j.patter.2024.100927] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 08/18/2023] [Accepted: 01/10/2024] [Indexed: 03/17/2024]
Abstract
In this study, we introduce TESA (weighted two-stage alignment), an innovative motif prediction tool that refines the identification of DNA-binding protein motifs, essential for deciphering transcriptional regulatory mechanisms. Unlike traditional algorithms that rely solely on sequence data, TESA integrates the high-resolution chromatin immunoprecipitation (ChIP) signal, specifically from ChIP-exonuclease (ChIP-exo), by assigning weights to sequence positions, thereby enhancing motif discovery. TESA employs a nuanced approach combining a binomial distribution model with a graph model, further supported by a "bookend" model, to improve the accuracy of predicting motifs of varying lengths. Our evaluation, utilizing an extensive compilation of 90 prokaryotic ChIP-exo datasets from proChIPdb and 167 H. sapiens datasets, compared TESA's performance against seven established tools. The results indicate TESA's improved precision in motif identification, suggesting its valuable contribution to the field of genomic research.
Collapse
Affiliation(s)
- Yang Li
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Yizhong Wang
- School of Mathematics, Shandong University, Jinan, Shandong 250100, China
| | - Cankun Wang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Anjun Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Qin Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH 43210, USA
| | - Bingqiang Liu
- School of Mathematics, Shandong University, Jinan, Shandong 250100, China
| |
Collapse
|
5
|
Zhang HQ, Liu SH, Li R, Yu JW, Ye DX, Yuan SS, Lin H, Huang CB, Tang H. MIBPred: Ensemble Learning-Based Metal Ion-Binding Protein Classifier. ACS OMEGA 2024; 9:8439-8447. [PMID: 38405489 PMCID: PMC10882704 DOI: 10.1021/acsomega.3c09587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Revised: 01/16/2024] [Accepted: 01/22/2024] [Indexed: 02/27/2024]
Abstract
In biological organisms, metal ion-binding proteins participate in numerous metabolic activities and are closely associated with various diseases. To accurately predict whether a protein binds to metal ions and the type of metal ion-binding protein, this study proposed a classifier named MIBPred. The classifier incorporated advanced Word2Vec technology from the field of natural language processing to extract semantic features of the protein sequence language and combined them with position-specific score matrix (PSSM) features. Furthermore, an ensemble learning model was employed for the metal ion-binding protein classification task. In the model, we independently trained XGBoost, LightGBM, and CatBoost algorithms and integrated the output results through an SVM voting mechanism. This innovative combination has led to a significant breakthrough in the predictive performance of our model. As a result, we achieved accuracies of 95.13% and 85.19%, respectively, in predicting metal ion-binding proteins and their types. Our research not only confirms the effectiveness of Word2Vec technology in extracting semantic information from protein sequences but also highlights the outstanding performance of the MIBPred classifier in the problem of metal ion-binding protein types. This study provides a reliable tool and method for the in-depth exploration of the structure and function of metal ion-binding proteins.
Collapse
Affiliation(s)
- Hong-Qi Zhang
- School
of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of
China, Chengdu 610054, China
| | - Shang-Hua Liu
- School
of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of
China, Chengdu 610054, China
| | - Rui Li
- School
of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of
China, Chengdu 610054, China
| | - Jun-Wen Yu
- School
of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of
China, Chengdu 610054, China
| | - Dong-Xin Ye
- School
of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of
China, Chengdu 610054, China
| | - Shi-Shi Yuan
- School
of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of
China, Chengdu 610054, China
| | - Hao Lin
- School
of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of
China, Chengdu 610054, China
| | - Cheng-Bing Huang
- School
of Computer Science and Technology, Aba Teachers University, Aba 623002, China
| | - Hua Tang
- School
of Basic Medical Sciences, Southwest Medical
University, Luzhou 646000, China
- Central
Nervous System Drug Key Laboratory of Sichuan Province, Luzhou 646000, China
| |
Collapse
|
6
|
Wang Y, Li Y, Wang C, Lio CWJ, Ma Q, Liu B. CEMIG: prediction of the cis-regulatory motif using the de Bruijn graph from ATAC-seq. Brief Bioinform 2023; 25:bbad505. [PMID: 38189539 PMCID: PMC10772951 DOI: 10.1093/bib/bbad505] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Revised: 11/21/2023] [Accepted: 12/03/2023] [Indexed: 01/09/2024] Open
Abstract
Sequence motif discovery algorithms enhance the identification of novel deoxyribonucleic acid sequences with pivotal biological significance, especially transcription factor (TF)-binding motifs. The advent of assay for transposase-accessible chromatin using sequencing (ATAC-seq) has broadened the toolkit for motif characterization. Nonetheless, prevailing computational approaches have focused on delineating TF-binding footprints, with motif discovery receiving less attention. Herein, we present Cis rEgulatory Motif Influence using de Bruijn Graph (CEMIG), an algorithm leveraging de Bruijn and Hamming distance graph paradigms to predict and map motif sites. Assessment on 129 ATAC-seq datasets from the Cistrome Data Browser demonstrates CEMIG's exceptional performance, surpassing three established methodologies on four evaluative metrics. CEMIG accurately identifies both cell-type-specific and common TF motifs within GM12878 and K562 cell lines, demonstrating its comparative genomic capabilities in the identification of evolutionary conservation and cell-type specificity. In-depth transcriptional and functional genomic studies have validated the functional relevance of CEMIG-identified motifs across various cell types. CEMIG is available at https://github.com/OSU-BMBL/CEMIG, developed in C++ to ensure cross-platform compatibility with Linux, macOS and Windows operating systems.
Collapse
Affiliation(s)
- Yizhong Wang
- School of Mathematics, Shandong University, Jinan, 250100, China
| | - Yang Li
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Cankun Wang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Chan-Wang Jerry Lio
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH, 43210, USA
| | - Qin Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH, 43210, USA
| | - Bingqiang Liu
- School of Mathematics, Shandong University, Jinan, 250100, China
| |
Collapse
|
7
|
Tognon M, Giugno R, Pinello L. A survey on algorithms to characterize transcription factor binding sites. Brief Bioinform 2023; 24:bbad156. [PMID: 37099664 PMCID: PMC10422928 DOI: 10.1093/bib/bbad156] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 03/27/2023] [Accepted: 04/01/2023] [Indexed: 04/28/2023] Open
Abstract
Transcription factors (TFs) are key regulatory proteins that control the transcriptional rate of cells by binding short DNA sequences called transcription factor binding sites (TFBS) or motifs. Identifying and characterizing TFBS is fundamental to understanding the regulatory mechanisms governing the transcriptional state of cells. During the last decades, several experimental methods have been developed to recover DNA sequences containing TFBS. In parallel, computational methods have been proposed to discover and identify TFBS motifs based on these DNA sequences. This is one of the most widely investigated problems in bioinformatics and is referred to as the motif discovery problem. In this manuscript, we review classical and novel experimental and computational methods developed to discover and characterize TFBS motifs in DNA sequences, highlighting their advantages and drawbacks. We also discuss open challenges and future perspectives that could fill the remaining gaps in the field.
Collapse
Affiliation(s)
- Manuel Tognon
- Computer Science Department, University of Verona, Verona, Italy
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Rosalba Giugno
- Computer Science Department, University of Verona, Verona, Italy
| | - Luca Pinello
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Department of Pathology, Harvard Medical School, Boston, Massachusetts, United States of America
| |
Collapse
|
8
|
Andrews G, Fan K, Pratt HE, Phalke N, Karlsson EK, Lindblad-Toh K, Gazal S, Moore JE, Weng Z, Andrews G, Armstrong JC, Bianchi M, Birren BW, Bredemeyer KR, Breit AM, Christmas MJ, Clawson H, Damas J, Di Palma F, Diekhans M, Dong MX, Eizirik E, Fan K, Fanter C, Foley NM, Forsberg-Nilsson K, Garcia CJ, Gatesy J, Gazal S, Genereux DP, Goodman L, Grimshaw J, Halsey MK, Harris AJ, Hickey G, Hiller M, Hindle AG, Hubley RM, Hughes GM, Johnson J, Juan D, Kaplow IM, Karlsson EK, Keough KC, Kirilenko B, Koepfli KP, Korstian JM, Kowalczyk A, Kozyrev SV, Lawler AJ, Lawless C, Lehmann T, Levesque DL, Lewin HA, Li X, Lind A, Lindblad-Toh K, Mackay-Smith A, Marinescu VD, Marques-Bonet T, Mason VC, Meadows JRS, Meyer WK, Moore JE, Moreira LR, Moreno-Santillan DD, Morrill KM, Muntané G, Murphy WJ, Navarro A, Nweeia M, Ortmann S, Osmanski A, Paten B, Paulat NS, Pfenning AR, Phan BN, Pollard KS, Pratt HE, Ray DA, Reilly SK, Rosen JR, Ruf I, Ryan L, Ryder OA, Sabeti PC, Schäffer DE, Serres A, Shapiro B, Smit AFA, Springer M, Srinivasan C, Steiner C, Storer JM, Sullivan KAM, Sullivan PF, Sundström E, Supple MA, Swofford R, Talbot JE, Teeling E, Turner-Maier J, Valenzuela A, Wagner F, Wallerman O, Wang C, Wang J, Weng Z, Wilder AP, Wirthlin ME, Xue JR, Zhang X. Mammalian evolution of human cis-regulatory elements and transcription factor binding sites. Science 2023; 380:eabn7930. [PMID: 37104580 DOI: 10.1126/science.abn7930] [Citation(s) in RCA: 31] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/29/2023]
Abstract
Understanding the regulatory landscape of the human genome is a long-standing objective of modern biology. Using the reference-free alignment across 241 mammalian genomes produced by the Zoonomia Consortium, we charted evolutionary trajectories for 0.92 million human candidate cis-regulatory elements (cCREs) and 15.6 million human transcription factor binding sites (TFBSs). We identified 439,461 cCREs and 2,024,062 TFBSs under evolutionary constraint. Genes near constrained elements perform fundamental cellular processes, whereas genes near primate-specific elements are involved in environmental interaction, including odor perception and immune response. About 20% of TFBSs are transposable element-derived and exhibit intricate patterns of gains and losses during primate evolution whereas sequence variants associated with complex traits are enriched in constrained TFBSs. Our annotations illuminate the regulatory functions of the human genome.
Collapse
Affiliation(s)
- Gregory Andrews
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, Worcester, MA, USA
| | - Kaili Fan
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, Worcester, MA, USA
| | - Henry E Pratt
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, Worcester, MA, USA
| | - Nishigandha Phalke
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, Worcester, MA, USA
| | - Elinor K Karlsson
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, Worcester, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA
- Program in Molecular Medicine, UMass Chan Medical School, Worcester, MA 01605, USA
| | - Kerstin Lindblad-Toh
- Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA
- Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, 75132 Uppsala, Sweden
| | - Steven Gazal
- Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA
- Center for Genetic Epidemiology, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA
| | - Jill E Moore
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, Worcester, MA, USA
| | - Zhiping Weng
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, Worcester, MA, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
9
|
Su Y, Xu C, Shea J, DeStephanis D, Su Z. Transcriptomic changes in single yeast cells under various stress conditions. BMC Genomics 2023; 24:88. [PMID: 36829151 PMCID: PMC9960639 DOI: 10.1186/s12864-023-09184-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Accepted: 02/13/2023] [Indexed: 02/26/2023] Open
Abstract
BACKGROUND The stress response of Saccharomyces cerevisiae has been extensively studied in the past decade. However, with the advent of recent technology in single-cell transcriptome profiling, there is a new opportunity to expand and further understanding of the yeast stress response with greater resolution on a system level. To understand transcriptomic changes in baker's yeast S. cerevisiae cells under stress conditions, we sequenced 117 yeast cells under three stress treatments (hypotonic condition, glucose starvation and amino acid starvation) using a full-length single-cell RNA-Seq method. RESULTS We found that though single cells from the same treatment showed varying degrees of uniformity, technical noise and batch effects can confound results significantly. However, upon careful selection of samples to reduce technical artifacts and account for batch-effects, we were able to capture distinct transcriptomic signatures for different stress conditions as well as putative regulatory relationships between transcription factors and target genes. CONCLUSION Our results show that a full-length single-cell based transcriptomic analysis of the yeast may help paint a clearer picture of how the model organism responds to stress than do bulk cell population-based methods.
Collapse
Affiliation(s)
- Yangqi Su
- Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, 28223, Charlotte, NC, USA
| | - Chen Xu
- Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, 28223, Charlotte, NC, USA
| | - Jonathan Shea
- Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, 28223, Charlotte, NC, USA
| | - Darla DeStephanis
- Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, 28223, Charlotte, NC, USA
| | - Zhengchang Su
- Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, 28223, Charlotte, NC, USA.
| |
Collapse
|
10
|
Yu Q, Zhang X, Hu Y, Chen S, Yang L. A Method for Predicting DNA Motif Length Based On Deep Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:61-73. [PMID: 35275822 DOI: 10.1109/tcbb.2022.3158471] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
A DNA motif is a sequence pattern shared by the DNA sequence segments that bind to a specific protein. Discovering motifs in a given DNA sequence dataset plays a vital role in studying gene expression regulation. As an important attribute of the DNA motif, the motif length directly affects the quality of the discovered motifs. How to determine the motif length more accurately remains a difficult challenge to be solved. We propose a new motif length prediction scheme named MotifLen by using supervised machine learning. First, a method of constructing sample data for predicting the motif length is proposed. Secondly, a deep learning model for motif length prediction is constructed based on the convolutional neural network. Then, the methods of applying the proposed prediction model based on a motif found by an existing motif discovery algorithm are given. The experimental results show that i) the prediction accuracy of MotifLen is more than 90% on the validation set and is significantly higher than that of the compared methods on real datasets, ii) MotifLen can successfully optimize the motifs found by the existing motif discovery algorithms, and iii) it can effectively improve the time performance of some existing motif discovery algorithms.
Collapse
|
11
|
Ni P, Wilson D, Su Z. A map of cis-regulatory modules and constituent transcription factor binding sites in 80% of the mouse genome. BMC Genomics 2022; 23:714. [PMID: 36261804 PMCID: PMC9583556 DOI: 10.1186/s12864-022-08933-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2022] [Accepted: 10/11/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Mouse is probably the most important model organism to study mammal biology and human diseases. A better understanding of the mouse genome will help understand the human genome, biology and diseases. However, despite the recent progress, the characterization of the regulatory sequences in the mouse genome is still far from complete, limiting its use to understand the regulatory sequences in the human genome. RESULTS Here, by integrating binding peaks in ~ 9,000 transcription factor (TF) ChIP-seq datasets that cover 79.9% of the mouse mappable genome using an efficient pipeline, we were able to partition these binding peak-covered genome regions into a cis-regulatory module (CRM) candidate (CRMC) set and a non-CRMC set. The CRMCs contain 912,197 putative CRMs and 38,554,729 TF binding sites (TFBSs) islands, covering 55.5% and 24.4% of the mappable genome, respectively. The CRMCs tend to be under strong evolutionary constraints, indicating that they are likely cis-regulatory; while the non-CRMCs are largely selectively neutral, indicating that they are unlikely cis-regulatory. Based on evolutionary profiles of the genome positions, we further estimated that 63.8% and 27.4% of the mouse genome might code for CRMs and TFBSs, respectively. CONCLUSIONS Validation using experimental data suggests that at least most of the CRMCs are authentic. Thus, this unprecedentedly comprehensive map of CRMs and TFBSs can be a good resource to guide experimental studies of regulatory genomes in mice and humans.
Collapse
Affiliation(s)
- Pengyu Ni
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, Charlotte, NC, 28223, USA
| | - David Wilson
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, Charlotte, NC, 28223, USA
| | - Zhengchang Su
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, Charlotte, NC, 28223, USA.
| |
Collapse
|
12
|
Ni P, Moe J, Su Z. Accurate prediction of functional states of cis-regulatory modules reveals common epigenetic rules in humans and mice. BMC Biol 2022; 20:221. [PMID: 36199141 PMCID: PMC9535988 DOI: 10.1186/s12915-022-01426-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2022] [Accepted: 09/29/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Predicting cis-regulatory modules (CRMs) in a genome and their functional states in various cell/tissue types of the organism are two related challenging computational tasks. Most current methods attempt to simultaneously achieve both using data of multiple epigenetic marks in a cell/tissue type. Though conceptually attractive, they suffer high false discovery rates and limited applications. To fill the gaps, we proposed a two-step strategy to first predict a map of CRMs in the genome, and then predict functional states of all the CRMs in various cell/tissue types of the organism. We have recently developed an algorithm for the first step that was able to more accurately and completely predict CRMs in a genome than existing methods by integrating numerous transcription factor ChIP-seq datasets in the organism. Here, we presented machine-learning methods for the second step. RESULTS We showed that functional states in a cell/tissue type of all the CRMs in the genome could be accurately predicted using data of only 1~4 epigenetic marks by a variety of machine-learning classifiers. Our predictions are substantially more accurate than the best achieved so far. Interestingly, a model trained on a cell/tissue type in humans can accurately predict functional states of CRMs in different cell/tissue types of humans as well as of mice, and vice versa. Therefore, epigenetic code that defines functional states of CRMs in various cell/tissue types is universal at least in humans and mice. Moreover, we found that from tens to hundreds of thousands of CRMs were active in a human and mouse cell/tissue type, and up to 99.98% of them were reutilized in different cell/tissue types, while as small as 0.02% of them were unique to a cell/tissue type that might define the cell/tissue type. CONCLUSIONS Our two-step approach can accurately predict functional states in any cell/tissue type of all the CRMs in the genome using data of only 1~4 epigenetic marks. Our approach is also more cost-effective than existing methods that typically use data of more epigenetic marks. Our results suggest common epigenetic rules for defining functional states of CRMs in various cell/tissue types in humans and mice.
Collapse
Affiliation(s)
- Pengyu Ni
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, Charlotte, NC, 28223, USA
| | - Joshua Moe
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, Charlotte, NC, 28223, USA
| | - Zhengchang Su
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, Charlotte, NC, 28223, USA.
| |
Collapse
|
13
|
Ni P, Su Z. PCRMS: a database of predicted cis-regulatory modules and constituent transcription factor binding sites in genomes. Database (Oxford) 2022; 2022:6572594. [PMID: 35452518 PMCID: PMC9216522 DOI: 10.1093/database/baac024] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2021] [Revised: 02/20/2022] [Accepted: 04/12/2022] [Indexed: 01/13/2023]
Abstract
More accurate and more complete predictions of cis-regulatory modules (CRMs) and constituent transcription factor (TF) binding sites (TFBSs) in genomes can facilitate characterizing functions of regulatory sequences. Here, we developed a database predicted cis-regulatory modules (PCRMS) (https://cci-bioinfo.uncc.edu) that stores highly accurate and unprecedentedly complete maps of predicted CRMs and TFBSs in the human and mouse genomes. The web interface allows the user to browse CRMs and TFBSs in an organism, find the closest CRMs to a gene, search CRMs around a gene and find all TFBSs of a TF. PCRMS can be a useful resource for the research community to characterize regulatory genomes. Database URL: https://cci-bioinfo.uncc.edu/.
Collapse
Affiliation(s)
- Pengyu Ni
- Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, 9201 University City Boulevard, Charlotte, NC 28223, USA
| | - Zhengchang Su
- Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, 9201 University City Boulevard, Charlotte, NC 28223, USA
| |
Collapse
|
14
|
Prosperi M, Marini S, Boucher C. Fast and exact quantification of motif occurrences in biological sequences. BMC Bioinformatics 2021; 22:445. [PMID: 34537012 PMCID: PMC8449872 DOI: 10.1186/s12859-021-04355-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2021] [Accepted: 09/06/2021] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large motif sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging. Here, we introduce 'motif_prob', a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. Our implementation speeds up the exact calculation, usually impractical, making it feasible and posit to substitute currently employed heuristics. RESULTS We implement motif_prob in both Perl and C+ + languages, using an efficient error-bound iterative process for the exact formula, providing comparison with state-of-the-art tools (e.g. MoSDi) in terms of precision, run time benchmarks, along with a real-world use case on bacterial motif characterization. Our software is able to process a million of motifs (13-31 bases) over genome lengths of 5 million bases within the minute on a regular laptop, and the run times for both the Perl and C+ + code are several orders of magnitude smaller (50-1000× faster) than MoSDi, even when using their fast compound Poisson approximation (60-120× faster). In the real-world use cases, we first show the consistency of motif_prob with MoSDi, and then how the p-value quantification is crucial for enrichment quantification when bacteria have different GC content, using motifs found in antimicrobial resistance genes. The software and the code sources are available under the MIT license at https://github.com/DataIntellSystLab/motif_prob . CONCLUSIONS The motif_prob software is a multi-platform and efficient open source solution for calculating exact frequency distributions of motifs. It can be integrated with motif discovery/characterization tools for quantifying enrichment and deviation from expected frequency ranges with exact p values, without loss in data processing efficiency.
Collapse
Affiliation(s)
- Mattia Prosperi
- Data Intelligence Systems Lab, Department of Epidemiology, College of Public Health and Health Professions and College of Medicine, University of Florida, Gainesville, FL, USA.
| | - Simone Marini
- Data Intelligence Systems Lab, Department of Epidemiology, College of Public Health and Health Professions and College of Medicine, University of Florida, Gainesville, FL, USA
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
| |
Collapse
|
15
|
Ni P, Su Z. Accurate prediction of cis-regulatory modules reveals a prevalent regulatory genome of humans. NAR Genom Bioinform 2021; 3:lqab052. [PMID: 34159315 PMCID: PMC8210889 DOI: 10.1093/nargab/lqab052] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Revised: 05/01/2021] [Accepted: 06/14/2021] [Indexed: 02/07/2023] Open
Abstract
cis-regulatory modules(CRMs) formed by clusters of transcription factor (TF) binding sites (TFBSs) are as important as coding sequences in specifying phenotypes of humans. It is essential to categorize all CRMs and constituent TFBSs in the genome. In contrast to most existing methods that predict CRMs in specific cell types using epigenetic marks, we predict a largely cell type agonistic but more comprehensive map of CRMs and constituent TFBSs in the gnome by integrating all available TF ChIP-seq datasets. Our method is able to partition 77.47% of genome regions covered by available 6092 datasets into a CRM candidate (CRMC) set (56.84%) and a non-CRMC set (43.16%). Intriguingly, the predicted CRMCs are under strong evolutionary constraints, while the non-CRMCs are largely selectively neutral, strongly suggesting that the CRMCs are likely cis-regulatory, while the non-CRMCs are not. Our predicted CRMs are under stronger evolutionary constraints than three state-of-the-art predictions (GeneHancer, EnhancerAtlas and ENCODE phase 3) and substantially outperform them for recalling VISTA enhancers and non-coding ClinVar variants. We estimated that the human genome might encode about 1.47M CRMs and 68M TFBSs, comprising about 55% and 22% of the genome, respectively; for both of which, we predicted 80%. Therefore, the cis-regulatory genome appears to be more prevalent than originally thought.
Collapse
Affiliation(s)
- Pengyu Ni
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, 9201 University City Boulevard, Charlotte, NC 28223, USA
| | - Zhengchang Su
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, 9201 University City Boulevard, Charlotte, NC 28223, USA
| |
Collapse
|
16
|
Bailey TL. STREME: accurate and versatile sequence motif discovery. Bioinformatics 2021; 37:2834-2840. [PMID: 33760053 PMCID: PMC8479671 DOI: 10.1093/bioinformatics/btab203] [Citation(s) in RCA: 261] [Impact Index Per Article: 65.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Revised: 02/21/2021] [Accepted: 03/23/2021] [Indexed: 02/02/2023] Open
Abstract
MOTIVATION Sequence motif discovery algorithms can identify novel sequence patterns that perform biological functions in DNA, RNA and protein sequences-for example, the binding site motifs of DNA- and RNA-binding proteins. RESULTS The STREME algorithm presented here advances the state-of-the-art in ab initio motif discovery in terms of both accuracy and versatility. Using in vivo DNA (ChIP-seq) and RNA (CLIP-seq) data, and validating motifs with reference motifs derived from in vitro data, we show that STREME is more accurate, sensitive and thorough than several widely used algorithms (DREME, HOMER, MEME, Peak-motifs) and two other representative algorithms (ProSampler and Weeder). STREME's capabilities include the ability to find motifs in datasets with hundreds of thousands of sequences, to find both short and long motifs (from 3 to 30 positions), to perform differential motif discovery in pairs of sequence datasets, and to find motifs in sequences over virtually any alphabet (DNA, RNA, protein and user-defined alphabets). Unlike most motif discovery algorithms, STREME reports a useful estimate of the statistical significance of each motif it discovers. STREME is easy to use individually via its web server or via the command line, and is completely integrated with the widely used MEME Suite of sequence analysis tools. The name STREME stands for 'Simple, Thorough, Rapid, Enriched Motif Elicitation'. AVAILABILITY AND IMPLEMENTATION The STREME web server and source code are provided freely for non-commercial use at http://meme-suite.org. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Timothy L Bailey
- Department of Pharmacology, University of Nevada, Reno, NV 89557, USA,
| |
Collapse
|
17
|
Ni P, Su Z. Deciphering epigenomic code for cell differentiation using deep learning. BMC Genomics 2019; 20:709. [PMID: 31510916 PMCID: PMC6739944 DOI: 10.1186/s12864-019-6072-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Accepted: 08/29/2019] [Indexed: 12/31/2022] Open
Abstract
Background Although DNA sequence plays a crucial role in establishing the unique epigenome of a cell type, little is known about the sequence determinants that lead to the unique epigenomes of different cell types produced during cell differentiation. To fill this gap, we employed two types of deep convolutional neural networks (CNNs) constructed for each of differentially related cell types and for each of histone marks measured in the cells, to learn the sequence determinants of various histone modification patterns in each cell type. Results We applied our models to four differentially related human CD4+ T cell types and six histone marks measured in each cell type. The cell models can accurately predict the histone marks in each cell type, while the mark models can also accurately predict the cell types based on a single mark. Sequence motifs learned by both the cell or mark models are highly similar to known binding motifs of transcription factors known to play important roles in CD4+ T cell differentiation. Both the unique histone mark patterns in each cell type and the different patterns of the same histone mark in different cell types are determined by a set of motifs with unique combinations. Interestingly, the level of sharing motifs learned in the different cell models reflects the lineage relationships of the cells, while the level of sharing motifs learned in the different histone mark models reflects their functional relationships. These models can also enable the prediction of the importance of learned motifs and their interactions in determining specific histone mark patterns in the cell types. Conclusion Sequence determinants of various histone modification patterns in different cell types can be revealed by comparative analysis of motifs learned in the CNN models for multiple cell types and histone marks. The learned motifs are interpretable and may provide insights into the underlying molecular mechanisms of establishing the unique epigenomes in different cell types. Thus, our results support the hypothesis that DNA sequences ultimately determine the unique epigenomes of different cell types through their interactions with transcriptional factors, epigenome remodeling system and extracellular cues during cell differentiation. Electronic supplementary material The online version of this article (10.1186/s12864-019-6072-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Pengyu Ni
- Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, 9201 University City Boulevard, Charlotte, NC, 28223, USA
| | - Zhengchang Su
- Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, 9201 University City Boulevard, Charlotte, NC, 28223, USA.
| |
Collapse
|