1
|
Luo R, Liu J, Guan L, Li M. HybProm: An attention-assisted hybrid CNN-BiLSTM model for the interpretable prediction of DNA promoter. Methods 2025; 235:71-80. [PMID: 39929298 DOI: 10.1016/j.ymeth.2025.02.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2024] [Revised: 01/18/2025] [Accepted: 02/03/2025] [Indexed: 02/13/2025] Open
Abstract
Promoter prediction is essential for analyzing gene structures, understanding regulatory networks, transcription mechanisms, and precisely controlling gene expression. Recently, computational and deep learning methods for promoter prediction have gained attention. However, there is still room to improve their accuracy. To address this, we propose the HybProm model, which uses DNA2Vec to transform DNA sequences into low-dimensional vectors, followed by a CNN-BiLSTM-Attention architecture to extract features and predict promoters across species, including E. coli, humans, mice, and plants. Experiments show that HybProm consistently achieves high accuracy (90%-99%) and offers good interpretability by identifying key sequence patterns and positions that drive predictions.
Collapse
Affiliation(s)
- Rentao Luo
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000 Jiangxi, China
| | - Jiawei Liu
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000 Jiangxi, China
| | - Lixin Guan
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000 Jiangxi, China
| | - Mengshan Li
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000 Jiangxi, China.
| |
Collapse
|
2
|
Wang JY, Xie ZX, Cui YZ, Li BZ, Yuan YJ. Artificial design of the genome: from sequences to the 3D structure of chromosomes. Trends Biotechnol 2025; 43:304-317. [PMID: 39299833 DOI: 10.1016/j.tibtech.2024.08.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2024] [Revised: 07/18/2024] [Accepted: 08/27/2024] [Indexed: 09/22/2024]
Abstract
Genome design is the foundation of genome synthesis, which provides a new platform for deepening our understanding of biological systems by exploring the fundamental components and structure of the genome. Artificial genome designs can endow unnatural genomes with desired functions. We provide a comprehensive overview of genome design principles ranging from DNA sequences to the 3D structure of chromosomes. Furthermore, we highlight applications of genome design in gene expression, genome structure, genome function, and biocontainment, and discuss the potential of artificial intelligence (AI) in genome design.
Collapse
Affiliation(s)
- Jun-Yi Wang
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China; Frontiers Research Institute for Synthetic Biology, Tianjin University, Tianjin 300072, China
| | - Ze-Xiong Xie
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China; Frontiers Research Institute for Synthetic Biology, Tianjin University, Tianjin 300072, China
| | - You-Zhi Cui
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China; Frontiers Research Institute for Synthetic Biology, Tianjin University, Tianjin 300072, China
| | - Bing-Zhi Li
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China; Frontiers Research Institute for Synthetic Biology, Tianjin University, Tianjin 300072, China.
| | - Ying-Jin Yuan
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China; Frontiers Research Institute for Synthetic Biology, Tianjin University, Tianjin 300072, China
| |
Collapse
|
3
|
Li S, Noroozizadeh S, Moayedpour S, Kogler-Anele L, Xue Z, Zheng D, Montoya FU, Agarwal V, Bar-Joseph Z, Jager S. mRNA-LM: full-length integrated SLM for mRNA analysis. Nucleic Acids Res 2025; 53:gkaf044. [PMID: 39898548 DOI: 10.1093/nar/gkaf044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2024] [Revised: 12/07/2024] [Accepted: 01/20/2025] [Indexed: 02/04/2025] Open
Abstract
The success of SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) messenger RNA (mRNA) vaccine has led to increased interest in the design and use of mRNA for vaccines and therapeutics. Still, selecting the most appropriate mRNA sequence for a protein remains a challenge. Several recent studies have shown that the specific mRNA sequence can have a significant impact on the translation efficiency, half-life, degradation rates, and other issues that play a major role in determining vaccine efficiency. To enable the selection of the most appropriate sequence, we developed mRNA-LM, an integrated small language model for modeling the entire mRNA sequence. mRNA-LM uses the contrastive language-image pretraining integration technology to combine three separate language models for the different mRNA segments. We trained mRNA-LM on millions of diverse mRNA sequences from several different species. The unsupervised model was able to learn meaningful biology related to evolution and host-pathogen interactions. Fine-tuning of mRNA-LM allowed us to use it in several mRNA property prediction tasks. As we show, using the full-length integrated model led to accurate predictions, improving on prior methods proposed for this task.
Collapse
Affiliation(s)
- Sizhen Li
- Digital R&D, Sanofi, Cambridge, MA 02141, United States
| | - Shahriar Noroozizadeh
- Digital R&D, Sanofi, Cambridge, MA 02141, United States
- Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213, United States
- Heinz College, Carnegie Mellon University, Pittsburgh, PA 15213, United States
| | | | | | - Zexin Xue
- Digital R&D, Sanofi, Cambridge, MA 02141, United States
- Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada
| | - Dinghai Zheng
- mRNA Center of Excellence, Sanofi, Waltham, MA 02451, United States
| | | | - Vikram Agarwal
- mRNA Center of Excellence, Sanofi, Waltham, MA 02451, United States
| | | | - Sven Jager
- Digital R&D, Sanofi, Cambridge, MA 02141, United States
| |
Collapse
|
4
|
Zeng R, Li Z, Li J, Zhang Q. DNA promoter task-oriented dictionary mining and prediction model based on natural language technology. Sci Rep 2025; 15:153. [PMID: 39747934 PMCID: PMC11697570 DOI: 10.1038/s41598-024-84105-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2024] [Accepted: 12/19/2024] [Indexed: 01/04/2025] Open
Abstract
Promoters are essential DNA sequences that initiate transcription and regulate gene expression. Precisely identifying promoter sites is crucial for deciphering gene expression patterns and the roles of gene regulatory networks. Recent advancements in bioinformatics have leveraged deep learning and natural language processing (NLP) to enhance promoter prediction accuracy. Techniques such as convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and BERT models have been particularly impactful. However, current approaches often rely on arbitrary DNA sequence segmentation during BERT pre-training, which may not yield optimal results. To overcome this limitation, this article introduces a novel DNA sequence segmentation method. This approach develops a more refined dictionary for DNA sequences, utilizes it for BERT pre-training, and employs an Inception neural network as the foundational model. This BERT-Inception architecture captures information across multiple granularities. Experimental results show that the model improves the performance of several downstream tasks and introduces deep learning interpretability, providing new perspectives for interpreting and understanding DNA sequence information. The detailed source code is available at https://github.com/katouMegumiH/Promoter_BERT .
Collapse
Affiliation(s)
- Ruolei Zeng
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 55455, USA
| | - Zihan Li
- National Engineering Research Centre for Agri-Product Quality Traceability, Beijing Technology and Business University, No.11 Fucheng Road, Beijing, 100048, China.
| | - Jialu Li
- National Engineering Research Centre for Agri-Product Quality Traceability, Beijing Technology and Business University, No.11 Fucheng Road, Beijing, 100048, China
| | - Qingchuan Zhang
- National Engineering Research Centre for Agri-Product Quality Traceability, Beijing Technology and Business University, No.11 Fucheng Road, Beijing, 100048, China.
| |
Collapse
|
5
|
Du Q, Poon MN, Zeng X, Zhang P, Wei Z, Wang H, Wang Y, Wei L, Wang X. Synthetic promoter design in Escherichia coli based on multinomial diffusion model. iScience 2024; 27:111207. [PMID: 39524356 PMCID: PMC11550136 DOI: 10.1016/j.isci.2024.111207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2024] [Revised: 10/03/2024] [Accepted: 10/15/2024] [Indexed: 11/16/2024] Open
Abstract
Generative design of promoters has enhanced the efficiency of de novo creation of functional sequences. Though several deep generative models have been employed in biological sequence generation, including variational autoencoder (VAE) or Wasserstein generative adversarial network (WGAN), these models might struggle with mode collapse and low sample diversity. In this study, we introduce the multinomial diffusion model (MDM) for promoter sequence design and propose a structured set of criteria for effectively comparing the performance of generative models. In silico experiments demonstrate that MDM outperforms existing generative AI approaches. MDM demonstrates superior performance in various computational evaluations, remains robust during the training process, and exhibits a strong ability in capturing weak signals. In addition, we experimentally validated that the majority of our model designed promoters have expression activities in vivo, indicating the practicality and potential of MDM for bioengineering.
Collapse
Affiliation(s)
- Qixiu Du
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - May Nee Poon
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xiaocheng Zeng
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Pengcheng Zhang
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Zheng Wei
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Haochen Wang
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Ye Wang
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Lei Wei
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xiaowo Wang
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China
| |
Collapse
|
6
|
Boretti A. The transformative potential of AI-driven CRISPR-Cas9 genome editing to enhance CAR T-cell therapy. Comput Biol Med 2024; 182:109137. [PMID: 39260044 DOI: 10.1016/j.compbiomed.2024.109137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2024] [Revised: 08/31/2024] [Accepted: 09/08/2024] [Indexed: 09/13/2024]
Abstract
This narrative review examines the promising potential of integrating artificial intelligence (AI) with CRISPR-Cas9 genome editing to advance CAR T-cell therapy. AI algorithms offer unparalleled precision in identifying genetic targets, essential for enhancing the therapeutic efficacy of CAR T-cell treatments. This precision is critical for eliminating negative regulatory elements that undermine therapy effectiveness. Additionally, AI streamlines the manufacturing process, significantly reducing costs and increasing accessibility, thereby encouraging further research and development investment. A key benefit of AI integration is improved safety; by predicting and minimizing off-target effects, AI enhances the specificity of CRISPR-Cas9 edits, contributing to safer CAR T-cell therapy. This advancement is crucial for patient safety and broader clinical adoption. The convergence of AI and CRISPR-Cas9 has transformative potential, poised to revolutionize personalized immunotherapy. These innovations could expand the application of CAR T-cell therapy beyond hematologic malignancies to various solid tumors and other non-hematologic conditions, heralding a new era in cancer treatment that substantially improves patient outcomes.
Collapse
|
7
|
Zhang Q, Wei Y, Liu L. GraphPro: An interpretable graph neural network-based model for identifying promoters in multiple species. Comput Biol Med 2024; 180:108974. [PMID: 39096613 DOI: 10.1016/j.compbiomed.2024.108974] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2024] [Revised: 07/29/2024] [Accepted: 07/30/2024] [Indexed: 08/05/2024]
Abstract
Promoters are DNA sequences that bind with RNA polymerase to initiate transcription, regulating this process through interactions with transcription factors. Accurate identification of promoters is crucial for understanding gene expression regulation mechanisms and developing therapeutic approaches for various diseases. However, experimental techniques for promoter identification are often expensive, time-consuming, and inefficient, necessitating the development of accurate and efficient computational models for this task. Enhancing the model's ability to recognize promoters across multiple species and improving its interpretability pose significant challenges. In this study, we introduce a novel interpretable model based on graph neural networks, named GraphPro, for multi-species promoter identification. Initially, we encode the sequences using k-tuple nucleotide frequency pattern, dinucleotide physicochemical properties, and dna2vec. Subsequently, we construct two feature extraction modules based on convolutional neural networks and graph neural networks. These modules aim to extract specific motifs from the promoters, learn their dependencies, and capture the underlying structural features of the promoters, providing a more comprehensive representation. Finally, a fully connected neural network predicts whether the input sequence is a promoter. We conducted extensive experiments on promoter datasets from eight species, including Human, Mouse, and Escherichia coli. The experimental results show that the average Sn, Sp, Acc and MCC values of GraphPro are 0.9123, 0.9482, 0.8840 and 0.7984, respectively. Compared with previous promoter identification methods, GraphPro not only achieves better recognition accuracy on multiple species, but also outperforms all previous methods in cross-species prediction ability. Furthermore, by visualizing GraphPro's decision process and analyzing the sequences matching the transcription factor binding motifs captured by the model, we validate its significant advantages in biological interpretability. The source code for GraphPro is available at https://github.com/liuliwei1980/GraphPro.
Collapse
Affiliation(s)
- Qi Zhang
- College of Science, Dalian Jiaotong University, Dalian, 116028, China
| | - Yuxiao Wei
- College of Software, Dalian Jiaotong University, Dalian, 116028, China
| | - Liwei Liu
- College of Science, Dalian Jiaotong University, Dalian, 116028, China.
| |
Collapse
|
8
|
Wang X, Li F, Zhang Y, Imoto S, Shen HH, Li S, Guo Y, Yang J, Song J. Deep learning approaches for non-coding genetic variant effect prediction: current progress and future prospects. Brief Bioinform 2024; 25:bbae446. [PMID: 39276327 PMCID: PMC11401448 DOI: 10.1093/bib/bbae446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 08/08/2024] [Accepted: 08/27/2024] [Indexed: 09/16/2024] Open
Abstract
Recent advancements in high-throughput sequencing technologies have significantly enhanced our ability to unravel the intricacies of gene regulatory processes. A critical challenge in this endeavor is the identification of variant effects, a key factor in comprehending the mechanisms underlying gene regulation. Non-coding variants, constituting over 90% of all variants, have garnered increasing attention in recent years. The exploration of gene variant impacts and regulatory mechanisms has spurred the development of various deep learning approaches, providing new insights into the global regulatory landscape through the analysis of extensive genetic data. Here, we provide a comprehensive overview of the development of the non-coding variants models based on bulk and single-cell sequencing data and their model-based interpretation and downstream tasks. This review delineates the popular sequencing technologies for epigenetic profiling and deep learning approaches for discerning the effects of non-coding variants. Additionally, we summarize the limitations of current approaches in variant effect prediction research and outline opportunities for improvement. We anticipate that our study will offer a practical and useful guide for the bioinformatic community to further advance the unraveling of genetic variant effects.
Collapse
Affiliation(s)
- Xiaoyu Wang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Fuyi Li
- South Australian immunoGENomics Cancer Institute (SAiGENCI), Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, SA 5005, Australia
| | - Yiwen Zhang
- School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC 3004, Australia
| | - Seiya Imoto
- Genome Center, Institute of Medical Science, The University of Tokyo, Minato-ku, Tokyo 108-8639, Japan
- Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, Bunkyo-ku, Tokyo 113-8657, Japan
| | - Hsin-Hui Shen
- Department of Materials Science and Engineering, Faculty of Engineering, Monash University, Clayton, VIC 3800, Australia
| | - Shanshan Li
- School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC 3004, Australia
| | - Yuming Guo
- School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC 3004, Australia
| | - Jian Yang
- School of Life Sciences, Westlake University, Hangzhou, Zhejiang 310030, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang 310024, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
9
|
Moeckel C, Mouratidis I, Chantzi N, Uzun Y, Georgakopoulos-Soares I. Advances in computational and experimental approaches for deciphering transcriptional regulatory networks: Understanding the roles of cis-regulatory elements is essential, and recent research utilizing MPRAs, STARR-seq, CRISPR-Cas9, and machine learning has yielded valuable insights. Bioessays 2024; 46:e2300210. [PMID: 38715516 PMCID: PMC11444527 DOI: 10.1002/bies.202300210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 04/22/2024] [Accepted: 04/23/2024] [Indexed: 05/16/2024]
Abstract
Understanding the influence of cis-regulatory elements on gene regulation poses numerous challenges given complexities stemming from variations in transcription factor (TF) binding, chromatin accessibility, structural constraints, and cell-type differences. This review discusses the role of gene regulatory networks in enhancing understanding of transcriptional regulation and covers construction methods ranging from expression-based approaches to supervised machine learning. Additionally, key experimental methods, including MPRAs and CRISPR-Cas9-based screening, which have significantly contributed to understanding TF binding preferences and cis-regulatory element functions, are explored. Lastly, the potential of machine learning and artificial intelligence to unravel cis-regulatory logic is analyzed. These computational advances have far-reaching implications for precision medicine, therapeutic target discovery, and the study of genetic variations in health and disease.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Yasin Uzun
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
- Department of Pediatrics, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| |
Collapse
|
10
|
Abbas Q, Wilhelm M, Kuster B, Poppenberger B, Frishman D. Exploring crop genomes: assembly features, gene prediction accuracy, and implications for proteomics studies. BMC Genomics 2024; 25:619. [PMID: 38898442 PMCID: PMC11186247 DOI: 10.1186/s12864-024-10521-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Accepted: 06/13/2024] [Indexed: 06/21/2024] Open
Abstract
Plant genomics plays a pivotal role in enhancing global food security and sustainability by offering innovative solutions for improving crop yield, disease resistance, and stress tolerance. As the number of sequenced genomes grows and the accuracy and contiguity of genome assemblies improve, structural annotation of plant genomes continues to be a significant challenge due to their large size, polyploidy, and rich repeat content. In this paper, we present an overview of the current landscape in crop genomics research, highlighting the diversity of genomic characteristics across various crop species. We also assessed the accuracy of popular gene prediction tools in identifying genes within crop genomes and examined the factors that impact their performance. Our findings highlight the strengths and limitations of BRAKER2 and Helixer as leading structural genome annotation tools and underscore the impact of genome complexity, fragmentation, and repeat content on their performance. Furthermore, we evaluated the suitability of the predicted proteins as a reliable search space in proteomics studies using mass spectrometry data. Our results provide valuable insights for future efforts to refine and advance the field of structural genome annotation.
Collapse
Affiliation(s)
- Qussai Abbas
- Chair of Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Mathias Wilhelm
- Computational Mass Spectrometry, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
- Munich Data Science Institute, Technical University of Munich, Garching, Germany
| | - Bernhard Kuster
- Munich Data Science Institute, Technical University of Munich, Garching, Germany
- Chair of Proteomics and Bioanalytics, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Brigitte Poppenberger
- Biotechnology of Horticultural Crops, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Dmitrij Frishman
- Chair of Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Freising, Germany.
| |
Collapse
|
11
|
Tenekeci S, Tekir S. Identifying promoter and enhancer sequences by graph convolutional networks. Comput Biol Chem 2024; 110:108040. [PMID: 38430611 DOI: 10.1016/j.compbiolchem.2024.108040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 01/09/2024] [Accepted: 02/27/2024] [Indexed: 03/05/2024]
Abstract
Identification of promoters, enhancers, and their interactions helps understand genetic regulation. This study proposes a graph-based semi-supervised learning model (GCN4EPI) for the enhancer-promoter classification problem. We adopt a graph convolutional network (GCN) architecture to integrate interaction information with sequence features. Nodes of the constructed graph hold word embeddings of DNA sequences while edges hold the Enhancer-Promoter Interaction (EPI) information. By means of semi-supervised learning, much less data (16%) and time are needed in model training. Comparisons on a benchmark dataset of six human cell lines show that the proposed approach outperforms the state-of-the-art methods by a large margin (10% higher F1 score) and has the fastest training time (up to 3 times). Moreover, GCN4EPI's performance on cross-cell line data is also better than the baselines (3% higher F1 score). Our qualitative analyses with graph explainability models prove that GCN4EPI learns from both text and graph structure. The results suggest that integrating interaction information with sequence features improves predictive performance and compensates for the number of training instances.
Collapse
Affiliation(s)
- Samet Tenekeci
- Department of Computer Engineering, Izmir Institute of Technology, Izmir, 35430, Turkiye
| | - Selma Tekir
- Department of Computer Engineering, Izmir Institute of Technology, Izmir, 35430, Turkiye.
| |
Collapse
|
12
|
Kindel F, Triesch S, Schlüter U, Randarevitch LA, Reichel-Deland V, Weber APM, Denton AK. Predmoter-cross-species prediction of plant promoter and enhancer regions. BIOINFORMATICS ADVANCES 2024; 4:vbae074. [PMID: 38841126 PMCID: PMC11150885 DOI: 10.1093/bioadv/vbae074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 04/10/2024] [Accepted: 05/22/2024] [Indexed: 06/07/2024]
Abstract
Motivation Identifying cis-regulatory elements (CREs) is crucial for analyzing gene regulatory networks. Next generation sequencing methods were developed to identify CREs but represent a considerable expenditure for targeted analysis of few genomic loci. Thus, predicting the outputs of these methods would significantly cut costs and time investment. Results We present Predmoter, a deep neural network that predicts base-wise Assay for Transposase Accessible Chromatin using sequencing (ATAC-seq) and histone Chromatin immunoprecipitation DNA-sequencing (ChIP-seq) read coverage for plant genomes. Predmoter uses only the DNA sequence as input. We trained our final model on 21 species for 13 of which ATAC-seq data and for 17 of which ChIP-seq data was publicly available. We evaluated our models on Arabidopsis thaliana and Oryza sativa. Our best models showed accurate predictions in peak position and pattern for ATAC- and histone ChIP-seq. Annotating putatively accessible chromatin regions provides valuable input for the identification of CREs. In conjunction with other in silico data, this can significantly reduce the search space for experimentally verifiable DNA-protein interaction pairs. Availability and implementation The source code for Predmoter is available at: https://github.com/weberlab-hhu/Predmoter. Predmoter takes a fasta file as input and outputs h5, and optionally bigWig and bedGraph files.
Collapse
Affiliation(s)
- Felicitas Kindel
- Institute of Plant Biochemistry, Math.-Nat. Faculty, Heinrich Heine University, Düsseldorf 40225, Germany
| | - Sebastian Triesch
- Institute of Plant Biochemistry, Math.-Nat. Faculty, Heinrich Heine University, Düsseldorf 40225, Germany
- Cluster of Excellence on Plant Sciences (CEPLAS), Germany
| | - Urte Schlüter
- Institute of Plant Biochemistry, Math.-Nat. Faculty, Heinrich Heine University, Düsseldorf 40225, Germany
| | - Laura Alexandra Randarevitch
- Cluster of Excellence on Plant Sciences (CEPLAS), Germany
- Institute of Population Genetics, Math.-Nat. Faculty, Heinrich Heine University, Düsseldorf 40225, Germany
| | - Vanessa Reichel-Deland
- Institute of Plant Biochemistry, Math.-Nat. Faculty, Heinrich Heine University, Düsseldorf 40225, Germany
| | - Andreas P M Weber
- Institute of Plant Biochemistry, Math.-Nat. Faculty, Heinrich Heine University, Düsseldorf 40225, Germany
- Cluster of Excellence on Plant Sciences (CEPLAS), Germany
| | - Alisandra K Denton
- Institute of Plant Biochemistry, Math.-Nat. Faculty, Heinrich Heine University, Düsseldorf 40225, Germany
- Cluster of Excellence on Plant Sciences (CEPLAS), Germany
- Valence Labs, Montréal, Québec H2S 3H1, Canada
| |
Collapse
|
13
|
Guo J, Lin LF, Oraskovich SV, Rivera de Jesús JA, Listgarten J, Schaffer DV. Computationally guided AAV engineering for enhanced gene delivery. Trends Biochem Sci 2024; 49:457-469. [PMID: 38531696 PMCID: PMC11456259 DOI: 10.1016/j.tibs.2024.03.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 02/22/2024] [Accepted: 03/01/2024] [Indexed: 03/28/2024]
Abstract
Gene delivery vehicles based on adeno-associated viruses (AAVs) are enabling increasing success in human clinical trials, and they offer the promise of treating a broad spectrum of both genetic and non-genetic disorders. However, delivery efficiency and targeting must be improved to enable safe and effective therapies. In recent years, considerable effort has been invested in creating AAV variants with improved delivery, and computational approaches have been increasingly harnessed for AAV engineering. In this review, we discuss how computationally designed AAV libraries are enabling directed evolution. Specifically, we highlight approaches that harness sequences outputted by next-generation sequencing (NGS) coupled with machine learning (ML) to generate new functional AAV capsids and related regulatory elements, pushing the frontier of what vector engineering and gene therapy may achieve.
Collapse
Affiliation(s)
- Jingxuan Guo
- California Institute for Quantitative Biosciences (QB3), University of California, Berkeley, CA 94720, USA
| | - Li F Lin
- Department of Chemical and Biomolecular Engineering, University of California, Berkeley, CA 94720, USA
| | - Sydney V Oraskovich
- Department of Bioengineering, University of California, Berkeley, CA 94720, USA; Graduate Program in Bioengineering, University of California, San Francisco and University of California, Berkeley, CA 94720, USA
| | - Julio A Rivera de Jesús
- Department of Bioengineering, University of California, Berkeley, CA 94720, USA; Graduate Program in Bioengineering, University of California, San Francisco and University of California, Berkeley, CA 94720, USA; Department of Neurological Surgery, University of California, San Francisco, CA 94143, USA
| | - Jennifer Listgarten
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, CA 94720, USA
| | - David V Schaffer
- California Institute for Quantitative Biosciences (QB3), University of California, Berkeley, CA 94720, USA; Department of Chemical and Biomolecular Engineering, University of California, Berkeley, CA 94720, USA; Department of Bioengineering, University of California, Berkeley, CA 94720, USA; Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720, USA.
| |
Collapse
|
14
|
Lee H, Ozbulak U, Park H, Depuydt S, De Neve W, Vankerschaver J. Assessing the reliability of point mutation as data augmentation for deep learning with genomic data. BMC Bioinformatics 2024; 25:170. [PMID: 38689247 PMCID: PMC11059627 DOI: 10.1186/s12859-024-05787-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Accepted: 04/15/2024] [Indexed: 05/02/2024] Open
Abstract
BACKGROUND Deep neural networks (DNNs) have the potential to revolutionize our understanding and treatment of genetic diseases. An inherent limitation of deep neural networks, however, is their high demand for data during training. To overcome this challenge, other fields, such as computer vision, use various data augmentation techniques to artificially increase the available training data for DNNs. Unfortunately, most data augmentation techniques used in other domains do not transfer well to genomic data. RESULTS Most genomic data possesses peculiar properties and data augmentations may significantly alter the intrinsic properties of the data. In this work, we propose a novel data augmentation technique for genomic data inspired by biology: point mutations. By employing point mutations as substitutes for codons, we demonstrate that our newly proposed data augmentation technique enhances the performance of DNNs across various genomic tasks that involve coding regions, such as translation initiation and splice site detection. CONCLUSION Silent and missense mutations are found to positively influence effectiveness, while nonsense mutations and random mutations in non-coding regions generally lead to degradation. Overall, point mutation-based augmentations in genomic datasets present valuable opportunities for improving the accuracy and reliability of predictive models for DNA sequences.
Collapse
Affiliation(s)
| | - Utku Ozbulak
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
| | - Homin Park
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
- IDLab, Department of Electronics and Information Systems, Ghent University, Ghent, Belgium
| | - Stephen Depuydt
- Erasmus Brussels University of Applied Sciences and Arts, Brussels, Belgium
| | - Wesley De Neve
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
- IDLab, Department of Electronics and Information Systems, Ghent University, Ghent, Belgium
| | - Joris Vankerschaver
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea.
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium.
| |
Collapse
|
15
|
Abbasi AF, Asim MN, Ahmed S, Dengel A. Long extrachromosomal circular DNA identification by fusing sequence-derived features of physicochemical properties and nucleotide distribution patterns. Sci Rep 2024; 14:9466. [PMID: 38658614 PMCID: PMC11043385 DOI: 10.1038/s41598-024-57457-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 03/18/2024] [Indexed: 04/26/2024] Open
Abstract
Long extrachromosomal circular DNA (leccDNA) regulates several biological processes such as genomic instability, gene amplification, and oncogenesis. The identification of leccDNA holds significant importance to investigate its potential associations with cancer, autoimmune, cardiovascular, and neurological diseases. In addition, understanding these associations can provide valuable insights about disease mechanisms and potential therapeutic approaches. Conventionally, wet lab-based methods are utilized to identify leccDNA, which are hindered by the need for prior knowledge, and resource-intensive processes, potentially limiting their broader applicability. To empower the process of leccDNA identification across multiple species, the paper in hand presents the very first computational predictor. The proposed iLEC-DNA predictor makes use of SVM classifier along with sequence-derived nucleotide distribution patterns and physicochemical properties-based features. In addition, the study introduces a set of 12 benchmark leccDNA datasets related to three species, namely Homo sapiens (HM), Arabidopsis Thaliana (AT), and Saccharomyces cerevisiae (SC/YS). It performs large-scale experimentation across 12 benchmark datasets under different experimental settings using the proposed predictor, more than 140 baseline predictors, and 858 encoder ensembles. The proposed predictor outperforms baseline predictors and encoder ensembles across diverse leccDNA datasets by producing average performance values of 81.09%, 62.2% and 81.08% in terms of ACC, MCC and AUC-ROC across all the datasets. The source code of the proposed and baseline predictors is available at https://github.com/FAhtisham/Extrachrosmosomal-DNA-Prediction . To facilitate the scientific community, a web application for leccDNA identification is available at https://sds_genetic_analysis.opendfki.de/iLEC_DNA/.
Collapse
Affiliation(s)
- Ahtisham Fazeel Abbasi
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, 67663, Kaiserslautern, Germany.
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Germany.
| | - Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Germany.
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Germany
| | - Andreas Dengel
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, 67663, Kaiserslautern, Germany
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Germany
| |
Collapse
|
16
|
Ledesma-Dominguez L, Carbajal-Degante E, Moreno-Hagelsieb G, Pérez-Rueda E. DeepReg: a deep learning hybrid model for predicting transcription factors in eukaryotic and prokaryotic genomes. Sci Rep 2024; 14:9155. [PMID: 38644393 PMCID: PMC11551149 DOI: 10.1038/s41598-024-59487-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2023] [Accepted: 04/11/2024] [Indexed: 04/23/2024] Open
Abstract
Deep learning models (DLMs) have gained importance in predicting, detecting, translating, and classifying a diversity of inputs. In bioinformatics, DLMs have been used to predict protein structures, transcription factor-binding sites, and promoters. In this work, we propose a hybrid model to identify transcription factors (TFs) among prokaryotic and eukaryotic protein sequences, named Deep Regulation (DeepReg) model. Two architectures were used in the DL model: a convolutional neural network (CNN), and a bidirectional long-short-term memory (BiLSTM). DeepReg reached a precision of 0.99, a recall of 0.97, and an F1-score of 0.98. The quality of our predictions, the bias-variance trade-off approach, and the characterization of new TF predictions were evaluated and compared against those produced by DeepTFactor, as well as against experimental data from three model organisms. Predictions based on our DLM tended to exhibit less variance and bias than those from DeepTFactor, thus increasing reliability and decreasing overfitting.
Collapse
Affiliation(s)
- Leonardo Ledesma-Dominguez
- Posgrado en Ciencia en Ingeniería de la Computación, Universidad Nacional Autónoma de México, 04510, Mexico City, Mexico.
- Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, UNAM, 04510, Mexico City, México.
| | - Erik Carbajal-Degante
- Coordinación de Universidad Abierta, Innovación Educativa y Educación a Distancia (CUAIEED), Universidad Nacional Autónoma de México, 04510, Mexico City, México
| | | | - Ernesto Pérez-Rueda
- Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Unidad Académica del Estado de Yucatán, Universidad Nacional Autónoma de México, Mérida, Yucatán, México.
| |
Collapse
|
17
|
Yang G, Li J, Hu J, Shi JY. Recognition of cyanobacteria promoters via Siamese network-based contrastive learning under novel non-promoter generation. Brief Bioinform 2024; 25:bbae193. [PMID: 38701419 PMCID: PMC11066903 DOI: 10.1093/bib/bbae193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Revised: 03/08/2024] [Accepted: 04/05/2024] [Indexed: 05/05/2024] Open
Abstract
It is a vital step to recognize cyanobacteria promoters on a genome-wide scale. Computational methods are promising to assist in difficult biological identification. When building recognition models, these methods rely on non-promoter generation to cope with the lack of real non-promoters. Nevertheless, the factitious significant difference between promoters and non-promoters causes over-optimistic prediction. Moreover, designed for E. coli or B. subtilis, existing methods cannot uncover novel, distinct motifs among cyanobacterial promoters. To address these issues, this work first proposes a novel non-promoter generation strategy called phantom sampling, which can eliminate the factitious difference between promoters and generated non-promoters. Furthermore, it elaborates a novel promoter prediction model based on the Siamese network (SiamProm), which can amplify the hidden difference between promoters and non-promoters through a joint characterization of global associations, upstream and downstream contexts, and neighboring associations w.r.t. k-mer tokens. The comparison with state-of-the-art methods demonstrates the superiority of our phantom sampling and SiamProm. Both comprehensive ablation studies and feature space illustrations also validate the effectiveness of the Siamese network and its components. More importantly, SiamProm, upon our phantom sampling, finds a novel cyanobacterial promoter motif ('GCGATCGC'), which is palindrome-patterned, content-conserved, but position-shifted.
Collapse
Affiliation(s)
- Guang Yang
- School of Life Sciences, Northwestern Polytechnical University, Xi’an, Shaanxi, 710072, China
| | - Jianing Li
- School of Computer Science, Northwestern Polytechnical University, Xi’an, Shaanxi, 710072, China
| | - Jinlu Hu
- School of Life Sciences, Northwestern Polytechnical University, Xi’an, Shaanxi, 710072, China
| | - Jian-Yu Shi
- School of Life Sciences, Northwestern Polytechnical University, Xi’an, Shaanxi, 710072, China
| |
Collapse
|
18
|
Yan Y, Li W, Wang S, Huang T. Seq-RBPPred: Predicting RNA-Binding Proteins from Sequence. ACS OMEGA 2024; 9:12734-12742. [PMID: 38524500 PMCID: PMC10955590 DOI: 10.1021/acsomega.3c08381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 12/18/2023] [Accepted: 12/28/2023] [Indexed: 03/26/2024]
Abstract
RNA-binding proteins (RBPs) can interact with RNAs to regulate RNA translation, modification, splicing, and other important biological processes. The accurate identification of RBPs is of paramount importance for gaining insights into the intricate mechanisms underlying organismal life activities. Traditional experimental methods to predict RBPs require a lot of time and money, so it is important to develop computational methods to predict RBPs. However, the existing approaches for RBP prediction still require further improvement due to unidentified RBPs in many species. In this study, we present Seq-RBPPred (predicting RBPs from sequence), a novel method that utilizes a comprehensive feature representation encompassing both biophysical properties and hidden-state features derived from protein sequences. In the results, comprehensive performance evaluations of Seq-RBPPred its superiority compare with state-of-the-art methods, yielding impressive performance including 0.922 for overall accuracy, 0.926 for sensitivity, 0.903 for specificity, and Matthew's correlation coefficient (MCC) of 0.757 as ascertained from the evaluation of the testing set. The data and code of Seq-RBPPred are available at https://github.com/yaoyao-11/Seq-RBPPred.
Collapse
Affiliation(s)
- Yuyao Yan
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| | - Wenran Li
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| | - Sijia Wang
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| | - Tao Huang
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| |
Collapse
|
19
|
Robson ES, Ioannidis NM. GUANinE v1.0: Benchmark Datasets for Genomic AI Sequence-to-Function Models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.10.12.562113. [PMID: 37904945 PMCID: PMC10614795 DOI: 10.1101/2023.10.12.562113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/01/2023]
Abstract
Computational genomics increasingly relies on machine learning methods for genome interpretation, and the recent adoption of neural sequence-to-function models highlights the need for rigorous model specification and controlled evaluation, problems familiar to other fields of AI. Research strategies that have greatly benefited other fields - including benchmarking, auditing, and algorithmic fairness - are also needed to advance the field of genomic AI and to facilitate model development. Here we propose a genomic AI benchmark, GUANinE, for evaluating model generalization across a number of distinct genomic tasks. Compared to existing task formulations in computational genomics, GUANinE is large-scale, de-noised, and suitable for evaluating pretrained models. GUANinE v1.0 primarily focuses on functional genomics tasks such as functional element annotation and gene expression prediction, and it also draws upon connections to evolutionary biology through sequence conservation tasks. The current GUANinE tasks provide insight into the performance of existing genomic AI models and non-neural baselines, with opportunities to be refined, revisited, and broadened as the field matures. Finally, the GUANinE benchmark allows us to evaluate new self-supervised T5 models and explore the tradeoffs between tokenization and model performance, while showcasing the potential for self-supervision to complement existing pretraining procedures.
Collapse
Affiliation(s)
- Eyes S Robson
- Center for Computational Biology, UC Berkeley, Berkeley, CA 94720
| | - Nilah M Ioannidis
- Department of Electrical Engineering and Computer Sciences, UC Berkeley, Berkeley, CA 94720
| |
Collapse
|
20
|
Martinez GS, Perez-Rueda E, Kumar A, Dutt M, Maya CR, Ledesma-Dominguez L, Casa PL, Kumar A, de Avila e Silva S, Kelvin DJ. CDBProm: the Comprehensive Directory of Bacterial Promoters. NAR Genom Bioinform 2024; 6:lqae018. [PMID: 38385146 PMCID: PMC10880602 DOI: 10.1093/nargab/lqae018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 01/12/2024] [Accepted: 01/29/2024] [Indexed: 02/23/2024] Open
Abstract
The decreasing cost of whole genome sequencing has produced high volumes of genomic information that require annotation. The experimental identification of promoter sequences, pivotal for regulating gene expression, is a laborious and cost-prohibitive task. To expedite this, we introduce the Comprehensive Directory of Bacterial Promoters (CDBProm), a directory of in-silico predicted bacterial promoter sequences. We first identified that an Extreme Gradient Boosting (XGBoost) algorithm would distinguish promoters from random downstream regions with an accuracy of 87%. To capture distinctive promoter signals, we generated a second XGBoost classifier trained on the instances misclassified in our first classifier. The predictor of CDBProm is then fed with over 55 million upstream regions from more than 6000 bacterial genomes. Upon finding potential promoter sequences in upstream regions, each promoter is mapped to the genomic data of the organism, linking the predicted promoter with its coding DNA sequence, and identifying the function of the gene regulated by the promoter. The collection of bacterial promoters available in CDBProm enables the quantitative analysis of a plethora of bacterial promoters. Our collection with over 24 million promoters is publicly available at https://aw.iimas.unam.mx/cdbprom/.
Collapse
Affiliation(s)
- Gustavo Sganzerla Martinez
- Microbiology and Immunology, Dalhousie University, Halifax, Nova Scotia B3H 4H7, Canada
- Pediatrics, Izaak Walton Killam (IWK) Health Center. Canadian Center for Vaccinology (CCfV), Halifax, Nova Scotia B3H 4H7, Canada
- BioForge Canada Limited, Halifax, Nova Scotia B3N 3B9, Canada
| | - Ernesto Perez-Rueda
- Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autonóma de México, Unidad Académica del Estado de Yucatán, Mérida 97302, Yucatán, Mexico
| | - Anuj Kumar
- Microbiology and Immunology, Dalhousie University, Halifax, Nova Scotia B3H 4H7, Canada
- Pediatrics, Izaak Walton Killam (IWK) Health Center. Canadian Center for Vaccinology (CCfV), Halifax, Nova Scotia B3H 4H7, Canada
- BioForge Canada Limited, Halifax, Nova Scotia B3N 3B9, Canada
| | - Mansi Dutt
- Microbiology and Immunology, Dalhousie University, Halifax, Nova Scotia B3H 4H7, Canada
- Pediatrics, Izaak Walton Killam (IWK) Health Center. Canadian Center for Vaccinology (CCfV), Halifax, Nova Scotia B3H 4H7, Canada
- BioForge Canada Limited, Halifax, Nova Scotia B3N 3B9, Canada
| | - Cinthia Rodríguez Maya
- Facultad de Ciencias e Ingeniería, Universidad Nacional Autonoma de Mexico, Mexico City 04510, Mexico
| | - Leonardo Ledesma-Dominguez
- Instituto de Investigaciones en Matematicas Aplicadas y en Sistemas, Universidad Nacional Autonoma de Mexico, Mexico City 04510, Mexico
| | - Pedro Lenz Casa
- Biotechnology Institute, Universidade de Caxias do Sul, Caxias do Sul, Rio Grande do Sul 95070-560, Brazil
| | - Aditya Kumar
- Molecular Biology and Biotechnology, Tezpur University, Tezpur, Assam 784028, India
| | - Scheila de Avila e Silva
- Biotechnology Institute, Universidade de Caxias do Sul, Caxias do Sul, Rio Grande do Sul 95070-560, Brazil
| | - David J Kelvin
- Microbiology and Immunology, Dalhousie University, Halifax, Nova Scotia B3H 4H7, Canada
- Pediatrics, Izaak Walton Killam (IWK) Health Center. Canadian Center for Vaccinology (CCfV), Halifax, Nova Scotia B3H 4H7, Canada
- BioForge Canada Limited, Halifax, Nova Scotia B3N 3B9, Canada
| |
Collapse
|
21
|
Nagda BM, Nguyen VM, White RT. promSEMBLE: Hard Pattern Mining and Ensemble Learning for Detecting DNA Promoter Sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:208-214. [PMID: 38051616 DOI: 10.1109/tcbb.2023.3339597] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/07/2023]
Abstract
Accurate identification of DNA promoter sequences is of crucial importance in unraveling the underlying mechanisms that regulate gene transcription. Initiation of transcription is controlled through regulatory transcription factors binding to promoter core regions in the DNA sequence. Detection of promoter regions is necessary if we are to build genetic regulatory networks for biomedical and clinical applications, and for identification of rarely expressed genes. We propose a novel ensemble learning technique using deep recurrent neural networks with convolutional feature extraction and hard negative pattern mining to detect several types of promoter sequences, including promoter sequences with the TATA-box and without the TATA-box, within DNA sequences of four different species. Using extensive independent tests and previously published results, we demonstrate that our method sets a new state-of-the-art of over 98% Matthews correlation coefficient in all eight organism categories for recognizing the stretch of base pairs that code for the promoter region within DNA sequences.
Collapse
|
22
|
Zulfiqar H, Ahmad RM, Raza A, Shahzad S, Lin H. Promoter Prediction in Agrobacterium tumefaciens Strain C58 by Using Artificial Intelligence Strategies. Methods Mol Biol 2024; 2844:33-44. [PMID: 39068330 DOI: 10.1007/978-1-0716-4063-0_2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/30/2024]
Abstract
Promoters are the genomic regions upstream of genes that RNA polymerase binds in order to initiate gene transcription. Understanding the regulation of gene expression depends on being able to identify promoters, because they are the most important component of gene expression. Agrobacterium tumefaciens (A. tumefaciens) strain C58 was the subject of this study with the goal of creating a machine learning-based model to predict promoters. In this study, nucleotide density (ND), k-mer, and one-hot were used to encode the promoter sequence. Support vector machine (SVM) on fivefold cross-validation with incremental feature selection (IFS) was used to optimize the generated features. These improved characteristics were then used to distinguish promoter sequences by feeding them into the random forest (RF) classifier. Tenfold cross-validation (CV) analysis revealed that the projected model has the ability to produce an accuracy of 84.22%.
Collapse
Affiliation(s)
- Hasan Zulfiqar
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, Zhejiang, China.
| | - Ramala Masood Ahmad
- Department of Plant Breeding and Genetics, University of Agriculture Faisalabad, Faisalabad, Pakistan
| | - Ali Raza
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, Zhejiang, China
| | - Sana Shahzad
- Institute of Horticultural Sciences, University of Agriculture Faisalabad, Faisalabad, Pakistan
| | - Hao Lin
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, Zhejiang, China.
| |
Collapse
|
23
|
Teufel F, Gíslason MH, Almagro Armenteros JJ, Johansen A, Winther O, Nielsen H. GraphPart: homology partitioning for biological sequence analysis. NAR Genom Bioinform 2023; 5:lqad088. [PMID: 37850036 PMCID: PMC10578201 DOI: 10.1093/nargab/lqad088] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2023] [Revised: 08/25/2023] [Accepted: 09/19/2023] [Indexed: 10/19/2023] Open
Abstract
When splitting biological sequence data for the development and testing of predictive models, it is necessary to avoid too-closely related pairs of sequences ending up in different partitions. If this is ignored, performance of prediction methods will tend to be overestimated. Several algorithms have been proposed for homology reduction, where sequences are removed until no too-closely related pairs remain. We present GraphPart, an algorithm for homology partitioning that divides the data such that closely related sequences always end up in the same partition, while keeping as many sequences as possible in the dataset. Evaluation of GraphPart on Protein, DNA and RNA datasets shows that it is capable of retaining a larger number of sequences per dataset, while providing homology separation on a par with reduction approaches.
Collapse
Affiliation(s)
- Felix Teufel
- Department of Biology, University of Copenhagen, 2200 Copenhagen, Denmark
- Digital Science & Innovation, Novo Nordisk A/S, 2760 Måløv, Denmark
| | - Magnús Halldór Gíslason
- Department of Genomic Medicine, Copenhagen University Hospital/Rigshospitalet, 2100 Copenhagen, Denmark
| | - José Juan Almagro Armenteros
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| | | | - Ole Winther
- Department of Biology, University of Copenhagen, 2200 Copenhagen, Denmark
- Department of Genomic Medicine, Copenhagen University Hospital/Rigshospitalet, 2100 Copenhagen, Denmark
- Department of Applied Mathematics and Computer Science, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark
| | - Henrik Nielsen
- Department of Health Technology, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark
| |
Collapse
|
24
|
Pan JH, Du PF. SilenceREIN: seeking silencers on anchors of chromatin loops by deep graph neural networks. Brief Bioinform 2023; 25:bbad494. [PMID: 38168841 PMCID: PMC10782921 DOI: 10.1093/bib/bbad494] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2023] [Revised: 11/09/2023] [Accepted: 12/04/2023] [Indexed: 01/05/2024] Open
Abstract
Silencers are repressive cis-regulatory elements that play crucial roles in transcriptional regulation. Experimental methods for identifying silencers are always costly and time-consuming. Computational methods, which relies on genomic sequence features, have been introduced as alternative approaches. However, silencers do not have significant epigenomic signature. Therefore, we explore a new way to computationally identify silencers, by incorporating chromatin structural information. We propose the SilenceREIN method, which focuses on finding silencers on anchors of chromatin loops. By using graph neural networks, we extracted chromatin structural information from a regulatory element interaction network. SilenceREIN integrated the chromatin structural information with linear genomic signatures to find silencers. The predictive performance of SilenceREIN is comparable or better than other states-of-the-art methods. We performed a genome-wide scanning to systematically find silencers in human genome. Results suggest that silencers are widespread on anchors of chromatin loops. In addition, enrichment analysis of transcription factor binding motif support our prediction results. As far as we can tell, this is the first attempt to incorporate chromatin structural information in finding silencers. All datasets and source codes of SilenceREIN have been deposited in a GitHub repository (https://github.com/JianHPan/SilenceREIN).
Collapse
Affiliation(s)
- Jian-Hua Pan
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Pu-Feng Du
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| |
Collapse
|
25
|
He S, Gao B, Sabnis R, Sun Q. Nucleic Transformer: Classifying DNA Sequences with Self-Attention and Convolutions. ACS Synth Biol 2023; 12:3205-3214. [PMID: 37916871 PMCID: PMC10863451 DOI: 10.1021/acssynbio.3c00154] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 10/04/2023] [Accepted: 10/06/2023] [Indexed: 11/03/2023]
Abstract
Much work has been done to apply machine learning and deep learning to genomics tasks, but these applications usually require extensive domain knowledge, and the resulting models provide very limited interpretability. Here, we present the Nucleic Transformer, a conceptually simple but effective and interpretable model architecture that excels in the classification of DNA sequences. The Nucleic Transformer employs self-attention and convolutions on nucleic acid sequences, leveraging two prominent deep learning strategies commonly used in computer vision and natural language analysis. We demonstrate that the Nucleic Transformer can be trained without much domain knowledge to achieve high performance in Escherichia coli promoter classification, viral genome identification, enhancer classification, and chromatin profile predictions.
Collapse
Affiliation(s)
- Shujun He
- Department of Chemical
Engineering, Texas A&M University, College Station, Texas 77840, United States
| | - Baizhen Gao
- Department of Chemical
Engineering, Texas A&M University, College Station, Texas 77840, United States
| | - Rushant Sabnis
- Department of Chemical
Engineering, Texas A&M University, College Station, Texas 77840, United States
| | - Qing Sun
- Department of Chemical
Engineering, Texas A&M University, College Station, Texas 77840, United States
| |
Collapse
|
26
|
Zhang YZ, Bai Z, Imoto S. Investigation of the BERT model on nucleotide sequences with non-standard pre-training and evaluation of different k-mer embeddings. Bioinformatics 2023; 39:btad617. [PMID: 37815839 PMCID: PMC10612406 DOI: 10.1093/bioinformatics/btad617] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Revised: 09/08/2023] [Accepted: 10/09/2023] [Indexed: 10/11/2023] Open
Abstract
MOTIVATION In recent years, pre-training with the transformer architecture has gained significant attention. While this approach has led to notable performance improvements across a variety of downstream tasks, the underlying mechanisms by which pre-training models influence these tasks, particularly in the context of biological data, are not yet fully elucidated. RESULTS In this study, focusing on the pre-training on nucleotide sequences, we decompose a pre-training model of Bidirectional Encoder Representations from Transformers (BERT) into its embedding and encoding modules to analyze what a pre-trained model learns from nucleotide sequences. Through a comparative study of non-standard pre-training at both the data and model levels, we find that a typical BERT model learns to capture overlapping-consistent k-mer embeddings for its token representation within its embedding module. Interestingly, using the k-mer embeddings pre-trained on random data can yield similar performance in downstream tasks, when compared with those using the k-mer embeddings pre-trained on real biological sequences. We further compare the learned k-mer embeddings with other established k-mer representations in downstream tasks of sequence-based functional prediction. Our experimental results demonstrate that the dense representation of k-mers learned from pre-training can be used as a viable alternative to one-hot encoding for representing nucleotide sequences. Furthermore, integrating the pre-trained k-mer embeddings with simpler models can achieve competitive performance in two typical downstream tasks. AVAILABILITY AND IMPLEMENTATION The source code and associated data can be accessed at https://github.com/yaozhong/bert_investigation.
Collapse
Affiliation(s)
- Yao-zhong Zhang
- Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Minato-ku, Tokyo 108-8639, Japan
| | - Zeheng Bai
- Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Minato-ku, Tokyo 108-8639, Japan
| | - Seiya Imoto
- Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Minato-ku, Tokyo 108-8639, Japan
| |
Collapse
|
27
|
Wang J, Zhang H, Chen N, Zeng T, Ai X, Wu K. PorcineAI-Enhancer: Prediction of Pig Enhancer Sequences Using Convolutional Neural Networks. Animals (Basel) 2023; 13:2935. [PMID: 37760334 PMCID: PMC10526013 DOI: 10.3390/ani13182935] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Revised: 08/21/2023] [Accepted: 09/05/2023] [Indexed: 09/29/2023] Open
Abstract
Understanding the mechanisms of gene expression regulation is crucial in animal breeding. Cis-regulatory DNA sequences, such as enhancers, play a key role in regulating gene expression. Identifying enhancers is challenging, despite the use of experimental techniques and computational methods. Enhancer prediction in the pig genome is particularly significant due to the costliness of high-throughput experimental techniques. The study constructed a high-quality database of pig enhancers by integrating information from multiple sources. A deep learning prediction framework called PorcineAI-enhancer was developed for the prediction of pig enhancers. This framework employs convolutional neural networks for feature extraction and classification. PorcineAI-enhancer showed excellent performance in predicting pig enhancers, validated on an independent test dataset. The model demonstrated reliable prediction capability for unknown enhancer sequences and performed remarkably well on tissue-specific enhancer sequences.The study developed a deep learning prediction framework, PorcineAI-enhancer, for predicting pig enhancers. The model demonstrated significant predictive performance and potential for tissue-specific enhancers. This research provides valuable resources for future studies on gene expression regulation in pigs.
Collapse
Affiliation(s)
- Ji Wang
- College of Animal Science and Technology, China Agricultural University, Beijing 100193, China; (J.W.); (H.Z.); (T.Z.); (X.A.)
| | - Han Zhang
- College of Animal Science and Technology, China Agricultural University, Beijing 100193, China; (J.W.); (H.Z.); (T.Z.); (X.A.)
| | - Nanzhu Chen
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China;
| | - Tong Zeng
- College of Animal Science and Technology, China Agricultural University, Beijing 100193, China; (J.W.); (H.Z.); (T.Z.); (X.A.)
| | - Xiaohua Ai
- College of Animal Science and Technology, China Agricultural University, Beijing 100193, China; (J.W.); (H.Z.); (T.Z.); (X.A.)
| | - Keliang Wu
- College of Animal Science and Technology, China Agricultural University, Beijing 100193, China; (J.W.); (H.Z.); (T.Z.); (X.A.)
| |
Collapse
|
28
|
Zaytsev K, Fedorov A, Korotkov E. Classification of Promoter Sequences from Human Genome. Int J Mol Sci 2023; 24:12561. [PMID: 37628742 PMCID: PMC10454140 DOI: 10.3390/ijms241612561] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Revised: 07/28/2023] [Accepted: 08/03/2023] [Indexed: 08/27/2023] Open
Abstract
We have developed a new method for promoter sequence classification based on a genetic algorithm and the MAHDS sequence alignment method. We have created four classes of human promoters, combining 17,310 sequences out of the 29,598 present in the EPD database. We searched the human genome for potential promoter sequences (PPSs) using dynamic programming and position weight matrices representing each of the promoter sequence classes. A total of 3,065,317 potential promoter sequences were found. Only 1,241,206 of them were located in unannotated parts of the human genome. Every other PPS found intersected with either true promoters, transposable elements, or interspersed repeats. We found a strong intersection between PPSs and Alu elements as well as transcript start sites. The number of false positive PPSs is estimated to be 3 × 10-8 per nucleotide, which is several orders of magnitude lower than for any other promoter prediction method. The developed method can be used to search for PPSs in various eukaryotic genomes.
Collapse
Affiliation(s)
- Konstantin Zaytsev
- Bach Institute of Biochemistry, Federal Research Center of Biotechnology of the Russian Academy of Sciences, 119071 Moscow, Russia
| | - Alexey Fedorov
- Bach Institute of Biochemistry, Federal Research Center of Biotechnology of the Russian Academy of Sciences, 119071 Moscow, Russia
| | - Eugene Korotkov
- Institute of Bioengineering, Federal Research Center of Biotechnology of the Russian Academy of Sciences, 119071 Moscow, Russia
| |
Collapse
|
29
|
Milito A, Aschern M, McQuillan JL, Yang JS. Challenges and advances towards the rational design of microalgal synthetic promoters in Chlamydomonas reinhardtii. JOURNAL OF EXPERIMENTAL BOTANY 2023; 74:3833-3850. [PMID: 37025006 DOI: 10.1093/jxb/erad100] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Accepted: 03/24/2023] [Indexed: 06/19/2023]
Abstract
Microalgae hold enormous potential to provide a safe and sustainable source of high-value compounds, acting as carbon-fixing biofactories that could help to mitigate rapidly progressing climate change. Bioengineering microalgal strains will be key to optimizing and modifying their metabolic outputs, and to render them competitive with established industrial biotechnology hosts, such as bacteria or yeast. To achieve this, precise and tuneable control over transgene expression will be essential, which would require the development and rational design of synthetic promoters as a key strategy. Among green microalgae, Chlamydomonas reinhardtii represents the reference species for bioengineering and synthetic biology; however, the repertoire of functional synthetic promoters for this species, and for microalgae generally, is limited in comparison to other commercial chassis, emphasizing the need to expand the current microalgal gene expression toolbox. Here, we discuss state-of-the-art promoter analyses, and highlight areas of research required to advance synthetic promoter development in C. reinhardtii. In particular, we exemplify high-throughput studies performed in other model systems that could be applicable to microalgae, and propose novel approaches to interrogating algal promoters. We lastly outline the major limitations hindering microalgal promoter development, while providing novel suggestions and perspectives for how to overcome them.
Collapse
Affiliation(s)
- Alfonsina Milito
- Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB, Campus UAB, Bellaterra, Barcelona, Spain
| | - Moritz Aschern
- Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB, Campus UAB, Bellaterra, Barcelona, Spain
| | - Josie L McQuillan
- Department of Chemical and Biological Engineering, University of Sheffield, Mappin Street, Sheffield, S1 3JD, UK
| | - Jae-Seong Yang
- Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB, Campus UAB, Bellaterra, Barcelona, Spain
| |
Collapse
|
30
|
Wang Y, Tai S, Zhang S, Sheng N, Xie X. PromGER: Promoter Prediction Based on Graph Embedding and Ensemble Learning for Eukaryotic Sequence. Genes (Basel) 2023; 14:1441. [PMID: 37510345 PMCID: PMC10379012 DOI: 10.3390/genes14071441] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Revised: 07/04/2023] [Accepted: 07/10/2023] [Indexed: 07/30/2023] Open
Abstract
Promoters are DNA non-coding regions around the transcription start site and are responsible for regulating the gene transcription process. Due to their key role in gene function and transcriptional activity, the prediction of promoter sequences and their core elements accurately is a crucial research area in bioinformatics. At present, models based on machine learning and deep learning have been developed for promoter prediction. However, these models cannot mine the deeper biological information of promoter sequences and consider the complex relationship among promoter sequences. In this work, we propose a novel prediction model called PromGER to predict eukaryotic promoter sequences. For a promoter sequence, firstly, PromGER utilizes four types of feature-encoding methods to extract local information within promoter sequences. Secondly, according to the potential relationships among promoter sequences, the whole promoter sequences are constructed as a graph. Furthermore, three different scales of graph-embedding methods are applied for obtaining the global feature information more comprehensively in the graph. Finally, combining local features with global features of sequences, PromGER analyzes and predicts promoter sequences through a tree-based ensemble-learning framework. Compared with seven existing methods, PromGER improved the average specificity of 13%, accuracy of 10%, Matthew's correlation coefficient of 16%, precision of 4%, F1 score of 6%, and AUC of 9%. Specifically, this study interpreted the PromGER by the t-distributed stochastic neighbor embedding (t-SNE) method and SHAPley Additive exPlanations (SHAP) value analysis, which demonstrates the interpretability of the model.
Collapse
Affiliation(s)
- Yan Wang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
- School of Artificial Intelligence, Jilin University, Changchun 130012, China
| | - Shiwen Tai
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Shuangquan Zhang
- School of Cyber Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Nan Sheng
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Xuping Xie
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| |
Collapse
|
31
|
Ni CE, Doan DP, Chiu YJ, Huang YH. TSSUNet-MB - ab initio identification of σ 70 promoter transcription start sites in Escherichia coli using deep multitask learning. Comput Biol Chem 2023; 105:107904. [PMID: 37327560 DOI: 10.1016/j.compbiolchem.2023.107904] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2022] [Revised: 03/22/2023] [Accepted: 06/09/2023] [Indexed: 06/18/2023]
Abstract
MOTIVATION Computational promoter prediction (CPP) tools designed to classify prokaryotic promoter regions usually assume that a transcription start site (TSS) is located at a predefined position within each promoter region. Such CPP tools are sensitive to any positional shifting of the TSS in a windowed region, and they are unsuitable for determining the boundaries of prokaryotic promoters. RESULTS TSSUNet-MB is a deep learning model developed to identify the TSSs of σ70 promoters. Mononucleotide and bendability were used to encode input sequences. TSSUNet-MB outperforms other CPP tools when assessed using the sequences obtained from the neighborhood of real promoters. TSSUNet-MB achieved a sensitivity of 0.839 and specificity of 0.768 on sliding sequences, while other CPP tool cannot maintain both sensitivities and specificities in a compatible range. Furthermore, TSSUNet-MB can precisely predict the TSS position of σ70 promoter-containing regions with a 10-base accuracy of 77.6%. By leveraging the sliding window scanning approach, we further computed the confidence score of each predicted TSS, which allows for more accurately identifying TSS locations. Our results suggest that TSSUNet-MB is a robust tool for finding σ70 promoters and identifying TSSs.
Collapse
Affiliation(s)
- Chung-En Ni
- Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei, Taiwan
| | - Duy-Phuong Doan
- Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei, Taiwan
| | - Yen-Jung Chiu
- Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei, Taiwan
| | - Yen-Hua Huang
- Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei, Taiwan; Center for Systems and Synthetic Biology, National Yang Ming Chiao Tung University, Taipei, Taiwan.
| |
Collapse
|
32
|
Grešová K, Martinek V, Čechák D, Šimeček P, Alexiou P. Genomic benchmarks: a collection of datasets for genomic sequence classification. BMC Genom Data 2023; 24:25. [PMID: 37127596 PMCID: PMC10150520 DOI: 10.1186/s12863-023-01123-8] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Accepted: 03/31/2023] [Indexed: 05/03/2023] Open
Abstract
BACKGROUND Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most prominent bioinformatic challenge of the past 50 years has been possible only thanks to a carefully curated benchmark of experimentally predicted protein structures. In Genomics, we have similar challenges (annotation of genomes and identification of functional elements) but currently, we lack benchmarks similar to protein folding competition. RESULTS Here we present a collection of curated and easily accessible sequence classification datasets in the field of genomics. The proposed collection is based on a combination of novel datasets constructed from the mining of publicly available databases and existing datasets obtained from published articles. The collection currently contains nine datasets that focus on regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm. A simple convolution neural network is also included in a repository and can be used as a baseline model. Benchmarks and the baseline model are distributed as the Python package 'genomic-benchmarks', and the code is available at https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks . CONCLUSIONS Deep learning techniques revolutionized many biological fields but mainly thanks to the carefully curated benchmarks. For the field of Genomics, we propose a collection of benchmark datasets for the classification of genomic sequences with an interface for the most commonly used deep learning libraries, implementation of the simple neural network and a training framework that can be used as a starting point for future research. The main aim of this effort is to create a repository for shared datasets that will make machine learning for genomics more comparable and reproducible while reducing the overhead of researchers who want to enter the field, leading to healthy competition and new discoveries.
Collapse
Affiliation(s)
- Katarína Grešová
- Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia
- National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czechia
| | - Vlastimil Martinek
- Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia
- National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czechia
| | - David Čechák
- Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia
- National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czechia
| | - Petr Šimeček
- Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia.
| | - Panagiotis Alexiou
- Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia
| |
Collapse
|
33
|
Barbero-Aparicio JA, Olivares-Gil A, Díez-Pastor JF, García-Osorio C. Deep learning and support vector machines for transcription start site identification. PeerJ Comput Sci 2023; 9:e1340. [PMID: 37346545 PMCID: PMC10280436 DOI: 10.7717/peerj-cs.1340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Accepted: 03/21/2023] [Indexed: 06/23/2023]
Abstract
Recognizing transcription start sites is key to gene identification. Several approaches have been employed in related problems such as detecting translation initiation sites or promoters, many of the most recent ones based on machine learning. Deep learning methods have been proven to be exceptionally effective for this task, but their use in transcription start site identification has not yet been explored in depth. Also, the very few existing works do not compare their methods to support vector machines (SVMs), the most established technique in this area of study, nor provide the curated dataset used in the study. The reduced amount of published papers in this specific problem could be explained by this lack of datasets. Given that both support vector machines and deep neural networks have been applied in related problems with remarkable results, we compared their performance in transcription start site predictions, concluding that SVMs are computationally much slower, and deep learning methods, specially long short-term memory neural networks (LSTMs), are best suited to work with sequences than SVMs. For such a purpose, we used the reference human genome GRCh38. Additionally, we studied two different aspects related to data processing: the proper way to generate training examples and the imbalanced nature of the data. Furthermore, the generalization performance of the models studied was also tested using the mouse genome, where the LSTM neural network stood out from the rest of the algorithms. To sum up, this article provides an analysis of the best architecture choices in transcription start site identification, as well as a method to generate transcription start site datasets including negative instances on any species available in Ensembl. We found that deep learning methods are better suited than SVMs to solve this problem, being more efficient and better adapted to long sequences and large amounts of data. We also create a transcription start site (TSS) dataset large enough to be used in deep learning experiments.
Collapse
Affiliation(s)
| | - Alicia Olivares-Gil
- Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain
| | - José F. Díez-Pastor
- Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain
| | - César García-Osorio
- Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain
| |
Collapse
|
34
|
Zulfiqar H, Ahmed Z, Kissanga Grace-Mercure B, Hassan F, Zhang ZY, Liu F. Computational prediction of promotors in Agrobacterium tumefaciens strain C58 by using the machine learning technique. Front Microbiol 2023; 14:1170785. [PMID: 37125199 PMCID: PMC10133480 DOI: 10.3389/fmicb.2023.1170785] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Accepted: 03/17/2023] [Indexed: 05/02/2023] Open
Abstract
Promotors are those genomic regions on the upstream of genes, which are bound by RNA polymerase for starting gene transcription. Because it is the most critical element of gene expression, the recognition of promoters is crucial to understand the regulation of gene expression. This study aimed to develop a machine learning-based model to predict promotors in Agrobacterium tumefaciens (A. tumefaciens) strain C58. In the model, promotor sequences were encoded by three different kinds of feature descriptors, namely, accumulated nucleotide frequency, k-mer nucleotide composition, and binary encodings. The obtained features were optimized by using correlation and the mRMR-based algorithm. These optimized features were inputted into a random forest (RF) classifier to discriminate promotor sequences from non-promotor sequences in A. tumefaciens strain C58. The examination of 10-fold cross-validation showed that the proposed model could yield an overall accuracy of 0.837. This model will provide help for the study of promoters in A. tumefaciens C58 strain.
Collapse
Affiliation(s)
- Hasan Zulfiqar
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, China
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Zahoor Ahmed
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, China
| | - Bakanina Kissanga Grace-Mercure
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Farwa Hassan
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Zhao-Yue Zhang
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Fen Liu
- Department of Radiation Oncology, Peking University Cancer Hospital (Inner Mongolia Campus), Affiliated Cancer Hospital of Inner Mongolia Medical University, Inner Mongolia Cancer Hospital, Hohhot, China
| |
Collapse
|
35
|
Shujaat M, Kim H, Tayara H, Chong KT. iProm-Sigma54: A CNN Base Prediction Tool for σ54 Promoters. Cells 2023; 12:cells12060829. [PMID: 36980170 PMCID: PMC10047130 DOI: 10.3390/cells12060829] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Revised: 02/23/2023] [Accepted: 02/23/2023] [Indexed: 03/11/2023] Open
Abstract
The sigma (σ) factor of RNA holoenzymes is essential for identifying and binding to promoter regions during gene transcription in prokaryotes. σ54 promoters carried out various ancillary methods and environmentally responsive procedures; therefore, it is crucial to accurately identify σ54 promoter sequences to comprehend the underlying process of gene regulation. Herein, we come up with a convolutional neural network (CNN) based prediction tool named “iProm-Sigma54” for the prediction of σ54 promoters. The CNN consists of two one-dimensional convolutional layers, which are followed by max pooling layers and dropout layers. A one-hot encoding scheme was used to extract the input matrix. To determine the prediction performance of iProm-Sigma54, we employed four assessment metrics and five-fold cross-validation; performance was measured using a benchmark and test dataset. According to the findings of this comparison, iProm-Sigma54 outperformed existing methodologies for identifying σ54 promoters. Additionally, a publicly accessible web server was constructed.
Collapse
Affiliation(s)
- Muhammad Shujaat
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea
| | - Hoonjoo Kim
- School of Pharmacy, Jeonbuk National University, Jeonju 54896, Republic of Korea
- Correspondence: (H.K.); (K.T.C.)
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, Republic of Korea
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea
- Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, Republic of Korea
- Correspondence: (H.K.); (K.T.C.)
| |
Collapse
|
36
|
Bu Y, Zheng J, Jia C. An efficient deep learning based predictor for identifying miRNA-triggered phasiRNA loci in plant. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:6853-6865. [PMID: 37161131 DOI: 10.3934/mbe.2023295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Phasic small interfering RNAs are plant secondary small interference RNAs that typically generated by the convergence of miRNAs and polyadenylated mRNAs. A growing number of studies have shown that miRNA-initiated phasiRNA plays crucial roles in regulating plant growth and stress responses. Experimental verification of miRNA-initiated phasiRNA loci may take considerable time, energy and labor. Therefore, computational methods capable of processing high throughput data have been proposed one by one. In this work, we proposed a predictor (DIGITAL) for identifying miRNA-initiated phasiRNAs in plant, which combined a multi-scale residual network with a bi-directional long-short term memory network. The negative dataset was constructed based on positive data, through replacing 60% of nucleotides randomly in each positive sample. Our predictor achieved the accuracy of 98.48% and 94.02% respectively on two independent test datasets with different sequence length. These independent testing results indicate the effectiveness of our model. Furthermore, DIGITAL is of robustness and generalization ability, and thus can be easily extended and applied for miRNA target recognition of other species. We provide the source code of DIGITAL, which is freely available at https://github.com/yuanyuanbu/DIGITAL.
Collapse
Affiliation(s)
- Yuanyuan Bu
- School of Science, Dalian Maritimr University, Dalian 116026, China
| | - Jia Zheng
- School of Science, Dalian Maritimr University, Dalian 116026, China
| | - Cangzhi Jia
- School of Science, Dalian Maritimr University, Dalian 116026, China
| |
Collapse
|
37
|
Implementing computational methods in tandem with synonymous gene recoding for therapeutic development. Trends Pharmacol Sci 2023; 44:73-84. [PMID: 36307252 DOI: 10.1016/j.tips.2022.09.008] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Revised: 09/26/2022] [Accepted: 09/27/2022] [Indexed: 12/24/2022]
Abstract
Synonymous gene recoding, the substitution of synonymous variants into the genetic sequence, has been used to overcome many production limitations in therapeutic development. However, the safety and efficacy of recoded therapeutics can be difficult to evaluate because synonymous codon substitutions can result in subtle, yet impactful changes in protein features and require sensitive methods for detection. Given that computational approaches have made significant leaps in recent years, we propose that machine-learning (ML) tools may be leveraged to assess gene-recoded therapeutics and foresee an opportunity to adapt codon contexts to enhance some powerful existing tools. Here, we examine how synonymous gene recoding has been used to address challenges in therapeutic development, explain the biological mechanisms underlying its effects, and explore the application of computational platforms to improve the surveillance of functional variants in therapeutic design.
Collapse
|
38
|
Explainable artificial intelligence as a reliable annotator of archaeal promoter regions. Sci Rep 2023; 13:1763. [PMID: 36720898 PMCID: PMC9889792 DOI: 10.1038/s41598-023-28571-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Accepted: 01/20/2023] [Indexed: 02/02/2023] Open
Abstract
Archaea are a vast and unexplored cellular domain that thrive in a high diversity of environments, having central roles in processes mediating global carbon and nutrient fluxes. For these organisms to balance their metabolism, the appropriate regulation of their gene expression is essential. A key momentum in regulating genes responsible for the life maintenance of archaea is when transcription factor proteins bind to the promoter element. This DNA segment is conserved, which enables its exploration by machine learning techniques. Here, we trained and tested a support vector machine with 3935 known archaeal promoter sequences. All promoter sequences were coded into DNA Duplex Stability. After, we performed a model interpretation task to map the decision pattern of the classification procedure. We also used a dataset of known-promoter sequences for validation. Our results showed that an AT rich region around position - 27 upstream (relative to the start TSS) is the most conserved in the analyzed organisms. In addition, we were able to identify the BRE element (- 33), the PPE (at - 10) and a position at + 3, that provides a more understandable picture of how promoters are organized in all the archaeal organisms. Finally, we used the interpreted model to identify potential promoter sequences of 135 unannotated organisms, delivering regulatory regions annotation of archaea in a scale never accomplished before ( https://pcyt.unam.mx/gene-regulation/ ). We consider that this approach will be useful to understand how gene regulation is achieved in other organisms apart from the already established transcription factor binding sites.
Collapse
|
39
|
He S, Gao B, Sabnis R, Sun Q. RNAdegformer: accurate prediction of mRNA degradation at nucleotide resolution with deep learning. Brief Bioinform 2023; 24:bbac581. [PMID: 36633966 PMCID: PMC9851316 DOI: 10.1093/bib/bbac581] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 11/14/2022] [Accepted: 11/28/2022] [Indexed: 01/13/2023] Open
Abstract
Messenger RNA-based therapeutics have shown tremendous potential, as demonstrated by the rapid development of messenger RNA based vaccines for COVID-19. Nevertheless, distribution of mRNA vaccines worldwide has been hampered by mRNA's inherent thermal instability due to in-line hydrolysis, a chemical degradation reaction. Therefore, predicting and understanding RNA degradation is a crucial and urgent task. Here we present RNAdegformer, an effective and interpretable model architecture that excels in predicting RNA degradation. RNAdegformer processes RNA sequences with self-attention and convolutions, two deep learning techniques that have proved dominant in the fields of computer vision and natural language processing, while utilizing biophysical features of RNA. We demonstrate that RNAdegformer outperforms previous best methods at predicting degradation properties at nucleotide resolution for COVID-19 mRNA vaccines. RNAdegformer predictions also exhibit improved correlation with RNA in vitro half-life compared with previous best methods. Additionally, we showcase how direct visualization of self-attention maps assists informed decision-making. Further, our model reveals important features in determining mRNA degradation rates via leave-one-feature-out analysis.
Collapse
Affiliation(s)
- Shujun He
- Department of Chemical Engineering, Texas A&M University, 100 Spence St, 77843, Texas, United States
| | - Baizhen Gao
- Department of Chemical Engineering, Texas A&M University, 100 Spence St, 77843, Texas, United States
| | - Rushant Sabnis
- Department of Chemical Engineering, Texas A&M University, 100 Spence St, 77843, Texas, United States
| | - Qing Sun
- Department of Chemical Engineering, Texas A&M University, 100 Spence St, 77843, Texas, United States
| |
Collapse
|
40
|
Kari H, Bandi SMS, Kumar A, Yella VR. DeePromClass: Delineator for Eukaryotic Core Promoters Employing Deep Neural Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:802-807. [PMID: 35353704 DOI: 10.1109/tcbb.2022.3163418] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Computational promoter identification in eukaryotes is a classical biological problem that should be refurbished with the availability of an avalanche of experimental data and emerging deep learning technologies. The current knowledge indicates that eukaryotic core promoters display multifarious signals such as TATA-Box, Inr element, TCT, and Pause-button, etc., and structural motifs such as G-quadruplexes. In the present study, we combined the power of deep learning with a plethora of promoter motifs to delineate promoter and non-promoters gleaned from the statistical properties of DNA sequence arrangement. To this end, we implemented convolutional neural network (CNN) and long short-term memory (LSTM) recurrent neural network architecture for five model systems with [-100 to +50] segments relative to the transcription start site being the core promoter. Unlike previous state-of-the-art tools, which furnish a binary decision of promoter or non-promoter, we classify a chunk of 151mer sequence into a promoter along with the consensus signal type or a non-promoter. The combined CNN-LSTM model; we call "DeePromClass", achieved testing accuracy of 90.6%, 93.6%, 91.8%, 86.5%, and 84.0% for S. cerevisiae, C. elegans, D. melanogaster, Mus musculus, and Homo sapiens respectively. In total, our tool provides an insightful update on next-generation promoter prediction tools for promoter biologists.
Collapse
|
41
|
Yang TH, Yu YH, Wu SH, Zhang FY. CFA: An explainable deep learning model for annotating the transcriptional roles of cis-regulatory modules based on epigenetic codes. Comput Biol Med 2023; 152:106375. [PMID: 36502693 DOI: 10.1016/j.compbiomed.2022.106375] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Revised: 11/07/2022] [Accepted: 11/27/2022] [Indexed: 11/30/2022]
Abstract
Metazoa gene expression is controlled by modular DNA segments called cis-regulatory modules (CRMs). CRMs can convey promoter/enhancer/insulator roles, generating additional regulation layers in transcription. Experiments for understanding CRM roles are low-throughput and costly. Large-scale CRM function investigation still depends on computational methods. However, existing in silico tools only recognize enhancers or promoters exclusively, thus accumulating errors when considering CRM promoter/enhancer/insulator roles altogether. Currently, no algorithm can concurrently consider these CRM roles. In this research, we developed the CRM Function Annotator (CFA) model. CFA provides complete CRM transcriptional role labeling based on epigenetic profiling interpretation. We demonstrated that CFA achieves high performance (test macro auROC/auPRC = 94.1%/90.3%) and outperforms existing tools in promoter/enhancer/insulator identification. CFA is also inspected to recognize explainable epigenetic codes consistent with previous findings when labeling CRM roles. By considering the higher-order combinations of the epigenetic codes, CFA significantly reduces false-positive rates in CRM transcriptional role annotation. CFA is available at https://github.com/cobisLab/CFA/.
Collapse
Affiliation(s)
- Tzu-Hsien Yang
- Department of Biomedical Engineering, National Cheng Kung University, No. 1, University Road, Tainan 701, Taiwan.
| | - Yu-Huai Yu
- Department of Information Management, National University of Kaohsiung, Kaohsiung University Rd, 811 Kaohsiung, Taiwan.
| | - Sheng-Hang Wu
- Department of Information Management, National University of Kaohsiung, Kaohsiung University Rd, 811 Kaohsiung, Taiwan.
| | - Fang-Yuan Zhang
- Department of Information Management, National University of Kaohsiung, Kaohsiung University Rd, 811 Kaohsiung, Taiwan.
| |
Collapse
|
42
|
Barbero-Aparicio JA, Cuesta-Lopez S, García-Osorio CI, Pérez-Rodríguez J, García-Pedrajas N. Nonlinear physics opens a new paradigm for accurate transcription start site prediction. BMC Bioinformatics 2022; 23:565. [PMID: 36585618 PMCID: PMC9801560 DOI: 10.1186/s12859-022-05129-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Accepted: 12/27/2022] [Indexed: 12/31/2022] Open
Abstract
There is evidence that DNA breathing (spontaneous opening of the DNA strands) plays a relevant role in the interactions of DNA with other molecules, and in particular in the transcription process. Therefore, having physical models that can predict these openings is of interest. However, this source of information has not been used before either in transcription start sites (TSSs) or promoter prediction. In this article, one such model is used as an additional information source that, when used by a machine learning (ML) model, improves the results of current methods for the prediction of TSSs. In addition, we provide evidence on the validity of the physical model, as it is able by itself to predict TSSs with high accuracy. This opens an exciting avenue of research at the intersection of statistical mechanics and ML, where ML models in bioinformatics can be improved using physical models of DNA as feature extractors.
Collapse
Affiliation(s)
- José Antonio Barbero-Aparicio
- grid.23520.360000 0000 8569 1592Departamento de Informática, Universidad de Burgos, Avda. de Cantabria s/n, 09006 Burgos, Spain
| | - Santiago Cuesta-Lopez
- grid.23520.360000 0000 8569 1592Universidad de Burgos, Hospital del Rey, s/n, 09001 Burgos, Spain ,ICAMCyL Foundation, Internacional Center for Advanced Materials and Raw Materials of Castilla y León, León Technology Park, main building, first floor, offices 106-108, C/Julia Morros s/n, Armunia, 24009 León, Spain
| | - César Ignacio García-Osorio
- grid.23520.360000 0000 8569 1592Departamento de Informática, Universidad de Burgos, Avda. de Cantabria s/n, 09006 Burgos, Spain
| | - Javier Pérez-Rodríguez
- grid.449008.10000 0004 1795 4150Departamento de Métodos Cuantitativos, Universidad de Loyola Andalucía, Escritor Castilla Aguayo, 4, 14004 Córdoba, Spain
| | - Nicolás García-Pedrajas
- grid.411901.c0000 0001 2183 9102Department of Computing and Numerical Analysis, University of Córdoba, Edificio Albert Einstein, Campus de Rabanales, 14071 Córdoba, Spain
| |
Collapse
|
43
|
Mai DHA, Nguyen LT, Lee EY. TSSNote-CyaPromBERT: Development of an integrated platform for highly accurate promoter prediction and visualization of Synechococcus sp. and Synechocystis sp. through a state-of-the-art natural language processing model BERT. Front Genet 2022; 13:1067562. [PMID: 36523764 PMCID: PMC9745317 DOI: 10.3389/fgene.2022.1067562] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 11/17/2022] [Indexed: 07/30/2023] Open
Abstract
Since the introduction of the first transformer model with a unique self-attention mechanism, natural language processing (NLP) models have attained state-of-the-art (SOTA) performance on various tasks. As DNA is the blueprint of life, it can be viewed as an unusual language, with its characteristic lexicon and grammar. Therefore, NLP models may provide insights into the meaning of the sequential structure of DNA. In the current study, we employed and compared the performance of popular SOTA NLP models (i.e., XLNET, BERT, and a variant DNABERT trained on the human genome) to predict and analyze the promoters in freshwater cyanobacterium Synechocystis sp. PCC 6803 and the fastest growing cyanobacterium Synechococcus elongatus sp. UTEX 2973. These freshwater cyanobacteria are promising hosts for phototrophically producing value-added compounds from CO2. Through a custom pipeline, promoters and non-promoters from Synechococcus elongatus sp. UTEX 2973 were used to train the model. The trained model achieved an AUROC score of 0.97 and F1 score of 0.92. During cross-validation with promoters from Synechocystis sp. PCC 6803, the model achieved an AUROC score of 0.96 and F1 score of 0.91. To increase accessibility, we developed an integrated platform (TSSNote-CyaPromBERT) to facilitate large dataset extraction, model training, and promoter prediction from public dRNA-seq datasets. Furthermore, various visualization tools have been incorporated to address the "black box" issue of deep learning and feature analysis. The learning transfer ability of large language models may help identify and analyze promoter regions for newly isolated strains with similar lineages.
Collapse
|
44
|
Zhang T, Tang Q, Nie F, Zhao Q, Chen W. DeepLncPro: an interpretable convolutional neural network model for identifying long non-coding RNA promoters. Brief Bioinform 2022; 23:6754194. [PMID: 36209437 DOI: 10.1093/bib/bbac447] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2022] [Revised: 09/14/2022] [Accepted: 09/17/2022] [Indexed: 12/14/2022] Open
Abstract
Long non-coding RNA (lncRNA) plays important roles in a series of biological processes. The transcription of lncRNA is regulated by its promoter. Hence, accurate identification of lncRNA promoter will be helpful to understand its regulatory mechanisms. Since experimental techniques remain time consuming for gnome-wide promoter identification, developing computational tools to identify promoters are necessary. However, only few computational methods have been proposed for lncRNA promoter prediction and their performances still have room to be improved. In the present work, a convolutional neural network based model, called DeepLncPro, was proposed to identify lncRNA promoters in human and mouse. Comparative results demonstrated that DeepLncPro was superior to both state-of-the-art machine learning methods and existing models for identifying lncRNA promoters. Furthermore, DeepLncPro has the ability to extract and analyze transcription factor binding motifs from lncRNAs, which made it become an interpretable model. These results indicate that the DeepLncPro can server as a powerful tool for identifying lncRNA promoters. An open-source tool for DeepLncPro was provided at https://github.com/zhangtian-yang/DeepLncPro.
Collapse
Affiliation(s)
- Tianyang Zhang
- School of Life Sciences, North China University of Science and Technology
| | - Qiang Tang
- School of Basic Medical Sciences, Chengdu University of Traditional Chinese Medicine
| | - Fulei Nie
- School of Life Sciences, North China University of Science and Technology
| | - Qi Zhao
- School of Computer Science and Software Engineering, University of Science and Technology Liaoning
| | - Wei Chen
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine
| |
Collapse
|
45
|
Nguyen-Vo TH, Trinh QH, Nguyen L, Nguyen-Hoang PU, Rahardja S, Nguyen BP. iPromoter-Seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded features. BMC Genomics 2022; 23:681. [PMID: 36192696 PMCID: PMC9531353 DOI: 10.1186/s12864-022-08829-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Accepted: 08/08/2022] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND Promoters, non-coding DNA sequences located at upstream regions of the transcription start site of genes/gene clusters, are essential regulatory elements for the initiation and regulation of transcriptional processes. Furthermore, identifying promoters in DNA sequences and genomes significantly contributes to discovering entire structures of genes of interest. Therefore, exploration of promoter regions is one of the most imperative topics in molecular genetics and biology. Besides experimental techniques, computational methods have been developed to predict promoters. In this study, we propose iPromoter-Seqvec - an efficient computational model to predict TATA and non-TATA promoters in human and mouse genomes using bidirectional long short-term memory neural networks in combination with sequence-embedded features extracted from input sequences. The promoter and non-promoter sequences were retrieved from the Eukaryotic Promoter database and then were refined to create four benchmark datasets. RESULTS The area under the receiver operating characteristic curve (AUCROC) and the area under the precision-recall curve (AUCPR) were used as two key metrics to evaluate model performance. Results on independent test sets showed that iPromoter-Seqvec outperformed other state-of-the-art methods with AUCROC values ranging from 0.85 to 0.99 and AUCPR values ranging from 0.86 to 0.99. Models predicting TATA promoters in both species had slightly higher predictive power compared to those predicting non-TATA promoters. With a novel idea of constructing artificial non-promoter sequences based on promoter sequences, our models were able to learn highly specific characteristics discriminating promoters from non-promoters to improve predictive efficiency. CONCLUSIONS iPromoter-Seqvec is a stable and robust model for predicting both TATA and non-TATA promoters in human and mouse genomes. Our proposed method was also deployed as an online web server with a user-friendly interface to support research communities. Links to our source codes and web server are available at https://github.com/mldlproject/2022-iPromoter-Seqvec .
Collapse
Affiliation(s)
- Thanh-Hoang Nguyen-Vo
- School of Mathematics and Statistics, Victoria University of Wellington, Gate 7, Kelburn Parade, 6140 Wellington, New Zealand
| | - Quang H. Trinh
- School of Information and Communication Technology, Hanoi University of Science and Technology, 1 Dai Co Viet, 100000 Hanoi, Vietnam
| | - Loc Nguyen
- School of Mathematics and Statistics, Victoria University of Wellington, Gate 7, Kelburn Parade, 6140 Wellington, New Zealand
| | - Phuong-Uyen Nguyen-Hoang
- Computational Biology Center, International University - VNU HCMC, Quarter 6, Linh Trung Ward, Thu Duc District, 700000 Ho Chi Minh City, Vietnam
| | - Susanto Rahardja
- School of Marine Science and Technology, Northwestern Polytechnical University, 127 West Youyi Road, 710072 Xi’an, China
- Infocomm Technology Cluster, Singapore Institute of Technology, 10 Dover Drive, 138683 Singapore, Singapore
| | - Binh P. Nguyen
- School of Mathematics and Statistics, Victoria University of Wellington, Gate 7, Kelburn Parade, 6140 Wellington, New Zealand
| |
Collapse
|
46
|
Zhang P, Zhang H, Wu H. iPro-WAEL: a comprehensive and robust framework for identifying promoters in multiple species. Nucleic Acids Res 2022; 50:10278-10289. [PMID: 36161334 PMCID: PMC9561371 DOI: 10.1093/nar/gkac824] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Revised: 08/24/2022] [Accepted: 09/14/2022] [Indexed: 11/27/2022] Open
Abstract
Promoters are consensus DNA sequences located near the transcription start sites and they play an important role in transcription initiation. Due to their importance in biological processes, the identification of promoters is significantly important for characterizing the expression of the genes. Numerous computational methods have been proposed to predict promoters. However, it is difficult for these methods to achieve satisfactory performance in multiple species. In this study, we propose a novel weighted average ensemble learning model, termed iPro-WAEL, for identifying promoters in multiple species, including Human, Mouse, E.coli, Arabidopsis, B.amyloliquefaciens, B.subtilis and R.capsulatus. Extensive benchmarking experiments illustrate that iPro-WAEL has optimal performance and is superior to the current methods in promoter prediction. The experimental results also demonstrate a satisfactory prediction ability of iPro-WAEL on cross-cell lines, promoters annotated by other methods and distinguishing between promoters and enhancers. Moreover, we identify the most important transcription factor binding site (TFBS) motif in promoter regions to facilitate the study of identifying important motifs in the promoter regions. The source code of iPro-WAEL is freely available at https://github.com/HaoWuLab-Bioinformatics/iPro-WAEL.
Collapse
Affiliation(s)
- Pengyu Zhang
- School of Software, Shandong University, Jinan, 250101, Shandong, China.,College of Information Engineering, Northwest A&F University, Yangling, 712100, Shaanxi, China
| | - Hongming Zhang
- College of Information Engineering, Northwest A&F University, Yangling, 712100, Shaanxi, China
| | - Hao Wu
- School of Software, Shandong University, Jinan, 250101, Shandong, China
| |
Collapse
|
47
|
Cross-species enhancer prediction using machine learning. Genomics 2022; 114:110454. [PMID: 36030022 DOI: 10.1016/j.ygeno.2022.110454] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Revised: 07/28/2022] [Accepted: 08/16/2022] [Indexed: 11/21/2022]
Abstract
Cis-regulatory elements (CREs) are non-coding parts of the genome that play a critical role in gene expression regulation. Enhancers, as an important example of CREs, interact with genes to influence complex traits like disease, heat tolerance and growth rate. Much of what is known about enhancers come from studies of humans and a few model organisms like mouse, with little known about other mammalian species. Previous studies have attempted to identify enhancers in less studied mammals using comparative genomics but with limited success. Recently, Machine Learning (ML) techniques have shown promising results to predict enhancer regions. Here, we investigated the ability of ML methods to identify enhancers in three non-model mammalian species (cattle, pig and dog) using human and mouse enhancer data from VISTA and publicly available ChIP-seq. We tested nine models, using four different representations of the DNA sequences in cross-species prediction using both the VISTA dataset and species-specific ChIP-seq data. We identified between 809,399 and 877,278 enhancer-like regions (ELRs) in the study species (11.6-13.7% of each genome). These predictions were close to the ~8% proportion of ELRs that covered the human genome. We propose that our ML methods have predictive ability for identifying enhancers in non-model mammalian species. We have provided a list of high confidence enhancers at https://github.com/DaviesCentreInformatics/Cross-species-enhancer-prediction and believe these enhancers will be of great use to the community.
Collapse
|
48
|
CapsProm: a capsule network for promoter prediction. Comput Biol Med 2022; 147:105627. [DOI: 10.1016/j.compbiomed.2022.105627] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 04/05/2022] [Accepted: 04/11/2022] [Indexed: 11/21/2022]
|
49
|
Database of Potential Promoter Sequences in the Capsicum annuum Genome. BIOLOGY 2022; 11:biology11081117. [PMID: 35892972 PMCID: PMC9332048 DOI: 10.3390/biology11081117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Revised: 07/19/2022] [Accepted: 07/23/2022] [Indexed: 11/16/2022]
Abstract
In this study, we used a mathematical method for the multiple alignment of highly divergent sequences (MAHDS) to create a database of potential promoter sequences (PPSs) in the Capsicum annuum genome. To search for PPSs, 20 statistically significant classes of sequences located in the range from −499 to +100 nucleotides near the annotated genes were calculated. For each class, a position–weight matrix (PWM) was computed and then used to identify PPSs in the C. annuum genome. In total, 825,136 PPSs were detected, with a false positive rate of 0.13%. The PPSs obtained with the MAHDS method were tested using TSSFinder, which detects transcription start sites. The databank of the found PPSs provides their coordinates in chromosomes, the alignment of each PPS with the PWM, and the level of statistical significance as a normal distribution argument, and can be used in genetic engineering and biotechnology.
Collapse
|
50
|
Liu C, Yu H, Zhang B, Liu S, Liu CG, Li F, Song H. Engineering whole-cell microbial biosensors: Design principles and applications in monitoring and treatment of heavy metals and organic pollutants. Biotechnol Adv 2022; 60:108019. [PMID: 35853551 DOI: 10.1016/j.biotechadv.2022.108019] [Citation(s) in RCA: 44] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 07/13/2022] [Accepted: 07/13/2022] [Indexed: 01/18/2023]
Abstract
Biosensors have been widely used as cost-effective, rapid, in situ, and real-time analytical tools for monitoring environments. The development of synthetic biology has enabled emergence of genetically engineered whole-cell microbial biosensors. This review updates the design and optimization principles for a diverse array of whole-cell biosensors based on transcription factors (TF) including activators or repressors derived from heavy metal resistance systems, alkanes, and aromatics metabolic pathways of bacteria. By designing genetic circuits, the whole-cell biosensors could be engineered to intelligently sense heavy metals (Hg2+, Zn2+, Pb2+, Au3+, Cd2+, As3+, Ni2+, Cu2+, and UO22+) or organic compounds (alcohols, alkanes, phenols, and benzenes) through one-component or two-component system-based TFs, transduce signals through genetic amplifiers, and response as various outputs such as cell fluorescence and bioelectricity for monitoring heavy metals and organic pollutants in real conditions, synthetic curli and surface metal-binding peptides for in situ bio-sorption of heavy metals. We further review strategies that have been implemented to optimize the selectivity and correlation between ligand concentration and output signal of the TF-based biosensors, so as to meet requirements of practical applications. The optimization strategies include protein engineering to change specificities, promoter engineering to improve sensitivities, and genetic circuit-based amplification to enhance dynamic ranges via designing transcriptional amplifiers, logic gates, and feedback loops. At last, we outlook future trends in developing novel forms of biosensors.
Collapse
Affiliation(s)
- Changjiang Liu
- Frontiers Science Center for Synthetic Biology (Ministry of Education), Key Laboratory of Systems Bioengineering, Tianjin University, Tianjin 300072, China; Collaborative Innovation Center of Chemical Science and Engineering, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
| | - Huan Yu
- Frontiers Science Center for Synthetic Biology (Ministry of Education), Key Laboratory of Systems Bioengineering, Tianjin University, Tianjin 300072, China; Collaborative Innovation Center of Chemical Science and Engineering, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
| | - Baocai Zhang
- Frontiers Science Center for Synthetic Biology (Ministry of Education), Key Laboratory of Systems Bioengineering, Tianjin University, Tianjin 300072, China; Collaborative Innovation Center of Chemical Science and Engineering, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
| | - Shilin Liu
- Collaborative Innovation Center of Chemical Science and Engineering, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
| | - Chen-Guang Liu
- State Key Laboratory of Microbial Metabolism, Joint International Research Laboratory of Metabolic & Developmental Sciences of Ministry of Education, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Feng Li
- Frontiers Science Center for Synthetic Biology (Ministry of Education), Key Laboratory of Systems Bioengineering, Tianjin University, Tianjin 300072, China; Collaborative Innovation Center of Chemical Science and Engineering, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China.
| | - Hao Song
- Frontiers Science Center for Synthetic Biology (Ministry of Education), Key Laboratory of Systems Bioengineering, Tianjin University, Tianjin 300072, China; Collaborative Innovation Center of Chemical Science and Engineering, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China.
| |
Collapse
|