1
|
Zhai J, Gokaslan A, Schiff Y, Berthel A, Liu ZY, Lai WY, Miller ZR, Scheben A, Stitzer MC, Romay MC, Buckler ES, Kuleshov V. Cross-species modeling of plant genomes at single-nucleotide resolution using a pretrained DNA language model. Proc Natl Acad Sci U S A 2025; 122:e2421738122. [PMID: 40489624 DOI: 10.1073/pnas.2421738122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Accepted: 04/17/2025] [Indexed: 06/11/2025] Open
Abstract
Interpreting function and fitness effects in diverse plant genomes requires transferable models. Language models (LMs) pretrained on large-scale biological sequences can capture evolutionary conservation and offer cross-species prediction better than supervised models through fine-tuning limited labeled data. We introduce PlantCaduceus, a plant DNA LM that learns evolutionary conservation patterns in 16 angiosperm genomes by modeling both DNA strands simultaneously. When fine-tuned on a small set of labeled Arabidopsis data for tasks such as predicting translation initiation/termination sites and splice donor/acceptor sites, PlantCaduceus demonstrated remarkable transferability to maize, which diverged 160 Mya. The model outperformed the best existing DNA language model by 1.45-fold in maize splice donor prediction and 7.23-fold in maize translation initiation site prediction. In variant effect prediction, PlantCaduceus showed performance comparative to state-of-the-art protein LMs. Mutations predicted to be deleterious by PlantCaduceus showed threefold lower average minor allele frequencies compared to those identified by multiple sequence alignment-based methods. Additionally, PlantCaduceus successfully identifies well-known causal variants in both Arabidopsis and maize. Overall, PlantCaduceus is a versatile DNA LM that can accelerate plant genomics and crop breeding applications.
Collapse
Affiliation(s)
- Jingjing Zhai
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853
| | - Aaron Gokaslan
- Department of Computer Science, Cornell University, Ithaca, NY 14853
| | - Yair Schiff
- Department of Computer Science, Cornell University, Ithaca, NY 14853
| | - Ana Berthel
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853
| | - Zong-Yan Liu
- Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY 14853
| | - Wei-Yun Lai
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853
| | - Zachary R Miller
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853
| | - Armin Scheben
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratoryx, Cold Spring Harbor, NY 11724
| | | | - M Cinta Romay
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853
| | - Edward S Buckler
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853
- Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY 14853
- Agricultural Research Service, US Department of Agriculture, Ithaca, NY 14853
| | | |
Collapse
|
2
|
Zhai J, Zhang Y, Zhang C, Yin X, Song M, Tang C, Ding P, Li Z, Ma C. deepTFBS: Improving within- and Cross-Species Prediction of Transcription Factor Binding Using Deep Multi-Task and Transfer Learning. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2025:e03135. [PMID: 40411397 DOI: 10.1002/advs.202503135] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/18/2025] [Revised: 04/24/2025] [Indexed: 05/26/2025]
Abstract
The precise prediction of transcription factor binding sites (TFBSs) is crucial in understanding gene regulation. In this study, deepTFBS, a comprehensive deep learning (DL) framework that builds a robust DNA language model of TF binding grammar for accurately predicting TFBSs within and across plant species is presented. Taking advantages of multi-task DL and transfer learning, deepTFBS is capable of leveraging the knowledge learned from large-scale TF binding profiles to enhance the prediction of TFBSs under small-sample training and cross-species prediction tasks. When tested using available information on 359 Arabidopsis TFs, deepTFBS outperformed previously described prediction strategies, including position weight matrix, deepSEA and DanQ, with a 244.49%, 49.15%, and 23.32% improvement of the area under the precision-recall curve (PRAUC), respectively. Further cross-species prediction of TFBS in wheat showed that deepTFBS yielded a significant PRAUC improvement of 30.6% over these three baseline models. deepTFBS can also utilize information from gene conservation and binding motifs, enabling efficient TFBS prediction in species where experimental data availability is limited. A case study, focusing on the WUSCHEL (WUS) transcription factor, illustrated the potential use of deepTFBS in cross-species applications, in our example between Arabidopsis and wheat. deepTFBS is publically available at https://github.com/cma2015/deepTFBS.
Collapse
Affiliation(s)
- Jingjing Zhai
- State Key Laboratory for Crop Stress Resistance and High-Efficiency Production, Center of Bioinformatics, College of Life Sciences, Northwest A&F University, Yangling, Shaanxi, 712100, China
- Key Laboratory of Biology and Genetics Improvement of Maize in Arid Area of Northwest Region, Ministry of Agriculture, Northwest A&F University, Yangling, Shaanxi, 712100, China
| | - Yuzhou Zhang
- State Key Laboratory for Crop Stress Resistance and High-Efficiency Production, Center of Bioinformatics, College of Life Sciences, Northwest A&F University, Yangling, Shaanxi, 712100, China
| | - Chujun Zhang
- State Key Laboratory for Crop Stress Resistance and High-Efficiency Production, Center of Bioinformatics, College of Life Sciences, Northwest A&F University, Yangling, Shaanxi, 712100, China
- Key Laboratory of Biology and Genetics Improvement of Maize in Arid Area of Northwest Region, Ministry of Agriculture, Northwest A&F University, Yangling, Shaanxi, 712100, China
| | - Xiaotong Yin
- College of Life Sciences, Northwest A&F University, Yangling, Shaanxi, 712100, China
| | - Minggui Song
- State Key Laboratory for Crop Stress Resistance and High-Efficiency Production, Center of Bioinformatics, College of Life Sciences, Northwest A&F University, Yangling, Shaanxi, 712100, China
- Key Laboratory of Biology and Genetics Improvement of Maize in Arid Area of Northwest Region, Ministry of Agriculture, Northwest A&F University, Yangling, Shaanxi, 712100, China
| | - Chenglong Tang
- College of Life Sciences, Northwest A&F University, Yangling, Shaanxi, 712100, China
| | - Pengjun Ding
- State Key Laboratory for Crop Stress Resistance and High-Efficiency Production, Center of Bioinformatics, College of Life Sciences, Northwest A&F University, Yangling, Shaanxi, 712100, China
- Key Laboratory of Biology and Genetics Improvement of Maize in Arid Area of Northwest Region, Ministry of Agriculture, Northwest A&F University, Yangling, Shaanxi, 712100, China
| | - Zenglin Li
- State Key Laboratory for Crop Stress Resistance and High-Efficiency Production, Center of Bioinformatics, College of Life Sciences, Northwest A&F University, Yangling, Shaanxi, 712100, China
- Key Laboratory of Biology and Genetics Improvement of Maize in Arid Area of Northwest Region, Ministry of Agriculture, Northwest A&F University, Yangling, Shaanxi, 712100, China
| | - Chuang Ma
- State Key Laboratory for Crop Stress Resistance and High-Efficiency Production, Center of Bioinformatics, College of Life Sciences, Northwest A&F University, Yangling, Shaanxi, 712100, China
- Key Laboratory of Biology and Genetics Improvement of Maize in Arid Area of Northwest Region, Ministry of Agriculture, Northwest A&F University, Yangling, Shaanxi, 712100, China
| |
Collapse
|
3
|
Cui X, Yin Q, Gao Z, Li Z, Chen X, Lv H, Chen S, Liu Q, Zeng W, Jiang R. CREATE: cell-type-specific cis-regulatory element identification via discrete embedding. Nat Commun 2025; 16:4607. [PMID: 40382355 PMCID: PMC12085597 DOI: 10.1038/s41467-025-59780-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2024] [Accepted: 05/02/2025] [Indexed: 05/20/2025] Open
Abstract
Cis-regulatory elements (CREs), including enhancers, silencers, promoters and insulators, play pivotal roles in orchestrating gene regulatory mechanisms that drive complex biological traits. However, current approaches for CRE identification are predominantly sequence-based and typically focus on individual CRE types, limiting insights into their cell-type-specific functions and regulatory dynamics. Here, we present CREATE, a multimodal deep learning framework based on Vector Quantized Variational AutoEncoder, tailored for comprehensive CRE identification and characterization. CREATE integrates genomic sequences, chromatin accessibility, and chromatin interaction data to generate discrete CRE embeddings, enabling accurate multi-class classification and robust characterization of CREs. CREATE excels in identifying cell-type-specific CREs, and provides quantitative and interpretable insights into CRE-specific features, uncovering the underlying regulatory codes. By facilitating large-scale prediction of CREs in specific cell types, CREATE enhances the recognition of disease- or phenotype-associated biological variabilities of CREs, thus advancing our understanding of gene regulatory landscapes and their roles in health and disease.
Collapse
Affiliation(s)
- Xuejian Cui
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China
| | - Qijin Yin
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China
| | - Zijing Gao
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China
| | - Zhen Li
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China
| | - Xiaoyang Chen
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China
| | - Hairong Lv
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China
| | - Shengquan Chen
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, China
| | - Qiao Liu
- Department of Statistics, Stanford University, Stanford, CA, USA
| | - Wanwen Zeng
- Department of Statistics, Stanford University, Stanford, CA, USA.
| | - Rui Jiang
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China.
| |
Collapse
|
4
|
Mondo SJ, He G, Sharma A, Ciobanu D, Riley R, Andreopoulos WB, Lipzen A, Kuo A, LaButti K, Pangilinan J, Salamov A, Salamon H, Shu L, Gladden J, Magnuson J, Aime MC, O’Malley R, Grigoriev IV. Consecutive low-frequency shifts in A/T content denote nucleosome positions across microeukaryotes. iScience 2025; 28:112472. [PMID: 40491964 PMCID: PMC12146615 DOI: 10.1016/j.isci.2025.112472] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 12/26/2024] [Accepted: 04/15/2025] [Indexed: 06/11/2025] Open
Abstract
Nucleosomes are the basic repeating unit, each spanning ≈150bp, that structures DNA in the nucleus and their positions have major consequences on gene activity. Here, through analyzing DNA signatures across 1117 microeukaryote genomes, we discovered ≈150bp shifts in A/T content associated with nucleosome organization. Often consecutively arrayed across the genome, A/T peaks were enriched surrounding transcriptional start sites in specific clades. Most nucleosomes (both in vitro and in vivo) across eukaryotes aligned with A/T peaks, even in the presence of DNA modifications. Using artificial intelligence-based approaches, we describe DNA features associated with nucleosomes and construct a deep learning (DL) model for improved nucleosome occupancy prediction. Using this model, we found that ≈70% of "random" transfer DNA inserts from an in vivo fungal RB-TDNAseq library avoided DL predicted nucleosome-bound regions. This study reveals a eukaryote-wide strategy for generating cassettes of nucleosome-favorable DNAs that has a profound impact on nucleosome organization.
Collapse
Affiliation(s)
- Stephen J. Mondo
- US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
- Department of Agricultural Biology, Colorado State University, Fort Collins, CO 80523, USA
| | - Guifen He
- US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Aditi Sharma
- US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Doina Ciobanu
- US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Robert Riley
- US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - William B. Andreopoulos
- US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Anna Lipzen
- US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Alan Kuo
- US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Kurt LaButti
- US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Jasmyn Pangilinan
- US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Asaf Salamov
- US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Hugh Salamon
- US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Lili Shu
- College of Horticulture, Shenyang Agricultural University, Shenyang 110866, P.R. China
- Engineering Research Center of Chinese Ministry of Education for Edible and Medicinal Fungi, Jilin Agricultural University, Changchun 130118, P.R. China
| | - John Gladden
- Joint BioEnergy Institute, Emeryville, CA 94608, USA
| | - Jon Magnuson
- Pacific Northwest National Laboratory, Richland, WA 99352, USA
| | - M. Catherine Aime
- Department of Botany and Plant Pathology, Purdue University, West Lafayette, IN, USA
| | - Ronan O’Malley
- US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Igor V. Grigoriev
- US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
- Department of Plant and Microbial Biology, University of California, Berkeley, Berkeley, CA 94720, USA
| |
Collapse
|
5
|
He G, Ye J, Hao H, Chen W. A KAN-based hybrid deep neural networks for accurate identification of transcription factor binding sites. PLoS One 2025; 20:e0322978. [PMID: 40334196 PMCID: PMC12058130 DOI: 10.1371/journal.pone.0322978] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2025] [Accepted: 04/01/2025] [Indexed: 05/09/2025] Open
Abstract
BACKGROUND Predicting protein-DNA binding sites in vivo is a challenging but urgent task in many fields such as drug design and development. Most promoters contain many transcription factor (TF) binding sites, yet only a few have been identified through time-consuming biochemical experiments. To address this challenge, numerous computational approaches have been proposed to predict TF binding sites from DNA sequences. However, current deep learning methods often face issues such as gradient vanishing as the model depth increases, leading to suboptimal feature extraction. RESULTS We propose a model called CBR-KAN (where C represents Convolutional Neural Network (CNN), B represents Bidirectional Long Short Term Memory (BiLSTM), and R represents Residual Mechanism) to predict transcription factor binding sites. Specifically, we designed a multi-scale convolution module (ConvBlock1, 2, 3) combined with BiLSTM network, introduced KAN network to replace traditional multilayer perceptron, and promoted model optimization through residual connections. Testing on 50 common ChIP seq benchmark datasets shows that CBR-KAN outperforms other state-of-the-art methods such as DeepBind, DanQ, DeepD2V, and DeepSEA in predicting TF binding sites. CONCLUSIONS The CBR-KAN model significantly improves prediction accuracy for transcription factor binding sites by effectively integrating multiple neural network architectures and mechanisms. This approach not only enhances feature extraction but also stabilizes training and boosts generalization capabilities. The promising results on multiple key performance indicators demonstrate the potential of CBR-KAN in bioinformatics applications.
Collapse
Affiliation(s)
- Guodong He
- School of Information Engineering, Wenzhou Business College, Wenzhou, Zhejiang, PR China
| | - Jiahao Ye
- School of Information Engineering, Wenzhou Business College, Wenzhou, Zhejiang, PR China
| | - Huijun Hao
- School of Information Engineering, Wenzhou Business College, Wenzhou, Zhejiang, PR China
| | - Wei Chen
- School of Information Engineering, Wenzhou Business College, Wenzhou, Zhejiang, PR China
| |
Collapse
|
6
|
Yang J, Zhou F, Luo X, Fang Y, Wang X, Liu X, Xiao R, Jiang D, Tang Y, Yang G, You L, Zhao Y. Enhancer reprogramming: critical roles in cancer and promising therapeutic strategies. Cell Death Discov 2025; 11:84. [PMID: 40032852 DOI: 10.1038/s41420-025-02366-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Revised: 01/24/2025] [Accepted: 02/19/2025] [Indexed: 03/05/2025] Open
Abstract
Transcriptional dysregulation is a hallmark of cancer initiation and progression, driven by genetic and epigenetic alterations. Enhancer reprogramming has emerged as a pivotal driver of carcinogenesis, with cancer cells often relying on aberrant transcriptional programs. The advent of high-throughput sequencing technologies has provided critical insights into enhancer reprogramming events and their role in malignancy. While targeting enhancers presents a promising therapeutic strategy, significant challenges remain. These include the off-target effects of enhancer-targeting technologies, the complexity and redundancy of enhancer networks, and the dynamic nature of enhancer reprogramming, which may contribute to therapeutic resistance. This review comprehensively encapsulates the structural attributes of enhancers, delineates the mechanisms underlying their dysregulation in malignant transformation, and evaluates the therapeutic opportunities and limitations associated with targeting enhancers in cancer.
Collapse
Affiliation(s)
- Jinshou Yang
- Department of General Surgery, Peking Union Medical College Hospital, Peking Union Medical College, Chinese Academy of Medical Sciences, Beijing, PR China
- Key Laboratory of Research in Pancreatic Tumor, Chinese Academy of Medical Sciences, Beijing, PR China
- National Science and Technology Key Infrastructure on Translational Medicine in Peking Union Medical College Hospital, Beijing, PR China
| | - Feihan Zhou
- Department of General Surgery, Peking Union Medical College Hospital, Peking Union Medical College, Chinese Academy of Medical Sciences, Beijing, PR China
- Key Laboratory of Research in Pancreatic Tumor, Chinese Academy of Medical Sciences, Beijing, PR China
- National Science and Technology Key Infrastructure on Translational Medicine in Peking Union Medical College Hospital, Beijing, PR China
| | - Xiyuan Luo
- Department of General Surgery, Peking Union Medical College Hospital, Peking Union Medical College, Chinese Academy of Medical Sciences, Beijing, PR China
- Key Laboratory of Research in Pancreatic Tumor, Chinese Academy of Medical Sciences, Beijing, PR China
- National Science and Technology Key Infrastructure on Translational Medicine in Peking Union Medical College Hospital, Beijing, PR China
| | - Yuan Fang
- Department of General Surgery, Peking Union Medical College Hospital, Peking Union Medical College, Chinese Academy of Medical Sciences, Beijing, PR China
- Key Laboratory of Research in Pancreatic Tumor, Chinese Academy of Medical Sciences, Beijing, PR China
- National Science and Technology Key Infrastructure on Translational Medicine in Peking Union Medical College Hospital, Beijing, PR China
| | - Xing Wang
- Department of General Surgery, Peking Union Medical College Hospital, Peking Union Medical College, Chinese Academy of Medical Sciences, Beijing, PR China
- Key Laboratory of Research in Pancreatic Tumor, Chinese Academy of Medical Sciences, Beijing, PR China
- National Science and Technology Key Infrastructure on Translational Medicine in Peking Union Medical College Hospital, Beijing, PR China
| | - Xiaohong Liu
- Department of General Surgery, Peking Union Medical College Hospital, Peking Union Medical College, Chinese Academy of Medical Sciences, Beijing, PR China
- Key Laboratory of Research in Pancreatic Tumor, Chinese Academy of Medical Sciences, Beijing, PR China
- National Science and Technology Key Infrastructure on Translational Medicine in Peking Union Medical College Hospital, Beijing, PR China
| | - Ruiling Xiao
- Department of General Surgery, Peking Union Medical College Hospital, Peking Union Medical College, Chinese Academy of Medical Sciences, Beijing, PR China
- Key Laboratory of Research in Pancreatic Tumor, Chinese Academy of Medical Sciences, Beijing, PR China
- National Science and Technology Key Infrastructure on Translational Medicine in Peking Union Medical College Hospital, Beijing, PR China
| | - Decheng Jiang
- Department of General Surgery, Peking Union Medical College Hospital, Peking Union Medical College, Chinese Academy of Medical Sciences, Beijing, PR China
- Key Laboratory of Research in Pancreatic Tumor, Chinese Academy of Medical Sciences, Beijing, PR China
- National Science and Technology Key Infrastructure on Translational Medicine in Peking Union Medical College Hospital, Beijing, PR China
| | - Yuemeng Tang
- Department of General Surgery, Peking Union Medical College Hospital, Peking Union Medical College, Chinese Academy of Medical Sciences, Beijing, PR China
- Key Laboratory of Research in Pancreatic Tumor, Chinese Academy of Medical Sciences, Beijing, PR China
- National Science and Technology Key Infrastructure on Translational Medicine in Peking Union Medical College Hospital, Beijing, PR China
| | - Gang Yang
- Department of General Surgery, Peking Union Medical College Hospital, Peking Union Medical College, Chinese Academy of Medical Sciences, Beijing, PR China.
- Key Laboratory of Research in Pancreatic Tumor, Chinese Academy of Medical Sciences, Beijing, PR China.
- National Science and Technology Key Infrastructure on Translational Medicine in Peking Union Medical College Hospital, Beijing, PR China.
| | - Lei You
- Department of General Surgery, Peking Union Medical College Hospital, Peking Union Medical College, Chinese Academy of Medical Sciences, Beijing, PR China.
- Key Laboratory of Research in Pancreatic Tumor, Chinese Academy of Medical Sciences, Beijing, PR China.
- National Science and Technology Key Infrastructure on Translational Medicine in Peking Union Medical College Hospital, Beijing, PR China.
| | - Yupei Zhao
- Department of General Surgery, Peking Union Medical College Hospital, Peking Union Medical College, Chinese Academy of Medical Sciences, Beijing, PR China.
- Key Laboratory of Research in Pancreatic Tumor, Chinese Academy of Medical Sciences, Beijing, PR China.
- National Science and Technology Key Infrastructure on Translational Medicine in Peking Union Medical College Hospital, Beijing, PR China.
| |
Collapse
|
7
|
An Z, Jiang A, Chen J. Toward understanding the role of genomic repeat elements in neurodegenerative diseases. Neural Regen Res 2025; 20:646-659. [PMID: 38886931 PMCID: PMC11433896 DOI: 10.4103/nrr.nrr-d-23-01568] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Revised: 12/21/2023] [Accepted: 03/02/2024] [Indexed: 06/20/2024] Open
Abstract
Neurodegenerative diseases cause great medical and economic burdens for both patients and society; however, the complex molecular mechanisms thereof are not yet well understood. With the development of high-coverage sequencing technology, researchers have started to notice that genomic repeat regions, previously neglected in search of disease culprits, are active contributors to multiple neurodegenerative diseases. In this review, we describe the association between repeat element variants and multiple degenerative diseases through genome-wide association studies and targeted sequencing. We discuss the identification of disease-relevant repeat element variants, further powered by the advancement of long-read sequencing technologies and their related tools, and summarize recent findings in the molecular mechanisms of repeat element variants in brain degeneration, such as those causing transcriptional silencing or RNA-mediated gain of toxic function. Furthermore, we describe how in silico predictions using innovative computational models, such as deep learning language models, could enhance and accelerate our understanding of the functional impact of repeat element variants. Finally, we discuss future directions to advance current findings for a better understanding of neurodegenerative diseases and the clinical applications of genomic repeat elements.
Collapse
Affiliation(s)
- Zhengyu An
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
| | - Aidi Jiang
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
| | - Jingqi Chen
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
- MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China
- MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Fudan University, Shanghai, China
- Zhangjiang Fudan International Innovation Center, Shanghai, China
| |
Collapse
|
8
|
Achterberg T, de Jong A. ProPr54 web server: predicting σ 54 promoters and regulon with a hybrid convolutional and recurrent deep neural network. NAR Genom Bioinform 2025; 7:lqae188. [PMID: 39781509 PMCID: PMC11704786 DOI: 10.1093/nargab/lqae188] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2024] [Revised: 11/19/2024] [Accepted: 12/23/2024] [Indexed: 01/12/2025] Open
Abstract
σ54 serves as an unconventional sigma factor with a distinct mechanism of transcription initiation, which depends on the involvement of a transcription activator. This unique sigma factor σ54 is indispensable for orchestrating the transcription of genes crucial to nitrogen regulation, flagella biosynthesis, motility, chemotaxis and various other essential cellular processes. Currently, no comprehensive tools are available to determine σ54 promoters and regulon in bacterial genomes. Here, we report a σ54 promoter prediction method ProPr54, based on a convolutional neural network trained on a set of 446 validated σ54 binding sites derived from 33 bacterial species. Model performance was tested and compared with respect to bacterial intergenic regions, demonstrating robust applicability. ProPr54 exhibits high performance when tested on various bacterial species, highly surpassing other available σ54 regulon identification methods. Furthermore, analysis on bacterial genomes, which have no experimentally validated σ54 binding sites, demonstrates the generalization of the model. ProPr54 is the first reliable in silico method for predicting σ54 binding sites, making it a valuable tool to support experimental studies on σ54. In conclusion, ProPr54 offers a reliable, broadly applicable tool for predicting σ54 promoters and regulon genes in bacterial genome sequences. A web server is freely accessible at http://propr54.molgenrug.nl.
Collapse
Affiliation(s)
- Tristan Achterberg
- Department of Molecular Genetics, Groningen, Biomolecular Sciences and Biotechnology Institute, University of Groningen, Nijenborgh 7, 9747 AG Groningen, the Netherlands
| | - Anne de Jong
- Department of Molecular Genetics, Groningen, Biomolecular Sciences and Biotechnology Institute, University of Groningen, Nijenborgh 7, 9747 AG Groningen, the Netherlands
| |
Collapse
|
9
|
Li J, Zhang P, Xi X, Liu L, Wei L, Wang X. Modeling and designing enhancers by introducing and harnessing transcription factor binding units. Nat Commun 2025; 16:1469. [PMID: 39922842 PMCID: PMC11807178 DOI: 10.1038/s41467-025-56749-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2024] [Accepted: 01/24/2025] [Indexed: 02/10/2025] Open
Abstract
Enhancers serve as pivotal regulators of gene expression throughout various biological processes by interacting with transcription factors (TFs). While transcription factor binding sites (TFBSs) are widely acknowledged as key determinants of TF binding and enhancer activity, the significant role of their surrounding context sequences remains to be quantitatively characterized. Here we propose the concept of transcription factor binding unit (TFBU) to modularly model enhancers by quantifying the impact of context sequences surrounding TFBSs using deep learning models. Based on this concept, we develop DeepTFBU, a comprehensive toolkit for enhancer design. We demonstrate that designing TFBS context sequences can significantly modulate enhancer activities and produce cell type-specific responses. DeepTFBU is also highly efficient in the de novo design of enhancers containing multiple TFBSs. Furthermore, DeepTFBU enables flexible decoupling and optimization of generalized enhancers. We prove that TFBU is a crucial concept, and DeepTFBU is highly effective for rational enhancer design.
Collapse
Affiliation(s)
- Jiaqi Li
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing, China
| | - Pengcheng Zhang
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing, China
| | - Xi Xi
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing, China
| | - Liyang Liu
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing, China
| | - Lei Wei
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing, China
| | - Xiaowo Wang
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing, China.
| |
Collapse
|
10
|
He J, Zhang Y, Liu Y, Zhou Z, Li T, Zhang Y, Xie B. BCDB: A dual-branch network based on transformer for predicting transcription factor binding sites. Methods 2025; 234:141-151. [PMID: 39701486 DOI: 10.1016/j.ymeth.2024.12.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2024] [Revised: 11/28/2024] [Accepted: 12/08/2024] [Indexed: 12/21/2024] Open
Abstract
Transcription factor binding sites (TFBSs) are critical in regulating gene expression. Precisely locating TFBSs can reveal the mechanisms of action of different transcription factors in gene transcription. Various deep learning methods have been proposed to predict TFBS; however, these models often need help demonstrating ideal performance under limited data conditions. Furthermore, these models typically have complex structures, which makes their decision-making processes difficult to transparentize. Addressing these issues, we have developed a framework named BCDB. This framework integrates multi-scale DNA information and employs a dual-branch output strategy. Integrating DNABERT, convolutional neural networks (CNN), and multi-head attention mechanisms enhances the feature extraction capabilities, significantly improving the accuracy of predictions. This innovative method aims to balance the extraction of global and local information, enhancing predictive performance while utilizing attention mechanisms to provide an intuitive way to explain the model's predictions, thus strengthening the overall interpretability of the model. Prediction results on 165 ChIP-seq datasets show that BCDB significantly outperforms other existing deep learning methods in terms of performance. Additionally, since the BCDB model utilizes transfer learning methods, it can transfer knowledge learned from many unlabeled data to specific cell line prediction tasks, allowing our model to achieve cross-cell line TFBS prediction. The source code for BCDB is available on https://github.com/ZhangLab312/BCDB.
Collapse
Affiliation(s)
- Jia He
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Yupeng Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Yuhang Liu
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Zhigan Zhou
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Tianhao Li
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Boqia Xie
- Department of Cardiology, Cardiovascualr Imaging Center, Beijing Chaoyang Hospital, Capital Medical University, Beijing, China.
| |
Collapse
|
11
|
Cheng W, Song Z, Zhang Y, Wang S, Wang D, Yang M, Li L, Ma J. DNALongBench: A Benchmark Suite for Long-Range DNA Prediction Tasks. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.06.631595. [PMID: 39829833 PMCID: PMC11741265 DOI: 10.1101/2025.01.06.631595] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 01/22/2025]
Abstract
Modeling long-range DNA dependencies is crucial for understanding genome structure and function across a wide range of biological contexts. However, effectively capturing these extensive dependencies, which may span millions of base pairs in tasks such as three-dimensional (3D) chromatin folding prediction, remains a significant challenge. Furthermore, a comprehensive benchmark suite for evaluating tasks that rely on long-range dependencies is notably absent. To address this gap, we introduce DNALongBench, a benchmark dataset encompassing five important genomics tasks that consider long-range dependencies up to 1 million base pairs: enhancer-target gene interaction, expression quantitative trait loci, 3D genome organization, regulatory sequence activity, and transcription initiation signals. To comprehensively assess DNALongBench, we evaluate the performance of five methods: a task-specific expert model, a convolutional neural network (CNN)-based model, and three fine-tuned DNA foundation models - HyenaDNA, Caduceus-Ph, and Caduceus-PS. We envision DNALongBench as a standardized resource with the potential to facilitate comprehensive comparisons and rigorous evaluations of emerging DNA sequence-based deep learning models that account for long-range dependencies.
Collapse
Affiliation(s)
- Wenduo Cheng
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Zhenqiao Song
- Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Yang Zhang
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Shike Wang
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Danqing Wang
- Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Muyu Yang
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Lei Li
- Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Jian Ma
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| |
Collapse
|
12
|
Bréhélin L. Advancing Regulatory Genomics With Machine Learning. Bioinform Biol Insights 2024; 18:11779322241249562. [PMID: 39735654 PMCID: PMC11672376 DOI: 10.1177/11779322241249562] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Accepted: 04/09/2024] [Indexed: 12/31/2024] Open
Abstract
In recent years, several machine learning (ML) approaches have been proposed to predict gene expression signal and chromatin features from the DNA sequence alone. These models are often used to deduce and, to some extent, assess putative new biological insights about gene regulation, and they have led to very interesting advances in regulatory genomics. This article reviews a selection of these methods, ranging from linear models to random forests, kernel methods, and more advanced deep learning models. Specifically, we detail the different techniques and strategies that can be used to extract new gene-regulation hypotheses from these models. Furthermore, because these putative insights need to be validated with wet-lab experiments, we emphasize that it is important to have a measure of confidence associated with the extracted hypotheses. We review the procedures that have been proposed to measure this confidence for the different types of ML models, and we discuss the fact that they do not provide the same kind of information.
Collapse
|
13
|
Gage JL, Romay MC, Buckler ES. Maize inbreds show allelic variation for diel transcription patterns. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.12.16.628400. [PMID: 39763849 PMCID: PMC11702552 DOI: 10.1101/2024.12.16.628400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/19/2025]
Abstract
Circadian entrainment and external cues can cause gene transcript abundance to oscillate throughout the day, and these patterns of diel transcript oscillation vary across genes and plant species. Less is known about within-species allelic variation for diel patterns of transcript oscillation, or about how regulatory sequence variation influences diel transcription patterns. In this study, we evaluated diel transcript abundance for 24 diverse maize inbred lines. We observed extensive natural variation in diel transcription patterns, with two-fold variation in the number of genes that oscillate over the course of the day. A convolutional neural network trained to predict oscillation from promoter sequence identified sequences previously reported as binding motifs for known circadian clock genes in other plant systems. Genes showing diel transcription patterns that cosegregate with promoter sequence haplotypes are enriched for associations with photoperiod sensitivity and may have been indirect targets of selection as maize was adapted to longer day lengths at higher latitudes. These findings support the idea that cis-regulatory sequence variation influences patterns of gene expression, which in turn can have effects on phenotypic plasticity and local adaptation.
Collapse
Affiliation(s)
- Joseph L. Gage
- Department of Crop and Soil Sciences, North Carolina State University, Raleigh, NC 27695
- NC Plant Sciences Initiative, North Carolina State University, Raleigh, NC, 27606
| | - M. Cinta Romay
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853
| | - Edward S. Buckler
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853
- USDA-ARS, Ithaca, NY 14850
- School of Integrative Plant Science, Plant Breeding and Genetics Section, Cornell University, Ithaca NY 14853
| |
Collapse
|
14
|
Tahir M, Norouzi M, Khan SS, Davie JR, Yamanaka S, Ashraf A. Artificial intelligence and deep learning algorithms for epigenetic sequence analysis: A review for epigeneticists and AI experts. Comput Biol Med 2024; 183:109302. [PMID: 39500240 DOI: 10.1016/j.compbiomed.2024.109302] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2024] [Revised: 09/22/2024] [Accepted: 10/17/2024] [Indexed: 11/20/2024]
Abstract
Epigenetics encompasses mechanisms that can alter the expression of genes without changing the underlying genetic sequence. The epigenetic regulation of gene expression is initiated and sustained by several mechanisms such as DNA methylation, histone modifications, chromatin conformation, and non-coding RNA. The changes in gene regulation and expression can manifest in the form of various diseases and disorders such as cancer and congenital deformities. Over the last few decades, high-throughput experimental approaches have been used to identify and understand epigenetic changes, but these laboratory experimental approaches and biochemical processes are time-consuming and expensive. To overcome these challenges, machine learning and artificial intelligence (AI) approaches have been extensively used for mapping epigenetic modifications to their phenotypic manifestations. In this paper we provide a narrative review of published research on AI models trained on epigenomic data to address a variety of problems such as prediction of disease markers, gene expression, enhancer-promoter interaction, and chromatin states. The purpose of this review is twofold as it is addressed to both AI experts and epigeneticists. For AI researchers, we provided a taxonomy of epigenetics research problems that can benefit from an AI-based approach. For epigeneticists, given each of the above problems we provide a list of candidate AI solutions in the literature. We have also identified several gaps in the literature, research challenges, and recommendations to address these challenges.
Collapse
Affiliation(s)
- Muhammad Tahir
- Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, R3T 5V6, MB, Canada
| | - Mahboobeh Norouzi
- Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, R3T 5V6, MB, Canada
| | - Shehroz S Khan
- College of Engineering and Technology, American University of the Middle East, Kuwait
| | - James R Davie
- Department of Biochemistry and Medical Genetics, Max Rady College of Medicine, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, MB, Canada
| | - Soichiro Yamanaka
- Graduate School of Science, Department of Biophysics and Biochemistry, University of Tokyo, Japan
| | - Ahmed Ashraf
- Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, R3T 5V6, MB, Canada.
| |
Collapse
|
15
|
Li Z, Zhang Y, Peng B, Qin S, Zhang Q, Chen Y, Chen C, Bao Y, Zhu Y, Hong Y, Liu B, Liu Q, Xu L, Chen X, Ma X, Wang H, Xie L, Yao Y, Deng B, Li J, De B, Chen Y, Wang J, Li T, Liu R, Tang Z, Cao J, Zuo E, Mei C, Zhu F, Shao C, Wang G, Sun T, Wang N, Liu G, Ni JQ, Liu Y. A novel interpretable deep learning-based computational framework designed synthetic enhancers with broad cross-species activity. Nucleic Acids Res 2024; 52:13447-13468. [PMID: 39420601 PMCID: PMC11602155 DOI: 10.1093/nar/gkae912] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Revised: 09/25/2024] [Accepted: 10/03/2024] [Indexed: 10/19/2024] Open
Abstract
Enhancers play a critical role in dynamically regulating spatial-temporal gene expression and establishing cell identity, underscoring the significance of designing them with specific properties for applications in biosynthetic engineering and gene therapy. Despite numerous high-throughput methods facilitating genome-wide enhancer identification, deciphering the sequence determinants of their activity remains challenging. Here, we present the DREAM (DNA cis-Regulatory Elements with controllable Activity design platforM) framework, a novel deep learning-based approach for synthetic enhancer design. Proficient in uncovering subtle and intricate patterns within extensive enhancer screening data, DREAM achieves cutting-edge sequence-based enhancer activity prediction and highlights critical sequence features implicating strong enhancer activity. Leveraging DREAM, we have engineered enhancers that surpass the potency of the strongest enhancer within the Drosophila genome by approximately 3.6-fold. Remarkably, these synthetic enhancers exhibited conserved functionality across species that have diverged more than billion years, indicating that DREAM was able to learn highly conserved enhancer regulatory grammar. Additionally, we designed silencers and cell line-specific enhancers using DREAM, demonstrating its versatility. Overall, our study not only introduces an interpretable approach for enhancer design but also lays out a general framework applicable to the design of other types of cis-regulatory elements.
Collapse
Affiliation(s)
- Zhaohong Li
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Yuanyuan Zhang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Bo Peng
- Gene Regulatory Lab, School of Basic Medical Sciences, Tsinghua University, NO. 30 Shuangqing road, Haidian district, Beijing 100084, China
- State Key Laboratory of Molecular Oncology, Tsinghua University, NO. 30 Shuangqing road, Haidian district, Beijing 100084, China
| | - Shenghua Qin
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Qian Zhang
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, NO.1 Beichen West Road, Chaoyang District, Beijing 100101, China
| | - Yun Chen
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Choulin Chen
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Yongzhou Bao
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Yuqi Zhu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, NO. 7 Pengfei Road, Dapeng District, Shenzhen 518124, China
| | - Yi Hong
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, NO. 7 Pengfei Road, Dapeng District, Shenzhen 518124, China
| | - Binghua Liu
- State Key Laboratory of Maricultural Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, NO.106 Nanjing Road, Shinan District, Qingdao, Shandong 266071, China
| | - Qian Liu
- State Key Laboratory of Maricultural Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, NO.106 Nanjing Road, Shinan District, Qingdao, Shandong 266071, China
| | - Lingna Xu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Xi Chen
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Xinhao Ma
- College of Grassland Agriculture, National Beef Cattle Improvement Center, College of Animal Science and Technology, Northwest A&F University, NO. 3 Taicheng Road, Yangling District, Yangling, Shaanxi 712100, China
| | - Hongyan Wang
- State Key Laboratory of Maricultural Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, NO.106 Nanjing Road, Shinan District, Qingdao, Shandong 266071, China
| | - Long Xie
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Yilong Yao
- Green Healthy Aquaculture Research Center, Kunpeng Institute of Modern Agriculture at Foshan, Chinese Academy of Agricultural Sciences, Building 26 Lihe Technology Park, Auxiliary Road of Xinxi Avenue South, Nanhai District, Foshan 528226, China
| | - Biao Deng
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Jiaying Li
- Department of Ophthalmology, Beijing Institute of Ophthalmology, Beijing Tongren Eye Center, Beijing Tongren Hospital, Capital Medical University, Dongjiaomin lane No1, Dongcheng District, Beijing 100101, China
| | - Baojun De
- College of Life Sciences, Inner Mongolia Autonomous Region Key Laboratory of Biomanufacturing, Inner Mongolia Agricultural University, NO. 306 Zhaowuda Road, Saihan District, Hohhot 010018, China
| | - Yuting Chen
- College of Life Sciences, Inner Mongolia Autonomous Region Key Laboratory of Biomanufacturing, Inner Mongolia Agricultural University, NO. 306 Zhaowuda Road, Saihan District, Hohhot 010018, China
| | - Jing Wang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Tian Li
- College of JUNCAO Science and Ecology, Haixia Institute of Science and Technology, National Engineering Research Center of JUNCAO, Fujian Agriculture and Forestry University (FAFU), NO.15 Shangxiadian Road, Cangshan District, Fuzhou 0350002, China
| | - Ranran Liu
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Yuanmingyuan West Road NO. 2, Haidian District, Beijing 100193, China
| | - Zhonglin Tang
- Green Healthy Aquaculture Research Center, Kunpeng Institute of Modern Agriculture at Foshan, Chinese Academy of Agricultural Sciences, Building 26 Lihe Technology Park, Auxiliary Road of Xinxi Avenue South, Nanhai District, Foshan 528226, China
| | - Junwei Cao
- College of Life Sciences, Inner Mongolia Autonomous Region Key Laboratory of Biomanufacturing, Inner Mongolia Agricultural University, NO. 306 Zhaowuda Road, Saihan District, Hohhot 010018, China
| | - Erwei Zuo
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Chugang Mei
- College of Grassland Agriculture, National Beef Cattle Improvement Center, College of Animal Science and Technology, Northwest A&F University, NO. 3 Taicheng Road, Yangling District, Yangling, Shaanxi 712100, China
| | - Fangjie Zhu
- College of JUNCAO Science and Ecology, Haixia Institute of Science and Technology, National Engineering Research Center of JUNCAO, Fujian Agriculture and Forestry University (FAFU), NO.15 Shangxiadian Road, Cangshan District, Fuzhou 0350002, China
| | - Changwei Shao
- State Key Laboratory of Maricultural Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, NO.106 Nanjing Road, Shinan District, Qingdao, Shandong 266071, China
| | - Guirong Wang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Tongjun Sun
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, NO. 7 Pengfei Road, Dapeng District, Shenzhen 518124, China
| | - Ningli Wang
- Department of Ophthalmology, Beijing Institute of Ophthalmology, Beijing Tongren Eye Center, Beijing Tongren Hospital, Capital Medical University, Dongjiaomin lane No1, Dongcheng District, Beijing 100101, China
| | - Gang Liu
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, NO.1 Beichen West Road, Chaoyang District, Beijing 100101, China
| | - Jian-Quan Ni
- Gene Regulatory Lab, School of Basic Medical Sciences, Tsinghua University, NO. 30 Shuangqing road, Haidian district, Beijing 100084, China
- State Key Laboratory of Molecular Oncology, Tsinghua University, NO. 30 Shuangqing road, Haidian district, Beijing 100084, China
- SXMU-Tsinghua Collaborative Innovation Center for Frontier Medicine, Shanxi Medical University, NO. 56 Xinjian South Road, Yingze District, Taiyuan 030001, China
| | - Yuwen Liu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Green Healthy Aquaculture Research Center, Kunpeng Institute of Modern Agriculture at Foshan, Chinese Academy of Agricultural Sciences, Building 26 Lihe Technology Park, Auxiliary Road of Xinxi Avenue South, Nanhai District, Foshan 528226, China
| |
Collapse
|
16
|
Wang Y, Kong S, Zhou C, Wang Y, Zhang Y, Fang Y, Li G. A review of deep learning models for the prediction of chromatin interactions with DNA and epigenomic profiles. Brief Bioinform 2024; 26:bbae651. [PMID: 39708837 DOI: 10.1093/bib/bbae651] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2024] [Revised: 10/29/2024] [Accepted: 12/03/2024] [Indexed: 12/23/2024] Open
Abstract
Advances in three-dimensional (3D) genomics have revealed the spatial characteristics of chromatin interactions in gene expression regulation, which is crucial for understanding molecular mechanisms in biological processes. High-throughput technologies like ChIA-PET, Hi-C, and their derivatives methods have greatly enhanced our knowledge of 3D chromatin architecture. However, the chromatin interaction mechanisms remain largely unexplored. Deep learning, with its powerful feature extraction and pattern recognition capabilities, offers a promising approach for integrating multi-omics data, to build accurate predictive models of chromatin interaction matrices. This review systematically summarizes recent advances in chromatin interaction matrix prediction models. By integrating DNA sequences and epigenetic signals, we investigate the latest developments in these methods. This article details various models, focusing on how one-dimensional (1D) information transforms into the 3D structure chromatin interactions, and how the integration of different deep learning modules specifically affects model accuracy. Additionally, we discuss the critical role of DNA sequence information and epigenetic markers in shaping 3D genome interaction patterns. Finally, this review addresses the challenges in predicting chromatin interaction matrices, in order to improve the precise mapping of chromatin interaction matrices and DNA sequence, and supporting the transformation and theoretical development of 3D genomics across biological systems.
Collapse
Affiliation(s)
- Yunlong Wang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No. 97 Buxin Road, Dapeng New District, Shenzhen 518120, China
| | - Siyuan Kong
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No. 97 Buxin Road, Dapeng New District, Shenzhen 518120, China
| | - Cong Zhou
- Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
- Hubei Engineering Technology Research Center of Agricultural Big Data, 3D Genomics Research Center, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
- College of Informatics, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
| | - Yanfang Wang
- State Key Laboratory of Animal Biotech Breeding, Institute of Animal Science, Chinese Academy of Agricultural Sciences (CAAS), No. 2 West Yuanmingyuan Rd, Haidian District, Beijing 100193, China
| | - Yubo Zhang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No. 97 Buxin Road, Dapeng New District, Shenzhen 518120, China
- Sequencing Facility, Frederick National Laboratory for Cancer Research, 8560 Progress Drive, Frederick, MD 21701, United States
| | - Yaping Fang
- Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
- Hubei Engineering Technology Research Center of Agricultural Big Data, 3D Genomics Research Center, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
- College of Informatics, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
| | - Guoliang Li
- Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
- Hubei Engineering Technology Research Center of Agricultural Big Data, 3D Genomics Research Center, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
- College of Informatics, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
| |
Collapse
|
17
|
Kabir A, Bhattarai M, Peterson S, Najman-Licht Y, Rasmussen K, Shehu A, Bishop A, Alexandrov B, Usheva A. DNA breathing integration with deep learning foundational model advances genome-wide binding prediction of human transcription factors. Nucleic Acids Res 2024; 52:e91. [PMID: 39271116 PMCID: PMC11514457 DOI: 10.1093/nar/gkae783] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2024] [Revised: 08/21/2024] [Accepted: 08/29/2024] [Indexed: 09/15/2024] Open
Abstract
It was previously shown that DNA breathing, thermodynamic stability, as well as transcriptional activity and transcription factor (TF) bindings are functionally correlated. To ascertain the precise relationship between TF binding and DNA breathing, we developed the multi-modal deep learning model EPBDxDNABERT-2, which is based on the Extended Peyrard-Bishop-Dauxois (EPBD) nonlinear DNA dynamics model. To train our EPBDxDNABERT-2, we used chromatin immunoprecipitation sequencing (ChIP-Seq) data comprising 690 ChIP-seq experimental results encompassing 161 distinct TFs and 91 human cell types. EPBDxDNABERT-2 significantly improves the prediction of over 660 TF-DNA, with an increase in the area under the receiver operating characteristic (AUROC) metric of up to 9.6% when compared to the baseline model that does not leverage DNA biophysical properties. We expanded our analysis to in vitro high-throughput Systematic Evolution of Ligands by Exponential enrichment (HT-SELEX) dataset of 215 TFs from 27 families, comparing EPBD with established frameworks. The integration of the DNA breathing features with DNABERT-2 foundational model, greatly enhanced TF-binding predictions. Notably, EPBDxDNABERT-2, trained on a large-scale multi-species genomes, with a cross-attention mechanism, improved predictive power shedding light on the mechanisms underlying disease-related non-coding variants discovered in genome-wide association studies.
Collapse
Affiliation(s)
- Anowarul Kabir
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544 NM, USA
- Department of Computer Science, George Mason University, 4400 University Dr, 22030 VA, USA
| | - Manish Bhattarai
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544 NM, USA
| | - Selma Peterson
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544 NM, USA
| | | | - Kim Ø Rasmussen
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544 NM, USA
| | - Amarda Shehu
- Department of Computer Science, George Mason University, 4400 University Dr, 22030 VA, USA
| | - Alan R Bishop
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544 NM, USA
| | - Boian Alexandrov
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544 NM, USA
| | - Anny Usheva
- Department of Surgery, Brown University, 69 Brown St Box 1822, 02912 RI, USA
| |
Collapse
|
18
|
La Fleur A, Shi Y, Seelig G. Decoding biology with massively parallel reporter assays and machine learning. Genes Dev 2024; 38:843-865. [PMID: 39362779 PMCID: PMC11535156 DOI: 10.1101/gad.351800.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/05/2024]
Abstract
Massively parallel reporter assays (MPRAs) are powerful tools for quantifying the impacts of sequence variation on gene expression. Reading out molecular phenotypes with sequencing enables interrogating the impact of sequence variation beyond genome scale. Machine learning models integrate and codify information learned from MPRAs and enable generalization by predicting sequences outside the training data set. Models can provide a quantitative understanding of cis-regulatory codes controlling gene expression, enable variant stratification, and guide the design of synthetic regulatory elements for applications from synthetic biology to mRNA and gene therapy. This review focuses on cis-regulatory MPRAs, particularly those that interrogate cotranscriptional and post-transcriptional processes: alternative splicing, cleavage and polyadenylation, translation, and mRNA decay.
Collapse
Affiliation(s)
- Alyssa La Fleur
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, USA
| | - Yongsheng Shi
- Department of Microbiology and Molecular Genetics, School of Medicine, University of California, Irvine, Irvine, California 92697, USA;
| | - Georg Seelig
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, USA;
- Department of Electrical & Computer Engineering, University of Washington, Seattle, Washington 98195, USA
| |
Collapse
|
19
|
Awdeh A, Turcotte M, Perkins TJ. Identifying transcription factors with cell-type specific DNA binding signatures. BMC Genomics 2024; 25:957. [PMID: 39402535 PMCID: PMC11472444 DOI: 10.1186/s12864-024-10859-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2024] [Accepted: 10/02/2024] [Indexed: 10/19/2024] Open
Abstract
BACKGROUND Transcription factors (TFs) bind to different parts of the genome in different types of cells, but it is usually assumed that the inherent DNA-binding preferences of a TF are invariant to cell type. Yet, there are several known examples of TFs that switch their DNA-binding preferences in different cell types, and yet more examples of other mechanisms, such as steric hindrance or cooperative binding, that may result in a "DNA signature" of differential binding. RESULTS To survey this phenomenon systematically, we developed a deep learning method we call SigTFB (Signatures of TF Binding) to detect and quantify cell-type specificity in a TF's known genomic binding sites. We used ENCODE ChIP-seq data to conduct a wide scale investigation of 169 distinct TFs in up to 14 distinct cell types. SigTFB detected statistically significant DNA binding signatures in approximately two-thirds of TFs, far more than might have been expected from the relatively sparse evidence in prior literature. We found that the presence or absence of a cell-type specific DNA binding signature is distinct from, and indeed largely uncorrelated to, the degree of overlap between ChIP-seq peaks in different cell types, and tended to arise by two mechanisms: using established motifs in different frequencies, and by selective inclusion of motifs for distint TFs. CONCLUSIONS While recent results have highlighted cell state features such as chromatin accessibility and gene expression in predicting TF binding, our results emphasize that, for some TFs, the DNA sequences of the binding sites contain substantial cell-type specific motifs.
Collapse
Affiliation(s)
- Aseel Awdeh
- School of Electrical Engineering and Compute Science, University of Ottawa, 800 King Edward Ave., Ottawa, K1N 6N5, Ontario, Canada
- Regenerative Medicine Program, Ottawa Hospital Research Institute, 501 Smyth Rd., Ottawa, K1H 8L6, Ontario, Canada
| | - Marcel Turcotte
- School of Electrical Engineering and Compute Science, University of Ottawa, 800 King Edward Ave., Ottawa, K1N 6N5, Ontario, Canada
| | - Theodore J Perkins
- School of Electrical Engineering and Compute Science, University of Ottawa, 800 King Edward Ave., Ottawa, K1N 6N5, Ontario, Canada.
- Regenerative Medicine Program, Ottawa Hospital Research Institute, 501 Smyth Rd., Ottawa, K1H 8L6, Ontario, Canada.
- Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology and Immunology, University of Ottawa, 451 Smyth Rd., Ottawa, K1H 8M5, Ontario, Canada.
| |
Collapse
|
20
|
Rafi AM, Nogina D, Penzar D, Lee D, Lee D, Kim N, Kim S, Kim D, Shin Y, Kwak IY, Meshcheryakov G, Lando A, Zinkevich A, Kim BC, Lee J, Kang T, Vaishnav ED, Yadollahpour P, Kim S, Albrecht J, Regev A, Gong W, Kulakovskiy IV, Meyer P, de Boer CG. A community effort to optimize sequence-based deep learning models of gene regulation. Nat Biotechnol 2024:10.1038/s41587-024-02414-w. [PMID: 39394483 DOI: 10.1038/s41587-024-02414-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Accepted: 08/29/2024] [Indexed: 10/13/2024]
Abstract
A systematic evaluation of how model architectures and training strategies impact genomics model performance is needed. To address this gap, we held a DREAM Challenge where competitors trained models on a dataset of millions of random promoter DNA sequences and corresponding expression levels, experimentally determined in yeast. For a robust evaluation of the models, we designed a comprehensive suite of benchmarks encompassing various sequence types. All top-performing models used neural networks but diverged in architectures and training strategies. To dissect how architectural and training choices impact performance, we developed the Prix Fixe framework to divide models into modular building blocks. We tested all possible combinations for the top three models, further improving their performance. The DREAM Challenge models not only achieved state-of-the-art results on our comprehensive yeast dataset but also consistently surpassed existing benchmarks on Drosophila and human genomic datasets, demonstrating the progress that can be driven by gold-standard genomics datasets.
Collapse
Affiliation(s)
| | - Daria Nogina
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia
| | - Dmitry Penzar
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
- AIRI, Moscow, Russia
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, Russia
| | - Dohoon Lee
- Seoul National University, Seoul, South Korea
| | | | - Nayeon Kim
- Seoul National University, Seoul, South Korea
| | | | - Dohyeon Kim
- Seoul National University, Seoul, South Korea
| | - Yeojin Shin
- Seoul National University, Seoul, South Korea
| | | | | | | | - Arsenii Zinkevich
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
| | | | - Juhyun Lee
- Chung-Ang University, Seoul, South Korea
| | - Taein Kang
- Chung-Ang University, Seoul, South Korea
| | - Eeshit Dhaval Vaishnav
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Sequome, Inc., South San Francisco, CA, USA
| | | | - Sun Kim
- Seoul National University, Seoul, South Korea
| | | | - Aviv Regev
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Genentech, San Francisco, CA, USA
| | - Wuming Gong
- University of Minnesota, Minneapolis, MN, USA
| | - Ivan V Kulakovskiy
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, Russia
| | - Pablo Meyer
- Health Care and Life Sciences, IBM Research, New York, NY, USA
| | - Carl G de Boer
- University of British Columbia, Vancouver, British Columbia, Canada.
| |
Collapse
|
21
|
John C, Sahoo J, Sajan IK, Madhavan M, Mathew OK. CNN-BLSTM based deep learning framework for eukaryotic kinome classification: An explainability based approach. Comput Biol Chem 2024; 112:108169. [PMID: 39137619 DOI: 10.1016/j.compbiolchem.2024.108169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Revised: 05/08/2024] [Accepted: 08/03/2024] [Indexed: 08/15/2024]
Abstract
Classification of protein families from their sequences is an enduring task in Proteomics and related studies. Numerous deep-learning models have been moulded to tackle this challenge, but due to the black-box character, they still fall short in reliability. Here, we present a novel explainability pipeline that explains the pivotal decisions of the deep learning model on the classification of the Eukaryotic kinome. Based on a comparative and experimental analysis of the most cutting-edge deep learning algorithms, the best deep learning model CNN-BLSTM was chosen to classify the eight eukaryotic kinase sequences to their corresponding families. As a substitution for the conventional class activation map-based interpretation of CNN-based models in the domain, we have cascaded the GRAD CAM and Integrated Gradient (IG) explainability modus operandi for improved and responsible results. To ensure the trustworthiness of the classifier, we have masked the kinase domain traces, identified from the explainability pipeline and observed a class-specific drop in F1-score from 0.96 to 0.76. In compliance with the Explainable AI paradigm, our results are promising and contribute to enhancing the trustworthiness of deep learning models for biological sequence-associated studies.
Collapse
Affiliation(s)
- Chinju John
- Department of Computer Science and Engineering, Indian Institute of Information Technology Kottayam, Kottayam, 686635, Kerala, India.
| | - Jayakrushna Sahoo
- Department of Computer Science and Engineering, Indian Institute of Information Technology Kottayam, Kottayam, 686635, Kerala, India
| | - Irish K Sajan
- Department of Computer Science and Engineering, Indian Institute of Information Technology Kottayam, Kottayam, 686635, Kerala, India
| | - Manu Madhavan
- Department of Computer Science and Engineering, Indian Institute of Information Technology Kottayam, Kottayam, 686635, Kerala, India
| | - Oommen K Mathew
- Department of Computer Science and Engineering, Indian Institute of Information Technology Kottayam, Kottayam, 686635, Kerala, India
| |
Collapse
|
22
|
Gosai SJ, Castro RI, Fuentes N, Butts JC, Mouri K, Alasoadura M, Kales S, Nguyen TTL, Noche RR, Rao AS, Joy MT, Sabeti PC, Reilly SK, Tewhey R. Machine-guided design of cell-type-targeting cis-regulatory elements. Nature 2024; 634:1211-1220. [PMID: 39443793 PMCID: PMC11525185 DOI: 10.1038/s41586-024-08070-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 09/18/2024] [Indexed: 10/25/2024]
Abstract
Cis-regulatory elements (CREs) control gene expression, orchestrating tissue identity, developmental timing and stimulus responses, which collectively define the thousands of unique cell types in the body1-3. While there is great potential for strategically incorporating CREs in therapeutic or biotechnology applications that require tissue specificity, there is no guarantee that an optimal CRE for these intended purposes has arisen naturally. Here we present a platform to engineer and validate synthetic CREs capable of driving gene expression with programmed cell-type specificity. We take advantage of innovations in deep neural network modelling of CRE activity across three cell types, efficient in silico optimization and massively parallel reporter assays to design and empirically test thousands of CREs4-8. Through large-scale in vitro validation, we show that synthetic sequences are more effective at driving cell-type-specific expression in three cell lines compared with natural sequences from the human genome and achieve specificity in analogous tissues when tested in vivo. Synthetic sequences exhibit distinct motif vocabulary associated with activity in the on-target cell type and a simultaneous reduction in the activity of off-target cells. Together, we provide a generalizable framework to prospectively engineer CREs from massively parallel reporter assay models and demonstrate the required literacy to write fit-for-purpose regulatory code.
Collapse
Affiliation(s)
- Sager J Gosai
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Harvard Graduate Program in Biological and Biomedical Science, Boston, MA, USA.
- Department Of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA.
- Howard Hughes Medical Institute, Chevy Chase, MD, USA.
| | | | - Natalia Fuentes
- The Jackson Laboratory, Bar Harbor, ME, USA
- Harvard College, Harvard University, Cambridge, MA, USA
| | - John C Butts
- The Jackson Laboratory, Bar Harbor, ME, USA
- Graduate School of Biomedical Sciences and Engineering, University of Maine, Orono, ME, USA
| | | | | | | | | | - Ramil R Noche
- Department of Comparative Medicine, Yale School of Medicine, New Haven, CT, USA
- Yale Zebrafish Research Core, Yale School of Medicine, New Haven, CT, USA
| | - Arya S Rao
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Harvard Medical School, Boston, MA, USA
| | - Mary T Joy
- The Jackson Laboratory, Bar Harbor, ME, USA
| | - Pardis C Sabeti
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department Of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
- Department of Immunology and Infectious Diseases, Harvard T H Chan School of Public Health, Harvard University, Boston, MA, USA
| | - Steven K Reilly
- Department of Genetics, Yale School of Medicine, New Haven, CT, USA.
- Wu Tsai Institute, Yale University, New Haven, CT, USA.
| | - Ryan Tewhey
- The Jackson Laboratory, Bar Harbor, ME, USA.
- Graduate School of Biomedical Sciences and Engineering, University of Maine, Orono, ME, USA.
- Graduate School of Biomedical Sciences, Tufts University School of Medicine, Boston, MA, USA.
| |
Collapse
|
23
|
He C, Duan L, Zheng H, Wang X, Guan L, Xu J. A Representation Learning Approach for Predicting circRNA Back-Splicing Event via Sequence-Interaction-Aware Dual Encoder. IEEE Trans Nanobioscience 2024; 23:603-611. [PMID: 39226209 DOI: 10.1109/tnb.2024.3454079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]
Abstract
Circular RNAs (circRNAs) play a crucial role in gene regulation and association with diseases because of their unique closed continuous loop structure, which is more stable and conserved than ordinary linear RNAs. As fundamental work to clarify their functions, a large number of computational approaches for identifying circRNA formation have been proposed. However, these methods fail to fully utilize the important characteristics of back-splicing events, i.e., the positional information of the splice sites and the interaction features of its flanking sequences, for predicting circRNAs. To this end, we hereby propose a novel approach called SIDE for predicting circRNA back-splicing events using only raw RNA sequences. Technically, SIDE employs a dual encoder to capture global and interactive features of the RNA sequence, and then a decoder designed by the contrastive learning to fuse out discriminative features improving the prediction of circRNAs formation. Empirical results on three real-world datasets show the effectiveness of SIDE. Further analysis also reveals that the effectiveness of SIDE.
Collapse
|
24
|
Koh HYK, Lam UTF, Ban KHK, Chen ES. Machine learning optimized DriverDetect software for high precision prediction of deleterious mutations in human cancers. Sci Rep 2024; 14:22618. [PMID: 39349509 PMCID: PMC11442673 DOI: 10.1038/s41598-024-71422-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2024] [Accepted: 08/28/2024] [Indexed: 10/02/2024] Open
Abstract
The detection of cancer-driving mutations is important for understanding cancer pathology and therapeutics development. Prediction tools have been created to streamline the computation process. However, most tools available have heterogeneous sensitivity or specificity. We built a machine learning-derived algorithm, DriverDetect that combines the outputs of seven pre-existing tools to improve the prediction of candidate driver cancer mutations. The algorithm was trained with cancer gene-specific mutation datasets of cancer patients to identify cancer drivers. DriverDetect performed better than the individual tools or their combinations in the validation test. It has the potential to incorporate future novel prediction algorithms and can be retrained with new datasets, offering an expanded application to pan-cancer analysis for cross-cancer study. (115 words).
Collapse
Affiliation(s)
- Herrick Yu Kan Koh
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - Ulysses Tsz Fung Lam
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - Kenneth Hon-Kim Ban
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore.
- National University Health System (NUHS), Singapore, Singapore.
- NUS Center for Cancer Research, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore.
| | - Ee Sin Chen
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore.
- National University Health System (NUHS), Singapore, Singapore.
- NUS Center for Cancer Research, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore.
- Integrative Sciences and Engineering Programme, National University of Singapore, Singapore, Singapore.
| |
Collapse
|
25
|
Zhang Y, Wang Z, Ge F, Wang X, Zhang Y, Li S, Guo Y, Song J, Yu DJ. MLSNet: a deep learning model for predicting transcription factor binding sites. Brief Bioinform 2024; 25:bbae489. [PMID: 39350338 PMCID: PMC11442149 DOI: 10.1093/bib/bbae489] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2024] [Revised: 09/05/2024] [Accepted: 09/16/2024] [Indexed: 10/04/2024] Open
Abstract
Accurate prediction of transcription factor binding sites (TFBSs) is essential for understanding gene regulation mechanisms and the etiology of diseases. Despite numerous advances in deep learning for predicting TFBSs, their performance can still be enhanced. In this study, we propose MLSNet, a novel deep learning architecture designed specifically to predict TFBSs. MLSNet innovatively integrates multisize convolutional fusion with long short-term memory (LSTM) networks to effectively capture DNA-sparse higher-order sequence features. Further, MLSNet incorporates super token attention and Bi-LSTM to systematically extract and integrate higher-order DNA shape features. Experimental results on 165 ChIP-seq (chromatin immunoprecipitation followed by sequencing) datasets indicate that MLSNet consistently outperforms several state-of-the-art algorithms in the prediction of TFBSs. Specifically, MLSNet reports average metrics: 0.8306 for ACC, 0.8992 for AUROC, and 0.9035 for AUPRC, surpassing the second-best methods by 1.82%, 1.68%, and 1.54%, respectively. This research delineates the effectiveness of combining multi-size convolutional layers with LSTM and DNA shape-based features in enhancing predictive accuracy. Moreover, this study comprehensively assesses the variability in model performance across different cell lines and transcription factors. The source code of MLSNet is available at https://github.com/minghaidea/MLSNet.
Collapse
Affiliation(s)
- Yuchuan Zhang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Zhikang Wang
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Wellington Rd, Clayton, Melbourne, VIC 3800, Australia
| | - Fang Ge
- State Key Laboratory of Organic Electronics and Information Displays & Institute of Advanced Materials (IAM), Nanjing University of Posts & Telecommunications, 9 Wenyuan, Nanjing, 210023, China
| | - Xiaoyu Wang
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Wellington Rd, Clayton, Melbourne, VIC 3800, Australia
| | - Yiwen Zhang
- Climate, Air Quality Research Unit, School of Public Health and Preventive Medicine, Monash University, 553 St Kilda Road, Melbourne, VIC 3004, Australia
| | - Shanshan Li
- Climate, Air Quality Research Unit, School of Public Health and Preventive Medicine, Monash University, 553 St Kilda Road, Melbourne, VIC 3004, Australia
| | - Yuming Guo
- Climate, Air Quality Research Unit, School of Public Health and Preventive Medicine, Monash University, 553 St Kilda Road, Melbourne, VIC 3004, Australia
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Wellington Rd, Clayton, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, Wellington Rd, Clayton, Melbourne, VIC 3800, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| |
Collapse
|
26
|
Phan H, Brouard C, Mourad R. Semi-supervised learning with pseudo-labeling compares favorably with large language models for regulatory sequence prediction. Brief Bioinform 2024; 25:bbae560. [PMID: 39489607 PMCID: PMC11531863 DOI: 10.1093/bib/bbae560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2024] [Revised: 09/13/2024] [Accepted: 10/17/2024] [Indexed: 11/05/2024] Open
Abstract
Predicting molecular processes using deep learning is a promising approach to provide biological insights for non-coding single nucleotide polymorphisms identified in genome-wide association studies. However, most deep learning methods rely on supervised learning, which requires DNA sequences associated with functional data, and whose amount is severely limited by the finite size of the human genome. Conversely, the amount of mammalian DNA sequences is growing exponentially due to ongoing large-scale sequencing projects, but in most cases without functional data. To alleviate the limitations of supervised learning, we propose a novel semi-supervised learning (SSL) based on pseudo-labeling, which allows to exploit unlabeled DNA sequences from numerous genomes during model pre-training. We further improved it incorporating principles from the Noisy Student algorithm to predict the confidence in pseudo-labeled data used for pre-training, which showed improvements for transcription factor with very few binding (very small training data). The approach is very flexible and can be used to train any neural architecture including state-of-the-art models, and shows in most cases strong predictive performance improvements compared to standard supervised learning. Moreover, small models trained by SSL showed similar or better performance than large language model DNABERT2.
Collapse
Affiliation(s)
- Han Phan
- INRAE, MIAT, 31326 Castanet-Tolosan, France
| | | | - Raphaël Mourad
- INRAE, MIAT, 31326 Castanet-Tolosan, France
- University of Toulouse, UPS, 31062 Toulouse, France
| |
Collapse
|
27
|
Borges Farias A, Sganzerla Martinez G, Galán-Vásquez E, Nicolás MF, Pérez-Rueda E. Predicting bacterial transcription factor binding sites through machine learning and structural characterization based on DNA duplex stability. Brief Bioinform 2024; 25:bbae581. [PMID: 39541188 PMCID: PMC11562833 DOI: 10.1093/bib/bbae581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2024] [Revised: 10/02/2024] [Accepted: 11/01/2024] [Indexed: 11/16/2024] Open
Abstract
Transcriptional factors (TFs) in bacteria play a crucial role in gene regulation by binding to specific DNA sequences, thereby assisting in the activation or repression of genes. Despite their central role, deciphering shape recognition of bacterial TFs-DNA interactions remains an intricate challenge. A deeper understanding of DNA secondary structures could greatly enhance our knowledge of how TFs recognize and interact with DNA, thereby elucidating their biological function. In this study, we employed machine learning algorithms to predict transcription factor binding sites (TFBS) and classify them as directed-repeat (DR) or inverted-repeat (IR). To accomplish this, we divided the set of TFBS nucleotide sequences by size, ranging from 8 to 20 base pairs, and converted them into thermodynamic data known as DNA duplex stability (DDS). Our results demonstrate that the Random Forest algorithm accurately predicts TFBS with an average accuracy of over 82% and effectively distinguishes between IR and DR with an accuracy of 89%. Interestingly, upon converting the base pairs of several TFBS-IR into DDS values, we observed a symmetric profile typical of the palindromic structure associated with these architectures. This study presents a novel TFBS prediction model based on a DDS characteristic that may indicate how respective proteins interact with base pairs, thus providing insights into molecular mechanisms underlying bacterial TFs-DNA interaction.
Collapse
Affiliation(s)
- André Borges Farias
- Laboratório Nacional de Computação Científica - LNCC, Avenida Getúlio Vargas, Petrópolis, Rio de Janeiro 25651075, Brazil
- Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Unidad Académica del Estado de Yucatán, Carretera Sierra Papacal, Mérida 97302, Yucatán, México
| | - Gustavo Sganzerla Martinez
- Microbiology and Immunology, Dalhousie University, 5850 College Street, Halifax B3H 4H7, Nova Scotia, Canada
| | - Edgardo Galán-Vásquez
- Departamento de Ingeniería de Sistemas Computacionales y Automatización, Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Ciudad Universitaria, Circuito Escolar S/N, Mexico City 01000, México
| | - Marisa Fabiana Nicolás
- Laboratório Nacional de Computação Científica - LNCC, Avenida Getúlio Vargas, Petrópolis, Rio de Janeiro 25651075, Brazil
| | - Ernesto Pérez-Rueda
- Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Unidad Académica del Estado de Yucatán, Carretera Sierra Papacal, Mérida 97302, Yucatán, México
| |
Collapse
|
28
|
Schadron T, van den Beld M, Mughini-Gras L, Franz E. Use of whole genome sequencing for surveillance and control of foodborne diseases: status quo and quo vadis. Front Microbiol 2024; 15:1460335. [PMID: 39345263 PMCID: PMC11427404 DOI: 10.3389/fmicb.2024.1460335] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2024] [Accepted: 08/27/2024] [Indexed: 10/01/2024] Open
Abstract
Improvements in sequencing quality, availability, speed and costs results in an increased presence of genomics in infectious disease applications. Nevertheless, there are still hurdles in regard to the optimal use of WGS for public health purposes. Here, we discuss the current state ("status quo") and future directions ("quo vadis") based on literature regarding the use of genomics in surveillance, hazard characterization and source attribution of foodborne pathogens. The future directions include the application of new techniques, such as machine learning and network approaches that may overcome the current shortcomings. These include the use of fixed genomic distances in cluster delineation, disentangling similarity or lack thereof in source attribution, and difficulties ascertaining function in hazard characterization. Although, the aforementioned methods can relatively easily be applied technically, an overarching challenge is the inference and biological/epidemiological interpretation of these large amounts of high-resolution data. Understanding the context in terms of bacterial isolate and host diversity allows to assess the level of representativeness in regard to sources and isolates in the dataset, which in turn defines the level of certainty associated with defining clusters, sources and risks. This also marks the importance of metadata (clinical, epidemiological, and biological) when using genomics for public health purposes.
Collapse
Affiliation(s)
- Tristan Schadron
- Centre for Infectious Disease Control, National Institute for Public Health and the Environment (RIVM), Bilthoven, Netherlands
| | - Maaike van den Beld
- Centre for Infectious Disease Control, National Institute for Public Health and the Environment (RIVM), Bilthoven, Netherlands
| | - Lapo Mughini-Gras
- Centre for Infectious Disease Control, National Institute for Public Health and the Environment (RIVM), Bilthoven, Netherlands
- Institute for Risk Assessment Sciences, Utrecht University, Utrecht, Netherlands
| | - Eelco Franz
- Centre for Infectious Disease Control, National Institute for Public Health and the Environment (RIVM), Bilthoven, Netherlands
| |
Collapse
|
29
|
Aparo A, Bonnici V, Avesani S, Cascione L, Giugno R. DiGAS: Differential gene allele spectrum as a descriptor in genetic studies. Comput Biol Med 2024; 179:108924. [PMID: 39067286 DOI: 10.1016/j.compbiomed.2024.108924] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Revised: 07/16/2024] [Accepted: 07/17/2024] [Indexed: 07/30/2024]
Abstract
Diagnosing individuals with complex genetic diseases is a challenging task. Computational methodologies exploit information at the genotype level by taking into account single nucleotide polymorphisms (SNPs) leveraging the results of genome-wide association studies analysis to assign a statistical significance to each SNP. Recent methodologies extend such an approach by aggregating SNP significance at the genetic level to identify genes that are related to the condition under study. However, such methodologies still suffer from the initial SNP analysis limitations. Here, we present DiGAS, a tool for diagnosing genetic conditions by computing significance, by means of SNP information, directly at the complex level of genetic regions. Such an approach is based on a generalized notion of allele spectrum, which evaluates the complete genetic alterations of the SNP set belonging to a genetic region at the population level. The statistical significance of a region is then evaluated through a differential allele spectrum analysis between the conditions of individuals belonging to the population. Tests, performed on well-established datasets regarding Alzheimer's disease, show that DiGAS outperforms the state of the art in distinguishing between sick and healthy subjects.
Collapse
Affiliation(s)
- Antonino Aparo
- University of Verona, Strada le Grazie, 15, Verona, 37134, Italy; Research Center LURM (Interdepartmental Laboratory of Medical Research), University of Verona, Ple. L.A. Scuro 10, Verona, 37134, Italy
| | - Vincenzo Bonnici
- University of Parma, Parco Area delle Scienze, 53/A, Parma, 43124, Italy
| | - Simone Avesani
- University of Verona, Strada le Grazie, 15, Verona, 37134, Italy
| | - Luciano Cascione
- Institute of Oncology Research (IOR), Via Francesco Chiesa 5, Bellinzona, 6500, Switzerland
| | - Rosalba Giugno
- University of Verona, Strada le Grazie, 15, Verona, 37134, Italy.
| |
Collapse
|
30
|
Bakulin A, Teyssier NB, Kampmann M, Khoroshkin M, Goodarzi H. pyPAGE: A framework for Addressing biases in gene-set enrichment analysis-A case study on Alzheimer's disease. PLoS Comput Biol 2024; 20:e1012346. [PMID: 39236079 PMCID: PMC11421795 DOI: 10.1371/journal.pcbi.1012346] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 09/24/2024] [Accepted: 07/22/2024] [Indexed: 09/07/2024] Open
Abstract
Inferring the driving regulatory programs from comparative analysis of gene expression data is a cornerstone of systems biology. Many computational frameworks were developed to address this problem, including our iPAGE (information-theoretic Pathway Analysis of Gene Expression) toolset that uses information theory to detect non-random patterns of expression associated with given pathways or regulons. Our recent observations, however, indicate that existing approaches are susceptible to the technical biases that are inherent to most real world annotations. To address this, we have extended our information-theoretic framework to account for specific biases and artifacts in biological networks using the concept of conditional information. To showcase pyPAGE, we performed a comprehensive analysis of regulatory perturbations that underlie the molecular etiology of Alzheimer's disease (AD). pyPAGE successfully recapitulated several known AD-associated gene expression programs. We also discovered several additional regulons whose differential activity is significantly associated with AD. We further explored how these regulators relate to pathological processes in AD through cell-type specific analysis of single cell and spatial gene expression datasets. Our findings showcase the utility of pyPAGE as a precise and reliable biomarker discovery in complex diseases such as Alzheimer's disease.
Collapse
Affiliation(s)
- Artemy Bakulin
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia
| | - Noam B. Teyssier
- Institute for Neurodegenerative Diseases, University of California San Francisco, California, United States of America
| | - Martin Kampmann
- Institute for Neurodegenerative Diseases, University of California San Francisco, California, United States of America
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, California, United States of America
| | - Matvei Khoroshkin
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, California, United States of America
- Department of Urology, University of California San Francisco, San Francisco, California, United States of America
- Helen Diller Family Comprehensive Cancer Center, University of California San Francisco, San Francisco, California, United States of America
- Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, California, United States of America
| | - Hani Goodarzi
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, California, United States of America
- Department of Urology, University of California San Francisco, San Francisco, California, United States of America
- Helen Diller Family Comprehensive Cancer Center, University of California San Francisco, San Francisco, California, United States of America
- Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, California, United States of America
- Arc Institute, Palo Alto, California, United States of America
| |
Collapse
|
31
|
Kundu P, Beura S, Mondal S, Das AK, Ghosh A. Machine learning for the advancement of genome-scale metabolic modeling. Biotechnol Adv 2024; 74:108400. [PMID: 38944218 DOI: 10.1016/j.biotechadv.2024.108400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Revised: 05/13/2024] [Accepted: 06/23/2024] [Indexed: 07/01/2024]
Abstract
Constraint-based modeling (CBM) has evolved as the core systems biology tool to map the interrelations between genotype, phenotype, and external environment. The recent advancement of high-throughput experimental approaches and multi-omics strategies has generated a plethora of new and precise information from wide-ranging biological domains. On the other hand, the continuously growing field of machine learning (ML) and its specialized branch of deep learning (DL) provide essential computational architectures for decoding complex and heterogeneous biological data. In recent years, both multi-omics and ML have assisted in the escalation of CBM. Condition-specific omics data, such as transcriptomics and proteomics, helped contextualize the model prediction while analyzing a particular phenotypic signature. At the same time, the advanced ML tools have eased the model reconstruction and analysis to increase the accuracy and prediction power. However, the development of these multi-disciplinary methodological frameworks mainly occurs independently, which limits the concatenation of biological knowledge from different domains. Hence, we have reviewed the potential of integrating multi-disciplinary tools and strategies from various fields, such as synthetic biology, CBM, omics, and ML, to explore the biochemical phenomenon beyond the conventional biological dogma. How the integrative knowledge of these intersected domains has improved bioengineering and biomedical applications has also been highlighted. We categorically explained the conventional genome-scale metabolic model (GEM) reconstruction tools and their improvement strategies through ML paradigms. Further, the crucial role of ML and DL in omics data restructuring for GEM development has also been briefly discussed. Finally, the case-study-based assessment of the state-of-the-art method for improving biomedical and metabolic engineering strategies has been elaborated. Therefore, this review demonstrates how integrating experimental and in silico strategies can help map the ever-expanding knowledge of biological systems driven by condition-specific cellular information. This multiview approach will elevate the application of ML-based CBM in the biomedical and bioengineering fields for the betterment of society and the environment.
Collapse
Affiliation(s)
- Pritam Kundu
- School School of Energy Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal 721302, India
| | - Satyajit Beura
- Department of Bioscience and Biotechnology, Indian Institute of Technology, Kharagpur, West Bengal 721302, India
| | - Suman Mondal
- P.K. Sinha Centre for Bioenergy and Renewables, Indian Institute of Technology Kharagpur, West Bengal 721302, India
| | - Amit Kumar Das
- Department of Bioscience and Biotechnology, Indian Institute of Technology, Kharagpur, West Bengal 721302, India
| | - Amit Ghosh
- School School of Energy Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal 721302, India; P.K. Sinha Centre for Bioenergy and Renewables, Indian Institute of Technology Kharagpur, West Bengal 721302, India.
| |
Collapse
|
32
|
Zhang Q, Wang S, Li Z, Pan Y, Huang D. Cross-Species Prediction of Transcription Factor Binding by Adversarial Training of a Novel Nucleotide-Level Deep Neural Network. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2405685. [PMID: 39076052 PMCID: PMC11423150 DOI: 10.1002/advs.202405685] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Indexed: 07/31/2024]
Abstract
Cross-species prediction of TF binding remains a major challenge due to the rapid evolutionary turnover of individual TF binding sites, resulting in cross-species predictive performance being consistently worse than within-species performance. In this study, a novel Nucleotide-Level Deep Neural Network (NLDNN) is first proposed to predict TF binding within or across species. NLDNN regards the task of TF binding prediction as a nucleotide-level regression task, which takes DNA sequences as input and directly predicts experimental coverage values. Beyond predictive performance, it also assesses model performance by locating potential TF binding regions, discriminating TF-specific single-nucleotide polymorphisms (SNPs), and identifying causal disease-associated SNPs. The experimental results show that NLDNN outperforms the competing methods in these tasks. Then, a dual-path framework is designed for adversarial training of NLDNN to further improve the cross-species prediction performance by pulling the domain space of human and mouse species closer. Through comparison and analysis, it finds that adversarial training not only can improve the cross-species prediction performance between humans and mice but also enhance the ability to locate TF binding regions and discriminate TF-specific SNPs. By visualizing the predictions, it is figured out that the framework corrects some mispredictions by amplifying the coverage values of incorrectly predicted peaks.
Collapse
Affiliation(s)
- Qinhu Zhang
- Ningbo Institute of Digital TwinEastern Institute of TechnologyNingbo315201China
- Division of Life Sciences and MedicineUniversity of Science and Technology of ChinaHefei230021China
- Big Data and Intelligent Computing Research CenterGuangxi Academy of ScienceNanning530007China
| | - Siguo Wang
- Ningbo Institute of Digital TwinEastern Institute of TechnologyNingbo315201China
| | - Zhipeng Li
- Ningbo Institute of Digital TwinEastern Institute of TechnologyNingbo315201China
| | - Yijie Pan
- Ningbo Institute of Digital TwinEastern Institute of TechnologyNingbo315201China
| | - De‐Shuang Huang
- Ningbo Institute of Digital TwinEastern Institute of TechnologyNingbo315201China
- Institute for Regenerative MedicineShanghai East HospitalTongji UniversityShanghai200092China
| |
Collapse
|
33
|
Li RZ, Han CZ, Glass CK. TIANA: transcription factors cooperativity inference analysis with neural attention. BMC Bioinformatics 2024; 25:274. [PMID: 39174927 PMCID: PMC11342676 DOI: 10.1186/s12859-024-05852-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Accepted: 07/01/2024] [Indexed: 08/24/2024] Open
Abstract
BACKGROUND Growing evidence suggests that distal regulatory elements are essential for cellular function and states. The sequences within these distal elements, especially motifs for transcription factor binding, provide critical information about the underlying regulatory programs. However, cooperativities between transcription factors that recognize these motifs are nonlinear and multiplexed, rendering traditional modeling methods insufficient to capture the underlying mechanisms. Recent development of attention mechanism, which exhibit superior performance in capturing dependencies across input sequences, makes them well-suited to uncover and decipher intricate dependencies between regulatory elements. RESULT We present Transcription factors cooperativity Inference Analysis with Neural Attention (TIANA), a deep learning framework that focuses on interpretability. In this study, we demonstrated that TIANA could discover biologically relevant insights into co-occurring pairs of transcription factor motifs. Compared with existing tools, TIANA showed superior interpretability and robust performance in identifying putative transcription factor cooperativities from co-occurring motifs. CONCLUSION Our results suggest that TIANA can be an effective tool to decipher transcription factor cooperativities from distal sequence data. TIANA can be accessed through: https://github.com/rzzli/TIANA .
Collapse
Affiliation(s)
- Rick Z Li
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, 92093, USA
| | - Claudia Z Han
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, 92093, USA
| | - Christopher K Glass
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, 92093, USA.
| |
Collapse
|
34
|
Zhai J, Gokaslan A, Schiff Y, Berthel A, Liu ZY, Lai WY, Miller ZR, Scheben A, Stitzer MC, Romay MC, Buckler ES, Kuleshov V. Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language model. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.04.596709. [PMID: 38895432 PMCID: PMC11185591 DOI: 10.1101/2024.06.04.596709] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
Interpreting function and fitness effects in diverse plant genomes requires transferable models. Language models (LMs) pre-trained on large-scale biological sequences can learn evolutionary conservation and offer cross-species prediction better than supervised models through fine-tuning limited labeled data. We introduce PlantCaduceus, a plant DNA LM based on the Caduceus and Mamba architectures, pre-trained on a curated dataset of 16 Angiosperm genomes. Fine-tuning PlantCaduceus on limited labeled Arabidopsis data for four tasks, including predicting translation initiation/termination sites and splice donor and acceptor sites, demonstrated high transferability to 160 million year diverged maize, outperforming the best existing DNA LM by 1.45 to 7.23-fold. PlantCaduceus is competitive to state-of-the-art protein LMs in terms of deleterious mutation identification, and is threefold better than PhyloP. Additionally, PlantCaduceus successfully identifies well-known causal variants in both Arabidopsis and maize. Overall, PlantCaduceus is a versatile DNA LM that can accelerate plant genomics and crop breeding applications.
Collapse
Affiliation(s)
- Jingjing Zhai
- Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853
| | - Aaron Gokaslan
- Department of Computer Science, Cornell University, Ithaca, NY, USA 14853
| | - Yair Schiff
- Department of Computer Science, Cornell University, Ithaca, NY, USA 14853
| | - Ana Berthel
- Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853
| | - Zong-Yan Liu
- Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY USA 14853
| | - Wei-Yun Lai
- Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853
| | - Zachary R. Miller
- Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853
| | - Armin Scheben
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY USA 11724
| | | | - M. Cinta Romay
- Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853
| | - Edward S. Buckler
- Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853
- Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY USA 14853
- USDA-ARS; Ithaca, NY, USA 14853
| | - Volodymyr Kuleshov
- Department of Computer Science, Cornell University, Ithaca, NY, USA 14853
| |
Collapse
|
35
|
Wang C, Liang C. CircCNNs, a convolutional neural network framework to better understand the biogenesis of exonic circRNAs. Sci Rep 2024; 14:18982. [PMID: 39152135 PMCID: PMC11329666 DOI: 10.1038/s41598-024-69262-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2024] [Accepted: 08/02/2024] [Indexed: 08/19/2024] Open
Abstract
Circular RNAs (circRNAs) as biomarkers for cancer detection have been extensively explored, however, the biogenesis mechanism is still elusive. In contrast to linear splicing (LS) involved in linear transcript formation, the so-called back splicing (BS) process has been proposed to explain circRNA formation. To investigate the potential mechanism of BS via the machine learning approach, we curated a high-quality BS and LS exon pairs dataset with evidence-based stringent filtering. Two convolutional neural networks (CNN) base models with different structures for processing splicing junction sequences including motif extraction were created and compared after extensive hyperparameter tuning. In contrast to the previous study, we are able to identify motifs corresponding to well-established BS-associated genes such as MBNL1, QKI, and ESPR2. Importantly, despite prevalent high false positive rates in existing circRNA detection pipelines and databases, our base models demonstrated a notable high specificity (greater than 90%). To further improve the model performance, a novo fast numerical method was proposed and implemented to calculate the reverse complementary matches (RCMs) crossing two flanking regions and within each flanking region of exon pairs. Our CircCNNs framework that incorporated RCM information into the optimal base models further reduced the false positive rates leading to 88% prediction accuracy.
Collapse
Affiliation(s)
- Chao Wang
- Department of Biology, Miami University, Oxford, OH, 45056, USA.
| | - Chun Liang
- Department of Biology, Miami University, Oxford, OH, 45056, USA.
| |
Collapse
|
36
|
Ramamurthy E, Agarwal S, Toong N, Sestili H, Kaplow IM, Chen Z, Phan B, Pfenning AR. Regression convolutional neural network models implicate peripheral immune regulatory variants in the predisposition to Alzheimer's disease. PLoS Comput Biol 2024; 20:e1012356. [PMID: 39186798 PMCID: PMC11389932 DOI: 10.1371/journal.pcbi.1012356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2023] [Revised: 09/11/2024] [Accepted: 07/23/2024] [Indexed: 08/28/2024] Open
Abstract
Alzheimer's disease (AD) involves aggregation of amyloid β and tau, neuron loss, cognitive decline, and neuroinflammatory responses. Both resident microglia and peripheral immune cells have been associated with the immune component of AD. However, the relative contribution of resident and peripheral immune cell types to AD predisposition has not been thoroughly explored due to their similarity in gene expression and function. To study the effects of AD-associated variants on cis-regulatory elements, we train convolutional neural network (CNN) regression models that link genome sequence to cell type-specific levels of open chromatin, a proxy for regulatory element activity. We then use in silico mutagenesis of regulatory sequences to predict the relative impact of candidate variants across these cell types. We develop and apply criteria for evaluating our models and refine our models using massively parallel reporter assay (MPRA) data. Our models identify multiple AD-associated variants with a greater predicted impact in peripheral cells relative to microglia or neurons. Our results support their use as models to study the effects of AD-associated variants and even suggest that peripheral immune cells themselves may mediate a component of AD predisposition. We make our library of CNN models and predictions available as a resource for the community to study immune and neurological disorders.
Collapse
Affiliation(s)
- Easwaran Ramamurthy
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Snigdha Agarwal
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Noelle Toong
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Heather Sestili
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Irene M. Kaplow
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Ziheng Chen
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - BaDoi Phan
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Andreas R. Pfenning
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| |
Collapse
|
37
|
Sokolova K, Chen KM, Hao Y, Zhou J, Troyanskaya OG. Deep Learning Sequence Models for Transcriptional Regulation. Annu Rev Genomics Hum Genet 2024; 25:105-122. [PMID: 38594933 DOI: 10.1146/annurev-genom-021623-024727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/11/2024]
Abstract
Deciphering the regulatory code of gene expression and interpreting the transcriptional effects of genome variation are critical challenges in human genetics. Modern experimental technologies have resulted in an abundance of data, enabling the development of sequence-based deep learning models that link patterns embedded in DNA to the biochemical and regulatory properties contributing to transcriptional regulation, including modeling epigenetic marks, 3D genome organization, and gene expression, with tissue and cell-type specificity. Such methods can predict the functional consequences of any noncoding variant in the human genome, even rare or never-before-observed variants, and systematically characterize their consequences beyond what is tractable from experiments or quantitative genetics studies alone. Recently, the development and application of interpretability approaches have led to the identification of key sequence patterns contributing to the predicted tasks, providing insights into the underlying biological mechanisms learned and revealing opportunities for improvement in future models.
Collapse
Affiliation(s)
- Ksenia Sokolova
- Department of Computer Science and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA; , ,
| | - Kathleen M Chen
- Department of Computer Science and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA; , ,
| | - Yun Hao
- Flatiron Institute, Simons Foundation, New York, NY, USA;
| | - Jian Zhou
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, Texas, USA;
| | - Olga G Troyanskaya
- Princeton Precision Health, Princeton University, Princeton, New Jersey, USA
- Flatiron Institute, Simons Foundation, New York, NY, USA;
- Department of Computer Science and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA; , ,
| |
Collapse
|
38
|
Lu Z, Xiao X, Zheng Q, Wang X, Xu L. Assessing next-generation sequencing-based computational methods for predicting transcriptional regulators with query gene sets. Brief Bioinform 2024; 25:bbae366. [PMID: 39082650 PMCID: PMC11289684 DOI: 10.1093/bib/bbae366] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 06/21/2024] [Accepted: 07/18/2024] [Indexed: 08/03/2024] Open
Abstract
This article provides an in-depth review of computational methods for predicting transcriptional regulators (TRs) with query gene sets. Identification of TRs is of utmost importance in many biological applications, including but not limited to elucidating biological development mechanisms, identifying key disease genes, and predicting therapeutic targets. Various computational methods based on next-generation sequencing (NGS) data have been developed in the past decade, yet no systematic evaluation of NGS-based methods has been offered. We classified these methods into two categories based on shared characteristics, namely library-based and region-based methods. We further conducted benchmark studies to evaluate the accuracy, sensitivity, coverage, and usability of NGS-based methods with molecular experimental datasets. Results show that BART, ChIP-Atlas, and Lisa have relatively better performance. Besides, we point out the limitations of NGS-based methods and explore potential directions for further improvement.
Collapse
Affiliation(s)
- Zeyu Lu
- Department of Statistics and Data Science, Moody School of Graduate and Advanced Studies, Southern Methodist University, 3225 Daniel Ave., P.O. Box 750332, Dallas, TX, United States
| | - Xue Xiao
- Quantitative Biomedical Research Center, Peter O’Donnell Jr. School of Public Health, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX, United States
| | - Qiang Zheng
- Division of Data Science, College of Science, University of Texas at Arlington, 501 S. Nedderman Dr., Arlington, TX 76019, United States
| | - Xinlei Wang
- Division of Data Science, College of Science, University of Texas at Arlington, 501 S. Nedderman Dr., Arlington, TX 76019, United States
- Department of Mathematics, University of Texas at Arlington, 411 S. Nedderman Dr., Arlington, TX 76019, United States
| | - Lin Xu
- Quantitative Biomedical Research Center, Peter O’Donnell Jr. School of Public Health, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX, United States
- Department of Pediatrics, Division of Hematology/Oncology, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd., Dallas, TX, United States
| |
Collapse
|
39
|
Wang X, Li F, Zhang Y, Imoto S, Shen HH, Li S, Guo Y, Yang J, Song J. Deep learning approaches for non-coding genetic variant effect prediction: current progress and future prospects. Brief Bioinform 2024; 25:bbae446. [PMID: 39276327 PMCID: PMC11401448 DOI: 10.1093/bib/bbae446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 08/08/2024] [Accepted: 08/27/2024] [Indexed: 09/16/2024] Open
Abstract
Recent advancements in high-throughput sequencing technologies have significantly enhanced our ability to unravel the intricacies of gene regulatory processes. A critical challenge in this endeavor is the identification of variant effects, a key factor in comprehending the mechanisms underlying gene regulation. Non-coding variants, constituting over 90% of all variants, have garnered increasing attention in recent years. The exploration of gene variant impacts and regulatory mechanisms has spurred the development of various deep learning approaches, providing new insights into the global regulatory landscape through the analysis of extensive genetic data. Here, we provide a comprehensive overview of the development of the non-coding variants models based on bulk and single-cell sequencing data and their model-based interpretation and downstream tasks. This review delineates the popular sequencing technologies for epigenetic profiling and deep learning approaches for discerning the effects of non-coding variants. Additionally, we summarize the limitations of current approaches in variant effect prediction research and outline opportunities for improvement. We anticipate that our study will offer a practical and useful guide for the bioinformatic community to further advance the unraveling of genetic variant effects.
Collapse
Affiliation(s)
- Xiaoyu Wang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Fuyi Li
- South Australian immunoGENomics Cancer Institute (SAiGENCI), Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, SA 5005, Australia
| | - Yiwen Zhang
- School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC 3004, Australia
| | - Seiya Imoto
- Genome Center, Institute of Medical Science, The University of Tokyo, Minato-ku, Tokyo 108-8639, Japan
- Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, Bunkyo-ku, Tokyo 113-8657, Japan
| | - Hsin-Hui Shen
- Department of Materials Science and Engineering, Faculty of Engineering, Monash University, Clayton, VIC 3800, Australia
| | - Shanshan Li
- School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC 3004, Australia
| | - Yuming Guo
- School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC 3004, Australia
| | - Jian Yang
- School of Life Sciences, Westlake University, Hangzhou, Zhejiang 310030, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang 310024, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
40
|
Alagarswamy K, Shi W, Boini A, Messaoudi N, Grasso V, Cattabiani T, Turner B, Croner R, Kahlert UD, Gumbs A. Should AI-Powered Whole-Genome Sequencing Be Used Routinely for Personalized Decision Support in Surgical Oncology—A Scoping Review. BIOMEDINFORMATICS 2024; 4:1757-1772. [DOI: 10.3390/biomedinformatics4030096] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2025]
Abstract
In this scoping review, we delve into the transformative potential of artificial intelligence (AI) in addressing challenges inherent in whole-genome sequencing (WGS) analysis, with a specific focus on its implications in oncology. Unveiling the limitations of existing sequencing technologies, the review illuminates how AI-powered methods emerge as innovative solutions to surmount these obstacles. The evolution of DNA sequencing technologies, progressing from Sanger sequencing to next-generation sequencing, sets the backdrop for AI’s emergence as a potent ally in processing and analyzing the voluminous genomic data generated. Particularly, deep learning methods play a pivotal role in extracting knowledge and discerning patterns from the vast landscape of genomic information. In the context of oncology, AI-powered methods exhibit considerable potential across diverse facets of WGS analysis, including variant calling, structural variation identification, and pharmacogenomic analysis. This review underscores the significance of multimodal approaches in diagnoses and therapies, highlighting the importance of ongoing research and development in AI-powered WGS techniques. Integrating AI into the analytical framework empowers scientists and clinicians to unravel the intricate interplay of genomics within the realm of multi-omics research, paving the way for more successful personalized and targeted treatments.
Collapse
Affiliation(s)
| | - Wenjie Shi
- Department of General-, Visceral-, Vascular and Transplantation Surgery, University of Magdeburg, Haus 60a, Leipziger Str. 44, 39120 Magdeburg, Germany
| | - Aishwarya Boini
- Davao Medical School Foundation, Davao City 8000, Philippines
| | - Nouredin Messaoudi
- Department of Hepatopancreatobiliary Surgery, Vrije Universiteit Brussel (VUB), Universitair Ziekenhuis Brussel (UZ Brussel), Europe Hospitals, 1090 Brussels, Belgium
| | - Vincent Grasso
- Department of Electrical and Computer Engineering, University of New Mexico, Albuquerque, NM 87131, USA
| | | | | | - Roland Croner
- Department of General-, Visceral-, Vascular and Transplantation Surgery, University of Magdeburg, Haus 60a, Leipziger Str. 44, 39120 Magdeburg, Germany
| | - Ulf D. Kahlert
- Department of General-, Visceral-, Vascular and Transplantation Surgery, University of Magdeburg, Haus 60a, Leipziger Str. 44, 39120 Magdeburg, Germany
| | - Andrew Gumbs
- Department of General-, Visceral-, Vascular and Transplantation Surgery, University of Magdeburg, Haus 60a, Leipziger Str. 44, 39120 Magdeburg, Germany
- Talos Surgical, Inc., New Castle, DE 19720, USA
- Department of Surgery, American Hospital of Tbilisi, 0102 Tbilisi, Georgia
| |
Collapse
|
41
|
Diao B, Luo J, Guo Y. A comprehensive survey on deep learning-based identification and predicting the interaction mechanism of long non-coding RNAs. Brief Funct Genomics 2024; 23:314-324. [PMID: 38576205 DOI: 10.1093/bfgp/elae010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Revised: 02/25/2024] [Accepted: 03/14/2024] [Indexed: 04/06/2024] Open
Abstract
Long noncoding RNAs (lncRNAs) have been discovered to be extensively involved in eukaryotic epigenetic, transcriptional, and post-transcriptional regulatory processes with the advancements in sequencing technology and genomics research. Therefore, they play crucial roles in the body's normal physiology and various disease outcomes. Presently, numerous unknown lncRNA sequencing data require exploration. Establishing deep learning-based prediction models for lncRNAs provides valuable insights for researchers, substantially reducing time and costs associated with trial and error and facilitating the disease-relevant lncRNA identification for prognosis analysis and targeted drug development as the era of artificial intelligence progresses. However, most lncRNA-related researchers lack awareness of the latest advancements in deep learning models and model selection and application in functional research on lncRNAs. Thus, we elucidate the concept of deep learning models, explore several prevalent deep learning algorithms and their data preferences, conduct a comprehensive review of recent literature studies with exemplary predictive performance over the past 5 years in conjunction with diverse prediction functions, critically analyze and discuss the merits and limitations of current deep learning models and solutions, while also proposing prospects based on cutting-edge advancements in lncRNA research.
Collapse
Affiliation(s)
- Biyu Diao
- Department of Breast Surgery, The First Affiliated Hospital of Ningbo University, No. 59, Liuting Street, Haishu District, Ningbo 315000, China
| | - Jin Luo
- Department of Breast Surgery, The First Affiliated Hospital of Ningbo University, No. 59, Liuting Street, Haishu District, Ningbo 315000, China
| | - Yu Guo
- Department of Breast Surgery, The First Affiliated Hospital of Ningbo University, No. 59, Liuting Street, Haishu District, Ningbo 315000, China
| |
Collapse
|
42
|
Romero R, Menichelli C, Vroland C, Marin JM, Lèbre S, Lecellier CH, Bréhélin L. TFscope: systematic analysis of the sequence features involved in the binding preferences of transcription factors. Genome Biol 2024; 25:187. [PMID: 38987807 PMCID: PMC11514967 DOI: 10.1186/s13059-024-03321-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Accepted: 06/24/2024] [Indexed: 07/12/2024] Open
Abstract
Characterizing the binding preferences of transcription factors (TFs) in different cell types and conditions is key to understand how they orchestrate gene expression. Here, we develop TFscope, a machine learning approach that identifies sequence features explaining the binding differences observed between two ChIP-seq experiments targeting either the same TF in two conditions or two TFs with similar motifs (paralogous TFs). TFscope systematically investigates differences in the core motif, nucleotide environment and co-factor motifs, and provides the contribution of each key feature in the two experiments. TFscope was applied to > 305 ChIP-seq pairs, and several examples are discussed.
Collapse
Affiliation(s)
- Raphaël Romero
- LIRMM, Univ Montpellier, CNRS, Montpellier, France
- IMAG, Univ Montpellier, CNRS, Montpellier, France
| | | | - Christophe Vroland
- LIRMM, Univ Montpellier, CNRS, Montpellier, France
- Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France
| | | | - Sophie Lèbre
- IMAG, Univ Montpellier, CNRS, Montpellier, France.
- AMIS, Université Paul-Valéry-Montpellier 3, Montpellier, France.
| | - Charles-Henri Lecellier
- LIRMM, Univ Montpellier, CNRS, Montpellier, France.
- Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France.
| | | |
Collapse
|
43
|
Mendoza-Revilla J, Trop E, Gonzalez L, Roller M, Dalla-Torre H, de Almeida BP, Richard G, Caton J, Lopez Carranza N, Skwark M, Laterre A, Beguir K, Pierrot T, Lopez M. A foundational large language model for edible plant genomes. Commun Biol 2024; 7:835. [PMID: 38982288 PMCID: PMC11233511 DOI: 10.1038/s42003-024-06465-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 06/17/2024] [Indexed: 07/11/2024] Open
Abstract
Significant progress has been made in the field of plant genomics, as demonstrated by the increased use of high-throughput methodologies that enable the characterization of multiple genome-wide molecular phenotypes. These findings have provided valuable insights into plant traits and their underlying genetic mechanisms, particularly in model plant species. Nonetheless, effectively leveraging them to make accurate predictions represents a critical step in crop genomic improvement. We present AgroNT, a foundational large language model trained on genomes from 48 plant species with a predominant focus on crop species. We show that AgroNT can obtain state-of-the-art predictions for regulatory annotations, promoter/terminator strength, tissue-specific gene expression, and prioritize functional variants. We conduct a large-scale in silico saturation mutagenesis analysis on cassava to evaluate the regulatory impact of over 10 million mutations and provide their predicted effects as a resource for variant characterization. Finally, we propose the use of the diverse datasets compiled here as the Plants Genomic Benchmark (PGB), providing a comprehensive benchmark for deep learning-based methods in plant genomic research. The pre-trained AgroNT model is publicly available on HuggingFace at https://huggingface.co/InstaDeepAI/agro-nucleotide-transformer-1b for future research purposes.
Collapse
|
44
|
Moeckel C, Mouratidis I, Chantzi N, Uzun Y, Georgakopoulos-Soares I. Advances in computational and experimental approaches for deciphering transcriptional regulatory networks: Understanding the roles of cis-regulatory elements is essential, and recent research utilizing MPRAs, STARR-seq, CRISPR-Cas9, and machine learning has yielded valuable insights. Bioessays 2024; 46:e2300210. [PMID: 38715516 PMCID: PMC11444527 DOI: 10.1002/bies.202300210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 04/22/2024] [Accepted: 04/23/2024] [Indexed: 05/16/2024]
Abstract
Understanding the influence of cis-regulatory elements on gene regulation poses numerous challenges given complexities stemming from variations in transcription factor (TF) binding, chromatin accessibility, structural constraints, and cell-type differences. This review discusses the role of gene regulatory networks in enhancing understanding of transcriptional regulation and covers construction methods ranging from expression-based approaches to supervised machine learning. Additionally, key experimental methods, including MPRAs and CRISPR-Cas9-based screening, which have significantly contributed to understanding TF binding preferences and cis-regulatory element functions, are explored. Lastly, the potential of machine learning and artificial intelligence to unravel cis-regulatory logic is analyzed. These computational advances have far-reaching implications for precision medicine, therapeutic target discovery, and the study of genetic variations in health and disease.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Yasin Uzun
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
- Department of Pediatrics, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| |
Collapse
|
45
|
Ahmed FS, Aly S, Liu X. EPI-Trans: an effective transformer-based deep learning model for enhancer promoter interaction prediction. BMC Bioinformatics 2024; 25:216. [PMID: 38890584 PMCID: PMC11184834 DOI: 10.1186/s12859-024-05784-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Accepted: 04/15/2024] [Indexed: 06/20/2024] Open
Abstract
BACKGROUND Recognition of enhancer-promoter Interactions (EPIs) is crucial for human development. EPIs in the genome play a key role in regulating transcription. However, experimental approaches for classifying EPIs are too expensive in terms of effort, time, and resources. Therefore, more and more studies are being done on developing computational techniques, particularly using deep learning and other machine learning techniques, to address such problems. Unfortunately, the majority of current computational methods are based on convolutional neural networks, recurrent neural networks, or a combination of them, which don't take into consideration contextual details and the long-range interactions between the enhancer and promoter sequences. A new transformer-based model called EPI-Trans is presented in this study to overcome the aforementioned limitations. The multi-head attention mechanism in the transformer model automatically learns features that represent the long interrelationships between enhancer and promoter sequences. Furthermore, a generic model is created with transferability that can be utilized as a pre-trained model for various cell lines. Moreover, the parameters of the generic model are fine-tuned using a particular cell line dataset to improve performance. RESULTS Based on the results obtained from six benchmark cell lines, the average AUROC for the specific, generic, and best models is 94.2%, 95%, and 95.7%, while the average AUPR is 80.5%, 66.1%, and 79.6% respectively. CONCLUSIONS This study proposed a transformer-based deep learning model for EPI prediction. The comparative results on certain cell lines show that EPI-Trans outperforms other cutting-edge techniques and can provide superior performance on the challenge of recognizing EPI.
Collapse
Affiliation(s)
- Fatma S Ahmed
- Department of Computer Science and Technology, Xiamen University, Xiamen, 361005, China.
- Department of Electrical Engineering, Aswan University, Aswan, 81542, Egypt.
| | - Saleh Aly
- Department of Electrical Engineering, Aswan University, Aswan, 81542, Egypt.
- Department of Information Technology, Majmaah University, 11952, Majmaah, Saudi Arabia.
| | - Xiangrong Liu
- Department of Computer Science and Technology, Xiamen University, Xiamen, 361005, China
| |
Collapse
|
46
|
Zhu S, Yuan S, Niu R, Zhou Y, Wang Z, Xu G. RNAirport: a deep neural network-based database characterizing representative gene models in plants. J Genet Genomics 2024; 51:652-664. [PMID: 38518981 DOI: 10.1016/j.jgg.2024.03.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2024] [Revised: 03/15/2024] [Accepted: 03/16/2024] [Indexed: 03/24/2024]
Abstract
A 5'-leader, known initially as the 5'-untranslated region, contains multiple isoforms due to alternative splicing (aS) and alternative transcription start site (aTSS). Therefore, a representative 5'-leader is demanded to examine the embedded RNA regulatory elements in controlling translation efficiency. Here, we develop a ranking algorithm and a deep-learning model to annotate representative 5'-leaders for five plant species. We rank the intra-sample and inter-sample frequency of aS-mediated transcript isoforms using the Kruskal-Wallis test-based algorithm and identify the representative aS-5'-leader. To further assign a representative 5'-end, we train the deep-learning model 5'leaderP to learn aTSS-mediated 5'-end distribution patterns from cap-analysis gene expression data. The model accurately predicts the 5'-end, confirmed experimentally in Arabidopsis and rice. The representative 5'-leader-contained gene models and 5'leaderP can be accessed at RNAirport (http://www.rnairport.com/leader5P/). The Stage 1 annotation of 5'-leader records 5'-leader diversity and will pave the way to Ribo-Seq open-reading frame annotation, identical to the project recently initiated by human GENCODE.
Collapse
Affiliation(s)
- Sitao Zhu
- State Key Laboratory of Hybrid Rice, Institute for Advanced Studies (IAS), Wuhan University, Wuhan, Hubei 430072, China
| | - Shu Yuan
- State Key Laboratory of Hybrid Rice, Institute for Advanced Studies (IAS), Wuhan University, Wuhan, Hubei 430072, China
| | - Ruixia Niu
- State Key Laboratory of Hybrid Rice, Institute for Advanced Studies (IAS), Wuhan University, Wuhan, Hubei 430072, China
| | - Yulu Zhou
- State Key Laboratory of Hybrid Rice, Institute for Advanced Studies (IAS), Wuhan University, Wuhan, Hubei 430072, China
| | - Zhao Wang
- State Key Laboratory of Hybrid Rice, Institute for Advanced Studies (IAS), Wuhan University, Wuhan, Hubei 430072, China
| | - Guoyong Xu
- State Key Laboratory of Hybrid Rice, Institute for Advanced Studies (IAS), Wuhan University, Wuhan, Hubei 430072, China; Hubei Hongshan Laboratory, Wuhan, Hubei 430070, China.
| |
Collapse
|
47
|
Zhao L, Hao R, Chai Z, Fu W, Yang W, Li C, Liu Q, Jiang Y. DeepOCR: A multi-species deep-learning framework for accurate identification of open chromatin regions in livestock. Comput Biol Chem 2024; 110:108077. [PMID: 38691895 DOI: 10.1016/j.compbiolchem.2024.108077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2024] [Revised: 03/27/2024] [Accepted: 04/16/2024] [Indexed: 05/03/2024]
Abstract
A wealth of experimental evidence has suggested that open chromatin regions (OCRs) are involved in many critical biological activities, such as DNA replication, enhancer activity, and gene transcription. Accurately identifying OCRs in livestock species can provide critical insights into the distribution and characteristics of OCRs for disease treatment in livestock, thereby improving animal welfare. However, most current machine-learning methods for OCR prediction were originally designed for a limited number of model organisms, such as humans and some model organisms, and thus their performance on non-model organisms, specifically livestock, is often unsatisfactory. To bridge this gap, we propose DeepOCR, a lightweight depth-separable residual network model for predicting OCRs in livestock, including chicken, cattle, and sheep. DeepOCR integrates a single convolution layer and two improved residue structure blocks to extract and learn important features from the input DNA sequences. A fully connected layer was also employed to further process the extracted features and improve the robustness of the entire network. Our benchmarking experiments demonstrated superior prediction performance of DeepOCR compared to state-of-the-art approaches on testing datasets of the three species. The source code of DeepOCR is freely available for academic purposes at https://github.com/jasonzhao371/DeepOCR/. We anticipate DeepOCR servers as a practical and reliable computational tool for OCR-related studies in livestock species.
Collapse
Affiliation(s)
- Liangwei Zhao
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Ran Hao
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Ziyi Chai
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Weiwei Fu
- College of Pastoral Agriculture Science and Technology, Lanzhou University, Lanzhou, Gansu 730020, China
| | - Wei Yang
- National Clinical Research Center for Infectious Diseases, Shenzhen Third People's Hospital, Shenzhen 518112, China
| | - Chen Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.
| | - Quanzhong Liu
- College of Information Engineering, Northwest A&F University, Yangling 712100, China.
| | - Yu Jiang
- Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling 712100, China; Key Laboratory of Livestock Biology, Northwest A&F University, Yangling, Shaanxi 712100, China.
| |
Collapse
|
48
|
Nie Z, Gao M, Jin X, Rao Y, Zhang X. MFPINC: prediction of plant ncRNAs based on multi-source feature fusion. BMC Genomics 2024; 25:531. [PMID: 38816689 PMCID: PMC11137975 DOI: 10.1186/s12864-024-10439-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 05/21/2024] [Indexed: 06/01/2024] Open
Abstract
Non-coding RNAs (ncRNAs) are recognized as pivotal players in the regulation of essential physiological processes such as nutrient homeostasis, development, and stress responses in plants. Common methods for predicting ncRNAs are susceptible to significant effects of experimental conditions and computational methods, resulting in the need for significant investment of time and resources. Therefore, we constructed an ncRNA predictor(MFPINC), to predict potential ncRNA in plants which is based on the PINC tool proposed by our previous studies. Specifically, sequence features were carefully refined using variance thresholding and F-test methods, while deep features were extracted and feature fusion were performed by applying the GRU model. The comprehensive evaluation of multiple standard datasets shows that MFPINC not only achieves more comprehensive and accurate identification of gene sequences, but also significantly improves the expressive and generalization performance of the model, and MFPINC significantly outperforms the existing competing methods in ncRNA identification. In addition, it is worth mentioning that our tool can also be found on Github ( https://github.com/Zhenj-Nie/MFPINC ) the data and source code can also be downloaded for free.
Collapse
Affiliation(s)
- Zhenjun Nie
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, 230036, China
| | - Mengqing Gao
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, 230036, China
| | - Xiu Jin
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, 230036, China
- Key Laboratory of Agricultural Sensors, Ministry of Agriculture and Rural Affairs, Hefei, 230036, China
| | - Yuan Rao
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, 230036, China
- Key Laboratory of Agricultural Sensors, Ministry of Agriculture and Rural Affairs, Hefei, 230036, China
| | - Xiaodan Zhang
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, 230036, China.
- Key Laboratory of Agricultural Sensors, Ministry of Agriculture and Rural Affairs, Hefei, 230036, China.
| |
Collapse
|
49
|
Zhuang J, Huang X, Liu S, Gao W, Su R, Feng K. MulTFBS: A Spatial-Temporal Network with Multichannels for Predicting Transcription Factor Binding Sites. J Chem Inf Model 2024; 64:4322-4333. [PMID: 38733561 DOI: 10.1021/acs.jcim.3c02088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/13/2024]
Abstract
Revealing the mechanisms that influence transcription factor binding specificity is the key to understanding gene regulation. In previous studies, DNA double helix structure and one-hot embedding have been used successfully to design computational methods for predicting transcription factor binding sites (TFBSs). However, DNA sequence as a kind of biological language, the method of word embedding representation in natural language processing, has not been considered properly in TFBS prediction models. In our work, we integrate different types of features of DNA sequence to design a multichanneled deep learning framework, namely MulTFBS, in which independent one-hot encoding, word embedding encoding, which can incorporate contextual information and extract the global features of the sequences, and double helix three-dimensional structural features have been trained in different channels. To extract sequence high-level information effectively, in our deep learning framework, we select the spatial-temporal network by combining convolutional neural networks and bidirectional long short-term memory networks with attention mechanism. Compared with six state-of-the-art methods on 66 universal protein-binding microarray data sets of different transcription factors, MulTFBS performs best on all data sets in the regression tasks, with the average R2 of 0.698 and the average PCC of 0.833, which are 5.4% and 3.2% higher, respectively, than the suboptimal method CRPTS. In addition, we evaluate the classification performance of MulTFBS for distinguishing bound or unbound regions on TF ChIP-seq data. The results show that our framework also performs well in the TFBS classification tasks.
Collapse
Affiliation(s)
- Jujuan Zhuang
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Xinru Huang
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Shuhan Liu
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Wanquan Gao
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Rui Su
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Kexin Feng
- The School of Science, Dalian Maritime University, Dalian 116026, China
| |
Collapse
|
50
|
Kim J, Seok J. ctGAN: combined transformation of gene expression and survival data with generative adversarial network. Brief Bioinform 2024; 25:bbae325. [PMID: 38980369 PMCID: PMC11232285 DOI: 10.1093/bib/bbae325] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Revised: 05/29/2024] [Accepted: 06/21/2024] [Indexed: 07/10/2024] Open
Abstract
Recent studies have extensively used deep learning algorithms to analyze gene expression to predict disease diagnosis, treatment effectiveness, and survival outcomes. Survival analysis studies on diseases with high mortality rates, such as cancer, are indispensable. However, deep learning models are plagued by overfitting owing to the limited sample size relative to the large number of genes. Consequently, the latest style-transfer deep generative models have been implemented to generate gene expression data. However, these models are limited in their applicability for clinical purposes because they generate only transcriptomic data. Therefore, this study proposes ctGAN, which enables the combined transformation of gene expression and survival data using a generative adversarial network (GAN). ctGAN improves survival analysis by augmenting data through style transformations between breast cancer and 11 other cancer types. We evaluated the concordance index (C-index) enhancements compared with previous models to demonstrate its superiority. Performance improvements were observed in nine of the 11 cancer types. Moreover, ctGAN outperformed previous models in seven out of the 11 cancer types, with colon adenocarcinoma (COAD) exhibiting the most significant improvement (median C-index increase of ~15.70%). Furthermore, integrating the generated COAD enhanced the log-rank p-value (0.041) compared with using only the real COAD (p-value = 0.797). Based on the data distribution, we demonstrated that the model generated highly plausible data. In clustering evaluation, ctGAN exhibited the highest performance in most cases (89.62%). These findings suggest that ctGAN can be meaningfully utilized to predict disease progression and select personalized treatments in the medical field.
Collapse
Affiliation(s)
- Jaeyoon Kim
- School of Electrical and Computer Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Korea
| | - Junhee Seok
- School of Electrical and Computer Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Korea
| |
Collapse
|