101
|
Xiao Y, Wang J, Li J, Zhang P, Li J, Zhou Y, Zhou Q, Chen M, Sheng X, Liu Z, Han X, Guo G. An analytical framework for decoding cell type-specific genetic variation of gene regulation. Nat Commun 2023; 14:3884. [PMID: 37391400 PMCID: PMC10313894 DOI: 10.1038/s41467-023-39538-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2022] [Accepted: 06/16/2023] [Indexed: 07/02/2023] Open
Abstract
A deeper understanding of genetic regulation and functional mechanisms underlying genetic associations with complex traits and diseases is impeded by cellular heterogeneity and linkage disequilibrium. To address these limits, we introduce Huatuo, a framework to decode genetic variation of gene regulation at cell type and single-nucleotide resolutions by integrating deep-learning-based variant predictions with population-based association analyses. We apply Huatuo to generate a comprehensive cell type-specific genetic variation landscape across human tissues and further evaluate their potential roles in complex diseases and traits. Finally, we show that Huatuo's inferences permit prioritizations of driver cell types associated with complex traits and diseases and allow for systematic insights into the mechanisms of phenotype-causal genetic variation.
Collapse
Affiliation(s)
- Yanyu Xiao
- Center for Stem Cell and Regenerative Medicine, and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, 310000, China
- Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, Zhejiang, 311121, China
| | - Jingjing Wang
- Center for Stem Cell and Regenerative Medicine, and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, 310000, China.
- Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, Zhejiang, 311121, China.
| | - Jiaqi Li
- Center for Stem Cell and Regenerative Medicine, and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, 310000, China
| | - Peijing Zhang
- Center for Stem Cell and Regenerative Medicine, and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, 310000, China
- Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, Zhejiang, 311121, China
| | - Jingyu Li
- Center for Stem Cell and Regenerative Medicine, and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, 310000, China
| | - Yincong Zhou
- College of Life Sciences, Zhejiang University, Hangzhou, Zhejiang, 310003, China
| | - Qing Zhou
- Life Sciences Institute, Zhejiang University, Hang Zhou, Zhejiang, 310058, China
| | - Ming Chen
- College of Life Sciences, Zhejiang University, Hangzhou, Zhejiang, 310003, China
| | - Xin Sheng
- Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, Zhejiang, 311121, China
| | - Zhihong Liu
- Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, Zhejiang, 311121, China
| | - Xiaoping Han
- Center for Stem Cell and Regenerative Medicine, and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, 310000, China.
- Zhejiang Provincial Key Lab for Tissue Engineering and Regenerative Medicine, Dr. Li Dak Sum & Yip Yio Chin Center for Stem Cell and Regenerative Medicine, Hangzhou, Zhejiang, 310058, China.
| | - Guoji Guo
- Center for Stem Cell and Regenerative Medicine, and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, 310000, China.
- Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, Zhejiang, 311121, China.
- Zhejiang Provincial Key Lab for Tissue Engineering and Regenerative Medicine, Dr. Li Dak Sum & Yip Yio Chin Center for Stem Cell and Regenerative Medicine, Hangzhou, Zhejiang, 310058, China.
- Zhejiang University-University of Edinburgh Institute, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 314400, China.
| |
Collapse
|
102
|
Balcı AT, Ebeid MM, Benos PV, Kostka D, Chikina M. An intrinsically interpretable neural network architecture for sequence-to-function learning. Bioinformatics 2023; 39:i413-i422. [PMID: 37387140 PMCID: PMC10311317 DOI: 10.1093/bioinformatics/btad271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Sequence-based deep learning approaches have been shown to predict a multitude of functional genomic readouts, including regions of open chromatin and RNA expression of genes. However, a major limitation of current methods is that model interpretation relies on computationally demanding post hoc analyses, and even then, one can often not explain the internal mechanics of highly parameterized models. Here, we introduce a deep learning architecture called totally interpretable sequence-to-function model (tiSFM). tiSFM improves upon the performance of standard multilayer convolutional models while using fewer parameters. Additionally, while tiSFM is itself technically a multilayer neural network, internal model parameters are intrinsically interpretable in terms of relevant sequence motifs. RESULTS We analyze published open chromatin measurements across hematopoietic lineage cell-types and demonstrate that tiSFM outperforms a state-of-the-art convolutional neural network model custom-tailored to this dataset. We also show that it correctly identifies context-specific activities of transcription factors with known roles in hematopoietic differentiation, including Pax5 and Ebf1 for B-cells, and Rorc for innate lymphoid cells. tiSFM's model parameters have biologically meaningful interpretations, and we show the utility of our approach on a complex task of predicting the change in epigenetic state as a function of developmental transition. AVAILABILITY AND IMPLEMENTATION The source code, including scripts for the analysis of key findings, can be found at https://github.com/boooooogey/ATAConv, implemented in Python.
Collapse
Affiliation(s)
- Ali Tuğrul Balcı
- Joint Carnegie Mellon University-University of Pittsburgh Program in Computational Biology, Pittsburgh, PA 15213, United States
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15213, United States
| | - Mark Maher Ebeid
- Joint Carnegie Mellon University-University of Pittsburgh Program in Computational Biology, Pittsburgh, PA 15213, United States
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15213, United States
| | - Panayiotis V Benos
- Department of Epidemiology, University of Florida, Gainesville, FL 32610, United States
| | - Dennis Kostka
- Joint Carnegie Mellon University-University of Pittsburgh Program in Computational Biology, Pittsburgh, PA 15213, United States
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15213, United States
- Department of Developmental Biology, University of Pittsburgh, Pittsburgh, PA 15213, United States
| | - Maria Chikina
- Joint Carnegie Mellon University-University of Pittsburgh Program in Computational Biology, Pittsburgh, PA 15213, United States
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15213, United States
| |
Collapse
|
103
|
Khodursky S, Zheng EB, Svetec N, Durkin SM, Benjamin S, Gadau A, Wu X, Zhao L. The evolution and mutational robustness of chromatin accessibility in Drosophila. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.06.26.546587. [PMID: 37425760 PMCID: PMC10327059 DOI: 10.1101/2023.06.26.546587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/11/2023]
Abstract
The evolution of regulatory regions in the genome plays a critical role in shaping the diversity of life. While this process is primarily sequence-dependent, the enormous complexity of biological systems has made it difficult to understand the factors underlying regulation and its evolution. Here, we apply deep neural networks as a tool to investigate the sequence determinants underlying chromatin accessibility in different tissues of Drosophila. We train hybrid convolution-attention neural networks to accurately predict ATAC-seq peaks using only local DNA sequences as input. We show that a model trained in one species has nearly identical performance when tested in another species, implying that the sequence determinants of accessibility are highly conserved. Indeed, model performance remains excellent even in distantly-related species. By using our model to examine species-specific gains in chromatin accessibility, we find that their orthologous inaccessible regions in other species have surprisingly similar model outputs, suggesting that these regions may be ancestrally poised for evolution. We then use in silico saturation mutagenesis to reveal evidence of selective constraint acting specifically on inaccessible chromatin regions. We further show that chromatin accessibility can be accurately predicted from short subsequences in each example. However, in silico knock-out of these sequences does not qualitatively impair classification, implying that chromatin accessibility is mutationally robust. Subsequently, we demonstrate that chromatin accessibility is predicted to be robust to large-scale random mutation even in the absence of selection. We also perform in silico evolution experiments under the regime of strong selection and weak mutation (SSWM) and show that chromatin accessibility can be extremely malleable despite its mutational robustness. However, selection acting in different directions in a tissue-specific manner can substantially slow adaptation. Finally, we identify motifs predictive of chromatin accessibility and recover motifs corresponding to known chromatin accessibility activators and repressors. These results demonstrate the conservation of the sequence determinants of accessibility and the general robustness of chromatin accessibility, as well as the power of deep neural networks as tools to answer fundamental questions in regulatory genomics and evolution.
Collapse
Affiliation(s)
- Samuel Khodursky
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
- These authors contributed equally
| | - Eric B Zheng
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
- These authors contributed equally
| | - Nicolas Svetec
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
| | - Sylvia M Durkin
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
- Current Address: Department of Integrative Biology and Museum of Vertebrate Zoology, University of California, Berkeley, Berkeley, CA, USA
| | - Sigi Benjamin
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
| | - Alice Gadau
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
| | - Xia Wu
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
| | - Li Zhao
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
| |
Collapse
|
104
|
Novakovsky G, Fornes O, Saraswat M, Mostafavi S, Wasserman WW. ExplaiNN: interpretable and transparent neural networks for genomics. Genome Biol 2023; 24:154. [PMID: 37370113 DOI: 10.1186/s13059-023-02985-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2022] [Accepted: 06/12/2023] [Indexed: 06/29/2023] Open
Abstract
Deep learning models such as convolutional neural networks (CNNs) excel in genomic tasks but lack interpretability. We introduce ExplaiNN, which combines the expressiveness of CNNs with the interpretability of linear models. ExplaiNN can predict TF binding, chromatin accessibility, and de novo motifs, achieving performance comparable to state-of-the-art methods. Its predictions are transparent, providing global (cell state level) as well as local (individual sequence level) biological insights into the data. ExplaiNN can serve as a plug-and-play platform for pretrained models and annotated position weight matrices. ExplaiNN aims to accelerate the adoption of deep learning in genomic sequence analysis by domain experts.
Collapse
Affiliation(s)
- Gherman Novakovsky
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada
| | - Oriol Fornes
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada
| | - Manu Saraswat
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada
- Division of Computational Genomics and Systems Genetics, German Cancer Research Center (DKFZ), Heidelberg, Germany
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Sara Mostafavi
- Paul G. Allen School of Computer Science and Engineering, University of Washington (UW), Seattle, USA
| | - Wyeth W Wasserman
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada.
| |
Collapse
|
105
|
Chu SK, Stormo GD. Finding motifs using DNA images derived from sparse representations. Bioinformatics 2023; 39:btad378. [PMID: 37294804 PMCID: PMC10290554 DOI: 10.1093/bioinformatics/btad378] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Revised: 05/10/2023] [Accepted: 06/08/2023] [Indexed: 06/11/2023] Open
Abstract
MOTIVATION Motifs play a crucial role in computational biology, as they provide valuable information about the binding specificity of proteins. However, conventional motif discovery methods typically rely on simple combinatoric or probabilistic approaches, which can be biased by heuristics such as substring-masking for multiple motif discovery. In recent years, deep neural networks have become increasingly popular for motif discovery, as they are capable of capturing complex patterns in data. Nonetheless, inferring motifs from neural networks remains a challenging problem, both from a modeling and computational standpoint, despite the success of these networks in supervised learning tasks. RESULTS We present a principled representation learning approach based on a hierarchical sparse representation for motif discovery. Our method effectively discovers gapped, long, or overlapping motifs that we show to commonly exist in next-generation sequencing datasets, in addition to the short and enriched primary binding sites. Our model is fully interpretable, fast, and capable of capturing motifs in a large number of DNA strings. A key concept emerged from our approach-enumerating at the image level-effectively overcomes the k-mers paradigm, enabling modest computational resources for capturing the long and varied but conserved patterns, in addition to capturing the primary binding sites. AVAILABILITY AND IMPLEMENTATION Our method is available as a Julia package under the MIT license at https://github.com/kchu25/MOTIFs.jl, and the results on experimental data can be found at https://zenodo.org/record/7783033.
Collapse
Affiliation(s)
- Shane K Chu
- Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, MO 63130, United States
| | - Gary D Stormo
- Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, United States
| |
Collapse
|
106
|
Salvatore M, Horlacher M, Marsico A, Winther O, Andersson R. Transfer learning identifies sequence determinants of cell-type specific regulatory element accessibility. NAR Genom Bioinform 2023; 5:lqad026. [PMID: 37007588 PMCID: PMC10052367 DOI: 10.1093/nargab/lqad026] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2022] [Revised: 03/01/2023] [Accepted: 03/07/2023] [Indexed: 04/03/2023] Open
Abstract
Dysfunction of regulatory elements through genetic variants is a central mechanism in the pathogenesis of disease. To better understand disease etiology, there is consequently a need to understand how DNA encodes regulatory activity. Deep learning methods show great promise for modeling of biomolecular data from DNA sequence but are limited to large input data for training. Here, we develop ChromTransfer, a transfer learning method that uses a pre-trained, cell-type agnostic model of open chromatin regions as a basis for fine-tuning on regulatory sequences. We demonstrate superior performances with ChromTransfer for learning cell-type specific chromatin accessibility from sequence compared to models not informed by a pre-trained model. Importantly, ChromTransfer enables fine-tuning on small input data with minimal decrease in accuracy. We show that ChromTransfer uses sequence features matching binding site sequences of key transcription factors for prediction. Together, these results demonstrate ChromTransfer as a promising tool for learning the regulatory code.
Collapse
Affiliation(s)
- Marco Salvatore
- Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, 2200, Copenhagen, Denmark
- Abzu ApS, 2150, Copenhagen, Denmark
| | - Marc Horlacher
- Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, 2200, Copenhagen, Denmark
- Department of Computer Science, Technical University Munich, Munich, Germany
- Computational Health Center, Helmholtz Center Munich, Munich, Germany
| | - Annalisa Marsico
- Computational Health Center, Helmholtz Center Munich, Munich, Germany
| | - Ole Winther
- Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, 2200, Copenhagen, Denmark
- Section for Cognitive Systems, DTU Compute, Technical University of Denmark, 2800 Kongens Lyngby, Denmark
- Department of Genomic medicine, Rigshospitalet, 2100 Copenhagen, Denmark
| | - Robin Andersson
- Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, 2200, Copenhagen, Denmark
- The Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| |
Collapse
|
107
|
Shi FY, Wang Y, Huang D, Liang Y, Liang N, Chen XW, Gao G. Computational Assessment of the Expression-modulating Potential for Non-coding Variants. GENOMICS, PROTEOMICS & BIOINFORMATICS 2023; 21:662-673. [PMID: 34890839 PMCID: PMC10787178 DOI: 10.1016/j.gpb.2021.10.003] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/27/2021] [Revised: 10/13/2021] [Accepted: 11/01/2021] [Indexed: 06/13/2023]
Abstract
Large-scale genome-wide association studies (GWAS) and expression quantitative trait locus (eQTL) studies have identified multiple non-coding variants associated with genetic diseases by affecting gene expression. However, pinpointing causal variants effectively and efficiently remains a serious challenge. Here, we developed CARMEN, a novel algorithm to identify functional non-coding expression-modulating variants. Multiple evaluations demonstrated CARMEN's superior performance over state-of-the-art tools. Applying CARMEN to GWAS and eQTL datasets further pinpointed several causal variants other than the reported lead single-nucleotide polymorphisms (SNPs). CARMEN scales well with the massive datasets, and is available online as a web server at http://carmen.gao-lab.org.
Collapse
Affiliation(s)
- Fang-Yuan Shi
- State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Biomedical Pioneering Innovative Center (BIOPIC) & Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), Peking University, Beijing 100871, China
| | - Yu Wang
- State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Biomedical Pioneering Innovative Center (BIOPIC) & Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), Peking University, Beijing 100871, China
| | - Dong Huang
- State Key Laboratory of Membrane Biology, Institute of Molecular Medicine, Peking University, Beijing 100871, China
| | - Yu Liang
- Human Aging Research Institute, School of Life Science, Nanchang University, Nanchang 330031, China
| | - Nan Liang
- State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Biomedical Pioneering Innovative Center (BIOPIC) & Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), Peking University, Beijing 100871, China
| | - Xiao-Wei Chen
- State Key Laboratory of Membrane Biology, Institute of Molecular Medicine, Peking University, Beijing 100871, China; Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
| | - Ge Gao
- State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Biomedical Pioneering Innovative Center (BIOPIC) & Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), Peking University, Beijing 100871, China.
| |
Collapse
|
108
|
Tognon M, Giugno R, Pinello L. A survey on algorithms to characterize transcription factor binding sites. Brief Bioinform 2023; 24:bbad156. [PMID: 37099664 PMCID: PMC10422928 DOI: 10.1093/bib/bbad156] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 03/27/2023] [Accepted: 04/01/2023] [Indexed: 04/28/2023] Open
Abstract
Transcription factors (TFs) are key regulatory proteins that control the transcriptional rate of cells by binding short DNA sequences called transcription factor binding sites (TFBS) or motifs. Identifying and characterizing TFBS is fundamental to understanding the regulatory mechanisms governing the transcriptional state of cells. During the last decades, several experimental methods have been developed to recover DNA sequences containing TFBS. In parallel, computational methods have been proposed to discover and identify TFBS motifs based on these DNA sequences. This is one of the most widely investigated problems in bioinformatics and is referred to as the motif discovery problem. In this manuscript, we review classical and novel experimental and computational methods developed to discover and characterize TFBS motifs in DNA sequences, highlighting their advantages and drawbacks. We also discuss open challenges and future perspectives that could fill the remaining gaps in the field.
Collapse
Affiliation(s)
- Manuel Tognon
- Computer Science Department, University of Verona, Verona, Italy
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Rosalba Giugno
- Computer Science Department, University of Verona, Verona, Italy
| | - Luca Pinello
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Department of Pathology, Harvard Medical School, Boston, Massachusetts, United States of America
| |
Collapse
|
109
|
Farhadi F, Allahbakhsh M, Maghsoudi A, Armin N, Amintoosi H. DiMo: discovery of microRNA motifs using deep learning and motif embedding. Brief Bioinform 2023; 24:bbad182. [PMID: 37165972 DOI: 10.1093/bib/bbad182] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 04/17/2023] [Accepted: 04/21/2023] [Indexed: 05/12/2023] Open
Abstract
MicroRNAs are small regulatory RNAs that decrease gene expression after transcription in various biological disciplines. In bioinformatics, identifying microRNAs and predicting their functionalities is critical. Finding motifs is one of the most well-known and important methods for identifying the functionalities of microRNAs. Several motif discovery techniques have been proposed, some of which rely on artificial intelligence-based techniques. However, in the case of few or no training data, their accuracy is low. In this research, we propose a new computational approach, called DiMo, for identifying motifs in microRNAs and generally macromolecules of small length. We employ word embedding techniques and deep learning models to improve the accuracy of motif discovery results. Also, we rely on transfer learning models to pre-train a model and use it in cases of a lack of (enough) training data. We compare our approach with five state-of-the-art works using three real-world datasets. DiMo outperforms the selected related works in terms of precision, recall, accuracy and f1-score.
Collapse
Affiliation(s)
- Fatemeh Farhadi
- Department of Bioinformatics, University of Zabol, Zabol, Iran
| | | | - Ali Maghsoudi
- Department of Bioinformatics, University of Zabol, Zabol, Iran
| | - Nadieh Armin
- Computer Engineering Department, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Haleh Amintoosi
- Computer Engineering Department, Ferdowsi University of Mashhad, Mashhad, Iran
| |
Collapse
|
110
|
Smith GD, Ching WH, Cornejo-Páramo P, Wong ES. Decoding enhancer complexity with machine learning and high-throughput discovery. Genome Biol 2023; 24:116. [PMID: 37173718 PMCID: PMC10176946 DOI: 10.1186/s13059-023-02955-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Accepted: 04/28/2023] [Indexed: 05/15/2023] Open
Abstract
Enhancers are genomic DNA elements controlling spatiotemporal gene expression. Their flexible organization and functional redundancies make deciphering their sequence-function relationships challenging. This article provides an overview of the current understanding of enhancer organization and evolution, with an emphasis on factors that influence these relationships. Technological advancements, particularly in machine learning and synthetic biology, are discussed in light of how they provide new ways to understand this complexity. Exciting opportunities lie ahead as we continue to unravel the intricacies of enhancer function.
Collapse
Affiliation(s)
- Gabrielle D Smith
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia
| | - Wan Hern Ching
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
| | - Paola Cornejo-Páramo
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia
| | - Emily S Wong
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia.
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia.
| |
Collapse
|
111
|
Cheng H, Liu L, Zhou Y, Deng K, Ge Y, Hu X. TSPTFBS 2.0: trans-species prediction of transcription factor binding sites and identification of their core motifs in plants. FRONTIERS IN PLANT SCIENCE 2023; 14:1175837. [PMID: 37229121 PMCID: PMC10203575 DOI: 10.3389/fpls.2023.1175837] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Accepted: 04/13/2023] [Indexed: 05/27/2023]
Abstract
Introduction An emerging approach using promoter tiling deletion via genome editing is beginning to become popular in plants. Identifying the precise positions of core motifs within plant gene promoter is of great demand but they are still largely unknown. We previously developed TSPTFBS of 265 Arabidopsis transcription factor binding sites (TFBSs) prediction models, which now cannot meet the above demand of identifying the core motif. Methods Here, we additionally introduced 104 maize and 20 rice TFBS datasets and utilized DenseNet for model construction on a large-scale dataset of a total of 389 plant TFs. More importantly, we combined three biological interpretability methods including DeepLIFT, in-silico tiling deletion, and in-silico mutagenesis to identify the potential core motifs of any given genomic region. Results For the results, DenseNet not only has achieved greater predictability than baseline methods such as LS-GKM and MEME for above 389 TFs from Arabidopsis, maize and rice, but also has greater performance on trans-species prediction of a total of 15 TFs from other six plant species. A motif analysis based on TF-MoDISco and global importance analysis (GIA) further provide the biological implication of the core motif identified by three interpretability methods. Finally, we developed a pipeline of TSPTFBS 2.0, which integrates 389 DenseNet-based models of TF binding and the above three interpretability methods. Discussion TSPTFBS 2.0 was implemented as a user-friendly web-server (http://www.hzau-hulab.com/TSPTFBS/), which can support important references for editing targets of any given plant promoters and it has great potentials to provide reliable editing target of genetic screen experiments in plants.
Collapse
|
112
|
Majdandzic A, Rajesh C, Koo PK. Correcting gradient-based interpretations of deep neural networks for genomics. Genome Biol 2023; 24:109. [PMID: 37161475 PMCID: PMC10169356 DOI: 10.1186/s13059-023-02956-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Accepted: 04/28/2023] [Indexed: 05/11/2023] Open
Abstract
Post hoc attribution methods can provide insights into the learned patterns from deep neural networks (DNNs) trained on high-throughput functional genomics data. However, in practice, their resultant attribution maps can be challenging to interpret due to spurious importance scores for seemingly arbitrary nucleotides. Here, we identify a previously overlooked attribution noise source that arises from how DNNs handle one-hot encoded DNA. We demonstrate this noise is pervasive across various genomic DNNs and introduce a statistical correction that effectively reduces it, leading to more reliable attribution maps. Our approach represents a promising step towards gaining meaningful insights from DNNs in regulatory genomics.
Collapse
Affiliation(s)
- Antonio Majdandzic
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA
| | - Chandana Rajesh
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA
| | - Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA.
| |
Collapse
|
113
|
Lee NK, Tang Z, Toneyan S, Koo PK. EvoAug: improving generalization and interpretability of genomic deep neural networks with evolution-inspired data augmentations. Genome Biol 2023; 24:105. [PMID: 37143118 PMCID: PMC10161416 DOI: 10.1186/s13059-023-02941-w] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Accepted: 04/17/2023] [Indexed: 05/06/2023] Open
Abstract
Deep neural networks (DNNs) hold promise for functional genomics prediction, but their generalization capability may be limited by the amount of available data. To address this, we propose EvoAug, a suite of evolution-inspired augmentations that enhance the training of genomic DNNs by increasing genetic variation. Random transformation of DNA sequences can potentially alter their function in unknown ways, so we employ a fine-tuning procedure using the original non-transformed data to preserve functional integrity. Our results demonstrate that EvoAug substantially improves the generalization and interpretability of established DNNs across prominent regulatory genomics prediction tasks, offering a robust solution for genomic DNNs.
Collapse
Affiliation(s)
- Nicholas Keone Lee
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA
| | - Ziqi Tang
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA
| | - Shushan Toneyan
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA
| | - Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA.
| |
Collapse
|
114
|
Glunk V, Laber S, Sinnott-Armstrong N, Sobreira DR, Strobel SM, Batista TM, Kubitz P, Moud BN, Ebert H, Huang Y, Brandl B, Garbo G, Honecker J, Stirling DR, Abdennur N, Calabuig-Navarro V, Skurk T, Ocvirk S, Stemmer K, Cimini BA, Carpenter AE, Dankel SN, Lindgren CM, Hauner H, Nobrega MA, Claussnitzer M. A non-coding variant linked to metabolic obesity with normal weight affects actin remodelling in subcutaneous adipocytes. Nat Metab 2023; 5:861-879. [PMID: 37253881 DOI: 10.1038/s42255-023-00807-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/20/2021] [Accepted: 04/12/2023] [Indexed: 06/01/2023]
Abstract
Recent large-scale genomic association studies found evidence for a genetic link between increased risk of type 2 diabetes and decreased risk for adiposity-related traits, reminiscent of metabolically obese normal weight (MONW) association signatures. However, the target genes and cellular mechanisms driving such MONW associations remain to be identified. Here, we systematically identify the cellular programmes of one of the top-scoring MONW risk loci, the 2q24.3 risk locus, in subcutaneous adipocytes. We identify a causal genetic variant, rs6712203, an intronic single-nucleotide polymorphism in the COBLL1 gene, which changes the conserved transcription factor motif of POU domain, class 2, transcription factor 2, and leads to differential COBLL1 gene expression by altering the enhancer activity at the locus in subcutaneous adipocytes. We then establish the cellular programme under the genetic control of the 2q24.3 MONW risk locus and the effector gene COBLL1, which is characterized by impaired actin cytoskeleton remodelling in differentiating subcutaneous adipocytes and subsequent failure of these cells to accumulate lipids and develop into metabolically active and insulin-sensitive adipocytes. Finally, we show that perturbations of the effector gene Cobll1 in a mouse model result in organismal phenotypes matching the MONW association signature, including decreased subcutaneous body fat mass and body weight along with impaired glucose tolerance. Taken together, our results provide a mechanistic link between the genetic risk for insulin resistance and low adiposity, providing a potential therapeutic hypothesis and a framework for future identification of causal relationships between genome associations and cellular programmes in other disorders.
Collapse
Affiliation(s)
- Viktoria Glunk
- Institute of Nutritional Medicine, School of Medicine, Technical University of Munich, Munich, Germany
- ZIEL Institute for Food & Health, Else Kröner-Fresenius-Center for Nutritional Medicine, School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Samantha Laber
- Broad Institute of MIT and Harvard, Medical and Population Genetics Program & Type 2 Diabetes Systems Genomics Initiative, Cambridge, MA, USA
| | - Nasa Sinnott-Armstrong
- Broad Institute of MIT and Harvard, Medical and Population Genetics Program & Type 2 Diabetes Systems Genomics Initiative, Cambridge, MA, USA
- Department of Genetics, Stanford University, Stanford, CA, USA
- Herbold Computational Biology Program, Publich Health Sciences Division, Fred Hutchinson Cancer Center, Seattle, WA, USA
| | - Debora R Sobreira
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
| | - Sophie M Strobel
- Institute of Nutritional Medicine, School of Medicine, Technical University of Munich, Munich, Germany
- ZIEL Institute for Food & Health, Else Kröner-Fresenius-Center for Nutritional Medicine, School of Life Sciences, Technical University of Munich, Freising, Germany
- Broad Institute of MIT and Harvard, Medical and Population Genetics Program & Type 2 Diabetes Systems Genomics Initiative, Cambridge, MA, USA
| | - Thiago M Batista
- Broad Institute of MIT and Harvard, Medical and Population Genetics Program & Type 2 Diabetes Systems Genomics Initiative, Cambridge, MA, USA
- Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Phil Kubitz
- Institute of Nutritional Medicine, School of Medicine, Technical University of Munich, Munich, Germany
- ZIEL Institute for Food & Health, Else Kröner-Fresenius-Center for Nutritional Medicine, School of Life Sciences, Technical University of Munich, Freising, Germany
- Broad Institute of MIT and Harvard, Medical and Population Genetics Program & Type 2 Diabetes Systems Genomics Initiative, Cambridge, MA, USA
- Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Bahareh Nemati Moud
- Institute of Nutritional Medicine, School of Medicine, Technical University of Munich, Munich, Germany
- ZIEL Institute for Food & Health, Else Kröner-Fresenius-Center for Nutritional Medicine, School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Hannah Ebert
- Institute of Nutritional Sciences, University of Hohenheim, Stuttgart, Germany
| | - Yi Huang
- Broad Institute of MIT and Harvard, Medical and Population Genetics Program & Type 2 Diabetes Systems Genomics Initiative, Cambridge, MA, USA
- Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Beate Brandl
- Institute of Nutritional Medicine, School of Medicine, Technical University of Munich, Munich, Germany
- ZIEL Institute for Food & Health, Else Kröner-Fresenius-Center for Nutritional Medicine, School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Garrett Garbo
- Broad Institute of MIT and Harvard, Medical and Population Genetics Program & Type 2 Diabetes Systems Genomics Initiative, Cambridge, MA, USA
| | - Julius Honecker
- Institute of Nutritional Medicine, School of Medicine, Technical University of Munich, Munich, Germany
- ZIEL Institute for Food & Health, Else Kröner-Fresenius-Center for Nutritional Medicine, School of Life Sciences, Technical University of Munich, Freising, Germany
| | - David R Stirling
- Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Nezar Abdennur
- Institute for Medical Engineering and Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Virtu Calabuig-Navarro
- Broad Institute of MIT and Harvard, Medical and Population Genetics Program & Type 2 Diabetes Systems Genomics Initiative, Cambridge, MA, USA
- Institute of Nutritional Sciences, University of Hohenheim, Stuttgart, Germany
| | - Thomas Skurk
- Institute of Nutritional Medicine, School of Medicine, Technical University of Munich, Munich, Germany
- ZIEL Institute for Food & Health, Else Kröner-Fresenius-Center for Nutritional Medicine, School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Soeren Ocvirk
- Division of Gastroenterology, Hepatology and Nutrition, Department of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Intestinal Microbiology Research Group, Department of Molecular Toxicology, German Institute of Human Nutrition Potsdam-Rehbruecke, Nuthetal, Germany
| | - Kerstin Stemmer
- Molecular Cell Biology, Institute for Theoretical Medicine, University of Augsburg, Augsburg, Germany
- Institute for Diabetes and Obesity, Helmholtz Zentrum München, Neuherberg, Germany
- German Center for Diabetes Research, Neuherberg, Germany
| | - Beth A Cimini
- Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Anne E Carpenter
- Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Simon N Dankel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Cecilia M Lindgren
- Broad Institute of MIT and Harvard, Medical and Population Genetics Program & Type 2 Diabetes Systems Genomics Initiative, Cambridge, MA, USA
- Big Data Institute at the Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK
| | - Hans Hauner
- Institute of Nutritional Medicine, School of Medicine, Technical University of Munich, Munich, Germany
- ZIEL Institute for Food & Health, Else Kröner-Fresenius-Center for Nutritional Medicine, School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Marcelo A Nobrega
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
| | - Melina Claussnitzer
- Broad Institute of MIT and Harvard, Medical and Population Genetics Program & Type 2 Diabetes Systems Genomics Initiative, Cambridge, MA, USA.
- Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Diabetes Unit and Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.
- Department of Medicine, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
115
|
Vaisband M, Schubert M, Gassner FJ, Geisberger R, Greil R, Zaborsky N, Hasenauer J. Validation of genetic variants from NGS data using deep convolutional neural networks. BMC Bioinformatics 2023; 24:158. [PMID: 37081386 PMCID: PMC10116675 DOI: 10.1186/s12859-023-05255-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2022] [Accepted: 03/27/2023] [Indexed: 04/22/2023] Open
Abstract
Accurate somatic variant calling from next-generation sequencing data is one most important tasks in personalised cancer therapy. The sophistication of the available technologies is ever-increasing, yet, manual candidate refinement is still a necessary step in state-of-the-art processing pipelines. This limits reproducibility and introduces a bottleneck with respect to scalability. We demonstrate that the validation of genetic variants can be improved using a machine learning approach resting on a Convolutional Neural Network, trained using existing human annotation. In contrast to existing approaches, we introduce a way in which contextual data from sequencing tracks can be included into the automated assessment. A rigorous evaluation shows that the resulting model is robust and performs on par with trained researchers following published standard operating procedure.
Collapse
Affiliation(s)
- Marc Vaisband
- Department of Internal Medicine III with Haematology, Medical Oncology, Haemostaseology, Infectiology and Rheumatology, Oncologic Center; Salzburg Cancer Research Institute - Laboratory for Immunological and Molecular Cancer Research (SCRI-LIMCR); Cancer Cluster Salzburg, Paracelsus Medical University, Salzburg, Austria.
- Life and Medical Sciences Institute, University of Bonn, Bonn, Germany.
| | - Maria Schubert
- Department of Internal Medicine III with Haematology, Medical Oncology, Haemostaseology, Infectiology and Rheumatology, Oncologic Center; Salzburg Cancer Research Institute - Laboratory for Immunological and Molecular Cancer Research (SCRI-LIMCR); Cancer Cluster Salzburg, Paracelsus Medical University, Salzburg, Austria
| | - Franz Josef Gassner
- Department of Internal Medicine III with Haematology, Medical Oncology, Haemostaseology, Infectiology and Rheumatology, Oncologic Center; Salzburg Cancer Research Institute - Laboratory for Immunological and Molecular Cancer Research (SCRI-LIMCR); Cancer Cluster Salzburg, Paracelsus Medical University, Salzburg, Austria
| | - Roland Geisberger
- Department of Internal Medicine III with Haematology, Medical Oncology, Haemostaseology, Infectiology and Rheumatology, Oncologic Center; Salzburg Cancer Research Institute - Laboratory for Immunological and Molecular Cancer Research (SCRI-LIMCR); Cancer Cluster Salzburg, Paracelsus Medical University, Salzburg, Austria
| | - Richard Greil
- Department of Internal Medicine III with Haematology, Medical Oncology, Haemostaseology, Infectiology and Rheumatology, Oncologic Center; Salzburg Cancer Research Institute - Laboratory for Immunological and Molecular Cancer Research (SCRI-LIMCR); Cancer Cluster Salzburg, Paracelsus Medical University, Salzburg, Austria
| | - Nadja Zaborsky
- Department of Internal Medicine III with Haematology, Medical Oncology, Haemostaseology, Infectiology and Rheumatology, Oncologic Center; Salzburg Cancer Research Institute - Laboratory for Immunological and Molecular Cancer Research (SCRI-LIMCR); Cancer Cluster Salzburg, Paracelsus Medical University, Salzburg, Austria
| | - Jan Hasenauer
- Life and Medical Sciences Institute, University of Bonn, Bonn, Germany
| |
Collapse
|
116
|
Valencia JD, Hendrix DA. Improving deep models of protein-coding potential with a Fourier-transform architecture and machine translation task. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.03.535488. [PMID: 37066250 PMCID: PMC10104019 DOI: 10.1101/2023.04.03.535488] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/05/2023]
Abstract
Ribosomes are information-processing macromolecular machines that integrate complex sequence patterns in messenger RNA (mRNA) transcripts to synthesize proteins. Studies of the sequence features that distinguish mRNAs from long noncoding RNAs (lncRNAs) may yield insight into the information that directs and regulates translation. Computational methods for calculating protein-coding potential are important for distinguishing mRNAs from lncRNAs during genome annotation, but most machine learning methods for this task rely on previously known rules to define features. Sequence-to-sequence (seq2seq) models, particularly ones using transformer networks, have proven capable of learning complex grammatical relationships between words to perform natural language translation. Seeking to leverage these advancements in the biological domain, we present a seq2seq formulation for predicting protein-coding potential with deep neural networks and demonstrate that simultaneously learning translation from RNA to protein improves classification performance relative to a classification-only training objective. Inspired by classical signal processing methods for gene discovery and Fourier-based image-processing neural networks, we introduce LocalFilterNet (LFNet). LFNet is a network architecture with an inductive bias for modeling the three-nucleotide periodicity apparent in coding sequences. We incorporate LFNet within an encoder-decoder framework to test whether the translation task improves the classification of transcripts and the interpretation of their sequence features. We use the resulting model to compute nucleotide-resolution importance scores, revealing sequence patterns that could assist the cellular machinery in distinguishing mRNAs and lncRNAs. Finally, we develop a novel approach for estimating mutation effects from Integrated Gradients, a backpropagation-based feature attribution, and characterize the difficulty of efficient approximations in this setting.
Collapse
Affiliation(s)
- Joseph D. Valencia
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | - David A. Hendrix
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
- Department of Biochemistry and Biophysics, Oregon State University, Corvallis, OR, USA
| |
Collapse
|
117
|
Dong C, Shen S, Keleş S. AdaLiftOver: high-resolution identification of orthologous regulatory elements with Adaptive liftOver. Bioinformatics 2023; 39:btad149. [PMID: 37004197 PMCID: PMC10085516 DOI: 10.1093/bioinformatics/btad149] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 03/02/2023] [Accepted: 03/20/2023] [Indexed: 04/03/2023] Open
Abstract
MOTIVATION Elucidating functionally similar orthologous regulatory regions for human and model organism genomes is critical for exploiting model organism research and advancing our understanding of results from genome-wide association studies (GWAS). Sequence conservation is the de facto approach for finding orthologous non-coding regions between human and model organism genomes. However, existing methods for mapping non-coding genomic regions across species are challenged by the multi-mapping, low precision, and low mapping rate issues. RESULTS We develop Adaptive liftOver (AdaLiftOver), a large-scale computational tool for identifying functionally similar orthologous non-coding regions across species. AdaLiftOver builds on the UCSC liftOver framework to extend the query regions and prioritizes the resulting candidate target regions based on the conservation of the epigenomic and the sequence grammar features. Evaluations of AdaLiftOver with multiple case studies, spanning both genomic intervals from epigenome datasets across a wide range of model organisms and GWAS SNPs, yield AdaLiftOver as a versatile method for deriving hard-to-obtain human epigenome datasets as well as reliably identifying orthologous loci for GWAS SNPs. AVAILABILITY AND IMPLEMENTATION The R package and the data for AdaLiftOver is available from https://github.com/keleslab/AdaLiftOver.
Collapse
Affiliation(s)
- Chenyang Dong
- Department of Statistics, University of Wisconsin-Madison, 1300 University Avenue, Madison, WI 53706, USA
| | - Siqi Shen
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, WARF Room 201, 610 Walnut Street, Madison, WI 53706, USA
| | - Sündüz Keleş
- Department of Statistics, University of Wisconsin-Madison, 1300 University Avenue, Madison, WI 53706, USA
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, WARF Room 201, 610 Walnut Street, Madison, WI 53706, USA
| |
Collapse
|
118
|
Balcı AT, Ebeid MM, Benos PV, Kostka D, Chikina M. An intrinsically interpretable neural network architecture for sequence to function learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.25.525572. [PMID: 36747873 PMCID: PMC9900791 DOI: 10.1101/2023.01.25.525572] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
MOTIVATION Sequence-based deep learning approaches have been shown to predict a multitude of functional genomic readouts, including regions of open chromatin and RNA expression of genes. However, a major limitation of current methods is that model interpretation relies on computationally demanding post hoc analyses, and even then, one can often not explain the internal mechanics of highly parameterized models. Here, we introduce a deep learning architecture called tiSFM (totally interpretable sequence to function model). tiSFM improves upon the performance of standard multi-layer convolutional models while using fewer parameters. Additionally, while tiSFM is itself technically a multi-layer neural network, internal model parameters are intrinsically interpretable in terms of relevant sequence motifs. RESULTS We analyze published open chromatin measurements across hematopoietic lineage cell-types and demonstrate that tiSFM outperforms a state-of-the-art convolutional neural network model custom-tailored to this dataset. We also show that it correctly identifies context specific activities of transcription factors with known roles in hematopoietic differentiation, including Pax5 and Ebf1 for B-cells, and Rorc for innate lymphoid cells. tiSFM's model parameters have biologically meaningful interpretations, and we show the utility of our approach on a complex task of predicting the change in epigenetic state as a function of developmental transition. AVAILABILITY AND IMPLEMENTATION The source code, including scripts for the analysis of key findings, can be found at https://github.com/boooooogey/ATAConv, implemented in Python.
Collapse
Affiliation(s)
- Ali Tuğrul Balcı
- Joint Carnegie Mellon University-University of Pittsburgh Program in Computational Biology, Institution, Pittsburgh, 15213, United States and
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, 15213, Unites States and
| | - Mark Maher Ebeid
- Joint Carnegie Mellon University-University of Pittsburgh Program in Computational Biology, Institution, Pittsburgh, 15213, United States and
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, 15213, Unites States and
| | - Panayiotis V Benos
- Department of Epidemiology, University of Florida, Gainesville, 32610, Unites States
| | - Dennis Kostka
- Joint Carnegie Mellon University-University of Pittsburgh Program in Computational Biology, Institution, Pittsburgh, 15213, United States and
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, 15213, Unites States and
| | - Maria Chikina
- Joint Carnegie Mellon University-University of Pittsburgh Program in Computational Biology, Institution, Pittsburgh, 15213, United States and
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, 15213, Unites States and
| |
Collapse
|
119
|
Ullah F, Jabeen S, Salton M, Reddy ASN, Ben-Hur A. Evidence for the role of transcription factors in the co-transcriptional regulation of intron retention. Genome Biol 2023; 24:53. [PMID: 36949544 PMCID: PMC10031921 DOI: 10.1186/s13059-023-02885-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2022] [Accepted: 02/16/2023] [Indexed: 03/24/2023] Open
Abstract
BACKGROUND Alternative splicing is a widespread regulatory phenomenon that enables a single gene to produce multiple transcripts. Among the different types of alternative splicing, intron retention is one of the least explored despite its high prevalence in both plants and animals. The recent discovery that the majority of splicing is co-transcriptional has led to the finding that chromatin state affects alternative splicing. Therefore, it is plausible that transcription factors can regulate splicing outcomes. RESULTS We provide evidence for the hypothesis that transcription factors are involved in the regulation of intron retention by studying regions of open chromatin in retained and excised introns. Using deep learning models designed to distinguish between regions of open chromatin in retained introns and non-retained introns, we identified motifs enriched in IR events with significant hits to known human transcription factors. Our model predicts that the majority of transcription factors that affect intron retention come from the zinc finger family. We demonstrate the validity of these predictions using ChIP-seq data for multiple zinc finger transcription factors and find strong over-representation for their peaks in intron retention events. CONCLUSIONS This work opens up opportunities for further studies that elucidate the mechanisms by which transcription factors affect intron retention and other forms of splicing. AVAILABILITY Source code available at https://github.com/fahadahaf/chromir.
Collapse
Affiliation(s)
- Fahad Ullah
- Department of Computer Science, Colorado State University, Fort Collins, CO, USA
| | - Saira Jabeen
- Department of Computer Science, Colorado State University, Fort Collins, CO, USA
| | - Maayan Salton
- Department of Biology, Colorado State University, Fort Collins, CO, USA
| | - Anireddy S N Reddy
- Biochemistry and Molecular Biology Department, The Hebrew University Faculty of Medicine, Jerusalem, Israel
| | - Asa Ben-Hur
- Department of Computer Science, Colorado State University, Fort Collins, CO, USA.
| |
Collapse
|
120
|
Wang Z, Zhang Y, Yu Y, Zhang J, Liu Y, Zou Q. A Unified Deep Learning Framework for Single-Cell ATAC-Seq Analysis Based on ProdDep Transformer Encoder. Int J Mol Sci 2023; 24:ijms24054784. [PMID: 36902216 PMCID: PMC10003007 DOI: 10.3390/ijms24054784] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Revised: 01/02/2023] [Accepted: 02/22/2023] [Indexed: 03/06/2023] Open
Abstract
Recent advances in single-cell sequencing assays for the transposase-accessibility chromatin (scATAC-seq) technique have provided cell-specific chromatin accessibility landscapes of cis-regulatory elements, providing deeper insights into cellular states and dynamics. However, few research efforts have been dedicated to modeling the relationship between regulatory grammars and single-cell chromatin accessibility and incorporating different analysis scenarios of scATAC-seq data into the general framework. To this end, we propose a unified deep learning framework based on the ProdDep Transformer Encoder, dubbed PROTRAIT, for scATAC-seq data analysis. Specifically motivated by the deep language model, PROTRAIT leverages the ProdDep Transformer Encoder to capture the syntax of transcription factor (TF)-DNA binding motifs from scATAC-seq peaks for predicting single-cell chromatin accessibility and learning single-cell embedding. Based on cell embedding, PROTRAIT annotates cell types using the Louvain algorithm. Furthermore, according to the identified likely noises of raw scATAC-seq data, PROTRAIT denoises these values based on predated chromatin accessibility. In addition, PROTRAIT employs differential accessibility analysis to infer TF activity at single-cell and single-nucleotide resolution. Extensive experiments based on the Buenrostro2018 dataset validate the effeteness of PROTRAIT for chromatin accessibility prediction, cell type annotation, and scATAC-seq data denoising, therein outperforming current approaches in terms of different evaluation metrics. Besides, we confirm the consistency between the inferred TF activity and the literature review. We also demonstrate the scalability of PROTRAIT to analyze datasets containing over one million cells.
Collapse
Affiliation(s)
- Zixuan Wang
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Yun Yu
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Junming Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Yuhang Liu
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Correspondence:
| |
Collapse
|
121
|
Using Attribution Sequence Alignment to Interpret Deep Learning Models for miRNA Binding Site Prediction. BIOLOGY 2023; 12:biology12030369. [PMID: 36979061 PMCID: PMC10045089 DOI: 10.3390/biology12030369] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Revised: 02/21/2023] [Accepted: 02/24/2023] [Indexed: 03/03/2023]
Abstract
MicroRNAs (miRNAs) are small non-coding RNAs that play a central role in the post-transcriptional regulation of biological processes. miRNAs regulate transcripts through direct binding involving the Argonaute protein family. The exact rules of binding are not known, and several in silico miRNA target prediction methods have been developed to date. Deep learning has recently revolutionized miRNA target prediction. However, the higher predictive power comes with a decreased ability to interpret increasingly complex models. Here, we present a novel interpretation technique, called attribution sequence alignment, for miRNA target site prediction models that can interpret such deep learning models on a two-dimensional representation of miRNA and putative target sequence. Our method produces a human readable visual representation of miRNA:target interactions and can be used as a proxy for the further interpretation of biological concepts learned by the neural network. We demonstrate applications of this method in the clustering of experimental data into binding classes, as well as using the method to narrow down predicted miRNA binding sites on long transcript sequences. Importantly, the presented method works with any neural network model trained on a two-dimensional representation of interactions and can be easily extended to further domains such as protein–protein interactions.
Collapse
|
122
|
Deng C, Whalen S, Steyert M, Ziffra R, Przytycki PF, Inoue F, Pereira DA, Capauto D, Norton S, Vaccarino FM, Pollen A, Nowakowski TJ, Ahituv N, Pollard KS. Massively parallel characterization of psychiatric disorder-associated and cell-type-specific regulatory elements in the developing human cortex. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.15.528663. [PMID: 36824845 PMCID: PMC9949039 DOI: 10.1101/2023.02.15.528663] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/17/2023]
Abstract
Nucleotide changes in gene regulatory elements are important determinants of neuronal development and disease. Using massively parallel reporter assays in primary human cells from mid-gestation cortex and cerebral organoids, we interrogated the cis-regulatory activity of 102,767 sequences, including differentially accessible cell-type specific regions in the developing cortex and single-nucleotide variants associated with psychiatric disorders. In primary cells, we identified 46,802 active enhancer sequences and 164 disorder-associated variants that significantly alter enhancer activity. Activity was comparable in organoids and primary cells, suggesting that organoids provide an adequate model for the developing cortex. Using deep learning, we decoded the sequence basis and upstream regulators of enhancer activity. This work establishes a comprehensive catalog of functional gene regulatory elements and variants in human neuronal development.
Collapse
Affiliation(s)
- Chengyu Deng
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco; San Francisco, CA, USA
- Institute for Human Genetics, University of California, San Francisco; San Francisco, CA, USA
| | - Sean Whalen
- Gladstone Institutes; San Francisco, CA, USA
| | - Marilyn Steyert
- Department of Anatomy, University of California, San Francisco; San Francisco, CA, USA
- Department of Psychiatry, University of California, San Francisco; San Francisco, CA, USA
- Department of Neurological Surgery, University of California, San Francisco; San Francisco, CA, USA
| | - Ryan Ziffra
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco; San Francisco, CA, USA
| | | | - Fumitaka Inoue
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University; Kyoto, Japan
| | - Daniela A. Pereira
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco; San Francisco, CA, USA
- Institute for Human Genetics, University of California, San Francisco; San Francisco, CA, USA
- Graduate Program of Genetics, Institute of Biological Sciences, Federal University of Minas Gerais; Belo Horizonte, Minas Gerais, Brazil
| | | | - Scott Norton
- Child Study Center, Yale University; New Haven, CT, USA
| | - Flora M. Vaccarino
- Child Study Center, Yale University; New Haven, CT, USA
- Department of Neuroscience, Yale University; New Haven, CT, USA
| | - Alex Pollen
- Department of Neurology, University of California, San Francisco; San Francisco, CA, USA
- Eli and Edythe Broad Center for Regeneration Medicine and Stem Cell Research, University of California, San Francisco; San Francisco, CA, USA
| | - Tomasz J. Nowakowski
- Department of Anatomy, University of California, San Francisco; San Francisco, CA, USA
- Department of Psychiatry, University of California, San Francisco; San Francisco, CA, USA
- Department of Neurological Surgery, University of California, San Francisco; San Francisco, CA, USA
- Eli and Edythe Broad Center for Regeneration Medicine and Stem Cell Research, University of California, San Francisco; San Francisco, CA, USA
- Chan Zuckerberg Biohub, San Francisco; San Francisco, CA, USA
| | - Nadav Ahituv
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco; San Francisco, CA, USA
- Institute for Human Genetics, University of California, San Francisco; San Francisco, CA, USA
| | - Katherine S. Pollard
- Institute for Human Genetics, University of California, San Francisco; San Francisco, CA, USA
- Gladstone Institutes; San Francisco, CA, USA
- Chan Zuckerberg Biohub, San Francisco; San Francisco, CA, USA
- Department of Epidemiology and Biostatistics, University of California, San Francisco; San Francisco, CA, USA
| |
Collapse
|
123
|
Xu H, Jia J, Jeong HH, Zhao Z. Deep learning for detecting and elucidating human T-cell leukemia virus type 1 integration in the human genome. PATTERNS (NEW YORK, N.Y.) 2023; 4:100674. [PMID: 36873907 PMCID: PMC9982299 DOI: 10.1016/j.patter.2022.100674] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Revised: 11/02/2022] [Accepted: 12/13/2022] [Indexed: 02/12/2023]
Abstract
Human T-cell leukemia virus type 1 (HTLV-1), a retrovirus, is the causative agent for adult T cell leukemia/lymphoma and many other human diseases. Accurate and high throughput detection of HTLV-1 virus integration sites (VISs) across the host genomes plays a crucial role in the prevention and treatment of HTLV-1-associated diseases. Here, we developed DeepHTLV, the first deep learning framework for VIS prediction de novo from genome sequence, motif discovery, and cis-regulatory factor identification. We demonstrated the high accuracy of DeepHTLV with more efficient and interpretive feature representations. Decoding the informative features captured by DeepHTLV resulted in eight representative clusters with consensus motifs for potential HTLV-1 integration. Furthermore, DeepHTLV revealed interesting cis-regulatory elements in regulation of VISs that have significant association with the detected motifs. Literature evidence demonstrated nearly half (34) of the predicted transcription factors enriched with VISs were involved in HTLV-1-associated diseases. DeepHTLV is freely available at https://github.com/bsml320/DeepHTLV.
Collapse
Affiliation(s)
- Haodong Xu
- Center for Precision Health, School of Biomedical Informatics, UTHealth Science Center at Houston, Houston, TX 77030, USA
| | - Johnathan Jia
- Center for Precision Health, School of Biomedical Informatics, UTHealth Science Center at Houston, Houston, TX 77030, USA.,MD Anderson UTHealth Graduate School of Biomedical Sciences, Houston, TX 77030, USA
| | - Hyun-Hwan Jeong
- Center for Precision Health, School of Biomedical Informatics, UTHealth Science Center at Houston, Houston, TX 77030, USA
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, UTHealth Science Center at Houston, Houston, TX 77030, USA.,MD Anderson UTHealth Graduate School of Biomedical Sciences, Houston, TX 77030, USA.,Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| |
Collapse
|
124
|
Novakovsky G, Dexter N, Libbrecht MW, Wasserman WW, Mostafavi S. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat Rev Genet 2023; 24:125-137. [PMID: 36192604 DOI: 10.1038/s41576-022-00532-2] [Citation(s) in RCA: 63] [Impact Index Per Article: 63.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/31/2022] [Indexed: 01/24/2023]
Abstract
Artificial intelligence (AI) models based on deep learning now represent the state of the art for making functional predictions in genomics research. However, the underlying basis on which predictive models make such predictions is often unknown. For genomics researchers, this missing explanatory information would frequently be of greater value than the predictions themselves, as it can enable new insights into genetic processes. We review progress in the emerging area of explainable AI (xAI), a field with the potential to empower life science researchers to gain mechanistic insights into complex deep learning models. We discuss and categorize approaches for model interpretation, including an intuitive understanding of how each approach works and their underlying assumptions and limitations in the context of typical high-throughput biological datasets.
Collapse
Affiliation(s)
- Gherman Novakovsky
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, British Columbia, Canada.,Bioinformatics Graduate Program, University of British Columbia, Vancouver, British Columbia, Canada
| | - Nick Dexter
- Department of Mathematics, Simon Fraser University, Burnaby, British Columbia, Canada.,School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Maxwell W Libbrecht
- School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada.
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, British Columbia, Canada.
| | - Sara Mostafavi
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA. .,Canadian Institute for Advanced Research, Toronto, Ontario, Canada.
| |
Collapse
|
125
|
Deep learning in regulatory genomics: from identification to design. Curr Opin Biotechnol 2023; 79:102887. [PMID: 36640453 DOI: 10.1016/j.copbio.2022.102887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2022] [Revised: 11/12/2022] [Accepted: 12/14/2022] [Indexed: 01/14/2023]
Abstract
Genomics and deep learning are a natural match since both are data-driven fields. Regulatory genomics refers to functional noncoding DNA regulating gene expression. In recent years, deep learning applications on regulatory genomics have achieved remarkable advances so-much-so that it has revolutionized the rules of the game of the computational methods in this field. Here, we review two emerging trends: (i) the modeling of very long input sequence (up to 200 kb), which requires self-matched modularization of model architecture; (ii) on the balance of model predictability and model interpretability because the latter is more able to meet biological demands. Finally, we discuss how to employ these two routes to design synthetic regulatory DNA, as a promising strategy for optimizing crop agronomic properties.
Collapse
|
126
|
Koo PK, Ploenzke M, Anand P, Paul S, Majdandzic A. ResidualBind: Uncovering Sequence-Structure Preferences of RNA-Binding Proteins with Deep Neural Networks. Methods Mol Biol 2023; 2586:197-215. [PMID: 36705906 DOI: 10.1007/978-1-0716-2768-6_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Deep neural networks have demonstrated improved performance at predicting sequence specificities of DNA- and RNA-binding proteins. However, it remains unclear why they perform better than previous methods that rely on k-mers and position weight matrices. Here, we highlight a recent deep learning-based software package, called ResidualBind, that analyzes RNA-protein interactions using only RNA sequence as an input feature and performs global importance analysis for model interpretability. We discuss practical considerations for model interpretability to uncover learned sequence motifs and their secondary structure preferences.
Collapse
Affiliation(s)
- Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
| | - Matt Ploenzke
- Department of Biostatistics, Harvard University, Cambridge, MA, USA
| | | | - Steffan Paul
- Bioinformatics Program, Harvard Medical School, Boston, MA, USA
| | - Antonio Majdandzic
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| |
Collapse
|
127
|
Li Z, Gao E, Zhou J, Han W, Xu X, Gao X. Applications of deep learning in understanding gene regulation. CELL REPORTS METHODS 2023; 3:100384. [PMID: 36814848 PMCID: PMC9939384 DOI: 10.1016/j.crmeth.2022.100384] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
Gene regulation is a central topic in cell biology. Advances in omics technologies and the accumulation of omics data have provided better opportunities for gene regulation studies than ever before. For this reason deep learning, as a data-driven predictive modeling approach, has been successfully applied to this field during the past decade. In this article, we aim to give a brief yet comprehensive overview of representative deep-learning methods for gene regulation. Specifically, we discuss and compare the design principles and datasets used by each method, creating a reference for researchers who wish to replicate or improve existing methods. We also discuss the common problems of existing approaches and prospectively introduce the emerging deep-learning paradigms that will potentially alleviate them. We hope that this article will provide a rich and up-to-date resource and shed light on future research directions in this area.
Collapse
Affiliation(s)
- Zhongxiao Li
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Elva Gao
- The KAUST School, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Juexiao Zhou
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Wenkai Han
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Xiaopeng Xu
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Xin Gao
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| |
Collapse
|
128
|
Kawaguchi RK, Tang Z, Fischer S, Rajesh C, Tripathy R, Koo PK, Gillis J. Learning single-cell chromatin accessibility profiles using meta-analytic marker genes. Brief Bioinform 2023; 24:bbac541. [PMID: 36549922 PMCID: PMC9851328 DOI: 10.1093/bib/bbac541] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Revised: 09/29/2022] [Accepted: 11/08/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Single-cell assay for transposase accessible chromatin using sequencing (scATAC-seq) is a valuable resource to learn cis-regulatory elements such as cell-type specific enhancers and transcription factor binding sites. However, cell-type identification of scATAC-seq data is known to be challenging due to the heterogeneity derived from different protocols and the high dropout rate. RESULTS In this study, we perform a systematic comparison of seven scATAC-seq datasets of mouse brain to benchmark the efficacy of neuronal cell-type annotation from gene sets. We find that redundant marker genes give a dramatic improvement for a sparse scATAC-seq annotation across the data collected from different studies. Interestingly, simple aggregation of such marker genes achieves performance comparable or higher than that of machine-learning classifiers, suggesting its potential for downstream applications. Based on our results, we reannotated all scATAC-seq data for detailed cell types using robust marker genes. Their meta scATAC-seq profiles are publicly available at https://gillisweb.cshl.edu/Meta_scATAC. Furthermore, we trained a deep neural network to predict chromatin accessibility from only DNA sequence and identified key motifs enriched for each neuronal subtype. Those predicted profiles are visualized together in our database as a valuable resource to explore cell-type specific epigenetic regulation in a sequence-dependent and -independent manner.
Collapse
Affiliation(s)
| | - Ziqi Tang
- Cold Spring Harbor Laboratory, Cold Spring Harbor 11724, USA
| | - Stephan Fischer
- Cold Spring Harbor Laboratory, Cold Spring Harbor 11724, USA
| | - Chandana Rajesh
- Cold Spring Harbor Laboratory, Cold Spring Harbor 11724, USA
| | - Rohit Tripathy
- Cold Spring Harbor Laboratory, Cold Spring Harbor 11724, USA
| | - Peter K Koo
- Cold Spring Harbor Laboratory, Cold Spring Harbor 11724, USA
| | - Jesse Gillis
- Cold Spring Harbor Laboratory, Cold Spring Harbor 11724, USA
- Department of Physiology and Donnelly Centre for Cellular & Biomolecular Research Department, University of Toronto, Ontario M5S 3E1, Canada
| |
Collapse
|
129
|
Zheng A, Shen Z, Glass CK, Gymrek M. Deep learning predicts the impact of regulatory variants on cell-type-specific enhancers in the brain. BIOINFORMATICS ADVANCES 2023; 3:vbad002. [PMID: 36726730 PMCID: PMC9887460 DOI: 10.1093/bioadv/vbad002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Revised: 11/11/2022] [Accepted: 01/11/2023] [Indexed: 01/13/2023]
Abstract
Motivation Previous studies have shown that the heritability of multiple brain-related traits and disorders is highly enriched in transcriptional enhancer regions. However, these regions often contain many individual variants, while only a subset of them are likely to causally contribute to a trait. Statistical fine-mapping techniques can identify putative causal variants, but their resolution is often limited, especially in regions with multiple variants in high linkage disequilibrium. In these cases, alternative computational methods to estimate the impact of individual variants can aid in variant prioritization. Results Here, we develop a deep learning pipeline to predict cell-type-specific enhancer activity directly from genomic sequences and quantify the impact of individual genetic variants in these regions. We show that the variants highlighted by our deep learning models are targeted by purifying selection in the human population, likely indicating a functional role. We integrate our deep learning predictions with statistical fine-mapping results for 8 brain-related traits, identifying 63 distinct candidate causal variants predicted to contribute to these traits by modulating enhancer activity, representing 6% of all genome-wide association study signals analyzed. Overall, our study provides a valuable computational method that can prioritize individual variants based on their estimated regulatory impact, but also highlights the limitations of existing methods for variant prioritization and fine-mapping. Availability and implementation The data underlying this article, nucleotide-level importance scores, and code for running the deep learning pipeline are available at https://github.com/Pandaman-Ryan/AgentBind-brain. Contact mgymrek@ucsd.edu. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- An Zheng
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, USA
| | - Zeyang Shen
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA 92093, USA
- Department of Bioengineering, University of California San Diego, La Jolla, CA 92093, USA
| | - Christopher K Glass
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA 92093, USA
- Department of Medicine, University of California San Diego, La Jolla, CA 92093, USA
| | - Melissa Gymrek
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, USA
- Department of Medicine, University of California San Diego, La Jolla, CA 92093, USA
| |
Collapse
|
130
|
Predicting Chromatin Interactions from DNA Sequence Using DeepC. Methods Mol Biol 2023; 2624:19-42. [PMID: 36723807 DOI: 10.1007/978-1-0716-2962-8_3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
The genome 3D structure is central to understanding how disease-associated genetic variants in the noncoding genome regulate their target genes. Genome architecture spans large-scale structures determined by fine-grained regulatory elements, making it challenging to predict the effects of sequence and structural variants. Experimental approaches for chromatin interaction mapping remain costly and time-consuming, limiting their use for interrogating changes of chromatin architecture associated with genomic variation at scale. Computational models to predict chromatin interactions have either interpreted chromatin at coarse resolution or failed to capture the long-range dependencies of larger sequence contexts. To bridge this gap, we previously developed deepC, a deep neural network approach to predict chromatin interactions from DNA sequence at megabase scale. deepC employs dilated convolutional layers to achieve simultaneously a large sequence context while interpreting the DNA sequence at single base pair resolution. Using transfer learning of convolutional weights trained to predict a compendium of chromatin features across cell types allows deepC to predict cell type-specific chromatin interactions from DNA sequence alone. Here, we present a detailed workflow to predict chromatin interactions with deepC. We detail the necessary data pre-processing steps, guide through deepC model training, and demonstrate how to employ trained models to predict chromatin interactions and the effect of sequence variations on genome architecture.
Collapse
|
131
|
Shea A, Bartz J, Zhang L, Dong X. Predicting mutational function using machine learning. MUTATION RESEARCH. REVIEWS IN MUTATION RESEARCH 2023; 791:108457. [PMID: 36965820 PMCID: PMC10239318 DOI: 10.1016/j.mrrev.2023.108457] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/23/2022] [Revised: 03/11/2023] [Accepted: 03/20/2023] [Indexed: 03/27/2023]
Abstract
Genetic variations are one of the major causes of phenotypic variations between human individuals. Although beneficial as being the substrate of evolution, germline mutations may cause diseases, including Mendelian diseases and complex diseases such as diabetes and heart diseases. Mutations occurring in somatic cells are a main cause of cancer and likely cause age-related phenotypes and other age-related diseases. Because of the high abundance of genetic variations in the human genome, i.e., millions of germline variations per human subject and thousands of additional somatic mutations per cell, it is technically challenging to experimentally verify the function of every possible mutation and their interactions. Significant progress has been made to solve this problem using computational approaches, especially machine learning (ML). Here, we review the progress and achievements made in recent years in this field of research. We classify the computational models in two ways: one according to their prediction goals including protein structural alterations, gene expression changes, and disease risks, and the other according to their methodologies, including non-machine learning methods, classical machine learning methods, and deep neural network methods. For models in each category, we discuss their architecture, prediction accuracy, and potential limitations. This review provides new insights into the applications and future directions of computational approaches in understanding the role of mutations in aging and disease.
Collapse
Affiliation(s)
- Anthony Shea
- Institute on the Biology of Aging and Metabolism, University of Minnesota, Minneapolis, MN 55455, USA; Department of Genetics, Cell Biology and Development, University of Minnesota, Minneapolis, MN 55455, USA
| | - Josh Bartz
- Institute on the Biology of Aging and Metabolism, University of Minnesota, Minneapolis, MN 55455, USA; Department of Genetics, Cell Biology and Development, University of Minnesota, Minneapolis, MN 55455, USA; Bioinformatics and Computational Biology Program, University of Minnesota, Minneapolis, MN 55455, USA
| | - Lei Zhang
- Institute on the Biology of Aging and Metabolism, University of Minnesota, Minneapolis, MN 55455, USA; Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, Minneapolis, MN 55455, USA
| | - Xiao Dong
- Institute on the Biology of Aging and Metabolism, University of Minnesota, Minneapolis, MN 55455, USA; Department of Genetics, Cell Biology and Development, University of Minnesota, Minneapolis, MN 55455, USA.
| |
Collapse
|
132
|
Li Y, Kong F, Cui H, Wang F, Li C, Ma J. SENIES: DNA Shape Enhanced Two-Layer Deep Learning Predictor for the Identification of Enhancers and Their Strength. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:637-645. [PMID: 35015646 DOI: 10.1109/tcbb.2022.3142019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Identifying enhancers is a critical task in bioinformatics due to their primary role in regulating gene expression. For this reason, various computational algorithms devoted to enhancer identification have been put forward over the years. More features are extracted from the single DNA sequences to boost the performance. Nevertheless, DNA structural information is neglected, which is an essential factor affecting the binding preferences of transcription factors to regulatory elements like enhancers. Here, we propose SENIES, a DNA shape enhanced deep learning predictor, to identify enhancers and their strength. The predictor consists of two layers where the first layer is for enhancer and non-enhancer identification, and the second layer is for predicting the strength of enhancers. Apart from two common sequence-derived features (i.e., one-hot and k-mer), DNA shape is introduced to describe the 3D structures of DNA sequences. Performance comparison with state-of-the-art methods conducted on public datasets demonstrates the effectiveness and robustness of our predictor. The code implementation of SENIES is publicly available at https://github.com/hlju-liye/SENIES.
Collapse
|
133
|
Agarwal A, Chen L. DeepPHiC: predicting promoter-centered chromatin interactions using a novel deep learning approach. Bioinformatics 2023; 39:6887158. [PMID: 36495179 PMCID: PMC9825766 DOI: 10.1093/bioinformatics/btac801] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Revised: 11/23/2022] [Accepted: 12/09/2022] [Indexed: 12/14/2022] Open
Abstract
MOTIVATION Promoter-centered chromatin interactions, which include promoter-enhancer (PE) and promoter-promoter (PP) interactions, are important to decipher gene regulation and disease mechanisms. The development of next-generation sequencing technologies such as promoter capture Hi-C (pcHi-C) leads to the discovery of promoter-centered chromatin interactions. However, pcHi-C experiments are expensive and thus may be unavailable for tissues/cell types of interest. In addition, these experiments may be underpowered due to insufficient sequencing depth or various artifacts, which results in a limited finding of interactions. Most existing computational methods for predicting chromatin interactions are based on in situ Hi-C and can detect chromatin interactions across the entire genome. However, they may not be optimal for predicting promoter-centered chromatin interactions. RESULTS We develop a supervised multi-modal deep learning model, which utilizes a comprehensive set of features such as genomic sequence, epigenetic signal, anchor distance, evolutionary features and DNA structural features to predict tissue/cell type-specific PE and PP interactions. We further extend the deep learning model in a multi-task learning and a transfer learning framework and demonstrate that the proposed approach outperforms state-of-the-art deep learning methods. Moreover, the proposed approach can achieve comparable prediction performance using predefined biologically relevant tissues/cell types compared to using all tissues/cell types in the pretraining especially for predicting PE interactions. The prediction performance can be further improved by using computationally inferred biologically relevant tissues/cell types in the pretraining, which are defined based on the common genes in the proximity of two anchors in the chromatin interactions. AVAILABILITY AND IMPLEMENTATION https://github.com/lichen-lab/DeepPHiC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Aman Agarwal
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA
| | - Li Chen
- To whom correspondence should be addressed.
| |
Collapse
|
134
|
Cazares TA, Rizvi FW, Iyer B, Chen X, Kotliar M, Bejjani AT, Wayman JA, Donmez O, Wronowski B, Parameswaran S, Kottyan LC, Barski A, Weirauch MT, Prasath VBS, Miraldi ER. maxATAC: Genome-scale transcription-factor binding prediction from ATAC-seq with deep neural networks. PLoS Comput Biol 2023; 19:e1010863. [PMID: 36719906 PMCID: PMC9917285 DOI: 10.1371/journal.pcbi.1010863] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Revised: 02/10/2023] [Accepted: 01/10/2023] [Indexed: 02/01/2023] Open
Abstract
Transcription factors read the genome, fundamentally connecting DNA sequence to gene expression across diverse cell types. Determining how, where, and when TFs bind chromatin will advance our understanding of gene regulatory networks and cellular behavior. The 2017 ENCODE-DREAM in vivo Transcription-Factor Binding Site (TFBS) Prediction Challenge highlighted the value of chromatin accessibility data to TFBS prediction, establishing state-of-the-art methods for TFBS prediction from DNase-seq. However, the more recent Assay-for-Transposase-Accessible-Chromatin (ATAC)-seq has surpassed DNase-seq as the most widely-used chromatin accessibility profiling method. Furthermore, ATAC-seq is the only such technique available at single-cell resolution from standard commercial platforms. While ATAC-seq datasets grow exponentially, suboptimal motif scanning is unfortunately the most common method for TFBS prediction from ATAC-seq. To enable community access to state-of-the-art TFBS prediction from ATAC-seq, we (1) curated an extensive benchmark dataset (127 TFs) for ATAC-seq model training and (2) built "maxATAC", a suite of user-friendly, deep neural network models for genome-wide TFBS prediction from ATAC-seq in any cell type. With models available for 127 human TFs, maxATAC is the largest collection of high-performance TFBS prediction models for ATAC-seq. maxATAC performance extends to primary cells and single-cell ATAC-seq, enabling improved TFBS prediction in vivo. We demonstrate maxATAC's capabilities by identifying TFBS associated with allele-dependent chromatin accessibility at atopic dermatitis genetic risk loci.
Collapse
Affiliation(s)
- Tareian A. Cazares
- Immunology Graduate Program, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States of America
| | - Faiz W. Rizvi
- Systems Biology and Physiology Graduate Program, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States of America
| | - Balaji Iyer
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
- Department of Electrical Engineering and Computer Science, University of Cincinnati, Cincinnati, Ohio, United States of America
| | - Xiaoting Chen
- The Center for Autoimmune Genetics and Etiology (CAGE), Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
| | - Michael Kotliar
- Division of Allergy and Immunology, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
| | - Anthony T. Bejjani
- Molecular and Developmental Biology Graduate Program, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States of America
| | - Joseph A. Wayman
- Division of Immunobiology, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
| | - Omer Donmez
- The Center for Autoimmune Genetics and Etiology (CAGE), Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
| | - Benjamin Wronowski
- Division of Allergy and Immunology, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
| | - Sreeja Parameswaran
- The Center for Autoimmune Genetics and Etiology (CAGE), Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
| | - Leah C. Kottyan
- The Center for Autoimmune Genetics and Etiology (CAGE), Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
- Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States of America
- Division of Human Genetics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
| | - Artem Barski
- Division of Allergy and Immunology, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
- Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States of America
- Division of Human Genetics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
| | - Matthew T. Weirauch
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
- The Center for Autoimmune Genetics and Etiology (CAGE), Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
- Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States of America
- Division of Human Genetics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
- Division of Developmental Biology, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
| | - V. B. Surya Prasath
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
- Department of Electrical Engineering and Computer Science, University of Cincinnati, Cincinnati, Ohio, United States of America
- Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States of America
| | - Emily R. Miraldi
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
- Department of Electrical Engineering and Computer Science, University of Cincinnati, Cincinnati, Ohio, United States of America
- Division of Immunobiology, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
- Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States of America
| |
Collapse
|
135
|
Accuracy and data efficiency in deep learning models of protein expression. Nat Commun 2022; 13:7755. [PMID: 36517468 PMCID: PMC9751117 DOI: 10.1038/s41467-022-34902-5] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Accepted: 11/10/2022] [Indexed: 12/23/2022] Open
Abstract
Synthetic biology often involves engineering microbial strains to express high-value proteins. Thanks to progress in rapid DNA synthesis and sequencing, deep learning has emerged as a promising approach to build sequence-to-expression models for strain optimization. But such models need large and costly training data that create steep entry barriers for many laboratories. Here we study the relation between accuracy and data efficiency in an atlas of machine learning models trained on datasets of varied size and sequence diversity. We show that deep learning can achieve good prediction accuracy with much smaller datasets than previously thought. We demonstrate that controlled sequence diversity leads to substantial gains in data efficiency and employed Explainable AI to show that convolutional neural networks can finely discriminate between input DNA sequences. Our results provide guidelines for designing genotype-phenotype screens that balance cost and quality of training data, thus helping promote the wider adoption of deep learning in the biotechnology sector.
Collapse
|
136
|
Grigorashvili EI, Chervontseva ZS, Gelfand MS. Predicting RNA secondary structure by a neural network: what features may be learned? PeerJ 2022; 10:e14335. [PMID: 36530406 PMCID: PMC9756865 DOI: 10.7717/peerj.14335] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Accepted: 10/12/2022] [Indexed: 12/14/2022] Open
Abstract
Deep learning is a class of machine learning techniques capable of creating internal representation of data without explicit preprogramming. Hence, in addition to practical applications, it is of interest to analyze what features of biological data may be learned by such models. Here, we describe PredPair, a deep learning neural network trained to predict base pairs in RNA structure from sequence alone, without any incorporated prior knowledge, such as the stacking energies or possible spatial structures. PredPair learned the Watson-Crick and wobble base-pairing rules and created an internal representation of the stacking energies and helices. Application to independent experimental (DMS-Seq) data on nucleotide accessibility in mRNA showed that the nucleotides predicted as paired indeed tend to be involved in the RNA structure. The performance of the constructed model was comparable with the state-of-the-art method based on the thermodynamic approach, but with a higher false positives rate. On the other hand, it successfully predicted pseudoknots. t-SNE clusters of embeddings of RNA sequences created by PredPair tend to contain embeddings from particular Rfam families, supporting the predictions of PredPair being in line with biological classification.
Collapse
Affiliation(s)
| | | | - Mikhail S. Gelfand
- Center of Molecular and Cellular Biology, Skolkovo Institute of Science and Technology, Moscow, Russia,Institute of Information Transmission Problems, Moscow, Russia
| |
Collapse
|
137
|
Toneyan S, Tang Z, Koo PK. Evaluating deep learning for predicting epigenomic profiles. NAT MACH INTELL 2022. [DOI: 10.1038/s42256-022-00570-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
138
|
Yuan GH, Wang Y, Wang GZ, Yang L. RNAlight: a machine learning model to identify nucleotide features determining RNA subcellular localization. Brief Bioinform 2022; 24:6868526. [PMID: 36464487 PMCID: PMC9851306 DOI: 10.1093/bib/bbac509] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Revised: 10/13/2022] [Accepted: 10/25/2022] [Indexed: 12/12/2022] Open
Abstract
Different RNAs have distinct subcellular localizations. However, nucleotide features that determine these distinct distributions of lncRNAs and mRNAs have yet to be fully addressed. Here, we develop RNAlight, a machine learning model based on LightGBM, to identify nucleotide k-mers contributing to the subcellular localizations of mRNAs and lncRNAs. With the Tree SHAP algorithm, RNAlight extracts nucleotide features for cytoplasmic or nuclear localization of RNAs, indicating the sequence basis for distinct RNA subcellular localizations. By assembling k-mers to sequence features and subsequently mapping to known RBP-associated motifs, different types of sequence features and their associated RBPs were additionally uncovered for lncRNAs and mRNAs with distinct subcellular localizations. Finally, we extended RNAlight to precisely predict the subcellular localizations of other types of RNAs, including snRNAs, snoRNAs and different circular RNA transcripts, suggesting the generality of using RNAlight for RNA subcellular localization prediction.
Collapse
Affiliation(s)
| | | | - Guang-Zhong Wang
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Li Yang
- Corresponding author. Li Yang, Center for Molecular Medicine, Children's Hospital, Fudan University and Shanghai Key Laboratory of Medical Epigenetics, International Laboratory of Medical Epigenetics and Metabolism, Ministry of Science and Technology, Institutes of Biomedical Sciences, Fudan University, Dong-An Road, 131, Shanghai, China. Tel: +86-021-54237325; E-mail:
| |
Collapse
|
139
|
Tabet D, Parikh V, Mali P, Roth FP, Claussnitzer M. Scalable Functional Assays for the Interpretation of Human Genetic Variation. Annu Rev Genet 2022; 56:441-465. [PMID: 36055970 DOI: 10.1146/annurev-genet-072920-032107] [Citation(s) in RCA: 29] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Scalable sequence-function studies have enabled the systematic analysis and cataloging of hundreds of thousands of coding and noncoding genetic variants in the human genome. This has improved clinical variant interpretation and provided insights into the molecular, biophysical, and cellular effects of genetic variants at an astonishing scale and resolution across the spectrum of allele frequencies. In this review, we explore current applications and prospects for the field and outline the principles underlying scalable functional assay design, with a focus on the study of single-nucleotide coding and noncoding variants.
Collapse
Affiliation(s)
- Daniel Tabet
- Donnelly Centre, Department of Molecular Genetics, and Department of Computer Science, University of Toronto, Toronto, Ontario, Canada;
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, Ontario, Canada
| | - Victoria Parikh
- Center for Inherited Cardiovascular Disease, Division of Cardiovascular Medicine, Stanford University School of Medicine, Stanford, California, USA
| | - Prashant Mali
- Department of Bioengineering, University of California, San Diego, California, USA
| | - Frederick P Roth
- Donnelly Centre, Department of Molecular Genetics, and Department of Computer Science, University of Toronto, Toronto, Ontario, Canada;
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, Ontario, Canada
| | - Melina Claussnitzer
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
- Center for Genomic Medicine and Endocrine Division, Massachusetts General Hospital, Boston, Massachusetts, USA
- Harvard Medical School, Harvard University, Boston, Massachusetts, USA;
| |
Collapse
|
140
|
Agarwal V, Kelley DR. The genetic and biochemical determinants of mRNA degradation rates in mammals. Genome Biol 2022; 23:245. [PMID: 36419176 PMCID: PMC9684954 DOI: 10.1186/s13059-022-02811-x] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Accepted: 11/02/2022] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND Degradation rate is a fundamental aspect of mRNA metabolism, and the factors governing it remain poorly characterized. Understanding the genetic and biochemical determinants of mRNA half-life would enable more precise identification of variants that perturb gene expression through post-transcriptional gene regulatory mechanisms. RESULTS We establish a compendium of 39 human and 27 mouse transcriptome-wide mRNA decay rate datasets. A meta-analysis of these data identified a prevalence of technical noise and measurement bias, induced partially by the underlying experimental strategy. Correcting for these biases allowed us to derive more precise, consensus measurements of half-life which exhibit enhanced consistency between species. We trained substantially improved statistical models based upon genetic and biochemical features to better predict half-life and characterize the factors molding it. Our state-of-the-art model, Saluki, is a hybrid convolutional and recurrent deep neural network which relies only upon an mRNA sequence annotated with coding frame and splice sites to predict half-life (r=0.77). The key novel principle learned by Saluki is that the spatial positioning of splice sites, codons, and RNA-binding motifs within an mRNA is strongly associated with mRNA half-life. Saluki predicts the impact of RNA sequences and genetic mutations therein on mRNA stability, in agreement with functional measurements derived from massively parallel reporter assays. CONCLUSIONS Our work produces a more robust ground truth for transcriptome-wide mRNA half-lives in mammalian cells. Using these revised measurements, we trained Saluki, a model that is over 50% more accurate in predicting half-life from sequence than existing models. Saluki succinctly captures many of the known determinants of mRNA half-life and can be rapidly deployed to predict the functional consequences of arbitrary mutations in the transcriptome.
Collapse
Affiliation(s)
- Vikram Agarwal
- Calico Life Sciences LLC, South San Francisco, CA, 94080, USA.
- Present Address: mRNA Center of Excellence, Sanofi Pasteur Inc., Waltham, MA, 02451, USA.
| | - David R Kelley
- Calico Life Sciences LLC, South San Francisco, CA, 94080, USA.
| |
Collapse
|
141
|
Lan AY, Corces MR. Deep learning approaches for noncoding variant prioritization in neurodegenerative diseases. Front Aging Neurosci 2022; 14:1027224. [PMID: 36466610 PMCID: PMC9716280 DOI: 10.3389/fnagi.2022.1027224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2022] [Accepted: 10/24/2022] [Indexed: 11/19/2022] Open
Abstract
Determining how noncoding genetic variants contribute to neurodegenerative dementias is fundamental to understanding disease pathogenesis, improving patient prognostication, and developing new clinical treatments. Next generation sequencing technologies have produced vast amounts of genomic data on cell type-specific transcription factor binding, gene expression, and three-dimensional chromatin interactions, with the promise of providing key insights into the biological mechanisms underlying disease. However, this data is highly complex, making it challenging for researchers to interpret, assimilate, and dissect. To this end, deep learning has emerged as a powerful tool for genome analysis that can capture the intricate patterns and dependencies within these large datasets. In this review, we organize and discuss the many unique model architectures, development philosophies, and interpretation methods that have emerged in the last few years with a focus on using deep learning to predict the impact of genetic variants on disease pathogenesis. We highlight both broadly-applicable genomic deep learning methods that can be fine-tuned to disease-specific contexts as well as existing neurodegenerative disease research, with an emphasis on Alzheimer's-specific literature. We conclude with an overview of the future of the field at the intersection of neurodegeneration, genomics, and deep learning.
Collapse
Affiliation(s)
- Alexander Y. Lan
- Gladstone Institute of Neurological Disease, San Francisco, CA, United States
- Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, United States
- Department of Neurology, University of California San Francisco, San Francisco, CA, United States
| | - M. Ryan Corces
- Gladstone Institute of Neurological Disease, San Francisco, CA, United States
- Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, United States
- Department of Neurology, University of California San Francisco, San Francisco, CA, United States
| |
Collapse
|
142
|
Donohue LK, Guo MG, Zhao Y, Jung N, Bussat RT, Kim DS, Neela PH, Kellman LN, Garcia OS, Meyers RM, Altman RB, Khavari PA. A cis-regulatory lexicon of DNA motif combinations mediating cell-type-specific gene regulation. CELL GENOMICS 2022; 2:100191. [PMID: 36742369 PMCID: PMC9894309 DOI: 10.1016/j.xgen.2022.100191] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Gene expression is controlled by transcription factors (TFs) that bind cognate DNA motif sequences in cis-regulatory elements (CREs). The combinations of DNA motifs acting within homeostasis and disease, however, are unclear. Gene expression, chromatin accessibility, TF footprinting, and H3K27ac-dependent DNA looping data were generated and a random-forest-based model was applied to identify 7,531 cell-type-specific cis-regulatory modules (CRMs) across 15 diploid human cell types. A co-enrichment framework within CRMs nominated 838 cell-type-specific, recurrent heterotypic DNA motif combinations (DMCs), which were functionally validated using massively parallel reporter assays. Cancer cells engaged DMCs linked to neoplasia-enabling processes operative in normal cells while also activating new DMCs only seen in the neoplastic state. This integrative approach identifies cell-type-specific cis-regulatory combinatorial DNA motifs in diverse normal and diseased human cells and represents a general framework for deciphering cis-regulatory sequence logic in gene regulation.
Collapse
Affiliation(s)
- Laura K.H. Donohue
- Program in Epithelial Biology, Stanford University School of Medicine, Stanford, CA, USA,Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA,Synthego, Redwood City, CA, USA,These authors contributed equally
| | - Margaret G. Guo
- Program in Epithelial Biology, Stanford University School of Medicine, Stanford, CA, USA,Stanford Program in Biomedical Informatics, Stanford University, Stanford, CA, USA,These authors contributed equally
| | - Yang Zhao
- Program in Epithelial Biology, Stanford University School of Medicine, Stanford, CA, USA,Synthego, Redwood City, CA, USA
| | - Namyoung Jung
- Program in Epithelial Biology, Stanford University School of Medicine, Stanford, CA, USA,Department of Life Science, Pohang University of Science and Technology, Pohang, Korea
| | - Rose T. Bussat
- Program in Epithelial Biology, Stanford University School of Medicine, Stanford, CA, USA,23andMe, Inc., Sunnyvale, CA, USA
| | - Daniel S. Kim
- Program in Epithelial Biology, Stanford University School of Medicine, Stanford, CA, USA,Stanford Program in Biomedical Informatics, Stanford University, Stanford, CA, USA
| | - Poornima H. Neela
- Program in Epithelial Biology, Stanford University School of Medicine, Stanford, CA, USA,Fauna Bio, Emeryville, CA, USA
| | - Laura N. Kellman
- Program in Epithelial Biology, Stanford University School of Medicine, Stanford, CA, USA,Stanford Program in Cancer Biology, Stanford University, Stanford, CA, USA
| | - Omar S. Garcia
- Program in Epithelial Biology, Stanford University School of Medicine, Stanford, CA, USA
| | - Robin M. Meyers
- Program in Epithelial Biology, Stanford University School of Medicine, Stanford, CA, USA,Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Russ B. Altman
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA,Stanford Program in Biomedical Informatics, Stanford University, Stanford, CA, USA,Department of Bioengineering, Stanford University, Stanford, CA, USA
| | - Paul A. Khavari
- Program in Epithelial Biology, Stanford University School of Medicine, Stanford, CA, USA,Stanford Program in Cancer Biology, Stanford University, Stanford, CA, USA,Veterans Affairs Palo Alto Healthcare System, Palo Alto, CA, USA,Lead contact,Correspondence:
| |
Collapse
|
143
|
Linder J, Koplik SE, Kundaje A, Seelig G. Deciphering the impact of genetic variation on human polyadenylation using APARENT2. Genome Biol 2022; 23:232. [PMID: 36335397 PMCID: PMC9636789 DOI: 10.1186/s13059-022-02799-4] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Accepted: 10/19/2022] [Indexed: 11/08/2022] Open
Abstract
BACKGROUND 3'-end processing by cleavage and polyadenylation is an important and finely tuned regulatory process during mRNA maturation. Numerous genetic variants are known to cause or contribute to human disorders by disrupting the cis-regulatory code of polyadenylation signals. Yet, due to the complexity of this code, variant interpretation remains challenging. RESULTS We introduce a residual neural network model, APARENT2, that can infer 3'-cleavage and polyadenylation from DNA sequence more accurately than any previous model. This model generalizes to the case of alternative polyadenylation (APA) for a variable number of polyadenylation signals. We demonstrate APARENT2's performance on several variant datasets, including functional reporter data and human 3' aQTLs from GTEx. We apply neural network interpretation methods to gain insights into disrupted or protective higher-order features of polyadenylation. We fine-tune APARENT2 on human tissue-resolved transcriptomic data to elucidate tissue-specific variant effects. By combining APARENT2 with models of mRNA stability, we extend aQTL effect size predictions to the entire 3' untranslated region. Finally, we perform in silico saturation mutagenesis of all human polyadenylation signals and compare the predicted effects of [Formula: see text] million variants against gnomAD. While loss-of-function variants were generally selected against, we also find specific clinical conditions linked to gain-of-function mutations. For example, we detect an association between gain-of-function mutations in the 3'-end and autism spectrum disorder. To experimentally validate APARENT2's predictions, we assayed clinically relevant variants in multiple cell lines, including microglia-derived cells. CONCLUSIONS A sequence-to-function model based on deep residual learning enables accurate functional interpretation of genetic variants in polyadenylation signals and, when coupled with large human variation databases, elucidates the link between functional 3'-end mutations and human health.
Collapse
Affiliation(s)
| | | | - Anshul Kundaje
- Department of Genetics, Stanford University, Stanford, USA
- Department of Computer Science, Stanford University, Stanford, USA
| | - Georg Seelig
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA
- Department of Electrical and Computer Engineering, University of Washington, Seattle, USA
| |
Collapse
|
144
|
Li J, Wang J, Zhang P, Wang R, Mei Y, Sun Z, Fei L, Jiang M, Ma L, E W, Chen H, Wang X, Fu Y, Wu H, Liu D, Wang X, Li J, Guo Q, Liao Y, Yu C, Jia D, Wu J, He S, Liu H, Ma J, Lei K, Chen J, Han X, Guo G. Deep learning of cross-species single-cell landscapes identifies conserved regulatory programs underlying cell types. Nat Genet 2022; 54:1711-1720. [PMID: 36229673 DOI: 10.1038/s41588-022-01197-7] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2021] [Accepted: 08/31/2022] [Indexed: 11/09/2022]
Abstract
Despite extensive efforts to generate and analyze reference genomes, genetic models to predict gene regulation and cell fate decisions are lacking for most species. Here, we generated whole-body single-cell transcriptomic landscapes of zebrafish, Drosophila and earthworm. We then integrated cell landscapes from eight representative metazoan species to study gene regulation across evolution. Using these uniformly constructed cross-species landscapes, we developed a deep-learning-based strategy, Nvwa, to predict gene expression and identify regulatory sequences at the single-cell level. We systematically compared cell-type-specific transcription factors to reveal conserved genetic regulation in vertebrates and invertebrates. Our work provides a valuable resource and offers a new strategy for studying regulatory grammar in diverse biological systems.
Collapse
Affiliation(s)
- Jiaqi Li
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China.,Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, China
| | - Jingjing Wang
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China. .,Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, China.
| | - Peijing Zhang
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China.,Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, China
| | - Renying Wang
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Yuqing Mei
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Zhongyi Sun
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Lijiang Fei
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Mengmeng Jiang
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China.,Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, China
| | - Lifeng Ma
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Weigao E
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Haide Chen
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China.,Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, China
| | - Xinru Wang
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Yuting Fu
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Hanyu Wu
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Daiyuan Liu
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Xueyi Wang
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Jingyu Li
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Qile Guo
- Zhejiang University-University of Edinburgh Institute, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, China
| | - Yuan Liao
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China.,Zhejiang Provincial Key Laboratory for Tissue Engineering and Regenerative Medicine, Dr. Li Dak Sum & Yip Yio Chin Center for Stem Cell and Regenerative Medicine, Hangzhou, China
| | - Chengxuan Yu
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Danmei Jia
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Jian Wu
- Division of Hepatobiliary and Pancreatic Surgery, Department of Surgery, First Affiliated Hospital School of Medicine, Zhejiang University, Hangzhou, China
| | - Shibo He
- College of Control Science and Engineering, Zhejiang University, Hangzhou, China
| | - Huanju Liu
- Women's Hospital and Institute of Genetics, Zhenjiang University School of Medicine, Hangzhou, China
| | - Jun Ma
- Women's Hospital and Institute of Genetics, Zhenjiang University School of Medicine, Hangzhou, China
| | - Kai Lei
- Westlake Laboratory of Life Sciences and Biomedicine, Key Laboratory of Growth Regulation and Translational Research of Zhejiang Province, School of Life Sciences, Westlake University, Hangzhou, China
| | - Jiming Chen
- College of Control Science and Engineering, Zhejiang University, Hangzhou, China
| | - Xiaoping Han
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China. .,Zhejiang Provincial Key Laboratory for Tissue Engineering and Regenerative Medicine, Dr. Li Dak Sum & Yip Yio Chin Center for Stem Cell and Regenerative Medicine, Hangzhou, China.
| | - Guoji Guo
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China. .,Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, China. .,Zhejiang University-University of Edinburgh Institute, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, China. .,Zhejiang Provincial Key Laboratory for Tissue Engineering and Regenerative Medicine, Dr. Li Dak Sum & Yip Yio Chin Center for Stem Cell and Regenerative Medicine, Hangzhou, China. .,Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, China.
| |
Collapse
|
145
|
Zhou Y, Wu T, Jiang Y, Li Y, Li K, Quan L, Lyu Q. DeepNup: Prediction of Nucleosome Positioning from DNA Sequences Using Deep Neural Network. Genes (Basel) 2022; 13:1983. [PMID: 36360220 PMCID: PMC9689664 DOI: 10.3390/genes13111983] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2022] [Revised: 10/25/2022] [Accepted: 10/26/2022] [Indexed: 10/29/2024] Open
Abstract
Nucleosome positioning is involved in diverse cellular biological processes by regulating the accessibility of DNA sequences to DNA-binding proteins and plays a vital role. Previous studies have manifested that the intrinsic preference of nucleosomes for DNA sequences may play a dominant role in nucleosome positioning. As a consequence, it is nontrivial to develop computational methods only based on DNA sequence information to accurately identify nucleosome positioning, and thus intend to verify the contribution of DNA sequences responsible for nucleosome positioning. In this work, we propose a new deep learning-based method, named DeepNup, which enables us to improve the prediction of nucleosome positioning only from DNA sequences. Specifically, we first use a hybrid feature encoding scheme that combines One-hot encoding and Trinucleotide composition encoding to encode raw DNA sequences; afterwards, we employ multiscale convolutional neural network modules that consist of two parallel convolution kernels with different sizes and gated recurrent units to effectively learn the local and global correlation feature representations; lastly, we use a fully connected layer and a sigmoid unit serving as a classifier to integrate these learned high-order feature representations and generate the final prediction outcomes. By comparing the experimental evaluation metrics on two benchmark nucleosome positioning datasets, DeepNup achieves a better performance for nucleosome positioning prediction than that of several state-of-the-art methods. These results demonstrate that DeepNup is a powerful deep learning-based tool that enables one to accurately identify potential nucleosome sequences.
Collapse
Affiliation(s)
- Yiting Zhou
- School of Computer Science and Technology, Soochow University, Suzhou Ganjiang East Streat 333, Suzhou 215006, China
| | - Tingfang Wu
- School of Computer Science and Technology, Soochow University, Suzhou Ganjiang East Streat 333, Suzhou 215006, China
- Key Lab for Information Processing Technologies, Soochow University, Suzhou Ganjiang East Streat 333, Suzhou 215006, China
- Collaborative Innovation Center of Novel Software Technology and Industrialization, Organization, Nanjing 210000, China
| | - Yelu Jiang
- School of Computer Science and Technology, Soochow University, Suzhou Ganjiang East Streat 333, Suzhou 215006, China
| | - Yan Li
- School of Computer Science and Technology, Soochow University, Suzhou Ganjiang East Streat 333, Suzhou 215006, China
| | - Kailong Li
- School of Computer Science and Technology, Soochow University, Suzhou Ganjiang East Streat 333, Suzhou 215006, China
| | - Lijun Quan
- School of Computer Science and Technology, Soochow University, Suzhou Ganjiang East Streat 333, Suzhou 215006, China
- Key Lab for Information Processing Technologies, Soochow University, Suzhou Ganjiang East Streat 333, Suzhou 215006, China
- Collaborative Innovation Center of Novel Software Technology and Industrialization, Organization, Nanjing 210000, China
| | - Qiang Lyu
- School of Computer Science and Technology, Soochow University, Suzhou Ganjiang East Streat 333, Suzhou 215006, China
- Key Lab for Information Processing Technologies, Soochow University, Suzhou Ganjiang East Streat 333, Suzhou 215006, China
- Collaborative Innovation Center of Novel Software Technology and Industrialization, Organization, Nanjing 210000, China
| |
Collapse
|
146
|
Malina S, Cizin D, Knowles DA. Deep mendelian randomization: Investigating the causal knowledge of genomic deep learning models. PLoS Comput Biol 2022; 18:e1009880. [PMID: 36265006 PMCID: PMC9624391 DOI: 10.1371/journal.pcbi.1009880] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Revised: 11/01/2022] [Accepted: 09/19/2022] [Indexed: 11/06/2022] Open
Abstract
Multi-task deep learning (DL) models can accurately predict diverse genomic marks from sequence, but whether these models learn the causal relationships between genomic marks is unknown. Here, we describe Deep Mendelian Randomization (DeepMR), a method for estimating causal relationships between genomic marks learned by genomic DL models. By combining Mendelian randomization with in silico mutagenesis, DeepMR obtains local (locus specific) and global estimates of (an assumed) linear causal relationship between marks. In a simulation designed to test recovery of pairwise causal relations between transcription factors (TFs), DeepMR gives accurate and unbiased estimates of the 'true' global causal effect, but its coverage decays in the presence of sequence-dependent confounding. We then apply DeepMR to examine the global relationships learned by a state-of-the-art DL model, BPNet, between TFs involved in reprogramming. DeepMR's causal effect estimates validate previously hypothesized relationships between TFs and suggest new relationships for future investigation.
Collapse
Affiliation(s)
- Stephen Malina
- Department of Computer Science, Columbia University, New York, New York, United States of America
- Dyno Therapeutics, Watertown, Massachusetts, United States of America
- * E-mail: ,
| | - Daniel Cizin
- Department of Computer Science, Columbia University, New York, New York, United States of America
- Tri-Institutional Ph.D. Program in Computational Biology and Medicine, Weill Cornell Medicine, New York, New York, United States of America
| | - David A. Knowles
- Department of Computer Science, Columbia University, New York, New York, United States of America
- New York Genome Center, New York, New York, United States of America
- Department of Systems Biology, Columbia University, New York, New York, United States of America
- Data Science Institute, Columbia University, New York, New York, United States of America
| |
Collapse
|
147
|
Zhao Y, Shao J, Asmann YW. Assessment and Optimization of Explainable Machine Learning Models Applied to Transcriptomic Data. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:899-911. [PMID: 35931322 PMCID: PMC10025763 DOI: 10.1016/j.gpb.2022.07.003] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/10/2021] [Revised: 06/05/2022] [Accepted: 07/25/2022] [Indexed: 01/12/2023]
Abstract
Explainable artificial intelligence aims to interpret how machine learning models make decisions, and many model explainers have been developed in the computer vision field. However, understanding of the applicability of these model explainers to biological data is still lacking. In this study, we comprehensively evaluated multiple explainers by interpreting pre-trained models for predicting tissue types from transcriptomic data and by identifying the top contributing genes from each sample with the greatest impacts on model prediction. To improve the reproducibility and interpretability of results generated by model explainers, we proposed a series of optimization strategies for each explainer on two different model architectures of multilayer perceptron (MLP) and convolutional neural network (CNN). We observed three groups of explainer and model architecture combinations with high reproducibility. Group II, which contains three model explainers on aggregated MLP models, identified top contributing genes in different tissues that exhibited tissue-specific manifestation and were potential cancer biomarkers. In summary, our work provides novel insights and guidance for exploring biological mechanisms using explainable machine learning models.
Collapse
Affiliation(s)
- Yongbing Zhao
- Department of Quantitative Health Sciences, Mayo Clinic, Jacksonville, FL 32224, USA.
| | - Jinfeng Shao
- The Laboratory of Malaria and Vector Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Rockville, MD 20852, USA
| | - Yan W Asmann
- Department of Quantitative Health Sciences, Mayo Clinic, Jacksonville, FL 32224, USA.
| |
Collapse
|
148
|
Li L, Yu X, Sheng C, Jiang X, Zhang Q, Han Y, Jiang J. A review of brain imaging biomarker genomics in Alzheimer’s disease: implementation and perspectives. Transl Neurodegener 2022; 11:42. [PMID: 36109823 PMCID: PMC9476275 DOI: 10.1186/s40035-022-00315-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Accepted: 08/24/2022] [Indexed: 11/25/2022] Open
Abstract
Alzheimer’s disease (AD) is a progressive neurodegenerative disease with phenotypic changes closely associated with both genetic variants and imaging pathology. Brain imaging biomarker genomics has been developed in recent years to reveal potential AD pathological mechanisms and provide early diagnoses. This technique integrates multimodal imaging phenotypes with genetic data in a noninvasive and high-throughput manner. In this review, we summarize the basic analytical framework of brain imaging biomarker genomics and elucidate two main implementation scenarios of this technique in AD studies: (1) exploring novel biomarkers and seeking mutual interpretability and (2) providing a diagnosis and prognosis for AD with combined use of machine learning methods and brain imaging biomarker genomics. Importantly, we highlight the necessity of brain imaging biomarker genomics, discuss the strengths and limitations of current methods, and propose directions for development of this research field.
Collapse
|
149
|
Hentges LD, Sergeant MJ, Cole CB, Downes DJ, Hughes JR, Taylor S. LanceOtron: a deep learning peak caller for genome sequencing experiments. Bioinformatics 2022; 38:4255-4263. [PMID: 35866989 PMCID: PMC9477537 DOI: 10.1093/bioinformatics/btac525] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2021] [Revised: 05/10/2022] [Accepted: 07/21/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Genome sequencing experiments have revolutionized molecular biology by allowing researchers to identify important DNA-encoded elements genome wide. Regions where these elements are found appear as peaks in the analog signal of an assay's coverage track, and despite the ease with which humans can visually categorize these patterns, the size of many genomes necessitates algorithmic implementations. Commonly used methods focus on statistical tests to classify peaks, discounting that the background signal does not completely follow any known probability distribution and reducing the information-dense peak shapes to simply maximum height. Deep learning has been shown to be highly accurate for many pattern recognition tasks, on par or even exceeding human capabilities, providing an opportunity to reimagine and improve peak calling. RESULTS We present the peak calling framework LanceOtron, which combines deep learning for recognizing peak shape with multifaceted enrichment calculations for assessing significance. In benchmarking ATAC-seq, ChIP-seq and DNase-seq, LanceOtron outperforms long-standing, gold-standard peak callers through its improved selectivity and near-perfect sensitivity. AVAILABILITY AND IMPLEMENTATION A fully featured web application is freely available from LanceOtron.molbiol.ox.ac.uk, command line interface via python is pip installable from PyPI at https://pypi.org/project/lanceotron/, and source code and benchmarking tests are available at https://github.com/LHentges/LanceOtron. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lance D Hentges
- MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK
| | - Martin J Sergeant
- MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK
| | - Christopher B Cole
- MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK
| | - Damien J Downes
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK
| | - Jim R Hughes
- MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK
| | - Stephen Taylor
- MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK
| |
Collapse
|
150
|
Lal A. Deciphering the regulatory syntax of genomic DNA with deep learning. J Biosci 2022. [DOI: 10.1007/s12038-022-00291-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|