1
|
Yan Y, Li W, Wang S, Huang T. Seq-RBPPred: Predicting RNA-Binding Proteins from Sequence. ACS OMEGA 2024; 9:12734-12742. [PMID: 38524500 PMCID: PMC10955590 DOI: 10.1021/acsomega.3c08381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 12/18/2023] [Accepted: 12/28/2023] [Indexed: 03/26/2024]
Abstract
RNA-binding proteins (RBPs) can interact with RNAs to regulate RNA translation, modification, splicing, and other important biological processes. The accurate identification of RBPs is of paramount importance for gaining insights into the intricate mechanisms underlying organismal life activities. Traditional experimental methods to predict RBPs require a lot of time and money, so it is important to develop computational methods to predict RBPs. However, the existing approaches for RBP prediction still require further improvement due to unidentified RBPs in many species. In this study, we present Seq-RBPPred (predicting RBPs from sequence), a novel method that utilizes a comprehensive feature representation encompassing both biophysical properties and hidden-state features derived from protein sequences. In the results, comprehensive performance evaluations of Seq-RBPPred its superiority compare with state-of-the-art methods, yielding impressive performance including 0.922 for overall accuracy, 0.926 for sensitivity, 0.903 for specificity, and Matthew's correlation coefficient (MCC) of 0.757 as ascertained from the evaluation of the testing set. The data and code of Seq-RBPPred are available at https://github.com/yaoyao-11/Seq-RBPPred.
Collapse
Affiliation(s)
- Yuyao Yan
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| | - Wenran Li
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| | - Sijia Wang
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| | - Tao Huang
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| |
Collapse
|
2
|
Tognon M, Giugno R, Pinello L. A survey on algorithms to characterize transcription factor binding sites. Brief Bioinform 2023; 24:bbad156. [PMID: 37099664 PMCID: PMC10422928 DOI: 10.1093/bib/bbad156] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 03/27/2023] [Accepted: 04/01/2023] [Indexed: 04/28/2023] Open
Abstract
Transcription factors (TFs) are key regulatory proteins that control the transcriptional rate of cells by binding short DNA sequences called transcription factor binding sites (TFBS) or motifs. Identifying and characterizing TFBS is fundamental to understanding the regulatory mechanisms governing the transcriptional state of cells. During the last decades, several experimental methods have been developed to recover DNA sequences containing TFBS. In parallel, computational methods have been proposed to discover and identify TFBS motifs based on these DNA sequences. This is one of the most widely investigated problems in bioinformatics and is referred to as the motif discovery problem. In this manuscript, we review classical and novel experimental and computational methods developed to discover and characterize TFBS motifs in DNA sequences, highlighting their advantages and drawbacks. We also discuss open challenges and future perspectives that could fill the remaining gaps in the field.
Collapse
Affiliation(s)
- Manuel Tognon
- Computer Science Department, University of Verona, Verona, Italy
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Rosalba Giugno
- Computer Science Department, University of Verona, Verona, Italy
| | - Luca Pinello
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Department of Pathology, Harvard Medical School, Boston, Massachusetts, United States of America
| |
Collapse
|
3
|
Deng C, Whalen S, Steyert M, Ziffra R, Przytycki PF, Inoue F, Pereira DA, Capauto D, Norton S, Vaccarino FM, Pollen A, Nowakowski TJ, Ahituv N, Pollard KS. Massively parallel characterization of psychiatric disorder-associated and cell-type-specific regulatory elements in the developing human cortex. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.15.528663. [PMID: 36824845 PMCID: PMC9949039 DOI: 10.1101/2023.02.15.528663] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/17/2023]
Abstract
Nucleotide changes in gene regulatory elements are important determinants of neuronal development and disease. Using massively parallel reporter assays in primary human cells from mid-gestation cortex and cerebral organoids, we interrogated the cis-regulatory activity of 102,767 sequences, including differentially accessible cell-type specific regions in the developing cortex and single-nucleotide variants associated with psychiatric disorders. In primary cells, we identified 46,802 active enhancer sequences and 164 disorder-associated variants that significantly alter enhancer activity. Activity was comparable in organoids and primary cells, suggesting that organoids provide an adequate model for the developing cortex. Using deep learning, we decoded the sequence basis and upstream regulators of enhancer activity. This work establishes a comprehensive catalog of functional gene regulatory elements and variants in human neuronal development.
Collapse
Affiliation(s)
- Chengyu Deng
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco; San Francisco, CA, USA
- Institute for Human Genetics, University of California, San Francisco; San Francisco, CA, USA
| | - Sean Whalen
- Gladstone Institutes; San Francisco, CA, USA
| | - Marilyn Steyert
- Department of Anatomy, University of California, San Francisco; San Francisco, CA, USA
- Department of Psychiatry, University of California, San Francisco; San Francisco, CA, USA
- Department of Neurological Surgery, University of California, San Francisco; San Francisco, CA, USA
| | - Ryan Ziffra
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco; San Francisco, CA, USA
| | | | - Fumitaka Inoue
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University; Kyoto, Japan
| | - Daniela A. Pereira
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco; San Francisco, CA, USA
- Institute for Human Genetics, University of California, San Francisco; San Francisco, CA, USA
- Graduate Program of Genetics, Institute of Biological Sciences, Federal University of Minas Gerais; Belo Horizonte, Minas Gerais, Brazil
| | | | - Scott Norton
- Child Study Center, Yale University; New Haven, CT, USA
| | - Flora M. Vaccarino
- Child Study Center, Yale University; New Haven, CT, USA
- Department of Neuroscience, Yale University; New Haven, CT, USA
| | - Alex Pollen
- Department of Neurology, University of California, San Francisco; San Francisco, CA, USA
- Eli and Edythe Broad Center for Regeneration Medicine and Stem Cell Research, University of California, San Francisco; San Francisco, CA, USA
| | - Tomasz J. Nowakowski
- Department of Anatomy, University of California, San Francisco; San Francisco, CA, USA
- Department of Psychiatry, University of California, San Francisco; San Francisco, CA, USA
- Department of Neurological Surgery, University of California, San Francisco; San Francisco, CA, USA
- Eli and Edythe Broad Center for Regeneration Medicine and Stem Cell Research, University of California, San Francisco; San Francisco, CA, USA
- Chan Zuckerberg Biohub, San Francisco; San Francisco, CA, USA
| | - Nadav Ahituv
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco; San Francisco, CA, USA
- Institute for Human Genetics, University of California, San Francisco; San Francisco, CA, USA
| | - Katherine S. Pollard
- Institute for Human Genetics, University of California, San Francisco; San Francisco, CA, USA
- Gladstone Institutes; San Francisco, CA, USA
- Chan Zuckerberg Biohub, San Francisco; San Francisco, CA, USA
- Department of Epidemiology and Biostatistics, University of California, San Francisco; San Francisco, CA, USA
| |
Collapse
|
4
|
Liu Q, Zeng W, Zhang W, Wang S, Chen H, Jiang R, Zhou M, Zhang S. Deep generative modeling and clustering of single cell Hi-C data. Brief Bioinform 2023; 24:6858951. [PMID: 36458445 DOI: 10.1093/bib/bbac494] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 09/28/2022] [Accepted: 10/18/2022] [Indexed: 12/05/2022] Open
Abstract
Deciphering 3D genome conformation is important for understanding gene regulation and cellular function at a spatial level. The recent advances of single cell Hi-C technologies have enabled the profiling of the 3D architecture of DNA within individual cell, which allows us to study the cell-to-cell variability of 3D chromatin organization. Computational approaches are in urgent need to comprehensively analyze the sparse and heterogeneous single cell Hi-C data. Here, we proposed scDEC-Hi-C, a new framework for single cell Hi-C analysis with deep generative neural networks. scDEC-Hi-C outperforms existing methods in terms of single cell Hi-C data clustering and imputation. Moreover, the generative power of scDEC-Hi-C could help unveil the differences of chromatin architecture across cell types. We expect that scDEC-Hi-C could shed light on deepening our understanding of the complex mechanism underlying the formation of chromatin contacts.
Collapse
Affiliation(s)
- Qiao Liu
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
| | - Wanwen Zeng
- College of Software, Nankai University, Tianjin 300071, China
| | - Wei Zhang
- Department of Biomedical Engineering, School of Control Science and Engineering, Shandong University, Jinan, Shandong 250061, China
| | - Sicheng Wang
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, USA
| | - Hongyang Chen
- The Research Center for Intelligent Network, Zhejiang Lab, Hangzhou 311121, China
| | - Rui Jiang
- Ministry of Education Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Mu Zhou
- SenseBrain Research, San Jose, CA 95131, USA
| | - Shaoting Zhang
- Shanghai Artificial Intelligence Laboratory, Shanghai 200240, China
| |
Collapse
|
5
|
Beknazarov N, Poptsova M. DeepZ: A Deep Learning Approach for Z-DNA Prediction. Methods Mol Biol 2023; 2651:217-226. [PMID: 36892770 DOI: 10.1007/978-1-0716-3084-6_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/10/2023]
Abstract
Here we describe an approach that uses deep learning neural networks such as CNN and RNN to aggregate information from DNA sequence; physical, chemical, and structural properties of nucleotides; and omics data on histone modifications, methylation, chromatin accessibility, and transcription factor binding sites and data from other available NGS experiments. We explain how with the trained model one can perform whole-genome annotation of Z-DNA regions and feature importance analysis in order to define key determinants for functional Z-DNA regions.
Collapse
Affiliation(s)
- Nazar Beknazarov
- Laboratory of Bioinformatics, Faculty of Computer Science, National Research University Higher School of Economics, Moscow, Russia
| | - Maria Poptsova
- Laboratory of Bioinformatics, Faculty of Computer Science, National Research University Higher School of Economics, Moscow, Russia.
| |
Collapse
|
6
|
Toneyan S, Tang Z, Koo PK. Evaluating deep learning for predicting epigenomic profiles. NAT MACH INTELL 2022. [DOI: 10.1038/s42256-022-00570-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
7
|
Malina S, Cizin D, Knowles DA. Deep mendelian randomization: Investigating the causal knowledge of genomic deep learning models. PLoS Comput Biol 2022; 18:e1009880. [PMID: 36265006 PMCID: PMC9624391 DOI: 10.1371/journal.pcbi.1009880] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Revised: 11/01/2022] [Accepted: 09/19/2022] [Indexed: 11/06/2022] Open
Abstract
Multi-task deep learning (DL) models can accurately predict diverse genomic marks from sequence, but whether these models learn the causal relationships between genomic marks is unknown. Here, we describe Deep Mendelian Randomization (DeepMR), a method for estimating causal relationships between genomic marks learned by genomic DL models. By combining Mendelian randomization with in silico mutagenesis, DeepMR obtains local (locus specific) and global estimates of (an assumed) linear causal relationship between marks. In a simulation designed to test recovery of pairwise causal relations between transcription factors (TFs), DeepMR gives accurate and unbiased estimates of the 'true' global causal effect, but its coverage decays in the presence of sequence-dependent confounding. We then apply DeepMR to examine the global relationships learned by a state-of-the-art DL model, BPNet, between TFs involved in reprogramming. DeepMR's causal effect estimates validate previously hypothesized relationships between TFs and suggest new relationships for future investigation.
Collapse
Affiliation(s)
- Stephen Malina
- Department of Computer Science, Columbia University, New York, New York, United States of America
- Dyno Therapeutics, Watertown, Massachusetts, United States of America
- * E-mail: ,
| | - Daniel Cizin
- Department of Computer Science, Columbia University, New York, New York, United States of America
- Tri-Institutional Ph.D. Program in Computational Biology and Medicine, Weill Cornell Medicine, New York, New York, United States of America
| | - David A. Knowles
- Department of Computer Science, Columbia University, New York, New York, United States of America
- New York Genome Center, New York, New York, United States of America
- Department of Systems Biology, Columbia University, New York, New York, United States of America
- Data Science Institute, Columbia University, New York, New York, United States of America
| |
Collapse
|
8
|
Omics Data and Data Representations for Deep Learning-Based Predictive Modeling. Int J Mol Sci 2022; 23:ijms232012272. [PMID: 36293133 PMCID: PMC9603455 DOI: 10.3390/ijms232012272] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Revised: 10/03/2022] [Accepted: 10/12/2022] [Indexed: 11/25/2022] Open
Abstract
Medical discoveries mainly depend on the capability to process and analyze biological datasets, which inundate the scientific community and are still expanding as the cost of next-generation sequencing technologies is decreasing. Deep learning (DL) is a viable method to exploit this massive data stream since it has advanced quickly with there being successive innovations. However, an obstacle to scientific progress emerges: the difficulty of applying DL to biology, and this because both fields are evolving at a breakneck pace, thus making it hard for an individual to occupy the front lines of both of them. This paper aims to bridge the gap and help computer scientists bring their valuable expertise into the life sciences. This work provides an overview of the most common types of biological data and data representations that are used to train DL models, with additional information on the models themselves and the various tasks that are being tackled. This is the essential information a DL expert with no background in biology needs in order to participate in DL-based research projects in biomedicine, biotechnology, and drug discovery. Alternatively, this study could be also useful to researchers in biology to understand and utilize the power of DL to gain better insights into and extract important information from the omics data.
Collapse
|
9
|
Li Y, Quan L, Zhou Y, Jiang Y, Li K, Wu T, Lyu Q. Identifying modifications on DNA-bound histones with joint deep learning of multiple binding sites in DNA sequence. Bioinformatics 2022; 38:4070-4077. [PMID: 35809058 DOI: 10.1093/bioinformatics/btac489] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2022] [Revised: 06/15/2022] [Accepted: 07/07/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Histone modifications are epigenetic markers that impact gene expression by altering the chromatin structure or recruiting histone modifiers. Their accurate identification is key to unraveling the mechanisms by which they regulate gene expression. However, the solutions for this task can be improved by exploiting multiple relationships from dataset and exploring designs of learning models, for example jointly learning technology. RESULTS This article proposes a deep learning-based multi-objective computational approach, iHMnBS, to identify which of the seven typical histone modifications a DNA sequence may choose to bind, and which parts of the DNA sequence bind to them. iHMnBS employs a customized dataset that allows the marking of modifications contained in histones that may bind to any position in the DNA sequence. iHMnBS tries to mine the information implicit in this richer data by means of deep neural networks. In comprehensive comparisons, iHMnBS outperforms a baseline method, and the probability of binding to modified histones assigned to a representative nucleotide of a DNA sequence can serve as a reference for biological experiments. Since the interaction between transcription factors and histone modifications has an important role in gene expression, we extracted a number of sequence patterns that may bind to transcription factors, and explored their possible impact on disease. AVAILABILITY AND IMPLEMENTATION The source code is available at https://github.com/lennylv/iHMnBS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yan Li
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Lijun Quan
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China.,Province Key Lab for Information Processing Technologies, Soochow University, Suzhou 215006, China.,Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
| | - Yiting Zhou
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Yelu Jiang
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Kailong Li
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Tingfang Wu
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China.,Province Key Lab for Information Processing Technologies, Soochow University, Suzhou 215006, China.,Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
| | - Qiang Lyu
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China.,Province Key Lab for Information Processing Technologies, Soochow University, Suzhou 215006, China.,Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
| |
Collapse
|
10
|
Lal A. Deciphering the regulatory syntax of genomic DNA with deep learning. J Biosci 2022. [DOI: 10.1007/s12038-022-00291-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
11
|
Cross-species enhancer prediction using machine learning. Genomics 2022; 114:110454. [PMID: 36030022 DOI: 10.1016/j.ygeno.2022.110454] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Revised: 07/28/2022] [Accepted: 08/16/2022] [Indexed: 11/21/2022]
Abstract
Cis-regulatory elements (CREs) are non-coding parts of the genome that play a critical role in gene expression regulation. Enhancers, as an important example of CREs, interact with genes to influence complex traits like disease, heat tolerance and growth rate. Much of what is known about enhancers come from studies of humans and a few model organisms like mouse, with little known about other mammalian species. Previous studies have attempted to identify enhancers in less studied mammals using comparative genomics but with limited success. Recently, Machine Learning (ML) techniques have shown promising results to predict enhancer regions. Here, we investigated the ability of ML methods to identify enhancers in three non-model mammalian species (cattle, pig and dog) using human and mouse enhancer data from VISTA and publicly available ChIP-seq. We tested nine models, using four different representations of the DNA sequences in cross-species prediction using both the VISTA dataset and species-specific ChIP-seq data. We identified between 809,399 and 877,278 enhancer-like regions (ELRs) in the study species (11.6-13.7% of each genome). These predictions were close to the ~8% proportion of ELRs that covered the human genome. We propose that our ML methods have predictive ability for identifying enhancers in non-model mammalian species. We have provided a list of high confidence enhancers at https://github.com/DaviesCentreInformatics/Cross-species-enhancer-prediction and believe these enhancers will be of great use to the community.
Collapse
|
12
|
Alharbi WS, Rashid M. A review of deep learning applications in human genomics using next-generation sequencing data. Hum Genomics 2022; 16:26. [PMID: 35879805 PMCID: PMC9317091 DOI: 10.1186/s40246-022-00396-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Accepted: 07/12/2022] [Indexed: 12/02/2022] Open
Abstract
Genomics is advancing towards data-driven science. Through the advent of high-throughput data generating technologies in human genomics, we are overwhelmed with the heap of genomic data. To extract knowledge and pattern out of this genomic data, artificial intelligence especially deep learning methods has been instrumental. In the current review, we address development and application of deep learning methods/models in different subarea of human genomics. We assessed over- and under-charted area of genomics by deep learning techniques. Deep learning algorithms underlying the genomic tools have been discussed briefly in later part of this review. Finally, we discussed briefly about the late application of deep learning tools in genomic. Conclusively, this review is timely for biotechnology or genomic scientists in order to guide them why, when and how to use deep learning methods to analyse human genomic data.
Collapse
Affiliation(s)
- Wardah S Alharbi
- Department of AI and Bioinformatics, King Abdullah International Medical Research Center (KAIMRC), King Saud Bin Abdulaziz University for Health Sciences (KSAU-HS), King Abdulaziz Medical City, Ministry of National Guard Health Affairs, P.O. Box 22490, Riyadh, 11426, Saudi Arabia
| | - Mamoon Rashid
- Department of AI and Bioinformatics, King Abdullah International Medical Research Center (KAIMRC), King Saud Bin Abdulaziz University for Health Sciences (KSAU-HS), King Abdulaziz Medical City, Ministry of National Guard Health Affairs, P.O. Box 22490, Riyadh, 11426, Saudi Arabia.
| |
Collapse
|
13
|
Asim MN, Ibrahim MA, Malik MI, Razzak I, Dengel A, Ahmed S. Histone-Net: a multi-paradigm computational framework for histone occupancy and modification prediction. COMPLEX INTELL SYST 2022. [DOI: 10.1007/s40747-022-00802-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
AbstractDeep exploration of histone occupancy and covalent post-translational modifications (e.g., acetylation, methylation) is essential to decode gene expression regulation, chromosome packaging, DNA damage, and transcriptional activation. Existing computational approaches are unable to precisely predict histone occupancy and modifications mainly due to the use of sub-optimal statistical representation of histone sequences. For the establishment of an improved histone occupancy and modification landscape for multiple histone markers, the paper in hand presents an end-to-end computational multi-paradigm framework “Histone-Net”. To learn local and global residue context aware sequence representation, Histone-Net generates unsupervised higher order residue embeddings (DNA2Vec) and presents a different application of language modelling, where it encapsulates histone occupancy and modification information while generating higher order residue embeddings (SuperDNA2Vec) in a supervised manner. We perform an intrinsic and extrinsic evaluation of both presented distributed representation learning schemes. A comprehensive empirical evaluation of Histone-Net over ten benchmark histone markers data sets for three different histone sequence analysis tasks indicates that SuperDNA2Vec sequence representation and softmax classifier-based approach outperforms state-of-the-art approach by an average accuracy of 7%. To eliminate the overhead of training separate binary classifiers for all ten histone markers, Histone-Net is evaluated in multi-label classification paradigm, where it produces decent performance for simultaneous prediction of histone occupancy, acetylation, and methylation.
Collapse
|
14
|
Abstract
The tremendous amount of biological sequence data available, combined with the recent methodological breakthrough in deep learning in domains such as computer vision or natural language processing, is leading today to the transformation of bioinformatics through the emergence of deep genomics, the application of deep learning to genomic sequences. We review here the new applications that the use of deep learning enables in the field, focusing on three aspects: the functional annotation of genomes, the sequence determinants of the genome functions and the possibility to write synthetic genomic sequences.
Collapse
|
15
|
Liu Q, Hua K, Zhang X, Wong WH, Jiang R. DeepCAGE: Incorporating Transcription Factors in Genome-wide Prediction of Chromatin Accessibility. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:496-507. [PMID: 35293310 PMCID: PMC9801045 DOI: 10.1016/j.gpb.2021.08.015] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Revised: 05/31/2021] [Accepted: 09/27/2021] [Indexed: 01/26/2023]
Abstract
Although computational approaches have been complementing high-throughput biological experiments for the identification of functional regions in the human genome, it remains a great challenge to systematically decipher interactions between transcription factors (TFs) and regulatory elements to achieve interpretable annotations of chromatin accessibility across diverse cellular contexts. To solve this problem, we propose DeepCAGE, a deep learning framework that integrates sequence information and binding statuses of TFs, for the accurate prediction of chromatin accessible regions at a genome-wide scale in a variety of cell types. DeepCAGE takes advantage of a densely connected deep convolutional neural network architecture to automatically learn sequence signatures of known chromatin accessible regions and then incorporates such features with expression levels and binding activities of human core TFs to predict novel chromatin accessible regions. In a series of systematic comparisons with existing methods, DeepCAGE exhibits superior performance in not only the classification but also the regression of chromatin accessibility signals. In a detailed analysis of TF activities, DeepCAGE successfully extracts novel binding motifs and measures the contribution of a TF to the regulation with respect to a specific locus in a certain cell type. When applied to whole-genome sequencing data analysis, our method successfully prioritizes putative deleterious variants underlying a human complex trait and thus provides insights into the understanding of disease-associated genetic variants. DeepCAGE can be downloaded from https://github.com/kimmo1019/DeepCAGE.
Collapse
Affiliation(s)
- Qiao Liu
- Ministry of Education Key Laboratory of Bioinformatics; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China,Department of Statistics, Stanford University, Stanford, CA 94305, USA
| | - Kui Hua
- Ministry of Education Key Laboratory of Bioinformatics; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xuegong Zhang
- Ministry of Education Key Laboratory of Bioinformatics; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Wing Hung Wong
- Department of Statistics, Stanford University, Stanford, CA 94305, USA,Corresponding authors.
| | - Rui Jiang
- Ministry of Education Key Laboratory of Bioinformatics; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China,Corresponding authors.
| |
Collapse
|
16
|
Ray A. Machine learning in postgenomic biology and personalized medicine. WILEY INTERDISCIPLINARY REVIEWS. DATA MINING AND KNOWLEDGE DISCOVERY 2022; 12:e1451. [PMID: 35966173 PMCID: PMC9371441 DOI: 10.1002/widm.1451] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/23/2020] [Accepted: 12/22/2021] [Indexed: 06/15/2023]
Abstract
In recent years Artificial Intelligence in the form of machine learning has been revolutionizing biology, biomedical sciences, and gene-based agricultural technology capabilities. Massive data generated in biological sciences by rapid and deep gene sequencing and protein or other molecular structure determination, on the one hand, requires data analysis capabilities using machine learning that are distinctly different from classical statistical methods; on the other, these large datasets are enabling the adoption of novel data-intensive machine learning algorithms for the solution of biological problems that until recently had relied on mechanistic model-based approaches that are computationally expensive. This review provides a bird's eye view of the applications of machine learning in post-genomic biology. Attempt is also made to indicate as far as possible the areas of research that are poised to make further impacts in these areas, including the importance of explainable artificial intelligence (XAI) in human health. Further contributions of machine learning are expected to transform medicine, public health, agricultural technology, as well as to provide invaluable gene-based guidance for the management of complex environments in this age of global warming.
Collapse
Affiliation(s)
- Animesh Ray
- Riggs School of Applied Life Sciences, Keck Graduate Institute, 535 Watson Drive, Claremont, CA91711, USA
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California, USA
| |
Collapse
|
17
|
Benning L, Peintner A, Peintner L. Advances in and the Applicability of Machine Learning-Based Screening and Early Detection Approaches for Cancer: A Primer. Cancers (Basel) 2022; 14:cancers14030623. [PMID: 35158890 PMCID: PMC8833439 DOI: 10.3390/cancers14030623] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2021] [Revised: 01/22/2022] [Accepted: 01/25/2022] [Indexed: 02/07/2023] Open
Abstract
Simple Summary Non-communicable diseases in general, and cancer in particular, contribute greatly to the global burden of disease. Although significant advances have been made to address this burden, cancer is still among the top drivers of mortality, second only to cardiovascular diseases. Consensus has been established that a key factor to reduce the burden of disease from cancer is to improve screening for and the early detection of such conditions. To date, however, most approaches in this field relied on established screening methods, such as a clinical examination, radiographic imaging, tissue staining or biochemical markers. Yet, with the advances of information technology, new data-driven screening and diagnostic tools have been developed. This article provides a brief overview of the theoretical foundations of these data-driven approaches, highlights the promising use cases and underscores the challenges and limitations that come with the introduction of these approaches to the clinical field. Abstract Despite the efforts of the past decades, cancer is still among the key drivers of global mortality. To increase the detection rates, screening programs and other efforts to improve early detection were initiated to cover the populations at a particular risk for developing a specific malignant condition. These diagnostic approaches have, so far, mostly relied on conventional diagnostic methods and have made little use of the vast amounts of clinical and diagnostic data that are routinely being collected along the diagnostic pathway. Practitioners have lacked the tools to handle this ever-increasing flood of data. Only recently, the clinical field has opened up more for the opportunities that come with the systematic utilisation of high-dimensional computational data analysis. We aim to introduce the reader to the theoretical background of machine learning (ML) and elaborate on the established and potential use cases of ML algorithms in screening and early detection. Furthermore, we assess and comment on the relevant challenges and misconceptions of the applicability of ML-based diagnostic approaches. Lastly, we emphasise the need for a clear regulatory framework to responsibly introduce ML-based diagnostics in clinical practice and routine care.
Collapse
Affiliation(s)
- Leo Benning
- Health Care Supply Research and Data Mining Working Group, Emergency Department, University Medical Center Freiburg, 79106 Freiburg, Germany;
| | - Andreas Peintner
- Databases and Information Systems, Department of Computer Science, Leopold-Franzens University of Innsbruck, 6020 Innsbruck, Austria;
| | - Lukas Peintner
- Institute of Molecular Medicine and Cell Research, Albert Ludwigs University of Freiburg, 79085 Freiburg, Germany
- Correspondence: ; Tel.: +49-761-203-9618
| |
Collapse
|
18
|
Zhang S, Ma A, Zhao J, Xu D, Ma Q, Wang Y. Assessing deep learning methods in cis-regulatory motif finding based on genomic sequencing data. Brief Bioinform 2022; 23:bbab374. [PMID: 34607350 PMCID: PMC8769700 DOI: 10.1093/bib/bbab374] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2021] [Revised: 08/22/2021] [Accepted: 08/23/2021] [Indexed: 12/28/2022] Open
Abstract
Identifying cis-regulatory motifs from genomic sequencing data (e.g. ChIP-seq and CLIP-seq) is crucial in identifying transcription factor (TF) binding sites and inferring gene regulatory mechanisms for any organism. Since 2015, deep learning (DL) methods have been widely applied to identify TF binding sites and predict motif patterns, with the strengths of offering a scalable, flexible and unified computational approach for highly accurate predictions. As far as we know, 20 DL methods have been developed. However, without a clear and systematic assessment, users will struggle to choose the most appropriate tool for their specific studies. In this manuscript, we evaluated 20 DL methods for cis-regulatory motif prediction using 690 ENCODE ChIP-seq, 126 cancer ChIP-seq and 55 RNA CLIP-seq data. Four metrics were investigated, including the accuracy of motif finding, the performance of DNA/RNA sequence classification, algorithm scalability and tool usability. The assessment results demonstrated the high complementarity of the existing DL methods. It was determined that the most suitable model should primarily depend on the data size and type and the method's outputs.
Collapse
Affiliation(s)
- Shuangquan Zhang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
| | - Anjun Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Jing Zhao
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, and Christopher S. Bond Life Science Center, University of Missouri, MO, 65211, USA
| | - Qin Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Yan Wang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
- School of Artificial Intelligence, Jilin University, Changchun, 130012, China
| |
Collapse
|
19
|
Morrow A, Hughes J, Singh J, Joseph A, Yosef N. Epitome: predicting epigenetic events in novel cell types with multi-cell deep ensemble learning. Nucleic Acids Res 2021; 49:e110. [PMID: 34379786 PMCID: PMC8565335 DOI: 10.1093/nar/gkab676] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Revised: 07/19/2021] [Accepted: 07/25/2021] [Indexed: 01/04/2023] Open
Abstract
The accumulation of large epigenomics data consortiums provides us with the opportunity to extrapolate existing knowledge to new cell types and conditions. We propose Epitome, a deep neural network that learns similarities of chromatin accessibility between well characterized reference cell types and a query cellular context, and copies over signal of transcription factor binding and modification of histones from reference cell types when chromatin profiles are similar to the query. Epitome achieves state-of-the-art accuracy when predicting transcription factor binding sites on novel cellular contexts and can further improve predictions as more epigenetic signals are collected from both reference cell types and the query cellular context of interest.
Collapse
Affiliation(s)
- Alyssa Kramer Morrow
- Electrical Engineering and Computer Science Department, University of California-Berkeley 465 Soda Hall, Berkeley, CA 94720-1776, USA
| | - John Weston Hughes
- Electrical Engineering and Computer Science Department, University of California-Berkeley 465 Soda Hall, Berkeley, CA 94720-1776, USA
- Computer Science Department, Stanford University, 353 Serra Mall, Stanford, CA 94305, USA
| | - Jahnavi Singh
- Electrical Engineering and Computer Science Department, University of California-Berkeley 465 Soda Hall, Berkeley, CA 94720-1776, USA
| | - Anthony Douglas Joseph
- Electrical Engineering and Computer Science Department, University of California-Berkeley 465 Soda Hall, Berkeley, CA 94720-1776, USA
- Center for Computational Biology, University of California-Berkeley 108 Stanley Hall, Berkeley, CA 94720-3220, USA
- Unite Genomics, Inc., 1301 Marina Village Pkwy, Suite 320, Alameda, CA 94501, USA
| | - Nir Yosef
- Electrical Engineering and Computer Science Department, University of California-Berkeley 465 Soda Hall, Berkeley, CA 94720-1776, USA
- Center for Computational Biology, University of California-Berkeley 108 Stanley Hall, Berkeley, CA 94720-3220, USA
- Ragon Institute of Massachusetts General Hospital, Massachusetts Institute of Technology, and Harvard University, Boston, MA, 02139, USA
- Chan Zuckerberg Biohub, San Francisco, CA, 94158, USA
| |
Collapse
|
20
|
Cao F, Zhang Y, Cai Y, Animesh S, Zhang Y, Akincilar SC, Loh YP, Li X, Chng WJ, Tergaonkar V, Kwoh CK, Fullwood MJ. Chromatin interaction neural network (ChINN): a machine learning-based method for predicting chromatin interactions from DNA sequences. Genome Biol 2021; 22:226. [PMID: 34399797 PMCID: PMC8365954 DOI: 10.1186/s13059-021-02453-5] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2021] [Accepted: 08/04/2021] [Indexed: 11/10/2022] Open
Abstract
Chromatin interactions play important roles in regulating gene expression. However, the availability of genome-wide chromatin interaction data is limited. We develop a computational method, chromatin interaction neural network (ChINN), to predict chromatin interactions between open chromatin regions using only DNA sequences. ChINN predicts CTCF- and RNA polymerase II-associated and Hi-C chromatin interactions. ChINN shows good across-sample performances and captures various sequence features for chromatin interaction prediction. We apply ChINN to 6 chronic lymphocytic leukemia (CLL) patient samples and a published cohort of 84 CLL open chromatin samples. Our results demonstrate extensive heterogeneity in chromatin interactions among CLL patient samples.
Collapse
Affiliation(s)
- Fan Cao
- Cancer Science Institute of Singapore, National University of Singapore, 14 Medical Dr, Singapore, 117599 Singapore
| | - Yu Zhang
- School of Computer Science and Engineering, Nanyang Technological University, Block N4, 50 Nanyang Avenue, Singapore, 639798 Singapore
| | - Yichao Cai
- Cancer Science Institute of Singapore, National University of Singapore, 14 Medical Dr, Singapore, 117599 Singapore
| | - Sambhavi Animesh
- Cancer Science Institute of Singapore, National University of Singapore, 14 Medical Dr, Singapore, 117599 Singapore
| | - Ying Zhang
- Cancer Science Institute of Singapore, National University of Singapore, 14 Medical Dr, Singapore, 117599 Singapore
| | - Semih Can Akincilar
- Institute of Molecular and Cell Biology, Agency for Science (IMCB), A*STAR (Agency for Science, Technology and Research,, Singapore, 138673 Singapore
| | - Yan Ping Loh
- Cancer Science Institute of Singapore, National University of Singapore, 14 Medical Dr, Singapore, 117599 Singapore
| | - Xinya Li
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551 Singapore
| | - Wee Joo Chng
- Cancer Science Institute of Singapore, National University of Singapore, 14 Medical Dr, Singapore, 117599 Singapore
- Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore, 1E Kent Ridge Road, Singapore, 119228 Singapore
- Department of Haematology-Oncology, National University Cancer Institute, National University Health System, NUH Zone B, Medical Centre, Singapore, 119074 Singapore
| | - Vinay Tergaonkar
- Institute of Molecular and Cell Biology, Agency for Science (IMCB), A*STAR (Agency for Science, Technology and Research,, Singapore, 138673 Singapore
- Department of Pathology, Yong Loo Lin School of Medicine, National University of Singapore (NUS), Singapore, 117597 Singapore
| | - Chee Keong Kwoh
- School of Computer Science and Engineering, Nanyang Technological University, Block N4, 50 Nanyang Avenue, Singapore, 639798 Singapore
| | - Melissa J. Fullwood
- Cancer Science Institute of Singapore, National University of Singapore, 14 Medical Dr, Singapore, 117599 Singapore
- Institute of Molecular and Cell Biology, Agency for Science (IMCB), A*STAR (Agency for Science, Technology and Research,, Singapore, 138673 Singapore
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551 Singapore
| |
Collapse
|
21
|
Hallal M, Awad M, Khoueiry P. TempoMAGE: a deep learning framework that exploits the causal dependency between time-series data to predict histone marks in open chromatin regions at time-points with missing ChIP-seq datasets. Bioinformatics 2021; 37:4336-4342. [PMID: 34255822 DOI: 10.1093/bioinformatics/btab513] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2021] [Revised: 07/05/2021] [Accepted: 07/09/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Identifying histone tail modifications using ChIP-seq is commonly used in time-series experiments in development and disease. These assays, however, cover specific time-points leaving intermediate or early stages with missing information. Although several machine learning methods were developed to predict histone marks, none exploited the dependence that exists in time-series experiments between data generated at specific time-points to extrapolate these findings to time-points where data cannot be generated for lack or scarcity of materials (i.e., early developmental stages). RESULTS Here, we train a deep learning model named TempoMAGE, to predict the presence or absence of H3K27ac in open chromatin regions by integrating information from sequence, gene expression, chromatin accessibility and the estimated change in H3K27ac state from a reference time-point. We show that adding reference time-point information systematically improves the overall model's performance. Additionally, sequence signatures extracted from our method were exclusive to the training dataset indicating that our model learned data-specific features. As an application, TempoMAGE was able to predict the activity of enhancers from pre-validated in-vivo dataset highlighting its ability to be used for functional annotation of putative enhancers. AVAILABILITY TempoMAGE is freely available through GitHub at https://github.com/pkhoueiry/TempoMAGE. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mohammad Hallal
- Department of Biochemistry and Molecular Genetics, Faculty of Medicine, American University of Beirut, Lebanon.,Biomedical Engineering Program, American University of Beirut, Lebanon
| | - Mariette Awad
- Department of Electrical and Computer Engineering, American University of Beirut, Lebanon
| | - Pierre Khoueiry
- Department of Biochemistry and Molecular Genetics, Faculty of Medicine, American University of Beirut, Lebanon.,Pillar Genomics Institute, Faculty of Medicine, American University of Beirut, Lebanon
| |
Collapse
|
22
|
Li JY, Jin S, Tu XM, Ding Y, Gao G. Identifying complex motifs in massive omics data with a variable-convolutional layer in deep neural network. Brief Bioinform 2021; 22:6312656. [PMID: 34219140 DOI: 10.1093/bib/bbab233] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Revised: 05/25/2021] [Accepted: 05/28/2021] [Indexed: 01/10/2023] Open
Abstract
Motif identification is among the most common and essential computational tasks for bioinformatics and genomics. Here we proposed a novel convolutional layer for deep neural network, named variable convolutional (vConv) layer, for effective motif identification in high-throughput omics data by learning kernel length from data adaptively. Empirical evaluations on DNA-protein binding and DNase footprinting cases well demonstrated that vConv-based networks have superior performance to their convolutional counterparts regardless of model complexity. Meanwhile, vConv could be readily integrated into multi-layer neural networks as an 'in-place replacement' of canonical convolutional layer. All source codes are freely available on GitHub for academic usage.
Collapse
Affiliation(s)
- Jing-Yi Li
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Shen Jin
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Xin-Ming Tu
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Yang Ding
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Ge Gao
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| |
Collapse
|
23
|
Talukder A, Barham C, Li X, Hu H. Interpretation of deep learning in genomics and epigenomics. Brief Bioinform 2021; 22:bbaa177. [PMID: 34020542 PMCID: PMC8138893 DOI: 10.1093/bib/bbaa177] [Citation(s) in RCA: 42] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Revised: 06/26/2020] [Accepted: 07/10/2020] [Indexed: 12/17/2022] Open
Abstract
Machine learning methods have been widely applied to big data analysis in genomics and epigenomics research. Although accuracy and efficiency are common goals in many modeling tasks, model interpretability is especially important to these studies towards understanding the underlying molecular and cellular mechanisms. Deep neural networks (DNNs) have recently gained popularity in various types of genomic and epigenomic studies due to their capabilities in utilizing large-scale high-throughput bioinformatics data and achieving high accuracy in predictions and classifications. However, DNNs are often challenged by their potential to explain the predictions due to their black-box nature. In this review, we present current development in the model interpretation of DNNs, focusing on their applications in genomics and epigenomics. We first describe state-of-the-art DNN interpretation methods in representative machine learning fields. We then summarize the DNN interpretation methods in recent studies on genomics and epigenomics, focusing on current data- and computing-intensive topics such as sequence motif identification, genetic variations, gene expression, chromatin interactions and non-coding RNAs. We also present the biological discoveries that resulted from these interpretation methods. We finally discuss the advantages and limitations of current interpretation approaches in the context of genomic and epigenomic studies. Contact:xiaoman@mail.ucf.edu, haihu@cs.ucf.edu.
Collapse
Affiliation(s)
- Amlan Talukder
- Computer Science, University of Central Florida, Orlando, FL 32816, USA
| | - Clayton Barham
- Computer Science, University of Central Florida, Orlando, FL 32816, USA
| | - Xiaoman Li
- Burnett School of Biomedical Science, University of Central Florida, Orlando, FL 32816, USA
| | - Haiyan Hu
- Computer Science, University of Central Florida, Orlando, FL 32816, USA
| |
Collapse
|
24
|
Salimi D, Moeini A. Incorporating K-mers Highly Correlated to Epigenetic Modifications for Bayesian Inference of Gene Interactions. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200728193621] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Objective::
A gene interaction network, along with its related biological features, has an
important role in computational biology. Bayesian network, as an efficient model, based on
probabilistic concepts is able to exploit known and novel biological casual relationships between
genes. The success of Bayesian networks in predicting the relationships greatly depends on
selecting priors.
Methods::
K-mers have been applied as the prominent features to uncover the similarity between
genes in a specific pathway, suggesting that this feature can be applied to study genes
dependencies. In this study, we propose k-mers (4,5 and 6-mers) highly correlated with epigenetic
modifications, including 17 modifications, as a new prior for Bayesian inference in the gene
interaction network.
Result::
Employing this model on a network of 23 human genes and on a network based on 27
genes related to yeast resulted in F-measure improvements in different biological networks.
Conclusion::
The improvements in the best case are 12%, 36%, and 10% in the pathway, coexpression,
and physical interaction, respectively.
Collapse
Affiliation(s)
- Dariush Salimi
- Department of Animal Science, Faculty of Agriculture, University of Zanjan, Zanjan, Iran
| | - Ali Moeini
- Department of Algorithms and Computation, Faculty of Engineering Science, College of Engineering, University of Tehran, Tehran, Iran
| |
Collapse
|
25
|
Arora I, Tollefsbol TO. Computational methods and next-generation sequencing approaches to analyze epigenetics data: Profiling of methods and applications. Methods 2021; 187:92-103. [PMID: 32941995 PMCID: PMC7914156 DOI: 10.1016/j.ymeth.2020.09.008] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2020] [Revised: 09/08/2020] [Accepted: 09/10/2020] [Indexed: 12/20/2022] Open
Abstract
Epigenetics is mainly comprised of features that regulate genomic interactions thereby playing a crucial role in a vast array of biological processes. Epigenetic mechanisms such as DNA methylation and histone modifications influence gene expression by modulating the packaging of DNA in the nucleus. A plethora of studies have emphasized the importance of analyzing epigenetics data through genome-wide studies and high-throughput approaches, thereby providing key insights towards epigenetics-based diseases such as cancer. Recent advancements have been made towards translating epigenetics research into a high throughput approach such as genome-scale profiling. Amongst all, bioinformatics plays a pivotal role in achieving epigenetics-related computational studies. Despite significant advancements towards epigenomic profiling, it is challenging to understand how various epigenetic modifications such as chromatin modifications and DNA methylation regulate gene expression. Next-generation sequencing (NGS) provides accurate and parallel sequencing thereby allowing researchers to comprehend epigenomic profiling. In this review, we summarize different computational methods such as machine learning and other bioinformatics tools, publicly available databases and resources to identify key modifications associated with epigenetic machinery. Additionally, the review also focuses on understanding recent methodologies related to epigenome profiling using NGS methods ranging from library preparation, different sequencing platforms and analytical techniques to evaluate various epigenetic modifications such as DNA methylation and histone modifications. We also provide detailed information on bioinformatics tools and computational strategies responsible for analyzing large scale data in epigenetics.
Collapse
Affiliation(s)
- Itika Arora
- Department of Biology, University of Alabama at Birmingham, 1300 University Boulevard, Birmingham, AL 35294, USA.
| | - Trygve O Tollefsbol
- Department of Biology, University of Alabama at Birmingham, 1300 University Boulevard, Birmingham, AL 35294, USA; Comprehensive Center for Healthy Aging, University of Alabama Birmingham, 1530 3rd Avenue South, Birmingham, AL 35294, USA; Comprehensive Cancer Center, University of Alabama Birmingham, 1802 6th Avenue South, Birmingham, AL 35294, USA; Nutrition Obesity Research Center, University of Alabama Birmingham, 1675 University Boulevard, Birmingham, AL 35294, USA; Comprehensive Diabetes Center, University of Alabama Birmingham, 1825 University Boulevard, Birmingham, AL 35294, USA.
| |
Collapse
|
26
|
Ohnuki H, Venzon DJ, Lobanov A, Tosato G. Iterative epigenomic analyses in the same single cell. Genome Res 2021; 31:1819-1830. [PMID: 33627472 DOI: 10.1101/gr.269068.120] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2020] [Accepted: 01/14/2021] [Indexed: 11/24/2022]
Abstract
Gene expression in individual cells is epigenetically regulated by DNA modifications, histone modifications, transcription factors, and other DNA-binding proteins. It has been shown that multiple histone modifications can predict gene expression and reflect future responses of bulk cells to extracellular cues. However, the predictive ability of epigenomic analysis is still limited for mechanistic research at a single cell level. To overcome this limitation, it would be useful to acquire reliable signals from multiple epigenetic marks in the same single cell. Here, we propose a new approach and a new method for analysis of several components of the epigenome in the same single cell. The new method allows reanalysis of the same single cell. We found that reanalysis of the same single cell is feasible, provides confirmation of the epigenetic signals, and allows application of statistical analysis to identify reproduced reads using data sets generated only from the single cell. Reanalysis of the same single cell is also useful to acquire multiple epigenetic marks from the same single cells. The method can acquire at least five epigenetic marks: H3K27ac, H3K27me3, mediator complex subunit 1, a DNA modification, and a DNA-interacting protein. We can predict active signaling pathways in K562 single cells using the epigenetic data and confirm that the predicted results strongly correlate with actual active signaling pathways identified by RNA-seq results. These results suggest that the new method provides mechanistic insights for cellular phenotypes through multilayered epigenome analysis in the same single cells.
Collapse
Affiliation(s)
- Hidetaka Ohnuki
- Laboratory of Cellular Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - David J Venzon
- Biostatistics and Data Management Section, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Rockville, Maryland 20850, USA
| | - Alexei Lobanov
- CCR Collaborative Bioinformatics Resource (CCBR), Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA.,Advanced Biomedical Computational Science, Frederick National Laboratory for Cancer Research sponsored by the National Cancer Institute, Frederick, Maryland 21702, USA
| | - Giovanna Tosato
- Laboratory of Cellular Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| |
Collapse
|
27
|
Schmidt B, Hildebrandt A. Deep learning in next-generation sequencing. Drug Discov Today 2021; 26:173-180. [PMID: 33059075 PMCID: PMC7550123 DOI: 10.1016/j.drudis.2020.10.002] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2020] [Revised: 09/16/2020] [Accepted: 10/07/2020] [Indexed: 12/22/2022]
Abstract
Next-generation sequencing (NGS) methods lie at the heart of large parts of biological and medical research. Their fundamental importance has created a continuously increasing demand for processing and analysis methods of the data sets produced, addressing questions such as variant calling, metagenomic classification and quantification, genomic feature detection, or downstream analysis in larger biological or medical contexts. In addition to classical algorithmic approaches, machine-learning (ML) techniques are often used for such tasks. In particular, deep learning (DL) methods that use multilayered artificial neural networks (ANNs) for supervised, semisupervised, and unsupervised learning have gained significant traction for such applications. Here, we highlight important network architectures, application areas, and DL frameworks in a NGS context.
Collapse
Affiliation(s)
- Bertil Schmidt
- Institut für Informatik, Johannes Gutenberg University Mainz, Germany.
| | | |
Collapse
|
28
|
Baisya DR, Lonardi S. Prediction of histone post-translational modifications using deep learning. Bioinformatics 2020; 36:5610-5617. [PMID: 33367499 DOI: 10.1093/bioinformatics/btaa1075] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2020] [Revised: 11/27/2020] [Accepted: 12/16/2020] [Indexed: 01/02/2023] Open
Abstract
Abstract
Motivation
Histone post-translational modifications (PTMs) are involved in a variety of essential regulatory processes in the cell, including transcription control. Recent studies have shown that histone PTMs can be accurately predicted from the knowledge of transcription factor binding or DNase hypersensitivity data. Similarly, it has been shown that one can predict PTMs from the underlying DNA primary sequence.
Results
In this study, we introduce a deep learning architecture called DeepPTM for predicting histone PTMs from transcription factor binding data and the primary DNA sequence. Extensive experimental results show that our deep learning model outperforms the prediction accuracy of the model proposed in Benveniste et al. (PNAS 2014) and DeepHistone (BMC Genomics 2019). The competitive advantage of our framework lies in the synergistic use of deep learning combined with an effective pre-processing step. Our classification framework has also enabled the discovery that the knowledge of a small subset of transcription factors (which are histone-PTM and cell-type-specific) can provide almost the same prediction accuracy that can be obtained using all the transcription factors data.
Availabilityand implementation
https://github.com/dDipankar/DeepPTM.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dipankar Ranjan Baisya
- Department of Computer Science and Engineering, University of California, Riverside, CA, 92521, USA
| | - Stefano Lonardi
- Department of Computer Science and Engineering, University of California, Riverside, CA, 92521, USA
| |
Collapse
|
29
|
Beknazarov N, Jin S, Poptsova M. Deep learning approach for predicting functional Z-DNA regions using omics data. Sci Rep 2020; 10:19134. [PMID: 33154517 PMCID: PMC7644757 DOI: 10.1038/s41598-020-76203-1] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2020] [Accepted: 10/20/2020] [Indexed: 12/18/2022] Open
Abstract
Computational methods to predict Z-DNA regions are in high demand to understand the functional role of Z-DNA. The previous state-of-the-art method Z-Hunt is based on statistical mechanical and energy considerations about B- to Z-DNA transition using sequence information. Z-DNA CHiP-seq experiment results showed little overlap with Z-Hunt predictions implying that sequence information only is not sufficient to explain emergence of Z-DNA at different genomic locations. Adding epigenetic and other functional genomic mark-ups to DNA sequence level can help revealing the functional Z-DNA sites. Here we take advantage of the deep learning approach that can analyze and extract information from large volumes of molecular biology data. We developed a machine learning approach DeepZ that aggregates information from genome-wide maps of epigenetic markers, transcription factor and RNA polymerase binding sites, and chromosome accessibility maps. With the developed model we not only verify the experimental Z-DNA predictions, but also generate the whole-genome annotation, introducing new possible Z-DNA regions, which have not yet been found in experiments and can be of interest to the researchers from various fields.
Collapse
Affiliation(s)
- Nazar Beknazarov
- Laboratory of Bioinformatics, Faculty of Computer Science, National Research University Higher School of Economics, 11 Pokrovsky boulvar, Moscow, Russia, 101000
| | - Seungmin Jin
- Laboratory of Bioinformatics, Faculty of Computer Science, National Research University Higher School of Economics, 11 Pokrovsky boulvar, Moscow, Russia, 101000
| | - Maria Poptsova
- Laboratory of Bioinformatics, Faculty of Computer Science, National Research University Higher School of Economics, 11 Pokrovsky boulvar, Moscow, Russia, 101000.
| |
Collapse
|
30
|
Abstract
Precision medicine is an emerging approach to clinical research and patient care that focuses on understanding and treating disease by integrating multi-modal or multi-omics data from an individual to make patient-tailored decisions. With the large and complex datasets generated using precision medicine diagnostic approaches, novel techniques to process and understand these complex data were needed. At the same time, computer science has progressed rapidly to develop techniques that enable the storage, processing, and analysis of these complex datasets, a feat that traditional statistics and early computing technologies could not accomplish. Machine learning, a branch of artificial intelligence, is a computer science methodology that aims to identify complex patterns in data that can be used to make predictions or classifications on new unseen data or for advanced exploratory data analysis. Machine learning analysis of precision medicine's multi-modal data allows for broad analysis of large datasets and ultimately a greater understanding of human health and disease. This review focuses on machine learning utilization for precision medicine's "big data", in the context of genetics, genomics, and beyond.
Collapse
Affiliation(s)
- Sarah J MacEachern
- Department of Pediatrics, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.,Alberta Children's Hospital Research Institute, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| | - Nils D Forkert
- Alberta Children's Hospital Research Institute, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.,Department of Radiology, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| |
Collapse
|
31
|
Hu S, Huo D, Yu Z, Chen Y, Liu J, Liu L, Wu X, Zhang Y. ncHMR detector: a computational framework to systematically reveal non-classical functions of histone modification regulators. Genome Biol 2020; 21:48. [PMID: 32093739 PMCID: PMC7038559 DOI: 10.1186/s13059-020-01953-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2019] [Accepted: 02/06/2020] [Indexed: 01/02/2023] Open
Abstract
Recently, several non-classical functions of histone modification regulators (HMRs), independent of their known histone modification substrates and products, have been reported to be essential for specific cellular processes. However, there is no framework designed for identifying such functions systematically. Here, we develop ncHMR detector, the first computational framework to predict non-classical functions and cofactors of a given HMR, based on ChIP-seq data mining. We apply ncHMR detector in ChIP-seq data-rich cell types and predict non-classical functions of HMRs. Finally, we experimentally reveal that the predicted non-classical function of CBX7 is biologically significant for the maintenance of pluripotency.
Collapse
Affiliation(s)
- Shengen Hu
- Institute for Regenerative Medicine, Shanghai East Hospital, Shanghai Key Laboratory of Signaling and Disease Research, Frontier Science Center for Stem Cell Research, School of Life Sciences and Technology, Tongji University, Shanghai, 200092 China
| | - Dawei Huo
- Department of Cell Biology, Tianjin Medical University, 2011 Collaborative Innovation Center of Tianjin for Medical Epigenetics, Tianjin Key Laboratory of Medical Epigenetics, Qixiangtai Road 22, Tianjin, China
- Department of Neurosurgery, Tianjin Medical University General Hospital, Tianjin, China
| | - Zhaowei Yu
- Institute for Regenerative Medicine, Shanghai East Hospital, Shanghai Key Laboratory of Signaling and Disease Research, Frontier Science Center for Stem Cell Research, School of Life Sciences and Technology, Tongji University, Shanghai, 200092 China
| | - Yujie Chen
- Institute for Regenerative Medicine, Shanghai East Hospital, Shanghai Key Laboratory of Signaling and Disease Research, Frontier Science Center for Stem Cell Research, School of Life Sciences and Technology, Tongji University, Shanghai, 200092 China
| | - Jing Liu
- Institute for Regenerative Medicine, Shanghai East Hospital, Shanghai Key Laboratory of Signaling and Disease Research, Frontier Science Center for Stem Cell Research, School of Life Sciences and Technology, Tongji University, Shanghai, 200092 China
- Present address: Key Laboratory of Forensic Genetics, National Engineering Laboratory for Forensic Science, Institute of Forensic Science, Beijing, China
| | - Lin Liu
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA USA
| | - Xudong Wu
- Department of Cell Biology, Tianjin Medical University, 2011 Collaborative Innovation Center of Tianjin for Medical Epigenetics, Tianjin Key Laboratory of Medical Epigenetics, Qixiangtai Road 22, Tianjin, China
- Department of Neurosurgery, Tianjin Medical University General Hospital, Tianjin, China
- State Key Laboratory of Experimental Hematology, Institute of Hematology and Blood Diseases Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Tianjin, 300020 China
| | - Yong Zhang
- Institute for Regenerative Medicine, Shanghai East Hospital, Shanghai Key Laboratory of Signaling and Disease Research, Frontier Science Center for Stem Cell Research, School of Life Sciences and Technology, Tongji University, Shanghai, 200092 China
| |
Collapse
|
32
|
FADD in Cancer: Mechanisms of Altered Expression and Function, and Clinical Implications. Cancers (Basel) 2019; 11:cancers11101462. [PMID: 31569512 PMCID: PMC6826683 DOI: 10.3390/cancers11101462] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2019] [Revised: 09/25/2019] [Accepted: 09/27/2019] [Indexed: 12/15/2022] Open
Abstract
FADD was initially described as an adaptor molecule for death receptor-mediated apoptosis, but subsequently it has been implicated in nonapoptotic cellular processes such as proliferation and cell cycle control. During the last decade, FADD has been shown to play a pivotal role in most of the signalosome complexes, such as the necroptosome and the inflammasome. Interestingly, various mechanisms involved in regulating FADD functions have been identified, essentially posttranslational modifications and secretion. All these aspects have been thoroughly addressed in previous reviews. However, FADD implication in cancer is complex, due to pleiotropic effects. It has been reported either as anti- or protumorigenic, depending on the cell type. Regulation of FADD expression in cancer is a complex issue since both overexpression and downregulation have been reported, but the mechanisms underlying such alterations have not been fully unveiled. Posttranslational modifications also constitute a relevant mechanism controlling FADD levels and functions in tumor cells. In this review, we aim to provide detailed, updated information on alterations leading to changes in FADD expression and function in cancer. The participation of FADD in various biological processes is recapitulated, with a mention of interesting novel functions recently proposed for FADD, such as regulation of gene expression and control of metabolic pathways. Finally, we gather all the available evidence regarding the clinical implications of FADD alterations in cancer, especially as it has been proposed as a potential biomarker with prognostic value.
Collapse
|