1
|
Morelli V, Heizelman RJ. Monitoring Social Determinants of Health Assessing Patients and Communities. Prim Care 2023; 50:527-547. [PMID: 37866829 DOI: 10.1016/j.pop.2023.04.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2023]
Abstract
Because of the devastating health effects of social determinants of health (SDoH), it is important for the primary care provider to assess and monitor these types of stressors. This can be done via surveys, geomapping, or various biomarkers. To date, however, each of these methods is fraught with obstacles. There are currently are no validated "best" SDoH screening tools for use in clinical practice. Nor is geomapping, a perfect solution. Although mapping can collect location specific factors, it does not account for the fact that patients may live in one area, work in another and travel frequently to a third.
Collapse
Affiliation(s)
- Vincent Morelli
- Department of Family and Community Medicine, Meharry Medical College, 3rd Floor, Old Hospital Building, 1005 Dr. D. B. Todd, Jr., Boulevard, Nashville, TN 37208-3599, USA.
| | - Robert Joseph Heizelman
- Department of Family Medicine, Medical Informatics, University of Michigan, 3rd Floor, Old Hospital Building, 1005 Dr. D. B. Todd, Jr., Boulevard, Nashville, TN 37208-3599, USA
| |
Collapse
|
2
|
Maciejewski E, Horvath S, Ernst J. Cross-species and tissue imputation of species-level DNA methylation samples across mammalian species. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.26.568769. [PMID: 38076978 PMCID: PMC10705269 DOI: 10.1101/2023.11.26.568769] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/24/2023]
Abstract
DNA methylation data offers valuable insights into various aspects of mammalian biology. The recent introduction and large-scale application of the mammalian methylation array has significantly expanded the availability of such data across conserved sites in many mammalian species. In our study, we consider 13,245 samples profiled on this array encompassing 348 species and 59 tissues from 746 species-tissue combinations. While having some coverage of many different species and tissue types, this data captures only 3.6% of potential species-tissue combinations. To address this gap, we developed CMImpute (Cross-species Methylation Imputation), a method based on a Conditional Variational Autoencoder, to impute DNA methylation for non-profiled species-tissue combinations. In cross-validation, we demonstrate that CMImpute achieves a strong correlation with actual observed values, surpassing several baseline methods. Using CMImpute we imputed methylation data for 19,786 new species-tissue combinations. We believe that both CMImpute and our imputed data resource will be useful for DNA methylation analyses across a wide range of mammalian species.
Collapse
Affiliation(s)
- Emily Maciejewski
- Computer Science Department, University of California, Los Angeles, Los Angeles, CA 90095, USA
- Department of Biological Chemistry, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Steve Horvath
- Dept. of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA 90095 USA
- Dept. of Biostatistics, Fielding School of Public Health, University of California Los Angeles, Los Angeles, CA 90095, US
- Altos Labs, Cambridge, UK, WA14 2DT
| | - Jason Ernst
- Computer Science Department, University of California, Los Angeles, Los Angeles, CA 90095, USA
- Department of Biological Chemistry, University of California, Los Angeles, Los Angeles, CA 90095, USA
- Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research at University of California, Los Angeles, Los Angeles, CA 90095, USA
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Jonsson Comprehensive Cancer Center, University of California, Los Angeles, Los Angeles, CA 90095, USA
- Molecular Biology Institute, University of California, Los Angeles, Los Angeles, CA 90095, USA
| |
Collapse
|
3
|
Wang Z, Xiang S, Zhou C, Xu Q. DeepMethylation: a deep learning based framework with GloVe and Transformer encoder for DNA methylation prediction. PeerJ 2023; 11:e16125. [PMID: 37780374 PMCID: PMC10538282 DOI: 10.7717/peerj.16125] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Accepted: 08/27/2023] [Indexed: 10/03/2023] Open
Abstract
DNA methylation is a crucial topic in bioinformatics research. Traditional wet experiments are usually time-consuming and expensive. In contrast, machine learning offers an efficient and novel approach. In this study, we propose DeepMethylation, a novel methylation predictor with deep learning. Specifically, the DNA sequence is encoded with word embedding and GloVe in the first step. After that, dilated convolution and Transformer encoder are utilized to extract the features. Finally, full connection and softmax operators are applied to predict the methylation sites. The proposed model achieves an accuracy of 97.8% on the 5mC dataset, which outperforms state-of-the-art methods. Furthermore, our predictor exhibits good generalization ability as it achieves an accuracy of 95.8% on the m1A dataset. To ease access for other researchers, our code is publicly available at https://github.com/sb111169/tf-5mc.
Collapse
Affiliation(s)
- Zhe Wang
- Wuhan University of Science and Technology, Wuhan, Hubei, China
| | - Sen Xiang
- Wuhan University of Science and Technology, Wuhan, Hubei, China
| | - Chao Zhou
- China Three Gorges University, Yichang, Hubei, China
| | - Qing Xu
- China Three Gorges University, Yichang, Hubei, China
| |
Collapse
|
4
|
Yassi M, Chatterjee A, Parry M. Application of deep learning in cancer epigenetics through DNA methylation analysis. Brief Bioinform 2023; 24:bbad411. [PMID: 37985455 PMCID: PMC10661960 DOI: 10.1093/bib/bbad411] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2023] [Revised: 10/08/2023] [Accepted: 10/25/2023] [Indexed: 11/22/2023] Open
Abstract
DNA methylation is a fundamental epigenetic modification involved in various biological processes and diseases. Analysis of DNA methylation data at a genome-wide and high-throughput level can provide insights into diseases influenced by epigenetics, such as cancer. Recent technological advances have led to the development of high-throughput approaches, such as genome-scale profiling, that allow for computational analysis of epigenetics. Deep learning (DL) methods are essential in facilitating computational studies in epigenetics for DNA methylation analysis. In this systematic review, we assessed the various applications of DL applied to DNA methylation data or multi-omics data to discover cancer biomarkers, perform classification, imputation and survival analysis. The review first introduces state-of-the-art DL architectures and highlights their usefulness in addressing challenges related to cancer epigenetics. Finally, the review discusses potential limitations and future research directions in this field.
Collapse
Affiliation(s)
- Maryam Yassi
- Department of Mathematics and Statistics, University of Otago, Dunedin, New Zealand
- Department of Pathology, Dunedin School of Medicine, University of Otago, Dunedin, New Zealand
| | - Aniruddha Chatterjee
- Department of Pathology, Dunedin School of Medicine, University of Otago, Dunedin, New Zealand
- Honorary Professor, UPES University, Dehradun, India
| | - Matthew Parry
- Department of Mathematics and Statistics, University of Otago, Dunedin, New Zealand
- Te Pūnaha Matatini Centre of Research Excellence, University of Auckland, Auckland, New Zealand
| |
Collapse
|
5
|
Sereshki S, Lee N, Omirou M, Fasoula D, Lonardi S. On the prediction of non-CG DNA methylation using machine learning. NAR Genom Bioinform 2023; 5:lqad045. [PMID: 37206627 PMCID: PMC10189801 DOI: 10.1093/nargab/lqad045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 04/06/2023] [Accepted: 05/05/2023] [Indexed: 05/21/2023] Open
Abstract
DNA methylation can be detected and measured using sequencing instruments after sodium bisulfite conversion, but experiments can be expensive for large eukaryotic genomes. Sequencing nonuniformity and mapping biases can leave parts of the genome with low or no coverage, thus hampering the ability of obtaining DNA methylation levels for all cytosines. To address these limitations, several computational methods have been proposed that can predict DNA methylation from the DNA sequence around the cytosine or from the methylation level of nearby cytosines. However, most of these methods are entirely focused on CG methylation in humans and other mammals. In this work, we study, for the first time, the problem of predicting cytosine methylation for CG, CHG and CHH contexts on six plant species, either from the DNA primary sequence around the cytosine or from the methylation levels of neighboring cytosines. In this framework, we also study the cross-species prediction problem and the cross-context prediction problem (within the same species). Finally, we show that providing gene and repeat annotations allows existing classifiers to significantly improve their prediction accuracy. We introduce a new classifier called AMPS (annotation-based methylation prediction from sequence) that takes advantage of genomic annotations to achieve higher accuracy.
Collapse
Affiliation(s)
- Saleh Sereshki
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA
| | - Nathan Lee
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA
| | - Michalis Omirou
- Department of Agrobiotechnology, Agricultural Microbiology Laboratory, Agricultural Research Institute, Nicosia 1516, Cyprus
| | - Dionysia Fasoula
- Department of Plant Breeding, Agricultural Research Institute, Nicosia 1516, Cyprus
| | - Stefano Lonardi
- To whom correspondence should be addressed. Tel: +1 951 827 2203; Fax: +1 951 827 4643;
| |
Collapse
|
6
|
Fryett JJ, Morris AP, Cordell HJ. Investigating the prediction of CpG methylation levels from SNP genotype data to help elucidate relationships between methylation, gene expression and complex traits. Genet Epidemiol 2022; 46:629-643. [PMID: 35930604 PMCID: PMC9804820 DOI: 10.1002/gepi.22496] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2022] [Revised: 06/27/2022] [Accepted: 07/19/2022] [Indexed: 01/09/2023]
Abstract
As popularised by PrediXcan (and related methods), transcriptome-wide association studies (TWAS), in which gene expression is imputed from single-nucleotide polymorphism (SNP) genotypes and tested for association with a phenotype, are a popular approach for investigating the role of gene expression in complex traits. Like gene expression, DNA methylation is an important biological process and, being under genetic regulation, may be imputable from SNP genotypes. Here, we investigate prediction of CpG methylation levels from SNP genotype data to help elucidate relationships between methylation, gene expression and complex traits. We start by examining how well CpG methylation can be predicted from SNP genotypes, comparing three penalised regression approaches and examining whether changing the window size improves prediction accuracy. Although methylation at most CpG sites cannot be accurately predicted from SNP genotypes, for a subset it can be predicted well. We next apply our methylation prediction models (trained using the optimal method and window size) to carry out a methylome-wide association study (MWAS) of primary biliary cholangitis. We intersect the regions identified via MWAS with those identified via TWAS, providing insight into the interplay between CpG methylation, gene expression and disease status. We conclude that MWAS has the potential to improve understanding of biological mechanisms in complex traits.
Collapse
Affiliation(s)
- James J. Fryett
- Population Health Sciences Institute, Faculty of Medical SciencesNewcastle UniversityNewcastle upon TyneUK
| | - Andrew P. Morris
- Centre for Genetics and Genomics Versus Arthritis, Centre for Musculoskeletal ResearchUniversity of ManchesterManchesterUK
| | - Heather J. Cordell
- Population Health Sciences Institute, Faculty of Medical SciencesNewcastle UniversityNewcastle upon TyneUK
| |
Collapse
|
7
|
Application of Feature Selection and Deep Learning for Cancer Prediction Using DNA Methylation Markers. Genes (Basel) 2022; 13:genes13091557. [PMID: 36140725 PMCID: PMC9498757 DOI: 10.3390/genes13091557] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2022] [Revised: 08/24/2022] [Accepted: 08/25/2022] [Indexed: 12/31/2022] Open
Abstract
DNA methylation is a process that can affect gene accessibility and therefore gene expression. In this study, a machine learning pipeline is proposed for the prediction of breast cancer and the identification of significant genes that contribute to the prediction. The current study utilized breast cancer methylation data from The Cancer Genome Atlas (TCGA), specifically the TCGA-BRCA dataset. Feature engineering techniques have been utilized to reduce data volume and make deep learning scalable. A comparative analysis of the proposed approach on Illumina 27K and 450K methylation data reveals that deep learning methodologies for cancer prediction can be coupled with feature selection models to enhance prediction accuracy. Prediction using 450K methylation markers can be accomplished in less than 13 s with an accuracy of 98.75%. Of the list of 685 genes in the feature selected 27K dataset, 578 were mapped to Ensemble Gene IDs. This reduced set was significantly (FDR < 0.05) enriched in five biological processes and one molecular function. Of the list of 1572 genes in the feature selected 450K data set, 1290 were mapped to Ensemble Gene IDs. This reduced set was significantly (FDR < 0.05) enriched in 95 biological processes and 17 molecular functions. Seven oncogene/tumor suppressor genes were common between the 27K and 450K feature selected gene sets. These genes were RTN4IP1, MYO18B, ANP32A, BRF1, SETBP1, NTRK1, and IGF2R. Our bioinformatics deep learning workflow, incorporating imputation and data balancing methods, is able to identify important methylation markers related to functionally important genes in breast cancer with high accuracy compared to deep learning or statistical models alone.
Collapse
|
8
|
Zhou J, Chen Q, Braun PR, Perzel Mandell KA, Jaffe AE, Tan HY, Hyde TM, Kleinman JE, Potash JB, Shinozaki G, Weinberger DR, Han S. Deep learning predicts DNA methylation regulatory variants in the human brain and elucidates the genetics of psychiatric disorders. Proc Natl Acad Sci U S A 2022; 119:e2206069119. [PMID: 35969790 PMCID: PMC9407663 DOI: 10.1073/pnas.2206069119] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Accepted: 07/18/2022] [Indexed: 11/18/2022] Open
Abstract
There is growing evidence for the role of DNA methylation (DNAm) quantitative trait loci (mQTLs) in the genetics of complex traits, including psychiatric disorders. However, due to extensive linkage disequilibrium (LD) of the genome, it is challenging to identify causal genetic variations that drive DNAm levels by population-based genetic association studies. This limits the utility of mQTLs for fine-mapping risk loci underlying psychiatric disorders identified by genome-wide association studies (GWAS). Here we present INTERACT, a deep learning model that integrates convolutional neural networks with transformer, to predict effects of genetic variations on DNAm levels at CpG sites in the human brain. We show that INTERACT-derived DNAm regulatory variants are not confounded by LD, are concentrated in regulatory genomic regions in the human brain, and are convergent with mQTL evidence from genetic association analysis. We further demonstrate that predicted DNAm regulatory variants are enriched for heritability of brain-related traits and improve polygenic risk prediction for schizophrenia across diverse ancestry samples. Finally, we applied predicted DNAm regulatory variants for fine-mapping schizophrenia GWAS risk loci to identify potential novel risk genes. Our study shows the power of a deep learning approach to identify functional regulatory variants that may elucidate the genetic basis of complex traits.
Collapse
Affiliation(s)
- Jiyun Zhou
- Lieber Institute for Brain Development, The Johns Hopkins Medical Campus, Baltimore, MD 21287
- Department of Psychiatry and Behavioral Sciences, The Johns Hopkins University School of Medicine, Baltimore, MD 21287
| | - Qiang Chen
- Lieber Institute for Brain Development, The Johns Hopkins Medical Campus, Baltimore, MD 21287
| | - Patricia R. Braun
- Department of Psychiatry and Behavioral Sciences, The Johns Hopkins University School of Medicine, Baltimore, MD 21287
| | - Kira A. Perzel Mandell
- Lieber Institute for Brain Development, The Johns Hopkins Medical Campus, Baltimore, MD 21287
- Department of Genetic Medicine, The Johns Hopkins University School of Medicine, Baltimore, MD 21205
| | - Andrew E. Jaffe
- Lieber Institute for Brain Development, The Johns Hopkins Medical Campus, Baltimore, MD 21287
- Department of Psychiatry and Behavioral Sciences, The Johns Hopkins University School of Medicine, Baltimore, MD 21287
- Department of Genetic Medicine, The Johns Hopkins University School of Medicine, Baltimore, MD 21205
- Department of Neuroscience, The Johns Hopkins University School of Medicine, Baltimore, MD 21205
- Department of Mental Health, The Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205
- Department of Biostatistics, The Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205
| | - Hao Yang Tan
- Lieber Institute for Brain Development, The Johns Hopkins Medical Campus, Baltimore, MD 21287
- Department of Psychiatry and Behavioral Sciences, The Johns Hopkins University School of Medicine, Baltimore, MD 21287
| | - Thomas M. Hyde
- Lieber Institute for Brain Development, The Johns Hopkins Medical Campus, Baltimore, MD 21287
- Department of Psychiatry and Behavioral Sciences, The Johns Hopkins University School of Medicine, Baltimore, MD 21287
- Department of Neurology, The Johns Hopkins University School of Medicine, Baltimore, MD 21205
| | - Joel E. Kleinman
- Lieber Institute for Brain Development, The Johns Hopkins Medical Campus, Baltimore, MD 21287
- Department of Psychiatry and Behavioral Sciences, The Johns Hopkins University School of Medicine, Baltimore, MD 21287
| | - James B. Potash
- Department of Psychiatry and Behavioral Sciences, The Johns Hopkins University School of Medicine, Baltimore, MD 21287
| | - Gen Shinozaki
- Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, Palo Alto, CA 94305
| | - Daniel R. Weinberger
- Lieber Institute for Brain Development, The Johns Hopkins Medical Campus, Baltimore, MD 21287
- Department of Psychiatry and Behavioral Sciences, The Johns Hopkins University School of Medicine, Baltimore, MD 21287
- Department of Genetic Medicine, The Johns Hopkins University School of Medicine, Baltimore, MD 21205
- Department of Neuroscience, The Johns Hopkins University School of Medicine, Baltimore, MD 21205
- Department of Neurology, The Johns Hopkins University School of Medicine, Baltimore, MD 21205
| | - Shizhong Han
- Lieber Institute for Brain Development, The Johns Hopkins Medical Campus, Baltimore, MD 21287
- Department of Psychiatry and Behavioral Sciences, The Johns Hopkins University School of Medicine, Baltimore, MD 21287
- Department of Genetic Medicine, The Johns Hopkins University School of Medicine, Baltimore, MD 21205
| |
Collapse
|
9
|
Akagi T, Masuda K, Kuwada E, Takeshita K, Kawakatsu T, Ariizumi T, Kubo Y, Ushijima K, Uchida S. Genome-wide cis-decoding for expression design in tomato using cistrome data and explainable deep learning. THE PLANT CELL 2022; 34:2174-2187. [PMID: 35258588 PMCID: PMC9134063 DOI: 10.1093/plcell/koac079] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/01/2021] [Accepted: 02/20/2022] [Indexed: 06/14/2023]
Abstract
In the evolutionary history of plants, variation in cis-regulatory elements (CREs) resulting in diversification of gene expression has played a central role in driving the evolution of lineage-specific traits. However, it is difficult to predict expression behaviors from CRE patterns to properly harness them, mainly because the biological processes are complex. In this study, we used cistrome datasets and explainable convolutional neural network (CNN) frameworks to predict genome-wide expression patterns in tomato (Solanum lycopersicum) fruit from the DNA sequences in gene regulatory regions. By fixing the effects of trans-acting factors using single cell-type spatiotemporal transcriptome data for the response variables, we developed a prediction model for crucial expression patterns in the initiation of tomato fruit ripening. Feature visualization of the CNNs identified nucleotide residues critical to the objective expression pattern in each gene, and their effects were validated experimentally in ripening tomato fruit. This cis-decoding framework will not only contribute to the understanding of the regulatory networks derived from CREs and transcription factor interactions, but also provides a flexible means of designing alleles for optimized expression.
Collapse
Affiliation(s)
| | | | | | | | - Taiji Kawakatsu
- Institute of Agrobiological Sciences, National Agriculture and Food Research Organization, Tsukuba, Ibaraki 305-8602, Japan
| | - Tohru Ariizumi
- Faculty of Life and Environmental Sciences, University of Tsukuba, Tsukuba Plant Innovation Research Center, Tsukuba, Japan
| | - Yasutaka Kubo
- Graduate School of Environmental and Life Science, Okayama University, Okayama 700-8530, Japan
| | - Koichiro Ushijima
- Graduate School of Environmental and Life Science, Okayama University, Okayama 700-8530, Japan
| | - Seiichi Uchida
- Department of Advanced Information Technology, Kyushu University, Fukuoka 819-0395, Japan
| |
Collapse
|
10
|
Lee D, Kim S. Knowledge-guided artificial intelligence technologies for decoding complex multiomics interactions in cells. Clin Exp Pediatr 2022; 65:239-249. [PMID: 34844399 PMCID: PMC9082244 DOI: 10.3345/cep.2021.01438] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 10/19/2021] [Accepted: 10/21/2021] [Indexed: 11/27/2022] Open
Abstract
Cells survive and proliferate through complex interactions among diverse molecules across multiomics layers. Conventional experimental approaches for identifying these interactions have built a firm foundation for molecular biology, but their scalability is gradually becoming inadequate compared to the rapid accumulation of multiomics data measured by high-throughput technologies. Therefore, the need for data-driven computational modeling of interactions within cells has been highlighted in recent years. The complexity of multiomics interactions is primarily due to their nonlinearity. That is, their accurate modeling requires intricate conditional dependencies, synergies, or antagonisms between considered genes or proteins, which retard experimental validations. Artificial intelligence (AI) technologies, including deep learning models, are optimal choices for handling complex nonlinear relationships between features that are scalable and produce large amounts of data. Thus, they have great potential for modeling multiomics interactions. Although there exist many AI-driven models for computational biology applications, relatively few explicitly incorporate the prior knowledge within model architectures or training procedures. Such guidance of models by domain knowledge will greatly reduce the amount of data needed to train models and constrain their vast expressive powers to focus on the biologically relevant space. Therefore, it can enhance a model's interpretability, reduce spurious interactions, and prove its validity and utility. Thus, to facilitate further development of knowledge-guided AI technologies for the modeling of multiomics interactions, here we review representative bioinformatics applications of deep learning models for multiomics interactions developed to date by categorizing them by guidance mode.
Collapse
Affiliation(s)
- Dohoon Lee
- Bioinformatics Institute, Seoul National University, Seoul, Korea
| | - Sun Kim
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea
- Department of Computer Science and Engineering, Seoul National University, Seoul, Korea
- Institute of Engineering Research, Seoul National University, Seoul, Korea
- AIGENDRUG Co., Ltd., Seoul, Korea
| |
Collapse
|
11
|
Arslan E, Schulz J, Rai K. Machine Learning in Epigenomics: Insights into Cancer Biology and Medicine. Biochim Biophys Acta Rev Cancer 2021; 1876:188588. [PMID: 34245839 PMCID: PMC8595561 DOI: 10.1016/j.bbcan.2021.188588] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Revised: 05/29/2021] [Accepted: 07/02/2021] [Indexed: 02/01/2023]
Abstract
The recent deluge of genome-wide technologies for the mapping of the epigenome and resulting data in cancer samples has provided the opportunity for gaining insights into and understanding the roles of epigenetic processes in cancer. However, the complexity, high-dimensionality, sparsity, and noise associated with these data pose challenges for extensive integrative analyses. Machine Learning (ML) algorithms are particularly suited for epigenomic data analyses due to their flexibility and ability to learn underlying hidden structures. We will discuss four overlapping but distinct major categories under ML: dimensionality reduction, unsupervised methods, supervised methods, and deep learning (DL). We review the preferred use cases of these algorithms in analyses of cancer epigenomics data with the hope to provide an overview of how ML approaches can be used to explore fundamental questions on the roles of epigenome in cancer biology and medicine.
Collapse
Affiliation(s)
- Emre Arslan
- Department of Genomic Medicine, MD Anderson Cancer Center, Houston, TX 77030, United States of America
| | - Jonathan Schulz
- Department of Genomic Medicine, MD Anderson Cancer Center, Houston, TX 77030, United States of America
| | - Kunal Rai
- Department of Genomic Medicine, MD Anderson Cancer Center, Houston, TX 77030, United States of America.
| |
Collapse
|
12
|
Chaudhari M, Thapa N, Roy K, Newman RH, Saigo H, B K C D. DeepRMethylSite: a deep learning based approach for prediction of arginine methylation sites in proteins. Mol Omics 2021; 16:448-454. [PMID: 32555810 DOI: 10.1039/d0mo00025f] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Methylation, which is one of the most prominent post-translational modifications on proteins, regulates many important cellular functions. Though several model-based methylation site predictors have been reported, all existing methods employ machine learning strategies, such as support vector machines and random forest, to predict sites of methylation based on a set of "hand-selected" features. As a consequence, the subsequent models may be biased toward one set of features. Moreover, due to the large number of features, model development can often be computationally expensive. In this paper, we propose an alternative approach based on deep learning to predict arginine methylation sites. Our model, which we termed DeepRMethylSite, is computationally less expensive than traditional feature-based methods while eliminating potential biases that can arise through features selection. Based on independent testing on our dataset, DeepRMethylSite achieved efficiency scores of 68%, 82% and 0.51 with respect to sensitivity (SN), specificity (SP) and Matthew's correlation coefficient (MCC), respectively. Importantly, in side-by-side comparisons with other state-of-the-art methylation site predictors, our method performs on par or better in all scoring metrics tested.
Collapse
Affiliation(s)
- Meenal Chaudhari
- Department of Computational Science and Engineering, North Carolina Agricultural & Technical State University, Greensboro, NC 27411, USA
| | - Niraj Thapa
- Department of Computational Science and Engineering, North Carolina Agricultural & Technical State University, Greensboro, NC 27411, USA
| | - Kaushik Roy
- Department of Computer Science, North Carolina Agricultural & Technical State University, Greensboro, NC 27411, USA
| | - Robert H Newman
- Department of Biology, North Carolina Agricultural & Technical State University, Greensboro, NC 27411, USA
| | - Hiroto Saigo
- Department of Informatics, Kyushu University, Fukuoka 819-0395, Japan
| | - Dukka B K C
- Electrical Engineering and Computer Science Department, Wichita State University, Wichita, KS 67260, USA.
| |
Collapse
|
13
|
Au Yeung WK, Maruyama O, Sasaki H. A convolutional neural network-based regression model to infer the epigenetic crosstalk responsible for CG methylation patterns. BMC Bioinformatics 2021; 22:341. [PMID: 34162326 PMCID: PMC8220828 DOI: 10.1186/s12859-021-04272-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2020] [Accepted: 06/15/2021] [Indexed: 12/02/2022] Open
Abstract
Background Epigenetic modifications, including CG methylation (a major form of DNA methylation) and histone modifications, interact with each other to shape their genomic distribution patterns. However, the entire picture of the epigenetic crosstalk regulating the CG methylation pattern is unknown especially in cells that are available only in a limited number, such as mammalian oocytes. Most machine learning approaches developed so far aim at finding DNA sequences responsible for the CG methylation patterns and were not tailored for studying the epigenetic crosstalk.
Results We built a machine learning model named epiNet to predict CG methylation patterns based on other epigenetic features, such as histone modifications, but not DNA sequence. Using epiNet, we identified biologically relevant epigenetic crosstalk between histone H3K36me3, H3K4me3, and CG methylation in mouse oocytes. This model also predicted the altered CG methylation pattern of mutant oocytes having perturbed histone modification, was applicable to cross-species prediction of the CG methylation pattern of human oocytes, and identified the epigenetic crosstalk potentially important in other cell types. Conclusions Our findings provide insight into the epigenetic crosstalk regulating the CG methylation pattern in mammalian oocytes and other cells. The use of epiNet should help to design or complement biological experiments in epigenetics studies. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04272-8.
Collapse
Affiliation(s)
- Wan Kin Au Yeung
- Division of Epigenomics and Development, Medical Institute of Bioregulation, Kyushu University, Fukuoka, 812-8582, Japan.
| | - Osamu Maruyama
- Faculty of Design, Kyushu University, Fukuoka, 815-0032, Japan
| | - Hiroyuki Sasaki
- Division of Epigenomics and Development, Medical Institute of Bioregulation, Kyushu University, Fukuoka, 812-8582, Japan.
| |
Collapse
|
14
|
Zrimec J, Buric F, Kokina M, Garcia V, Zelezniak A. Learning the Regulatory Code of Gene Expression. Front Mol Biosci 2021; 8:673363. [PMID: 34179082 PMCID: PMC8223075 DOI: 10.3389/fmolb.2021.673363] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Accepted: 05/24/2021] [Indexed: 11/13/2022] Open
Abstract
Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology.
Collapse
Affiliation(s)
- Jan Zrimec
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Filip Buric
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Mariia Kokina
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Victor Garcia
- School of Life Sciences and Facility Management, Zurich University of Applied Sciences, Wädenswil, Switzerland
| | - Aleksej Zelezniak
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Science for Life Laboratory, Stockholm, Sweden
| |
Collapse
|
15
|
Fischer MA, Vondriska TM. Clinical epigenomics for cardiovascular disease: Diagnostics and therapies. J Mol Cell Cardiol 2021; 154:97-105. [PMID: 33561434 PMCID: PMC8330446 DOI: 10.1016/j.yjmcc.2021.01.011] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Revised: 01/05/2021] [Accepted: 01/10/2021] [Indexed: 12/28/2022]
Abstract
The study of epigenomics has advanced in recent years to span the regulation of a single genetic locus to the structure and orientation of entire chromosomes within the nucleus. In this review, we focus on the challenges and opportunities of clinical epigenomics in cardiovascular disease. As an integrator of genetic and environmental inputs, and because of advances in measurement techniques that are highly reproducible and provide sequence information, the epigenome is a rich source of potential biosignatures of cardiovascular health and disease. Most of the studies to date have focused on the latter, and herein we discuss observations on epigenomic changes in human cardiovascular disease, examining the role of protein modifiers of chromatin, noncoding RNAs and DNA modification. We provide an overview of cardiovascular epigenomics, discussing the challenges of data sovereignty, data analysis, doctor-patient ethics and innovations necessary to implement precision health.
Collapse
Affiliation(s)
- Matthew A Fischer
- Department of Anesthesiology & Perioperative Medicine, David Geffen School of Medicine at UCLA, USA.
| | - Thomas M Vondriska
- Department of Anesthesiology & Perioperative Medicine, David Geffen School of Medicine at UCLA, USA
| |
Collapse
|
16
|
The progress on the estimation of DNA methylation level and the detection of abnormal methylation. QUANTITATIVE BIOLOGY 2021. [DOI: 10.15302/j-qb-022-0289] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
17
|
Schmidt B, Hildebrandt A. Deep learning in next-generation sequencing. Drug Discov Today 2021; 26:173-180. [PMID: 33059075 PMCID: PMC7550123 DOI: 10.1016/j.drudis.2020.10.002] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2020] [Revised: 09/16/2020] [Accepted: 10/07/2020] [Indexed: 12/22/2022]
Abstract
Next-generation sequencing (NGS) methods lie at the heart of large parts of biological and medical research. Their fundamental importance has created a continuously increasing demand for processing and analysis methods of the data sets produced, addressing questions such as variant calling, metagenomic classification and quantification, genomic feature detection, or downstream analysis in larger biological or medical contexts. In addition to classical algorithmic approaches, machine-learning (ML) techniques are often used for such tasks. In particular, deep learning (DL) methods that use multilayered artificial neural networks (ANNs) for supervised, semisupervised, and unsupervised learning have gained significant traction for such applications. Here, we highlight important network architectures, application areas, and DL frameworks in a NGS context.
Collapse
Affiliation(s)
- Bertil Schmidt
- Institut für Informatik, Johannes Gutenberg University Mainz, Germany.
| | | |
Collapse
|
18
|
Zhang L, Zou Y, He N, Chen Y, Chen Z, Li L. DeepKhib: A Deep-Learning Framework for Lysine 2-Hydroxyisobutyrylation Sites Prediction. Front Cell Dev Biol 2020; 8:580217. [PMID: 33015075 PMCID: PMC7509169 DOI: 10.3389/fcell.2020.580217] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2020] [Accepted: 08/17/2020] [Indexed: 11/28/2022] Open
Abstract
As a novel type of post-translational modification, lysine 2-Hydroxyisobutyrylation (K hib ) plays an important role in gene transcription and signal transduction. In order to understand its regulatory mechanism, the essential step is the recognition of K hib sites. Thousands of K hib sites have been experimentally verified across five different species. However, there are only a couple traditional machine-learning algorithms developed to predict K hib sites for limited species, lacking a general prediction algorithm. We constructed a deep-learning algorithm based on convolutional neural network with the one-hot encoding approach, dubbed CNN OH . It performs favorably to the traditional machine-learning models and other deep-learning models across different species, in terms of cross-validation and independent test. The area under the ROC curve (AUC) values for CNN OH ranged from 0.82 to 0.87 for different organisms, which is superior to the currently available K hib predictors. Moreover, we developed the general model based on the integrated data from multiple species and it showed great universality and effectiveness with the AUC values in the range of 0.79-0.87. Accordingly, we constructed the on-line prediction tool dubbed DeepKhib for easily identifying K hib sites, which includes both species-specific and general models. DeepKhib is available at http://www.bioinfogo.org/DeepKhib.
Collapse
Affiliation(s)
- Luna Zhang
- School of Data Science and Software Engineering, Qingdao University, Qingdao, China
| | - Yang Zou
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Ningning He
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Yu Chen
- School of Data Science and Software Engineering, Qingdao University, Qingdao, China
| | - Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou, China
- Key Laboratory of Rice Biology in Henan Province, Henan Agricultural University, Zhengzhou, China
| | - Lei Li
- School of Data Science and Software Engineering, Qingdao University, Qingdao, China
- School of Basic Medicine, Qingdao University, Qingdao, China
| |
Collapse
|
19
|
Abstract
Deep neural networks have been revolutionizing the field of machine learning for the past several years. They have been applied with great success in many domains of the biomedical data sciences and are outperforming extant methods by a large margin. The ability of deep neural networks to pick up local image features and model the interactions between them makes them highly applicable to regulatory genomics. Instead of an image, the networks analyze DNA and RNA sequences and additional epigenomic data. In this review, we survey the successes of deep learning in the field of regulatory genomics. We first describe the fundamental building blocks of deep neural networks, popular architectures used in regulatory genomics, and their training process on molecular sequence data. We then review several key methods in different gene regulation domains. We start with the pioneering method DeepBind and its successors, which were developed to predict protein–DNA binding. We then review methods developed to predict and model epigenetic information, such as histone marks and nucleosome occupancy. Following epigenomics, we review methods to predict protein–RNA binding with its unique challenge of incorporating RNA structure information. Finally, we provide our overall view of the strengths and weaknesses of deep neural networks and prospects for future developments.
Collapse
Affiliation(s)
- Mira Barshai
- School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
| | - Eitamar Tripto
- Department of Biomedical Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
| | - Yaron Orenstein
- School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
| |
Collapse
|
20
|
Tang J, Zou J, Zhang X, Fan M, Tian Q, Fu S, Gao S, Fan S. PretiMeth: precise prediction models for DNA methylation based on single methylation mark. BMC Genomics 2020; 21:364. [PMID: 32414326 PMCID: PMC7227319 DOI: 10.1186/s12864-020-6768-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2019] [Accepted: 05/04/2020] [Indexed: 11/29/2022] Open
Abstract
Background The computational prediction of methylation levels at single CpG resolution is promising to explore the methylation levels of CpGs uncovered by existing array techniques, especially for the 450 K beadchip array data with huge reserves. General prediction models concentrate on improving the overall prediction accuracy for the bulk of CpG loci while neglecting whether each locus is precisely predicted. This leads to the limited application of the prediction results, especially when performing downstream analysis with high precision requirements. Results Here we reported PretiMeth, a method for constructing precise prediction models for each single CpG locus. PretiMeth used a logistic regression algorithm to build a prediction model for each interested locus. Only one DNA methylation feature that shared the most similar methylation pattern with the CpG locus to be predicted was applied in the model. We found that PretiMeth outperformed other algorithms in the prediction accuracy, and kept robust across platforms and cell types. Furthermore, PretiMeth was applied to The Cancer Genome Atlas data (TCGA), the intensive analysis based on precise prediction results showed that several CpG loci and genes (differentially methylated between the tumor and normal samples) were worthy for further biological validation. Conclusion The precise prediction of single CpG locus is important for both methylation array data expansion and downstream analysis of prediction results. PretiMeth achieved precise modeling for each CpG locus by using only one significant feature, which also suggested that our precise prediction models could be probably used for reference in the probe set design when the DNA methylation beadchip update. PretiMeth is provided as an open source tool via https://github.com/JxTang-bioinformatics/PretiMeth.
Collapse
Affiliation(s)
- Jianxiong Tang
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Jianxiao Zou
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Xiaoran Zhang
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China.,Department of Automation, Tsinghua University, Beijing, 100084, China
| | - Mei Fan
- Chengdu Women's and Children's Central Hospital, School of Medicine, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Qi Tian
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Shuyao Fu
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Shihong Gao
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Shicai Fan
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China. .,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 611731, China.
| |
Collapse
|
21
|
Crawford J, Greene CS. Incorporating biological structure into machine learning models in biomedicine. Curr Opin Biotechnol 2020; 63:126-134. [PMID: 31962244 PMCID: PMC7308204 DOI: 10.1016/j.copbio.2019.12.021] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2019] [Revised: 12/17/2019] [Accepted: 12/19/2019] [Indexed: 12/19/2022]
Abstract
In biomedical applications of machine learning, relevant information
often has a rich structure that is not easily encoded as real-valued predictors.
Examples of such data include DNA or RNA sequences, gene sets or pathways, gene
interaction or coexpression networks, ontologies, and phylogenetic trees. We
highlight recent examples of machine learning models that use structure to
constrain model architecture or incorporate structured data into model training.
For machine learning in biomedicine, where sample size is limited and model
interpretability is crucial, incorporating prior knowledge in the form of
structured data can be particularly useful. The area of research would benefit
from performant open source implementations and independent benchmarking
efforts.
Collapse
Affiliation(s)
- Jake Crawford
- Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States; Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States; Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Philadelphia, PA, United States.
| |
Collapse
|
22
|
Huang J, Wang L. Cell-Free DNA Methylation Profiling Analysis-Technologies and Bioinformatics. Cancers (Basel) 2019; 11:cancers11111741. [PMID: 31698791 PMCID: PMC6896050 DOI: 10.3390/cancers11111741] [Citation(s) in RCA: 35] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2019] [Revised: 11/01/2019] [Accepted: 11/04/2019] [Indexed: 12/24/2022] Open
Abstract
Analysis of circulating nucleic acids in bodily fluids, referred to as “liquid biopsies”, is rapidly gaining prominence. Studies have shown that cell-free DNA (cfDNA) has great potential in characterizing tumor status and heterogeneity, as well as the response to therapy and tumor recurrence. DNA methylation is an epigenetic modification that plays an important role in a broad range of biological processes and diseases. It is well known that aberrant DNA methylation is generalizable across various samples and occurs early during the pathogenesis of cancer. Methylation patterns of cfDNA are also consistent with their originated cells or tissues. Systemic analysis of cfDNA methylation profiles has emerged as a promising approach for cancer detection and origin determination. In this review, we will summarize the technologies for DNA methylation analysis and discuss their feasibility for liquid biopsy applications. We will also provide a brief overview of the bioinformatic approaches for analysis of DNA methylation sequencing data. Overall, this review provides informative guidance for the selection of experimental and computational methods in cfDNA methylation-based studies.
Collapse
|