51
|
Zou LS, Erdos MR, Taylor DL, Chines PS, Varshney A, Parker SCJ, Collins FS, Didion JP. BoostMe accurately predicts DNA methylation values in whole-genome bisulfite sequencing of multiple human tissues. BMC Genomics 2018; 19:390. [PMID: 29792182 PMCID: PMC5966887 DOI: 10.1186/s12864-018-4766-y] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2018] [Accepted: 05/08/2018] [Indexed: 01/14/2023] Open
Abstract
Background Bisulfite sequencing is widely employed to study the role of DNA methylation in disease; however, the data suffer from biases due to coverage depth variability. Imputation of methylation values at low-coverage sites may mitigate these biases while also identifying important genomic features associated with predictive power. Results Here we describe BoostMe, a method for imputing low-quality DNA methylation estimates within whole-genome bisulfite sequencing (WGBS) data. BoostMe uses a gradient boosting algorithm, XGBoost, and leverages information from multiple samples for prediction. We find that BoostMe outperforms existing algorithms in speed and accuracy when applied to WGBS of human tissues. Furthermore, we show that imputation improves concordance between WGBS and the MethylationEPIC array at low WGBS depth, suggesting improved WGBS accuracy after imputation. Conclusions Our findings support the use of BoostMe as a preprocessing step for WGBS analysis. Electronic supplementary material The online version of this article (10.1186/s12864-018-4766-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Luli S Zou
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Michael R Erdos
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - D Leland Taylor
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20892, USA.,European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, UK
| | - Peter S Chines
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Arushi Varshney
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, 48109, USA
| | | | - Stephen C J Parker
- Department of Human Genetics, University of Michigan, Ann Arbor, MI, 48109, USA.,Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Francis S Collins
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20892, USA.
| | - John P Didion
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| |
Collapse
|
52
|
Kalinin AA, Higgins GA, Reamaroon N, Soroushmehr S, Allyn-Feuer A, Dinov ID, Najarian K, Athey BD. Deep learning in pharmacogenomics: from gene regulation to patient stratification. Pharmacogenomics 2018; 19:629-650. [PMID: 29697304 PMCID: PMC6022084 DOI: 10.2217/pgs-2018-0008] [Citation(s) in RCA: 74] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2018] [Accepted: 03/09/2018] [Indexed: 01/02/2023] Open
Abstract
This Perspective provides examples of current and future applications of deep learning in pharmacogenomics, including: identification of novel regulatory variants located in noncoding domains of the genome and their function as applied to pharmacoepigenomics; patient stratification from medical records; and the mechanistic prediction of drug response, targets and their interactions. Deep learning encapsulates a family of machine learning algorithms that has transformed many important subfields of artificial intelligence over the last decade, and has demonstrated breakthrough performance improvements on a wide range of tasks in biomedicine. We anticipate that in the future, deep learning will be widely used to predict personalized drug response and optimize medication selection and dosing, using knowledge extracted from large and complex molecular, epidemiological, clinical and demographic datasets.
Collapse
Affiliation(s)
- Alexandr A Kalinin
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
- Statistics Online Computational Resource (SOCR), University of Michigan School of Nursing, Ann Arbor, MI 48109, USA
| | - Gerald A Higgins
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | - Narathip Reamaroon
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | - Sayedmohammadreza Soroushmehr
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | - Ari Allyn-Feuer
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | - Ivo D Dinov
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
- Statistics Online Computational Resource (SOCR), University of Michigan School of Nursing, Ann Arbor, MI 48109, USA
- Michigan Institute for Data Science (MIDAS), University of Michigan, Ann Arbor, MI 48109, USA
| | - Kayvan Najarian
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
- Department of Emergency Medicine, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | - Brian D Athey
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
- Michigan Institute for Data Science (MIDAS), University of Michigan, Ann Arbor, MI 48109, USA
- Department of Internal Medicine, University of Michigan Health System, Ann Arbor, MI 48109, USA
- Department of Psychiatry, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| |
Collapse
|
53
|
Widschwendter M, Jones A, Evans I, Reisel D, Dillner J, Sundström K, Steyerberg EW, Vergouwe Y, Wegwarth O, Rebitschek FG, Siebert U, Sroczynski G, de Beaufort ID, Bolt I, Cibula D, Zikan M, Bjørge L, Colombo N, Harbeck N, Dudbridge F, Tasse AM, Knoppers BM, Joly Y, Teschendorff AE, Pashayan N. Epigenome-based cancer risk prediction: rationale, opportunities and challenges. Nat Rev Clin Oncol 2018; 15:292-309. [PMID: 29485132 DOI: 10.1038/nrclinonc.2018.30] [Citation(s) in RCA: 103] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
The incidence of cancer is continuing to rise and risk-tailored early diagnostic and/or primary prevention strategies are urgently required. The ideal risk-predictive test should: integrate the effects of both genetic and nongenetic factors and aim to capture these effects using an approach that is both biologically stable and technically reproducible; derive a score from easily accessible biological samples that acts as a surrogate for the organ in question; and enable the effectiveness of risk-reducing measures to be monitored. Substantial evidence has accumulated suggesting that the epigenome and, in particular, DNA methylation-based tests meet all of these requirements. However, the development and implementation of DNA methylation-based risk-prediction tests poses considerable challenges. In particular, the cell type specificity of DNA methylation and the extensive cellular heterogeneity of the easily accessible surrogate cells that might contain information relevant to less accessible tissues necessitates the use of novel methods in order to account for these confounding issues. Furthermore, the engagement of the scientific community with health-care professionals, policymakers and the public is required in order to identify and address the organizational, ethical, legal, social and economic challenges associated with the routine use of epigenetic testing.
Collapse
Affiliation(s)
- Martin Widschwendter
- Department of Women's Cancer, Institute for Women's Health, University College London, London, UK
| | - Allison Jones
- Department of Women's Cancer, Institute for Women's Health, University College London, London, UK
| | - Iona Evans
- Department of Women's Cancer, Institute for Women's Health, University College London, London, UK
| | - Daniel Reisel
- Department of Women's Cancer, Institute for Women's Health, University College London, London, UK
| | - Joakim Dillner
- Department of Laboratory Medicine, Karolinska Institutet, Stockholm, Sweden.,Karolinska University Laboratory, Karolinska University Hospital, Stockholm, Sweden
| | - Karin Sundström
- Department of Laboratory Medicine, Karolinska Institutet, Stockholm, Sweden.,Karolinska University Laboratory, Karolinska University Hospital, Stockholm, Sweden
| | - Ewout W Steyerberg
- Center for Medical Decision Sciences, Department of Public Health, Erasmus MC, Rotterdam, Netherlands.,Department of Biomedical Data Sciences, LUMC, Leiden, Netherlands
| | - Yvonne Vergouwe
- Center for Medical Decision Sciences, Department of Public Health, Erasmus MC, Rotterdam, Netherlands
| | - Odette Wegwarth
- Max Planck Institute for Human Development, Harding Center for Risk Literacy, Berlin, Germany.,Max Planck Institute for Human Development, Center for Adaptive Rationality, Berlin, Germany
| | - Felix G Rebitschek
- Max Planck Institute for Human Development, Harding Center for Risk Literacy, Berlin, Germany
| | - Uwe Siebert
- Institute of Public Health, Medical Decision Making and Health Technology Assessment, Department of Public Health, Health Services Research, and HTA, UMIT-University for Health Sciences, Medical Informatics and Technology, Hall in Tirol, Austria.,Harvard T. C. Chan School of Public Health, Center for Health Decision Science, Department of Health Policy and Management, Boston, MA, USA.,Oncotyrol: Center for Personalized Medicine, Innsbruck, Austria
| | - Gaby Sroczynski
- Institute of Public Health, Medical Decision Making and Health Technology Assessment, Department of Public Health, Health Services Research, and HTA, UMIT-University for Health Sciences, Medical Informatics and Technology, Hall in Tirol, Austria
| | - Inez D de Beaufort
- Department of Medical Ethics and Philosophy of Medicine, Erasmus Medical Center, Rotterdam, Netherlands
| | - Ineke Bolt
- Department of Medical Ethics and Philosophy of Medicine, Erasmus Medical Center, Rotterdam, Netherlands
| | - David Cibula
- Department of Obstetrics and Gynaecology, First Medical Faculty of the Charles University and General Faculty Hospital, Prague, Czech Republic
| | - Michal Zikan
- Department of Obstetrics and Gynaecology, First Medical Faculty of the Charles University and General Faculty Hospital, Prague, Czech Republic
| | - Line Bjørge
- Department of Obstetrics and Gynecology, Haukeland University Hospital, and Centre for Cancer Biomarkers, Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Nicoletta Colombo
- European Institute of Oncology and University Milan-Bicocca, Milan, Italy
| | - Nadia Harbeck
- Breast Center, Department of Gynaecology and Obstetrics, University of Munich (LMU), Munich, Germany
| | - Frank Dudbridge
- Department of Non-communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London, UK.,Department of Health Sciences, University of Leicester, Leicester, UK
| | - Anne-Marie Tasse
- Public Population Project in Genomics and Society, McGill University and Genome Quebec Innovation Centre, Montreal, Canada
| | | | - Yann Joly
- Centre of Genomics and Policy, McGill University, Montreal, Canada
| | - Andrew E Teschendorff
- Department of Women's Cancer, Institute for Women's Health, University College London, London, UK
| | - Nora Pashayan
- Department of Applied Health Research, Institute of Epidemiology and Healthcare, University College London, UK
| | | |
Collapse
|
55
|
Finnegan A, Song JS. Maximum entropy methods for extracting the learned features of deep neural networks. PLoS Comput Biol 2017; 13:e1005836. [PMID: 29084280 PMCID: PMC5679649 DOI: 10.1371/journal.pcbi.1005836] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2017] [Revised: 11/09/2017] [Accepted: 10/23/2017] [Indexed: 11/19/2022] Open
Abstract
New architectures of multilayer artificial neural networks and new methods for training them are rapidly revolutionizing the application of machine learning in diverse fields, including business, social science, physical sciences, and biology. Interpreting deep neural networks, however, currently remains elusive, and a critical challenge lies in understanding which meaningful features a network is actually learning. We present a general method for interpreting deep neural networks and extracting network-learned features from input data. We describe our algorithm in the context of biological sequence analysis. Our approach, based on ideas from statistical physics, samples from the maximum entropy distribution over possible sequences, anchored at an input sequence and subject to constraints implied by the empirical function learned by a network. Using our framework, we demonstrate that local transcription factor binding motifs can be identified from a network trained on ChIP-seq data and that nucleosome positioning signals are indeed learned by a network trained on chemical cleavage nucleosome maps. Imposing a further constraint on the maximum entropy distribution also allows us to probe whether a network is learning global sequence features, such as the high GC content in nucleosome-rich regions. This work thus provides valuable mathematical tools for interpreting and extracting learned features from feed-forward neural networks. Deep learning is a state-of-the-art reformulation of artificial neural networks that have a long history of development. It can perform superbly well in diverse automated classification and prediction problems, including handwriting recognition, image identification, and biological pattern recognition. Its modern success can be attributed to improved training algorithms, clever network architecture, rapid explosion of available data, and advanced computing power–all of which have allowed the great expansion in the number of unknown parameters to be estimated by the model. These parameters, however, are so intricately connected through highly nonlinear functions that interpreting which essential features of given data are actually used by a deep neural network for its excellent performance has been difficult. We address this problem by using ideas from statistical physics to sample new unseen data that are likely to behave similarly to original data points when passed through the trained network. This synthetic data cloud around each original data point retains informative features while averaging out nonessential ones, ultimately allowing us to extract important network-learned features from the original data set and thus improving the human interpretability of deep learning methods. We demonstrate how our method can be applied to biological sequence analysis.
Collapse
Affiliation(s)
- Alex Finnegan
- Department of Physics, University of Illinois, Urbana-Champaign, Urbana, Illinois, United States of America
- Carl R. Woese Institute for Genomic Biology, University of Illinois, Urbana-Champaign, Urbana, Illinois, United States of America
| | - Jun S. Song
- Department of Physics, University of Illinois, Urbana-Champaign, Urbana, Illinois, United States of America
- Carl R. Woese Institute for Genomic Biology, University of Illinois, Urbana-Champaign, Urbana, Illinois, United States of America
- * E-mail:
| |
Collapse
|