101
|
Ramchandran M, Patil P, Parmigiani G. Tree-Weighting for Multi-Study Ensemble Learners. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2020; 25:451-462. [PMID: 31797618 PMCID: PMC6980320] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Multi-study learning uses multiple training studies, separately trains classifiers on each, and forms an ensemble with weights rewarding members with better cross-study prediction ability. This article considers novel weighting approaches for constructing tree-based ensemble learners in this setting. Using Random Forests as a single-study learner, we compare weighting each forest to form the ensemble, to extracting the individual trees trained by each Random Forest and weighting them directly. We find that incorporating multiple layers of ensembling in the training process by weighting trees increases the robustness of the resulting predictor. Furthermore, we explore how ensembling weights correspond to tree structure, to shed light on the features that determine whether weighting trees directly is advantageous. Finally, we apply our approach to genomic datasets and show that weighting trees improves upon the basic multi-study learning paradigm. Code and supplementary material are available at https://github.com/m-ramchandran/tree-weighting.
Collapse
Affiliation(s)
- Maya Ramchandran
- Department of Biostatistics, Harvard T.H. Chan School of Public Health,
| | - Prasad Patil
- Department of Biostatistics, Harvard T.H. Chan School of Public Health,Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02115, USA
| | - Giovanni Parmigiani
- Department of Biostatistics, Harvard T.H. Chan School of Public Health,Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02115, USA
| |
Collapse
|
102
|
Pfister N, Bauer S, Peters J. Learning stable and predictive structures in kinetic systems. Proc Natl Acad Sci U S A 2019; 116:25405-25411. [PMID: 31776252 PMCID: PMC6925987 DOI: 10.1073/pnas.1905688116] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Learning kinetic systems from data is one of the core challenges in many fields. Identifying stable models is essential for the generalization capabilities of data-driven inference. We introduce a computationally efficient framework, called CausalKinetiX, that identifies structure from discrete time, noisy observations, generated from heterogeneous experiments. The algorithm assumes the existence of an underlying, invariant kinetic model, a key criterion for reproducible research. Results on both simulated and real-world examples suggest that learning the structure of kinetic systems benefits from a causal perspective. The identified variables and models allow for a concise description of the dynamics across multiple experimental settings and can be used for prediction in unseen experiments. We observe significant improvements compared to well-established approaches focusing solely on predictive performance, especially for out-of-sample generalization.
Collapse
Affiliation(s)
- Niklas Pfister
- Seminar for Statistics, Eidgenössische Technische Hochschule Zürich, 8092 Zürich, Switzerland;
| | - Stefan Bauer
- Empirical Inference, Max-Planck-Institute for Intelligent Systems, 72076 Tübingen, Germany
| | - Jonas Peters
- Department of Mathematical Sciences, University of Copenhagen, 2100 Copenhagen, Denmark
| |
Collapse
|
103
|
A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks. Genes (Basel) 2019; 10:genes10120996. [PMID: 31810264 PMCID: PMC6947651 DOI: 10.3390/genes10120996] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2019] [Revised: 11/23/2019] [Accepted: 11/26/2019] [Indexed: 12/12/2022] Open
Abstract
As time progresses and technology improves, biological data sets are continuously increasing in size. New methods and new implementations of existing methods are needed to keep pace with this increase. In this paper, we present a high-performance computing (HPC)-capable implementation of Iterative Random Forest (iRF). This new implementation enables the explainable-AI eQTL analysis of SNP sets with over a million SNPs. Using this implementation, we also present a new method, iRF Leave One Out Prediction (iRF-LOOP), for the creation of Predictive Expression Networks on the order of 40,000 genes or more. We compare the new implementation of iRF with the previous R version and analyze its time to completion on two of the world’s fastest supercomputers, Summit and Titan. We also show iRF-LOOP’s ability to capture biologically significant results when creating Predictive Expression Networks. This new implementation of iRF will enable the analysis of biological data sets at scales that were previously not possible.
Collapse
|
104
|
Harfouche AL, Jacobson DA, Kainer D, Romero JC, Harfouche AH, Scarascia Mugnozza G, Moshelion M, Tuskan GA, Keurentjes JJ, Altman A. Accelerating Climate Resilient Plant Breeding by Applying Next-Generation Artificial Intelligence. Trends Biotechnol 2019; 37:1217-1235. [DOI: 10.1016/j.tibtech.2019.05.007] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2019] [Revised: 05/18/2019] [Accepted: 05/23/2019] [Indexed: 12/20/2022]
|
105
|
Murdoch WJ, Singh C, Kumbier K, Abbasi-Asl R, Yu B. Definitions, methods, and applications in interpretable machine learning. Proc Natl Acad Sci U S A 2019. [PMID: 31619572 DOI: 10.1073/pnas.1900654116/suppl_file/pnas.1900654116.sapp.pdf] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/12/2023] Open
Abstract
Machine-learning models have demonstrated great success in learning complex patterns that enable them to make predictions about unobserved data. In addition to using models for prediction, the ability to interpret what a model has learned is receiving an increasing amount of attention. However, this increased focus has led to considerable confusion about the notion of interpretability. In particular, it is unclear how the wide array of proposed interpretation methods are related and what common concepts can be used to evaluate them. We aim to address these concerns by defining interpretability in the context of machine learning and introducing the predictive, descriptive, relevant (PDR) framework for discussing interpretations. The PDR framework provides 3 overarching desiderata for evaluation: predictive accuracy, descriptive accuracy, and relevancy, with relevancy judged relative to a human audience. Moreover, to help manage the deluge of interpretation methods, we introduce a categorization of existing techniques into model-based and post hoc categories, with subgroups including sparsity, modularity, and simulatability. To demonstrate how practitioners can use the PDR framework to evaluate and understand interpretations, we provide numerous real-world examples. These examples highlight the often underappreciated role played by human audiences in discussions of interpretability. Finally, based on our framework, we discuss limitations of existing methods and directions for future work. We hope that this work will provide a common vocabulary that will make it easier for both practitioners and researchers to discuss and choose from the full range of interpretation methods.
Collapse
Affiliation(s)
- W James Murdoch
- Statistics Department, University of California, Berkeley, CA 94720
| | - Chandan Singh
- Electrical Engineering and Computer Science Department, University of California, Berkeley, CA 94720
| | - Karl Kumbier
- Statistics Department, University of California, Berkeley, CA 94720
| | - Reza Abbasi-Asl
- Electrical Engineering and Computer Science Department, University of California, Berkeley, CA 94720
- Department of Neurology, University of California, San Francisco, CA 94158
- Allen Institute for Brain Science, Seattle, WA 98109
| | - Bin Yu
- Statistics Department, University of California, Berkeley, CA 94720;
- Electrical Engineering and Computer Science Department, University of California, Berkeley, CA 94720
| |
Collapse
|
106
|
Murdoch WJ, Singh C, Kumbier K, Abbasi-Asl R, Yu B. Definitions, methods, and applications in interpretable machine learning. Proc Natl Acad Sci U S A 2019; 116:22071-22080. [PMID: 31619572 PMCID: PMC6825274 DOI: 10.1073/pnas.1900654116] [Citation(s) in RCA: 373] [Impact Index Per Article: 74.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Machine-learning models have demonstrated great success in learning complex patterns that enable them to make predictions about unobserved data. In addition to using models for prediction, the ability to interpret what a model has learned is receiving an increasing amount of attention. However, this increased focus has led to considerable confusion about the notion of interpretability. In particular, it is unclear how the wide array of proposed interpretation methods are related and what common concepts can be used to evaluate them. We aim to address these concerns by defining interpretability in the context of machine learning and introducing the predictive, descriptive, relevant (PDR) framework for discussing interpretations. The PDR framework provides 3 overarching desiderata for evaluation: predictive accuracy, descriptive accuracy, and relevancy, with relevancy judged relative to a human audience. Moreover, to help manage the deluge of interpretation methods, we introduce a categorization of existing techniques into model-based and post hoc categories, with subgroups including sparsity, modularity, and simulatability. To demonstrate how practitioners can use the PDR framework to evaluate and understand interpretations, we provide numerous real-world examples. These examples highlight the often underappreciated role played by human audiences in discussions of interpretability. Finally, based on our framework, we discuss limitations of existing methods and directions for future work. We hope that this work will provide a common vocabulary that will make it easier for both practitioners and researchers to discuss and choose from the full range of interpretation methods.
Collapse
Affiliation(s)
- W James Murdoch
- Statistics Department, University of California, Berkeley, CA 94720
| | - Chandan Singh
- Electrical Engineering and Computer Science Department, University of California, Berkeley, CA 94720
| | - Karl Kumbier
- Statistics Department, University of California, Berkeley, CA 94720
| | - Reza Abbasi-Asl
- Electrical Engineering and Computer Science Department, University of California, Berkeley, CA 94720
- Department of Neurology, University of California, San Francisco, CA 94158
- Allen Institute for Brain Science, Seattle, WA 98109
| | - Bin Yu
- Statistics Department, University of California, Berkeley, CA 94720;
- Electrical Engineering and Computer Science Department, University of California, Berkeley, CA 94720
| |
Collapse
|
107
|
Vervier K, Michaelson JJ. TiSAn: estimating tissue-specific effects of coding and non-coding variants. Bioinformatics 2019; 34:3061-3068. [PMID: 29912365 PMCID: PMC6137979 DOI: 10.1093/bioinformatics/bty301] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2017] [Accepted: 04/16/2018] [Indexed: 02/06/2023] Open
Abstract
Motivation Model-based estimates of general deleteriousness, like CADD, DANN or PolyPhen, have become indispensable tools in the interpretation of genetic variants. However, these approaches say little about the tissues in which the effects of deleterious variants will be most meaningful. Tissue-specific annotations have been recently inferred for dozens of tissues/cell types from large collections of cross-tissue epigenomic data, and have demonstrated sensitivity in predicting affected tissues in complex traits. It remains unclear, however, whether including additional genome-scale data specific to the tissue of interest would appreciably improve functional annotations. Results Herein, we introduce TiSAn, a tool that integrates multiple genome-scale data sources, defined by expert knowledge. TiSAn uses machine learning to discriminate variants relevant to a tissue from those with no bearing on the function of that tissue. Predictions are made genome-wide, and can be used to contextualize and filter variants of interest in whole genome sequencing or genome-wide association studies. We demonstrate the accuracy and flexibility of TiSAn by producing predictive models for human heart and brain, and detecting tissue-relevant variations in large cohorts for autism spectrum disorder (TiSAn-brain) and coronary artery disease (TiSAn-heart). We find the multiomics TiSAn model is better able to prioritize genetic variants according to their tissue-specific action than the current state-of-the-art method, GenoSkyLine. Availability and implementation Software and vignettes are available at http://github.com/kevinVervier/TiSAn. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kévin Vervier
- Department of Psychiatry, Carver College of Medicine, University of Iowa, Iowa City, IA, USA
| | - Jacob J Michaelson
- Department of Psychiatry, Carver College of Medicine, University of Iowa, Iowa City, IA, USA
| |
Collapse
|
108
|
Long GS, Hussen M, Dench J, Aris-Brosou S. Identifying genetic determinants of complex phenotypes from whole genome sequence data. BMC Genomics 2019; 20:470. [PMID: 31182025 PMCID: PMC6558885 DOI: 10.1186/s12864-019-5820-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2018] [Accepted: 05/21/2019] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND A critical goal in biology is to relate the phenotype to the genotype, that is, to find the genetic determinants of various traits. However, while simple monofactorial determinants are relatively easy to identify, the underpinnings of complex phenotypes are harder to predict. While traditional approaches rely on genome-wide association studies based on Single Nucleotide Polymorphism data, the ability of machine learning algorithms to find these determinants in whole proteome data is still not well known. RESULTS To better understand the applicability of machine learning in this case, we implemented two such algorithms, adaptive boosting (AB) and repeated random forest (RRF), and developed a chunking layer that facilitates the analysis of whole proteome data. We first assessed the performance of these algorithms and tuned them on an influenza data set, for which the determinants of three complex phenotypes (infectivity, transmissibility, and pathogenicity) are known based on experimental evidence. This allowed us to show that chunking improves runtimes by an order of magnitude. Based on simulations, we showed that chunking also increases sensitivity of the predictions, reaching 100% with as few as 20 sequences in a small proteome as in the influenza case (5k sites), but may require at least 30 sequences to reach 90% on larger alignments (500k sites). While RRF has less specificity than random forest, it was never <50%, and RRF sensitivity was significantly higher at smaller chunk sizes. We then used these algorithms to predict the determinants of three types of drug resistance (to Ciprofloxacin, Ceftazidime, and Gentamicin) in a bacterium, Pseudomonas aeruginosa. While both algorithms performed well in the case of the influenza data, results were more nuanced in the bacterial case, with RRF making more sensible predictions, with smaller errors rates, than AB. CONCLUSIONS Altogether, we demonstrated that ML algorithms can be used to identify genetic determinants in small proteomes (viruses), even when trained on small numbers of individuals. We further showed that our RRF algorithm may deserve more scrutiny, which should be facilitated by the decreasing costs of both sequencing and phenotyping of large cohorts of individuals.
Collapse
Affiliation(s)
- George S Long
- Department of Biology, University of Ottawa, Ottawa, Ontario, Canada
| | - Mohammed Hussen
- Department of Biology, University of Ottawa, Ottawa, Ontario, Canada
| | - Jonathan Dench
- Department of Biology, University of Ottawa, Ottawa, Ontario, Canada
| | - Stéphane Aris-Brosou
- Department of Biology, University of Ottawa, Ottawa, Ontario, Canada. .,Department of Mathematics and Statistics, University of Ottawa, Ottawa, Ontario, Canada.
| |
Collapse
|
109
|
A Brief Review of Random Forests for Water Scientists and Practitioners and Their Recent History in Water Resources. WATER 2019. [DOI: 10.3390/w11050910] [Citation(s) in RCA: 93] [Impact Index Per Article: 18.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Random forests (RF) is a supervised machine learning algorithm, which has recently started to gain prominence in water resources applications. However, existing applications are generally restricted to the implementation of Breiman’s original algorithm for regression and classification problems, while numerous developments could be also useful in solving diverse practical problems in the water sector. Here we popularize RF and their variants for the practicing water scientist, and discuss related concepts and techniques, which have received less attention from the water science and hydrologic communities. In doing so, we review RF applications in water resources, highlight the potential of the original algorithm and its variants, and assess the degree of RF exploitation in a diverse range of applications. Relevant implementations of random forests, as well as related concepts and techniques in the R programming language, are also covered.
Collapse
|
110
|
Farbehi N, Patrick R, Dorison A, Xaymardan M, Janbandhu V, Wystub-Lis K, Ho JW, Nordon RE, Harvey RP. Single-cell expression profiling reveals dynamic flux of cardiac stromal, vascular and immune cells in health and injury. eLife 2019; 8:43882. [PMID: 30912746 PMCID: PMC6459677 DOI: 10.7554/elife.43882] [Citation(s) in RCA: 318] [Impact Index Per Article: 63.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2018] [Accepted: 03/25/2019] [Indexed: 12/11/2022] Open
Abstract
Besides cardiomyocytes (CM), the heart contains numerous interstitial cell types which play key roles in heart repair, regeneration and disease, including fibroblast, vascular and immune cells. However, a comprehensive understanding of this interactive cell community is lacking. We performed single-cell RNA-sequencing of the total non-CM fraction and enriched (Pdgfra-GFP+) fibroblast lineage cells from murine hearts at days 3 and 7 post-sham or myocardial infarction (MI) surgery. Clustering of >30,000 single cells identified >30 populations representing nine cell lineages, including a previously undescribed fibroblast lineage trajectory present in both sham and MI hearts leading to a uniquely activated cell state defined in part by a strong anti-WNT transcriptome signature. We also uncovered novel myofibroblast subtypes expressing either pro-fibrotic or anti-fibrotic signatures. Our data highlight non-linear dynamics in myeloid and fibroblast lineages after cardiac injury, and provide an entry point for deeper analysis of cardiac homeostasis, inflammation, fibrosis, repair and regeneration. In our bodies, heart attacks lead to cell death and inflammation. This is then followed by a healing phase where the organ repairs itself. There are many types of heart cells, from muscle and pacemaker cells that help to create the beating motion, to so-called fibroblasts that act as a supporting network. Yet, it is still unclear how individual cells participate in the heart's response to injury. All cells possess the same genetic information, but they turn on or off different genes depending on the specific tasks that they need to perform. Spotting which genes are activated in individual cells can therefore provide clues about their exact roles in the body. Until recently, technological limitations meant that this information was difficult to access, because it was only possible to capture the global response of a group of cells in a sample. A new method called single-cell RNA sequencing is now allowing researchers to study the activities of many genes in thousands of individual cells at the same time. Here, Farbehi, Patrick et al. performed single-cell RNA sequencing on over 30,000 individual cells from healthy and injured mouse hearts. Computational approaches were then used to cluster cells into groups according to the activities of their genes. The experiments identified over 30 distinct sub-types of cell, including several that were previously unknown. For example, a group of fibroblasts that express a gene called Wif1 was discovered. Previous genetic studies have shown that Wif1 is essential for the heart's response to injury. Further experiments by Farbehi, Patrick et al. indicated that this new sub-type of cells may control the timing of the different aspects of heart repair after damage. Tens of millions of people around the world suffer from heart attacks and other heart diseases. Knowing how different types of heart cells participate in repair mechanisms may help to find new targets for drugs and other treatments.
Collapse
Affiliation(s)
- Nona Farbehi
- Victor Chang Cardiac Research Institute, Darlinghurst, Australia.,Stem Cells Australia, Melbourne Brain Centre, University of Melbourne, Victoria, Australia.,Garvan Weizmann Centre for Cellular Genomics, Garvan Institute of Medical Research, Sydney, Australia.,Graduate School of Biomedical Engineering, UNSW Sydney, Kensington, Australia
| | - Ralph Patrick
- Victor Chang Cardiac Research Institute, Darlinghurst, Australia.,Stem Cells Australia, Melbourne Brain Centre, University of Melbourne, Victoria, Australia.,St. Vincent's Clinical School, UNSW Sydney, Kensington, Australia
| | - Aude Dorison
- Victor Chang Cardiac Research Institute, Darlinghurst, Australia.,Stem Cells Australia, Melbourne Brain Centre, University of Melbourne, Victoria, Australia
| | - Munira Xaymardan
- Victor Chang Cardiac Research Institute, Darlinghurst, Australia.,Stem Cells Australia, Melbourne Brain Centre, University of Melbourne, Victoria, Australia.,School of Dentistry, Faculty of Medicine and Health, University of Sydney, Westmead Hospital, Westmead, Australia
| | - Vaibhao Janbandhu
- Victor Chang Cardiac Research Institute, Darlinghurst, Australia.,Stem Cells Australia, Melbourne Brain Centre, University of Melbourne, Victoria, Australia.,St. Vincent's Clinical School, UNSW Sydney, Kensington, Australia
| | | | - Joshua Wk Ho
- Victor Chang Cardiac Research Institute, Darlinghurst, Australia.,St. Vincent's Clinical School, UNSW Sydney, Kensington, Australia
| | - Robert E Nordon
- Stem Cells Australia, Melbourne Brain Centre, University of Melbourne, Victoria, Australia.,Graduate School of Biomedical Engineering, UNSW Sydney, Kensington, Australia
| | - Richard P Harvey
- Victor Chang Cardiac Research Institute, Darlinghurst, Australia.,Stem Cells Australia, Melbourne Brain Centre, University of Melbourne, Victoria, Australia.,School of Biotechnology and Biomolecular Science, UNSW Sydney, Kensington, Australia
| |
Collapse
|
111
|
Azuaje F. Artificial intelligence for precision oncology: beyond patient stratification. NPJ Precis Oncol 2019; 3:6. [PMID: 30820462 PMCID: PMC6389974 DOI: 10.1038/s41698-019-0078-1] [Citation(s) in RCA: 64] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2018] [Accepted: 01/22/2019] [Indexed: 12/18/2022] Open
Abstract
The data-driven identification of disease states and treatment options is a crucial challenge for precision oncology. Artificial intelligence (AI) offers unique opportunities for enhancing such predictive capabilities in the lab and the clinic. AI, including its best-known branch of research, machine learning, has significant potential to enable precision oncology well beyond relatively well-known pattern recognition applications, such as the supervised classification of single-source omics or imaging datasets. This perspective highlights key advances and challenges in that direction. Furthermore, it argues that AI's scope and depth of research need to be expanded to achieve ground-breaking progress in precision oncology.
Collapse
Affiliation(s)
- Francisco Azuaje
- Bioinformatics and Modelling Research Group, Department of Oncology, Luxembourg Institute of Health (LIH), L-1445 Strassen, Luxembourg
- Present Address: Computational Biomedicine Research Group, Center for Quantitative Biology, Luxembourg Institute of Health (LIH), L-1445 Strassen, Luxembourg
| |
Collapse
|
112
|
Deshpande S, Shuttleworth J, Yang J, Taramonli S, England M. PLIT: An alignment-free computational tool for identification of long non-coding RNAs in plant transcriptomic datasets. Comput Biol Med 2019; 105:169-181. [DOI: 10.1016/j.compbiomed.2018.12.014] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2018] [Revised: 12/27/2018] [Accepted: 12/29/2018] [Indexed: 02/05/2023]
|
113
|
|