51
|
Zhang F, Gou J. Machine learning assessment of risk factors for depression in later adulthood. THE LANCET REGIONAL HEALTH. EUROPE 2022; 18:100399. [PMID: 35586270 PMCID: PMC9109181 DOI: 10.1016/j.lanepe.2022.100399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Affiliation(s)
- Fengqing Zhang
- Department of Psychological and Brain Sciences, Drexel University, 3201 Chestnut Street, Philadelphia PA 19104, USA
| | - Jiangtao Gou
- Department of Mathematics and Statistics, Villanova University, 800 E. Lancaster Ave. Villanova, PA 19085, USA
| |
Collapse
|
52
|
Wang Z, Niu Y, Vashisth T, Li J, Madden R, Livingston TS, Wang Y. Nontargeted metabolomics-based multiple machine learning modeling boosts early accurate detection for citrus Huanglongbing. HORTICULTURE RESEARCH 2022; 9:uhac145. [PMID: 36061619 PMCID: PMC9433982 DOI: 10.1093/hr/uhac145] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/06/2022] [Accepted: 06/20/2022] [Indexed: 06/15/2023]
Abstract
Early accurate detection of crop disease is extremely important for timely disease management. Huanglongbing (HLB), one of the most destructive citrus diseases, has brought about severe economic losses for the global citrus industry. The direct strategies for HLB identification, such as quantitative real-time polymerase chain reaction (qPCR) and chemical staining, are robust for the symptomatic plants but powerless for the asymptomatic ones at the early stage of affection. Thus, it is very necessary to develop a practical method used for the early detection of HLB. In this study, a novel method combining ultra-high performance liquid chromatography/mass spectrometry (UHPLC/MS)-based nontargeted metabolomics and machine learning (ML) was developed for conducting the early detection of HLB for the first time. Six ML algorithms were selected to build the classifiers. Regularized logistic regression (LR-L2) and gradient-boosted decision tree (GBDT) outperformed with the highest average accuracy of 95.83% to not only classify healthy and infected plants but identify significant features. The proposed method proved to be practical for early detection of HLB, which tackled the shortcomings of low sensitivity in the conventional methods and avoid the problems such as lighting condition interference in spectrum/image recognition-based ML methods. Additionally, the discovered biomarkers were verified by the metabolic pathway analysis and content change analysis, which was remarkably consistent with the previous reports.
Collapse
Affiliation(s)
- Zhixin Wang
- Citrus Research & Education Center, Institute of Food and Agricultural Sciences, University of Florida, Lake Alfred, Florida 33850-2299, U.S.A
| | - Yue Niu
- Department of Mathematics, University of Arizona, Tucson, Arizona 85721-0089, U.S.A
| | - Tripti Vashisth
- Citrus Research & Education Center, Institute of Food and Agricultural Sciences, University of Florida, Lake Alfred, Florida 33850-2299, U.S.A
| | - Jingwen Li
- Citrus Research & Education Center, Institute of Food and Agricultural Sciences, University of Florida, Lake Alfred, Florida 33850-2299, U.S.A
| | - Robert Madden
- Citrus Research & Education Center, Institute of Food and Agricultural Sciences, University of Florida, Lake Alfred, Florida 33850-2299, U.S.A
| | - Taylor Shea Livingston
- Citrus Research & Education Center, Institute of Food and Agricultural Sciences, University of Florida, Lake Alfred, Florida 33850-2299, U.S.A
| | - Yu Wang
- Corresponding author: E-mail:
| |
Collapse
|
53
|
Walker AM, Cliff A, Romero J, Shah MB, Jones P, Felipe Machado Gazolla JG, Jacobson DA, Kainer D. Evaluating the Performance of Random Forest and Iterative Random Forest Based Methods when Applied to Gene Expression Data. Comput Struct Biotechnol J 2022; 20:3372-3386. [PMID: 35832622 PMCID: PMC9260260 DOI: 10.1016/j.csbj.2022.06.037] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Revised: 06/14/2022] [Accepted: 06/14/2022] [Indexed: 11/30/2022] Open
Abstract
Gene-to-gene networks, such as Gene Regulatory Networks (GRN) and Predictive Expression Networks (PEN) capture relationships between genes and are beneficial for use in downstream biological analyses. There exists multiple network inference tools to produce these gene-to-gene networks from matrices of gene expression data. Random Forest-Leave One Out Prediction (RF-LOOP) is a method that has been shown to be efficient at producing these gene-to-gene networks, frequently known as GEne Network Inference with Ensemble of trees (GENIE3). Random Forest can be replaced in this process by iterative Random Forest (iRF), which performs variable selection and boosting. Here we validate that iterative Random Forest-Leave One Out Prediction (iRF-LOOP) produces higher quality networks than GENIE3 (RF-LOOP). We use both synthetic and empirical networks from the Dialogue for Reverse Engineering Assessment and Methods (DREAM) Challenges by Sage Bionetworks, as well as two additional empirical networks created from Arabidopsis thaliana and Populus trichocarpa expression data.
Collapse
Affiliation(s)
- Angelica M. Walker
- The Bredesen Center for Interdisciplinary Research and Graduate Education, University of Tennessee Knoxville, 821 Volunteer Blvd, Knoxville 37996, TN, USA
| | - Ashley Cliff
- The Bredesen Center for Interdisciplinary Research and Graduate Education, University of Tennessee Knoxville, 821 Volunteer Blvd, Knoxville 37996, TN, USA
| | - Jonathon Romero
- The Bredesen Center for Interdisciplinary Research and Graduate Education, University of Tennessee Knoxville, 821 Volunteer Blvd, Knoxville 37996, TN, USA
| | - Manesh B. Shah
- Computational and Predictive Biology, Oak Ridge National Laboratory, 1 Bethel Valley Rd, Oak Ridge 37830, TN, USA
| | - Piet Jones
- The Bredesen Center for Interdisciplinary Research and Graduate Education, University of Tennessee Knoxville, 821 Volunteer Blvd, Knoxville 37996, TN, USA
| | | | - Daniel A Jacobson
- Computational and Predictive Biology, Oak Ridge National Laboratory, 1 Bethel Valley Rd, Oak Ridge 37830, TN, USA
- Corresponding authors.
| | - David Kainer
- Computational and Predictive Biology, Oak Ridge National Laboratory, 1 Bethel Valley Rd, Oak Ridge 37830, TN, USA
- Corresponding authors.
| |
Collapse
|
54
|
Provable Boolean interaction recovery from tree ensemble obtained via random forests. Proc Natl Acad Sci U S A 2022; 119:e2118636119. [PMID: 35609192 PMCID: PMC9295780 DOI: 10.1073/pnas.2118636119] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
SignificanceRandom Forests (RFs) are among the most successful machine-learning algorithms in terms of prediction accuracy. In many domain problems, however, the primary goal is not prediction, but to understand the data-generation process-in particular, finding important features and feature interactions. There exists strong empirical evidence that RF-based methods-in particular, iterative RF (iRF)-are very successful in terms of detecting feature interactions. In this work, we propose a biologically motivated, Boolean interaction model. Using this model, we complement the existing empirical evidence with theoretical evidence for the ability of iRF-type methods to select desirable interactions. Our theoretical analysis also yields deeper insights into the general interaction selection mechanism of decision-tree algorithms and the importance of feature subsampling.
Collapse
|
55
|
Sadique Z, Grieve R, Diaz-Ordaz K, Mouncey P, Lamontagne F, O’Neill S. A Machine-Learning Approach for Estimating Subgroup- and Individual-Level Treatment Effects: An Illustration Using the 65 Trial. Med Decis Making 2022; 42:923-936. [PMID: 35607982 PMCID: PMC9459357 DOI: 10.1177/0272989x221100717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Personalizing treatment recommendations or guidelines requires evidence about the
heterogeneity of treatment effects (HTE). Machine-learning (ML) approaches can
explore HTE by considering many covariates, including complex interactions
between them. Causal ML approaches can avoid overfitting, which arises when the
same dataset is used to select covariate by treatment interaction terms as to
make inferences and reduce reliance on the correct specification of fixed
parametric models. We investigate causal forests (CF), a ML method based on
modified decision trees that can estimate subgroup- and individual-level
treatment effects, without requiring correct prespecification of the effect
model. We consider CF alongside parametric approaches for estimating HTE, within
the 65 Trial, which evaluates the effect of a permissive hypotension strategy
versus usual care on 90-d mortality for critically ill patients aged 65 y or
older with vasodilatory hypotension. Here, the CF approach provides similar
estimates of treatment effectiveness for prespecified and post hoc subgroups to
the parametric approach, and the results of a test for overall HTE show weak
evidence of heterogeneity. The CF estimates of individual-level treatment
effects, the expected effects of treatment for individuals in subpopulations
defined by their covariates, suggest that the permissive hypotension strategy is
expected to reduce 90-d mortality for 98.7% of patients but with 95% confidence
intervals that include zero for 71.6% of patients. A ML approach is then used to
assess the patient characteristics associated with these individual-level
effects, and to help target future research that can identify those patient
subgroups for whom the intervention is most effective.
Collapse
Affiliation(s)
- Zia Sadique
- Department of Health Services Research and
Policy, London School of Hygiene & Tropical Medicine, London, UK
| | - Richard Grieve
- R. Grieve, Department of Health Services
Research and Policy, London School of Hygiene and Tropical Medicine, 15-17
Tavistock Place, WC1H 9SH, London;
()
| | - Karla Diaz-Ordaz
- Department of Medical Statistics, London School
of Hygiene & Tropical Medicine, London, UK
| | - Paul Mouncey
- Clinical Trials Unit, Intensive Care National
Audit & Research Centre (ICNARC), London, UK
| | - Francois Lamontagne
- Université de Sherbrooke, Quebec, Canada
- Centre de Recherche du Centre Hospitalier
Universitaire de Sherbrooke, Quebec, Canada
| | - Stephen O’Neill
- Department of Health Services Research and
Policy, London School of Hygiene & Tropical Medicine, London, UK
| |
Collapse
|
56
|
You X, Dadwal UC, Lenburg ME, Kacena MA, Charles JF. Murine Gut Microbiome Meta-analysis Reveals Alterations in Carbohydrate Metabolism in Response to Aging. mSystems 2022; 7:e0124821. [PMID: 35400171 PMCID: PMC9040766 DOI: 10.1128/msystems.01248-21] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2021] [Accepted: 03/28/2022] [Indexed: 11/23/2022] Open
Abstract
Compositional and functional alterations to the gut microbiota during aging are hypothesized to potentially impact our health. Thus, determining aging-specific gut microbiome alterations is critical for developing microbiome-based strategies to improve health and promote longevity in the elderly. In this study, we performed a meta-analysis of publicly available 16S rRNA gene sequencing data from studies investigating the effect of aging on the gut microbiome in mice. Aging reproducibly increased gut microbial alpha diversity and shifted the microbial community structure in mice. We applied the bioinformatic tool PICRUSt2 to predict microbial metagenome function and established a random forest classifier to differentiate between microbial communities from young and old hosts and to identify aging-specific metabolic features. In independent validation data sets, this classifier achieved an area under the receiver operating characteristic curve (AUC) of 0.75 to 0.97 in differentiating microbiomes from young and old hosts. We found that 50% of the most important predicted aging-specific metabolic features were involved in carbohydrate metabolism. Furthermore, fecal short-chain fatty acid (SCFA) concentrations were significantly decreased in old mice, and the expression of the SCFA receptor Gpr41 in the colon was significantly correlated with the relative abundances of gut microbes and microbial carbohydrate metabolic pathways. In conclusion, this study identified aging-specific alterations in the composition and function of the gut microbiome and revealed a potential relationship between aging, microbial carbohydrate metabolism, fecal SCFA, and colonic Gpr41 expression. IMPORTANCE Aging-associated microbial alteration is hypothesized to play an important role in host health and longevity. However, investigations regarding specific gut microbes or microbial functional alterations associated with aging have had inconsistent results. We performed a meta-analysis across 5 independent studies to investigate the effect of aging on the gut microbiome in mice. Our analysis revealed that aging increased gut microbial alpha diversity and shifted the microbial community structure. To determine if we could reliably differentiate the gut microbiomes from young and old hosts, we established a random forest classifier based on predicted metagenome function and validated its performance against independent data sets. Alterations in microbial carbohydrate metabolism and decreased fecal short-chain fatty acid (SCFA) concentrations were key features of aging and correlated with host colonic expression of the SCFA receptor Gpr41. This study advances our understanding of the impact of aging on the gut microbiome and proposes a hypothesis that alterations in gut microbiota-derived SCFA-host GPR41 signaling are a feature of aging.
Collapse
Affiliation(s)
- Xiaomeng You
- Department of Orthopaedic Surgery, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts, USA
| | - Ushashi C. Dadwal
- Orthopaedic Surgery, Indiana University School of Medicine, Indianapolis, Indiana, USA
| | - Marc E. Lenburg
- Department of Medicine, Boston University School of Medicine, Boston, Massachusetts, USA
| | - Melissa A. Kacena
- Orthopaedic Surgery, Indiana University School of Medicine, Indianapolis, Indiana, USA
| | - Julia F. Charles
- Department of Orthopaedic Surgery, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts, USA
- Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts, USA
| |
Collapse
|
57
|
Machine Learning algorithm unveils glutamatergic alterations in the post-mortem schizophrenia brain. NPJ SCHIZOPHRENIA 2022; 8:8. [PMID: 35217646 PMCID: PMC8881508 DOI: 10.1038/s41537-022-00231-1] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/05/2021] [Accepted: 12/06/2021] [Indexed: 01/24/2023]
Abstract
Schizophrenia is a disorder of synaptic plasticity and aberrant connectivity in which a major dysfunction in glutamate synapse has been suggested. However, a multi-level approach tackling diverse clusters of interacting molecules of the glutamate signaling in schizophrenia is still lacking. We investigated in the post-mortem dorsolateral prefrontal cortex (DLPFC) and hippocampus of schizophrenia patients and non-psychiatric controls, the levels of neuroactive d- and l-amino acids (l-glutamate, d-serine, glycine, l-aspartate, d-aspartate) by HPLC. Moreover, by quantitative RT-PCR and western blotting we analyzed, respectively, the mRNA and protein levels of pre- and post-synaptic key molecules involved in the glutamatergic synapse functioning, including glutamate receptors (NMDA, AMPA, metabotropic), their interacting scaffolding proteins (PSD-95, Homer1b/c), plasma membrane and vesicular glutamate transporters (EAAT1, EAAT2, VGluT1, VGluT2), enzymes involved either in glutamate-dependent GABA neurotransmitter synthesis (GAD65 and 67), or in post-synaptic NMDA receptor-mediated signaling (CAMKIIα) and the pre-synaptic marker Synapsin-1. Univariable analyses revealed that none of the investigated molecules was differently represented in the post-mortem DLPFC and hippocampus of schizophrenia patients, compared with controls. Nonetheless, multivariable hypothesis-driven analyses revealed that the presence of schizophrenia was significantly affected by variations in neuroactive amino acid levels and glutamate-related synaptic elements. Furthermore, a Machine Learning hypothesis-free unveiled other discriminative clusters of molecules, one in the DLPFC and another in the hippocampus. Overall, while confirming a key role of glutamatergic synapse in the molecular pathophysiology of schizophrenia, we reported molecular signatures encompassing elements of the glutamate synapse able to discriminate patients with schizophrenia and normal individuals.
Collapse
|
58
|
Minamikawa MF, Nonaka K, Hamada H, Shimizu T, Iwata H. Dissecting Breeders' Sense via Explainable Machine Learning Approach: Application to Fruit Peelability and Hardness in Citrus. FRONTIERS IN PLANT SCIENCE 2022; 13:832749. [PMID: 35222489 PMCID: PMC8867066 DOI: 10.3389/fpls.2022.832749] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Accepted: 01/17/2022] [Indexed: 06/14/2023]
Abstract
"Genomics-assisted breeding", which utilizes genomics-based methods, e.g., genome-wide association study (GWAS) and genomic selection (GS), has been attracting attention, especially in the field of fruit breeding. Low-cost genotyping technologies that support genome-assisted breeding have already been established. However, efficient collection of large amounts of high-quality phenotypic data is essential for the success of such breeding. Most of the fruit quality traits have been sensorily and visually evaluated by professional breeders. However, the fruit morphological features that serve as the basis for such sensory and visual judgments are unclear. This makes it difficult to collect efficient phenotypic data on fruit quality traits using image analysis. In this study, we developed a method to automatically measure the morphological features of citrus fruits by the image analysis of cross-sectional images of citrus fruits. We applied explainable machine learning methods and Bayesian networks to determine the relationship between fruit morphological features and two sensorily evaluated fruit quality traits: easiness of peeling (Peeling) and fruit hardness (FruH). In each of all the methods applied in this study, the degradation area of the central core of the fruit was significantly and directly associated with both Peeling and FruH, while the seed area was significantly and directly related to FruH alone. The degradation area of albedo and the area of flavedo were also significantly and directly related to Peeling and FruH, respectively, except in one or two methods. These results suggest that an approach that combines explainable machine learning methods, Bayesian networks, and image analysis can be effective in dissecting the experienced sense of a breeder. In breeding programs, collecting fruit images and efficiently measuring and documenting fruit morphological features that are related to fruit quality traits may increase the size of data for the analysis and improvement of the accuracy of GWAS and GS on the quality traits of the citrus fruits.
Collapse
Affiliation(s)
- Mai F. Minamikawa
- Laboratory of Biometry and Bioinformatics, Department of Agricultural and Environmental Biology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan
| | - Keisuke Nonaka
- Institute of Fruit Tree and Tea Science, National Agriculture and Food Research Organization (NARO), Shizuoka, Japan
| | - Hiroko Hamada
- Institute of Fruit Tree and Tea Science, National Agriculture and Food Research Organization (NARO), Shizuoka, Japan
| | - Tokurou Shimizu
- Institute of Fruit Tree and Tea Science, National Agriculture and Food Research Organization (NARO), Shizuoka, Japan
| | - Hiroyoshi Iwata
- Laboratory of Biometry and Bioinformatics, Department of Agricultural and Environmental Biology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
59
|
Ji X, Lin L, Fan J, Li Y, Wei Y, Shen S, Su L, Shafer A, Bjaanæs MM, Karlsson A, Planck M, Staaf J, Helland Å, Esteller M, Zhang R, Chen F, Christiani DC. Epigenome-wide three-way interaction study identifies a complex pattern between TRIM27, KIAA0226, and smoking associated with overall survival of early-stage NSCLC. Mol Oncol 2022; 16:717-731. [PMID: 34932879 PMCID: PMC8807353 DOI: 10.1002/1878-0261.13167] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2021] [Revised: 11/23/2021] [Accepted: 12/20/2021] [Indexed: 01/12/2023] Open
Abstract
The interaction between DNA methylation of tripartite motif containing 27 (cg05293407TRIM27 ) and smoking has previously been identified to reveal histologically heterogeneous effects of TRIM27 DNA methylation on early-stage non-small-cell lung cancer (NSCLC) survival. However, to understand the complex mechanisms underlying NSCLC progression, we searched three-way interactions. A two-phase study was adopted to identify three-way interactions in the form of pack-year of smoking (number of cigarettes smoked per day × number of years smoked) × cg05293407TRIM27 × epigenome-wide DNA methylation CpG probe. Two CpG probes were identified with FDR-q ≤ 0.05 in the discovery phase and P ≤ 0.05 in the validation phase: cg00060500KIAA0226 and cg17479956EXT2 . Compared to a prediction model with only clinical information, the model added 42 significant three-way interactions using a looser criterion (discovery: FDR-q ≤ 0.10, validation: P ≤ 0.05) had substantially improved the area under the receiver operating characteristic curve (AUC) of the prognostic prediction model for both 3-year and 5-year survival. Our research identified the complex interaction effects among multiple environment and epigenetic factors, and provided therapeutic target for NSCLC patients.
Collapse
Affiliation(s)
- Xinyu Ji
- Department of BiostatisticsCenter for Global HealthSchool of Public HealthNanjing Medical UniversityNanjingChina
| | - Lijuan Lin
- Department of BiostatisticsCenter for Global HealthSchool of Public HealthNanjing Medical UniversityNanjingChina
| | - Juanjuan Fan
- Department of BiostatisticsCenter for Global HealthSchool of Public HealthNanjing Medical UniversityNanjingChina
| | - Yi Li
- Department of BiostatisticsUniversity of MichiganAnn ArborMIUSA
| | - Yongyue Wei
- Department of BiostatisticsCenter for Global HealthSchool of Public HealthNanjing Medical UniversityNanjingChina,Department of Environmental HealthHarvard T.H. Chan School of Public HealthBostonMAUSA,China International Cooperation Center for Environment and Human HealthNanjing Medical UniversityNanjingChina
| | - Sipeng Shen
- Department of BiostatisticsCenter for Global HealthSchool of Public HealthNanjing Medical UniversityNanjingChina
| | - Li Su
- Department of Environmental HealthHarvard T.H. Chan School of Public HealthBostonMAUSA
| | - Andrea Shafer
- Pulmonary and Critical Care DivisionDepartment of MedicineMassachusetts General Hospital and Harvard Medical SchoolBostonMAUSA
| | - Maria Moksnes Bjaanæs
- Department of Cancer GeneticsInstitute for Cancer ResearchOslo University HospitalOsloNorway
| | - Anna Karlsson
- Division of OncologyDepartment of Clinical Sciences Lund and CREATE Health Strategic Center for Translational Cancer ResearchLund UniversityLundSweden
| | - Maria Planck
- Division of OncologyDepartment of Clinical Sciences Lund and CREATE Health Strategic Center for Translational Cancer ResearchLund UniversityLundSweden
| | - Johan Staaf
- Division of OncologyDepartment of Clinical Sciences Lund and CREATE Health Strategic Center for Translational Cancer ResearchLund UniversityLundSweden
| | - Åslaug Helland
- Department of Cancer GeneticsInstitute for Cancer ResearchOslo University HospitalOsloNorway,Institute of Clinical MedicineUniversity of OsloOsloNorway
| | - Manel Esteller
- Josep Carreras Leukaemia Research InstituteBarcelonaSpain,Centro de Investigacion Biomedica en Red CancerMadridSpain,Institucio Catalana de Recerca i Estudis AvançatsBarcelonaSpain,Physiological Sciences DepartmentSchool of Medicine and Health SciencesUniversity of BarcelonaBarcelonaSpain
| | - Ruyang Zhang
- Department of BiostatisticsCenter for Global HealthSchool of Public HealthNanjing Medical UniversityNanjingChina,Department of Environmental HealthHarvard T.H. Chan School of Public HealthBostonMAUSA,China International Cooperation Center for Environment and Human HealthNanjing Medical UniversityNanjingChina
| | - Feng Chen
- Department of BiostatisticsCenter for Global HealthSchool of Public HealthNanjing Medical UniversityNanjingChina,China International Cooperation Center for Environment and Human HealthNanjing Medical UniversityNanjingChina,State Key Laboratory of Reproductive MedicineNanjing Medical UniversityNanjingChina,Jiangsu Key Lab of Cancer Biomarkers, Prevention and TreatmentCancer CenterCollaborative Innovation Center for Cancer Personalized MedicineNanjing Medical UniversityNanjingChina
| | - David C. Christiani
- Department of Environmental HealthHarvard T.H. Chan School of Public HealthBostonMAUSA,Pulmonary and Critical Care DivisionDepartment of MedicineMassachusetts General Hospital and Harvard Medical SchoolBostonMAUSA
| |
Collapse
|
60
|
Prates ET, Garvin MR, Jones P, Miller JI, Sullivan KA, Cliff A, Gazolla JGFM, Shah MB, Walker AM, Lane M, Rentsch CT, Justice A, Pavicic M, Romero J, Jacobson D. Antiviral Strategies Against SARS-CoV-2: A Systems Biology Approach. Methods Mol Biol 2022; 2452:317-351. [PMID: 35554915 DOI: 10.1007/978-1-0716-2111-0_19] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The unprecedented scientific achievements in combating the COVID-19 pandemic reflect a global response informed by unprecedented access to data. We now have the ability to rapidly generate a diversity of information on an emerging pathogen and, by using high-performance computing and a systems biology approach, we can mine this wealth of information to understand the complexities of viral pathogenesis and contagion like never before. These efforts will aid in the development of vaccines, antiviral medications, and inform policymakers and clinicians. Here we detail computational protocols developed as SARS-CoV-2 began to spread across the globe. They include pathogen detection, comparative structural proteomics, evolutionary adaptation analysis via network and artificial intelligence methodologies, and multiomic integration. These protocols constitute a core framework on which to build a systems-level infrastructure that can be quickly brought to bear on future pathogens before they evolve into pandemic proportions.
Collapse
Affiliation(s)
- Erica T Prates
- Oak Ridge National Laboratory, Computational Systems Biology, Oak Ridge, TN, USA
- National Virtual Biotechnology Laboratory, US Department of Energy, Washington, DC, USA
| | - Michael R Garvin
- Oak Ridge National Laboratory, Computational Systems Biology, Oak Ridge, TN, USA
- National Virtual Biotechnology Laboratory, US Department of Energy, Washington, DC, USA
| | - Piet Jones
- The Bredesen Center for Interdisciplinary Research and Graduate Education, University of Tennessee Knoxville, Knoxville, TN, USA
| | - J Izaak Miller
- Oak Ridge National Laboratory, Computational Systems Biology, Oak Ridge, TN, USA
- National Virtual Biotechnology Laboratory, US Department of Energy, Washington, DC, USA
| | - Kyle A Sullivan
- Oak Ridge National Laboratory, Computational Systems Biology, Oak Ridge, TN, USA
- National Virtual Biotechnology Laboratory, US Department of Energy, Washington, DC, USA
| | - Ashley Cliff
- The Bredesen Center for Interdisciplinary Research and Graduate Education, University of Tennessee Knoxville, Knoxville, TN, USA
| | - Joao Gabriel Felipe Machado Gazolla
- Oak Ridge National Laboratory, Computational Systems Biology, Oak Ridge, TN, USA
- National Virtual Biotechnology Laboratory, US Department of Energy, Washington, DC, USA
| | - Manesh B Shah
- Genome Science and Technology, University of Tennessee Knoxville, Knoxville, TN, USA
| | - Angelica M Walker
- The Bredesen Center for Interdisciplinary Research and Graduate Education, University of Tennessee Knoxville, Knoxville, TN, USA
| | - Matthew Lane
- The Bredesen Center for Interdisciplinary Research and Graduate Education, University of Tennessee Knoxville, Knoxville, TN, USA
| | - Christopher T Rentsch
- Faculty of Epidemiology and Population Health, London School of Hygiene and Tropical Medicine, London, UK
- VA Connecticut Healthcare/General Internal Medicine, West Haven, CT, USA
| | - Amy Justice
- VA Connecticut Healthcare/General Internal Medicine, West Haven, CT, USA
- Yale University School of Medicine, New Haven, CT, USA
| | - Mirko Pavicic
- Oak Ridge National Laboratory, Computational Systems Biology, Oak Ridge, TN, USA
- National Virtual Biotechnology Laboratory, US Department of Energy, Washington, DC, USA
| | - Jonathon Romero
- The Bredesen Center for Interdisciplinary Research and Graduate Education, University of Tennessee Knoxville, Knoxville, TN, USA
| | - Daniel Jacobson
- Oak Ridge National Laboratory, Computational Systems Biology, Oak Ridge, TN, USA.
- National Virtual Biotechnology Laboratory, US Department of Energy, Washington, DC, USA.
- The Bredesen Center for Interdisciplinary Research and Graduate Education, University of Tennessee Knoxville, Knoxville, TN, USA.
- Genome Science and Technology, University of Tennessee Knoxville, Knoxville, TN, USA.
- Department of Psychology, NeuroNet Research Center, University of Tennessee Knoxville, Knoxville, TN, USA.
| |
Collapse
|
61
|
Abstract
Interpretability is becoming an active research topic as machine learning (ML) models are more widely used to make critical decisions. Tabular data are one of the most commonly used modes of data in diverse applications such as healthcare and finance. Much of the existing interpretability methods used for tabular data only report feature-importance scores—either locally (per example) or globally (per model)—but they do not provide interpretation or visualization of how the features interact. We address this limitation by introducing Feature Vectors, a new global interpretability method designed for tabular datasets. In addition to providing feature-importance, Feature Vectors discovers the inherent semantic relationship among features via an intuitive feature visualization technique. Our systematic experiments demonstrate the empirical utility of this new method by applying it to several real-world datasets. We further provide an easy-to-use Python package for Feature Vectors.
Collapse
|
62
|
Sanchez CD, Brown JB, Gal-Oz O, Singer E. EcoPLOT: dynamic analysis of biogeochemical data. Bioinformatics 2021; 38:1480-1482. [PMID: 34927685 PMCID: PMC8825466 DOI: 10.1093/bioinformatics/btab842] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2021] [Revised: 12/02/2021] [Accepted: 12/14/2021] [Indexed: 01/05/2023] Open
Abstract
MOTIVATION We have created EcoPLOT (parameterized linkage of omics-driven technologies), a web-app for the dynamic, interactive analysis of biogeochemical datasets that combines state-of-the-art analysis tools to statistically and graphically explore environmental, geochemical and microbiome datasets. Using the iterative random forest, a machine learning algorithm, EcoPLOT allows for the de novo discovery of drivers which exhibit significant impact on plant, microbial or soil dynamics. AVAILABILITY AND IMPLEMENTATION EcoPLOT is built entirely within the R language. It can be accessed through any system where R is installed, including Windows, Mac and most Linux systems. EcoPLOT is free to use and can be accessed at https://github.com/cdsanchez18/EcoPLOT.
Collapse
Affiliation(s)
- Christopher D Sanchez
- Lawrence Berkeley National Laboratory, Berkeley, CA 94710, USA,To whom correspondence should be addressed. or
| | | | - Omree Gal-Oz
- Lawrence Berkeley National Laboratory, Berkeley, CA 94710, USA
| | - Esther Singer
- Lawrence Berkeley National Laboratory, Berkeley, CA 94710, USA,DOE Joint Genome Institute, Berkeley, CA 94720, USA,To whom correspondence should be addressed. or
| |
Collapse
|
63
|
Branch CL, Semenov GA, Wagner DN, Sonnenberg BR, Pitera AM, Bridge ES, Taylor SA, Pravosudov VV. The genetic basis of spatial cognitive variation in a food-caching bird. Curr Biol 2021; 32:210-219.e4. [PMID: 34735793 DOI: 10.1016/j.cub.2021.10.036] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2021] [Revised: 09/15/2021] [Accepted: 10/14/2021] [Indexed: 01/02/2023]
Abstract
Spatial cognition is used by most organisms to navigate their environment. Some species rely particularly heavily on specialized spatial cognition to survive, suggesting that a heritable component of cognition may be under natural selection. This idea remains largely untested outside of humans, perhaps because cognition in general is known to be strongly affected by learning and experience.1-4 We investigated the genetic basis of individual variation in spatial cognition used by non-migratory food-caching birds to recover food stores and survive harsh montane winters. Comparing the genomes of wild, free-living birds ranging from best to worst in their performance on a spatial cognitive task revealed significant associations with genes involved in neuron growth and development and hippocampal function. These results identify candidate genes associated with differences in spatial cognition and provide a critical link connecting individual variation in spatial cognition with natural selection.
Collapse
Affiliation(s)
- Carrie L Branch
- Cornell Lab of Ornithology, Cornell University, Ithaca, NY 14850, USA.
| | - Georgy A Semenov
- Department of Ecology and Evolutionary Biology, University of Colorado, Boulder, CO 80309, USA
| | - Dominique N Wagner
- Department of Ecology and Evolutionary Biology, University of Colorado, Boulder, CO 80309, USA
| | - Benjamin R Sonnenberg
- Ecology, Evolution, and Conservation Biology Graduate Program, University of Nevada, Reno, NV 89557, USA
| | - Angela M Pitera
- Ecology, Evolution, and Conservation Biology Graduate Program, University of Nevada, Reno, NV 89557, USA
| | - Eli S Bridge
- Ecology and Evolutionary Biology, University of Oklahoma, Norman, OK 73019, USA
| | - Scott A Taylor
- Department of Ecology and Evolutionary Biology, University of Colorado, Boulder, CO 80309, USA
| | - Vladimir V Pravosudov
- Ecology, Evolution, and Conservation Biology Graduate Program, University of Nevada, Reno, NV 89557, USA.
| |
Collapse
|
64
|
A novel dimension reduction algorithm based on weighted kernel principal analysis for gene expression data. PLoS One 2021; 16:e0258326. [PMID: 34644329 PMCID: PMC8513872 DOI: 10.1371/journal.pone.0258326] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2020] [Accepted: 09/26/2021] [Indexed: 11/19/2022] Open
Abstract
Gene expression data has the characteristics of high dimensionality and a small sample size and contains a large number of redundant genes unrelated to a disease. The direct application of machine learning to classify this type of data will not only incur a great time cost but will also sometimes fail to improved classification performance. To counter this problem, this paper proposes a dimension-reduction algorithm based on weighted kernel principal component analysis (WKPCA), constructs kernel function weights according to kernel matrix eigenvalues, and combines multiple kernel functions to reduce the feature dimensions. To further improve the dimensional reduction efficiency of WKPCA, t-class kernel functions are constructed, and corresponding theoretical proofs are given. Moreover, the cumulative optimal performance rate is constructed to measure the overall performance of WKPCA combined with machine learning algorithms. Naive Bayes, K-nearest neighbour, random forest, iterative random forest and support vector machine approaches are used in classifiers to analyse 6 real gene expression dataset. Compared with the all-variable model, linear principal component dimension reduction and single kernel function dimension reduction, the results show that the classification performance of the 5 machine learning methods mentioned above can be improved effectively by WKPCA dimension reduction.
Collapse
|
65
|
A novel random forest approach to revealing interactions and controls on chlorophyll concentration and bacterial communities during coastal phytoplankton blooms. Sci Rep 2021; 11:19944. [PMID: 34620921 PMCID: PMC8497483 DOI: 10.1038/s41598-021-98110-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Accepted: 08/24/2021] [Indexed: 11/12/2022] Open
Abstract
Increasing occurrence of harmful algal blooms across the land–water interface poses significant risks to coastal ecosystem structure and human health. Defining significant drivers and their interactive impacts on blooms allows for more effective analysis and identification of specific conditions supporting phytoplankton growth. A novel iterative Random Forests (iRF) machine-learning model was developed and applied to two example cases along the California coast to identify key stable interactions: (1) phytoplankton abundance in response to various drivers due to coastal conditions and land-sea nutrient fluxes, (2) microbial community structure during algal blooms. In Example 1, watershed derived nutrients were identified as the least significant interacting variable associated with Monterey Bay phytoplankton abundance. In Example 2, through iRF analysis of field-based 16S OTU bacterial community and algae datasets, we independently found stable interactions of prokaryote abundance patterns associated with phytoplankton abundance that have been previously identified in laboratory-based studies. Our study represents the first iRF application to marine algal blooms that helps to identify ocean, microbial, and terrestrial conditions that are considered dominant causal factors on bloom dynamics.
Collapse
|
66
|
Chen D, Sun Y, Shao G, Yu W, Zhang HT, Lin W. Coordinating directional switches in pigeon flocks: the role of nonlinear interactions. ROYAL SOCIETY OPEN SCIENCE 2021; 8:210649. [PMID: 34631121 PMCID: PMC8479334 DOI: 10.1098/rsos.210649] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/06/2021] [Accepted: 09/03/2021] [Indexed: 06/13/2023]
Abstract
The mechanisms inducing unpredictably directional switches in collective and moving biological entities are largely unclear. Deeply understanding such mechanisms is beneficial to delicate design of biologically inspired devices with particular functions. Here, articulating a framework that integrates data-driven, analytical and numerical methods, we investigate the underlying mechanism governing the coordinated rotational flight of pigeon flocks with unpredictably directional switches. Particularly using the sparse Bayesian learning method, we extract the inter-agent interactional dynamics from the high-resolution GPS data of three pigeon flocks, which reveals that the decision-making process in rotational switching flight performs in a more nonlinear manner than in smooth coordinated flight. To elaborate the principle of this nonlinearity of interactions, we establish a data-driven particle model with two potential wells and estimate the mean switching time of rotational direction. Our model with its analytical and numerical results renders the directional switches of moving biological groups more interpretable and predictable. Actually, an appropriate combination of natures, including high density, stronger nonlinearity in interactions, and moderate strength of noise, can enhance such highly ordered, less frequent switches.
Collapse
Affiliation(s)
- Duxin Chen
- School of Mathematics, Southeast University, Nanjing 211096, People’s Republic of China
| | - Yongzheng Sun
- School of Mathematics, China University of Mining and Technology, Xuzhou 221116, People’s Republic of China
| | - Guanbo Shao
- School of Mathematics, Southeast University, Nanjing 211096, People’s Republic of China
| | - Wenwu Yu
- School of Mathematics, Southeast University, Nanjing 211096, People’s Republic of China
| | - Hai-Tao Zhang
- School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, People’s Republic of China
| | - Wei Lin
- Research Institute of Intelligent Complex Systems and MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, People’s Republic of China
- School of Mathematical Sciences, LMNS, and SCMS, Fudan University, Shanghai 200433, People’s Republic of China
| |
Collapse
|
67
|
Zabeti H, Dexter N, Safari AH, Sedaghat N, Libbrecht M, Chindelevitch L. INGOT-DR: an interpretable classifier for predicting drug resistance in M. tuberculosis. Algorithms Mol Biol 2021; 16:17. [PMID: 34376217 PMCID: PMC8353837 DOI: 10.1186/s13015-021-00198-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2021] [Accepted: 07/23/2021] [Indexed: 12/13/2022] Open
Abstract
Motivation Prediction of drug resistance and identification of its mechanisms in bacteria such as Mycobacterium tuberculosis, the etiological agent of tuberculosis, is a challenging problem. Solving this problem requires a transparent, accurate, and flexible predictive model. The methods currently used for this purpose rarely satisfy all of these criteria. On the one hand, approaches based on testing strains against a catalogue of previously identified mutations often yield poor predictive performance; on the other hand, machine learning techniques typically have higher predictive accuracy, but often lack interpretability and may learn patterns that produce accurate predictions for the wrong reasons. Current interpretable methods may either exhibit a lower accuracy or lack the flexibility needed to generalize them to previously unseen data. Contribution In this paper we propose a novel technique, inspired by group testing and Boolean compressed sensing, which yields highly accurate predictions, interpretable results, and is flexible enough to be optimized for various evaluation metrics at the same time. Results We test the predictive accuracy of our approach on five first-line and seven second-line antibiotics used for treating tuberculosis. We find that it has a higher or comparable accuracy to that of commonly used machine learning models, and is able to identify variants in genes with previously reported association to drug resistance. Our method is intrinsically interpretable, and can be customized for different evaluation metrics. Our implementation is available at github.com/hoomanzabeti/INGOT_DR and can be installed via The Python Package Index (Pypi) under ingotdr. This package is also compatible with most of the tools in the Scikit-learn machine learning library.
Collapse
|
68
|
Stell E, Warner D, Jian J, Bond-Lamberty B, Vargas R. Spatial biases of information influence global estimates of soil respiration: How can we improve global predictions? GLOBAL CHANGE BIOLOGY 2021; 27:3923-3938. [PMID: 33934461 DOI: 10.1111/gcb.15666] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/08/2021] [Accepted: 03/31/2021] [Indexed: 06/12/2023]
Abstract
Soil respiration (Rs), the efflux of CO2 from soils to the atmosphere, is a major component of the terrestrial carbon cycle, but is poorly constrained from regional to global scales. The global soil respiration database (SRDB) is a compilation of in situ Rs observations from around the globe that has been consistently updated with new measurements over the past decade. It is unclear whether the addition of data to new versions has produced better-constrained global Rs estimates. We compared two versions of the SRDB (v3.0 n = 5173 and v5.0 n = 10,366) to determine how additional data influenced global Rs annual sum, spatial patterns and associated uncertainty (1 km spatial resolution) using a machine learning approach. A quantile regression forest model parameterized using SRDBv3 yielded a global Rs sum of 88.6 Pg C year-1 , and associated uncertainty of 29.9 (mean absolute error) and 57.9 (standard deviation) Pg C year-1 , whereas parameterization using SRDBv5 yielded 96.5 Pg C year-1 and associated uncertainty of 30.2 (mean average error) and 73.4 (standard deviation) Pg C year-1 . Empirically estimated global heterotrophic respiration (Rh) from v3 and v5 were 49.9-50.2 (mean 50.1) and 53.3-53.5 (mean 53.4) Pg C year-1 , respectively. SRDBv5's inclusion of new data from underrepresented regions (e.g., Asia, Africa, South America) resulted in overall higher model uncertainty. The largest differences between models parameterized with different SRDVB versions were in arid/semi-arid regions. The SRDBv5 is still biased toward northern latitudes and temperate zones, so we tested an optimized global distribution of Rs measurements, which resulted in a global sum of 96.4 ± 21.4 Pg C year-1 with an overall lower model uncertainty. These results support current global estimates of Rs but highlight spatial biases that influence model parameterization and interpretation and provide insights for design of environmental networks to improve global-scale Rs estimates.
Collapse
Affiliation(s)
- Emma Stell
- Department of Geography and Spatial Sciences, University of Delaware, Newark, DE, USA
| | - Daniel Warner
- Delaware Geological Survey, University of Delaware, Newark, DE, USA
| | - Jinshi Jian
- Pacific Northwest National Laboratory, Joint Global Change Research Institute, College Park, MD, USA
| | - Ben Bond-Lamberty
- Pacific Northwest National Laboratory, Joint Global Change Research Institute, College Park, MD, USA
| | - Rodrigo Vargas
- Department of Geography and Spatial Sciences, University of Delaware, Newark, DE, USA
- Department of Plant and Soil Sciences, University of Delaware, Newark, DE, USA
| |
Collapse
|
69
|
Tansey W, Veitch V, Zhang H, Rabadan R, Blei DM. The Holdout Randomization Test for Feature Selection in Black Box Models. J Comput Graph Stat 2021. [DOI: 10.1080/10618600.2021.1923520] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- Wesley Tansey
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY
| | - Victor Veitch
- Department of Statistics, Columbia University, New York, NY
| | - Haoran Zhang
- Department of Computer Science, University of Texas at Austin, TX
| | - Raul Rabadan
- Department of Systems Biology, Columbia University Medical Center, New York, NY
| | - David M. Blei
- Departments of Computer Science and Statistics, Columbia University, New York, NY
| |
Collapse
|
70
|
Mooney C, O'Boyle D, Finder M, Hallberg B, Walsh BH, Henshall DC, Boylan GB, Murray DM. Predictive modelling of hypoxic ischaemic encephalopathy risk following perinatal asphyxia. Heliyon 2021; 7:e07411. [PMID: 34278022 PMCID: PMC8261660 DOI: 10.1016/j.heliyon.2021.e07411] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2021] [Revised: 05/29/2021] [Accepted: 06/23/2021] [Indexed: 01/03/2023] Open
Abstract
Hypoxic Ischemic Encephalopathy (HIE) remains a major cause of neurological disability. Early intervention with therapeutic hypothermia improves outcome, but prediction of HIE is difficult and no single clinical marker is reliable. Machine learning algorithms may allow identification of patterns in clinical data to improve prognostic power. Here we examine the use of a Random Forest machine learning algorithm and five-fold cross-validation to predict the occurrence of HIE in a prospective cohort of infants with perinatal asphyxia. Infants with perinatal asphyxia were recruited at birth and neonatal course was followed for the development of HIE. Clinical variables were recorded for each infant including maternal demographics, delivery details and infant's condition at birth. We found that the strongest predictors of HIE were the infant's condition at birth (as expressed by Apgar score), need for resuscitation, and the first postnatal measures of pH, lactate, and base deficit. Random Forest models combining features including Apgar score, most intensive resuscitation, maternal age and infant birth weight both with and without biochemical markers of pH, lactate, and base deficit resulted in a sensitivity of 56-100% and a specificity of 78-99%. This study presents a dynamic method of rapid classification that has the potential to be easily adapted and implemented in a clinical setting, with and without the availability of blood gas analysis. Our results demonstrate that applying machine learning algorithms to readily available clinical data may support clinicians in the early and accurate identification of infants who will develop HIE. We anticipate our models to be a starting point for the development of a more sophisticated clinical decision support system to help identify which infants will benefit from early therapeutic hypothermia.
Collapse
Affiliation(s)
- Catherine Mooney
- School of Computer Science, University College Dublin, Dublin, Ireland.,FutureNeuro SFI Research Centre, RCSI University of Medicine and Health Sciences, Dublin, Ireland.,INFANT Research Centre, University College Cork, Cork, Ireland
| | - Daragh O'Boyle
- INFANT Research Centre, University College Cork, Cork, Ireland.,Department of Paediatrics and Child Health, University College Cork, Cork, Ireland
| | - Mikael Finder
- Neonatal Department, Karolinska University Hospital, Stockholm, Sweden.,Division of Paediatrics, CLINTEC, Karolinska Institute, Stockholm, Sweden
| | - Boubou Hallberg
- Neonatal Department, Karolinska University Hospital, Stockholm, Sweden.,Division of Paediatrics, CLINTEC, Karolinska Institute, Stockholm, Sweden
| | - Brian H Walsh
- INFANT Research Centre, University College Cork, Cork, Ireland.,Department of Paediatrics and Child Health, University College Cork, Cork, Ireland.,Department of Neonatology, Cork University Maternity Hospital, Cork, Ireland
| | - David C Henshall
- FutureNeuro SFI Research Centre, RCSI University of Medicine and Health Sciences, Dublin, Ireland
| | - Geraldine B Boylan
- INFANT Research Centre, University College Cork, Cork, Ireland.,Department of Paediatrics and Child Health, University College Cork, Cork, Ireland
| | - Deirdre M Murray
- INFANT Research Centre, University College Cork, Cork, Ireland.,Department of Paediatrics and Child Health, University College Cork, Cork, Ireland
| |
Collapse
|
71
|
DiMucci D, Kon M, Segrè D. BowSaw: Inferring Higher-Order Trait Interactions Associated With Complex Biological Phenotypes. Front Mol Biosci 2021; 8:663532. [PMID: 34222331 PMCID: PMC8245782 DOI: 10.3389/fmolb.2021.663532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Accepted: 05/24/2021] [Indexed: 11/15/2022] Open
Abstract
Machine learning is helping the interpretation of biological complexity by enabling the inference and classification of cellular, organismal and ecological phenotypes based on large datasets, e.g., from genomic, transcriptomic and metagenomic analyses. A number of available algorithms can help search these datasets to uncover patterns associated with specific traits, including disease-related attributes. While, in many instances, treating an algorithm as a black box is sufficient, it is interesting to pursue an enhanced understanding of how system variables end up contributing to a specific output, as an avenue toward new mechanistic insight. Here we address this challenge through a suite of algorithms, named BowSaw, which takes advantage of the structure of a trained random forest algorithm to identify combinations of variables (“rules”) frequently used for classification. We first apply BowSaw to a simulated dataset and show that the algorithm can accurately recover the sets of variables used to generate the phenotypes through complex Boolean rules, even under challenging noise levels. We next apply our method to data from the integrative Human Microbiome Project and find previously unreported high-order combinations of microbial taxa putatively associated with Crohn’s disease. By leveraging the structure of trees within a random forest, BowSaw provides a new way of using decision trees to generate testable biological hypotheses.
Collapse
Affiliation(s)
- Demetrius DiMucci
- Bioinformatics Graduate Program, Boston University, Boston, MA, United States.,Biological Design Center, Boston University, Boston, MA, United States
| | - Mark Kon
- Bioinformatics Graduate Program, Boston University, Boston, MA, United States.,Department of Mathematics and Statistics, Boston University, Boston, MA, United States
| | - Daniel Segrè
- Bioinformatics Graduate Program, Boston University, Boston, MA, United States.,Biological Design Center, Boston University, Boston, MA, United States.,Department of Biology, Boston University, Boston, MA, United States.,Department of Biomedical Engineering, Boston University, Boston, MA, United States.,Department of Physics, Boston University, Boston, MA, United States
| |
Collapse
|
72
|
Gao H, Yang C, Fan J, Lan L, Pang D. Hereditary and breastfeeding factors are positively associated with the aetiology of mammary gland hyperplasia: a case-control study. Int Health 2021; 13:240-247. [PMID: 32556322 PMCID: PMC8079319 DOI: 10.1093/inthealth/ihaa028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2020] [Revised: 04/10/2020] [Accepted: 05/18/2020] [Indexed: 11/30/2022] Open
Abstract
Background Hyperplasia of mammary gland (HMG) has become a common disorder in women. A family history of breast cancer and female reproductive factors may work together to increase the risk of HMG. However, this specific relationship has not been fully characterized. Methods A total of 1881 newly diagnosed HMG cases and 1900 controls were recruited from 2012 to 2017. Demographic characteristics including female reproductive factors and a family history of breast cancer were collected. A multi-analytic strategy combining unconditional logistic regression, multifactor dimensionality reduction (MDR) and crossover approaches were applied to systematically identify the interaction effect of family history of breast cancer and reproductive factors on HMG susceptibility. Results In MDR analysis, high-order interactions among higher-level education, shorter breastfeeding duration and family history of breast cancer were identified (odds ratio [OR] 7.07 [95% confidence interval {CI} 6.08 to 8.22]). Similarly, in crossover analysis, HMG risk increased significantly for those with higher-level education (OR 36.39 [95% CI 11.47 to 115.45]), shorter duration of breastfeeding (OR 27.70 [95% CI 3.73 to 205.70]) and a family history of breast cancer. Conclusion Higher-level education, shorter breastfeeding duration and a family history of breast cancer may synergistically increase the risk of HMG.
Collapse
Affiliation(s)
- Hanlu Gao
- Department of Preventive Health, The Affiliated Hospital of Medical School of Ningbo University, 247 Renmin Road, Ningbo, Zhejiang, P.R. China.,Division of Chronic and Non-communicable Diseases, Harbin Center for Diseases Control and Prevention, 30 Weixing Road, Harbin, Heilongjiang, P.R. China.,Department of Breast Surgery, Harbin Medical University Cancer Hospital, 150 Haping Road, Harbin, Heilongjiang, P.R. China
| | - Chao Yang
- Division of Chronic and Non-communicable Diseases, Harbin Center for Diseases Control and Prevention, 30 Weixing Road, Harbin, Heilongjiang, P.R. China
| | - Jinqing Fan
- Department of Dermatology, The Affiliated Hospital of Medical School of Ningbo University, 247 Renmin Road, Ningbo, Zhejiang, P.R. China
| | - Li Lan
- Division of Chronic and Non-communicable Diseases, Harbin Center for Diseases Control and Prevention, 30 Weixing Road, Harbin, Heilongjiang, P.R. China
| | - Da Pang
- Department of Breast Surgery, Harbin Medical University Cancer Hospital, 150 Haping Road, Harbin, Heilongjiang, P.R. China
| |
Collapse
|
73
|
Armstrong AJS, Quinn K, Fouquier J, Li SX, Schneider JM, Nusbacher NM, Doenges KA, Fiorillo S, Marden TJ, Higgins J, Reisdorph N, Campbell TB, Palmer BE, Lozupone CA. Systems Analysis of Gut Microbiome Influence on Metabolic Disease in HIV-Positive and High-Risk Populations. mSystems 2021; 6:e01178-20. [PMID: 34006628 PMCID: PMC8269254 DOI: 10.1128/msystems.01178-20] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Accepted: 04/15/2021] [Indexed: 12/20/2022] Open
Abstract
Poor metabolic health, characterized by insulin resistance and dyslipidemia, is higher in people living with HIV and has been linked with inflammation, antiretroviral therapy (ART) drugs, and ART-associated lipodystrophy (LD). Metabolic disease is associated with gut microbiome composition outside the context of HIV but has not been deeply explored in HIV infection or in high-risk men who have sex with men (HR-MSM), who have a highly altered gut microbiome composition. Furthermore, the contribution of increased bacterial translocation and associated systemic inflammation that has been described in HIV-positive and HR-MSM individuals has not been explored. We used a multiomic approach to explore relationships between impaired metabolic health, defined using fasting blood markers, gut microbes, immune phenotypes, and diet. Our cohort included ART-treated HIV-positive MSM with or without LD, untreated HIV-positive MSM, and HR-MSM. For HIV-positive MSM on ART, we further explored associations with the plasma metabolome. We found that elevated plasma lipopolysaccharide binding protein (LBP) was the most important predictor of impaired metabolic health and network analysis showed that LBP formed a hub joining correlated microbial and immune predictors of metabolic disease. Taken together, our results suggest the role of inflammatory processes linked with bacterial translocation and interaction with the gut microbiome in metabolic disease among HIV-positive and -negative MSM.IMPORTANCE The gut microbiome in people living with HIV (PLWH) is of interest since chronic infection often results in long-term comorbidities. Metabolic disease is prevalent in PLWH even in well-controlled infection and has been linked with the gut microbiome in previous studies, but little attention has been given to PLWH. Furthermore, integrated analyses that consider gut microbiome, together with diet, systemic immune activation, metabolites, and demographics, have been lacking. In a systems-level analysis of predictors of metabolic disease in PLWH and men who are at high risk of acquiring HIV, we found that increased lipopolysaccharide-binding protein, an inflammatory marker indicative of compromised intestinal barrier function, was associated with worse metabolic health. We also found impaired metabolic health associated with specific dietary components, gut microbes, and host and microbial metabolites. This study lays the framework for mechanistic studies aimed at targeting the microbiome to prevent or treat metabolic endotoxemia in HIV-infected individuals.
Collapse
Affiliation(s)
- Abigail J S Armstrong
- Department of Medicine, University of Colorado Denver, Aurora, Colorado, USA
- Department of Immunology and Microbiology, University of Colorado Denver, Aurora, Colorado, USA
- Center for Advanced Biotechnology and Medicine, Rutgers the State University, Piscataway, New Jersey, USA
| | - Kevin Quinn
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of Colorado, Aurora, Colorado, USA
| | - Jennifer Fouquier
- Department of Medicine, University of Colorado Denver, Aurora, Colorado, USA
| | - Sam X Li
- Department of Medicine, University of Colorado Denver, Aurora, Colorado, USA
| | | | - Nichole M Nusbacher
- Department of Medicine, University of Colorado Denver, Aurora, Colorado, USA
| | - Katrina A Doenges
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of Colorado, Aurora, Colorado, USA
| | - Suzanne Fiorillo
- Department of Medicine, University of Colorado Denver, Aurora, Colorado, USA
| | - Tyson J Marden
- Colorado Clinical and Translational Sciences Institute, Aurora, Colorado, USA
| | - Janine Higgins
- Department of Pediatrics, Section of Endocrinology, University of Colorado, Aurora, Colorado, USA
| | - Nichole Reisdorph
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of Colorado, Aurora, Colorado, USA
| | - Thomas B Campbell
- Department of Medicine, University of Colorado Denver, Aurora, Colorado, USA
| | - Brent E Palmer
- Department of Medicine, University of Colorado Denver, Aurora, Colorado, USA
| | | |
Collapse
|
74
|
Yu F, Wei C, Deng P, Peng T, Hu X. Deep exploration of random forest model boosts the interpretability of machine learning studies of complicated immune responses and lung burden of nanoparticles. SCIENCE ADVANCES 2021; 7:7/22/eabf4130. [PMID: 34039604 PMCID: PMC8153727 DOI: 10.1126/sciadv.abf4130] [Citation(s) in RCA: 49] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/25/2020] [Accepted: 04/05/2021] [Indexed: 05/22/2023]
Abstract
The development of machine learning provides solutions for predicting the complicated immune responses and pharmacokinetics of nanoparticles (NPs) in vivo. However, highly heterogeneous data in NP studies remain challenging because of the low interpretability of machine learning. Here, we propose a tree-based random forest feature importance and feature interaction network analysis framework (TBRFA) and accurately predict the pulmonary immune responses and lung burden of NPs, with the correlation coefficient of all training sets >0.9 and half of the test sets >0.75. This framework overcomes the feature importance bias brought by small datasets through a multiway importance analysis. TBRFA also builds feature interaction networks, boosts model interpretability, and reveals hidden interactional factors (e.g., various NP properties and exposure conditions). TBRFA provides guidance for the design and application of ideal NPs and discovers the feature interaction networks that contribute to complex systems with small-size data in various fields.
Collapse
Affiliation(s)
- Fubo Yu
- Key Laboratory of Pollution Processes and Environmental Criteria (Ministry of Education)/Tianjin Key Laboratory of Environmental Remediation and Pollution Control, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China
| | - Changhong Wei
- Key Laboratory of Pollution Processes and Environmental Criteria (Ministry of Education)/Tianjin Key Laboratory of Environmental Remediation and Pollution Control, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China
| | - Peng Deng
- Key Laboratory of Pollution Processes and Environmental Criteria (Ministry of Education)/Tianjin Key Laboratory of Environmental Remediation and Pollution Control, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China
| | - Ting Peng
- Key Laboratory of Pollution Processes and Environmental Criteria (Ministry of Education)/Tianjin Key Laboratory of Environmental Remediation and Pollution Control, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China
| | - Xiangang Hu
- Key Laboratory of Pollution Processes and Environmental Criteria (Ministry of Education)/Tianjin Key Laboratory of Environmental Remediation and Pollution Control, College of Environmental Science and Engineering, Nankai University, Tianjin 300350, China.
| |
Collapse
|
75
|
Liu D, Zhang X, Zheng T, Shi Q, Cui Y, Wang Y, Liu L. Optimisation and evaluation of the random forest model in the efficacy prediction of chemoradiotherapy for advanced cervical cancer based on radiomics signature from high-resolution T2 weighted images. Arch Gynecol Obstet 2021; 303:811-820. [PMID: 33394142 PMCID: PMC7960581 DOI: 10.1007/s00404-020-05908-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2020] [Accepted: 11/17/2020] [Indexed: 12/28/2022]
Abstract
PURPOSE Our objective was to establish a random forest model and to evaluate its predictive capability of the treatment effect of neoadjuvant chemotherapy-radiation therapy. METHODS This retrospective study included 82 patients with locally advanced cervical cancer who underwent scanning from March 2013 to May 2018. The random forest model was established and optimised based on the open source toolkit scikit-learn. Byoptimising of the number of decision trees in the random forest, the criteria for selecting the final partition index and the minimum number of samples partitioned by each node, the performance of random forest in the prediction of the treatment effect of neoadjuvant chemotherapy-radiation therapy on advanced cervical cancer (> IIb) was evaluated. RESULTS The number of decision trees in the random forests influenced the model performance. When the number of decision trees was set to 10, 25, 40, 55, 70, 85 and 100, the performance of random forest model exhibited an increasing trend first and then a decreasing one. The criteria for the selection of final partition index showed significant effects on the generation of decision trees. The Gini index demonstrated a better effect compared with information gain index. The area under the receiver operating curve for Gini index attained a value of 0.917. CONCLUSION The random forest model showed potential in predicting the treatment effect of neoadjuvant chemotherapy-radiation therapy based on high-resolution T2WIs for advanced cervical cancer (> IIb).
Collapse
Affiliation(s)
- Defeng Liu
- Department of Magnetic Resonance Imaging, Qinhuangdao Municipal No. 1 Hospital, Qinhuangdao, People's Republic of China
| | - Xiaohang Zhang
- State Grid Information & Telecommunication Group Co., Ltd., Beijing, People's Republic of China
| | - Tao Zheng
- Department of Magnetic Resonance Imaging, Qinhuangdao Municipal No. 1 Hospital, Qinhuangdao, People's Republic of China
| | - Qinglei Shi
- Scientific Clinical Specialist, Siemens Ltd., Beijing, People's Republic of China
| | - Yujie Cui
- Department of Magnetic Resonance Imaging, Qinhuangdao Municipal No. 1 Hospital, Qinhuangdao, People's Republic of China
| | - Yongji Wang
- Cooperative Innovation Center, Institute of Software, Chinese Academy of Sciences, Beijing, People's Republic of China
- University of Chinese Academy of Sciences, Beijing, People's Republic of China
- State Key Laboratory of Computer Science (Institute of Software, The Chinese Academy of Sciences), Beijing, People's Republic of China
| | - Lanxiang Liu
- Department of Magnetic Resonance Imaging, Qinhuangdao Municipal No. 1 Hospital, Qinhuangdao, People's Republic of China.
| |
Collapse
|
76
|
Sarkar P, Malik S, Laha S, Das S, Bunk S, Ray JG, Chatterjee R, Saha A. Dysbiosis of Oral Microbiota During Oral Squamous Cell Carcinoma Development. Front Oncol 2021; 11:614448. [PMID: 33708627 PMCID: PMC7940518 DOI: 10.3389/fonc.2021.614448] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2020] [Accepted: 01/05/2021] [Indexed: 12/24/2022] Open
Abstract
Infection with specific pathogens and alterations in tissue commensal microbial composition are intricately associated with the development of many human cancers. Likewise, dysbiosis of oral microbiome was also shown to play critical role in the initiation as well as progression of oral cancer. However, there are no reports portraying changes in oral microbial community in the patients of Indian subcontinent, which has the highest incidence of oral cancer per year, globally. To establish the association of bacterial dysbiosis and oral squamous cell carcinoma (OSCC) among the Indian population, malignant lesions and anatomically matched adjacent normal tissues were obtained from fifty well-differentiated OSCC patients and analyzed using 16S rRNA V3-V4 amplicon based sequencing on the MiSeq platform. Interestingly, in contrast to the previous studies, a significantly lower bacterial diversity was observed in the malignant samples as compared to the normal counterpart. Overall our study identified Prevotella, Corynebacterium, Pseudomonas, Deinococcus and Noviherbaspirillum as significantly enriched genera, whereas genera including Actinomyces, Sutterella, Stenotrophomonas, Anoxybacillus, and Serratia were notably decreased in the OSCC lesions. Moreover, we demonstrated HPV-16 but not HPV-18 was significantly associated with the OSCC development. In future, with additional validation, this panel could directly be applied into clinical diagnostic and prognostic workflows for OSCC in Indian scenario.
Collapse
Affiliation(s)
- Purandar Sarkar
- School of Biotechnology, Presidency University, Kolkata, India
| | - Samaresh Malik
- School of Biotechnology, Presidency University, Kolkata, India
| | - Sayantan Laha
- Human Genetics Unit, Indian Statistical Institute, Kolkata, India
| | - Shantanab Das
- Human Genetics Unit, Indian Statistical Institute, Kolkata, India
| | - Soumya Bunk
- Department of Life Sciences, Presidency University, Kolkata, India
| | - Jay Gopal Ray
- Department of Oral Pathology, Dr. R Ahmed Dental College and Hospital, Kolkata, India
| | | | - Abhik Saha
- School of Biotechnology, Presidency University, Kolkata, India.,Department of Life Sciences, Presidency University, Kolkata, India
| |
Collapse
|
77
|
Jain R, Xu W. HDSI: High dimensional selection with interactions algorithm on feature selection and testing. PLoS One 2021; 16:e0246159. [PMID: 33592034 PMCID: PMC7886179 DOI: 10.1371/journal.pone.0246159] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2020] [Accepted: 01/15/2021] [Indexed: 11/19/2022] Open
Abstract
Feature selection on high dimensional data along with the interaction effects is a critical challenge for classical statistical learning techniques. Existing feature selection algorithms such as random LASSO leverages LASSO capability to handle high dimensional data. However, the technique has two main limitations, namely the inability to consider interaction terms and the lack of a statistical test for determining the significance of selected features. This study proposes a High Dimensional Selection with Interactions (HDSI) algorithm, a new feature selection method, which can handle high-dimensional data, incorporate interaction terms, provide the statistical inferences of selected features and leverage the capability of existing classical statistical techniques. The method allows the application of any statistical technique like LASSO and subset selection on multiple bootstrapped samples; each contains randomly selected features. Each bootstrap data incorporates interaction terms for the randomly sampled features. The selected features from each model are pooled and their statistical significance is determined. The selected statistically significant features are used as the final output of the approach, whose final coefficients are estimated using appropriate statistical techniques. The performance of HDSI is evaluated using both simulated data and real studies. In general, HDSI outperforms the commonly used algorithms such as LASSO, subset selection, adaptive LASSO, random LASSO and group LASSO.
Collapse
Affiliation(s)
- Rahi Jain
- Biostatistics Department, Princess Margaret Cancer Research Centre, Toronto, Ontario, Canada
| | - Wei Xu
- Biostatistics Department, Princess Margaret Cancer Research Centre, Toronto, Ontario, Canada
- Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
- * E-mail:
| |
Collapse
|
78
|
A Data-Driven and Data-Based Framework for Online Voltage Stability Assessment Using Partial Mutual Information and Iterated Random Forest. ENERGIES 2021. [DOI: 10.3390/en14030715] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Due to the rapid development of phasor measurement units (PMUs) and the wide area of interconnection of modern power systems, the security of power systems is confronted with severe challenges. A novel framework based on data for static voltage stability margin (VSM) assessment of power systems is presented. The proposed framework can select the key operation variables as input features for the assessment based on partial mutual information (PMI). Before the feature selection procedure is completed by PMI, a feature preprocessing approach is applied to remove redundant and irrelevant features to improve computational efficiency. Using the selected key variables, a voltage stability assessment (VSA) model based on iterated random forest (IRF) can rapidly provide the relative VSM results. The proposed framework is examined on the IEEE 30-bus system and a practical 1648-bus system, and a desirable assessment performance is demonstrated. In addition, the robustness and computational speed of the proposed framework are also verified. Some impact factors for power system operation are studied in a robustness examination, such as topology change, variation of peak/minimum load, and variation of generator/load power distribution.
Collapse
|
79
|
Affiliation(s)
- Bin Yu
- Statistics Department University of California Berkeley Berkeley CA
- EECS Department University of California Berkeley Berkeley CA
- Chan Zuckerberg Biohub San Francisco CA
| | - Rebecca Barter
- Statistics Department University of California Berkeley Berkeley CA
| |
Collapse
|
80
|
Dwivedi R, Tan YS, Park B, Wei M, Horgan K, Madigan D, Yu B. Stable Discovery of Interpretable Subgroups via Calibration in Causal Studies. Int Stat Rev 2020. [DOI: 10.1111/insr.12427] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Affiliation(s)
- Raaz Dwivedi
- Department of EECS University of California, Berkeley Berkeley CA USA
| | - Yan Shuo Tan
- Department of Statistics University of California, Berkeley Berkeley CA USA
| | - Briton Park
- Department of Statistics University of California, Berkeley Berkeley CA USA
| | - Mian Wei
- Department of Statistics University of California, Berkeley Berkeley CA USA
| | - Kevin Horgan
- Protypia Inc 111 10th Avenue South, Suite 102 Nashville TN 37023 USA
| | - David Madigan
- Khoury College of Computer Sciences Northeastern University Boston MA USA
| | - Bin Yu
- Department of EECS University of California, Berkeley Berkeley CA USA
- Department of Statistics University of California, Berkeley Berkeley CA USA
- Division of Biostatistics University of California, Berkeley Berkeley CA USA
- Center for Computational Biology University of California, Berkeley Berkeley CA USA
- Chan Zuckerberg Biohub San Francisco CA USA
| |
Collapse
|
81
|
Khalili E, Kouchaki S, Ramazi S, Ghanati F. Machine Learning Techniques for Soybean Charcoal Rot Disease Prediction. FRONTIERS IN PLANT SCIENCE 2020; 11:590529. [PMID: 33381132 PMCID: PMC7767839 DOI: 10.3389/fpls.2020.590529] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/01/2020] [Accepted: 11/23/2020] [Indexed: 06/01/2023]
Abstract
Early prediction of pathogen infestation is a key factor to reduce the disease spread in plants. Macrophomina phaseolina (Tassi) Goid, as one of the main causes of charcoal rot disease, suppresses the plant productivity significantly. Charcoal rot disease is one of the most severe threats to soybean productivity. Prediction of this disease in soybeans is very tedious and non-practical using traditional approaches. Machine learning (ML) techniques have recently gained substantial traction across numerous domains. ML methods can be applied to detect plant diseases, prior to the full appearance of symptoms. In this paper, several ML techniques were developed and examined for prediction of charcoal rot disease in soybean for a cohort of 2,000 healthy and infected plants. A hybrid set of physiological and morphological features were suggested as inputs to the ML models. All developed ML models were performed better than 90% in terms of accuracy. Gradient Tree Boosting (GBT) was the best performing classifier which obtained 96.25% and 97.33% in terms of sensitivity and specificity. Our findings supported the applicability of ML especially GBT for charcoal rot disease prediction in a real environment. Moreover, our analysis demonstrated the importance of including physiological featured in the learning. The collected dataset and source code can be found in https://github.com/Elham-khalili/Soybean-Charcoal-Rot-Disease-Prediction-Dataset-code.
Collapse
Affiliation(s)
- Elham Khalili
- Department of Plant Science, Faculty of Science, Tarbiat Modarres University, Tehran, Iran
| | - Samaneh Kouchaki
- Faculty of Engineering and Physical Sciences, Centre for Vision, Speech, and Signal Processing, University of Surrey, Guildford, United Kingdom
| | - Shahin Ramazi
- Department of Biophysics, Faculty of Biological Science, Tarbiat Modares University, Tehran, Iran
| | - Faezeh Ghanati
- Department of Plant Science, Faculty of Science, Tarbiat Modarres University, Tehran, Iran
| |
Collapse
|
82
|
Lawson CE, Martí JM, Radivojevic T, Jonnalagadda SVR, Gentz R, Hillson NJ, Peisert S, Kim J, Simmons BA, Petzold CJ, Singer SW, Mukhopadhyay A, Tanjore D, Dunn JG, Garcia Martin H. Machine learning for metabolic engineering: A review. Metab Eng 2020; 63:34-60. [PMID: 33221420 DOI: 10.1016/j.ymben.2020.10.005] [Citation(s) in RCA: 86] [Impact Index Per Article: 21.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2020] [Revised: 10/22/2020] [Accepted: 10/31/2020] [Indexed: 12/14/2022]
Abstract
Machine learning provides researchers a unique opportunity to make metabolic engineering more predictable. In this review, we offer an introduction to this discipline in terms that are relatable to metabolic engineers, as well as providing in-depth illustrative examples leveraging omics data and improving production. We also include practical advice for the practitioner in terms of data management, algorithm libraries, computational resources, and important non-technical issues. A variety of applications ranging from pathway construction and optimization, to genetic editing optimization, cell factory testing, and production scale-up are discussed. Moreover, the promising relationship between machine learning and mechanistic models is thoroughly reviewed. Finally, the future perspectives and most promising directions for this combination of disciplines are examined.
Collapse
Affiliation(s)
- Christopher E Lawson
- Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA; Joint BioEnergy Institute, Emeryville, CA, 94608, USA
| | - Jose Manuel Martí
- Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA; Joint BioEnergy Institute, Emeryville, CA, 94608, USA; DOE Agile BioFoundry, Emeryville, CA, 94608, USA
| | - Tijana Radivojevic
- Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA; Joint BioEnergy Institute, Emeryville, CA, 94608, USA; DOE Agile BioFoundry, Emeryville, CA, 94608, USA
| | - Sai Vamshi R Jonnalagadda
- Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA; Joint BioEnergy Institute, Emeryville, CA, 94608, USA; DOE Agile BioFoundry, Emeryville, CA, 94608, USA
| | - Reinhard Gentz
- Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA; Joint BioEnergy Institute, Emeryville, CA, 94608, USA; Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Nathan J Hillson
- Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA; Joint BioEnergy Institute, Emeryville, CA, 94608, USA; DOE Agile BioFoundry, Emeryville, CA, 94608, USA
| | - Sean Peisert
- Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA; Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA; University of California Davis, Davis, CA, 95616, USA
| | - Joonhoon Kim
- Joint BioEnergy Institute, Emeryville, CA, 94608, USA; Pacific Northwest National Laboratory, Richland, 99354, WA, USA
| | - Blake A Simmons
- Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA; Joint BioEnergy Institute, Emeryville, CA, 94608, USA; DOE Agile BioFoundry, Emeryville, CA, 94608, USA
| | - Christopher J Petzold
- Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA; Joint BioEnergy Institute, Emeryville, CA, 94608, USA; DOE Agile BioFoundry, Emeryville, CA, 94608, USA
| | - Steven W Singer
- Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA; Joint BioEnergy Institute, Emeryville, CA, 94608, USA
| | - Aindrila Mukhopadhyay
- Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA; Joint BioEnergy Institute, Emeryville, CA, 94608, USA; Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, USA
| | - Deepti Tanjore
- Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA; Advanced Biofuels and Bioproducts Process Development Unit, Emeryville, CA, 94608, USA
| | | | - Hector Garcia Martin
- Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA; Joint BioEnergy Institute, Emeryville, CA, 94608, USA; DOE Agile BioFoundry, Emeryville, CA, 94608, USA; Basque Center for Applied Mathematics, 48009, Bilbao, Spain; Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, USA.
| |
Collapse
|
83
|
Affiliation(s)
- Rina Friedberg
- Department of Statistics, Stanford University , Stanford , CA
| | | | - Susan Athey
- Graduate School of Business, Stanford University , Stanford , CA
| | - Stefan Wager
- Graduate School of Business, Stanford University , Stanford , CA
| |
Collapse
|
84
|
Affiliation(s)
- Tim C. D. Lucas
- Big Data Institute University of Oxford Old Road Campus Oxford OX3 7LF United Kingdom
| |
Collapse
|
85
|
Parchure P, Joshi H, Dharmarajan K, Freeman R, Reich DL, Mazumdar M, Timsina P, Kia A. Development and validation of a machine learning-based prediction model for near-term in-hospital mortality among patients with COVID-19. BMJ Support Palliat Care 2020; 12:bmjspcare-2020-002602. [PMID: 32963059 PMCID: PMC8049537 DOI: 10.1136/bmjspcare-2020-002602] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2020] [Revised: 08/09/2020] [Accepted: 08/18/2020] [Indexed: 02/06/2023]
Abstract
OBJECTIVES To develop and validate a model for prediction of near-term in-hospital mortality among patients with COVID-19 by application of a machine learning (ML) algorithm on time-series inpatient data from electronic health records. METHODS A cohort comprised of 567 patients with COVID-19 at a large acute care healthcare system between 10 February 2020 and 7 April 2020 observed until either death or discharge. Random forest (RF) model was developed on randomly drawn 70% of the cohort (training set) and its performance was evaluated on the rest of 30% (the test set). The outcome variable was in-hospital mortality within 20-84 hours from the time of prediction. Input features included patients' vital signs, laboratory data and ECG results. RESULTS Patients had a median age of 60.2 years (IQR 26.2 years); 54.1% were men. In-hospital mortality rate was 17.0% and overall median time to death was 6.5 days (range 1.3-23.0 days). In the test set, the RF classifier yielded a sensitivity of 87.8% (95% CI: 78.2% to 94.3%), specificity of 60.6% (95% CI: 55.2% to 65.8%), accuracy of 65.5% (95% CI: 60.7% to 70.0%), area under the receiver operating characteristic curve of 85.5% (95% CI: 80.8% to 90.2%) and area under the precision recall curve of 64.4% (95% CI: 53.5% to 75.3%). CONCLUSIONS Our ML-based approach can be used to analyse electronic health record data and reliably predict near-term mortality prediction. Using such a model in hospitals could help improve care, thereby better aligning clinical decisions with prognosis in critically ill patients with COVID-19.
Collapse
Affiliation(s)
- Prathamesh Parchure
- Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Himanshu Joshi
- Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, New York, United States
| | - Kavita Dharmarajan
- Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Geriatrics and Palliative Care, Icahn School of Medicine at Mount Sinai, New York, New York, United States
- Department of Radiation Oncology, Icahn School of Medicine at Mount Sinai, New York, New York, United States
| | - Robert Freeman
- Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Hospital Administration, Icahn School of Medicine at Mount Sinai, New York, New York, United States
| | - David L Reich
- Hospital Administration, Icahn School of Medicine at Mount Sinai, New York, New York, United States
- Department of Anesthesiology, Icahn School of Medicine at Mount Sinai, New York, New York, United States
| | - Madhu Mazumdar
- Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, New York, United States
| | - Prem Timsina
- Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Arash Kia
- Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| |
Collapse
|
86
|
Hu Q, Greene CS, Heller EA. Specific histone modifications associate with alternative exon selection during mammalian development. Nucleic Acids Res 2020; 48:4709-4724. [PMID: 32319526 PMCID: PMC7229819 DOI: 10.1093/nar/gkaa248] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2019] [Revised: 03/23/2020] [Accepted: 04/02/2020] [Indexed: 12/29/2022] Open
Abstract
Alternative splicing (AS) is frequent during early mouse embryonic development. Specific histone post-translational modifications (hPTMs) have been shown to regulate exon splicing by either directly recruiting splice machinery or indirectly modulating transcriptional elongation. In this study, we hypothesized that hPTMs regulate expression of alternatively spliced genes for specific processes during differentiation. To address this notion, we applied an innovative machine learning approach to relate global hPTM enrichment to AS regulation during mammalian tissue development. We found that specific hPTMs, H3K36me3 and H3K4me1, play a role in skipped exon selection among all the tissues and developmental time points examined. In addition, we used iterative random forest model and found that interactions of multiple hPTMs most strongly predicted splicing when they included H3K36me3 and H3K4me1. Collectively, our data demonstrated a link between hPTMs and alternative splicing which will drive further experimental studies on the functional relevance of these modifications to alternative splicing.
Collapse
Affiliation(s)
- Qiwen Hu
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Elizabeth A Heller
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
87
|
Hao B, Zhang A, Cheng G. Sparse and Low-rank Tensor Estimation via Cubic Sketchings. IEEE TRANSACTIONS ON INFORMATION THEORY 2020; 66:5927-5964. [PMID: 33746244 PMCID: PMC7978041 DOI: 10.1109/tit.2020.2982499] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In this paper, we propose a general framework for sparse and low-rank tensor estimation from cubic sketchings. A two-stage non-convex implementation is developed based on sparse tensor decomposition and thresholded gradient descent, which ensures exact recovery in the noiseless case and stable recovery in the noisy case with high probability. The non-asymptotic analysis sheds light on an interplay between optimization error and statistical error. The proposed procedure is shown to be rate-optimal under certain conditions. As a technical by-product, novel high-order concentration inequalities are derived for studying high-moment sub-Gaussian tensors. An interesting tensor formulation illustrates the potential application to high-order interaction pursuit in high-dimensional linear regression.
Collapse
Affiliation(s)
- Botao Hao
- Department of Electrical Engineering, Princeton University, Princeton, NJ 08540
| | - Anru Zhang
- Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706
| | - Guang Cheng
- Department Statistics, Purdue University, West Lafayette, IN 47906
| |
Collapse
|
88
|
Huang S, Blatti C, Sinha S, Parameswaran A. Uncovering Effective Explanations for Interactive Genomic Data Analysis. PATTERNS 2020; 1:100093. [PMID: 33205133 PMCID: PMC7660438 DOI: 10.1016/j.patter.2020.100093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/10/2020] [Revised: 07/13/2020] [Accepted: 08/05/2020] [Indexed: 10/25/2022]
|
89
|
Schperberg AV, Boichard A, Tsigelny IF, Richard SB, Kurzrock R. Machine learning model to predict oncologic outcomes for drugs in randomized clinical trials. Int J Cancer 2020; 147:2537-2549. [PMID: 32745254 DOI: 10.1002/ijc.33240] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2020] [Revised: 07/15/2020] [Accepted: 07/17/2020] [Indexed: 11/12/2022]
Abstract
Predicting oncologic outcome is challenging due to the diversity of cancer histologies and the complex network of underlying biological factors. In this study, we determine whether machine learning (ML) can extract meaningful associations between oncologic outcome and clinical trial, drug-related biomarker and molecular profile information. We analyzed therapeutic clinical trials corresponding to 1102 oncologic outcomes from 104 758 cancer patients with advanced colorectal adenocarcinoma, pancreatic adenocarcinoma, melanoma and nonsmall-cell lung cancer. For each intervention arm, a dataset with the following attributes was curated: line of treatment, the number of cytotoxic chemotherapies, small-molecule inhibitors, or monoclonal antibody agents, drug class, molecular alteration status of the clinical arm's population, cancer type, probability of drug sensitivity (PDS) (integrating the status of genomic, transcriptomic and proteomic biomarkers in the population of interest) and outcome. A total of 467 progression-free survival (PFS) and 369 overall survival (OS) data points were used as training sets to build our ML (random forest) model. Cross-validation sets were used for PFS and OS, obtaining correlation coefficients (r) of 0.82 and 0.70, respectively (outcome vs model's parameters). A total of 156 PFS and 110 OS data points were used as test sets. The Spearman correlation (rs ) between predicted and actual outcomes was statistically significant (PFS: rs = 0.879, OS: rs = 0.878, P < .0001). The better outcome arm was predicted in 81% (PFS: N = 59/73, z = 5.24, P < .0001) and 71% (OS: N = 37/52, z = 2.91, P = .004) of randomized trials. The success of our algorithm to predict clinical outcome may be exploitable as a model to optimize clinical trial design with pharmaceutical agents.
Collapse
Affiliation(s)
- Alexander V Schperberg
- CureMatch, Inc., San Diego, California, USA.,Department of Mechanical and Aerospace Engineering, University of California Los Angeles, Los Angeles, California, USA
| | - Amélie Boichard
- Center for Personalized Cancer Therapy and Division of Hematology and Oncology, University of California San Diego Moores Cancer Center, La Jolla, California, USA
| | - Igor F Tsigelny
- CureMatch, Inc., San Diego, California, USA.,San Diego Supercomputer Center, University of California San Diego, La Jolla, California, USA.,Department of Neurosciences, University of California San Diego, La Jolla, California, USA
| | - Stéphane B Richard
- CureMatch, Inc., San Diego, California, USA.,Oncodesign, Inc., New York, New York, USA
| | - Razelle Kurzrock
- Center for Personalized Cancer Therapy and Division of Hematology and Oncology, University of California San Diego Moores Cancer Center, La Jolla, California, USA
| |
Collapse
|
90
|
Ghazanfar S, Lin Y, Su X, Lin DM, Patrick E, Han ZG, Marioni JC, Yang JYH. Investigating higher-order interactions in single-cell data with scHOT. Nat Methods 2020; 17:799-806. [PMID: 32661426 PMCID: PMC7610653 DOI: 10.1038/s41592-020-0885-x] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2019] [Accepted: 06/03/2020] [Indexed: 12/12/2022]
Abstract
Single-cell genomics has transformed our ability to examine cell fate choice. Examining cells along a computationally ordered 'pseudotime' offers the potential to unpick subtle changes in variability and covariation among key genes. We describe an approach, scHOT-single-cell higher-order testing-which provides a flexible and statistically robust framework for identifying changes in higher-order interactions among genes. scHOT can be applied for cells along a continuous trajectory or across space and accommodates various higher-order measurements including variability or correlation. We demonstrate the use of scHOT by studying coordinated changes in higher-order interactions during embryonic development of the mouse liver. Additionally, scHOT identifies subtle changes in gene-gene correlations across space using spatially resolved transcriptomics data from the mouse olfactory bulb. scHOT meaningfully adds to first-order differential expression testing and provides a framework for interrogating higher-order interactions using single-cell data.
Collapse
Affiliation(s)
- Shila Ghazanfar
- Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK
| | - Yingxin Lin
- School of Mathematics and Statistics, The University of Sydney, Sydney, New South Wales, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, New South Wales, Australia
| | - Xianbin Su
- Key Laboratory of Systems Biomedicine (Ministry of Education) and Collaborative Innovation Center of Systems Biomedicine, Shanghai Center for Systems Biomedicine, Shanghai Jiao Tong University, Shanghai, China
| | - David Ming Lin
- Department of Biomedical Sciences, Cornell University, Ithaca, NY, USA
| | - Ellis Patrick
- School of Mathematics and Statistics, The University of Sydney, Sydney, New South Wales, Australia
- Westmead Institute for Medical Research, Westmead, New South Wales, Australia
| | - Ze-Guang Han
- Key Laboratory of Systems Biomedicine (Ministry of Education) and Collaborative Innovation Center of Systems Biomedicine, Shanghai Center for Systems Biomedicine, Shanghai Jiao Tong University, Shanghai, China
| | - John C Marioni
- Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK.
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Cambridge, UK.
- Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, UK.
| | - Jean Yee Hwa Yang
- School of Mathematics and Statistics, The University of Sydney, Sydney, New South Wales, Australia.
- Charles Perkins Centre, The University of Sydney, Sydney, New South Wales, Australia.
| |
Collapse
|
91
|
A mechanism-aware and multiomic machine-learning pipeline characterizes yeast cell growth. Proc Natl Acad Sci U S A 2020; 117:18869-18879. [PMID: 32675233 DOI: 10.1073/pnas.2002959117] [Citation(s) in RCA: 54] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Metabolic modeling and machine learning are key components in the emerging next generation of systems and synthetic biology tools, targeting the genotype-phenotype-environment relationship. Rather than being used in isolation, it is becoming clear that their value is maximized when they are combined. However, the potential of integrating these two frameworks for omic data augmentation and integration is largely unexplored. We propose, rigorously assess, and compare machine-learning-based data integration techniques, combining gene expression profiles with computationally generated metabolic flux data to predict yeast cell growth. To this end, we create strain-specific metabolic models for 1,143 Saccharomyces cerevisiae mutants and we test 27 machine-learning methods, incorporating state-of-the-art feature selection and multiview learning approaches. We propose a multiview neural network using fluxomic and transcriptomic data, showing that the former increases the predictive accuracy of the latter and reveals functional patterns that are not directly deducible from gene expression alone. We test the proposed neural network on a further 86 strains generated in a different experiment, therefore verifying its robustness to an additional independent dataset. Finally, we show that introducing mechanistic flux features improves the predictions also for knockout strains whose genes were not modeled in the metabolic reconstruction. Our results thus demonstrate that fusing experimental cues with in silico models, based on known biochemistry, can contribute with disjoint information toward biologically informed and interpretable machine learning. Overall, this study provides tools for understanding and manipulating complex phenotypes, increasing both the prediction accuracy and the extent of discernible mechanistic biological insights.
Collapse
|
92
|
Affiliation(s)
- Bin Yu
- Statistics Department, University of California Berkeley, Berkeley, CA
- EECS Department, University of California Berkeley, Berkeley, CA
- Chan Zuckerberg Biohub, San Francisco, CA
| | - Rebecca Barter
- Statistics Department, University of California Berkeley, Berkeley, CA
| |
Collapse
|
93
|
Cheng FY, Joshi H, Tandon P, Freeman R, Reich DL, Mazumdar M, Kohli-Seth R, Levin MA, Timsina P, Kia A. Using Machine Learning to Predict ICU Transfer in Hospitalized COVID-19 Patients. J Clin Med 2020; 9:jcm9061668. [PMID: 32492874 PMCID: PMC7356638 DOI: 10.3390/jcm9061668] [Citation(s) in RCA: 96] [Impact Index Per Article: 24.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2020] [Revised: 05/27/2020] [Accepted: 05/28/2020] [Indexed: 12/13/2022] Open
Abstract
OBJECTIVES Approximately 20-30% of patients with COVID-19 require hospitalization, and 5-12% may require critical care in an intensive care unit (ICU). A rapid surge in cases of severe COVID-19 will lead to a corresponding surge in demand for ICU care. Because of constraints on resources, frontline healthcare workers may be unable to provide the frequent monitoring and assessment required for all patients at high risk of clinical deterioration. We developed a machine learning-based risk prioritization tool that predicts ICU transfer within 24 h, seeking to facilitate efficient use of care providers' efforts and help hospitals plan their flow of operations. METHODS A retrospective cohort was comprised of non-ICU COVID-19 admissions at a large acute care health system between 26 February and 18 April 2020. Time series data, including vital signs, nursing assessments, laboratory data, and electrocardiograms, were used as input variables for training a random forest (RF) model. The cohort was randomly split (70:30) into training and test sets. The RF model was trained using 10-fold cross-validation on the training set, and its predictive performance on the test set was then evaluated. RESULTS The cohort consisted of 1987 unique patients diagnosed with COVID-19 and admitted to non-ICU units of the hospital. The median time to ICU transfer was 2.45 days from the time of admission. Compared to actual admissions, the tool had 72.8% (95% CI: 63.2-81.1%) sensitivity, 76.3% (95% CI: 74.7-77.9%) specificity, 76.2% (95% CI: 74.6-77.7%) accuracy, and 79.9% (95% CI: 75.2-84.6%) area under the receiver operating characteristics curve. CONCLUSIONS A ML-based prediction model can be used as a screening tool to identify patients at risk of imminent ICU transfer within 24 h. This tool could improve the management of hospital resources and patient-throughput planning, thus delivering more effective care to patients hospitalized with COVID-19.
Collapse
Affiliation(s)
- Fu-Yuan Cheng
- Institute for Healthcare Delivery Science; Icahn School of Medicine at Mount Sinai, 1425 Madison Avenue, New York, NY 10029, USA; (F.-Y.C.); (H.J.); (R.F.); (P.T.); (A.K.)
| | - Himanshu Joshi
- Institute for Healthcare Delivery Science; Icahn School of Medicine at Mount Sinai, 1425 Madison Avenue, New York, NY 10029, USA; (F.-Y.C.); (H.J.); (R.F.); (P.T.); (A.K.)
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, 1425 Madison Avenue, New York, NY 10029, USA
| | - Pranai Tandon
- Respiratory Institute, Icahn School of Medicine at Mount Sinai, 10 E 102nd St, New York, NY 10029, USA;
| | - Robert Freeman
- Institute for Healthcare Delivery Science; Icahn School of Medicine at Mount Sinai, 1425 Madison Avenue, New York, NY 10029, USA; (F.-Y.C.); (H.J.); (R.F.); (P.T.); (A.K.)
- Hospital Administration; The Mount Sinai Hospital, 1 Gustave L. Levy Place, New York, NY 10029, USA;
| | - David L Reich
- Hospital Administration; The Mount Sinai Hospital, 1 Gustave L. Levy Place, New York, NY 10029, USA;
- Department of Anesthesiology, Perioperative and Pain Medicine, 1 Gustave L. Levy Place, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA;
| | - Madhu Mazumdar
- Institute for Healthcare Delivery Science; Icahn School of Medicine at Mount Sinai, 1425 Madison Avenue, New York, NY 10029, USA; (F.-Y.C.); (H.J.); (R.F.); (P.T.); (A.K.)
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, 1425 Madison Avenue, New York, NY 10029, USA
- Correspondence: ; Tel.: +1-212-659-1470; Fax: +1-212-423-2998
| | - Roopa Kohli-Seth
- Institute for Critical Care Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA;
| | - Matthew A. Levin
- Department of Anesthesiology, Perioperative and Pain Medicine, 1 Gustave L. Levy Place, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA;
- Department of Genetics and Genomic Sciences, 1 Gustave L. Levy Place, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Prem Timsina
- Institute for Healthcare Delivery Science; Icahn School of Medicine at Mount Sinai, 1425 Madison Avenue, New York, NY 10029, USA; (F.-Y.C.); (H.J.); (R.F.); (P.T.); (A.K.)
| | - Arash Kia
- Institute for Healthcare Delivery Science; Icahn School of Medicine at Mount Sinai, 1425 Madison Avenue, New York, NY 10029, USA; (F.-Y.C.); (H.J.); (R.F.); (P.T.); (A.K.)
| |
Collapse
|
94
|
Wang H, Sham P, Tong T, Pang H. Pathway-Based Single-Cell RNA-Seq Classification, Clustering, and Construction of Gene-Gene Interactions Networks Using Random Forests. IEEE J Biomed Health Inform 2020; 24:1814-1822. [DOI: 10.1109/jbhi.2019.2944865] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
95
|
Zhang X, Baer AG, Price JM, Jones PC, Garcia BJ, Romero J, Cliff AM, Mi W, Brown JB, Jacobson DA, Lydic R, Baghdoyan HA. Neurotransmitter networks in mouse prefrontal cortex are reconfigured by isoflurane anesthesia. J Neurophysiol 2020; 123:2285-2296. [PMID: 32347157 PMCID: PMC7311717 DOI: 10.1152/jn.00092.2020] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
This study quantified eight small-molecule neurotransmitters collected simultaneously from prefrontal cortex of C57BL/6J mice (n = 23) during wakefulness and during isoflurane anesthesia (1.3%). Using isoflurane anesthesia as an independent variable enabled evaluation of the hypothesis that isoflurane anesthesia differentially alters concentrations of multiple neurotransmitters and their interactions. Machine learning was applied to reveal higher order interactions among neurotransmitters. Using a between-subjects design, microdialysis was performed during wakefulness and during anesthesia. Concentrations (nM) of acetylcholine, adenosine, dopamine, GABA, glutamate, histamine, norepinephrine, and serotonin in the dialysis samples are reported (means ± SD). Relative to wakefulness, acetylcholine concentration was lower during isoflurane anesthesia (1.254 ± 1.118 vs. 0.401 ± 0.134, P = 0.009), and concentrations of adenosine (29.456 ± 29.756 vs. 101.321 ± 38.603, P < 0.001), dopamine (0.0578 ± 0.0384 vs. 0.113 ± 0.084, P = 0.036), and norepinephrine (0.126 ± 0.080 vs. 0.219 ± 0.066, P = 0.010) were higher during anesthesia. Isoflurane reconfigured neurotransmitter interactions in prefrontal cortex, and the state of isoflurane anesthesia was reliably predicted by prefrontal cortex concentrations of adenosine, norepinephrine, and acetylcholine. A novel finding to emerge from machine learning analyses is that neurotransmitter concentration profiles in mouse prefrontal cortex undergo functional reconfiguration during isoflurane anesthesia. Adenosine, norepinephrine, and acetylcholine showed high feature importance, supporting the interpretation that interactions among these three transmitters may play a key role in modulating levels of cortical and behavioral arousal. NEW & NOTEWORTHY This study discovered that interactions between neurotransmitters in mouse prefrontal cortex were altered during isoflurane anesthesia relative to wakefulness. Machine learning further demonstrated that, relative to wakefulness, higher order interactions among neurotransmitters were disrupted during isoflurane administration. These findings extend to the neurochemical domain the concept that anesthetic-induced loss of wakefulness results from a disruption of neural network connectivity.
Collapse
Affiliation(s)
- Xiaoying Zhang
- Department of Anesthesiology, University of Tennessee Medical Center, Knoxville, Tennessee.,Department of Psychology, University of Tennessee, Knoxville, Tennessee.,Anesthesia and Operation Center, Chinese PLA General Hospital, Beijing, China
| | - Aaron G Baer
- Department of Anesthesiology, University of Tennessee Medical Center, Knoxville, Tennessee
| | - Joshua M Price
- Office of Information Technology, University of Tennessee, Knoxville, Tennessee
| | - Piet C Jones
- Oak Ridge National Laboratory, Oak Ridge, Tennessee.,Bredesen Center for Interdisciplinary Research and Graduate Education, University of Tennessee, Knoxville, Tennessee
| | | | - Jonathon Romero
- Oak Ridge National Laboratory, Oak Ridge, Tennessee.,Bredesen Center for Interdisciplinary Research and Graduate Education, University of Tennessee, Knoxville, Tennessee
| | - Ashley M Cliff
- Oak Ridge National Laboratory, Oak Ridge, Tennessee.,Bredesen Center for Interdisciplinary Research and Graduate Education, University of Tennessee, Knoxville, Tennessee
| | - Weidong Mi
- Anesthesia and Operation Center, Chinese PLA General Hospital, Beijing, China
| | - James B Brown
- Molecular Ecosystems Biology Department, Lawrence Berkeley National Laboratory, Berkeley, California
| | | | - Ralph Lydic
- Department of Anesthesiology, University of Tennessee Medical Center, Knoxville, Tennessee.,Department of Psychology, University of Tennessee, Knoxville, Tennessee.,Oak Ridge National Laboratory, Oak Ridge, Tennessee
| | - Helen A Baghdoyan
- Department of Anesthesiology, University of Tennessee Medical Center, Knoxville, Tennessee.,Department of Psychology, University of Tennessee, Knoxville, Tennessee.,Oak Ridge National Laboratory, Oak Ridge, Tennessee
| |
Collapse
|
96
|
Azodi CB, Tang J, Shiu SH. Opening the Black Box: Interpretable Machine Learning for Geneticists. Trends Genet 2020; 36:442-455. [PMID: 32396837 DOI: 10.1016/j.tig.2020.03.005] [Citation(s) in RCA: 104] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Revised: 03/12/2020] [Accepted: 03/16/2020] [Indexed: 01/16/2023]
Abstract
Because of its ability to find complex patterns in high dimensional and heterogeneous data, machine learning (ML) has emerged as a critical tool for making sense of the growing amount of genetic and genomic data available. While the complexity of ML models is what makes them powerful, it also makes them difficult to interpret. Fortunately, efforts to develop approaches that make the inner workings of ML models understandable to humans have improved our ability to make novel biological insights. Here, we discuss the importance of interpretable ML, different strategies for interpreting ML models, and examples of how these strategies have been applied. Finally, we identify challenges and promising future directions for interpretable ML in genetics and genomics.
Collapse
Affiliation(s)
- Christina B Azodi
- Department of Plant Biology, Michigan State University, East Lansing, MI, USA; Bioinformatics and Cellular Genomics, St. Vincent's Institute of Medical Research, Fitzroy, Victoria, Australia.
| | - Jiliang Tang
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| | - Shin-Han Shiu
- Department of Plant Biology, Michigan State University, East Lansing, MI, USA; Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI, USA.
| |
Collapse
|
97
|
Rillig MC, Ryo M, Lehmann A, Aguilar-Trigueros CA, Buchert S, Wulf A, Iwasaki A, Roy J, Yang G. The role of multiple global change factors in driving soil functions and microbial biodiversity. Science 2020; 366:886-890. [PMID: 31727838 DOI: 10.1126/science.aay2832] [Citation(s) in RCA: 251] [Impact Index Per Article: 62.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2019] [Revised: 08/28/2019] [Accepted: 10/15/2019] [Indexed: 01/06/2023]
Abstract
Soils underpin terrestrial ecosystem functions, but they face numerous anthropogenic pressures. Despite their crucial ecological role, we know little about how soils react to more than two environmental factors at a time. Here, we show experimentally that increasing the number of simultaneous global change factors (up to 10) caused increasing directional changes in soil properties, soil processes, and microbial communities, though there was greater uncertainty in predicting the magnitude of change. Our study provides a blueprint for addressing multifactor change with an efficient, broadly applicable experimental design for studying the impacts of global environmental change.
Collapse
Affiliation(s)
- Matthias C Rillig
- Institute of Biology, Freie Universität Berlin, 14195 Berlin, Germany. .,Berlin-Brandenburg Institute of Advanced Biodiversity Research (BBIB), 14195 Berlin, Germany
| | - Masahiro Ryo
- Institute of Biology, Freie Universität Berlin, 14195 Berlin, Germany.,Berlin-Brandenburg Institute of Advanced Biodiversity Research (BBIB), 14195 Berlin, Germany
| | - Anika Lehmann
- Institute of Biology, Freie Universität Berlin, 14195 Berlin, Germany.,Berlin-Brandenburg Institute of Advanced Biodiversity Research (BBIB), 14195 Berlin, Germany
| | - Carlos A Aguilar-Trigueros
- Institute of Biology, Freie Universität Berlin, 14195 Berlin, Germany.,Berlin-Brandenburg Institute of Advanced Biodiversity Research (BBIB), 14195 Berlin, Germany
| | - Sabine Buchert
- Institute of Biology, Freie Universität Berlin, 14195 Berlin, Germany.,Berlin-Brandenburg Institute of Advanced Biodiversity Research (BBIB), 14195 Berlin, Germany
| | - Anja Wulf
- Institute of Biology, Freie Universität Berlin, 14195 Berlin, Germany.,Berlin-Brandenburg Institute of Advanced Biodiversity Research (BBIB), 14195 Berlin, Germany
| | - Aiko Iwasaki
- Institute of Biology, Freie Universität Berlin, 14195 Berlin, Germany.,Berlin-Brandenburg Institute of Advanced Biodiversity Research (BBIB), 14195 Berlin, Germany
| | - Julien Roy
- Institute of Biology, Freie Universität Berlin, 14195 Berlin, Germany.,Berlin-Brandenburg Institute of Advanced Biodiversity Research (BBIB), 14195 Berlin, Germany
| | - Gaowen Yang
- Institute of Biology, Freie Universität Berlin, 14195 Berlin, Germany.,Berlin-Brandenburg Institute of Advanced Biodiversity Research (BBIB), 14195 Berlin, Germany
| |
Collapse
|
98
|
Abstract
Building and expanding on principles of statistics, machine learning, and scientific inquiry, we propose the predictability, computability, and stability (PCS) framework for veridical data science. Our framework, composed of both a workflow and documentation, aims to provide responsible, reliable, reproducible, and transparent results across the data science life cycle. The PCS workflow uses predictability as a reality check and considers the importance of computation in data collection/storage and algorithm design. It augments predictability and computability with an overarching stability principle. Stability expands on statistical uncertainty considerations to assess how human judgment calls impact data results through data and model/algorithm perturbations. As part of the PCS workflow, we develop PCS inference procedures, namely PCS perturbation intervals and PCS hypothesis testing, to investigate the stability of data results relative to problem formulation, data cleaning, modeling decisions, and interpretations. We illustrate PCS inference through neuroscience and genomics projects of our own and others. Moreover, we demonstrate its favorable performance over existing methods in terms of receiver operating characteristic (ROC) curves in high-dimensional, sparse linear model simulations, including a wide range of misspecified models. Finally, we propose PCS documentation based on R Markdown or Jupyter Notebook, with publicly available, reproducible codes and narratives to back up human choices made throughout an analysis. The PCS workflow and documentation are demonstrated in a genomics case study available on Zenodo.
Collapse
|
99
|
King DM, Hong CKY, Shepherdson JL, Granas DM, Maricque BB, Cohen BA. Synthetic and genomic regulatory elements reveal aspects of cis-regulatory grammar in mouse embryonic stem cells. eLife 2020; 9:41279. [PMID: 32043966 PMCID: PMC7077988 DOI: 10.7554/elife.41279] [Citation(s) in RCA: 43] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2018] [Accepted: 02/07/2020] [Indexed: 01/08/2023] Open
Abstract
In embryonic stem cells (ESCs), a core transcription factor (TF) network establishes the gene expression program necessary for pluripotency. To address how interactions between four key TFs contribute to cis-regulation in mouse ESCs, we assayed two massively parallel reporter assay (MPRA) libraries composed of binding sites for SOX2, POU5F1 (OCT4), KLF4, and ESRRB. Comparisons between synthetic cis-regulatory elements and genomic sequences with comparable binding site configurations revealed some aspects of a regulatory grammar. The expression of synthetic elements is influenced by both the number and arrangement of binding sites. This grammar plays only a small role for genomic sequences, as the relative activities of genomic sequences are best explained by the predicted occupancy of binding sites, regardless of binding site identity and positioning. Our results suggest that the effects of transcription factor binding sites (TFBS) are influenced by the order and orientation of sites, but that in the genome the overall occupancy of TFs is the primary determinant of activity. Transcription factors are proteins that flip genetic switches; their role is to control when and where genes are active. They do this by binding to short stretches of DNA called cis-regulatory sequences. Each sequence can have several binding sites for different transcription factors, but it is largely unclear whether the transcription factors binding to the same regulatory sequence actually work together. It is possible that each transcription factor may work independently and there only needs to be critical mass of transcription factors bound to throw the genetic switch. If this is the case, the most important features of a cis-regulatory sequence should be the number of binding sites it contains, and how tightly the transcription factors bind to those sites. The more transcription factors and the more strongly they bind, the more active the gene should be. An alternative option is that certain transcription factors may work better together, enhancing each other's effects such that the total effect is more than the sum of its parts. If this is true, the order, orientation and spacing of the binding sites within a sequence should matter more than the number. One way to investigate to distinguish between these possibilities is to study mouse embryonic stem cells, which have a core set of four transcription factors. Looking directly at a real genome, however, can be confusing and it is difficult to measure the effects of different cis-regulatory sequences because genes differ in so many other ways. To tackle this problem, King et al. created a synthetic set of cis-regulatory sequences based on the four core transcription factors found in mouse stem cells. The synthetic set had every combination of two, three or four of the binding sites, with each site either facing forwards or backwards along the DNA strand. King et al. attached each of the synthetic cis-regulatory sequences to a reporter gene to find out how well each sequence performed. This revealed that the cis-regulatory sequences with the most binding sites and the tightest binding affinities work best, suggesting that transcription factors mainly work independently. There was evidence of some interaction between some transcription factors, because, of the synthetic sequences with four binding sites, some worked better than others, and there were patterns in the most effective binding site combinations. However, these effects were small and when King et al. went on to test sequences from the real mouse genome, the most important factor by far was the number of binding sites. Synthetic libraries of DNA sequences allow researchers to examine gene regulation more clearly than is possible in real genomes. Yet this approach does have its limitations and it is impossible to capture every type of cis-regulatory sequence in one library. The next step to extend this work is to combine the two approaches, taking sequences from the real genome and manipulating them one by one. This could help to unravel the rules that govern how cis-regulatory sequences work in real cells.
Collapse
Affiliation(s)
- Dana M King
- Edison Center for Genome Sciences and Systems Biology, Washington University in St. Louis, St. Louis, United States.,Department of Genetics, Washington University in St. Louis, St. Louis, United States
| | - Clarice Kit Yee Hong
- Edison Center for Genome Sciences and Systems Biology, Washington University in St. Louis, St. Louis, United States.,Department of Genetics, Washington University in St. Louis, St. Louis, United States
| | - James L Shepherdson
- Edison Center for Genome Sciences and Systems Biology, Washington University in St. Louis, St. Louis, United States.,Department of Genetics, Washington University in St. Louis, St. Louis, United States
| | - David M Granas
- Edison Center for Genome Sciences and Systems Biology, Washington University in St. Louis, St. Louis, United States.,Department of Genetics, Washington University in St. Louis, St. Louis, United States
| | - Brett B Maricque
- Edison Center for Genome Sciences and Systems Biology, Washington University in St. Louis, St. Louis, United States.,Department of Genetics, Washington University in St. Louis, St. Louis, United States
| | - Barak A Cohen
- Edison Center for Genome Sciences and Systems Biology, Washington University in St. Louis, St. Louis, United States.,Department of Genetics, Washington University in St. Louis, St. Louis, United States
| |
Collapse
|
100
|
Streich J, Romero J, Gazolla JGFM, Kainer D, Cliff A, Prates ET, Brown JB, Khoury S, Tuskan GA, Garvin M, Jacobson D, Harfouche AL. Can exascale computing and explainable artificial intelligence applied to plant biology deliver on the United Nations sustainable development goals? Curr Opin Biotechnol 2020; 61:217-225. [DOI: 10.1016/j.copbio.2020.01.010] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2019] [Revised: 01/27/2020] [Accepted: 01/28/2020] [Indexed: 01/26/2023]
|