1
|
McDonough C, Li YC, Vangeepuram N, Liu B, Pandey G. A Comprehensive Youth Diabetes Epidemiological Data Set and Web Portal: Resource Development and Case Studies. JMIR Public Health Surveill 2024; 10:e53330. [PMID: 38666756 DOI: 10.2196/53330] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2023] [Revised: 02/06/2024] [Accepted: 04/26/2024] [Indexed: 05/03/2024] Open
Abstract
BACKGROUND The prevalence of type 2 diabetes mellitus (DM) and pre-diabetes mellitus (pre-DM) has been increasing among youth in recent decades in the United States, prompting an urgent need for understanding and identifying their associated risk factors. Such efforts, however, have been hindered by the lack of easily accessible youth pre-DM/DM data. OBJECTIVE We aimed to first build a high-quality, comprehensive epidemiological data set focused on youth pre-DM/DM. Subsequently, we aimed to make these data accessible by creating a user-friendly web portal to share them and the corresponding codes. Through this, we hope to address this significant gap and facilitate youth pre-DM/DM research. METHODS Building on data from the National Health and Nutrition Examination Survey (NHANES) from 1999 to 2018, we cleaned and harmonized hundreds of variables relevant to pre-DM/DM (fasting plasma glucose level ≥100 mg/dL or glycated hemoglobin ≥5.7%) for youth aged 12-19 years (N=15,149). We identified individual factors associated with pre-DM/DM risk using bivariate statistical analyses and predicted pre-DM/DM status using our Ensemble Integration (EI) framework for multidomain machine learning. We then developed a user-friendly web portal named Prediabetes/diabetes in youth Online Dashboard (POND) to share the data and codes. RESULTS We extracted 95 variables potentially relevant to pre-DM/DM risk organized into 4 domains (sociodemographic, health status, diet, and other lifestyle behaviors). The bivariate analyses identified 27 significant correlates of pre-DM/DM (P<.001, Bonferroni adjusted), including race or ethnicity, health insurance, BMI, added sugar intake, and screen time. Among these factors, 16 factors were also identified based on the EI methodology (Fisher P of overlap=7.06×106). In addition to those, the EI approach identified 11 additional predictive variables, including some known (eg, meat and fruit intake and family income) and less recognized factors (eg, number of rooms in homes). The factors identified in both analyses spanned across all 4 of the domains mentioned. These data and results, as well as other exploratory tools, can be accessed on POND. CONCLUSIONS Using NHANES data, we built one of the largest public epidemiological data sets for studying youth pre-DM/DM and identified potential risk factors using complementary analytical approaches. Our results align with the multifactorial nature of pre-DM/DM with correlates across several domains. Also, our data-sharing platform, POND, facilitates a wide range of applications to inform future youth pre-DM/DM studies.
Collapse
Affiliation(s)
- Catherine McDonough
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Yan Chak Li
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Nita Vangeepuram
- Department of Pediatrics, Icahn School of Medicine at Mount Sinai, New York, NY, United States
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Bian Liu
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Gaurav Pandey
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| |
Collapse
|
2
|
Wright BM, Bodnar MS, Moore AD, Maseda MC, Kucharik MP, Diaz CC, Schmidt CM, Mir HR. Is ChatGPT a trusted source of information for total hip and knee arthroplasty patients? Bone Jt Open 2024; 5:139-146. [PMID: 38354748 PMCID: PMC10867788 DOI: 10.1302/2633-1462.52.bjo-2023-0113.r1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/16/2024] Open
Abstract
Aims While internet search engines have been the primary information source for patients' questions, artificial intelligence large language models like ChatGPT are trending towards becoming the new primary source. The purpose of this study was to determine if ChatGPT can answer patient questions about total hip (THA) and knee arthroplasty (TKA) with consistent accuracy, comprehensiveness, and easy readability. Methods We posed the 20 most Google-searched questions about THA and TKA, plus ten additional postoperative questions, to ChatGPT. Each question was asked twice to evaluate for consistency in quality. Following each response, we responded with, "Please explain so it is easier to understand," to evaluate ChatGPT's ability to reduce response reading grade level, measured as Flesch-Kincaid Grade Level (FKGL). Five resident physicians rated the 120 responses on 1 to 5 accuracy and comprehensiveness scales. Additionally, they answered a "yes" or "no" question regarding acceptability. Mean scores were calculated for each question, and responses were deemed acceptable if ≥ four raters answered "yes." Results The mean accuracy and comprehensiveness scores were 4.26 (95% confidence interval (CI) 4.19 to 4.33) and 3.79 (95% CI 3.69 to 3.89), respectively. Out of all the responses, 59.2% (71/120; 95% CI 50.0% to 67.7%) were acceptable. ChatGPT was consistent when asked the same question twice, giving no significant difference in accuracy (t = 0.821; p = 0.415), comprehensiveness (t = 1.387; p = 0.171), acceptability (χ2 = 1.832; p = 0.176), and FKGL (t = 0.264; p = 0.793). There was a significantly lower FKGL (t = 2.204; p = 0.029) for easier responses (11.14; 95% CI 10.57 to 11.71) than original responses (12.15; 95% CI 11.45 to 12.85). Conclusion ChatGPT answered THA and TKA patient questions with accuracy comparable to previous reports of websites, with adequate comprehensiveness, but with limited acceptability as the sole information source. ChatGPT has potential for answering patient questions about THA and TKA, but needs improvement.
Collapse
Affiliation(s)
- Benjamin M. Wright
- Morsani College of Medicine, University of South Florida, Tampa, Florida, USA
| | - Michael S. Bodnar
- Morsani College of Medicine, University of South Florida, Tampa, Florida, USA
| | - Andrew D. Moore
- Department of Orthopaedic Surgery, University of South Florida, Tampa, Florida, USA
| | - Meghan C. Maseda
- Department of Orthopaedic Surgery, University of South Florida, Tampa, Florida, USA
| | - Michael P. Kucharik
- Department of Orthopaedic Surgery, University of South Florida, Tampa, Florida, USA
| | - Connor C. Diaz
- Department of Orthopaedic Surgery, University of South Florida, Tampa, Florida, USA
| | - Christian M. Schmidt
- Department of Orthopaedic Surgery, University of South Florida, Tampa, Florida, USA
| | - Hassan R. Mir
- Orthopaedic Trauma Service, Florida Orthopedic Institute, Tampa, Florida, USA
| |
Collapse
|
3
|
McDonough C, Li YC, Vangeepuram N, Liu B, Pandey G. Facilitating youth diabetes studies with the most comprehensive epidemiological dataset available through a public web portal. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.08.02.23293517. [PMID: 37577465 PMCID: PMC10418570 DOI: 10.1101/2023.08.02.23293517] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/15/2023]
Abstract
The prevalence of type 2 diabetes mellitus (DM) and prediabetes (preDM) is rapidly increasing among youth, posing significant health and economic consequences. To address this growing concern, we created the most comprehensive youth-focused diabetes dataset to date derived from National Health and Nutrition Examination Survey (NHANES) data from 1999 to 2018. The dataset, consisting of 15,149 youth aged 12 to 19 years, encompasses preDM/DM relevant variables from sociodemographic, health status, diet, and other lifestyle behavior domains. An interactive web portal, POND (Prediabetes/diabetes in youth ONline Dashboard), was developed to provide public access to the dataset, allowing users to explore variables potentially associated with youth preDM/DM. Leveraging statistical and machine learning methods, we conducted two case studies, revealing established and lesser-known variables linked to youth preDM/DM. This dataset and portal can facilitate future studies to inform prevention and management strategies for youth prediabetes and diabetes.
Collapse
Affiliation(s)
- Catherine McDonough
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Yan Chak Li
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Nita Vangeepuram
- Department of Pediatrics, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Bian Liu
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Gaurav Pandey
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
4
|
Sieberts SK, Borzymowski H, Guan Y, Huang Y, Matzner A, Page A, Bar-Gad I, Beaulieu-Jones B, El-Hanani Y, Goschenhofer J, Javidnia M, Keller MS, Li YC, Saqib M, Smith G, Stanescu A, Venuto CS, Zielinski R, Jayaraman A, Evers LJW, Foschini L, Mariakakis A, Pandey G, Shawen N, Synder P, Omberg L. Developing better digital health measures of Parkinson's disease using free living data and a crowdsourced data analysis challenge. PLOS DIGITAL HEALTH 2023; 2:e0000208. [PMID: 36976789 PMCID: PMC10047543 DOI: 10.1371/journal.pdig.0000208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Accepted: 02/07/2023] [Indexed: 03/29/2023]
Abstract
One of the promising opportunities of digital health is its potential to lead to more holistic understandings of diseases by interacting with the daily life of patients and through the collection of large amounts of real-world data. Validating and benchmarking indicators of disease severity in the home setting is difficult, however, given the large number of confounders present in the real world and the challenges in collecting ground truth data in the home. Here we leverage two datasets collected from patients with Parkinson's disease, which couples continuous wrist-worn accelerometer data with frequent symptom reports in the home setting, to develop digital biomarkers of symptom severity. Using these data, we performed a public benchmarking challenge in which participants were asked to build measures of severity across 3 symptoms (on/off medication, dyskinesia, and tremor). 42 teams participated and performance was improved over baseline models for each subchallenge. Additional ensemble modeling across submissions further improved performance, and the top models validated in a subset of patients whose symptoms were observed and rated by trained clinicians.
Collapse
Affiliation(s)
| | | | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Yidi Huang
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Ayala Matzner
- Gonda Brain Research Center, Bar Ilan University, Ramat Gan, Israel
| | - Alex Page
- Center for Health + Technology, University of Rochester Medical Center, Rochester, New York, United States of America
- Cardiology Division, University of Rochester Medical Center, Rochester, New York, United States of America
| | - Izhar Bar-Gad
- Gonda Brain Research Center, Bar Ilan University, Ramat Gan, Israel
| | - Brett Beaulieu-Jones
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, United States of America
- Department of Neurology, Brigham and Women’s Hospital, Boston, Massachusetts, United States of America
| | - Yuval El-Hanani
- Gonda Brain Research Center, Bar Ilan University, Ramat Gan, Israel
| | | | - Monica Javidnia
- Center for Health + Technology, University of Rochester Medical Center, Rochester, New York, United States of America
- Department of Neurology, University of Rochester, Rochester, New York, United States of America
| | - Mark S. Keller
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Yan-chak Li
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Mohammed Saqib
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Greta Smith
- Center for Health + Technology, University of Rochester Medical Center, Rochester, New York, United States of America
- Department of Neurology, University of Rochester, Rochester, New York, United States of America
| | - Ana Stanescu
- Department of Computing and Mathematics, University of West Georgia, Carrollton, Georgia, United States of America
| | - Charles S. Venuto
- Center for Health + Technology, University of Rochester Medical Center, Rochester, New York, United States of America
- Department of Neurology, University of Rochester, Rochester, New York, United States of America
| | - Robert Zielinski
- Center for Health + Technology, University of Rochester Medical Center, Rochester, New York, United States of America
- Department of Neurology, University of Rochester, Rochester, New York, United States of America
| | | | - Arun Jayaraman
- Center for Rehabilitation Technologies & Outcomes Research, Shirley Ryan AbilityLab, Chicago, Illinois, United States of America
| | - Luc J. W. Evers
- Donders Institute for Brain, Cognition and Behaviour, Department of Neurology, Radboud University Medical Center, Nijmegen, the Netherlands
- Institute for Computing and Information Sciences, Radboud University, Nijmegen, the Netherlands
| | - Luca Foschini
- Evidation Health, Santa Barbara, California, United States of America
| | - Alex Mariakakis
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
| | - Gaurav Pandey
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
| | - Nicholas Shawen
- Center for Rehabilitation Technologies & Outcomes Research, Shirley Ryan AbilityLab, Chicago, Illinois, United States of America
- Medical Scientist Training Program, Northwestern University Feinberg School of Medicine, Chicago, Illinois, United States of America
| | - Phil Synder
- Sage Bionetworks, Seattle, Washington, United States of America
| | - Larsson Omberg
- Sage Bionetworks, Seattle, Washington, United States of America
| |
Collapse
|
5
|
Li YC, Wang L, Law JN, Murali TM, Pandey G. Integrating multimodal data through interpretable heterogeneous ensembles. BIOINFORMATICS ADVANCES 2022; 2:vbac065. [PMID: 36158455 PMCID: PMC9495448 DOI: 10.1093/bioadv/vbac065] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Revised: 09/01/2022] [Accepted: 09/10/2022] [Indexed: 01/27/2023]
Abstract
Motivation Integrating multimodal data represents an effective approach to predicting biomedical characteristics, such as protein functions and disease outcomes. However, existing data integration approaches do not sufficiently address the heterogeneous semantics of multimodal data. In particular, early and intermediate approaches that rely on a uniform integrated representation reinforce the consensus among the modalities but may lose exclusive local information. The alternative late integration approach that can address this challenge has not been systematically studied for biomedical problems. Results We propose Ensemble Integration (EI) as a novel systematic implementation of the late integration approach. EI infers local predictive models from the individual data modalities using appropriate algorithms and uses heterogeneous ensemble algorithms to integrate these local models into a global predictive model. We also propose a novel interpretation method for EI models. We tested EI on the problems of predicting protein function from multimodal STRING data and mortality due to coronavirus disease 2019 (COVID-19) from multimodal data in electronic health records. We found that EI accomplished its goal of producing significantly more accurate predictions than each individual modality. It also performed better than several established early integration methods for each of these problems. The interpretation of a representative EI model for COVID-19 mortality prediction identified several disease-relevant features, such as laboratory test (blood urea nitrogen and calcium) and vital sign measurements (minimum oxygen saturation) and demographics (age). These results demonstrated the effectiveness of the EI framework for biomedical data integration and predictive modeling. Availability and implementation Code and data are available at https://github.com/GauravPandeyLab/ensemble_integration. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Yan Chak Li
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Linhua Wang
- Graduate Program in Quantitative and Computational Biosciences, Baylor College of Medicine, Houston, TX 77030, USA
| | - Jeffrey N Law
- Biosciences Center, National Renewable Energy Laboratory, Golden, CO 80401, USA
| | - T M Murali
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA
| | | |
Collapse
|
6
|
Li YC, Wang L, Law JN, Murali TM, Pandey G. Integrating multimodal data through interpretable heterogeneous ensembles. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2022:2020.05.29.123497. [PMID: 35923321 PMCID: PMC9347276 DOI: 10.1101/2020.05.29.123497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Motivation Integrating multimodal data represents an effective approach to predicting biomedical characteristics, such as protein functions and disease outcomes. However, existing data integration approaches do not sufficiently address the heterogeneous semantics of multimodal data. In particular, early and intermediate approaches that rely on a uniform integrated representation reinforce the consensus among the modalities, but may lose exclusive local information. The alternative late integration approach that can address this challenge has not been systematically studied for biomedical problems. Results We propose Ensemble Integration (EI) as a novel systematic implementation of the late integration approach. EI infers local predictive models from the individual data modalities using appropriate algorithms, and uses effective heterogeneous ensemble algorithms to integrate these local models into a global predictive model. We also propose a novel interpretation method for EI models. We tested EI on the problems of predicting protein function from multimodal STRING data, and mortality due to COVID-19 from multimodal data in electronic health records. We found that EI accomplished its goal of producing significantly more accurate predictions than each individual modality. It also performed better than several established early integration methods for each of these problems. The interpretation of a representative EI model for COVID-19 mortality prediction identified several disease-relevant features, such as laboratory test (blood urea nitrogen (BUN) and calcium) and vital sign measurements (minimum oxygen saturation) and demographics (age). These results demonstrated the effectiveness of the EI framework for biomedical data integration and predictive modeling. Availability Code and data are available at https://github.com/GauravPandeyLab/ensemble_integration . Contact gaurav.pandey@mssm.edu.
Collapse
Affiliation(s)
- Yan Chak Li
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Linhua Wang
- Graduate Program in Quantitative and Computational Biosciences, Baylor College of Medicine, Houston, Texas, USA
| | - Jeffrey N. Law
- National Renewable Energy Laboratory, Golden, Colorado, USA
| | - T. M. Murali
- Department of Computer Science, Virginia Tech, Blacksburg, Virginia, USA
| | - Gaurav Pandey
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| |
Collapse
|
7
|
Yan Y, Schaffter T, Bergquist T, Yu T, Prosser J, Aydin Z, Jabeer A, Brugere I, Gao J, Chen G, Causey J, Yao Y, Bryson K, Long DR, Jarvik JG, Lee CI, Wilcox A, Guinney J, Mooney S. A Continuously Benchmarked and Crowdsourced Challenge for Rapid Development and Evaluation of Models to Predict COVID-19 Diagnosis and Hospitalization. JAMA Netw Open 2021; 4:e2124946. [PMID: 34633425 PMCID: PMC8506231 DOI: 10.1001/jamanetworkopen.2021.24946] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Accepted: 07/08/2021] [Indexed: 01/28/2023] Open
Abstract
Importance Machine learning could be used to predict the likelihood of diagnosis and severity of illness. Lack of COVID-19 patient data has hindered the data science community in developing models to aid in the response to the pandemic. Objectives To describe the rapid development and evaluation of clinical algorithms to predict COVID-19 diagnosis and hospitalization using patient data by citizen scientists, provide an unbiased assessment of model performance, and benchmark model performance on subgroups. Design, Setting, and Participants This diagnostic and prognostic study operated a continuous, crowdsourced challenge using a model-to-data approach to securely enable the use of regularly updated COVID-19 patient data from the University of Washington by participants from May 6 to December 23, 2020. A postchallenge analysis was conducted from December 24, 2020, to April 7, 2021, to assess the generalizability of models on the cumulative data set as well as subgroups stratified by age, sex, race, and time of COVID-19 test. By December 23, 2020, this challenge engaged 482 participants from 90 teams and 7 countries. Main Outcomes and Measures Machine learning algorithms used patient data and output a score that represented the probability of patients receiving a positive COVID-19 test result or being hospitalized within 21 days after receiving a positive COVID-19 test result. Algorithms were evaluated using area under the receiver operating characteristic curve (AUROC) and area under the precision recall curve (AUPRC) scores. Ensemble models aggregating models from the top challenge teams were developed and evaluated. Results In the analysis using the cumulative data set, the best performance for COVID-19 diagnosis prediction was an AUROC of 0.776 (95% CI, 0.775-0.777) and an AUPRC of 0.297, and for hospitalization prediction, an AUROC of 0.796 (95% CI, 0.794-0.798) and an AUPRC of 0.188. Analysis on top models submitting to the challenge showed consistently better model performance on the female group than the male group. Among all age groups, the best performance was obtained for the 25- to 49-year age group, and the worst performance was obtained for the group aged 17 years or younger. Conclusions and Relevance In this diagnostic and prognostic study, models submitted by citizen scientists achieved high performance for the prediction of COVID-19 testing and hospitalization outcomes. Evaluation of challenge models on demographic subgroups and prospective data revealed performance discrepancies, providing insights into the potential bias and limitations in the models.
Collapse
Affiliation(s)
- Yao Yan
- Sage Bionetworks, Seattle, Washington
- Molecular Engineering and Sciences Institute, University of Washington, Seattle
| | | | - Timothy Bergquist
- Sage Bionetworks, Seattle, Washington
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle
| | - Thomas Yu
- Sage Bionetworks, Seattle, Washington
| | - Justin Prosser
- Institute of Translational Health Sciences, University of Washington, Seattle
| | - Zafer Aydin
- Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, Turkey
| | - Amhar Jabeer
- Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, Turkey
| | - Ivan Brugere
- Department of Computer Science, University of Illinois at Chicago, Chicago
| | - Jifan Gao
- Department of Biostatistics and Medical Informatics, University of Wisconsin–Madison, Madison
| | - Guanhua Chen
- Department of Biostatistics and Medical Informatics, University of Wisconsin–Madison, Madison
| | - Jason Causey
- Computer Science Department, College of Engineering and Computer Science, Arkansas State University, Jonesboro
- Arkansas AI-Campus, Center for No-Boundary Thinking, Arkansas State University, Jonesboro
| | - Yuxin Yao
- Department of Computer Science, University College London, London, United Kingdom
| | - Kevin Bryson
- Department of Computer Science, University College London, London, United Kingdom
| | - Dustin R. Long
- Division of Critical Care Medicine, Department of Anesthesiology and Pain Medicine, University of Washington, Seattle
| | - Jeffrey G. Jarvik
- The University of Washington Clinical Learning, Evidence And Research Center for Musculoskeletal Disorders, Seattle
- Department of Radiology, University of Washington School of Medicine, Seattle
| | - Christoph I. Lee
- Department of Radiology, University of Washington School of Medicine, Seattle
| | - Adam Wilcox
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle
| | | | - Sean Mooney
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle
| |
Collapse
|
8
|
Moro G, Masseroli M. Gene function finding through cross-organism ensemble learning. BioData Min 2021; 14:14. [PMID: 33579334 PMCID: PMC7879670 DOI: 10.1186/s13040-021-00239-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Accepted: 01/10/2021] [Indexed: 11/12/2022] Open
Abstract
Background Structured biological information about genes and proteins is a valuable resource to improve discovery and understanding of complex biological processes via machine learning algorithms. Gene Ontology (GO) controlled annotations describe, in a structured form, features and functions of genes and proteins of many organisms. However, such valuable annotations are not always reliable and sometimes are incomplete, especially for rarely studied organisms. Here, we present GeFF (Gene Function Finder), a novel cross-organism ensemble learning method able to reliably predict new GO annotations of a target organism from GO annotations of another source organism evolutionarily related and better studied. Results Using a supervised method, GeFF predicts unknown annotations from random perturbations of existing annotations. The perturbation consists in randomly deleting a fraction of known annotations in order to produce a reduced annotation set. The key idea is to train a supervised machine learning algorithm with the reduced annotation set to predict, namely to rebuild, the original annotations. The resulting prediction model, in addition to accurately rebuilding the original known annotations for an organism from their perturbed version, also effectively predicts new unknown annotations for the organism. Moreover, the prediction model is also able to discover new unknown annotations in different target organisms without retraining.We combined our novel method with different ensemble learning approaches and compared them to each other and to an equivalent single model technique. We tested the method with five different organisms using their GO annotations: Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum. The outcomes demonstrate the effectiveness of the cross-organism ensemble approach, which can be customized with a trade-off between the desired number of predicted new annotations and their precision.A Web application to browse both input annotations used and predicted ones, choosing the ensemble prediction method to use, is publicly available at http://tiny.cc/geff/. Conclusions Our novel cross-organism ensemble learning method provides reliable predicted novel gene annotations, i.e., functions, ranked according to an associated likelihood value. They are very valuable both to speed the annotation curation, focusing it on the prioritized new annotations predicted, and to complement known annotations available.
Collapse
Affiliation(s)
- Gianluca Moro
- DISI - University of Bologna, Via dell'Università, Cesena (FC), Italy.
| | - Marco Masseroli
- DEIB, Politecnico di Milano, Piazza L. Da Vinci 32, Milan, 20133, Italy
| |
Collapse
|
9
|
Chetnik K, Petrick L, Pandey G. MetaClean: a machine learning-based classifier for reduced false positive peak detection in untargeted LC-MS metabolomics data. Metabolomics 2020; 16:117. [PMID: 33085002 PMCID: PMC7895495 DOI: 10.1007/s11306-020-01738-3] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/09/2020] [Accepted: 10/13/2020] [Indexed: 10/23/2022]
Abstract
INTRODUCTION Despite the availability of several pre-processing software, poor peak integration remains a prevalent problem in untargeted metabolomics data generated using liquid chromatography high-resolution mass spectrometry (LC-MS). As a result, the output of these pre-processing software may retain incorrectly calculated metabolite abundances that can perpetuate in downstream analyses. OBJECTIVES To address this problem, we propose a computational methodology that combines machine learning and peak quality metrics to filter out low quality peaks. METHODS Specifically, we comprehensively and systematically compared the performance of 24 different classifiers generated by combining eight classification algorithms and three sets of peak quality metrics on the task of distinguishing reliably integrated peaks from poorly integrated ones. These classifiers were compared to using a residual standard deviation (RSD) cut-off in pooled quality-control (QC) samples, which aims to remove peaks with analytical error. RESULTS The best performing classifier was found to be a combination of the AdaBoost algorithm and a set of 11 peak quality metrics previously explored in untargeted metabolomics and proteomics studies. As a complementary approach, applying our framework to peaks retained after filtering by 30% RSD across pooled QC samples was able to further distinguish poorly integrated peaks that were not removed from filtering alone. An R implementation of these classifiers and the overall computational approach is available as the MetaClean package at https://CRAN.R-project.org/package=MetaClean . CONCLUSION Our work represents an important step forward in developing an automated tool for filtering out unreliable peak integrations in untargeted LC-MS metabolomics data.
Collapse
Affiliation(s)
- Kelsey Chetnik
- Department of Genetics and Genomic Sciences and Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Lauren Petrick
- Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
- Institute for Exposomics Research, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| | - Gaurav Pandey
- Department of Genetics and Genomic Sciences and Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
- Institute for Exposomics Research, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| |
Collapse
|
10
|
Schaffter T, Buist DSM, Lee CI, Nikulin Y, Ribli D, Guan Y, Lotter W, Jie Z, Du H, Wang S, Feng J, Feng M, Kim HE, Albiol F, Albiol A, Morrell S, Wojna Z, Ahsen ME, Asif U, Jimeno Yepes A, Yohanandan S, Rabinovici-Cohen S, Yi D, Hoff B, Yu T, Chaibub Neto E, Rubin DL, Lindholm P, Margolies LR, McBride RB, Rothstein JH, Sieh W, Ben-Ari R, Harrer S, Trister A, Friend S, Norman T, Sahiner B, Strand F, Guinney J, Stolovitzky G. Evaluation of Combined Artificial Intelligence and Radiologist Assessment to Interpret Screening Mammograms. JAMA Netw Open 2020; 3:e200265. [PMID: 32119094 PMCID: PMC7052735 DOI: 10.1001/jamanetworkopen.2020.0265] [Citation(s) in RCA: 180] [Impact Index Per Article: 45.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/12/2019] [Accepted: 12/26/2019] [Indexed: 12/18/2022] Open
Abstract
Importance Mammography screening currently relies on subjective human interpretation. Artificial intelligence (AI) advances could be used to increase mammography screening accuracy by reducing missed cancers and false positives. Objective To evaluate whether AI can overcome human mammography interpretation limitations with a rigorous, unbiased evaluation of machine learning algorithms. Design, Setting, and Participants In this diagnostic accuracy study conducted between September 2016 and November 2017, an international, crowdsourced challenge was hosted to foster AI algorithm development focused on interpreting screening mammography. More than 1100 participants comprising 126 teams from 44 countries participated. Analysis began November 18, 2016. Main Outcomes and Measurements Algorithms used images alone (challenge 1) or combined images, previous examinations (if available), and clinical and demographic risk factor data (challenge 2) and output a score that translated to cancer yes/no within 12 months. Algorithm accuracy for breast cancer detection was evaluated using area under the curve and algorithm specificity compared with radiologists' specificity with radiologists' sensitivity set at 85.9% (United States) and 83.9% (Sweden). An ensemble method aggregating top-performing AI algorithms and radiologists' recall assessment was developed and evaluated. Results Overall, 144 231 screening mammograms from 85 580 US women (952 cancer positive ≤12 months from screening) were used for algorithm training and validation. A second independent validation cohort included 166 578 examinations from 68 008 Swedish women (780 cancer positive). The top-performing algorithm achieved an area under the curve of 0.858 (United States) and 0.903 (Sweden) and 66.2% (United States) and 81.2% (Sweden) specificity at the radiologists' sensitivity, lower than community-practice radiologists' specificity of 90.5% (United States) and 98.5% (Sweden). Combining top-performing algorithms and US radiologist assessments resulted in a higher area under the curve of 0.942 and achieved a significantly improved specificity (92.0%) at the same sensitivity. Conclusions and Relevance While no single AI algorithm outperformed radiologists, an ensemble of AI algorithms combined with radiologist assessment in a single-reader screening environment improved overall accuracy. This study underscores the potential of using machine learning methods for enhancing mammography screening interpretation.
Collapse
Affiliation(s)
| | - Diana S. M. Buist
- Kaiser Permanente Washington Health Research Institute, Seattle, Washington
| | | | | | - Dezső Ribli
- Department of Physics of Complex Systems, ELTE Eötvös Loránd University, Budapest, Hungary
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, Michigan Medicine, University of Michigan, Ann Arbor
| | | | | | - Hao Du
- National University of Singapore, Singapore
| | - Sijia Wang
- Integrated Health Information Systems Pte Ltd, Singapore
| | - Jiashi Feng
- Department of Electrical and Computer Engineering, National University of Singapore, Singapore
| | | | | | - Francisco Albiol
- Instituto de Física Corpuscular (IFIC), CSIC–Universitat de València, Valencia, Spain
| | - Alberto Albiol
- Universitat Politecnica de Valencia, Valencia, Valenciana, Spain
| | - Stephen Morrell
- Centre for Medical Image Computing, University College London, Bloomsbury, London, United Kingdom
| | | | | | - Umar Asif
- IBM Research Australia, Melbourne, Australia
| | | | | | | | - Darvin Yi
- Stanford University, Stanford, California
| | - Bruce Hoff
- Computational Oncology, Sage Bionetworks, Seattle, Washington
| | - Thomas Yu
- Computational Oncology, Sage Bionetworks, Seattle, Washington
| | | | - Daniel L. Rubin
- Department of Biomedical Data Science, Radiology, and Medicine (Biomedical Informatics), Stanford University, Stanford, California
| | - Peter Lindholm
- Department of Physiology and Pharmacology, Karolinska Institutet, Stockholm, Sweden
| | - Laurie R. Margolies
- Department of Diagnostic, Molecular and Interventional Radiology, Icahn School of Medicine at Mount Sinai, New York, New York
| | - Russell Bailey McBride
- Department of Pathology, Molecular and Cell-Based Medicine, Icahn School of Medicine at Mount Sinai, New York, New York
| | - Joseph H. Rothstein
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York
| | - Weiva Sieh
- Department of Population Health Science and Policy, Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York
| | - Rami Ben-Ari
- IBM Research Haifa, Haifa University Campus, Mount Carmel, Haifa, Israel
| | | | - Andrew Trister
- Fred Hutchinson Cancer Research Center, Seattle, Washington
| | - Stephen Friend
- Computational Oncology, Sage Bionetworks, Seattle, Washington
| | - Thea Norman
- Bill and Melinda Gates Foundation, Seattle, Washington
| | - Berkman Sahiner
- Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, Maryland
| | - Fredrik Strand
- Department of Oncology-Pathology, Karolinska Institutet, Stockholm, Sweden
- Breast Radiology, Karolinska University Hospital, Stockholm, Sweden
| | - Justin Guinney
- Computational Oncology, Sage Bionetworks, Seattle, Washington
| | - Gustavo Stolovitzky
- IBM Research, Translational Systems Biology and Nanobiotechnology, Thomas J. Watson Research Center, Yorktown Heights, New York
| | | |
Collapse
|
11
|
Objective risk stratification of prostate cancer using machine learning and radiomics applied to multiparametric magnetic resonance images. Sci Rep 2019; 9:1570. [PMID: 30733585 PMCID: PMC6367324 DOI: 10.1038/s41598-018-38381-x] [Citation(s) in RCA: 52] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2018] [Accepted: 12/27/2018] [Indexed: 12/24/2022] Open
Abstract
Multiparametric magnetic resonance imaging (mpMRI) has become increasingly important for the clinical assessment of prostate cancer (PCa), but its interpretation is generally variable due to its relatively subjective nature. Radiomics and classification methods have shown potential for improving the accuracy and objectivity of mpMRI-based PCa assessment. However, these studies are limited to a small number of classification methods, evaluation using the AUC score only, and a non-rigorous assessment of all possible combinations of radiomics and classification methods. This paper presents a systematic and rigorous framework comprised of classification, cross-validation and statistical analyses that was developed to identify the best performing classifier for PCa risk stratification based on mpMRI-derived radiomic features derived from a sizeable cohort. This classifier performed well in an independent validation set, including performing better than PI-RADS v2 in some aspects, indicating the value of objectively interpreting mpMRI images using radiomics and classification methods for PCa risk assessment.
Collapse
|
12
|
Fourati S, Talla A, Mahmoudian M, Burkhart JG, Klén R, Henao R, Yu T, Aydın Z, Yeung KY, Ahsen ME, Almugbel R, Jahandideh S, Liang X, Nordling TEM, Shiga M, Stanescu A, Vogel R, Pandey G, Chiu C, McClain MT, Woods CW, Ginsburg GS, Elo LL, Tsalik EL, Mangravite LM, Sieberts SK. A crowdsourced analysis to identify ab initio molecular signatures predictive of susceptibility to viral infection. Nat Commun 2018; 9:4418. [PMID: 30356117 PMCID: PMC6200745 DOI: 10.1038/s41467-018-06735-8] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2018] [Accepted: 09/12/2018] [Indexed: 01/17/2023] Open
Abstract
The response to respiratory viruses varies substantially between individuals, and there are currently no known molecular predictors from the early stages of infection. Here we conduct a community-based analysis to determine whether pre- or early post-exposure molecular factors could predict physiologic responses to viral exposure. Using peripheral blood gene expression profiles collected from healthy subjects prior to exposure to one of four respiratory viruses (H1N1, H3N2, Rhinovirus, and RSV), as well as up to 24 h following exposure, we find that it is possible to construct models predictive of symptomatic response using profiles even prior to viral exposure. Analysis of predictive gene features reveal little overlap among models; however, in aggregate, these genes are enriched for common pathways. Heme metabolism, the most significantly enriched pathway, is associated with a higher risk of developing symptoms following viral exposure. This study demonstrates that pre-exposure molecular predictors can be identified and improves our understanding of the mechanisms of response to respiratory viruses.
Collapse
Affiliation(s)
- Slim Fourati
- Department of Pathology, School of Medicine, Case Western Reserve University, Cleveland, OH, 44106, USA
| | - Aarthi Talla
- Department of Pathology, School of Medicine, Case Western Reserve University, Cleveland, OH, 44106, USA
| | - Mehrad Mahmoudian
- Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, FI-20520, Turku, Finland
- Department of Future Technologies, University of Turku, FI-20014 Turku, Finland
| | - Joshua G Burkhart
- Department of Medical Informatics and Clinical Epidemiology, School of Medicine, Oregon Health & Science University, Portland, OR, 97239, USA
- Laboratory of Evolutionary Genetics, Institute of Ecology and Evolution, University of Oregon, Eugene, OR, 97403, USA
| | - Riku Klén
- Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, FI-20520, Turku, Finland
| | - Ricardo Henao
- Duke Center for Applied Genomics and Precision Medicine, Duke University School of Medicine, Durham, NC, 27710, USA
- Department of Electrical and Computer Engineering, Duke University, Durham, NC, 27708, USA
| | - Thomas Yu
- Sage Bionetworks, Seattle, WA, 98121, USA
| | - Zafer Aydın
- Department of Computer Engineering, Abdullah Gul University, Kayseri, 38080, Turkey
| | - Ka Yee Yeung
- School of Engineering and Technology, University of Washington Tacoma, Tacoma, WA, 98402, USA
| | - Mehmet Eren Ahsen
- Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
| | - Reem Almugbel
- School of Engineering and Technology, University of Washington Tacoma, Tacoma, WA, 98402, USA
| | | | - Xiao Liang
- School of Engineering and Technology, University of Washington Tacoma, Tacoma, WA, 98402, USA
| | - Torbjörn E M Nordling
- Department of Mechanical Engineering, National Cheng Kung University, Tainan, 70101, Taiwan
| | - Motoki Shiga
- Department of Electrical, Electronic and Computer Engineering, Faculty of Engineering, Gifu University, Gifu, 501-1193, Japan
| | - Ana Stanescu
- Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
- Department of Computer Science, University of West Georgia, Carrolton, GA, 30116, USA
| | - Robert Vogel
- Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
- IBM T.J. Watson Research Center, Yorktown Heights, NY, 10598, USA
| | - Gaurav Pandey
- Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
| | - Christopher Chiu
- Section of Infectious Diseases and Immunity, Imperial College London, London, W12 0NN, UK
| | - Micah T McClain
- Duke Center for Applied Genomics and Precision Medicine, Duke University School of Medicine, Durham, NC, 27710, USA
- Medical Service, Durham VA Health Care System, Durham, NC, 27705, USA
- Department of Medicine, Duke University School of Medicine, Durham, NC, 27710, USA
| | - Christopher W Woods
- Duke Center for Applied Genomics and Precision Medicine, Duke University School of Medicine, Durham, NC, 27710, USA
- Medical Service, Durham VA Health Care System, Durham, NC, 27705, USA
- Department of Medicine, Duke University School of Medicine, Durham, NC, 27710, USA
| | - Geoffrey S Ginsburg
- Duke Center for Applied Genomics and Precision Medicine, Duke University School of Medicine, Durham, NC, 27710, USA
- Department of Medicine, Duke University School of Medicine, Durham, NC, 27710, USA
| | - Laura L Elo
- Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, FI-20520, Turku, Finland
| | - Ephraim L Tsalik
- Duke Center for Applied Genomics and Precision Medicine, Duke University School of Medicine, Durham, NC, 27710, USA
- Department of Medicine, Duke University School of Medicine, Durham, NC, 27710, USA
- Emergency Medicine Service, Durham VA Health Care System, Durham, NC, 27705, USA
| | | | | |
Collapse
|
13
|
Wang L, Law J, Kale SD, Murali TM, Pandey G. Large-scale protein function prediction using heterogeneous ensembles. F1000Res 2018; 7. [PMID: 30450194 PMCID: PMC6221071 DOI: 10.12688/f1000research.16415.1] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 09/26/2018] [Indexed: 12/24/2022] Open
Abstract
Heterogeneous ensembles are an effective approach in scenarios where the ideal data type and/or individual predictor are unclear for a given problem. These ensembles have shown promise for protein function prediction (PFP), but their ability to improve PFP at a large scale is unclear. The overall goal of this study is to critically assess this ability of a variety of heterogeneous ensemble methods across a multitude of functional terms, proteins and organisms. Our results show that these methods, especially Stacking using Logistic Regression, indeed produce more accurate predictions for a variety of Gene Ontology terms differing in size and specificity. To enable the application of these methods to other related problems, we have publicly shared the HPC-enabled code underlying this work as LargeGOPred ( https://github.com/GauravPandeyLab/LargeGOPred).
Collapse
Affiliation(s)
- Linhua Wang
- Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
| | - Jeffrey Law
- Genetics, Bioinformatics, and Computational Biology Ph.D. Program, Virginia Polytechnic Institute and State University, Blacksburg, VA, 24061, USA
| | - Shiv D Kale
- Biocomplexity Institute, Virginia Polytechnic Institute and State University, Blacksburg, VA, 24061, USA
| | - T M Murali
- Department of Computer Science, Virginia Polytechnic Institute and State University, Blacksburg, VA, 24061, USA
| | - Gaurav Pandey
- Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
| |
Collapse
|
14
|
Letter to the Editor concerning the article "Machine learning for prediction of 30-day mortality after ST elevation myocardial infarction". Int J Cardiol 2018; 266:41. [PMID: 29887469 DOI: 10.1016/j.ijcard.2017.11.061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/11/2017] [Accepted: 11/17/2017] [Indexed: 11/21/2022]
|
15
|
Pandey G, Pandey OP, Rogers AJ, Ahsen ME, Hoffman GE, Raby BA, Weiss ST, Schadt EE, Bunyavanich S. A Nasal Brush-based Classifier of Asthma Identified by Machine Learning Analysis of Nasal RNA Sequence Data. Sci Rep 2018; 8:8826. [PMID: 29891868 PMCID: PMC5995932 DOI: 10.1038/s41598-018-27189-4] [Citation(s) in RCA: 41] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2018] [Accepted: 05/25/2018] [Indexed: 12/31/2022] Open
Abstract
Asthma is a common, under-diagnosed disease affecting all ages. We sought to identify a nasal brush-based classifier of mild/moderate asthma. 190 subjects with mild/moderate asthma and controls underwent nasal brushing and RNA sequencing of nasal samples. A machine learning-based pipeline identified an asthma classifier consisting of 90 genes interpreted via an L2-regularized logistic regression classification model. This classifier performed with strong predictive value and sensitivity across eight test sets, including (1) a test set of independent asthmatic and control subjects profiled by RNA sequencing (positive and negative predictive values of 1.00 and 0.96, respectively; AUC of 0.994), (2) two independent case-control cohorts of asthma profiled by microarray, and (3) five cohorts with other respiratory conditions (allergic rhinitis, upper respiratory infection, cystic fibrosis, smoking), where the classifier had a low to zero misclassification rate. Following validation in large, prospective cohorts, this classifier could be developed into a nasal biomarker of asthma.
Collapse
Affiliation(s)
- Gaurav Pandey
- Icahn Institute for Genomics and Multiscale Biology and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Om P Pandey
- Icahn Institute for Genomics and Multiscale Biology and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Angela J Rogers
- Division of Pulmonary and Critical Care Medicine, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Mehmet E Ahsen
- Icahn Institute for Genomics and Multiscale Biology and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Gabriel E Hoffman
- Icahn Institute for Genomics and Multiscale Biology and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Benjamin A Raby
- Channing Division of Network Medicine and Division of Pulmonary and Critical Care Medicine, Brigham & Women's Hospital, and Harvard Medical School, Boston, MA, USA
| | - Scott T Weiss
- Channing Division of Network Medicine and Division of Pulmonary and Critical Care Medicine, Brigham & Women's Hospital, and Harvard Medical School, Boston, MA, USA
| | - Eric E Schadt
- Icahn Institute for Genomics and Multiscale Biology and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Supinda Bunyavanich
- Icahn Institute for Genomics and Multiscale Biology and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA. .,Division of Allergy & Immunology, Department of Pediatrics, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| |
Collapse
|
16
|
Zhao Y, Fu G, Wang J, Guo M, Yu G. Gene function prediction based on Gene Ontology Hierarchy Preserving Hashing. Genomics 2018; 111:334-342. [PMID: 29477548 DOI: 10.1016/j.ygeno.2018.02.008] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2017] [Revised: 02/02/2018] [Accepted: 02/16/2018] [Indexed: 12/27/2022]
Abstract
Gene Ontology (GO) uses structured vocabularies (or terms) to describe the molecular functions, biological roles, and cellular locations of gene products in a hierarchical ontology. GO annotations associate genes with GO terms and indicate the given gene products carrying out the biological functions described by the relevant terms. However, predicting correct GO annotations for genes from a massive set of GO terms as defined by GO is a difficult challenge. To combat with this challenge, we introduce a Gene Ontology Hierarchy Preserving Hashing (HPHash) based semantic method for gene function prediction. HPHash firstly measures the taxonomic similarity between GO terms. It then uses a hierarchy preserving hashing technique to keep the hierarchical order between GO terms, and to optimize a series of hashing functions to encode massive GO terms via compact binary codes. After that, HPHash utilizes these hashing functions to project the gene-term association matrix into a low-dimensional one and performs semantic similarity based gene function prediction in the low-dimensional space. Experimental results on three model species (Homo sapiens, Mus musculus and Rattus norvegicus) for interspecies gene function prediction show that HPHash performs better than other related approaches and it is robust to the number of hash functions. In addition, we also take HPHash as a plugin for BLAST based gene function prediction. From the experimental results, HPHash again significantly improves the prediction performance. The codes of HPHash are available at: http://mlda.swu.edu.cn/codes.php?name=HPHash.
Collapse
Affiliation(s)
- Yingwen Zhao
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Guangyuan Fu
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing 100044, China; Beijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing 100044, China.
| | - Guoxian Yu
- College of Computer and Information Science, Southwest University, Chongqing 400715, China.
| |
Collapse
|
17
|
Stanescu A, Pandey G. LEARNING PARSIMONIOUS ENSEMBLES FOR UNBALANCED COMPUTATIONAL GENOMICS PROBLEMS. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2017; 22:288-299. [PMID: 27896983 DOI: 10.1142/9789813207813_0028] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Prediction problems in biomedical sciences are generally quite difficult, partially due to incomplete knowledge of how the phenomenon of interest is influenced by the variables and measurements used for prediction, as well as a lack of consensus regarding the ideal predictor(s) for specific problems. In these situations, a powerful approach to improving prediction performance is to construct ensembles that combine the outputs of many individual base predictors, which have been successful for many biomedical prediction tasks. Moreover, selecting a parsimonious ensemble can be of even greater value for biomedical sciences, where it is not only important to learn an accurate predictor, but also to interpret what novel knowledge it can provide about the target problem. Ensemble selection is a promising approach for this task because of its ability to select a collectively predictive subset, often a relatively small one, of all input base predictors. One of the most well-known algorithms for ensemble selection, CES (Caruana et al.'s Ensemble Selection), generally performs well in practice, but faces several challenges due to the difficulty of choosing the right values of its various parameters. Since the choices made for these parameters are usually ad-hoc, good performance of CES is difficult to guarantee for a variety of problems or datasets. To address these challenges with CES and other such algorithms, we propose a novel heterogeneous ensemble selection approach based on the paradigm of reinforcement learning (RL), which offers a more systematic and mathematically sound methodology for exploring the many possible combinations of base predictors that can be selected into an ensemble. We develop three RL-based strategies for constructing ensembles and analyze their results on two unbalanced computational genomics problems, namely the prediction of protein function and splice sites in eukaryotic genomes. We show that the resultant ensembles are indeed substantially more parsimonious as compared to the full set of base predictors, yet still offer almost the same classification power, especially for larger datasets. The RL ensembles also yield a better combination of parsimony and predictive performance as compared to CES.
Collapse
Affiliation(s)
- Ana Stanescu
- Icahn Institute for Genomics and Multiscale Biology and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | | |
Collapse
|
18
|
|
19
|
Madhukar NS, Elemento O, Pandey G. Prediction of Genetic Interactions Using Machine Learning and Network Properties. Front Bioeng Biotechnol 2015; 3:172. [PMID: 26579514 PMCID: PMC4620407 DOI: 10.3389/fbioe.2015.00172] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Accepted: 10/12/2015] [Indexed: 12/04/2022] Open
Abstract
A genetic interaction (GI) is a type of interaction where the effect of one gene is modified by the effect of one or several other genes. These interactions are important for delineating functional relationships among genes and their corresponding proteins, as well as elucidating complex biological processes and diseases. An important type of GI - synthetic sickness or synthetic lethality - involves two or more genes, where the loss of either gene alone has little impact on cell viability, but the combined loss of all genes leads to a severe decrease in fitness (sickness) or cell death (lethality). The identification of GIs is an important problem for it can help delineate pathways, protein complexes, and regulatory dependencies. Synthetic lethal interactions have important clinical and biological significance, such as providing therapeutically exploitable weaknesses in tumors. While near systematic high-content screening for GIs is possible in single cell organisms such as yeast, the systematic discovery of GIs is extremely difficult in mammalian cells. Therefore, there is a great need for computational approaches to reliably predict GIs, including synthetic lethal interactions, in these organisms. Here, we review the state-of-the-art approaches, strategies, and rigorous evaluation methods for learning and predicting GIs, both under general (healthy/standard laboratory) conditions and under specific contexts, such as diseases.
Collapse
Affiliation(s)
- Neel S Madhukar
- Department of Physiology and Biophysics, Meyer Cancer Center, Institute for Precision Medicine and Institute for Computational Biomedicine, Weill Cornell Medical College , New York, NY , USA ; Tri-Institutional Training Program in Computational Biology and Medicine , New York, NY , USA
| | - Olivier Elemento
- Department of Physiology and Biophysics, Meyer Cancer Center, Institute for Precision Medicine and Institute for Computational Biomedicine, Weill Cornell Medical College , New York, NY , USA ; Tri-Institutional Training Program in Computational Biology and Medicine , New York, NY , USA
| | - Gaurav Pandey
- Department of Genetics and Genomic Sciences and Graduate School of Biomedical Sciences, Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai , New York, NY , USA
| |
Collapse
|