Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Whalen S, Pandey OP, Pandey G. Predicting protein function and other biomedical characteristics with heterogeneous ensembles. Methods 2015;93:92-102. [PMID: 26342255 DOI: 10.1016/j.ymeth.2015.08.016] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2015] [Revised: 08/03/2015] [Accepted: 08/23/2015] [Indexed: 12/29/2022] Open

For:	Whalen S, Pandey OP, Pandey G. Predicting protein function and other biomedical characteristics with heterogeneous ensembles. Methods 2015;93:92-102. [PMID: 26342255 DOI: 10.1016/j.ymeth.2015.08.016] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2015] [Revised: 08/03/2015] [Accepted: 08/23/2015] [Indexed: 12/29/2022] Open

Number

Cited by Other Article(s)

McDonough C, Li YC, Vangeepuram N, Liu B, Pandey G. A Comprehensive Youth Diabetes Epidemiological Data Set and Web Portal: Resource Development and Case Studies. JMIR Public Health Surveill 2024;10:e53330. [PMID: 38666756 DOI: 10.2196/53330] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2023] [Revised: 02/06/2024] [Accepted: 04/26/2024] [Indexed: 05/03/2024] Open

Abstract

BACKGROUND

The prevalence of type 2 diabetes mellitus (DM) and pre-diabetes mellitus (pre-DM) has been increasing among youth in recent decades in the United States, prompting an urgent need for understanding and identifying their associated risk factors. Such efforts, however, have been hindered by the lack of easily accessible youth pre-DM/DM data.

OBJECTIVE

We aimed to first build a high-quality, comprehensive epidemiological data set focused on youth pre-DM/DM. Subsequently, we aimed to make these data accessible by creating a user-friendly web portal to share them and the corresponding codes. Through this, we hope to address this significant gap and facilitate youth pre-DM/DM research.

METHODS

Building on data from the National Health and Nutrition Examination Survey (NHANES) from 1999 to 2018, we cleaned and harmonized hundreds of variables relevant to pre-DM/DM (fasting plasma glucose level ≥100 mg/dL or glycated hemoglobin ≥5.7%) for youth aged 12-19 years (N=15,149). We identified individual factors associated with pre-DM/DM risk using bivariate statistical analyses and predicted pre-DM/DM status using our Ensemble Integration (EI) framework for multidomain machine learning. We then developed a user-friendly web portal named Prediabetes/diabetes in youth Online Dashboard (POND) to share the data and codes.

RESULTS

We extracted 95 variables potentially relevant to pre-DM/DM risk organized into 4 domains (sociodemographic, health status, diet, and other lifestyle behaviors). The bivariate analyses identified 27 significant correlates of pre-DM/DM (P<.001, Bonferroni adjusted), including race or ethnicity, health insurance, BMI, added sugar intake, and screen time. Among these factors, 16 factors were also identified based on the EI methodology (Fisher P of overlap=7.06×106). In addition to those, the EI approach identified 11 additional predictive variables, including some known (eg, meat and fruit intake and family income) and less recognized factors (eg, number of rooms in homes). The factors identified in both analyses spanned across all 4 of the domains mentioned. These data and results, as well as other exploratory tools, can be accessed on POND.

CONCLUSIONS

Using NHANES data, we built one of the largest public epidemiological data sets for studying youth pre-DM/DM and identified potential risk factors using complementary analytical approaches. Our results align with the multifactorial nature of pre-DM/DM with correlates across several domains. Also, our data-sharing platform, POND, facilitates a wide range of applications to inform future youth pre-DM/DM studies.

Collapse

Wright BM, Bodnar MS, Moore AD, Maseda MC, Kucharik MP, Diaz CC, Schmidt CM, Mir HR. Is ChatGPT a trusted source of information for total hip and knee arthroplasty patients? Bone Jt Open 2024;5:139-146. [PMID: 38354748 PMCID: PMC10867788 DOI: 10.1302/2633-1462.52.bjo-2023-0113.r1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/16/2024] Open

Abstract

Aims

While internet search engines have been the primary information source for patients' questions, artificial intelligence large language models like ChatGPT are trending towards becoming the new primary source. The purpose of this study was to determine if ChatGPT can answer patient questions about total hip (THA) and knee arthroplasty (TKA) with consistent accuracy, comprehensiveness, and easy readability.

Methods

We posed the 20 most Google-searched questions about THA and TKA, plus ten additional postoperative questions, to ChatGPT. Each question was asked twice to evaluate for consistency in quality. Following each response, we responded with, "Please explain so it is easier to understand," to evaluate ChatGPT's ability to reduce response reading grade level, measured as Flesch-Kincaid Grade Level (FKGL). Five resident physicians rated the 120 responses on 1 to 5 accuracy and comprehensiveness scales. Additionally, they answered a "yes" or "no" question regarding acceptability. Mean scores were calculated for each question, and responses were deemed acceptable if ≥ four raters answered "yes."

Results

The mean accuracy and comprehensiveness scores were 4.26 (95% confidence interval (CI) 4.19 to 4.33) and 3.79 (95% CI 3.69 to 3.89), respectively. Out of all the responses, 59.2% (71/120; 95% CI 50.0% to 67.7%) were acceptable. ChatGPT was consistent when asked the same question twice, giving no significant difference in accuracy (t = 0.821; p = 0.415), comprehensiveness (t = 1.387; p = 0.171), acceptability (χ2 = 1.832; p = 0.176), and FKGL (t = 0.264; p = 0.793). There was a significantly lower FKGL (t = 2.204; p = 0.029) for easier responses (11.14; 95% CI 10.57 to 11.71) than original responses (12.15; 95% CI 11.45 to 12.85).

Conclusion

ChatGPT answered THA and TKA patient questions with accuracy comparable to previous reports of websites, with adequate comprehensiveness, but with limited acceptability as the sole information source. ChatGPT has potential for answering patient questions about THA and TKA, but needs improvement.

Collapse

McDonough C, Li YC, Vangeepuram N, Liu B, Pandey G. Facilitating youth diabetes studies with the most comprehensive epidemiological dataset available through a public web portal. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.08.02.23293517. [PMID: 37577465 PMCID: PMC10418570 DOI: 10.1101/2023.08.02.23293517] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/15/2023]

Sieberts SK, Borzymowski H, Guan Y, Huang Y, Matzner A, Page A, Bar-Gad I, Beaulieu-Jones B, El-Hanani Y, Goschenhofer J, Javidnia M, Keller MS, Li YC, Saqib M, Smith G, Stanescu A, Venuto CS, Zielinski R, Jayaraman A, Evers LJW, Foschini L, Mariakakis A, Pandey G, Shawen N, Synder P, Omberg L. Developing better digital health measures of Parkinson's disease using free living data and a crowdsourced data analysis challenge. PLOS DIGITAL HEALTH 2023;2:e0000208. [PMID: 36976789 PMCID: PMC10047543 DOI: 10.1371/journal.pdig.0000208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Accepted: 02/07/2023] [Indexed: 03/29/2023]

Affiliation(s)

Solveig K. Sieberts Sage Bionetworks, Seattle, Washington, United States of America
Henryk Borzymowski Independent researcher
Yuanfang Guan Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
Yidi Huang Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, United States of America
Ayala Matzner Gonda Brain Research Center, Bar Ilan University, Ramat Gan, Israel
Alex Page Center for Health + Technology, University of Rochester Medical Center, Rochester, New York, United States of America Cardiology Division, University of Rochester Medical Center, Rochester, New York, United States of America
Izhar Bar-Gad Gonda Brain Research Center, Bar Ilan University, Ramat Gan, Israel
Brett Beaulieu-Jones Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, United States of America Department of Neurology, Brigham and Women’s Hospital, Boston, Massachusetts, United States of America
Yuval El-Hanani Gonda Brain Research Center, Bar Ilan University, Ramat Gan, Israel
Jann Goschenhofer Independent researcher
Monica Javidnia Center for Health + Technology, University of Rochester Medical Center, Rochester, New York, United States of America Department of Neurology, University of Rochester, Rochester, New York, United States of America
Mark S. Keller Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, United States of America
Yan-chak Li Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
Mohammed Saqib Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, United States of America
Greta Smith Center for Health + Technology, University of Rochester Medical Center, Rochester, New York, United States of America Department of Neurology, University of Rochester, Rochester, New York, United States of America
Ana Stanescu Department of Computing and Mathematics, University of West Georgia, Carrollton, Georgia, United States of America
Charles S. Venuto Center for Health + Technology, University of Rochester Medical Center, Rochester, New York, United States of America Department of Neurology, University of Rochester, Rochester, New York, United States of America
Robert Zielinski Center for Health + Technology, University of Rochester Medical Center, Rochester, New York, United States of America Department of Neurology, University of Rochester, Rochester, New York, United States of America
the BEAT-PD DREAM Challenge Consortium
Arun Jayaraman Center for Rehabilitation Technologies & Outcomes Research, Shirley Ryan AbilityLab, Chicago, Illinois, United States of America
Luc J. W. Evers Donders Institute for Brain, Cognition and Behaviour, Department of Neurology, Radboud University Medical Center, Nijmegen, the Netherlands Institute for Computing and Information Sciences, Radboud University, Nijmegen, the Netherlands
Luca Foschini Evidation Health, Santa Barbara, California, United States of America
Alex Mariakakis Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
Gaurav Pandey Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
Nicholas Shawen Center for Rehabilitation Technologies & Outcomes Research, Shirley Ryan AbilityLab, Chicago, Illinois, United States of America Medical Scientist Training Program, Northwestern University Feinberg School of Medicine, Chicago, Illinois, United States of America
Phil Synder Sage Bionetworks, Seattle, Washington, United States of America
Larsson Omberg Sage Bionetworks, Seattle, Washington, United States of America

Collapse

Li YC, Wang L, Law JN, Murali TM, Pandey G. Integrating multimodal data through interpretable heterogeneous ensembles. BIOINFORMATICS ADVANCES 2022;2:vbac065. [PMID: 36158455 PMCID: PMC9495448 DOI: 10.1093/bioadv/vbac065] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Revised: 09/01/2022] [Accepted: 09/10/2022] [Indexed: 01/27/2023]

Abstract

Motivation

Integrating multimodal data represents an effective approach to predicting biomedical characteristics, such as protein functions and disease outcomes. However, existing data integration approaches do not sufficiently address the heterogeneous semantics of multimodal data. In particular, early and intermediate approaches that rely on a uniform integrated representation reinforce the consensus among the modalities but may lose exclusive local information. The alternative late integration approach that can address this challenge has not been systematically studied for biomedical problems.

Results

We propose Ensemble Integration (EI) as a novel systematic implementation of the late integration approach. EI infers local predictive models from the individual data modalities using appropriate algorithms and uses heterogeneous ensemble algorithms to integrate these local models into a global predictive model. We also propose a novel interpretation method for EI models. We tested EI on the problems of predicting protein function from multimodal STRING data and mortality due to coronavirus disease 2019 (COVID-19) from multimodal data in electronic health records. We found that EI accomplished its goal of producing significantly more accurate predictions than each individual modality. It also performed better than several established early integration methods for each of these problems. The interpretation of a representative EI model for COVID-19 mortality prediction identified several disease-relevant features, such as laboratory test (blood urea nitrogen and calcium) and vital sign measurements (minimum oxygen saturation) and demographics (age). These results demonstrated the effectiveness of the EI framework for biomedical data integration and predictive modeling.

Availability and implementation

Code and data are available at https://github.com/GauravPandeyLab/ensemble_integration.

Supplementary information

Supplementary data are available at Bioinformatics Advances online.

Collapse

Li YC, Wang L, Law JN, Murali TM, Pandey G. Integrating multimodal data through interpretable heterogeneous ensembles. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2022:2020.05.29.123497. [PMID: 35923321 PMCID: PMC9347276 DOI: 10.1101/2020.05.29.123497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]

Abstract

Motivation

Integrating multimodal data represents an effective approach to predicting biomedical characteristics, such as protein functions and disease outcomes. However, existing data integration approaches do not sufficiently address the heterogeneous semantics of multimodal data. In particular, early and intermediate approaches that rely on a uniform integrated representation reinforce the consensus among the modalities, but may lose exclusive local information. The alternative late integration approach that can address this challenge has not been systematically studied for biomedical problems.

Results

We propose Ensemble Integration (EI) as a novel systematic implementation of the late integration approach. EI infers local predictive models from the individual data modalities using appropriate algorithms, and uses effective heterogeneous ensemble algorithms to integrate these local models into a global predictive model. We also propose a novel interpretation method for EI models. We tested EI on the problems of predicting protein function from multimodal STRING data, and mortality due to COVID-19 from multimodal data in electronic health records. We found that EI accomplished its goal of producing significantly more accurate predictions than each individual modality. It also performed better than several established early integration methods for each of these problems. The interpretation of a representative EI model for COVID-19 mortality prediction identified several disease-relevant features, such as laboratory test (blood urea nitrogen (BUN) and calcium) and vital sign measurements (minimum oxygen saturation) and demographics (age). These results demonstrated the effectiveness of the EI framework for biomedical data integration and predictive modeling.

Availability

Code and data are available at https://github.com/GauravPandeyLab/ensemble_integration .

Contact

gaurav.pandey@mssm.edu.

Collapse

Yan Y, Schaffter T, Bergquist T, Yu T, Prosser J, Aydin Z, Jabeer A, Brugere I, Gao J, Chen G, Causey J, Yao Y, Bryson K, Long DR, Jarvik JG, Lee CI, Wilcox A, Guinney J, Mooney S. A Continuously Benchmarked and Crowdsourced Challenge for Rapid Development and Evaluation of Models to Predict COVID-19 Diagnosis and Hospitalization. JAMA Netw Open 2021;4:e2124946. [PMID: 34633425 PMCID: PMC8506231 DOI: 10.1001/jamanetworkopen.2021.24946] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Accepted: 07/08/2021] [Indexed: 01/28/2023] Open

Abstract

Importance

Machine learning could be used to predict the likelihood of diagnosis and severity of illness. Lack of COVID-19 patient data has hindered the data science community in developing models to aid in the response to the pandemic.

Objectives

To describe the rapid development and evaluation of clinical algorithms to predict COVID-19 diagnosis and hospitalization using patient data by citizen scientists, provide an unbiased assessment of model performance, and benchmark model performance on subgroups.

Design, Setting, and Participants

This diagnostic and prognostic study operated a continuous, crowdsourced challenge using a model-to-data approach to securely enable the use of regularly updated COVID-19 patient data from the University of Washington by participants from May 6 to December 23, 2020. A postchallenge analysis was conducted from December 24, 2020, to April 7, 2021, to assess the generalizability of models on the cumulative data set as well as subgroups stratified by age, sex, race, and time of COVID-19 test. By December 23, 2020, this challenge engaged 482 participants from 90 teams and 7 countries.

Main Outcomes and Measures

Machine learning algorithms used patient data and output a score that represented the probability of patients receiving a positive COVID-19 test result or being hospitalized within 21 days after receiving a positive COVID-19 test result. Algorithms were evaluated using area under the receiver operating characteristic curve (AUROC) and area under the precision recall curve (AUPRC) scores. Ensemble models aggregating models from the top challenge teams were developed and evaluated.

Results

In the analysis using the cumulative data set, the best performance for COVID-19 diagnosis prediction was an AUROC of 0.776 (95% CI, 0.775-0.777) and an AUPRC of 0.297, and for hospitalization prediction, an AUROC of 0.796 (95% CI, 0.794-0.798) and an AUPRC of 0.188. Analysis on top models submitting to the challenge showed consistently better model performance on the female group than the male group. Among all age groups, the best performance was obtained for the 25- to 49-year age group, and the worst performance was obtained for the group aged 17 years or younger.

Conclusions and Relevance

In this diagnostic and prognostic study, models submitted by citizen scientists achieved high performance for the prediction of COVID-19 testing and hospitalization outcomes. Evaluation of challenge models on demographic subgroups and prospective data revealed performance discrepancies, providing insights into the potential bias and limitations in the models.

Collapse

Affiliation(s)

Yao Yan Sage Bionetworks, Seattle, Washington Molecular Engineering and Sciences Institute, University of Washington, Seattle
Thomas Schaffter Sage Bionetworks, Seattle, Washington
Timothy Bergquist Sage Bionetworks, Seattle, Washington Department of Biomedical Informatics and Medical Education, University of Washington, Seattle
Thomas Yu Sage Bionetworks, Seattle, Washington
Justin Prosser Institute of Translational Health Sciences, University of Washington, Seattle
Zafer Aydin Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, Turkey
Amhar Jabeer Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, Turkey
Ivan Brugere Department of Computer Science, University of Illinois at Chicago, Chicago
Jifan Gao Department of Biostatistics and Medical Informatics, University of Wisconsin–Madison, Madison
Guanhua Chen Department of Biostatistics and Medical Informatics, University of Wisconsin–Madison, Madison
Jason Causey Computer Science Department, College of Engineering and Computer Science, Arkansas State University, Jonesboro Arkansas AI-Campus, Center for No-Boundary Thinking, Arkansas State University, Jonesboro
Yuxin Yao Department of Computer Science, University College London, London, United Kingdom
Kevin Bryson Department of Computer Science, University College London, London, United Kingdom
Dustin R. Long Division of Critical Care Medicine, Department of Anesthesiology and Pain Medicine, University of Washington, Seattle
Jeffrey G. Jarvik The University of Washington Clinical Learning, Evidence And Research Center for Musculoskeletal Disorders, Seattle Department of Radiology, University of Washington School of Medicine, Seattle
Christoph I. Lee Department of Radiology, University of Washington School of Medicine, Seattle
Adam Wilcox Department of Biomedical Informatics and Medical Education, University of Washington, Seattle
Justin Guinney Sage Bionetworks, Seattle, Washington
Sean Mooney Department of Biomedical Informatics and Medical Education, University of Washington, Seattle

Collapse

Moro G, Masseroli M. Gene function finding through cross-organism ensemble learning. BioData Min 2021;14:14. [PMID: 33579334 PMCID: PMC7879670 DOI: 10.1186/s13040-021-00239-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Accepted: 01/10/2021] [Indexed: 11/12/2022] Open

Abstract

Background

Structured biological information about genes and proteins is a valuable resource to improve discovery and understanding of complex biological processes via machine learning algorithms. Gene Ontology (GO) controlled annotations describe, in a structured form, features and functions of genes and proteins of many organisms. However, such valuable annotations are not always reliable and sometimes are incomplete, especially for rarely studied organisms. Here, we present GeFF (Gene Function Finder), a novel cross-organism ensemble learning method able to reliably predict new GO annotations of a target organism from GO annotations of another source organism evolutionarily related and better studied.

Results

Using a supervised method, GeFF predicts unknown annotations from random perturbations of existing annotations. The perturbation consists in randomly deleting a fraction of known annotations in order to produce a reduced annotation set. The key idea is to train a supervised machine learning algorithm with the reduced annotation set to predict, namely to rebuild, the original annotations. The resulting prediction model, in addition to accurately rebuilding the original known annotations for an organism from their perturbed version, also effectively predicts new unknown annotations for the organism. Moreover, the prediction model is also able to discover new unknown annotations in different target organisms without retraining.We combined our novel method with different ensemble learning approaches and compared them to each other and to an equivalent single model technique. We tested the method with five different organisms using their GO annotations: Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum. The outcomes demonstrate the effectiveness of the cross-organism ensemble approach, which can be customized with a trade-off between the desired number of predicted new annotations and their precision.A Web application to browse both input annotations used and predicted ones, choosing the ensemble prediction method to use, is publicly available at http://tiny.cc/geff/.

Conclusions

Our novel cross-organism ensemble learning method provides reliable predicted novel gene annotations, i.e., functions, ranked according to an associated likelihood value. They are very valuable both to speed the annotation curation, focusing it on the prioritized new annotations predicted, and to complement known annotations available.

Collapse

Chetnik K, Petrick L, Pandey G. MetaClean: a machine learning-based classifier for reduced false positive peak detection in untargeted LC-MS metabolomics data. Metabolomics 2020;16:117. [PMID: 33085002 PMCID: PMC7895495 DOI: 10.1007/s11306-020-01738-3] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/09/2020] [Accepted: 10/13/2020] [Indexed: 10/23/2022]

Schaffter T, Buist DSM, Lee CI, Nikulin Y, Ribli D, Guan Y, Lotter W, Jie Z, Du H, Wang S, Feng J, Feng M, Kim HE, Albiol F, Albiol A, Morrell S, Wojna Z, Ahsen ME, Asif U, Jimeno Yepes A, Yohanandan S, Rabinovici-Cohen S, Yi D, Hoff B, Yu T, Chaibub Neto E, Rubin DL, Lindholm P, Margolies LR, McBride RB, Rothstein JH, Sieh W, Ben-Ari R, Harrer S, Trister A, Friend S, Norman T, Sahiner B, Strand F, Guinney J, Stolovitzky G. Evaluation of Combined Artificial Intelligence and Radiologist Assessment to Interpret Screening Mammograms. JAMA Netw Open 2020;3:e200265. [PMID: 32119094 PMCID: PMC7052735 DOI: 10.1001/jamanetworkopen.2020.0265] [Citation(s) in RCA: 180] [Impact Index Per Article: 45.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/12/2019] [Accepted: 12/26/2019] [Indexed: 12/18/2022] Open

Abstract

Importance

Mammography screening currently relies on subjective human interpretation. Artificial intelligence (AI) advances could be used to increase mammography screening accuracy by reducing missed cancers and false positives.

Objective

To evaluate whether AI can overcome human mammography interpretation limitations with a rigorous, unbiased evaluation of machine learning algorithms.

Design, Setting, and Participants

In this diagnostic accuracy study conducted between September 2016 and November 2017, an international, crowdsourced challenge was hosted to foster AI algorithm development focused on interpreting screening mammography. More than 1100 participants comprising 126 teams from 44 countries participated. Analysis began November 18, 2016.

Main Outcomes and Measurements

Algorithms used images alone (challenge 1) or combined images, previous examinations (if available), and clinical and demographic risk factor data (challenge 2) and output a score that translated to cancer yes/no within 12 months. Algorithm accuracy for breast cancer detection was evaluated using area under the curve and algorithm specificity compared with radiologists' specificity with radiologists' sensitivity set at 85.9% (United States) and 83.9% (Sweden). An ensemble method aggregating top-performing AI algorithms and radiologists' recall assessment was developed and evaluated.

Results

Overall, 144 231 screening mammograms from 85 580 US women (952 cancer positive ≤12 months from screening) were used for algorithm training and validation. A second independent validation cohort included 166 578 examinations from 68 008 Swedish women (780 cancer positive). The top-performing algorithm achieved an area under the curve of 0.858 (United States) and 0.903 (Sweden) and 66.2% (United States) and 81.2% (Sweden) specificity at the radiologists' sensitivity, lower than community-practice radiologists' specificity of 90.5% (United States) and 98.5% (Sweden). Combining top-performing algorithms and US radiologist assessments resulted in a higher area under the curve of 0.942 and achieved a significantly improved specificity (92.0%) at the same sensitivity.

Conclusions and Relevance

While no single AI algorithm outperformed radiologists, an ensemble of AI algorithms combined with radiologist assessment in a single-reader screening environment improved overall accuracy. This study underscores the potential of using machine learning methods for enhancing mammography screening interpretation.

Collapse

Affiliation(s)

Thomas Schaffter Computational Oncology, Sage Bionetworks, Seattle, Washington
Diana S. M. Buist Kaiser Permanente Washington Health Research Institute, Seattle, Washington
Christoph I. Lee University of Washington School of Medicine, Seattle
Yaroslav Nikulin Therapixel, Paris, France
Dezső Ribli Department of Physics of Complex Systems, ELTE Eötvös Loránd University, Budapest, Hungary
Yuanfang Guan Department of Computational Medicine and Bioinformatics, Michigan Medicine, University of Michigan, Ann Arbor
William Lotter DeepHealth Inc, Cambridge, Massachusetts
Zequn Jie Tencent AI Lab, Shenzhen, China
Hao Du National University of Singapore, Singapore
Sijia Wang Integrated Health Information Systems Pte Ltd, Singapore
Jiashi Feng Department of Electrical and Computer Engineering, National University of Singapore, Singapore
Mengling Feng National University Health System, Singapore
Hyo-Eun Kim Lunit Inc, Seoul, Korea
Francisco Albiol Instituto de Física Corpuscular (IFIC), CSIC–Universitat de València, Valencia, Spain
Alberto Albiol Universitat Politecnica de Valencia, Valencia, Valenciana, Spain
Stephen Morrell Centre for Medical Image Computing, University College London, Bloomsbury, London, United Kingdom
Zbigniew Wojna Tensorflight Inc, Mountain View, California
Mehmet Eren Ahsen University of Illinois at Urbana-Champaign, Urbana
Umar Asif IBM Research Australia, Melbourne, Australia
Antonio Jimeno Yepes IBM Research Australia, Melbourne, Australia
Shivanthan Yohanandan IBM Research Australia, Melbourne, Australia
Simona Rabinovici-Cohen IBM Research Haifa, Haifa University Campus, Mount Carmel, Haifa, Israel
Darvin Yi Stanford University, Stanford, California
Bruce Hoff Computational Oncology, Sage Bionetworks, Seattle, Washington
Thomas Yu Computational Oncology, Sage Bionetworks, Seattle, Washington
Elias Chaibub Neto Computational Oncology, Sage Bionetworks, Seattle, Washington
Daniel L. Rubin Department of Biomedical Data Science, Radiology, and Medicine (Biomedical Informatics), Stanford University, Stanford, California
Peter Lindholm Department of Physiology and Pharmacology, Karolinska Institutet, Stockholm, Sweden
Laurie R. Margolies Department of Diagnostic, Molecular and Interventional Radiology, Icahn School of Medicine at Mount Sinai, New York, New York
Russell Bailey McBride Department of Pathology, Molecular and Cell-Based Medicine, Icahn School of Medicine at Mount Sinai, New York, New York
Joseph H. Rothstein Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York
Weiva Sieh Department of Population Health Science and Policy, Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York
Rami Ben-Ari IBM Research Haifa, Haifa University Campus, Mount Carmel, Haifa, Israel
Stefan Harrer IBM Research Australia, Melbourne, Australia
Andrew Trister Fred Hutchinson Cancer Research Center, Seattle, Washington
Stephen Friend Computational Oncology, Sage Bionetworks, Seattle, Washington
Thea Norman Bill and Melinda Gates Foundation, Seattle, Washington
Berkman Sahiner Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, Maryland
Fredrik Strand Department of Oncology-Pathology, Karolinska Institutet, Stockholm, Sweden Breast Radiology, Karolinska University Hospital, Stockholm, Sweden
Justin Guinney Computational Oncology, Sage Bionetworks, Seattle, Washington
Gustavo Stolovitzky IBM Research, Translational Systems Biology and Nanobiotechnology, Thomas J. Watson Research Center, Yorktown Heights, New York
and the DM DREAM Consortium

Collapse

Objective risk stratification of prostate cancer using machine learning and radiomics applied to multiparametric magnetic resonance images. Sci Rep 2019;9:1570. [PMID: 30733585 PMCID: PMC6367324 DOI: 10.1038/s41598-018-38381-x] [Citation(s) in RCA: 52] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2018] [Accepted: 12/27/2018] [Indexed: 12/24/2022] Open

Fourati S, Talla A, Mahmoudian M, Burkhart JG, Klén R, Henao R, Yu T, Aydın Z, Yeung KY, Ahsen ME, Almugbel R, Jahandideh S, Liang X, Nordling TEM, Shiga M, Stanescu A, Vogel R, Pandey G, Chiu C, McClain MT, Woods CW, Ginsburg GS, Elo LL, Tsalik EL, Mangravite LM, Sieberts SK. A crowdsourced analysis to identify ab initio molecular signatures predictive of susceptibility to viral infection. Nat Commun 2018;9:4418. [PMID: 30356117 PMCID: PMC6200745 DOI: 10.1038/s41467-018-06735-8] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2018] [Accepted: 09/12/2018] [Indexed: 01/17/2023] Open

Affiliation(s)

Slim Fourati Department of Pathology, School of Medicine, Case Western Reserve University, Cleveland, OH, 44106, USA
Aarthi Talla Department of Pathology, School of Medicine, Case Western Reserve University, Cleveland, OH, 44106, USA
Mehrad Mahmoudian Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, FI-20520, Turku, Finland Department of Future Technologies, University of Turku, FI-20014 Turku, Finland
Joshua G Burkhart Department of Medical Informatics and Clinical Epidemiology, School of Medicine, Oregon Health & Science University, Portland, OR, 97239, USA Laboratory of Evolutionary Genetics, Institute of Ecology and Evolution, University of Oregon, Eugene, OR, 97403, USA
Riku Klén Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, FI-20520, Turku, Finland
Ricardo Henao Duke Center for Applied Genomics and Precision Medicine, Duke University School of Medicine, Durham, NC, 27710, USA Department of Electrical and Computer Engineering, Duke University, Durham, NC, 27708, USA
Thomas Yu Sage Bionetworks, Seattle, WA, 98121, USA
Zafer Aydın Department of Computer Engineering, Abdullah Gul University, Kayseri, 38080, Turkey
Ka Yee Yeung School of Engineering and Technology, University of Washington Tacoma, Tacoma, WA, 98402, USA
Mehmet Eren Ahsen Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
Reem Almugbel School of Engineering and Technology, University of Washington Tacoma, Tacoma, WA, 98402, USA
Samad Jahandideh Origent Data Sciences, Inc., Vienna, VA, 22182, USA
Xiao Liang School of Engineering and Technology, University of Washington Tacoma, Tacoma, WA, 98402, USA
Torbjörn E M Nordling Department of Mechanical Engineering, National Cheng Kung University, Tainan, 70101, Taiwan
Motoki Shiga Department of Electrical, Electronic and Computer Engineering, Faculty of Engineering, Gifu University, Gifu, 501-1193, Japan
Ana Stanescu Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA Department of Computer Science, University of West Georgia, Carrolton, GA, 30116, USA
Robert Vogel Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA IBM T.J. Watson Research Center, Yorktown Heights, NY, 10598, USA
Gaurav Pandey Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
Christopher Chiu Section of Infectious Diseases and Immunity, Imperial College London, London, W12 0NN, UK
Micah T McClain Duke Center for Applied Genomics and Precision Medicine, Duke University School of Medicine, Durham, NC, 27710, USA Medical Service, Durham VA Health Care System, Durham, NC, 27705, USA Department of Medicine, Duke University School of Medicine, Durham, NC, 27710, USA
Christopher W Woods Duke Center for Applied Genomics and Precision Medicine, Duke University School of Medicine, Durham, NC, 27710, USA Medical Service, Durham VA Health Care System, Durham, NC, 27705, USA Department of Medicine, Duke University School of Medicine, Durham, NC, 27710, USA
Geoffrey S Ginsburg Duke Center for Applied Genomics and Precision Medicine, Duke University School of Medicine, Durham, NC, 27710, USA Department of Medicine, Duke University School of Medicine, Durham, NC, 27710, USA
Laura L Elo Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, FI-20520, Turku, Finland
Ephraim L Tsalik Duke Center for Applied Genomics and Precision Medicine, Duke University School of Medicine, Durham, NC, 27710, USA Department of Medicine, Duke University School of Medicine, Durham, NC, 27710, USA Emergency Medicine Service, Durham VA Health Care System, Durham, NC, 27705, USA
Lara M Mangravite Sage Bionetworks, Seattle, WA, 98121, USA.
Solveig K Sieberts Sage Bionetworks, Seattle, WA, 98121, USA.

Collapse

Wang L, Law J, Kale SD, Murali TM, Pandey G. Large-scale protein function prediction using heterogeneous ensembles. F1000Res 2018;7. [PMID: 30450194 PMCID: PMC6221071 DOI: 10.12688/f1000research.16415.1] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 09/26/2018] [Indexed: 12/24/2022] Open

Letter to the Editor concerning the article "Machine learning for prediction of 30-day mortality after ST elevation myocardial infarction". Int J Cardiol 2018;266:41. [PMID: 29887469 DOI: 10.1016/j.ijcard.2017.11.061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/11/2017] [Accepted: 11/17/2017] [Indexed: 11/21/2022]

Pandey G, Pandey OP, Rogers AJ, Ahsen ME, Hoffman GE, Raby BA, Weiss ST, Schadt EE, Bunyavanich S. A Nasal Brush-based Classifier of Asthma Identified by Machine Learning Analysis of Nasal RNA Sequence Data. Sci Rep 2018;8:8826. [PMID: 29891868 PMCID: PMC5995932 DOI: 10.1038/s41598-018-27189-4] [Citation(s) in RCA: 41] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2018] [Accepted: 05/25/2018] [Indexed: 12/31/2022] Open

Zhao Y, Fu G, Wang J, Guo M, Yu G. Gene function prediction based on Gene Ontology Hierarchy Preserving Hashing. Genomics 2018;111:334-342. [PMID: 29477548 DOI: 10.1016/j.ygeno.2018.02.008] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2017] [Revised: 02/02/2018] [Accepted: 02/16/2018] [Indexed: 12/27/2022]

Stanescu A, Pandey G. LEARNING PARSIMONIOUS ENSEMBLES FOR UNBALANCED COMPUTATIONAL GENOMICS PROBLEMS. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2017;22:288-299. [PMID: 27896983 DOI: 10.1142/9789813207813_0028] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]

Abstract

Prediction problems in biomedical sciences are generally quite difficult, partially due to incomplete knowledge of how the phenomenon of interest is influenced by the variables and measurements used for prediction, as well as a lack of consensus regarding the ideal predictor(s) for specific problems. In these situations, a powerful approach to improving prediction performance is to construct ensembles that combine the outputs of many individual base predictors, which have been successful for many biomedical prediction tasks. Moreover, selecting a parsimonious ensemble can be of even greater value for biomedical sciences, where it is not only important to learn an accurate predictor, but also to interpret what novel knowledge it can provide about the target problem. Ensemble selection is a promising approach for this task because of its ability to select a collectively predictive subset, often a relatively small one, of all input base predictors. One of the most well-known algorithms for ensemble selection, CES (Caruana et al.'s Ensemble Selection), generally performs well in practice, but faces several challenges due to the difficulty of choosing the right values of its various parameters. Since the choices made for these parameters are usually ad-hoc, good performance of CES is difficult to guarantee for a variety of problems or datasets. To address these challenges with CES and other such algorithms, we propose a novel heterogeneous ensemble selection approach based on the paradigm of reinforcement learning (RL), which offers a more systematic and mathematically sound methodology for exploring the many possible combinations of base predictors that can be selected into an ensemble. We develop three RL-based strategies for constructing ensembles and analyze their results on two unbalanced computational genomics problems, namely the prediction of protein function and splice sites in eukaryotic genomes. We show that the resultant ensembles are indeed substantially more parsimonious as compared to the full set of base predictors, yet still offer almost the same classification power, especially for larger datasets. The RL ensembles also yield a better combination of parsimony and predictive performance as compared to CES.

Collapse

Kihara D. Computational protein function predictions. Methods 2016;93:1-2. [DOI: 10.1016/j.ymeth.2016.01.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open

Madhukar NS, Elemento O, Pandey G. Prediction of Genetic Interactions Using Machine Learning and Network Properties. Front Bioeng Biotechnol 2015;3:172. [PMID: 26579514 PMCID: PMC4620407 DOI: 10.3389/fbioe.2015.00172] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Accepted: 10/12/2015] [Indexed: 12/04/2022] Open