1
|
Schaekermann M, Spitz T, Pyles M, Cole-Lewis H, Wulczyn E, Pfohl SR, Martin D, Jaroensri R, Keeling G, Liu Y, Farquhar S, Xue Q, Lester J, Hughes C, Strachan P, Tan F, Bui P, Mermel CH, Peng LH, Matias Y, Corrado GS, Webster DR, Virmani S, Semturs C, Liu Y, Horn I, Cameron Chen PH. Health equity assessment of machine learning performance (HEAL): a framework and dermatology AI model case study. EClinicalMedicine 2024; 70:102479. [PMID: 38685924 PMCID: PMC11056401 DOI: 10.1016/j.eclinm.2024.102479] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 01/16/2024] [Accepted: 01/25/2024] [Indexed: 05/02/2024] Open
Abstract
Background Artificial intelligence (AI) has repeatedly been shown to encode historical inequities in healthcare. We aimed to develop a framework to quantitatively assess the performance equity of health AI technologies and to illustrate its utility via a case study. Methods Here, we propose a methodology to assess whether health AI technologies prioritise performance for patient populations experiencing worse outcomes, that is complementary to existing fairness metrics. We developed the Health Equity Assessment of machine Learning performance (HEAL) framework designed to quantitatively assess the performance equity of health AI technologies via a four-step interdisciplinary process to understand and quantify domain-specific criteria, and the resulting HEAL metric. As an illustrative case study (analysis conducted between October 2022 and January 2023), we applied the HEAL framework to a dermatology AI model. A set of 5420 teledermatology cases (store-and-forward cases from patients of 20 years or older, submitted from primary care providers in the USA and skin cancer clinics in Australia), enriched for diversity in age, sex and race/ethnicity, was used to retrospectively evaluate the AI model's HEAL metric, defined as the likelihood that the AI model performs better for subpopulations with worse average health outcomes as compared to others. The likelihood that AI performance was anticorrelated to pre-existing health outcomes was estimated using bootstrap methods as the probability that the negated Spearman's rank correlation coefficient (i.e., "R") was greater than zero. Positive values of R suggest that subpopulations with poorer health outcomes have better AI model performance. Thus, the HEAL metric, defined as p (R >0), measures how likely the AI technology is to prioritise performance for subpopulations with worse average health outcomes as compared to others (presented as a percentage below). Health outcomes were quantified as disability-adjusted life years (DALYs) when grouping by sex and age, and years of life lost (YLLs) when grouping by race/ethnicity. AI performance was measured as top-3 agreement with the reference diagnosis from a panel of 3 dermatologists per case. Findings Across all dermatologic conditions, the HEAL metric was 80.5% for prioritizing AI performance of racial/ethnic subpopulations based on YLLs, and 92.1% and 0.0% respectively for prioritizing AI performance of sex and age subpopulations based on DALYs. Certain dermatologic conditions were significantly associated with greater AI model performance compared to a reference category of less common conditions. For skin cancer conditions, the HEAL metric was 73.8% for prioritizing AI performance of age subpopulations based on DALYs. Interpretation Analysis using the proposed HEAL framework showed that the dermatology AI model prioritised performance for race/ethnicity, sex (all conditions) and age (cancer conditions) subpopulations with respect to pre-existing health disparities. More work is needed to investigate ways of promoting equitable AI performance across age for non-cancer conditions and to better understand how AI models can contribute towards improving equity in health outcomes. Funding Google LLC.
Collapse
Affiliation(s)
| | | | - Malcolm Pyles
- Advanced Clinical, Deerfield, IL, USA
- Department of Dermatology, Cleveland Clinic, Cleveland, OH, USA
| | | | | | | | | | | | | | - Yuan Liu
- Google Health, Mountain View, CA, USA
| | | | | | - Jenna Lester
- Advanced Clinical, Deerfield, IL, USA
- Department of Dermatology, University of California, San Francisco, CA, USA
| | | | | | | | - Peggy Bui
- Google Health, Mountain View, CA, USA
| | | | | | | | | | | | | | | | - Yun Liu
- Google Health, Mountain View, CA, USA
| | - Ivor Horn
- Google Health, Mountain View, CA, USA
| | | |
Collapse
|
2
|
Azizi S, Culp L, Freyberg J, Mustafa B, Baur S, Kornblith S, Chen T, Tomasev N, Mitrović J, Strachan P, Mahdavi SS, Wulczyn E, Babenko B, Walker M, Loh A, Chen PHC, Liu Y, Bavishi P, McKinney SM, Winkens J, Roy AG, Beaver Z, Ryan F, Krogue J, Etemadi M, Telang U, Liu Y, Peng L, Corrado GS, Webster DR, Fleet D, Hinton G, Houlsby N, Karthikesalingam A, Norouzi M, Natarajan V. Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nat Biomed Eng 2023:10.1038/s41551-023-01049-7. [PMID: 37291435 DOI: 10.1038/s41551-023-01049-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Accepted: 05/02/2023] [Indexed: 06/10/2023]
Abstract
Machine-learning models for medical tasks can match or surpass the performance of clinical experts. However, in settings differing from those of the training dataset, the performance of a model can deteriorate substantially. Here we report a representation-learning strategy for machine-learning models applied to medical-imaging tasks that mitigates such 'out of distribution' performance problem and that improves model robustness and training efficiency. The strategy, which we named REMEDIS (for 'Robust and Efficient Medical Imaging with Self-supervision'), combines large-scale supervised transfer learning on natural images and intermediate contrastive self-supervised learning on medical images and requires minimal task-specific customization. We show the utility of REMEDIS in a range of diagnostic-imaging tasks covering six imaging domains and 15 test datasets, and by simulating three realistic out-of-distribution scenarios. REMEDIS improved in-distribution diagnostic accuracies up to 11.5% with respect to strong supervised baseline models, and in out-of-distribution settings required only 1-33% of the data for retraining to match the performance of supervised models retrained using all available data. REMEDIS may accelerate the development lifecycle of machine-learning models for medical imaging.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Ting Chen
- Google Research, Mountain View, CA, USA
| | | | | | | | | | | | | | | | - Aaron Loh
- Google Research, Mountain View, CA, USA
| | | | - Yuan Liu
- Google Research, Mountain View, CA, USA
| | | | | | | | | | | | - Fiona Ryan
- Georgia Institute of Technology, Computer Science, Atlanta, GA, USA
| | | | - Mozziyar Etemadi
- School of Medicine/School of Engineering, Northwestern University, Chicago, IL, USA
| | | | - Yun Liu
- Google Research, Mountain View, CA, USA
| | - Lily Peng
- Google Research, Mountain View, CA, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
3
|
Krogue JD, Azizi S, Tan F, Flament-Auvigne I, Brown T, Plass M, Reihs R, Müller H, Zatloukal K, Richeson P, Corrado GS, Peng LH, Mermel CH, Liu Y, Chen PHC, Gombar S, Montine T, Shen J, Steiner DF, Wulczyn E. Predicting lymph node metastasis from primary tumor histology and clinicopathologic factors in colorectal cancer using deep learning. Commun Med (Lond) 2023; 3:59. [PMID: 37095223 PMCID: PMC10125969 DOI: 10.1038/s43856-023-00282-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Accepted: 03/29/2023] [Indexed: 04/26/2023] Open
Abstract
BACKGROUND Presence of lymph node metastasis (LNM) influences prognosis and clinical decision-making in colorectal cancer. However, detection of LNM is variable and depends on a number of external factors. Deep learning has shown success in computational pathology, but has struggled to boost performance when combined with known predictors. METHODS Machine-learned features are created by clustering deep learning embeddings of small patches of tumor in colorectal cancer via k-means, and then selecting the top clusters that add predictive value to a logistic regression model when combined with known baseline clinicopathological variables. We then analyze performance of logistic regression models trained with and without these machine-learned features in combination with the baseline variables. RESULTS The machine-learned extracted features provide independent signal for the presence of LNM (AUROC: 0.638, 95% CI: [0.590, 0.683]). Furthermore, the machine-learned features add predictive value to the set of 6 clinicopathologic variables in an external validation set (likelihood ratio test, p < 0.00032; AUROC: 0.740, 95% CI: [0.701, 0.780]). A model incorporating these features can also further risk-stratify patients with and without identified metastasis (p < 0.001 for both stage II and stage III). CONCLUSION This work demonstrates an effective approach to combine deep learning with established clinicopathologic factors in order to identify independently informative features associated with LNM. Further work building on these specific results may have important impact in prognostication and therapeutic decision making for LNM. Additionally, this general computational approach may prove useful in other contexts.
Collapse
Affiliation(s)
| | | | - Fraser Tan
- Google Health, Palo Alto, California, USA
| | | | | | | | | | | | | | - Pema Richeson
- Department of Pathology, Stanford University School of Medicine, Stanford, California, USA
| | | | | | | | - Yun Liu
- Google Health, Palo Alto, California, USA
| | | | - Saurabh Gombar
- Department of Pathology, Stanford University School of Medicine, Stanford, California, USA
| | - Thomas Montine
- Department of Pathology, Stanford University School of Medicine, Stanford, California, USA
| | - Jeanne Shen
- Department of Pathology, Stanford University School of Medicine, Stanford, California, USA
| | | | | |
Collapse
|
4
|
L’Imperio V, Wulczyn E, Plass M, Müller H, Tamini N, Gianotti L, Zucchini N, Reihs R, Corrado GS, Webster DR, Peng LH, Chen PHC, Lavitrano M, Liu Y, Steiner DF, Zatloukal K, Pagni F. Pathologist Validation of a Machine Learning-Derived Feature for Colon Cancer Risk Stratification. JAMA Netw Open 2023; 6:e2254891. [PMID: 36917112 PMCID: PMC10015309 DOI: 10.1001/jamanetworkopen.2022.54891] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 03/16/2023] Open
Abstract
IMPORTANCE Identifying new prognostic features in colon cancer has the potential to refine histopathologic review and inform patient care. Although prognostic artificial intelligence systems have recently demonstrated significant risk stratification for several cancer types, studies have not yet shown that the machine learning-derived features associated with these prognostic artificial intelligence systems are both interpretable and usable by pathologists. OBJECTIVE To evaluate whether pathologist scoring of a histopathologic feature previously identified by machine learning is associated with survival among patients with colon cancer. DESIGN, SETTING, AND PARTICIPANTS This prognostic study used deidentified, archived colorectal cancer cases from January 2013 to December 2015 from the University of Milano-Bicocca. All available histologic slides from 258 consecutive colon adenocarcinoma cases were reviewed from December 2021 to February 2022 by 2 pathologists, who conducted semiquantitative scoring for tumor adipose feature (TAF), which was previously identified via a prognostic deep learning model developed with an independent colorectal cancer cohort. MAIN OUTCOMES AND MEASURES Prognostic value of TAF for overall survival and disease-specific survival as measured by univariable and multivariable regression analyses. Interpathologist agreement in TAF scoring was also evaluated. RESULTS A total of 258 colon adenocarcinoma histopathologic cases from 258 patients (138 men [53%]; median age, 67 years [IQR, 65-81 years]) with stage II (n = 119) or stage III (n = 139) cancer were included. Tumor adipose feature was identified in 120 cases (widespread in 63 cases, multifocal in 31, and unifocal in 26). For overall survival analysis after adjustment for tumor stage, TAF was independently prognostic in 2 ways: TAF as a binary feature (presence vs absence: hazard ratio [HR] for presence of TAF, 1.55 [95% CI, 1.07-2.25]; P = .02) and TAF as a semiquantitative categorical feature (HR for widespread TAF, 1.87 [95% CI, 1.23-2.85]; P = .004). Interpathologist agreement for widespread TAF vs lower categories (absent, unifocal, or multifocal) was 90%, corresponding to a κ metric at this threshold of 0.69 (95% CI, 0.58-0.80). CONCLUSIONS AND RELEVANCE In this prognostic study, pathologists were able to learn and reproducibly score for TAF, providing significant risk stratification on this independent data set. Although additional work is warranted to understand the biological significance of this feature and to establish broadly reproducible TAF scoring, this work represents the first validation to date of human expert learning from machine learning in pathology. Specifically, this validation demonstrates that a computationally identified histologic feature can represent a human-identifiable, prognostic feature with the potential for integration into pathology practice.
Collapse
Affiliation(s)
- Vincenzo L’Imperio
- Department of Medicine and Surgery, Pathology, University of Milano-Bicocca, IRCCS (Scientific Institute for Research, Hospitalization and Healthcare) Fondazione San Gerardo dei Tintori, Monza, Italy
| | | | - Markus Plass
- Medical University of Graz, Diagnostic and Research Institute of Pathology, Graz, Austria
| | - Heimo Müller
- Medical University of Graz, Diagnostic and Research Institute of Pathology, Graz, Austria
| | - Nicolò Tamini
- Department of Surgery, San Gerardo Hospital, Monza, Italy
| | - Luca Gianotti
- Department of Surgery, San Gerardo Hospital, Monza, Italy
| | - Nicola Zucchini
- Department of Medicine and Surgery, Pathology, University of Milano-Bicocca, IRCCS (Scientific Institute for Research, Hospitalization and Healthcare) Fondazione San Gerardo dei Tintori, Monza, Italy
| | - Robert Reihs
- Medical University of Graz, Diagnostic and Research Institute of Pathology, Graz, Austria
| | | | | | - Lily H. Peng
- Google Health, Google LLC, Palo Alto, California
| | | | - Marialuisa Lavitrano
- Department of Medicine and Surgery, Pathology, University of Milano-Bicocca, IRCCS (Scientific Institute for Research, Hospitalization and Healthcare) Fondazione San Gerardo dei Tintori, Monza, Italy
| | - Yun Liu
- Google Health, Google LLC, Palo Alto, California
| | | | - Kurt Zatloukal
- Medical University of Graz, Diagnostic and Research Institute of Pathology, Graz, Austria
| | - Fabio Pagni
- Department of Medicine and Surgery, Pathology, University of Milano-Bicocca, IRCCS (Scientific Institute for Research, Hospitalization and Healthcare) Fondazione San Gerardo dei Tintori, Monza, Italy
| |
Collapse
|
5
|
Sadhwani A, Chang HW, Behrooz A, Brown T, Auvigne-Flament I, Patel H, Findlater R, Velez V, Tan F, Tekiela K, Wulczyn E, Yi ES, Mermel CH, Hanks D, Chen PHC, Kulig K, Batenchuk C, Steiner DF, Cimermancic P. Comparative analysis of machine learning approaches to classify tumor mutation burden in lung adenocarcinoma using histopathology images. Sci Rep 2021; 11:16605. [PMID: 34400666 PMCID: PMC8368039 DOI: 10.1038/s41598-021-95747-4] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Accepted: 07/12/2021] [Indexed: 01/11/2023] Open
Abstract
Both histologic subtypes and tumor mutation burden (TMB) represent important biomarkers in lung cancer, with implications for patient prognosis and treatment decisions. Typically, TMB is evaluated by comprehensive genomic profiling but this requires use of finite tissue specimens and costly, time-consuming laboratory processes. Histologic subtype classification represents an established component of lung adenocarcinoma histopathology, but can be challenging and is associated with substantial inter-pathologist variability. Here we developed a deep learning system to both classify histologic patterns in lung adenocarcinoma and predict TMB status using de-identified Hematoxylin and Eosin (H&E) stained whole slide images. We first trained a convolutional neural network to map histologic features across whole slide images of lung cancer resection specimens. On evaluation using an external data source, this model achieved patch-level area under the receiver operating characteristic curve (AUC) of 0.78–0.98 across nine histologic features. We then integrated the output of this model with clinico-demographic data to develop an interpretable model for TMB classification. The resulting end-to-end system was evaluated on 172 held out cases from TCGA, achieving an AUC of 0.71 (95% CI 0.63–0.80). The benefit of using histologic features in predicting TMB is highlighted by the significant improvement this approach offers over using the clinical features alone (AUC of 0.63 [95% CI 0.53–0.72], p = 0.002). Furthermore, we found that our histologic subtype-based approach achieved performance similar to that of a weakly supervised approach (AUC of 0.72 [95% CI 0.64–0.80]). Together these results underscore that incorporating histologic patterns in biomarker prediction for lung cancer provides informative signals, and that interpretable approaches utilizing these patterns perform comparably with less interpretable, weakly supervised approaches.
Collapse
Affiliation(s)
| | | | - Ali Behrooz
- Verily Life Sciences, South San Francisco, CA, USA
| | | | | | - Hardik Patel
- Verily Life Sciences, South San Francisco, CA, USA
| | | | | | | | | | | | - Eunhee S Yi
- Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, USA
| | | | - Debra Hanks
- Verily Life Sciences, South San Francisco, CA, USA
| | | | - Kimary Kulig
- Verily Life Sciences, South San Francisco, CA, USA.,PathPresenter Corp., New York, NY, USA
| | | | | | | |
Collapse
|
6
|
Wilson M, Chopra R, Wilson MZ, Cooper C, MacWilliams P, Liu Y, Wulczyn E, Florea D, Hughes CO, Karthikesalingam A, Khalid H, Vermeirsch S, Nicholson L, Keane PA, Balaskas K, Kelly CJ. Validation and Clinical Applicability of Whole-Volume Automated Segmentation of Optical Coherence Tomography in Retinal Disease Using Deep Learning. JAMA Ophthalmol 2021; 139:964-973. [PMID: 34236406 PMCID: PMC8444027 DOI: 10.1001/jamaophthalmol.2021.2273] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Question Is deep learning–based segmentation of macular disease in optical coherence tomography (OCT) suitable for clinical use? Findings In this diagnostic study of OCT data from 173 patients with age-related macular degeneration or diabetic macular edema, model segmentations qualitatively ranked better or comparable for clinical applicability to 1 or more expert grader segmentations in 127 scans (73%) by a panel of 3 retinal specialists. Scans with high quantitative accuracy scores were not reliably associated with higher rankings. Meaning These findings suggest that qualitative evaluation adds to quantitative approaches when assessing clinical applicability of segmentation tools and clinician satisfaction in practice. Importance Quantitative volumetric measures of retinal disease in optical coherence tomography (OCT) scans are infeasible to perform owing to the time required for manual grading. Expert-level deep learning systems for automatic OCT segmentation have recently been developed. However, the potential clinical applicability of these systems is largely unknown. Objective To evaluate a deep learning model for whole-volume segmentation of 4 clinically important pathological features and assess clinical applicability. Design, Setting, Participants This diagnostic study used OCT data from 173 patients with a total of 15 558 B-scans, treated at Moorfields Eye Hospital. The data set included 2 common OCT devices and 2 macular conditions: wet age-related macular degeneration (107 scans) and diabetic macular edema (66 scans), covering the full range of severity, and from 3 points during treatment. Two expert graders performed pixel-level segmentations of intraretinal fluid, subretinal fluid, subretinal hyperreflective material, and pigment epithelial detachment, including all B-scans in each OCT volume, taking as long as 50 hours per scan. Quantitative evaluation of whole-volume model segmentations was performed. Qualitative evaluation of clinical applicability by 3 retinal experts was also conducted. Data were collected from June 1, 2012, to January 31, 2017, for set 1 and from January 1 to December 31, 2017, for set 2; graded between November 2018 and January 2020; and analyzed from February 2020 to November 2020. Main Outcomes and Measures Rating and stack ranking for clinical applicability by retinal specialists, model-grader agreement for voxelwise segmentations, and total volume evaluated using Dice similarity coefficients, Bland-Altman plots, and intraclass correlation coefficients. Results Among the 173 patients included in the analysis (92 [53%] women), qualitative assessment found that automated whole-volume segmentation ranked better than or comparable to at least 1 expert grader in 127 scans (73%; 95% CI, 66%-79%). A neutral or positive rating was given to 135 model segmentations (78%; 95% CI, 71%-84%) and 309 expert gradings (2 per scan) (89%; 95% CI, 86%-92%). The model was rated neutrally or positively in 86% to 92% of diabetic macular edema scans and 53% to 87% of age-related macular degeneration scans. Intraclass correlations ranged from 0.33 (95% CI, 0.08-0.96) to 0.96 (95% CI, 0.90-0.99). Dice similarity coefficients ranged from 0.43 (95% CI, 0.29-0.66) to 0.78 (95% CI, 0.57-0.85). Conclusions and Relevance This deep learning–based segmentation tool provided clinically useful measures of retinal disease that would otherwise be infeasible to obtain. Qualitative evaluation was additionally important to reveal clinical applicability for both care management and research.
Collapse
Affiliation(s)
| | - Reena Chopra
- Google Health, London, United Kingdom.,National Institute for Health Research Biomedical Research Centre for Ophthalmology, Moorfields Eye Hospital NHS (National Health Service) Foundation Trust, London, United Kingdom.,University College London Institute of Ophthalmology, London, United Kingdom
| | | | | | | | - Yun Liu
- Google Health, Palo Alto, California
| | | | - Daniela Florea
- National Institute for Health Research Biomedical Research Centre for Ophthalmology, Moorfields Eye Hospital NHS (National Health Service) Foundation Trust, London, United Kingdom.,University College London Institute of Ophthalmology, London, United Kingdom
| | | | | | - Hagar Khalid
- National Institute for Health Research Biomedical Research Centre for Ophthalmology, Moorfields Eye Hospital NHS (National Health Service) Foundation Trust, London, United Kingdom.,University College London Institute of Ophthalmology, London, United Kingdom
| | - Sandra Vermeirsch
- National Institute for Health Research Biomedical Research Centre for Ophthalmology, Moorfields Eye Hospital NHS (National Health Service) Foundation Trust, London, United Kingdom.,University College London Institute of Ophthalmology, London, United Kingdom
| | - Luke Nicholson
- National Institute for Health Research Biomedical Research Centre for Ophthalmology, Moorfields Eye Hospital NHS (National Health Service) Foundation Trust, London, United Kingdom.,University College London Institute of Ophthalmology, London, United Kingdom
| | - Pearse A Keane
- National Institute for Health Research Biomedical Research Centre for Ophthalmology, Moorfields Eye Hospital NHS (National Health Service) Foundation Trust, London, United Kingdom.,University College London Institute of Ophthalmology, London, United Kingdom
| | - Konstantinos Balaskas
- National Institute for Health Research Biomedical Research Centre for Ophthalmology, Moorfields Eye Hospital NHS (National Health Service) Foundation Trust, London, United Kingdom.,University College London Institute of Ophthalmology, London, United Kingdom
| | | |
Collapse
|
7
|
Wulczyn E, Steiner DF, Moran M, Plass M, Reihs R, Mueller H, Sadhwani A, Cai Y, Flament I, Chen PHC, Liu Y, Stumpe MC, Xu Z, Zatloukal K, Mermel CH. Abstract 2096: A deep learning system to predict disease-specific survival in stage II and stage III colorectal cancer. Cancer Res 2020. [DOI: 10.1158/1538-7445.am2020-2096] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
Accurate prognosis in colorectal cancer can have important implications for clinical management. Here, we develop a deep learning system (DLS) to first identify invasive cancer and then directly predict disease specific survival (DSS) for stage II and stage III colorectal cancer using only digitized histopathology whole-slide images. The DLS was trained using slides from 1173 stage II and 1266 stage III cases (18,304 total slides) and was evaluated on a held-out test set of 601 stage II and 638 stage III cases (9,340 total slides). The area under the receiver operating characteristic curve (AUC) for 5-year DSS prediction was 68.0 for stage II (95% CI 62.2-73.1) and 65.5 for stage III (95% CI 61.1-70.0). For stage II, 5-year DSS was 64% for DLS-predicted high-risk cases versus 89% for DLS-predicted low-risk cases (upper and lower risk quartiles; p<0.001, log rank test). For stage III, 5-year DSS was 35% for DLS-predicted high-risk cases versus 66% for DLS-predicted low-risk cases (upper and lower risk quartiles; p<0.001, log rank test). In a multivariable Cox model, the DLS prediction remained significantly associated with DSS after adjusting for T-category, N-category, age, gender, tumor grade, and lymphovascular invasion (stage II: adjusted hazard ratio 1.55, 95% CI 1.33-1.81, p<0.0001; stage III: adjusted hazard ratio 1.35, 95% CI (1.21-1.51), p<0.0001). Finally, a combined proportional-hazards model using the DLS along with baseline clinicopathologic information provided better risk prediction than the DLS or baseline information alone, increasing 5-year AUC over the baseline-only model by 8.9 points (95% CI 3.9-13.6) and 5.3 points (95% CI 2.3-8.4) for stages II and III, respectively. Taken together, these findings demonstrate that the DLS provides significant prognostic value and risk stratification in both stage II and stage III colorectal cancer, and can be combined with known risk features to further improve prognostic accuracy. This represents novel work to train a DLS to directly predict patient outcomes using whole-slide images and weakly supervised learning. The ability to use non-annotated slides as input has important implications for possible clinical applications and the features learned by the model may also help to identify new prognosis-associated morphologic factors in colorectal cancer. Additional work is ongoing to confirm the utility of these findings, such as validation in additional datasets and interpretability experiments to better understand the features learned by the DLS for these predictions.
Citation Format: Ellery Wulczyn, David F. Steiner, Melissa Moran, Markus Plass, Robert Reihs, Heimo Mueller, Apaar Sadhwani, Yuannan Cai, Isabelle Flament, Po-Hsuan Cameron Chen, Yun Liu, Martin C. Stumpe, Zhaoyang Xu, Kurt Zatloukal, Craig H. Mermel. A deep learning system to predict disease-specific survival in stage II and stage III colorectal cancer [abstract]. In: Proceedings of the Annual Meeting of the American Association for Cancer Research 2020; 2020 Apr 27-28 and Jun 22-24. Philadelphia (PA): AACR; Cancer Res 2020;80(16 Suppl):Abstract nr 2096.
Collapse
|
8
|
Wulczyn E, Steiner DF, Xu Z, Sadhwani A, Wang H, Flament-Auvigne I, Mermel CH, Chen PHC, Liu Y, Stumpe MC. Deep learning-based survival prediction for multiple cancer types using histopathology images. PLoS One 2020; 15:e0233678. [PMID: 32555646 PMCID: PMC7299324 DOI: 10.1371/journal.pone.0233678] [Citation(s) in RCA: 90] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2020] [Accepted: 05/10/2020] [Indexed: 12/12/2022] Open
Abstract
Providing prognostic information at the time of cancer diagnosis has important implications for treatment and monitoring. Although cancer staging, histopathological assessment, molecular features, and clinical variables can provide useful prognostic insights, improving risk stratification remains an active research area. We developed a deep learning system (DLS) to predict disease specific survival across 10 cancer types from The Cancer Genome Atlas (TCGA). We used a weakly-supervised approach without pixel-level annotations, and tested three different survival loss functions. The DLS was developed using 9,086 slides from 3,664 cases and evaluated using 3,009 slides from 1,216 cases. In multivariable Cox regression analysis of the combined cohort including all 10 cancers, the DLS was significantly associated with disease specific survival (hazard ratio of 1.58, 95% CI 1.28–1.70, p<0.0001) after adjusting for cancer type, stage, age, and sex. In a per-cancer adjusted subanalysis, the DLS remained a significant predictor of survival in 5 of 10 cancer types. Compared to a baseline model including stage, age, and sex, the c-index of the model demonstrated an absolute 3.7% improvement (95% CI 1.0–6.5) in the combined cohort. Additionally, our models stratified patients within individual cancer stages, particularly stage II (p = 0.025) and stage III (p<0.001). By developing and evaluating prognostic models across multiple cancer types, this work represents one of the most comprehensive studies exploring the direct prediction of clinical outcomes using deep learning and histopathology images. Our analysis demonstrates the potential for this approach to provide significant prognostic information in multiple cancer types, and even within specific pathologic stages. However, given the relatively small number of cases and observed clinical events for a deep learning task of this type, we observed wide confidence intervals for model performance, thus highlighting that future work will benefit from larger datasets assembled for the purposes for survival modeling.
Collapse
Affiliation(s)
- Ellery Wulczyn
- Google Health, Google LLC, Palo Alto, California, United States of America
| | - David F. Steiner
- Google Health, Google LLC, Palo Alto, California, United States of America
| | - Zhaoyang Xu
- Google Health, Google LLC, Palo Alto, California, United States of America
| | - Apaar Sadhwani
- Google Health, Google LLC, Palo Alto, California, United States of America
| | - Hongwu Wang
- Google Health, Google LLC, Palo Alto, California, United States of America
| | | | - Craig H. Mermel
- Google Health, Google LLC, Palo Alto, California, United States of America
| | | | - Yun Liu
- Google Health, Google LLC, Palo Alto, California, United States of America
- * E-mail:
| | - Martin C. Stumpe
- Google Health, Google LLC, Palo Alto, California, United States of America
| |
Collapse
|
9
|
Nagpal K, Foote D, Liu Y, Chen PHC, Wulczyn E, Tan F, Olson N, Smith JL, Mohtashamian A, Wren JH, Corrado GS, MacDonald R, Peng LH, Amin MB, Evans AJ, Sangoi AR, Mermel CH, Hipp JD, Stumpe MC. Erratum: Publisher Correction: Development and validation of a deep learning algorithm for improving Gleason scoring of prostate cancer. NPJ Digit Med 2019; 2:113. [PMID: 31754638 PMCID: PMC6864046 DOI: 10.1038/s41746-019-0196-8] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Affiliation(s)
- Kunal Nagpal
- 1Google AI Healthcare, Google, Mountain View, CA USA
| | - Davis Foote
- 1Google AI Healthcare, Google, Mountain View, CA USA
| | - Yun Liu
- 1Google AI Healthcare, Google, Mountain View, CA USA
| | | | | | - Fraser Tan
- 1Google AI Healthcare, Google, Mountain View, CA USA
| | - Niels Olson
- 2Laboratory Department, Naval Medical Center San Diego, San Diego, CA USA
| | - Jenny L Smith
- 2Laboratory Department, Naval Medical Center San Diego, San Diego, CA USA
| | - Arash Mohtashamian
- 2Laboratory Department, Naval Medical Center San Diego, San Diego, CA USA
| | | | | | | | - Lily H Peng
- 1Google AI Healthcare, Google, Mountain View, CA USA
| | - Mahul B Amin
- 4Department of Pathology and Laboratory Medicine, University of Tennessee Health Science Center, Memphis, TN USA
| | - Andrew J Evans
- 5Department of Pathology, Laboratory Medicine and Pathology, University Health Network and University of Toronto, Toronto, ON Canada
| | - Ankur R Sangoi
- 6Department of Pathology, El Camino Hospital, Mountain View, CA USA
| | | | - Jason D Hipp
- 1Google AI Healthcare, Google, Mountain View, CA USA
| | - Martin C Stumpe
- 1Google AI Healthcare, Google, Mountain View, CA USA.,Present Address: AI and Data Science, Tempus Labs Inc, Chicago, United States
| |
Collapse
|
10
|
Nagpal K, Foote D, Liu Y, Chen PHC, Wulczyn E, Tan F, Olson N, Smith JL, Mohtashamian A, Wren JH, Corrado GS, MacDonald R, Peng LH, Amin MB, Evans AJ, Sangoi AR, Mermel CH, Hipp JD, Stumpe MC. Development and validation of a deep learning algorithm for improving Gleason scoring of prostate cancer. NPJ Digit Med 2019; 2:48. [PMID: 31304394 PMCID: PMC6555810 DOI: 10.1038/s41746-019-0112-2] [Citation(s) in RCA: 167] [Impact Index Per Article: 33.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2019] [Accepted: 04/15/2019] [Indexed: 12/20/2022] Open
Abstract
For prostate cancer patients, the Gleason score is one of the most important prognostic factors, potentially determining treatment independent of the stage. However, Gleason scoring is based on subjective microscopic examination of tumor morphology and suffers from poor reproducibility. Here we present a deep learning system (DLS) for Gleason scoring whole-slide images of prostatectomies. Our system was developed using 112 million pathologist-annotated image patches from 1226 slides, and evaluated on an independent validation dataset of 331 slides. Compared to a reference standard provided by genitourinary pathology experts, the mean accuracy among 29 general pathologists was 0.61 on the validation set. The DLS achieved a significantly higher diagnostic accuracy of 0.70 (p = 0.002) and trended towards better patient risk stratification in correlations to clinical follow-up data. Our approach could improve the accuracy of Gleason scoring and subsequent therapy decisions, particularly where specialist expertise is unavailable. The DLS also goes beyond the current Gleason system to more finely characterize and quantitate tumor morphology, providing opportunities for refinement of the Gleason system itself.
Collapse
Affiliation(s)
- Kunal Nagpal
- Google AI Healthcare, Google, Mountain View, CA USA
| | - Davis Foote
- Google AI Healthcare, Google, Mountain View, CA USA
| | - Yun Liu
- Google AI Healthcare, Google, Mountain View, CA USA
| | | | | | - Fraser Tan
- Google AI Healthcare, Google, Mountain View, CA USA
| | - Niels Olson
- Laboratory Department, Naval Medical Center San Diego, San Diego, CA USA
| | - Jenny L. Smith
- Laboratory Department, Naval Medical Center San Diego, San Diego, CA USA
| | - Arash Mohtashamian
- Laboratory Department, Naval Medical Center San Diego, San Diego, CA USA
| | | | | | | | - Lily H. Peng
- Google AI Healthcare, Google, Mountain View, CA USA
| | - Mahul B. Amin
- Department of Pathology and Laboratory Medicine, University of Tennessee Health Science Center, Memphis, TN USA
| | - Andrew J. Evans
- Department of Pathology, Laboratory Medicine and Pathology, University Health Network and University of Toronto, Toronto, ON Canada
| | - Ankur R. Sangoi
- Department of Pathology, El Camino Hospital, Mountain View, CA USA
| | | | | | - Martin C. Stumpe
- Google AI Healthcare, Google, Mountain View, CA USA
- Present Address: AI and Data Science, Tempus Labs Inc, Chicago, United States
| |
Collapse
|
11
|
Abstract
The different Wikipedia language editions vary dramatically in how comprehensive they are. As a result, most language editions contain only a small fraction of the sum of information that exists across all Wikipedias. In this paper, we present an approach to filling gaps in article coverage across different Wikipedia editions. Our main contribution is an end-to-end system for recommending articles for creation that exist in one language but are missing in another. The system involves identifying missing articles, ranking the missing articles according to their importance, and recommending important missing articles to editors based on their interests. We empirically validate our models in a controlled experiment involving 12,000 French Wikipedia editors. We find that personalizing recommendations increases editor engagement by a factor of two. Moreover, recommending articles increases their chance of being created by a factor of 3.2. Finally, articles created as a result of our recommendations are of comparable quality to organically created articles. Overall, our system leads to more engaged editors and faster growth of Wikipedia with no effect on its quality.
Collapse
|