1
|
Gupta S, Tran T, Luo W, Phung D, Kennedy RL, Broad A, Campbell D, Kipp D, Singh M, Khasraw M, Matheson L, Ashley DM, Venkatesh S. Machine-learning prediction of cancer survival: a retrospective study using electronic administrative records and a cancer registry. BMJ Open 2014; 4:e004007. [PMID: 24643167 PMCID: PMC3963101 DOI: 10.1136/bmjopen-2013-004007] [Citation(s) in RCA: 57] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/24/2023] Open
Abstract
OBJECTIVES Using the prediction of cancer outcome as a model, we have tested the hypothesis that through analysing routinely collected digital data contained in an electronic administrative record (EAR), using machine-learning techniques, we could enhance conventional methods in predicting clinical outcomes. SETTING A regional cancer centre in Australia. PARTICIPANTS Disease-specific data from a purpose-built cancer registry (Evaluation of Cancer Outcomes (ECO)) from 869 patients were used to predict survival at 6, 12 and 24 months. The model was validated with data from a further 94 patients, and results compared to the assessment of five specialist oncologists. Machine-learning prediction using ECO data was compared with that using EAR and a model combining ECO and EAR data. PRIMARY AND SECONDARY OUTCOME MEASURES Survival prediction accuracy in terms of the area under the receiver operating characteristic curve (AUC). RESULTS The ECO model yielded AUCs of 0.87 (95% CI 0.848 to 0.890) at 6 months, 0.796 (95% CI 0.774 to 0.823) at 12 months and 0.764 (95% CI 0.737 to 0.789) at 24 months. Each was slightly better than the performance of the clinician panel. The model performed consistently across a range of cancers, including rare cancers. Combining ECO and EAR data yielded better prediction than the ECO-based model (AUCs ranging from 0.757 to 0.997 for 6 months, AUCs from 0.689 to 0.988 for 12 months and AUCs from 0.713 to 0.973 for 24 months). The best prediction was for genitourinary, head and neck, lung, skin, and upper gastrointestinal tumours. CONCLUSIONS Machine learning applied to information from a disease-specific (cancer) database and the EAR can be used to predict clinical outcomes. Importantly, the approach described made use of digital data that is already routinely collected but underexploited by clinical health systems.
Collapse
Affiliation(s)
- Sunil Gupta
- Centre for Pattern Recognition and Data Analytics, Deakin University, Geelong, Victoria, Australia
| | - Truyen Tran
- Centre for Pattern Recognition and Data Analytics, Deakin University, Geelong, Victoria, Australia
- Department of Computing, Curtin University, Perth, Western Australia, Australia
| | - Wei Luo
- Centre for Pattern Recognition and Data Analytics, Deakin University, Geelong, Victoria, Australia
| | - Dinh Phung
- Centre for Pattern Recognition and Data Analytics, Deakin University, Geelong, Victoria, Australia
| | | | - Adam Broad
- Andrew Love Cancer Centre, Barwon Health, Geelong, Victoria, Australia
| | - David Campbell
- Andrew Love Cancer Centre, Barwon Health, Geelong, Victoria, Australia
| | - David Kipp
- Andrew Love Cancer Centre, Barwon Health, Geelong, Victoria, Australia
| | - Madhu Singh
- Andrew Love Cancer Centre, Barwon Health, Geelong, Victoria, Australia
| | - Mustafa Khasraw
- School of Medicine, Deakin University, Geelong, Victoria, Australia
- Andrew Love Cancer Centre, Barwon Health, Geelong, Victoria, Australia
| | - Leigh Matheson
- Barwon Southwest Integrated Cancer Service, Geelong, Victoria, Australia
| | - David M Ashley
- School of Medicine, Deakin University, Geelong, Victoria, Australia
- Andrew Love Cancer Centre, Barwon Health, Geelong, Victoria, Australia
- Barwon Southwest Integrated Cancer Service, Geelong, Victoria, Australia
| | - Svetha Venkatesh
- Centre for Pattern Recognition and Data Analytics, Deakin University, Geelong, Victoria, Australia
| |
Collapse
|
2
|
Sariyar M, Hoffmann I, Binder H. Combining techniques for screening and evaluating interaction terms on high-dimensional time-to-event data. BMC Bioinformatics 2014; 15:58. [PMID: 24571520 PMCID: PMC3945780 DOI: 10.1186/1471-2105-15-58] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2013] [Accepted: 01/28/2014] [Indexed: 11/23/2022] Open
Abstract
Background Molecular data, e.g. arising from microarray technology, is often used for predicting survival probabilities of patients. For multivariate risk prediction models on such high-dimensional data, there are established techniques that combine parameter estimation and variable selection. One big challenge is to incorporate interactions into such prediction models. In this feasibility study, we present building blocks for evaluating and incorporating interactions terms in high-dimensional time-to-event settings, especially for settings in which it is computationally too expensive to check all possible interactions. Results We use a boosting technique for estimation of effects and the following building blocks for pre-selecting interactions: (1) resampling, (2) random forests and (3) orthogonalization as a data pre-processing step. In a simulation study, the strategy that uses all building blocks is able to detect true main effects and interactions with high sensitivity in different kinds of scenarios. The main challenge are interactions composed of variables that do not represent main effects, but our findings are also promising in this regard. Results on real world data illustrate that effect sizes of interactions frequently may not be large enough to improve prediction performance, even though the interactions are potentially of biological relevance. Conclusion Screening interactions through random forests is feasible and useful, when one is interested in finding relevant two-way interactions. The other building blocks also contribute considerably to an enhanced pre-selection of interactions. We determined the limits of interaction detection in terms of necessary effect sizes. Our study emphasizes the importance of making full use of existing methods in addition to establishing new ones.
Collapse
Affiliation(s)
- Murat Sariyar
- Institute of Medical Biostatistics, Epidemiology and Informatics, Medical Center of the Johannes Gutenberg University, Mainz 55131, Germany.
| | | | | |
Collapse
|