1
|
Li J, Wang Z, Wu L, Qiu S, Zhao H, Lin F, Zhang K. Method for Incomplete and Imbalanced Data Based on Multivariate Imputation by Chained Equations and Ensemble Learning. IEEE J Biomed Health Inform 2024; 28:3102-3113. [PMID: 38483807 DOI: 10.1109/jbhi.2024.3376428] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/20/2024]
Abstract
The classification analysis of incomplete and imbalanced data is still a challenging task since these issues could negatively impact the training of classifiers, which were also found in our study on the physical fitness assessments of patients. And in fields such as healthcare, there are higher requirements for the accuracy of the generated imputation values. To train a high-performance classifier and pursue high accuracy, we attempted to resolve any potential negative impact by using a novel algorithmic approach based on the combination of multivariate imputation by chained equations and the ensemble learning method (MICEEN), which can solve the two problems simultaneously. We used multivariate imputation by chained equations to generate more accurate imputation values for the training set passed to ensemble learning to build a predictor. On the other hand, missing values were introduced into minority classes and used them to generate new samples belonging to the minority classes in order to balance the distribution of classes. On real-world datasets, we perform extensive experiments to assess our method and compare it to other state-of-the-art approaches. The advantages of the proposed method are demonstrated by experimental results for the benchmark datasets and self-collected datasets of physical fitness assessment of tumor patients with varying missing rates.
Collapse
|
2
|
De Angeli K, Gao S, Blanchard A, Durbin EB, Wu XC, Stroup A, Doherty J, Schwartz SM, Wiggins C, Coyle L, Penberthy L, Tourassi G, Yoon HJ. Using ensembles and distillation to optimize the deployment of deep learning models for the classification of electronic cancer pathology reports. JAMIA Open 2022; 5:ooac075. [PMID: 36110150 PMCID: PMC9469924 DOI: 10.1093/jamiaopen/ooac075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2022] [Revised: 08/15/2022] [Accepted: 08/23/2022] [Indexed: 11/12/2022] Open
Abstract
Objective We aim to reduce overfitting and model overconfidence by distilling the knowledge of an ensemble of deep learning models into a single model for the classification of cancer pathology reports. Materials and Methods We consider the text classification problem that involves 5 individual tasks. The baseline model consists of a multitask convolutional neural network (MtCNN), and the implemented ensemble (teacher) consists of 1000 MtCNNs. We performed knowledge transfer by training a single model (student) with soft labels derived through the aggregation of ensemble predictions. We evaluate performance based on accuracy and abstention rates by using softmax thresholding. Results The student model outperforms the baseline MtCNN in terms of abstention rates and accuracy, thereby allowing the model to be used with a larger volume of documents when deployed. The highest boost was observed for subsite and histology, for which the student model classified an additional 1.81% reports for subsite and 3.33% reports for histology. Discussion Ensemble predictions provide a useful strategy for quantifying the uncertainty inherent in labeled data and thereby enable the construction of soft labels with estimated probabilities for multiple classes for a given document. Training models with the derived soft labels reduce model confidence in difficult-to-classify documents, thereby leading to a reduction in the number of highly confident wrong predictions. Conclusions Ensemble model distillation is a simple tool to reduce model overconfidence in problems with extreme class imbalance and noisy datasets. These methods can facilitate the deployment of deep learning models in high-risk domains with low computational resources where minimizing inference time is required.
Collapse
Affiliation(s)
- Kevin De Angeli
- Oak Ridge National Laboratory , Oak Ridge, Tennessee, USA
- University of Tennessee , Knoxville, Tennessee, USA
| | - Shang Gao
- Oak Ridge National Laboratory , Oak Ridge, Tennessee, USA
| | | | - Eric B Durbin
- College of Medicine, University of Kentucky , Lexington, Kentucky, USA
| | - Xiao-Cheng Wu
- Louisiana Tumor Registry, Louisiana State University Health Sciences Center School of Public Health , New Orleans, Louisiana, USA
| | - Antoinette Stroup
- Rutgers Cancer Institute of New Jersey , New Brunswick, New Jersey, USA
| | - Jennifer Doherty
- Utah Cancer Registry, Huntsman Cancer Institute, University of Utah , Salt Lake City, Utah, USA
| | - Stephen M Schwartz
- Fred Hutchinson Cancer Center, Epidemiology Program , Seattle, Washington, USA
| | | | - Linda Coyle
- Information Management Services Inc. , Calverton, Maryland, USA
| | | | | | - Hong-Jun Yoon
- Oak Ridge National Laboratory , Oak Ridge, Tennessee, USA
| |
Collapse
|