1
|
Li Y, Hsu W. A classification for complex imbalanced data in disease screening and early diagnosis. Stat Med 2022; 41:3679-3695. [PMID: 35603639 PMCID: PMC9541048 DOI: 10.1002/sim.9442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2021] [Revised: 04/11/2022] [Accepted: 05/10/2022] [Indexed: 11/09/2022]
Abstract
Imbalanced classification has drawn considerable attention in the statistics and machine learning literature. Typically, traditional classification methods often perform poorly when a severely skewed class distribution is observed, not to mention under a high-dimensional longitudinal data structure. Given the ubiquity of big data in modern health research, it is expected that imbalanced classification in disease diagnosis may encounter an additional level of difficulty that is imposed by such a complex data structure. In this article, we propose a nonparametric classification approach for imbalanced data in longitudinal and high-dimensional settings. Technically, the functional principal component analysis is first applied for feature extraction under the longitudinal structure. The univariate exponential loss function coupled with group LASSO penalty is then adopted into the classification procedure in high-dimensional settings. Along with a good improvement in imbalanced classification, our approach provides a meaningful feature selection for interpretation while enjoying a remarkably lower computational complexity. The proposed method is illustrated on the real data application of Alzheimer's disease early detection and its empirical performance in finite sample size is extensively evaluated by simulations.
Collapse
Affiliation(s)
- Yiming Li
- Department of StatisticsKansas State UniversityManhattanKansasUSA
| | - Wei‐Wen Hsu
- Division of Biostatistics and Bioinformatics, Department of Environmental and Public Health SciencesUniversity of CincinnatiCincinnatiOhioUSA
| | | |
Collapse
|
2
|
Chiang C, Lai YH, Huang BH, Guo WJ, Wu YJ, Chang LC, Hsiao CF. Use of a tolerance interval approach as a statistical quality control tool for traditional Chinese medicine. J Biopharm Stat 2020; 30:873-881. [PMID: 32394789 DOI: 10.1080/10543406.2020.1757693] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
Raw materials for traditional Chinese medicine (TCM) are often from different resources and its final product may also be made by different sites. Therefore, variabilities from different resources such as site-to-site or within site component-to-component may be expected. Consequently, test for consistency in raw materials, in-process materials, and/or final product has become an important issue in the quality control (QC) process in TCM development. In this paper, a statistical QC process for raw materials and/or the final product of TCM is proposed based on a two sided [Formula: see text]-content, [Formula: see text]-confidence tolerance interval. More specifically, we construct the tolerance interval for a random-effects model to assess the QC of TCM products from different regions and possibly different product batches. The products can be claimed to be consistency when the constructed tolerance interval is within the permitted range. Given the region and batch effects, sample sizes can also be calculated to ensure the desired measure of goodness. An example is presented to illustrate the proposed approach.
Collapse
Affiliation(s)
- Chieh Chiang
- Institute of Population Health Sciences, National Health Research Institutes , Zhunan, Taiwan
| | - Yi-Hsuan Lai
- Institute of Population Health Sciences, National Health Research Institutes , Zhunan, Taiwan
| | - Bo-Han Huang
- Division of Biometry, Department of Agronomy, National Taiwan University , Taipei, Taiwan
| | - Wen-Jin Guo
- Institute of Population Health Sciences, National Health Research Institutes , Zhunan, Taiwan
| | - Yuh-Jenn Wu
- Department of Applied Mathematics, Chung Yuan Christian University , Chungli, Taiwan
| | - Lien-Cheng Chang
- Food and Drug Administration, Ministry of Health and Welfare , Taipei, Taiwan
| | - Chin-Fu Hsiao
- Institute of Population Health Sciences, National Health Research Institutes , Zhunan, Taiwan
| |
Collapse
|
3
|
Li H, Gatsonis C. Combining biomarker trajectories to improve diagnostic accuracy in prospective cohort studies with verification bias. Stat Med 2019; 38:1968-1990. [PMID: 30590870 DOI: 10.1002/sim.8079] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2017] [Revised: 09/20/2018] [Accepted: 12/04/2018] [Indexed: 11/10/2022]
Abstract
In this paper, we develop methods to combine multiple biomarker trajectories into a composite diagnostic marker using functional data analysis (FDA) to achieve better diagnostic accuracy in monitoring disease recurrence in the setting of a prospective cohort study. In such studies, the disease status is usually verified only for patients with a positive test result in any biomarker and is missing in patients with negative test results in all biomarkers. Thus, the test result will affect disease verification, which leads to verification bias if the analysis is restricted only to the verified cases. We treat verification bias as a missing data problem. Under both missing at random (MAR) and missing not at random (MNAR) assumptions, we derive the optimal classification rules using the Neyman-Pearson lemma based on the composite diagnostic marker. We estimate thresholds adjusted for verification bias to dichotomize patients as test positive or test negative, and we evaluate the diagnostic accuracy using the verification bias corrected area under the ROC curves (AUCs). We evaluate the performance and robustness of the FDA combination approach and assess the consistency of the approach through simulation studies. In addition, we perform a sensitivity analysis of the dependency between the verification process and disease status for the approach under the MNAR assumption. We apply the proposed method on data from the Religious Orders Study and from a non-small cell lung cancer trial.
Collapse
Affiliation(s)
- Hong Li
- Department of Public Health Science, Medical University of South Carolina, Charleston, South Carolina
| | | |
Collapse
|
4
|
Fu GH, Yi LZ, Pan J. Tuning model parameters in class-imbalanced learning with precision-recall curve. Biom J 2018; 61:652-664. [PMID: 30548291 DOI: 10.1002/bimj.201800148] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2018] [Revised: 10/18/2018] [Accepted: 10/23/2018] [Indexed: 11/08/2022]
Abstract
An issue for class-imbalanced learning is what assessment metric should be employed. So far, precision-recall curve (PRC) as a metric is rarely used in practice as compared with its alternative of receiver operating characteristic (ROC). This study investigates the performance of PRC as the evaluating criterion to address the class-imbalanced data and focuses on the comparison of PRC with ROC. The advantages of PRC over ROC on assessing class-imbalanced data are also investigated and tested on our proposed algorithm by tuning the whole model parameters in simulation studies and real data examples. The result shows that PRC is competitive with ROC as performance measurement for handling class-imbalanced data in tuning the model parameters. PRC can be considered as an alternative but effective assessment for preprocessing (such as variable selection) skewed data and building a classifier in class-imbalanced learning.
Collapse
Affiliation(s)
- Guang-Hui Fu
- School of Science, Kunming University of Science and Technology, Kunming, P. R. China
| | - Lun-Zhao Yi
- Yunnan Food Safety Research Institute, Kunming University of Science and Technology, Kunming, P. R. China
| | - Jianxin Pan
- School of Mathematics, The University of Manchester, Manchester, UK
| |
Collapse
|
5
|
Identifying risk factors for bone mass transition states for postmenopausal osteoporosis. Eur J Integr Med 2017. [DOI: 10.1016/j.eujim.2017.08.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
6
|
Fong Y, Yin S, Huang Y. Combining biomarkers linearly and nonlinearly for classification using the area under the ROC curve. Stat Med 2016; 35:3792-809. [PMID: 27058981 DOI: 10.1002/sim.6956] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2015] [Revised: 01/13/2016] [Accepted: 03/08/2016] [Indexed: 11/05/2022]
Abstract
In biomedical studies, it is often of interest to classify/predict a subject's disease status based on a variety of biomarker measurements. A commonly used classification criterion is based on area under the receiver operating characteristic curve (AUC). Many methods have been proposed to optimize approximated empirical AUC criteria, but there are two limitations to the existing methods. First, most methods are only designed to find the best linear combination of biomarkers, which may not perform well when there is strong nonlinearity in the data. Second, many existing linear combination methods use gradient-based algorithms to find the best marker combination, which often result in suboptimal local solutions. In this paper, we address these two problems by proposing a new kernel-based AUC optimization method called ramp AUC (RAUC). This method approximates the empirical AUC loss function with a ramp function and finds the best combination by a difference of convex functions algorithm. We show that as a linear combination method, RAUC leads to a consistent and asymptotically normal estimator of the linear marker combination when the data are generated from a semiparametric generalized linear model, just as the smoothed AUC method. Through simulation studies and real data examples, we demonstrate that RAUC outperforms smooth AUC in finding the best linear marker combinations, and can successfully capture nonlinear pattern in the data to achieve better classification performance. We illustrate our method with a dataset from a recent HIV vaccine trial. Copyright © 2016 John Wiley & Sons, Ltd.
Collapse
Affiliation(s)
- Youyi Fong
- Public Health Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue N., M2-B500, Seattle, 98109, WA, U.S.A.,Department of Biostatistics, University of Washington, Seattle, 98195, WA, U.S.A
| | - Shuxin Yin
- Public Health Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue N., M2-B500, Seattle, 98109, WA, U.S.A
| | - Ying Huang
- Public Health Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue N., M2-B500, Seattle, 98109, WA, U.S.A.,Department of Biostatistics, University of Washington, Seattle, 98195, WA, U.S.A
| |
Collapse
|
7
|
Shu B, Shi Q, Wang YJ. Shen (Kidney)-tonifying principle for primary osteoporosis: to treat both the disease and the Chinese medicine syndrome. Chin J Integr Med 2015; 21:656-61. [DOI: 10.1007/s11655-015-2306-z] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2015] [Indexed: 10/23/2022]
|
8
|
Guo P, Zeng F, Hu X, Zhang D, Zhu S, Deng Y, Hao Y. Improved Variable Selection Algorithm Using a LASSO-Type Penalty, with an Application to Assessing Hepatitis B Infection Relevant Factors in Community Residents. PLoS One 2015. [PMID: 26214802 PMCID: PMC4516242 DOI: 10.1371/journal.pone.0134151] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
OBJECTIVES In epidemiological studies, it is important to identify independent associations between collective exposures and a health outcome. The current stepwise selection technique ignores stochastic errors and suffers from a lack of stability. The alternative LASSO-penalized regression model can be applied to detect significant predictors from a pool of candidate variables. However, this technique is prone to false positives and tends to create excessive biases. It remains challenging to develop robust variable selection methods and enhance predictability. MATERIAL AND METHODS Two improved algorithms denoted the two-stage hybrid and bootstrap ranking procedures, both using a LASSO-type penalty, were developed for epidemiological association analysis. The performance of the proposed procedures and other methods including conventional LASSO, Bolasso, stepwise and stability selection models were evaluated using intensive simulation. In addition, methods were compared by using an empirical analysis based on large-scale survey data of hepatitis B infection-relevant factors among Guangdong residents. RESULTS The proposed procedures produced comparable or less biased selection results when compared to conventional variable selection models. In total, the two newly proposed procedures were stable with respect to various scenarios of simulation, demonstrating a higher power and a lower false positive rate during variable selection than the compared methods. In empirical analysis, the proposed procedures yielding a sparse set of hepatitis B infection-relevant factors gave the best predictive performance and showed that the procedures were able to select a more stringent set of factors. The individual history of hepatitis B vaccination, family and individual history of hepatitis B infection were associated with hepatitis B infection in the studied residents according to the proposed procedures. CONCLUSIONS The newly proposed procedures improve the identification of significant variables and enable us to derive a new insight into epidemiological association analysis.
Collapse
Affiliation(s)
- Pi Guo
- Department of Medical Statistics and Epidemiology and Health Information Research Center, School of Public Health, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
- Laboratory of Health Informatics, Guangdong Key Laboratory of Medicine, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
| | - Fangfang Zeng
- Department of Medical Statistics and Epidemiology and Health Information Research Center, School of Public Health, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
- Laboratory of Health Informatics, Guangdong Key Laboratory of Medicine, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
| | - Xiaomin Hu
- Department of Medical Statistics and Epidemiology and Health Information Research Center, School of Public Health, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
- Laboratory of Health Informatics, Guangdong Key Laboratory of Medicine, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
| | - Dingmei Zhang
- Department of Medical Statistics and Epidemiology and Health Information Research Center, School of Public Health, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
- Laboratory of Health Informatics, Guangdong Key Laboratory of Medicine, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
| | - Shuming Zhu
- Department of Medical Statistics and Epidemiology and Health Information Research Center, School of Public Health, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
- Laboratory of Health Informatics, Guangdong Key Laboratory of Medicine, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
| | - Yu Deng
- Department of Medical Statistics and Epidemiology and Health Information Research Center, School of Public Health, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
- Laboratory of Health Informatics, Guangdong Key Laboratory of Medicine, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
| | - Yuantao Hao
- Department of Medical Statistics and Epidemiology and Health Information Research Center, School of Public Health, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
- Laboratory of Health Informatics, Guangdong Key Laboratory of Medicine, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
- * E-mail:
| |
Collapse
|
9
|
Li Y, Qin Y, Wang L, Chen J, Ma S. Grouped Variable Selection Using Area under the ROC with Imbalanced Data. COMMUN STAT-SIMUL C 2014. [DOI: 10.1080/03610918.2013.818691] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
10
|
Hsu MJ, Chang YCI, Hsueh HM. Biomarker selection for medical diagnosis using the partial area under the ROC curve. BMC Res Notes 2014; 7:25. [PMID: 24410929 PMCID: PMC3923449 DOI: 10.1186/1756-0500-7-25] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2013] [Accepted: 12/23/2013] [Indexed: 11/17/2022] Open
Abstract
Background A biomarker is usually used as a diagnostic or assessment tool in medical research. Finding an ideal biomarker is not easy and combining multiple biomarkers provides a promising alternative. Moreover, some biomarkers based on the optimal linear combination do not have enough discriminatory power. As a result, the aim of this study was to find the significant biomarkers based on the optimal linear combination maximizing the pAUC for assessment of the biomarkers. Methods Under the binormality assumption we obtain the optimal linear combination of biomarkers maximizing the partial area under the receiver operating characteristic curve (pAUC). Related statistical tests are developed for assessment of a biomarker set and of an individual biomarker. Stepwise biomarker selections are introduced to identify those biomarkers of statistical significance. Results The results of simulation study and three real examples, Duchenne Muscular Dystrophy disease, heart disease, and breast tissue example are used to show that our methods are most suitable biomarker selection for the data sets of a moderate number of biomarkers. Conclusions Our proposed biomarker selection approaches can be used to find the significant biomarkers based on hypothesis testing.
Collapse
Affiliation(s)
- Man-Jen Hsu
- Institute of Statistical Science, Academia Sinica, Taipei 11529, Taiwan.
| | | | | |
Collapse
|
11
|
Zhao XG, Dai W, Li Y, Tian L. AUC-based biomarker ensemble with an application on gene scores predicting low bone mineral density. ACTA ACUST UNITED AC 2011; 27:3050-5. [PMID: 21908541 DOI: 10.1093/bioinformatics/btr516] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
MOTIVATION The area under the receiver operating characteristic (ROC) curve (AUC), long regarded as a 'golden' measure for the predictiveness of a continuous score, has propelled the need to develop AUC-based predictors. However, the AUC-based ensemble methods are rather scant, largely due to the fact that the associated objective function is neither continuous nor concave. Indeed, there is no reliable numerical algorithm identifying optimal combination of a set of biomarkers to maximize the AUC, especially when the number of biomarkers is large. RESULTS We have proposed a novel AUC-based statistical ensemble methods for combining multiple biomarkers to differentiate a binary response of interest. Specifically, we propose to replace the non-continuous and non-convex AUC objective function by a convex surrogate loss function, whose minimizer can be efficiently identified. With the established framework, the lasso and other regularization techniques enable feature selections. Extensive simulations have demonstrated the superiority of the new methods to the existing methods. The proposal has been applied to a gene expression dataset to construct gene expression scores to differentiate elderly women with low bone mineral density (BMD) and those with normal BMD. The AUCs of the resulting scores in the independent test dataset has been satisfactory. CONCLUSION Aiming for directly maximizing AUC, the proposed AUC-based ensemble method provides an efficient means of generating a stable combination of multiple biomarkers, which is especially useful under the high-dimensional settings. CONTACT lutian@stanford.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- X G Zhao
- Department of Bone and Joint Surgery, The First Affiliated Hospital of Xi'an Medical University, Xi'an 710077, Shaanxi Province, PR China
| | | | | | | |
Collapse
|