1
|
Huang X, Tian L, Sun Y, Chatterjee S, Devanarayan V. Predictive signature development based on maximizing the area between receiver operating characteristic curves. Stat Med 2022; 41:5242-5257. [PMID: 36053782 PMCID: PMC10681287 DOI: 10.1002/sim.9565] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2021] [Revised: 08/16/2022] [Accepted: 08/18/2022] [Indexed: 11/09/2022]
Abstract
Development of marker signatures to predict treatment benefits for a new therapeutic is an important scientific component in advancing the drug discovery and is an important first step toward the goal of precision medicine. In this article, we focus on developing an algorithm to search for optimal linear combination of markers that maximizes the area between two receiver operating characteristic curves of the new therapeutic and the control groups without assuming any parametric model. We further generalize the proposed algorithm for predictive signature development to maximize the difference of Harrel's C-index of the new therapeutic and the control groups when the outcome of interest is time-to-event. The performance of this proposed method is evaluated and compared to existing methods via simulations and real clinical trial data.
Collapse
Affiliation(s)
- Xin Huang
- Data and Statistical Sciences, AbbVie Inc, North Chicago, Illinois, USA
| | - Lu Tian
- School of Medicine, Stanford University, Stanford, California, USA
| | - Yan Sun
- Data and Statistical Sciences, AbbVie Inc, North Chicago, Illinois, USA
| | | | | |
Collapse
|
2
|
Alakus TB, Turkoglu I. Prediction of viral-host interactions of COVID-19 by computational methods. CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS : AN INTERNATIONAL JOURNAL SPONSORED BY THE CHEMOMETRICS SOCIETY 2022; 228:104622. [PMID: 35879939 PMCID: PMC9301933 DOI: 10.1016/j.chemolab.2022.104622] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/07/2022] [Revised: 06/20/2022] [Accepted: 07/10/2022] [Indexed: 06/15/2023]
Abstract
Experimental approaches are currently used to determine viral-host interactions, but these approaches are both time-consuming and costly. For these reasons, computational-based approaches are recommended. In this study, using computational-based approaches, viral-host interactions of SARS-CoV-2 virus and human proteins were predicted. The study consists of four different stages; in the first stage viral and host protein sequences were obtained. In the second stage, protein sequences were converted into numerical expressions by various protein mapping methods. These methods are entropy-based, AVL-tree, FIBHASH, binary encoding, CPNR, PAM250, BLOSUM62, Atchley factors, Meiler parameters, EIIP, AESNN1, Miyazawa energies, Micheletti potentials, Z-scale, and hydrophobicity. In the third stage, a deep learning model was designed and BiLSTM was used for this. In the last stage, the protein sequences were classified, and the viral-host interactions were predicted. The performances of protein mapping methods were determined by accuracy, F1-score, specificity, sensitivity, and AUC scores. According to the classification results, the best classification process was obtained by the entropy-based method. With this method, 94.74% accuracy, and 0.95 AUC score were calculated. Then, the most successful classification process was performed with the Z-scale and 91.23% accuracy, and 0.96 AUC score were obtained. Although other protein mapping methods are not as efficient as Z-scale and entropy-based methods, they have achieved successful classification. AVL-tree, FIBHASH, binary encoding, CPNR, PAM250, BLOSUM62, Atchley factors, Meiler parameters and AESNN1 methods showed over 80% accuracy, F1-score, and AUC score. Accuracy scores of EIIP, Miyazawa energies, Micheletti potentials and hydrophobicity methods remained below 80%. When the results were examined in general, it was observed that the computational approaches were successful in predicting viral-host interactions between SARS-CoV-2 virus and human proteins.
Collapse
Affiliation(s)
- Talha Burak Alakus
- Kirklareli University, Department of Software Engineering, Kirklareli, 39000, Turkey
| | - Ibrahim Turkoglu
- Firat University, Department of Software Engineering, Elazig, 23119, Turkey
| |
Collapse
|
3
|
Li Y, Hsu W. A classification for complex imbalanced data in disease screening and early diagnosis. Stat Med 2022; 41:3679-3695. [PMID: 35603639 PMCID: PMC9541048 DOI: 10.1002/sim.9442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2021] [Revised: 04/11/2022] [Accepted: 05/10/2022] [Indexed: 11/09/2022]
Abstract
Imbalanced classification has drawn considerable attention in the statistics and machine learning literature. Typically, traditional classification methods often perform poorly when a severely skewed class distribution is observed, not to mention under a high-dimensional longitudinal data structure. Given the ubiquity of big data in modern health research, it is expected that imbalanced classification in disease diagnosis may encounter an additional level of difficulty that is imposed by such a complex data structure. In this article, we propose a nonparametric classification approach for imbalanced data in longitudinal and high-dimensional settings. Technically, the functional principal component analysis is first applied for feature extraction under the longitudinal structure. The univariate exponential loss function coupled with group LASSO penalty is then adopted into the classification procedure in high-dimensional settings. Along with a good improvement in imbalanced classification, our approach provides a meaningful feature selection for interpretation while enjoying a remarkably lower computational complexity. The proposed method is illustrated on the real data application of Alzheimer's disease early detection and its empirical performance in finite sample size is extensively evaluated by simulations.
Collapse
Affiliation(s)
- Yiming Li
- Department of StatisticsKansas State UniversityManhattanKansasUSA
| | - Wei‐Wen Hsu
- Division of Biostatistics and Bioinformatics, Department of Environmental and Public Health SciencesUniversity of CincinnatiCincinnatiOhioUSA
| | | |
Collapse
|
4
|
Li XY, Xiang J, Wu FX, Li M. NetAUC: A network-based multi-biomarker identification method by AUC optimization. Methods 2021; 198:56-64. [PMID: 34364986 DOI: 10.1016/j.ymeth.2021.08.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Revised: 07/08/2021] [Accepted: 08/03/2021] [Indexed: 10/20/2022] Open
Abstract
Complex diseases are caused by a variety of factors, and their diagnosis, treatment and prognosis are usually difficult. Proteins play an indispensable role in living organisms and perform specific biological functions by interacting with other proteins or biomolecules, their dysfunction may lead to diseases, it is a natural way to mine disease-related biomarkers from protein-protein interaction network. AUC, the area under the receiver operating characteristics (ROC) curve, is regarded as a gold standard to evaluate the effectiveness of a binary classifier, which measures the classification ability of an algorithm under arbitrary distribution or any misclassification cost. In this study, we have proposed a network-based multi-biomarker identification method by AUC optimization (NetAUC), which integrates gene expression and the network information to identify biomarkers for the complex disease analysis. The main purpose is to optimize two objectives simultaneously: maximizing AUC and minimizing the number of selected features. We have applied NetAUC to two types of disease analysis: 1) prognosis of breast cancer, 2) classification of similar diseases. The results show that NetAUC can identify a small panel of disease-related biomarkers which have the powerful classification ability and the functional interpretability.
Collapse
Affiliation(s)
- Xing-Yi Li
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Ju Xiang
- School of Computer Science and Engineering, Central South University, Changsha 410083, China; Neuroscience Research Center & Department of Basic Medical Sciences, Changsha Medical University, Changsha, 410219, Hunan, China
| | - Fang-Xiang Wu
- Department of Mechanical Engineering and Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha 410083, China.
| |
Collapse
|
5
|
Sun P, Lu Q, Li Z, Qin N, Jiang Y, Ma H, Jin G, Yu H, Dai J. Assessment of prognostic prediction models for gastric cancer using genomic and transcriptomic profiles. Meta Gene 2021. [DOI: 10.1016/j.mgene.2021.100890] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
|
6
|
Li H, Gatsonis C. Combining biomarker trajectories to improve diagnostic accuracy in prospective cohort studies with verification bias. Stat Med 2019; 38:1968-1990. [PMID: 30590870 DOI: 10.1002/sim.8079] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2017] [Revised: 09/20/2018] [Accepted: 12/04/2018] [Indexed: 11/10/2022]
Abstract
In this paper, we develop methods to combine multiple biomarker trajectories into a composite diagnostic marker using functional data analysis (FDA) to achieve better diagnostic accuracy in monitoring disease recurrence in the setting of a prospective cohort study. In such studies, the disease status is usually verified only for patients with a positive test result in any biomarker and is missing in patients with negative test results in all biomarkers. Thus, the test result will affect disease verification, which leads to verification bias if the analysis is restricted only to the verified cases. We treat verification bias as a missing data problem. Under both missing at random (MAR) and missing not at random (MNAR) assumptions, we derive the optimal classification rules using the Neyman-Pearson lemma based on the composite diagnostic marker. We estimate thresholds adjusted for verification bias to dichotomize patients as test positive or test negative, and we evaluate the diagnostic accuracy using the verification bias corrected area under the ROC curves (AUCs). We evaluate the performance and robustness of the FDA combination approach and assess the consistency of the approach through simulation studies. In addition, we perform a sensitivity analysis of the dependency between the verification process and disease status for the approach under the MNAR assumption. We apply the proposed method on data from the Religious Orders Study and from a non-small cell lung cancer trial.
Collapse
Affiliation(s)
- Hong Li
- Department of Public Health Science, Medical University of South Carolina, Charleston, South Carolina
| | | |
Collapse
|
7
|
Zhu L, Zhang HB, Huang DS. LMMO: A Large Margin Approach for Refining Regulatory Motifs. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:913-925. [PMID: 28391205 DOI: 10.1109/tcbb.2017.2691325] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Although discriminative motif discovery (DMD) methods are promising for eliciting motifs from high-throughput experimental data, they usually have to sacrifice accuracy and may fail to fully leverage the potential of large datasets. Recently, it has been demonstrated that the motifs identified by DMDs can be significantly improved by maximizing the receiver-operating characteristic curve (AUC) metric, which has been widely used in the literature to rank the performance of elicited motifs. However, existing approaches for motif refinement choose to directly maximize the non-convex and discontinuous AUC itself, which is known to be difficult and may lead to suboptimal solutions. In this paper, we propose Large Margin Motif Optimizer (LMMO), a large-margin-type algorithm for refining regulatory motifs. By relaxing the AUC cost function with the surrogate convex hinge loss, we show that the resultant learning problem can be cast as an instance of difference-of-convex (DC) programs, and solve it iteratively using constrained concave-convex procedure (CCCP). To further save computational time, we combine LMMO with existing techniques for improving the scalability of large-margin-type algorithms, such as cutting plane method. Experimental evaluations on synthetic and real data illustrate the performance of the proposed approach. The code of LMMO is freely available at: https://github.com/ekffar/LMMO.
Collapse
|
8
|
Zou M, Liu Z, Zhang XS, Wang Y. NCC-AUC: an AUC optimization method to identify multi-biomarker panel for cancer prognosis from genomic and clinical data. Bioinformatics 2015; 31:3330-8. [PMID: 26092859 DOI: 10.1093/bioinformatics/btv374] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2015] [Accepted: 06/14/2015] [Indexed: 12/26/2022] Open
Abstract
MOTIVATION In prognosis and survival studies, an important goal is to identify multi-biomarker panels with predictive power using molecular characteristics or clinical observations. Such analysis is often challenged by censored, small-sample-size, but high-dimensional genomic profiles or clinical data. Therefore, sophisticated models and algorithms are in pressing need. RESULTS In this study, we propose a novel Area Under Curve (AUC) optimization method for multi-biomarker panel identification named Nearest Centroid Classifier for AUC optimization (NCC-AUC). Our method is motived by the connection between AUC score for classification accuracy evaluation and Harrell's concordance index in survival analysis. This connection allows us to convert the survival time regression problem to a binary classification problem. Then an optimization model is formulated to directly maximize AUC and meanwhile minimize the number of selected features to construct a predictor in the nearest centroid classifier framework. NCC-AUC shows its great performance by validating both in genomic data of breast cancer and clinical data of stage IB Non-Small-Cell Lung Cancer (NSCLC). For the genomic data, NCC-AUC outperforms Support Vector Machine (SVM) and Support Vector Machine-based Recursive Feature Elimination (SVM-RFE) in classification accuracy. It tends to select a multi-biomarker panel with low average redundancy and enriched biological meanings. Also NCC-AUC is more significant in separation of low and high risk cohorts than widely used Cox model (Cox proportional-hazards regression model) and L1-Cox model (L1 penalized in Cox model). These performance gains of NCC-AUC are quite robust across 5 subtypes of breast cancer. Further in an independent clinical data, NCC-AUC outperforms SVM and SVM-RFE in predictive accuracy and is consistently better than Cox model and L1-Cox model in grouping patients into high and low risk categories. CONCLUSION In summary, NCC-AUC provides a rigorous optimization framework to systematically reveal multi-biomarker panel from genomic and clinical data. It can serve as a useful tool to identify prognostic biomarkers for survival analysis. AVAILABILITY AND IMPLEMENTATION NCC-AUC is available at http://doc.aporc.org/wiki/NCC-AUC. CONTACT ywang@amss.ac.cn SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Meng Zou
- Academy of Mathematics and Systems Science, National Center for Mathematics and Interdisciplinary Sciences, Chinese Academy of Sciences, Beijing 10080, China
| | - Zhaoqi Liu
- Academy of Mathematics and Systems Science, National Center for Mathematics and Interdisciplinary Sciences, Chinese Academy of Sciences, Beijing 10080, China
| | - Xiang-Sun Zhang
- Academy of Mathematics and Systems Science, National Center for Mathematics and Interdisciplinary Sciences, Chinese Academy of Sciences, Beijing 10080, China
| | - Yong Wang
- Academy of Mathematics and Systems Science, National Center for Mathematics and Interdisciplinary Sciences, Chinese Academy of Sciences, Beijing 10080, China
| |
Collapse
|
9
|
Choi S, Park J. Nonparametric additive model with grouped lasso and maximizing area under the ROC curve. Comput Stat Data Anal 2014. [DOI: 10.1016/j.csda.2014.03.010] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
10
|
Kang C, Janes H, Huang Y. Combining biomarkers to optimize patient treatment recommendations. Biometrics 2014; 70:695-707. [PMID: 24889663 DOI: 10.1111/biom.12191] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2013] [Revised: 05/01/2013] [Accepted: 12/01/2013] [Indexed: 12/12/2022]
Abstract
Markers that predict treatment effect have the potential to improve patient outcomes. For example, the OncotypeDX® RecurrenceScore® has some ability to predict the benefit of adjuvant chemotherapy over and above hormone therapy for the treatment of estrogen-receptor-positive breast cancer, facilitating the provision of chemotherapy to women most likely to benefit from it. Given that the score was originally developed for predicting outcome given hormone therapy alone, it is of interest to develop alternative combinations of the genes comprising the score that are optimized for treatment selection. However, most methodology for combining markers is useful when predicting outcome under a single treatment. We propose a method for combining markers for treatment selection which requires modeling the treatment effect as a function of markers. Multiple models of treatment effect are fit iteratively by upweighting or "boosting" subjects potentially misclassified according to treatment benefit at the previous stage. The boosting approach is compared to existing methods in a simulation study based on the change in expected outcome under marker-based treatment. The approach improves upon methods in some settings and has comparable performance in others. Our simulation study also provides insights as to the relative merits of the existing methods. Application of the boosting approach to the breast cancer data, using scaled versions of the original markers, produces marker combinations that may have improved performance for treatment selection.
Collapse
Affiliation(s)
- Chaeryon Kang
- Vaccine and Infectious Disease Division and Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, U.S.A
| | - Holly Janes
- Vaccine and Infectious Disease Division and Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, U.S.A
| | - Ying Huang
- Vaccine and Infectious Disease Division and Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, U.S.A
| |
Collapse
|
11
|
Parast L, Cai B, Bedayat A, Kumamaru KK, George E, Dill KE, Rybicki FJ. Statistical methods for predicting mortality in patients diagnosed with acute pulmonary embolism. Acad Radiol 2012; 19:1465-73. [PMID: 23122566 DOI: 10.1016/j.acra.2012.09.008] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2012] [Revised: 09/17/2012] [Accepted: 09/18/2012] [Indexed: 12/20/2022]
Abstract
RATIONALE AND OBJECTIVES Risk stratification in pulmonary embolism (PE) guides patient management. The purpose of this study was to develop and test novel mortality risk prediction models for subjects with acute PE diagnosed using computed tomographic pulmonary angiography in a large cohort with comprehensive clinical data. MATERIALS AND METHODS Retrospective analyses of 1596 consecutive subjects diagnosed with acute PE from a single, large, urban teaching hospital included two modern statistical methods to predict survival in patients with acute PE. Landmark analysis was used for 90-day mortality. Adaptive least absolute shrinkage and selection operator (aLASSO), a penalization method, was used to select variables important for prediction and to estimate model coefficients. Receiver-operating characteristic analysis was used to evaluate the resulting prediction rules. RESULTS Using 30-day all-cause mortality outcome, three of the 16 clinical risk factors (the presence of a known malignancy, coronary artery disease, and increased age) were associated with high risk, while subjects treated with anticoagulation had lower risk. For 90-day landmark mortality, subjects with recent operations had a lower risk for death. Both prediction rules developed using aLASSO performed well compared to standard logistic regression. CONCLUSIONS The aLASSO regression approach combined with landmark analysis provides a novel tool for large patient populations and can be applied for clinical risk stratification among subjects diagnosed with acute PE. After positive results on computed tomographic pulmonary angiography, the presence of a known malignancy, coronary artery disease, and advanced age increase 30-day mortality. Additional risk stratification can be simplified with these methods, and future work will place imaging-based prediction of mortality in perspective with other clinical data.
Collapse
Affiliation(s)
- Layla Parast
- Department of Biostatistics, Harvard University, Cambridge, MA, USA
| | | | | | | | | | | | | |
Collapse
|