1
|
de Jong VMT, Hoogland J, Moons KGM, Riley RD, Nguyen TL, Debray TPA. Propensity-based standardization to enhance the validation and interpretation of prediction model discrimination for a target population. Stat Med 2023; 42:3508-3528. [PMID: 37311563 DOI: 10.1002/sim.9817] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2021] [Revised: 02/26/2023] [Accepted: 05/19/2023] [Indexed: 06/15/2023]
Abstract
External validation of the discriminative ability of prediction models is of key importance. However, the interpretation of such evaluations is challenging, as the ability to discriminate depends on both the sample characteristics (ie, case-mix) and the generalizability of predictor coefficients, but most discrimination indices do not provide any insight into their respective contributions. To disentangle differences in discriminative ability across external validation samples due to a lack of model generalizability from differences in sample characteristics, we propose propensity-weighted measures of discrimination. These weighted metrics, which are derived from propensity scores for sample membership, are standardized for case-mix differences between the model development and validation samples, allowing for a fair comparison of discriminative ability in terms of model characteristics in a target population of interest. We illustrate our methods with the validation of eight prediction models for deep vein thrombosis in 12 external validation data sets and assess our methods in a simulation study. In the illustrative example, propensity score standardization reduced between-study heterogeneity of discrimination, indicating that between-study variability was partially attributable to case-mix. The simulation study showed that only flexible propensity-score methods (allowing for non-linear effects) produced unbiased estimates of model discrimination in the target population, and only when the positivity assumption was met. Propensity score-based standardization may facilitate the interpretation of (heterogeneity in) discriminative ability of a prediction model as observed across multiple studies, and may guide model updating strategies for a particular target population. Careful propensity score modeling with attention for non-linear relations is recommended.
Collapse
Affiliation(s)
- Valentijn M T de Jong
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
- Data Analytics and Methods Task Force, European Medicines Agency, Amsterdam, The Netherlands
| | - Jeroen Hoogland
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
- Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| | - Karel G M Moons
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
| | - Richard D Riley
- Institute of Applied Health Research, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK
| | - Tri-Long Nguyen
- Section of Epidemiology, Department of Public Health, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Thomas P A Debray
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
- Smart Data Analysis and Statistics, Utrecht, The Netherlands
| |
Collapse
|
2
|
Wang J, Zhao Y, Tang LL. Estimating the AUC with a Graphical Lasso Method for High-dimensional Biomarkers with LOD. BIOSTATISTICS & EPIDEMIOLOGY 2021; 5:189-206. [PMID: 35415380 PMCID: PMC9000202 DOI: 10.1080/24709360.2021.1898731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/16/2020] [Accepted: 10/13/2020] [Indexed: 06/14/2023]
Abstract
This manuscript estimates the area under the receiver operating characteristic curve (AUC) of combined biomarkers in a high-dimensional setting. We propose a penalization approach to the inference of precision matrices in the presence of the limit of detection. A new version of expectation-maximization algorithm is then proposed for the penalized likelihood, with the use of numerical integration and the graphical lasso method. The estimated precision matrix is then applied to the inference of AUCs. The proposed method outperforms the existing methods in numerical studies. We apply the proposed method to a data set of brain tumor study. The results show a higher accuracy on the estimation of AUC compared with the existing methods.
Collapse
Affiliation(s)
- Jirui Wang
- Department of Statistics, George Mason University
| | | | | |
Collapse
|
3
|
Chen YM, Zu XP, Li D. Identification of Proteins of Tobacco Mosaic Virus by Using a Method of Feature Extraction. Front Genet 2020; 11:569100. [PMID: 33193664 PMCID: PMC7581905 DOI: 10.3389/fgene.2020.569100] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2020] [Accepted: 09/09/2020] [Indexed: 12/03/2022] Open
Abstract
Tobacco mosaic virus, TMV for short, is widely distributed in the global tobacco industry and has a significant impact on tobacco production. It can reduce the amount of tobacco grown by 50–70%. In this research of study, we aimed to identify tobacco mosaic virus proteins and healthy tobacco leaf proteins by using machine learning approaches. The experiment's results showed that the support vector machine algorithm achieved high accuracy in different feature extraction methods. And 188-dimensions feature extraction method improved the classification accuracy. In that the support vector machine algorithm and 188-dimensions feature extraction method were finally selected as the final experimental methods. In the 10-fold cross-validation processes, the SVM combined with 188-dimensions achieved 93.5% accuracy on the training set and 92.7% accuracy on the independent validation set. Besides, the evaluation index of the results of experiments indicate that the method developed by us is valid and robust.
Collapse
Affiliation(s)
- Yu-Miao Chen
- Information and Computer Engineering College, Northeast Forestry University, Harbin, China
| | - Xin-Ping Zu
- Information and Computer Engineering College, Northeast Forestry University, Harbin, China
| | - Dan Li
- Information and Computer Engineering College, Northeast Forestry University, Harbin, China
| |
Collapse
|
4
|
Abstract
Risk prediction models have been developed in many contexts to classify individuals according to a single outcome, such as risk of a disease. Emerging “-omic” biomarkers provide panels of features that can simultaneously predict multiple outcomes from a single biological sample, creating issues of multiplicity reminiscent of exploratory hypothesis testing. Here I propose definitions of some basic criteria for evaluating prediction models of multiple outcomes. I define calibration in the multivariate setting and then distinguish between outcome-wise and individual-wise prediction, and within the latter between joint and panel-wise prediction. I give examples such as screening and early detection in which different senses of prediction may be more appropriate. In each case I propose definitions of sensitivity, specificity, concordance, positive and negative predictive value and relative utility. I link the definitions through a multivariate probit model, showing that the accuracy of a multivariate prediction model can be summarised by its covariance with a liability vector. I illustrate the concepts on a biomarker panel for early detection of eight cancers, and on polygenic risk scores for six common diseases.
Collapse
Affiliation(s)
- Frank Dudbridge
- Frank Dudbridge, Department of Health Sciences, University of Leicester, Leicester LE1 7RH, UK.
| |
Collapse
|
5
|
Cui Z, Gao YL, Liu JX, Dai LY, Yuan SS. L 2,1-GRMF: an improved graph regularized matrix factorization method to predict drug-target interactions. BMC Bioinformatics 2019; 20:287. [PMID: 31182006 PMCID: PMC6557743 DOI: 10.1186/s12859-019-2768-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
Background Predicting drug-target interactions is time-consuming and expensive. It is important to present the accuracy of the calculation method. There are many algorithms to predict global interactions, some of which use drug-target networks for prediction (ie, a bipartite graph of bound drug pairs and targets known to interact). Although these algorithms can predict some drug-target interactions to some extent, there is little effect for some new drugs or targets that have no known interaction. Results Since the datasets are usually located at or near low-dimensional nonlinear manifolds, we propose an improved GRMF (graph regularized matrix factorization) method to learn these flow patterns in combination with the previous matrix-decomposition method. In addition, we use one of the pre-processing steps previously proposed to improve the accuracy of the prediction. Conclusions Cross-validation is used to evaluate our method, and simulation experiments are used to predict new interactions. In most cases, our method is superior to other methods. Finally, some examples of new drugs and new targets are predicted by performing simulation experiments. And the improved GRMF method can better predict the remaining drug-target interactions.
Collapse
Affiliation(s)
- Zhen Cui
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Ying-Lian Gao
- Library of Qufu Normal University, Qufu Normal University, Rizhao, China
| | - Jin-Xing Liu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China. .,Co-Innovation Center for Information Supply & Assurance Technology, Anhui University, Hefei, China.
| | - Ling-Yun Dai
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Sha-Sha Yuan
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| |
Collapse
|
6
|
|
7
|
Forecasting influenza epidemics from multi-stream surveillance data in a subtropical city of China. PLoS One 2014; 9:e92945. [PMID: 24676091 PMCID: PMC3968046 DOI: 10.1371/journal.pone.0092945] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2013] [Accepted: 02/27/2014] [Indexed: 11/19/2022] Open
Abstract
Background Influenza has been associated with heavy burden of mortality and morbidity in subtropical regions. However, timely forecast of influenza epidemic in these regions has been hindered by unclear seasonality of influenza viruses. In this study, we developed a forecasting model by integrating multiple sentinel surveillance data to predict influenza epidemics in a subtropical city Shenzhen, China. Methods Dynamic linear models with the predictors of single or multiple surveillance data for influenza-like illness (ILI) were adopted to forecast influenza epidemics from 2006 to 2012 in Shenzhen. Temporal coherence of these surveillance data with laboratory-confirmed influenza cases was evaluated by wavelet analysis and only the coherent data streams were entered into the model. Timeliness, sensitivity and specificity of these models were also evaluated to compare their performance. Results Both influenza virology data and ILI consultation rates in Shenzhen demonstrated a significant annual seasonal cycle (p<0.05) during the entire study period, with occasional deviations observed in some data streams. The forecasting models that combined multi-stream ILI surveillance data generally outperformed the models with single-stream ILI data, by providing more timely, sensitive and specific alerts. Conclusions Forecasting models that combine multiple sentinel surveillance data can be considered to generate timely alerts for influenza epidemics in subtropical regions like Shenzhen.
Collapse
|
8
|
Zhao Q, Shi X, Xie Y, Huang J, Shia B, Ma S. Combining multidimensional genomic measurements for predicting cancer prognosis: observations from TCGA. Brief Bioinform 2014; 16:291-303. [PMID: 24632304 DOI: 10.1093/bib/bbu003] [Citation(s) in RCA: 92] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
With accumulating research on the interconnections among different types of genomic regulations, researchers have found that multidimensional genomic studies outperform one-dimensional studies in multiple aspects. Among many sources of multidimensional genomic data, The Cancer Genome Atlas (TCGA) provides the public with comprehensive profiling data on >30 cancer types, making it an ideal test bed for conducting and comparing different analyses. In this article, the analysis goal is to apply several existing methods and associate multidimensional genomic measurements with cancer outcomes in particular prognosis, with special focus on the predictive power of genomic signatures. We exploit clinical data and four types of genomic measurement including mRNA gene expression, DNA methylation, microRNA and copy number alterations for breast invasive carcinoma, glioblastoma multiforme, acute myeloid leukemia and lung squamous cell carcinoma collected by TCGA. To accommodate the high dimensionality, we extract important features using Principal Component Analysis, Partial Least Squares and Least Absolute Shrinkage and Selection Operator (Lasso), which are representative of dimension reduction and variable selection techniques and have been extensively adopted, and fit Cox survival models with combined important features. We calibrate the predictive power of each type of genomic measurement for the prognosis of four cancer types and find that the results vary across cancers. Our analysis also suggests that for most of the cancers in our study and the adopted methods, there is no substantial improvement in prediction when adding other genomic measurement after gene expression and clinical covariates have been included in the model. This is consistent with the findings that molecular features measured at the transcription level affect clinical outcomes more directly than those measured at the DNA/epigenetic level.
Collapse
|
9
|
Li J, Jiang B, Fine JP. Multicategory reclassification statistics for assessing improvements in diagnostic accuracy. Biostatistics 2013; 14:382-94. [PMID: 23197381 PMCID: PMC3695653 DOI: 10.1093/biostatistics/kxs047] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2012] [Revised: 10/21/2012] [Accepted: 10/22/2012] [Indexed: 01/04/2023] Open
Abstract
In this paper, we extend the definitions of the net reclassification improvement (NRI) and the integrated discrimination improvement (IDI) in the context of multicategory classification. Both measures were proposed in Pencina and others (2008. Evaluating the added predictive ability of a new marker: from area under the receiver operating characteristic (ROC) curve to reclassification and beyond. Statistics in Medicine 27, 157-172) as numeric characterizations of accuracy improvement for binary diagnostic tests and were shown to have certain advantage over analyses based on ROC curves or other regression approaches. Estimation and inference procedures for the multiclass NRI and IDI are provided in this paper along with necessary asymptotic distributional results. Simulations are conducted to study the finite-sample properties of the proposed estimators. Two medical examples are considered to illustrate our methodology.
Collapse
Affiliation(s)
- Jialiang Li
- Department of Statistics and Applied Probability, National University of Singapore, Singapore117546,Singapore.
| | | | | |
Collapse
|
10
|
Pfeiffer RM. Extensions of criteria for evaluating risk prediction models for public health applications. Biostatistics 2012; 14:366-81. [PMID: 23087412 DOI: 10.1093/biostatistics/kxs037] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
We recently proposed two novel criteria to assess the usefulness of risk prediction models for public health applications. The proportion of cases followed, PCF(p), is the proportion of individuals who will develop disease who are included in the proportion p of individuals in the population at highest risk. The proportion needed to follow-up, PNF(q), is the proportion of the general population at highest risk that one needs to follow in order that a proportion q of those destined to become cases will be followed (Pfeiffer, R.M. and Gail, M.H., 2011. Two criteria for evaluating risk prediction models. Biometrics 67, 1057-1065). Here, we extend these criteria in two ways. First, we introduce two new criteria by integrating PCF and PNF over a range of values of q or p to obtain iPCF, the integrated PCF, and iPNF, the integrated PNF. A key assumption in the previous work was that the risk model is well calibrated. This assumption also underlies novel estimates of iPCF and iPNF based on observed risks in a population alone. The second extension is to propose and study estimates of PCF, PNF, iPCF, and iPNF that are consistent even if the risk models are not well calibrated. These new estimates are obtained from case-control data when the outcome prevalence in the population is known, and from cohort data, with baseline covariates and observed health outcomes. We study the efficiency of the various estimates and propose and compare tests for comparing two risk models, both of which were evaluated in the same validation data.
Collapse
Affiliation(s)
- Ruth M Pfeiffer
- Biostatistics Branch, National Cancer Institute, Bethesda, MD 20892-7244, USA.
| |
Collapse
|
11
|
Tang LL, Liu A, Chen Z, Schisterman EF, Zhang B, Miao Z. Nonparametric ROC summary statistics for correlated diagnostic marker data. Stat Med 2012; 32:2209-20. [PMID: 23055248 DOI: 10.1002/sim.5654] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2012] [Accepted: 09/20/2012] [Indexed: 11/10/2022]
Abstract
We propose efficient nonparametric statistics to compare medical imaging modalities in multi-reader multi-test data and to compare markers in longitudinal ROC data. The proposed methods are based on the weighted area under the ROC curve, which includes the area under the curve and the partial area under the curve as special cases. The methods maximize the local power for detecting the difference between imaging modalities. We develop the asymptotic results of the proposed methods under a complex correlation structure. Our simulation studies show that the proposed statistics result in much better powers than existing statistics. We apply the proposed statistics to an endometriosis diagnosis study.
Collapse
|
12
|
Yu T, Li J, Ma S. Adjusting confounders in ranking biomarkers: a model-based ROC approach. Brief Bioinform 2012; 13:513-23. [PMID: 22396461 DOI: 10.1093/bib/bbs008] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
High-throughput studies have been extensively conducted in the research of complex human diseases. As a representative example, consider gene-expression studies where thousands of genes are profiled at the same time. An important objective of such studies is to rank the diagnostic accuracy of biomarkers (e.g. gene expressions) for predicting outcome variables while properly adjusting for confounding effects from low-dimensional clinical risk factors and environmental exposures. Existing approaches are often fully based on parametric or semi-parametric models and target evaluating estimation significance as opposed to diagnostic accuracy. Receiver operating characteristic (ROC) approaches can be employed to tackle this problem. However, existing ROC ranking methods focus on biomarkers only and ignore effects of confounders. In this article, we propose a model-based approach which ranks the diagnostic accuracy of biomarkers using ROC measures with a proper adjustment of confounding effects. To this end, three different methods for constructing the underlying regression models are investigated. Simulation study shows that the proposed methods can accurately identify biomarkers with additional diagnostic power beyond confounders. Analysis of two cancer gene-expression studies demonstrates that adjusting for confounders can lead to substantially different rankings of genes.
Collapse
Affiliation(s)
- Tao Yu
- University of Wisconsin, Madison, USA
| | | | | |
Collapse
|
13
|
Ma S, Dai Y. Principal component analysis based methods in bioinformatics studies. Brief Bioinform 2011; 12:714-22. [PMID: 21242203 DOI: 10.1093/bib/bbq090] [Citation(s) in RCA: 125] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In analysis of bioinformatics data, a unique challenge arises from the high dimensionality of measurements. Without loss of generality, we use genomic study with gene expression measurements as a representative example but note that analysis techniques discussed in this article are also applicable to other types of bioinformatics studies. Principal component analysis (PCA) is a classic dimension reduction approach. It constructs linear combinations of gene expressions, called principal components (PCs). The PCs are orthogonal to each other, can effectively explain variation of gene expressions, and may have a much lower dimensionality. PCA is computationally simple and can be realized using many existing software packages. This article consists of the following parts. First, we review the standard PCA technique and their applications in bioinformatics data analysis. Second, we describe recent 'non-standard' applications of PCA, including accommodating interactions among genes, pathways and network modules and conducting PCA with estimating equations as opposed to gene expressions. Third, we introduce several recently proposed PCA-based techniques, including the supervised PCA, sparse PCA and functional PCA. The supervised PCA and sparse PCA have been shown to have better empirical performance than the standard PCA. The functional PCA can analyze time-course gene expression data. Last, we raise the awareness of several critical but unsolved problems related to PCA. The goal of this article is to make bioinformatics researchers aware of the PCA technique and more importantly its most recent development, so that this simple yet effective dimension reduction technique can be better employed in bioinformatics data analysis.
Collapse
Affiliation(s)
- Shuangge Ma
- 60 College ST, LEPH 209, School of Public Health, Yale University, New Haven, CT 06520, USA.
| | | |
Collapse
|