1
|
Shan G. Monte Carlo cross-validation for a study with binary outcome and limited sample size. BMC Med Inform Decis Mak 2022; 22:270. [PMID: 36253749 PMCID: PMC9578204 DOI: 10.1186/s12911-022-02016-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Accepted: 10/10/2022] [Indexed: 11/26/2022] Open
Abstract
Cross-validation (CV) is a resampling approach to evaluate machine learning models when sample size is limited. The number of all possible combinations of folds for the training data, known as CV rounds, are often very small in leave-one-out CV. Alternatively, Monte Carlo cross-validation (MCCV) can be performed with a flexible number of simulations when computational resources are feasible for a study with limited sample size. We conduct extensive simulation studies to compare accuracy between MCCV and CV with the same number of simulations for a study with binary outcome (e.g., disease progression or not). Accuracy of MCCV is generally higher than CV although the gain is small. They have similar performance when sample size is large. Meanwhile, MCCV is going to provide reliable performance metrics as the number of simulations increases. Two real examples are used to illustrate the comparison between MCCV and CV.
Collapse
Affiliation(s)
- Guogen Shan
- Department of Biostatistics, University of Florida, Gainesville, FL, 32610, USA.
| |
Collapse
|
2
|
Semella S, Hutengs C, Seidel M, Ulrich M, Schneider B, Ortner M, Thiele-Bruhn S, Ludwig B, Vohland M. Accuracy and Reproducibility of Laboratory Diffuse Reflectance Measurements with Portable VNIR and MIR Spectrometers for Predictive Soil Organic Carbon Modeling. Sensors (Basel) 2022; 22:2749. [PMID: 35408363 PMCID: PMC9003508 DOI: 10.3390/s22072749] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Revised: 03/18/2022] [Accepted: 03/31/2022] [Indexed: 06/14/2023]
Abstract
Soil spectroscopy in the visible-to-near infrared (VNIR) and mid-infrared (MIR) is a cost-effective method to determine the soil organic carbon content (SOC) based on predictive spectral models calibrated to analytical-determined SOC reference data. The degree to which uncertainty in reference data and spectral measurements contributes to the estimated accuracy of VNIR and MIR predictions, however, is rarely addressed and remains unclear, in particular for current handheld MIR spectrometers. We thus evaluated the reproducibility of both the spectral reflectance measurements with portable VNIR and MIR spectrometers and the analytical dry combustion SOC reference method, with the aim to assess how varying spectral inputs and reference values impact the calibration and validation of predictive VNIR and MIR models. Soil reflectance spectra and SOC were measured in triplicate, the latter by different laboratories, for a set of 75 finely ground soil samples covering a wide range of parent materials and SOC contents. Predictive partial least-squares regression (PLSR) models were evaluated in a repeated, nested cross-validation approach with systematically varied spectral inputs and reference data, respectively. We found that SOC predictions from both VNIR and MIR spectra were equally highly reproducible on average and similar to the dry combustion method, but MIR spectra were more robust to calibration sample variation. The contributions of spectral variation (ΔRMSE < 0.4 g·kg−1) and reference SOC uncertainty (ΔRMSE < 0.3 g·kg−1) to spectral modeling errors were small compared to the difference between the VNIR and MIR spectral ranges (ΔRMSE ~1.4 g·kg−1 in favor of MIR). For reference SOC, uncertainty was limited to the case of biased reference data appearing in either the calibration or validation. Given better predictive accuracy, comparable spectral reproducibility and greater robustness against calibration sample selection, the portable MIR spectrometer was considered overall superior to the VNIR instrument for SOC analysis. Our results further indicate that random errors in SOC reference values are effectively compensated for during model calibration, while biased SOC calibration data propagates errors into model predictions. Reference data uncertainty is thus more likely to negatively impact the estimated validation accuracy in soil spectroscopy studies where archived data, e.g., from soil spectral libraries, are used for model building, but it should be negligible otherwise.
Collapse
Affiliation(s)
- Sebastian Semella
- Geoinformatics and Remote Sensing, Institute for Geography, Leipzig University, 04103 Leipzig, Germany; (S.S.); (M.S.); (M.U.)
| | - Christopher Hutengs
- Geoinformatics and Remote Sensing, Institute for Geography, Leipzig University, 04103 Leipzig, Germany; (S.S.); (M.S.); (M.U.)
- Remote Sensing Centre for Earth System Research, Leipzig University, 04103 Leipzig, Germany
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, 04103 Leipzig, Germany
| | - Michael Seidel
- Geoinformatics and Remote Sensing, Institute for Geography, Leipzig University, 04103 Leipzig, Germany; (S.S.); (M.S.); (M.U.)
- Remote Sensing Centre for Earth System Research, Leipzig University, 04103 Leipzig, Germany
| | - Mathias Ulrich
- Geoinformatics and Remote Sensing, Institute for Geography, Leipzig University, 04103 Leipzig, Germany; (S.S.); (M.S.); (M.U.)
| | - Birgit Schneider
- Physical Geography, Institute for Geography, Leipzig University, 04103 Leipzig, Germany;
| | - Malte Ortner
- Soil Science, Faculty of Spatial and Environmental Sciences, University of Trier, 54286 Trier, Germany; (M.O.); (S.T.-B.)
| | - Sören Thiele-Bruhn
- Soil Science, Faculty of Spatial and Environmental Sciences, University of Trier, 54286 Trier, Germany; (M.O.); (S.T.-B.)
| | - Bernard Ludwig
- Department of Environmental Chemistry, University of Kassel, 37213 Witzenhausen, Germany;
| | - Michael Vohland
- Geoinformatics and Remote Sensing, Institute for Geography, Leipzig University, 04103 Leipzig, Germany; (S.S.); (M.S.); (M.U.)
- Remote Sensing Centre for Earth System Research, Leipzig University, 04103 Leipzig, Germany
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, 04103 Leipzig, Germany
| |
Collapse
|
3
|
Cava C, Bertoli G, Castiglioni I. In silico identification of drug target pathways in breast cancer subtypes using pathway cross-talk inhibition. J Transl Med 2018; 16:154. [PMID: 29871693 PMCID: PMC5989433 DOI: 10.1186/s12967-018-1535-2] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2018] [Accepted: 06/01/2018] [Indexed: 12/11/2022] Open
Abstract
Background Despite great development in genome and proteome high-throughput methods, treatment failure is a critical point in the management of most solid cancers, including breast cancer (BC). Multiple alternative mechanisms upon drug treatment are involved to offset therapeutic effects, eventually causing drug resistance or treatment failure. Methods Here, we optimized a computational method to discover novel drug target pathways in cancer subtypes using pathway cross-talk inhibition (PCI). The in silico method is based on the detection and quantification of the pathway cross-talk for distinct cancer subtypes. From a BC data set of The Cancer Genome Atlas, we have identified different networks of cross-talking pathways for different BC subtypes, validated using an independent BC dataset from Gene Expression Omnibus. Then, we predicted in silico the effects of new or approved drugs on different BC subtypes by silencing individual or combined subtype-derived pathways with the aim to find new potential drugs or more effective synergistic combinations of drugs. Results Overall, we identified a set of new potential drug target pathways for distinct BC subtypes on which therapeutic agents could synergically act showing antitumour effects and impacting on cross-talk inhibition. Conclusions We believe that in silico methods based on PCI could offer valuable approaches to identifying more tailored and effective treatments in particular in heterogeneous cancer diseases. Electronic supplementary material The online version of this article (10.1186/s12967-018-1535-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Claudia Cava
- Institute of Molecular Bioimaging and Physiology, National Research Council (IBFM-CNR), Via F.Cervi 93, Segrate, 20090, Milan, Italy
| | - Gloria Bertoli
- Institute of Molecular Bioimaging and Physiology, National Research Council (IBFM-CNR), Via F.Cervi 93, Segrate, 20090, Milan, Italy
| | - Isabella Castiglioni
- Institute of Molecular Bioimaging and Physiology, National Research Council (IBFM-CNR), Via F.Cervi 93, Segrate, 20090, Milan, Italy.
| |
Collapse
|
4
|
Posma JM, Garcia-Perez I, Ebbels TMD, Lindon JC, Stamler J, Elliott P, Holmes E, Nicholson JK. Optimized Phenotypic Biomarker Discovery and Confounder Elimination via Covariate-Adjusted Projection to Latent Structures from Metabolic Spectroscopy Data. J Proteome Res 2018; 17:1586-1595. [PMID: 29457906 PMCID: PMC5891819 DOI: 10.1021/acs.jproteome.7b00879] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Metabolism is altered by genetics, diet, disease status, environment, and many other factors. Modeling either one of these is often done without considering the effects of the other covariates. Attributing differences in metabolic profile to one of these factors needs to be done while controlling for the metabolic influence of the rest. We describe here a data analysis framework and novel confounder-adjustment algorithm for multivariate analysis of metabolic profiling data. Using simulated data, we show that similar numbers of true associations and significantly less false positives are found compared to other commonly used methods. Covariate-adjusted projections to latent structures (CA-PLS) are exemplified here using a large-scale metabolic phenotyping study of two Chinese populations at different risks for cardiovascular disease. Using CA-PLS, we find that some previously reported differences are actually associated with external factors and discover a number of previously unreported biomarkers linked to different metabolic pathways. CA-PLS can be applied to any multivariate data where confounding may be an issue and the confounder-adjustment procedure is translatable to other multivariate regression techniques.
Collapse
Affiliation(s)
| | - Isabel Garcia-Perez
- Investigative Medicine, Department of Medicine, Faculty of Medicine , Imperial College London , W12 0NN London , United Kingdom
| | | | | | - Jeremiah Stamler
- Department of Preventive Medicine, Feinberg School of Medicine , Northwestern University , Chicago , Illinois 60611 , United States
| | | | | | | |
Collapse
|
5
|
Liu C, Gu X, Jiang Z. Identification of novel targets for multiple myeloma through integrative approach with Monte Carlo cross-validation analysis. J Bone Oncol 2017; 8:8-12. [PMID: 28856086 PMCID: PMC5565744 DOI: 10.1016/j.jbo.2017.08.001] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2017] [Revised: 08/01/2017] [Accepted: 08/10/2017] [Indexed: 11/20/2022] Open
Abstract
More than one pathway is involved in disease development and progression, and two or more pathways may be interconnected to further affect the disease onset, as functional proteins participate in multiple pathways. Thus, identifying cross-talk among pathways is necessary to understand the molecular mechanisms of multiple myeloma (MM). Based on this, this paper looked at extracting potential pathway cross-talk in MM through an integrative approach using Monte Carlo cross-validation analysis. The gene expression library of MM (accession number: GSE6477) was downloaded from the Gene Expression Omnibus (GEO) database. The integrative approach was then used to identify potential pathway cross-talk, and included four steps: Firstly, differential expression analysis was conducted to identify differentially expressed genes (DEGs). Secondly, the DEGs obtained were mapped to the pathways downloaded from an ingenuity pathways analysis (IPA), to reveal the underlying relationship between the DEGs and pathways enriched by these DEGs. A subset of pathways enriched by the DEGs was then obtained. Thirdly, a discriminating score (DS) value for each paired pathway was computed. Lastly, random forest (RF) classification was used to identify the paired pathways based on area under the curve (AUC) and Monte Carlo cross-validation, which was repeated 50 times to explore the best paired pathways. These paired pathways were tested with another independently published MM microarray data (GSE85837), using in silico validation. Overall, 60 DEGs and 19 differential pathways enriched by DEGs were extracted. Each pathway was sorted based on their AUC values. The paired pathways, inhibition of matrix metalloproteases and EIF2 signaling pathway, indicated the best AUC value of 1.000. Paired pathways consisting of IL-8 and EIF2 signaling pathways with higher AUC of 0.975, were involved in 7 runs. Furthermore, it was validated consistently in separate microarray data sets (GSE85837). Paired pathways (inhibition of matrix metalloproteases and EIF2 signaling, IL-8 signaling and EIF2 signaling) exhibited the best AUC values and higher frequency of validation. Two paired pathways (inhibition of matrix metalloproteases and EIF2 signaling, IL-8 signaling and EIF2 signaling) were used to accurately classify MM and control samples. These paired pathways may be potential bio-signatures for diagnosis and management of MM.
Collapse
|