Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Harrington PDB. Multiple Versus Single Set Validation of Multivariate Models to Avoid Mistakes. Crit Rev Anal Chem 2017;48:33-46. [DOI: 10.1080/10408347.2017.1361314] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]

For:	Harrington PDB. Multiple Versus Single Set Validation of Multivariate Models to Avoid Mistakes. Crit Rev Anal Chem 2017;48:33-46. [DOI: 10.1080/10408347.2017.1361314] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]

Number

Cited by Other Article(s)

Saunders A, Harrington PDB. Advances in Activity/Property Prediction from Chemical Structures. Crit Rev Anal Chem 2024;54:135-147. [PMID: 35482792 DOI: 10.1080/10408347.2022.2066461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]

Tabashum T, Snyder RC, O'Brien MK, Albert MV. Machine Learning Models for Parkinson Disease: Systematic Review. JMIR Med Inform 2024;12:e50117. [PMID: 38771237 PMCID: PMC11112052 DOI: 10.2196/50117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Revised: 02/12/2024] [Accepted: 04/01/2024] [Indexed: 05/22/2024] Open

Abstract

Background

With the increasing availability of data, computing resources, and easier-to-use software libraries, machine learning (ML) is increasingly used in disease detection and prediction, including for Parkinson disease (PD). Despite the large number of studies published every year, very few ML systems have been adopted for real-world use. In particular, a lack of external validity may result in poor performance of these systems in clinical practice. Additional methodological issues in ML design and reporting can also hinder clinical adoption, even for applications that would benefit from such data-driven systems.

Objective

To sample the current ML practices in PD applications, we conducted a systematic review of studies published in 2020 and 2021 that used ML models to diagnose PD or track PD progression.

Methods

We conducted a systematic literature review in accordance with PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines in PubMed between January 2020 and April 2021, using the following exact string: "Parkinson's" AND ("ML" OR "prediction" OR "classification" OR "detection" or "artificial intelligence" OR "AI"). The search resulted in 1085 publications. After a search query and review, we found 113 publications that used ML for the classification or regression-based prediction of PD or PD-related symptoms.

Results

Only 65.5% (74/113) of studies used a holdout test set to avoid potentially inflated accuracies, and approximately half (25/46, 54%) of the studies without a holdout test set did not state this as a potential concern. Surprisingly, 38.9% (44/113) of studies did not report on how or if models were tuned, and an additional 27.4% (31/113) used ad hoc model tuning, which is generally frowned upon in ML model optimization. Only 15% (17/113) of studies performed direct comparisons of results with other models, severely limiting the interpretation of results.

Conclusions

This review highlights the notable limitations of current ML systems and techniques that may contribute to a gap between reported performance in research and the real-life applicability of ML models aiming to detect and predict diseases such as PD.

Collapse

Majd E, Xing L, Zhang X. Segmentation of patients with small cell lung cancer into responders and non-responders using the optimal cross-validation technique. BMC Med Res Methodol 2024;24:83. [PMID: 38589775 PMCID: PMC11000309 DOI: 10.1186/s12874-024-02185-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2022] [Accepted: 02/20/2024] [Indexed: 04/10/2024] Open

Abstract

BACKGROUND

The timing of treating cancer patients is an essential factor in the efficacy of treatment. So, patients who will not respond to current therapy should receive a different treatment as early as possible. Machine learning models can be built to classify responders and nonresponders. Such classification models predict the probability of a patient being a responder. Most methods use a probability threshold of 0.5 to convert the probabilities into binary group membership. However, the cutoff of 0.5 is not always the optimal choice.

METHODS

In this study, we propose a novel data-driven approach to select a better cutoff value based on the optimal cross-validation technique. To illustrate our novel method, we applied it to three clinical trial datasets of small-cell lung cancer patients. We used two different datasets to build a scoring system to segment patients. Then the models were applied to segment patients into the test data.

RESULTS

We found that, in test data, the predicted responders and non-responders had significantly different long-term survival outcomes. Our proposed novel method segments patients better than the standard approach using a cutoff of 0.5. Comparing clinical outcomes of responders versus non-responders, our novel method had a p-value of 0.009 with a hazard ratio of 0.668 for grouping patients using the Cox proportion hazard model and a p-value of 0.011 using the accelerated failure time model which approved a significant difference between responders and non-responders. In contrast, the standard approach had a p-value of 0.194 with a hazard ratio of 0.823 using the Cox proportion hazard model and a p-value of 0.240 using the accelerated failure time model indicating the responders and non-responders do not differ significantly in survival.

CONCLUSION

In summary, our novel prediction method can successfully segment new patients into responders and non-responders. Clinicians can use our prediction to decide if a patient should receive a different treatment or stay with the current treatment.

Collapse

Cao R, Hu W, Wei P, Ding Y, Bin Y, Zheng C. FFMAVP: a new classifier based on feature fusion and multitask learning for identifying antiviral peptides and their subclasses. Brief Bioinform 2023;24:bbad353. [PMID: 37861174 DOI: 10.1093/bib/bbad353] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2023] [Revised: 08/25/2023] [Accepted: 09/06/2023] [Indexed: 10/21/2023] Open

Zhou W, Liu Y, Li Y, Kong S, Wang W, Ding B, Han J, Mou C, Gao X, Liu J. TriNet: A tri-fusion neural network for the prediction of anticancer and antimicrobial peptides. PATTERNS (NEW YORK, N.Y.) 2023;4:100702. [PMID: 36960450 PMCID: PMC10028424 DOI: 10.1016/j.patter.2023.100702] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Revised: 12/20/2022] [Accepted: 02/03/2023] [Indexed: 03/04/2023]

Navani V, Meyers DE, Ruan Y, Boyne DJ, O'Sullivan DE, Dolter S, Grosjean HA, Stukalin I, Heng DYC, Morris DG, Brenner DR, Sangha R, Cheung WY, Pabani A. Lung Immune Therapy Evaluation (LITE) Risk, a Novel Prognostic Model for Patients With Advanced Non-Small Cell Lung Cancer Treated With Immune Checkpoint Blockade. Clin Lung Cancer 2023;24:e152-e159. [PMID: 36774234 DOI: 10.1016/j.cllc.2022.12.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 10/28/2022] [Accepted: 12/26/2022] [Indexed: 01/21/2023]

Affiliation(s)

Vishal Navani Department of Medical Oncology, Tom Baker Cancer Centre, Calgary, Alberta, Canada; Department of Oncology, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada.
Daniel E Meyers Department of Medical Oncology, Tom Baker Cancer Centre, Calgary, Alberta, Canada
Yibing Ruan Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada; Department of Cancer Epidemiology and Prevention Research, Alberta Health Services, Calgary, Alberta, Canada; Forzani & MacPhail Colon Cancer Screening Centre, University of Calgary, Calgary, Alberta, Canada
Devon J Boyne Department of Oncology, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada; Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada; Department of Cancer Epidemiology and Prevention Research, Alberta Health Services, Calgary, Alberta, Canada
Dylan E O'Sullivan Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada; Department of Cancer Epidemiology and Prevention Research, Alberta Health Services, Calgary, Alberta, Canada; Forzani & MacPhail Colon Cancer Screening Centre, University of Calgary, Calgary, Alberta, Canada
Samantha Dolter Department of Medical Oncology, Tom Baker Cancer Centre, Calgary, Alberta, Canada
Heidi Ai Grosjean Department of Medical Oncology, Tom Baker Cancer Centre, Calgary, Alberta, Canada
Igor Stukalin Department of Medical Oncology, Tom Baker Cancer Centre, Calgary, Alberta, Canada
Daniel Y C Heng Department of Medical Oncology, Tom Baker Cancer Centre, Calgary, Alberta, Canada; Department of Oncology, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada
Don G Morris Department of Medical Oncology, Tom Baker Cancer Centre, Calgary, Alberta, Canada; Department of Oncology, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada
Darren R Brenner Department of Oncology, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada; Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada; Department of Cancer Epidemiology and Prevention Research, Alberta Health Services, Calgary, Alberta, Canada
Randeep Sangha Department of Medical Oncology, Cross Cancer Institute, Edmonton, Alberta, Canada
Winson Y Cheung Department of Medical Oncology, Tom Baker Cancer Centre, Calgary, Alberta, Canada; Department of Oncology, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada
Aliyah Pabani Department of Medical Oncology, Tom Baker Cancer Centre, Calgary, Alberta, Canada; Department of Oncology, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada

Collapse

Suchana S, Passeport E. Implications of polar organic chemical integrative sampler for high membrane sorption and suitability of polyethersulfone as a single-phase sampler. THE SCIENCE OF THE TOTAL ENVIRONMENT 2022;850:157898. [PMID: 35952872 DOI: 10.1016/j.scitotenv.2022.157898] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/13/2021] [Revised: 08/03/2022] [Accepted: 08/04/2022] [Indexed: 06/15/2023]

Abstract

Polar organic chemical integrative sampler (POCIS) contains sorbent, which is typically enclosed between two polyethersulfones (PES) membranes. A significant PES uptake is reported for many contaminants, yet, aqueous concentration is mainly correlated with the sorbent uptake using first-order kinetics. Under high PES sorption, the first-order kinetics often provide erroneous sampling rate for the sorbent phase due to increased membrane resistance. This work evaluated the uptake of four high PES sorbing chemicals, i.e., three Cl- and CH₃-substituted nitrobenzenes and one chlorinated aniline using POCIS and the potential of a single-phase PES sampler using laboratory experiments. POCIS calibration results demonstrated that both sorbent and membrane had similar affinity for the target compounds. A rapid PES sorption occurred in the earlier days (<7 days) followed by a gradual increase in the PES phase concentration (equilibrium not achieved after 60 days). Especially, the membrane was the primary sink for 3,4-dichloroaniline and 3,4-dichloronitrobenzene for up to 14 and 31 days, respectively. On the other hand, the single-phase PES sampler showed similar mass uptake as POCIS and reached equilibrium within 19 days under static condition, indicating its potential suitability in the equilibrium regime. PES-water partition coefficient of the target compounds was between 1.2 and 6.5 L/g. Finally, we present a poly-parameter linear-free energy relationship (pp-LFER) using published data to predict the PES-water partition coefficients. The pp-LFER models showed moderate predictability as indicated by R²_adj values between 0.7 and 0.9 for both internal and external data set consisting of a wide range of hydrophobic and hydrophilic compounds (-0.1 ≤ logK_OW ≤ 7.4). The proposed pp-LFER model can be used to screen high PES-sorbing chemicals to increase the reliability and accuracy of aqueous concentration prediction from POCIS sampling and to select the most appropriate sampling approach for new compounds.

Collapse

Feasibility of Application of Near Infrared Reflectance (NIR) Spectroscopy for the Prediction of the Chemical Composition of Traditional Sausages. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app112311282] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]

Chen Z, de Boves Harrington P, Griffin V, Griffin T. In Situ Determination of Cannabidiol in Hemp Oil by Near-Infrared Spectroscopy. JOURNAL OF NATURAL PRODUCTS 2021;84:2851-2857. [PMID: 34784219 DOI: 10.1021/acs.jnatprod.1c00557] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]

Wang L, Vendrell-Dones MO, Deriu C, Doğruer S, de B Harrington P, McCord B. Multivariate Analysis Aided Surface-Enhanced Raman Spectroscopy (MVA-SERS) Multiplex Quantitative Detection of Trace Fentanyl in Illicit Drug Mixtures Using a Handheld Raman Spectrometer. APPLIED SPECTROSCOPY 2021;75:1225-1236. [PMID: 34318708 DOI: 10.1177/00037028211032930] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]

Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results. PLoS One 2021;16:e0256152. [PMID: 34383858 PMCID: PMC8360533 DOI: 10.1371/journal.pone.0256152] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2021] [Accepted: 08/01/2021] [Indexed: 12/23/2022] Open

Abstract

This study aims to determine how randomly splitting a dataset into training and test sets affects the estimated performance of a machine learning model and its gap from the test performance under different conditions, using real-world brain tumor radiomics data. We conducted two classification tasks of different difficulty levels with magnetic resonance imaging (MRI) radiomics features: (1) "Simple" task, glioblastomas [n = 109] vs. brain metastasis [n = 58] and (2) "difficult" task, low- [n = 163] vs. high-grade [n = 95] meningiomas. Additionally, two undersampled datasets were created by randomly sampling 50% from these datasets. We performed random training-test set splitting for each dataset repeatedly to create 1,000 different training-test set pairs. For each dataset pair, the least absolute shrinkage and selection operator model was trained and evaluated using various validation methods in the training set, and tested in the test set, using the area under the curve (AUC) as an evaluation metric. The AUCs in training and testing varied among different training-test set pairs, especially with the undersampled datasets and the difficult task. The mean (±standard deviation) AUC difference between training and testing was 0.039 (±0.032) for the simple task without undersampling and 0.092 (±0.071) for the difficult task with undersampling. In a training-test set pair with the difficult task without undersampling, for example, the AUC was high in training but much lower in testing (0.882 and 0.667, respectively); in another dataset pair with the same task, however, the AUC was low in training but much higher in testing (0.709 and 0.911, respectively). When the AUC discrepancy between training and test, or generalization gap, was large, none of the validation methods helped sufficiently reduce the generalization gap. Our results suggest that machine learning after a single random training-test set split may lead to unreliable results in radiomics studies especially with small sample sizes.

Collapse

Pang Y, Yao L, Jhong JH, Wang Z, Lee TY. AVPIden: a new scheme for identification and functional prediction of antiviral peptides based on machine learning approaches. Brief Bioinform 2021;22:6323205. [PMID: 34279599 DOI: 10.1093/bib/bbab263] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 06/07/2021] [Accepted: 06/21/2021] [Indexed: 02/06/2023] Open

Sajid MR, Muhammad N, Zakaria R, Shahbaz A, Bukhari SAC, Kadry S, Suresh A. Nonclinical Features in Predictive Modeling of Cardiovascular Diseases: A Machine Learning Approach. Interdiscip Sci 2021;13:201-211. [PMID: 33675528 DOI: 10.1007/s12539-021-00423-w] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2020] [Revised: 02/08/2021] [Accepted: 02/20/2021] [Indexed: 12/23/2022]

Ede JM. Deep learning in electron microscopy. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2021. [DOI: 10.1088/2632-2153/abd614] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open

Chemometric applications in metabolomic studies using chromatography-mass spectrometry. Trends Analyt Chem 2021. [DOI: 10.1016/j.trac.2020.116165] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]

Solovyev PA, Fauhl-Hassek C, Riedl J, Esslinger S, Bontempo L, Camin F. NMR spectroscopy in wine authentication: An official control perspective. Compr Rev Food Sci Food Saf 2021;20:2040-2062. [PMID: 33506593 DOI: 10.1111/1541-4337.12700] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2020] [Revised: 11/30/2020] [Accepted: 12/23/2020] [Indexed: 12/14/2022]

Chen Z, Boves Harrington PD. Self-Optimizing Support Vector Elastic Net. Anal Chem 2020;92:15306-15316. [PMID: 33166108 DOI: 10.1021/acs.analchem.0c01506] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]

Abstract

Chemometrics is widely used to solve various quantitative and qualitative problems in analytical chemistry. A self-optimizing chemometrics method facilitates scientists to exploit the advantages of chemometrics. In this report, a parameter-free support vector elastic net that self-optimizes two key regularization constants, i.e., λ for L2 regularization and t for L1 regularization, is developed and referred to as self-optimizing support vector elastic net (SOSVEN). Response surface modeling (RSM) and bootstrapped Latin partitions (BLPs) are incorporated for the optimization. Responses at a set of design points over the ranges of the two factors are evaluated with an internal BLP validation using a calibration set. A 2-dimensional interpolation with a cubic spline fits a response surface to determine the best condition that gives the best-estimated response. The SOSVEN with RSM had comparable performances with the one tuned by grid search, while the RSM is more efficient. The developed SOSVEN was compared with two parameter-free chemometrics methods, super partial least-squares regression (sPLSR) and super support vector regression (sSVR) for calibration, and sPLS-discriminant analysis (sPLS-DA) and support vector classification (SVC) for classification. For calibration, the SOSVEN with RSM worked equivalently well or better than the other two self-optimizing methods for the evaluations using meat and hemp oil data sets. For classification, a reference wine data set and mass spectra of different marijuana extracts were used. The three classifiers had similar performances to identify the cultivars of wines with nearly 98% of accuracy. The SOSVEN significantly outperformed sPLS-DA and SVC to classify the mass spectra of marijuana extracts with an overall accuracy of 97%. These results demonstrated excellent abilities of SOSVEN for classification and calibration.

Collapse

Wang Y, Harrington PDB, Chen P. Metabolomic profiling and comparison of major cinnamon species using UHPLC-HRMS. Anal Bioanal Chem 2020;412:7669-7681. [PMID: 32875369 DOI: 10.1007/s00216-020-02904-1] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2020] [Revised: 08/11/2020] [Accepted: 08/19/2020] [Indexed: 01/08/2023]

Kucheryavskiy S, Zhilin S, Rodionova O, Pomerantsev A. Procrustes Cross-Validation—A Bridge between Cross-Validation and Independent Validation Sets. Anal Chem 2020;92:11842-11850. [DOI: 10.1021/acs.analchem.0c02175] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]

Al-Hetlani E, Halámková L, Amin MO, Lednev IK. Differentiating smokers and nonsmokers based on Raman spectroscopy of oral fluid and advanced statistics for forensic applications. JOURNAL OF BIOPHOTONICS 2020;13:e201960123. [PMID: 31702875 DOI: 10.1002/jbio.201960123] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/19/2019] [Revised: 10/26/2019] [Accepted: 11/06/2019] [Indexed: 06/10/2023]

Chen Z, de Boves Harrington P. Automatic soft independent modeling for class analogies. Anal Chim Acta 2019;1090:47-56. [PMID: 31655645 DOI: 10.1016/j.aca.2019.09.035] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2019] [Revised: 09/10/2019] [Accepted: 09/11/2019] [Indexed: 01/19/2023]

Chen Z, Harrington PDB. Pipeline for High-Throughput Modeling of Marijuana and Hemp Extracts. Anal Chem 2019;91:14489-14497. [PMID: 31660729 DOI: 10.1021/acs.analchem.9b03290] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]

Abstract

Authentication of Cannabis products is important for assuring the quality of manufacturing, with the increasing consumption and regulation. In this report, a two-stage pipeline was developed for high-throughput screening and chemotyping the spectra from two sets of botanical extracts from the Cannabis genus. The first set contains different marijuana samples with higher concentrations of tetrahydrocannabinol (THC). The other set includes samples from hemp, a variety of Cannabis sativa with the THC concentration below 0.3%. The first stage applies the technique of class modeling to determine whether spectra belong to marijuana or hemp and reject novel spectra that may be neither marijuana nor hemp. An automatic soft independent modeling of class analogy (aSIMCA) that self-optimizes the number of principal components and the decision threshold is utilized in the first pipeline process to achieve excellent efficiency and efficacy. Once these spectra are recognized by aSIMCA as marijuana or hemp, they are then routed to the appropriate classifiers in the second stage for chemotyping the spectra, i.e., identifying these spectra into different chemotypes so that the pharmacological properties and cultivars of the spectra can be recognized. Three multivariate classifiers, a fuzzy rule building expert system (FuRES), super partial least-squares-discriminant analysis (sPLS-DA), and support vector machine tree type entropy (SVMtreeH), are employed for chemotyping. The discriminant ability of the pipeline was evaluated with different spectral data sets of these two groups of botanical samples, including proton nuclear magnetic resonance, mass, and ultraviolet spectra. All evaluations gave good results with accuracies greater than 95%, which demonstrated promising application of the pipeline for automated high-throughput screening and chemotyping marijuana and hemp, as well as other botanical products.

Collapse

Tang Y, Harrington PB. Noninteger Root Transformations for Preprocessing Nanoelectrospray Ionization High-Resolution Mass Spectra for the Classification of Cannabis. Anal Chem 2019;91:1328-1334. [PMID: 30565911 DOI: 10.1021/acs.analchem.8b03145] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]

Chen Z, de Boves Harrington P, Baugh SF. High-Throughput Chemotyping of Cannabis and Hemp Extracts Using an Ultraviolet Microplate Reader and Multivariate Classifiers. JOURNAL OF ANALYSIS AND TESTING 2018. [DOI: 10.1007/s41664-018-0075-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]

On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning. JOURNAL OF ANALYSIS AND TESTING 2018;2:249-262. [PMID: 30842888 PMCID: PMC6373628 DOI: 10.1007/s41664-018-0068-2] [Citation(s) in RCA: 194] [Impact Index Per Article: 32.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2018] [Revised: 10/08/2018] [Accepted: 10/12/2018] [Indexed: 11/15/2022]

Abstract

Model validation is the most important part of building a supervised model. For building a model with good generalization performance one must have a sensible data splitting strategy, and this is crucial for model validation. In this study, we conducted a comparative study on various reported data splitting methods. The MixSim model was employed to generate nine simulated datasets with different probabilities of mis-classification and variable sample sizes. Then partial least squares for discriminant analysis and support vector machines for classification were applied to these datasets. Data splitting methods tested included variants of cross-validation, bootstrapping, bootstrapped Latin partition, Kennard-Stone algorithm (K-S) and sample set partitioning based on joint X–Y distances algorithm (SPXY). These methods were employed to split the data into training and validation sets. The estimated generalization performances from the validation sets were then compared with the ones obtained from the blind test sets which were generated from the same distribution but were unseen by the training/validation procedure used in model construction. The results showed that the size of the data is the deciding factor for the qualities of the generalization performance estimated from the validation set. We found that there was a significant gap between the performance estimated from the validation set and the one from the test set for the all the data splitting methods employed on small datasets. Such disparity decreased when more samples were available for training/validation, and this is because the models were then moving towards approximations of the central limit theory for the simulated datasets used. We also found that having too many or too few samples in the training set had a negative effect on the estimated model performance, suggesting that it is necessary to have a good balance between the sizes of training set and validation set to have a reliable estimation of model performance. We also found that systematic sampling method such as K-S and SPXY generally had very poor estimation of the model performance, most likely due to the fact that they are designed to take the most representative samples first and thus left a rather poorly representative sample set for model performance estimation.

Collapse

Classification of samples from NMR-based metabolomics using principal components analysis and partial least squares with uncertainty estimation. Anal Bioanal Chem 2018;410:6305-6319. [PMID: 30043113 DOI: 10.1007/s00216-018-1240-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2018] [Revised: 06/14/2018] [Accepted: 07/02/2018] [Indexed: 12/18/2022]

Rodríguez-Pérez R, Fernández L, Marco S. Overoptimism in cross-validation when using partial least squares-discriminant analysis for omics data: a systematic study. Anal Bioanal Chem 2018;410:5981-5992. [PMID: 29959482 DOI: 10.1007/s00216-018-1217-1] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2018] [Revised: 06/13/2018] [Accepted: 06/21/2018] [Indexed: 01/29/2023]

Abstract

Advances in analytical instrumentation have provided the possibility of examining thousands of genes, peptides, or metabolites in parallel. However, the cost and time-consuming data acquisition process causes a generalized lack of samples. From a data analysis perspective, omics data are characterized by high dimensionality and small sample counts. In many scenarios, the analytical aim is to differentiate between two different conditions or classes combining an analytical method plus a tailored qualitative predictive model using available examples collected in a dataset. For this purpose, partial least squares-discriminant analysis (PLS-DA) is frequently employed in omics research. Recently, there has been growing concern about the uncritical use of this method, since it is prone to overfitting and may aggravate problems of false discoveries. In many applications involving a small number of subjects or samples, predictive model performance estimation is only based on cross-validation (CV) results with a strong preference for reporting results using leave one out (LOO). The combination of PLS-DA for high dimensionality data and small sample conditions, together with a weak validation methodology is a recipe for unreliable estimations of model performance. In this work, we present a systematic study about the impact of the dataset size, the dimensionality, and the CV technique used on PLS-DA overoptimism when performance estimation is done in cross-validation. Firstly, by using synthetic data generated from a same probability distribution and with assigned random binary labels, we have obtained a dataset where the true classification rate (CR) is 50%. As expected, our results confirm that internal validation provides overoptimistic estimations of the classification accuracy (i.e., overfitting). We have characterized the CR estimator in terms of bias and variance depending on the internal CV technique used and sample to dimensionality ratio. In small sample conditions, due to the large bias and variance of the estimator, the occurrence of extremely good CRs is common. We have found that overfitting peaks when the sample size in the training subset approaches the feature vector dimensionality minus one. In these conditions, the models are neither under- or overdetermined with a unique solution. This effect is particularly intense for LOO and peaks higher in small sample conditions. Overoptimism is decreased beyond this point where the abundance of noisy produces a regularization effect leading to less complex models. In terms of overfitting, our study ranks CV methods as follows: Bootstrap produces the most accurate estimator of the CR, followed by bootstrapped Latin partitions, random subsampling, K-Fold, and finally, the very popular LOO provides the worst results. Simulation results are further confirmed in real datasets from mass spectrometry and microarrays.

Collapse