1
|
Saunders A, Harrington PDB. Advances in Activity/Property Prediction from Chemical Structures. Crit Rev Anal Chem 2024; 54:135-147. [PMID: 35482792 DOI: 10.1080/10408347.2022.2066461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Recent technological advancement in AI modeling of molecular property databases has significantly expanded the opportunities for drug design and development. Quantitative structure-activity relationships (QSARs) are shown to provide more accurate predictions with regards to biological activity as well as toxicological assessment. By using a combination of in-silico models or by combining disparate structure-activity databases, researchers have been able to improve accuracy for a variety of drug discovery and analysis methods, generating viable compounds, which in certain cases, can be synthesized and further studied in vitro to find candidates for potential development. Additionally, the development of compounds of determined toxicology can be discontinued earlier, allowing alternative routes to be evaluated, preventing wasted time and resources. Although the progress that has been made is tremendous, expert review is still necessary for most in-silico generated predictions. Regardless, the scientific community continues to move ever closer to completely automated drug discovery and evaluation.
Collapse
Affiliation(s)
- Arianne Saunders
- Department of Chemistry and Biochemistry, Ohio University, Athens, Ohio, USA
| | | |
Collapse
|
2
|
Tabashum T, Snyder RC, O'Brien MK, Albert MV. Machine Learning Models for Parkinson Disease: Systematic Review. JMIR Med Inform 2024; 12:e50117. [PMID: 38771237 PMCID: PMC11112052 DOI: 10.2196/50117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Revised: 02/12/2024] [Accepted: 04/01/2024] [Indexed: 05/22/2024] Open
Abstract
Background With the increasing availability of data, computing resources, and easier-to-use software libraries, machine learning (ML) is increasingly used in disease detection and prediction, including for Parkinson disease (PD). Despite the large number of studies published every year, very few ML systems have been adopted for real-world use. In particular, a lack of external validity may result in poor performance of these systems in clinical practice. Additional methodological issues in ML design and reporting can also hinder clinical adoption, even for applications that would benefit from such data-driven systems. Objective To sample the current ML practices in PD applications, we conducted a systematic review of studies published in 2020 and 2021 that used ML models to diagnose PD or track PD progression. Methods We conducted a systematic literature review in accordance with PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines in PubMed between January 2020 and April 2021, using the following exact string: "Parkinson's" AND ("ML" OR "prediction" OR "classification" OR "detection" or "artificial intelligence" OR "AI"). The search resulted in 1085 publications. After a search query and review, we found 113 publications that used ML for the classification or regression-based prediction of PD or PD-related symptoms. Results Only 65.5% (74/113) of studies used a holdout test set to avoid potentially inflated accuracies, and approximately half (25/46, 54%) of the studies without a holdout test set did not state this as a potential concern. Surprisingly, 38.9% (44/113) of studies did not report on how or if models were tuned, and an additional 27.4% (31/113) used ad hoc model tuning, which is generally frowned upon in ML model optimization. Only 15% (17/113) of studies performed direct comparisons of results with other models, severely limiting the interpretation of results. Conclusions This review highlights the notable limitations of current ML systems and techniques that may contribute to a gap between reported performance in research and the real-life applicability of ML models aiming to detect and predict diseases such as PD.
Collapse
Affiliation(s)
- Thasina Tabashum
- Department of Computer Science and Engineering, University of North Texas, Denton, TX, United States
| | - Robert Cooper Snyder
- Department of Computer Science and Engineering, University of North Texas, Denton, TX, United States
| | - Megan K O'Brien
- Technology and Innovation Hub, Shirley Ryan AbilityLab, Chicago, IL, United States
- Department of Physical Medicine & Rehabilitation, Northwestern University, Chicago, IL, United States
| | - Mark V Albert
- Department of Computer Science and Engineering, University of North Texas, Denton, TX, United States
- Department of Biomedical Engineering, University of North Texas, Denton, TX, United States
| |
Collapse
|
3
|
Majd E, Xing L, Zhang X. Segmentation of patients with small cell lung cancer into responders and non-responders using the optimal cross-validation technique. BMC Med Res Methodol 2024; 24:83. [PMID: 38589775 PMCID: PMC11000309 DOI: 10.1186/s12874-024-02185-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2022] [Accepted: 02/20/2024] [Indexed: 04/10/2024] Open
Abstract
BACKGROUND The timing of treating cancer patients is an essential factor in the efficacy of treatment. So, patients who will not respond to current therapy should receive a different treatment as early as possible. Machine learning models can be built to classify responders and nonresponders. Such classification models predict the probability of a patient being a responder. Most methods use a probability threshold of 0.5 to convert the probabilities into binary group membership. However, the cutoff of 0.5 is not always the optimal choice. METHODS In this study, we propose a novel data-driven approach to select a better cutoff value based on the optimal cross-validation technique. To illustrate our novel method, we applied it to three clinical trial datasets of small-cell lung cancer patients. We used two different datasets to build a scoring system to segment patients. Then the models were applied to segment patients into the test data. RESULTS We found that, in test data, the predicted responders and non-responders had significantly different long-term survival outcomes. Our proposed novel method segments patients better than the standard approach using a cutoff of 0.5. Comparing clinical outcomes of responders versus non-responders, our novel method had a p-value of 0.009 with a hazard ratio of 0.668 for grouping patients using the Cox proportion hazard model and a p-value of 0.011 using the accelerated failure time model which approved a significant difference between responders and non-responders. In contrast, the standard approach had a p-value of 0.194 with a hazard ratio of 0.823 using the Cox proportion hazard model and a p-value of 0.240 using the accelerated failure time model indicating the responders and non-responders do not differ significantly in survival. CONCLUSION In summary, our novel prediction method can successfully segment new patients into responders and non-responders. Clinicians can use our prediction to decide if a patient should receive a different treatment or stay with the current treatment.
Collapse
Affiliation(s)
- Elham Majd
- Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada
| | - Li Xing
- Department of Mathematics and Statistics, University of Saskatchewan, Saskatoon, SK, Canada
| | - Xuekui Zhang
- Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada.
| |
Collapse
|
4
|
Cao R, Hu W, Wei P, Ding Y, Bin Y, Zheng C. FFMAVP: a new classifier based on feature fusion and multitask learning for identifying antiviral peptides and their subclasses. Brief Bioinform 2023; 24:bbad353. [PMID: 37861174 DOI: 10.1093/bib/bbad353] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2023] [Revised: 08/25/2023] [Accepted: 09/06/2023] [Indexed: 10/21/2023] Open
Abstract
Antiviral peptides (AVPs) are widely found in animals and plants, with high specificity and strong sensitivity to drug-resistant viruses. However, due to the great heterogeneity of different viruses, most of the AVPs have specific antiviral activities. Therefore, it is necessary to identify the specific activities of AVPs on virus types. Most existing studies only identify AVPs, with only a few studies identifying subclasses by training multiple binary classifiers. We develop a two-stage prediction tool named FFMAVP that can simultaneously predict AVPs and their subclasses. In the first stage, we identify whether a peptide is AVP or not. In the second stage, we predict the six virus families and eight species specifically targeted by AVPs based on two multiclass tasks. Specifically, the feature extraction module in the two-stage task of FFMAVP adopts the same neural network structure, in which one branch extracts features based on amino acid feature descriptors and the other branch extracts sequence features. Then, the two types of features are fused for the following task. Considering the correlation between the two tasks of the second stage, a multitask learning model is constructed to improve the effectiveness of the two multiclass tasks. In addition, to improve the effectiveness of the second stage, the network parameters trained through the first-stage data are used to initialize the network parameters in the second stage. As a demonstration, the cross-validation results, independent test results and visualization results show that FFMAVP achieves great advantages in both stages.
Collapse
Affiliation(s)
- Ruifen Cao
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Computer Science and Technology, Anhui University
| | - Weiling Hu
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Computer Science and Technology, Anhui University
| | - Pijing Wei
- Institutes of Physical Science and Information Technology, Anhui University
| | - Yun Ding
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, School of Artificial Intelligence, Anhui University
| | - Yannan Bin
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education and Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University
| | - Chunhou Zheng
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, School of Artificial Intelligence, Anhui University
| |
Collapse
|
5
|
Zhou W, Liu Y, Li Y, Kong S, Wang W, Ding B, Han J, Mou C, Gao X, Liu J. TriNet: A tri-fusion neural network for the prediction of anticancer and antimicrobial peptides. PATTERNS (NEW YORK, N.Y.) 2023; 4:100702. [PMID: 36960450 PMCID: PMC10028424 DOI: 10.1016/j.patter.2023.100702] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Revised: 12/20/2022] [Accepted: 02/03/2023] [Indexed: 03/04/2023]
Abstract
The accurate identification of anticancer peptides (ACPs) and antimicrobial peptides (AMPs) remains a computational challenge. We propose a tri-fusion neural network termed TriNet for the accurate prediction of both ACPs and AMPs. The framework first defines three kinds of features to capture the peptide information contained in serial fingerprints, sequence evolutions, and physicochemical properties, which are then fed into three parallel modules: a convolutional neural network module enhanced by channel attention, a bidirectional long short-term memory module, and an encoder module for training and final classification. To achieve a better training effect, TriNet is trained via a training approach using iterative interactions between the samples in the training and validation datasets. TriNet is tested on multiple challenging ACP and AMP datasets and exhibits significant improvements over various state-of-the-art methods. The web server and source code of TriNet are respectively available at http://liulab.top/TriNet/server and https://github.com/wanyunzh/TriNet.
Collapse
Affiliation(s)
- Wanyun Zhou
- SDU-ANU Joint Science College, Shandong University (Weihai), Weihai 264209, China
| | - Yufei Liu
- SDU-ANU Joint Science College, Shandong University (Weihai), Weihai 264209, China
| | - Yingxin Li
- School of Mechanical, Electrical & Information Engineering, Shandong University (Weihai), Weihai 264209, China
| | - Siqi Kong
- SDU-ANU Joint Science College, Shandong University (Weihai), Weihai 264209, China
| | - Weilin Wang
- SDU-ANU Joint Science College, Shandong University (Weihai), Weihai 264209, China
| | - Boyun Ding
- SDU-ANU Joint Science College, Shandong University (Weihai), Weihai 264209, China
| | - Jiyun Han
- School of Mathematics and Statistics, Shandong University (Weihai), Weihai 264209, China
| | - Chaozhou Mou
- School of Mathematics and Statistics, Shandong University (Weihai), Weihai 264209, China
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - Juntao Liu
- School of Mathematics and Statistics, Shandong University (Weihai), Weihai 264209, China
| |
Collapse
|
6
|
Navani V, Meyers DE, Ruan Y, Boyne DJ, O'Sullivan DE, Dolter S, Grosjean HA, Stukalin I, Heng DYC, Morris DG, Brenner DR, Sangha R, Cheung WY, Pabani A. Lung Immune Therapy Evaluation (LITE) Risk, a Novel Prognostic Model for Patients With Advanced Non-Small Cell Lung Cancer Treated With Immune Checkpoint Blockade. Clin Lung Cancer 2023; 24:e152-e159. [PMID: 36774234 DOI: 10.1016/j.cllc.2022.12.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 10/28/2022] [Accepted: 12/26/2022] [Indexed: 01/21/2023]
Abstract
INTRODUCTION/BACKGROUND Immune checkpoint inhibitors (ICI) have revolutionized non-small cell lung cancer (NSCLC). We aimed to identify baseline characteristics, that are prognostic factors for overall survival (OS) in patients with NSCLC treated with ICI monotherapy, in order to derive the Lung Immune Therapy Evaluation (LITE) risk, a prognostic model. MATERIALS AND METHODS Multi-center observational cohort study of patients with advanced NSCLC that received ≥1 dose of ICI monotherapy. The training set (n=342) consisted of patients with NSCLC who received first line ICI. The test set (n=153) used for external validation was a discrete cohort of patients who received second line ICI. 20 candidate prognostic factors were examined. Penalized Cox regression was used for variable selection. Multiple imputation was used to address missingness. RESULTS Three baseline characteristics populated the final model: ECOG (0, 1 or ≥2), lactate dehydrogenase>upper limit of normal, and derived neutrophil to lymphocyte ratio ≥3. Patients were parsed into 3 risk groups; favorable (n=146, risk score 0-1), intermediate (n=101, risk score 2) and poor (n=95, risk score ≥3). The c-statistic of the training cohort was 0.702 and 0.694 after bootstrapping. The test cohort c-statistic was 0.664. The median OS for favorable, intermediate and poor LITE risk were; 28.3 months, 9.1 months and 2.1 months respectively. Improving LITE risk group was associated with improved OS, intermediate vs favorable HR 2.08 (95%CI 1.46-2.97, P < .001); poor vs favorable HR 5.21 (95%CI 3.69-7.34, P < .001). CONCLUSION A simple prognostic model, utilizing accessible clinical data, can discriminate survival outcomes in patients with advanced NSCLC.
Collapse
Affiliation(s)
- Vishal Navani
- Department of Medical Oncology, Tom Baker Cancer Centre, Calgary, Alberta, Canada; Department of Oncology, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada.
| | - Daniel E Meyers
- Department of Medical Oncology, Tom Baker Cancer Centre, Calgary, Alberta, Canada
| | - Yibing Ruan
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada; Department of Cancer Epidemiology and Prevention Research, Alberta Health Services, Calgary, Alberta, Canada; Forzani & MacPhail Colon Cancer Screening Centre, University of Calgary, Calgary, Alberta, Canada
| | - Devon J Boyne
- Department of Oncology, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada; Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada; Department of Cancer Epidemiology and Prevention Research, Alberta Health Services, Calgary, Alberta, Canada
| | - Dylan E O'Sullivan
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada; Department of Cancer Epidemiology and Prevention Research, Alberta Health Services, Calgary, Alberta, Canada; Forzani & MacPhail Colon Cancer Screening Centre, University of Calgary, Calgary, Alberta, Canada
| | - Samantha Dolter
- Department of Medical Oncology, Tom Baker Cancer Centre, Calgary, Alberta, Canada
| | - Heidi Ai Grosjean
- Department of Medical Oncology, Tom Baker Cancer Centre, Calgary, Alberta, Canada
| | - Igor Stukalin
- Department of Medical Oncology, Tom Baker Cancer Centre, Calgary, Alberta, Canada
| | - Daniel Y C Heng
- Department of Medical Oncology, Tom Baker Cancer Centre, Calgary, Alberta, Canada; Department of Oncology, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada
| | - Don G Morris
- Department of Medical Oncology, Tom Baker Cancer Centre, Calgary, Alberta, Canada; Department of Oncology, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada
| | - Darren R Brenner
- Department of Oncology, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada; Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada; Department of Cancer Epidemiology and Prevention Research, Alberta Health Services, Calgary, Alberta, Canada
| | - Randeep Sangha
- Department of Medical Oncology, Cross Cancer Institute, Edmonton, Alberta, Canada
| | - Winson Y Cheung
- Department of Medical Oncology, Tom Baker Cancer Centre, Calgary, Alberta, Canada; Department of Oncology, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada
| | - Aliyah Pabani
- Department of Medical Oncology, Tom Baker Cancer Centre, Calgary, Alberta, Canada; Department of Oncology, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada
| |
Collapse
|
7
|
Suchana S, Passeport E. Implications of polar organic chemical integrative sampler for high membrane sorption and suitability of polyethersulfone as a single-phase sampler. THE SCIENCE OF THE TOTAL ENVIRONMENT 2022; 850:157898. [PMID: 35952872 DOI: 10.1016/j.scitotenv.2022.157898] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/13/2021] [Revised: 08/03/2022] [Accepted: 08/04/2022] [Indexed: 06/15/2023]
Abstract
Polar organic chemical integrative sampler (POCIS) contains sorbent, which is typically enclosed between two polyethersulfones (PES) membranes. A significant PES uptake is reported for many contaminants, yet, aqueous concentration is mainly correlated with the sorbent uptake using first-order kinetics. Under high PES sorption, the first-order kinetics often provide erroneous sampling rate for the sorbent phase due to increased membrane resistance. This work evaluated the uptake of four high PES sorbing chemicals, i.e., three Cl- and CH3-substituted nitrobenzenes and one chlorinated aniline using POCIS and the potential of a single-phase PES sampler using laboratory experiments. POCIS calibration results demonstrated that both sorbent and membrane had similar affinity for the target compounds. A rapid PES sorption occurred in the earlier days (<7 days) followed by a gradual increase in the PES phase concentration (equilibrium not achieved after 60 days). Especially, the membrane was the primary sink for 3,4-dichloroaniline and 3,4-dichloronitrobenzene for up to 14 and 31 days, respectively. On the other hand, the single-phase PES sampler showed similar mass uptake as POCIS and reached equilibrium within 19 days under static condition, indicating its potential suitability in the equilibrium regime. PES-water partition coefficient of the target compounds was between 1.2 and 6.5 L/g. Finally, we present a poly-parameter linear-free energy relationship (pp-LFER) using published data to predict the PES-water partition coefficients. The pp-LFER models showed moderate predictability as indicated by R2adj values between 0.7 and 0.9 for both internal and external data set consisting of a wide range of hydrophobic and hydrophilic compounds (-0.1 ≤ logKOW ≤ 7.4). The proposed pp-LFER model can be used to screen high PES-sorbing chemicals to increase the reliability and accuracy of aqueous concentration prediction from POCIS sampling and to select the most appropriate sampling approach for new compounds.
Collapse
Affiliation(s)
- Shamsunnahar Suchana
- Department of Civil & Mineral Engineering, University of Toronto, 35 St. George Street, Toronto, Ontario, M5S 1A4, Canada
| | - Elodie Passeport
- Department of Civil & Mineral Engineering, University of Toronto, 35 St. George Street, Toronto, Ontario, M5S 1A4, Canada; Department of Chemical Engineering & Applied Chemistry, University of Toronto, 200 College Street, Toronto, Ontario M5S 3E5, Canada.
| |
Collapse
|
8
|
Feasibility of Application of Near Infrared Reflectance (NIR) Spectroscopy for the Prediction of the Chemical Composition of Traditional Sausages. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app112311282] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
In the present study, the potential of application of near infrared reflectance (NIR) spectroscopy for the estimation of the chemical composition of traditional (village style) sausages was examined. The chemical composition (moisture, ash, protein and, fat content) was determined by standard reference methods. For the development of the calibration model, 39 samples of traditional fresh sausages were used, while for external validation, 10 samples of sausages were used. The correlation coefficients of calibration (RMSEC) and standard errors (SEC) were 0.92 and 1.58 (moisture), 0.77 and 0.18 (ash), 0.87 and 0.89 (protein) and 0.93 and 1.73 (fat). The cross-validation correlation coefficients (RMSECV) and standard errors (SECV) were 0.86 and 2.13 (moisture), 0.56 and 0.26 (ash), 0.78 and 1.17 (protein), and 0.88 and 2.17 (fat). The results of the calibration model showed that NIR spectroscopy can be applied to estimate with very good precision the fat content of traditional village-style sausages, whereas moisture and protein content can be estimated with good accuracy. The external validation confirmed the ability of NIR spectroscopy to predict the chemical composition of sausages.
Collapse
|
9
|
Chen Z, de Boves Harrington P, Griffin V, Griffin T. In Situ Determination of Cannabidiol in Hemp Oil by Near-Infrared Spectroscopy. JOURNAL OF NATURAL PRODUCTS 2021; 84:2851-2857. [PMID: 34784219 DOI: 10.1021/acs.jnatprod.1c00557] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Cannabidiol (CBD, 1) is an active component of hemp oil and many other products that offers diverse health benefits. Near-infrared spectroscopy (NIRS) coupled with chemometrics was utilized to quantify the CBD (1) concentration in the hemp oil through the containing glass vial. NIRS provided a fast and cost-effective tool to measure chemical profiles for the hemp oil samples with various concentrations of CBD (1) and its acid precursor, i.e., cannabidiolic acid (CBDA, 2). The measured NIR spectra were transformed by using a Savitzky-Golay first-derivative filter to remove baseline drift. Two self-optimizing chemometric methods, super partial least-squares regression (sPLSR) and self-optimizing support vector elastic net (SOSVEN), were applied to construct automatically multivariate models that predict the concentrations of CBD (1) and total CBD (sum of 1 and 2 concentrations) of the hemp oil samples. The SOSVEN had validation errors of 6.4 mg/mL for the prediction of CBD (1) concentration and 6.6 mg/mL for the prediction of total CBD concentration, which are significantly lower than the errors given by sPLSR. Other than the lower validation errors, SOSVEN has another advantage over sPLSR in that it builds a multivariate model while selecting spectral features at the same time. These results demonstrated that NIR spectroscopy combined with chemometrics can be used as a rapid and cost-effective approach to determine the CBD (1) and total CBD concentrations in hemp oil. Manufacturers would benefit from the fast and reliable approach in quality assurance.
Collapse
Affiliation(s)
- Zewei Chen
- Clippinger Laboratories, Department of Chemistry and Biochemistry, Ohio University, Athens, Ohio 45701, United States
| | - Peter de Boves Harrington
- Clippinger Laboratories, Department of Chemistry and Biochemistry, Ohio University, Athens, Ohio 45701, United States
| | - Veronica Griffin
- G2 Analytical, PO Box 851, Wingate, North Carolina 28174, United States
| | - Todd Griffin
- G2 Analytical, PO Box 851, Wingate, North Carolina 28174, United States
| |
Collapse
|
10
|
Wang L, Vendrell-Dones MO, Deriu C, Doğruer S, de B Harrington P, McCord B. Multivariate Analysis Aided Surface-Enhanced Raman Spectroscopy (MVA-SERS) Multiplex Quantitative Detection of Trace Fentanyl in Illicit Drug Mixtures Using a Handheld Raman Spectrometer. APPLIED SPECTROSCOPY 2021; 75:1225-1236. [PMID: 34318708 DOI: 10.1177/00037028211032930] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Recently there has been upsurge in reports that illicit seizures of cocaine and heroin have been adulterated with fentanyl. Surface-enhanced Raman spectroscopy (SERS) provides a useful alternative to current screening procedures that permits detection of trace levels of fentanyl in mixtures. Samples are solubilized and allowed to interact with aggregated colloidal nanostars to produce a rapid and sensitive assay. In this study, we present the quantitative determination of fentanyl in heroin and cocaine using SERS, using a point-and-shoot handheld Raman system. Our protocol is optimized to detect pure fentanyl down to 0.20 ± 0.06 ng/mL and can also distinguish pure cocaine and heroin at ng/mL levels. Multiplex analysis of mixtures is enabled by combining SERS detection with principal component analysis and super partial least squares regression discriminate analysis (SPLS-DA), which allow for the determination of fentanyl as low as 0.05% in simulated seized heroin and 0.10% in simulated seized cocaine samples.
Collapse
Affiliation(s)
- Ling Wang
- Department of Chemistry and Biochemistry, Florida International University, Miami, FL USA
| | - Mario O Vendrell-Dones
- Department of Chemistry and Biochemistry, Florida International University, Miami, FL USA
| | - Chiara Deriu
- Department of Chemistry and Biochemistry, Florida International University, Miami, FL USA
| | - Sevde Doğruer
- Department of Chemistry and Biochemistry, Florida International University, Miami, FL USA
| | | | - Bruce McCord
- Department of Chemistry and Biochemistry, Florida International University, Miami, FL USA
| |
Collapse
|
11
|
Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results. PLoS One 2021; 16:e0256152. [PMID: 34383858 PMCID: PMC8360533 DOI: 10.1371/journal.pone.0256152] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2021] [Accepted: 08/01/2021] [Indexed: 12/23/2022] Open
Abstract
This study aims to determine how randomly splitting a dataset into training and test sets affects the estimated performance of a machine learning model and its gap from the test performance under different conditions, using real-world brain tumor radiomics data. We conducted two classification tasks of different difficulty levels with magnetic resonance imaging (MRI) radiomics features: (1) "Simple" task, glioblastomas [n = 109] vs. brain metastasis [n = 58] and (2) "difficult" task, low- [n = 163] vs. high-grade [n = 95] meningiomas. Additionally, two undersampled datasets were created by randomly sampling 50% from these datasets. We performed random training-test set splitting for each dataset repeatedly to create 1,000 different training-test set pairs. For each dataset pair, the least absolute shrinkage and selection operator model was trained and evaluated using various validation methods in the training set, and tested in the test set, using the area under the curve (AUC) as an evaluation metric. The AUCs in training and testing varied among different training-test set pairs, especially with the undersampled datasets and the difficult task. The mean (±standard deviation) AUC difference between training and testing was 0.039 (±0.032) for the simple task without undersampling and 0.092 (±0.071) for the difficult task with undersampling. In a training-test set pair with the difficult task without undersampling, for example, the AUC was high in training but much lower in testing (0.882 and 0.667, respectively); in another dataset pair with the same task, however, the AUC was low in training but much higher in testing (0.709 and 0.911, respectively). When the AUC discrepancy between training and test, or generalization gap, was large, none of the validation methods helped sufficiently reduce the generalization gap. Our results suggest that machine learning after a single random training-test set split may lead to unreliable results in radiomics studies especially with small sample sizes.
Collapse
|
12
|
Pang Y, Yao L, Jhong JH, Wang Z, Lee TY. AVPIden: a new scheme for identification and functional prediction of antiviral peptides based on machine learning approaches. Brief Bioinform 2021; 22:6323205. [PMID: 34279599 DOI: 10.1093/bib/bbab263] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 06/07/2021] [Accepted: 06/21/2021] [Indexed: 02/06/2023] Open
Abstract
Antiviral peptide (AVP) is a kind of antimicrobial peptide (AMP) that has the potential ability to fight against virus infection. Machine learning-based prediction with a computational biology approach can facilitate the development of the novel therapeutic agents. In this study, we proposed a double-stage classification scheme, named AVPIden, for predicting the AVPs and their functional activities against different viruses. The first stage is to distinguish the AVP from a broad-spectrum peptide collection, including not only the regular peptides (non-AMP) but also the AMPs without antiviral functions (non-AVP). The second stage is responsible for characterizing one or more virus families or species that the AVP targets. Imbalanced learning is utilized to improve the performance of prediction. The AVPIden uses multiple descriptors to precisely demonstrate the peptide properties and adopts explainable machine learning strategies based on Shapley value to exploit how the descriptors impact the antiviral activities. Finally, the evaluation performance of the proposed model suggests its ability to predict the antivirus activities and their potential functions against six virus families (Coronaviridae, Retroviridae, Herpesviridae, Paramyxoviridae, Orthomyxoviridae, Flaviviridae) and eight kinds of virus (FIV, HCV, HIV, HPIV3, HSV1, INFVA, RSV, SARS-CoV). The AVPIden gives an option for reinforcing the development of AVPs with the computer-aided method and has been deployed at http://awi.cuhk.edu.cn/AVPIden/.
Collapse
Affiliation(s)
- Yuxuan Pang
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, PR China
| | - Lantian Yao
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, PR China
| | - Jhih-Hua Jhong
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, PR China
| | - Zhuo Wang
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, PR China
| | - Tzong-Yi Lee
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, PR China
| |
Collapse
|
13
|
Sajid MR, Muhammad N, Zakaria R, Shahbaz A, Bukhari SAC, Kadry S, Suresh A. Nonclinical Features in Predictive Modeling of Cardiovascular Diseases: A Machine Learning Approach. Interdiscip Sci 2021; 13:201-211. [PMID: 33675528 DOI: 10.1007/s12539-021-00423-w] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2020] [Revised: 02/08/2021] [Accepted: 02/20/2021] [Indexed: 12/23/2022]
Abstract
BACKGROUND In the broader healthcare domain, the prediction bears more value than an explanation considering the cost of delays in its services. There are various risk prediction models for cardiovascular diseases (CVDs) in the literature for early risk assessment. However, the substantial increase in CVDs-related mortality is challenging global health systems, especially in developing countries. This situation allows researchers to improve CVDs prediction models using new features and risk computing methods. This study aims to assess nonclinical features that can be easily available in any healthcare systems, in predicting CVDs using advanced and flexible machine learning (ML) algorithms. METHODS A gender-matched case-control study was conducted in the largest public sector cardiac hospital of Pakistan, and the data of 460 subjects were collected. The dataset comprised of eight nonclinical features. Four supervised ML algorithms were used to train and test the models to predict the CVDs status by considering traditional logistic regression (LR) as the baseline model. The models were validated through the train-test split (70:30) and tenfold cross-validation approaches. RESULTS Random forest (RF), a nonlinear ML algorithm, performed better than other ML algorithms and LR. The area under the curve (AUC) of RF was 0.851 and 0.853 in the train-test split and tenfold cross-validation approach, respectively. The nonclinical features yielded an admissible accuracy (minimum 71%) through the LR and ML models, exhibiting its predictive capability in risk estimation. CONCLUSION The satisfactory performance of nonclinical features reveals that these features and flexible computational methodologies can reinforce the existing risk prediction models for better healthcare services.
Collapse
Affiliation(s)
- Mirza Rizwan Sajid
- Centre for Mathematical Sciences, College of Computing and Applied Sciences, Universiti Malaysia Pahang, 26300, Gambang, Kuantan, Pahang Darul Makmur, Malaysia
| | - Noryanti Muhammad
- Centre for Mathematical Sciences, College of Computing and Applied Sciences, Universiti Malaysia Pahang, 26300, Gambang, Kuantan, Pahang Darul Makmur, Malaysia.
| | - Roslinazairimah Zakaria
- Centre for Mathematical Sciences, College of Computing and Applied Sciences, Universiti Malaysia Pahang, 26300, Gambang, Kuantan, Pahang Darul Makmur, Malaysia
| | - Ahmad Shahbaz
- Punjab Institute of Cardiology, Lahore, 54000, Pakistan
| | - Syed Ahmad Chan Bukhari
- Division of Computer Science, Mathematics and Science, Collins College of Professional Studies, St. Johns University, New York, NY, 11439, USA
| | - Seifedine Kadry
- Faculty of Applied Computing and Technology, Noroff University College, Kristiansand, Norway
| | - A Suresh
- Department of Computer Science and Engineering, SRM Institute of Science & Technology, Kattankulathur, Chengalpattu (D.t), 603 203, Tamilnadu, India
| |
Collapse
|
14
|
Abstract
Abstract
Deep learning is transforming most areas of science and technology, including electron microscopy. This review paper offers a practical perspective aimed at developers with limited familiarity. For context, we review popular applications of deep learning in electron microscopy. Following, we discuss hardware and software needed to get started with deep learning and interface with electron microscopes. We then review neural network components, popular architectures, and their optimization. Finally, we discuss future directions of deep learning in electron microscopy.
Collapse
|
15
|
Chemometric applications in metabolomic studies using chromatography-mass spectrometry. Trends Analyt Chem 2021. [DOI: 10.1016/j.trac.2020.116165] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
16
|
Solovyev PA, Fauhl-Hassek C, Riedl J, Esslinger S, Bontempo L, Camin F. NMR spectroscopy in wine authentication: An official control perspective. Compr Rev Food Sci Food Saf 2021; 20:2040-2062. [PMID: 33506593 DOI: 10.1111/1541-4337.12700] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2020] [Revised: 11/30/2020] [Accepted: 12/23/2020] [Indexed: 12/14/2022]
Abstract
Wine authentication is vital in identifying malpractice and fraud, and various physical and chemical analytical techniques have been employed for this purpose. Besides wet chemistry, these include chromatography, isotopic ratio mass spectrometry, optical spectroscopy, and nuclear magnetic resonance (NMR) spectroscopy, which have been applied in recent years in combination with chemometric approaches. For many years, 2 H NMR spectroscopy was the method of choice and achieved official recognition in the detection of sugar addition to grape products. Recently, 1 H NMR spectroscopy, a simpler and faster method (in terms of sample preparation), has gathered more and more attention in wine analysis, even if it still lacks official recognition. This technique makes targeted quantitative determination of wine ingredients and nontargeted detection of the metabolomic fingerprint of a wine sample possible. This review summarizes the possibilities and limitations of 1 H NMR spectroscopy in analytical wine authentication, by reviewing its applications as reported in the literature. Examples of commercial and open-source solutions combining NMR spectroscopy and chemometrics are also examined herein, together with its opportunities of becoming an official method.
Collapse
Affiliation(s)
- Pavel A Solovyev
- Department of Food Quality and Nutrition, Research and Innovation Center, Fondazione Edmund Mach (FEM), via E. Mach 1, San Michele all'Adige, 38010, Italy
| | - Carsten Fauhl-Hassek
- German Federal Institute for Risk Assessment, Department Safety in the Food Chain, Unit Product Identity, Supply Chains and Traceability, Max-Dohrn Strasse, 8-10, Berlin, 10589, Germany
| | - Janet Riedl
- German Federal Institute for Risk Assessment, Department Safety in the Food Chain, Unit Product Identity, Supply Chains and Traceability, Max-Dohrn Strasse, 8-10, Berlin, 10589, Germany
| | - Susanne Esslinger
- German Federal Institute for Risk Assessment, Department Safety in the Food Chain, Unit Product Identity, Supply Chains and Traceability, Max-Dohrn Strasse, 8-10, Berlin, 10589, Germany
| | - Luana Bontempo
- Department of Food Quality and Nutrition, Research and Innovation Center, Fondazione Edmund Mach (FEM), via E. Mach 1, San Michele all'Adige, 38010, Italy
| | - Federica Camin
- Department of Food Quality and Nutrition, Research and Innovation Center, Fondazione Edmund Mach (FEM), via E. Mach 1, San Michele all'Adige, 38010, Italy.,Center Agriculture Food Environment (C3A), University of Trento, via Mach 1, San Michele all'Adige, Tennessee, 38010, Italy
| |
Collapse
|
17
|
Abstract
Chemometrics is widely used to solve various quantitative and qualitative problems in analytical chemistry. A self-optimizing chemometrics method facilitates scientists to exploit the advantages of chemometrics. In this report, a parameter-free support vector elastic net that self-optimizes two key regularization constants, i.e., λ for L2 regularization and t for L1 regularization, is developed and referred to as self-optimizing support vector elastic net (SOSVEN). Response surface modeling (RSM) and bootstrapped Latin partitions (BLPs) are incorporated for the optimization. Responses at a set of design points over the ranges of the two factors are evaluated with an internal BLP validation using a calibration set. A 2-dimensional interpolation with a cubic spline fits a response surface to determine the best condition that gives the best-estimated response. The SOSVEN with RSM had comparable performances with the one tuned by grid search, while the RSM is more efficient. The developed SOSVEN was compared with two parameter-free chemometrics methods, super partial least-squares regression (sPLSR) and super support vector regression (sSVR) for calibration, and sPLS-discriminant analysis (sPLS-DA) and support vector classification (SVC) for classification. For calibration, the SOSVEN with RSM worked equivalently well or better than the other two self-optimizing methods for the evaluations using meat and hemp oil data sets. For classification, a reference wine data set and mass spectra of different marijuana extracts were used. The three classifiers had similar performances to identify the cultivars of wines with nearly 98% of accuracy. The SOSVEN significantly outperformed sPLS-DA and SVC to classify the mass spectra of marijuana extracts with an overall accuracy of 97%. These results demonstrated excellent abilities of SOSVEN for classification and calibration.
Collapse
Affiliation(s)
- Zewei Chen
- Clippinger Laboratories, Department of Chemistry and Biochemistry, Ohio University, Athens, Ohio 45701, United States
| | - Peter de Boves Harrington
- Clippinger Laboratories, Department of Chemistry and Biochemistry, Ohio University, Athens, Ohio 45701, United States
| |
Collapse
|
18
|
Wang Y, Harrington PDB, Chen P. Metabolomic profiling and comparison of major cinnamon species using UHPLC-HRMS. Anal Bioanal Chem 2020; 412:7669-7681. [PMID: 32875369 DOI: 10.1007/s00216-020-02904-1] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2020] [Revised: 08/11/2020] [Accepted: 08/19/2020] [Indexed: 01/08/2023]
Abstract
The metabolomic profiles of four major species of cinnamon (Cinnamomum verum, C. burmannii, C. loureiroi, and C. cassia) were investigated by ultra-high-performance liquid chromatography-high-resolution mass spectrometry (UHPLC-HRMS). Thirty-six metabolites were tentatively characterized, belonging to various compound groups such as phenolic glycosides, flavan-3-ols, phenolic acids, terpenes, alkaloids, and aldehydes. Principal component analysis (PCA) and partial least squares-discriminant analysis (PLS-DA) on the HRMS data matrix resulted in a clear separation of the four cinnamon species. Coumarin, cinnamaldehyde, methoxycinnamaldehyde, cinnamoyl-methoxyphenyl acetate, proanthocyanidins, and other components varied among the four species. Such variations were used to develop a step-by-step strategy for differentiating the four cinnamon species based on their levels of pre-selected components. This study suggests a significant variation in the phytochemical compositions of different cinnamon species, which have a direct influence on cinnamon's health benefit potentials. Graphical Abstract.
Collapse
Affiliation(s)
- Yifei Wang
- Methods and Application of Food Composition Laboratory, U.S. Department of Agriculture, Agricultural Research Service, Beltsville Human Nutrition Research Center, Beltsville, MD, 20705, USA
- Department of Chemistry & Biochemistry, College of Arts and Sciences, Ohio University, Athens, OH, 45701, USA
| | - Peter de B Harrington
- Department of Chemistry & Biochemistry, College of Arts and Sciences, Ohio University, Athens, OH, 45701, USA
| | - Pei Chen
- Methods and Application of Food Composition Laboratory, U.S. Department of Agriculture, Agricultural Research Service, Beltsville Human Nutrition Research Center, Beltsville, MD, 20705, USA.
| |
Collapse
|
19
|
Kucheryavskiy S, Zhilin S, Rodionova O, Pomerantsev A. Procrustes Cross-Validation—A Bridge between Cross-Validation and Independent Validation Sets. Anal Chem 2020; 92:11842-11850. [DOI: 10.1021/acs.analchem.0c02175] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Affiliation(s)
- Sergey Kucheryavskiy
- Department of Chemistry and Bioscience, Aalborg University, Niels Bohrs vej 8, Esbjerg 6700, Denmark
| | - Sergei Zhilin
- CSort Ltd., Germana Titova St. 7, Barnaul 656023, Russia
| | - Oxana Rodionova
- Semenov Federal Research Center for Chemical Physics, RAS, Kosygin St. 4, Moscow 119991, Russia
| | - Alexey Pomerantsev
- Semenov Federal Research Center for Chemical Physics, RAS, Kosygin St. 4, Moscow 119991, Russia
| |
Collapse
|
20
|
Al-Hetlani E, Halámková L, Amin MO, Lednev IK. Differentiating smokers and nonsmokers based on Raman spectroscopy of oral fluid and advanced statistics for forensic applications. JOURNAL OF BIOPHOTONICS 2020; 13:e201960123. [PMID: 31702875 DOI: 10.1002/jbio.201960123] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/19/2019] [Revised: 10/26/2019] [Accepted: 11/06/2019] [Indexed: 06/10/2023]
Abstract
Raman spectroscopy has proven to be a valuable tool for analyzing various types of forensic evidence such as traces of body fluids. In this work, Raman spectroscopy was employed as a nondestructive technique for the analysis of dry traces of oral fluid to differentiate between smoker and nonsmoker donors with the aid of advanced statistical tools. A total of 32 oral fluid samples were collected from donors of differing gender, age and race and were subjected to Raman spectroscopic analysis. A genetic algorithm was used to determine eight spectral regions that contribute the most to the differentiation of smokers and nonsmokers. Thereafter, a classification model was developed based on the artificial neural network that showed 100% accuracy after external validation. The developed approach demonstrates great potential for the differentiation of smokers and nonsmokers based on the analysis of dry traces of oral fluid.
Collapse
Affiliation(s)
- Entesar Al-Hetlani
- Department of Chemistry, Faculty of Science, Kuwait University, Safat, Kuwait
| | - Lenka Halámková
- Department of Chemistry, University at Albany, SUNY, Albany, New York
| | - Mohamed O Amin
- Department of Chemistry, Faculty of Science, Kuwait University, Safat, Kuwait
| | - Igor K Lednev
- Department of Chemistry, University at Albany, SUNY, Albany, New York
| |
Collapse
|
21
|
Chen Z, de Boves Harrington P. Automatic soft independent modeling for class analogies. Anal Chim Acta 2019; 1090:47-56. [PMID: 31655645 DOI: 10.1016/j.aca.2019.09.035] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2019] [Revised: 09/10/2019] [Accepted: 09/11/2019] [Indexed: 01/19/2023]
Abstract
Soft independent modeling of class analogy (SIMCA) is an important method for authentication. The key parameters for SIMCA, the number of principal components and the decision threshold, determine the model's performance. In this report, a self-optimizing SIMCA that automatically determines these two parameters is devised and referred to as automatic SIMCA (aSIMCA). An efficient optimization is obtained by incorporating response surface modeling (RSM) and bootstrapped Latin partitions with the model-building dataset. A set of design points over the ranges of the two parameters are evaluated with respect to sensitivity and specificity by using the model-building data from target and non-target classes. Averages of the sensitivity and specificity are used as responses for the design points. A 2-dimensional interpolation and a bivariate cubic polynomial were used to model the response surface. As a control method, a grid search that evaluates all combinations of the two parameters over the same ranges was performed in parallel to determine the best conditions for SIMCA and the modeling performance was compared to aSIMCA with RSM. The developed aSIMCA methods were evaluated by authenticating two botanical extracts sets, i.e., marijuana and hemp, with spectral datasets collected from various spectroscopic techniques, including nuclear magnetic resonance, high-resolution mass, and ultraviolet spectrometry. Results of a paired t-test indicated that the aSIMCA with the RSM had similar performance with the one optimized by the grid search for modeling marijuana and hemp, while the RSM was more computationally efficient. The 2-dimensional interpolation is preferred because the better efficiency and the fit to the response surface is more precise.
Collapse
Affiliation(s)
- Zewei Chen
- Center for Intelligent Chemical Instrumentation, Clippinger Laboratories, Department of Chemistry and Biochemistry, Ohio University, Athens, OH, 45701, USA
| | - Peter de Boves Harrington
- Center for Intelligent Chemical Instrumentation, Clippinger Laboratories, Department of Chemistry and Biochemistry, Ohio University, Athens, OH, 45701, USA.
| |
Collapse
|
22
|
Chen Z, Harrington PDB. Pipeline for High-Throughput Modeling of Marijuana and Hemp Extracts. Anal Chem 2019; 91:14489-14497. [PMID: 31660729 DOI: 10.1021/acs.analchem.9b03290] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
Authentication of Cannabis products is important for assuring the quality of manufacturing, with the increasing consumption and regulation. In this report, a two-stage pipeline was developed for high-throughput screening and chemotyping the spectra from two sets of botanical extracts from the Cannabis genus. The first set contains different marijuana samples with higher concentrations of tetrahydrocannabinol (THC). The other set includes samples from hemp, a variety of Cannabis sativa with the THC concentration below 0.3%. The first stage applies the technique of class modeling to determine whether spectra belong to marijuana or hemp and reject novel spectra that may be neither marijuana nor hemp. An automatic soft independent modeling of class analogy (aSIMCA) that self-optimizes the number of principal components and the decision threshold is utilized in the first pipeline process to achieve excellent efficiency and efficacy. Once these spectra are recognized by aSIMCA as marijuana or hemp, they are then routed to the appropriate classifiers in the second stage for chemotyping the spectra, i.e., identifying these spectra into different chemotypes so that the pharmacological properties and cultivars of the spectra can be recognized. Three multivariate classifiers, a fuzzy rule building expert system (FuRES), super partial least-squares-discriminant analysis (sPLS-DA), and support vector machine tree type entropy (SVMtreeH), are employed for chemotyping. The discriminant ability of the pipeline was evaluated with different spectral data sets of these two groups of botanical samples, including proton nuclear magnetic resonance, mass, and ultraviolet spectra. All evaluations gave good results with accuracies greater than 95%, which demonstrated promising application of the pipeline for automated high-throughput screening and chemotyping marijuana and hemp, as well as other botanical products.
Collapse
Affiliation(s)
- Zewei Chen
- Center for Intelligent Chemical Instrumentation, Clippinger Laboratories, Department of Chemistry and Biochemistry , Ohio University , Athens , Ohio 45701 , United States
| | - Peter de Boves Harrington
- Center for Intelligent Chemical Instrumentation, Clippinger Laboratories, Department of Chemistry and Biochemistry , Ohio University , Athens , Ohio 45701 , United States
| |
Collapse
|
23
|
Tang Y, Harrington PB. Noninteger Root Transformations for Preprocessing Nanoelectrospray Ionization High-Resolution Mass Spectra for the Classification of Cannabis. Anal Chem 2019; 91:1328-1334. [PMID: 30565911 DOI: 10.1021/acs.analchem.8b03145] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Typically, for measurements with a high dynamic range, the range is reduced by using the square root transform. By using noninteger roots coupled with systematic experimental design, improvements to the measurements may be obtained. The effect of using noninteger root transformation was evaluated using high-resolution mass spectrometry (HRMS) combined with nanoelectrospray ionization (Nano-ESI) to differentiate 23 samples of Cannabis. The mass spectra were evaluated and classified using different mass resolving powers and noninteger root transformations. Classification was achieved by super partial least-squares discriminant analysis (sPLS-DA), support vector machine (SVM), and SVM classification tree type entropy (SVMTreeH). The 2.5 root transformation gave the best overall performance at different resolving powers for chemical profiling from a multilevel factorial experimental design using 2 factors and more than 4 levels. Response surface modeling using a cubic polynomial model of the bootstrapped sPLS-DA average prediction accuracies yielded optima at 0.005 for resolving power and 2.3 for the root transformation. Root transformation is an important spectral preprocessing tool for decreasing the dynamic range so that the relative variance of smaller but more important features may be inflated. For the classification of Cannabis using Nano-ESI, the optimal ranges of root and resolution were broad. The chasing-the-optimum method has been introduced for refining the polynomial response surface model.
Collapse
Affiliation(s)
- Yue Tang
- Ohio University Center for Intelligent Chemical Instrumentation , Department of Chemistry and Biochemistry, Clippinger Laboratories , Athens , Ohio 45701-2979 , United States
| | - Peter B Harrington
- Ohio University Center for Intelligent Chemical Instrumentation , Department of Chemistry and Biochemistry, Clippinger Laboratories , Athens , Ohio 45701-2979 , United States
| |
Collapse
|
24
|
Chen Z, de Boves Harrington P, Baugh SF. High-Throughput Chemotyping of Cannabis and Hemp Extracts Using an Ultraviolet Microplate Reader and Multivariate Classifiers. JOURNAL OF ANALYSIS AND TESTING 2018. [DOI: 10.1007/s41664-018-0075-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
|
25
|
On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning. JOURNAL OF ANALYSIS AND TESTING 2018; 2:249-262. [PMID: 30842888 PMCID: PMC6373628 DOI: 10.1007/s41664-018-0068-2] [Citation(s) in RCA: 194] [Impact Index Per Article: 32.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2018] [Revised: 10/08/2018] [Accepted: 10/12/2018] [Indexed: 11/15/2022]
Abstract
Model validation is the most important part of building a supervised model. For building a model with good generalization performance one must have a sensible data splitting strategy, and this is crucial for model validation. In this study, we conducted a comparative study on various reported data splitting methods. The MixSim model was employed to generate nine simulated datasets with different probabilities of mis-classification and variable sample sizes. Then partial least squares for discriminant analysis and support vector machines for classification were applied to these datasets. Data splitting methods tested included variants of cross-validation, bootstrapping, bootstrapped Latin partition, Kennard-Stone algorithm (K-S) and sample set partitioning based on joint X–Y distances algorithm (SPXY). These methods were employed to split the data into training and validation sets. The estimated generalization performances from the validation sets were then compared with the ones obtained from the blind test sets which were generated from the same distribution but were unseen by the training/validation procedure used in model construction. The results showed that the size of the data is the deciding factor for the qualities of the generalization performance estimated from the validation set. We found that there was a significant gap between the performance estimated from the validation set and the one from the test set for the all the data splitting methods employed on small datasets. Such disparity decreased when more samples were available for training/validation, and this is because the models were then moving towards approximations of the central limit theory for the simulated datasets used. We also found that having too many or too few samples in the training set had a negative effect on the estimated model performance, suggesting that it is necessary to have a good balance between the sizes of training set and validation set to have a reliable estimation of model performance. We also found that systematic sampling method such as K-S and SPXY generally had very poor estimation of the model performance, most likely due to the fact that they are designed to take the most representative samples first and thus left a rather poorly representative sample set for model performance estimation.
Collapse
|
26
|
Classification of samples from NMR-based metabolomics using principal components analysis and partial least squares with uncertainty estimation. Anal Bioanal Chem 2018; 410:6305-6319. [PMID: 30043113 DOI: 10.1007/s00216-018-1240-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2018] [Revised: 06/14/2018] [Accepted: 07/02/2018] [Indexed: 12/18/2022]
Abstract
Recent progress in metabolomics has been aided by the development of analysis techniques such as gas and liquid chromatography coupled with mass spectrometry (GC-MS and LC-MS) and nuclear magnetic resonance (NMR) spectroscopy. The vast quantities of data produced by these techniques has resulted in an increase in the use of machine algorithms that can aid in the interpretation of this data, such as principal components analysis (PCA) and partial least squares (PLS). Techniques such as these can be applied to biomarker discovery, interlaboratory comparison, and clinical diagnoses. However, there is a lingering question whether the results of these studies can be applied to broader sets of clinical data, usually taken from different data sources. In this work, we address this question by creating a metabolomics workflow that combines a previously published consensus analysis procedure ( https://doi.org/10.1016/j.chemolab.2016.12.010 ) with PCA and PLS models using uncertainty analysis based on bootstrapping. This workflow is applied to NMR data that come from an interlaboratory comparison study using synthetic and biologically obtained metabolite mixtures. The consensus analysis identifies trusted laboratories, whose data are used to create classification models that are more reliable than without. With uncertainty analysis, the reliability of the classification can be rigorously quantified, both for data from the original set and from new data that the model is analyzing. Graphical abstract ᅟ.
Collapse
|
27
|
Rodríguez-Pérez R, Fernández L, Marco S. Overoptimism in cross-validation when using partial least squares-discriminant analysis for omics data: a systematic study. Anal Bioanal Chem 2018; 410:5981-5992. [PMID: 29959482 DOI: 10.1007/s00216-018-1217-1] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2018] [Revised: 06/13/2018] [Accepted: 06/21/2018] [Indexed: 01/29/2023]
Abstract
Advances in analytical instrumentation have provided the possibility of examining thousands of genes, peptides, or metabolites in parallel. However, the cost and time-consuming data acquisition process causes a generalized lack of samples. From a data analysis perspective, omics data are characterized by high dimensionality and small sample counts. In many scenarios, the analytical aim is to differentiate between two different conditions or classes combining an analytical method plus a tailored qualitative predictive model using available examples collected in a dataset. For this purpose, partial least squares-discriminant analysis (PLS-DA) is frequently employed in omics research. Recently, there has been growing concern about the uncritical use of this method, since it is prone to overfitting and may aggravate problems of false discoveries. In many applications involving a small number of subjects or samples, predictive model performance estimation is only based on cross-validation (CV) results with a strong preference for reporting results using leave one out (LOO). The combination of PLS-DA for high dimensionality data and small sample conditions, together with a weak validation methodology is a recipe for unreliable estimations of model performance. In this work, we present a systematic study about the impact of the dataset size, the dimensionality, and the CV technique used on PLS-DA overoptimism when performance estimation is done in cross-validation. Firstly, by using synthetic data generated from a same probability distribution and with assigned random binary labels, we have obtained a dataset where the true classification rate (CR) is 50%. As expected, our results confirm that internal validation provides overoptimistic estimations of the classification accuracy (i.e., overfitting). We have characterized the CR estimator in terms of bias and variance depending on the internal CV technique used and sample to dimensionality ratio. In small sample conditions, due to the large bias and variance of the estimator, the occurrence of extremely good CRs is common. We have found that overfitting peaks when the sample size in the training subset approaches the feature vector dimensionality minus one. In these conditions, the models are neither under- or overdetermined with a unique solution. This effect is particularly intense for LOO and peaks higher in small sample conditions. Overoptimism is decreased beyond this point where the abundance of noisy produces a regularization effect leading to less complex models. In terms of overfitting, our study ranks CV methods as follows: Bootstrap produces the most accurate estimator of the CR, followed by bootstrapped Latin partitions, random subsampling, K-Fold, and finally, the very popular LOO provides the worst results. Simulation results are further confirmed in real datasets from mass spectrometry and microarrays.
Collapse
Affiliation(s)
- Raquel Rodríguez-Pérez
- Signal and Information Processing for Sensing Systems, Institute for Bioengineering of Catalonia, The Barcelona Institute for Science and Technology, Baldiri Reixac 4-8, 08028, Barcelona, Spain
| | - Luis Fernández
- Signal and Information Processing for Sensing Systems, Institute for Bioengineering of Catalonia, The Barcelona Institute for Science and Technology, Baldiri Reixac 4-8, 08028, Barcelona, Spain.,Department of Electronics and Biomedical Engineering, University of Barcelona, Martí i Franqués 1, 08028, Barcelona, Spain
| | - Santiago Marco
- Signal and Information Processing for Sensing Systems, Institute for Bioengineering of Catalonia, The Barcelona Institute for Science and Technology, Baldiri Reixac 4-8, 08028, Barcelona, Spain. .,Department of Electronics and Biomedical Engineering, University of Barcelona, Martí i Franqués 1, 08028, Barcelona, Spain.
| |
Collapse
|