1
|
Jiang S, Liang Y, Shi S, Wu C, Shi Z. Improving predictions and understanding of primary and ultimate biodegradation rates with machine learning models. THE SCIENCE OF THE TOTAL ENVIRONMENT 2023; 904:166623. [PMID: 37652371 DOI: 10.1016/j.scitotenv.2023.166623] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Revised: 08/08/2023] [Accepted: 08/25/2023] [Indexed: 09/02/2023]
Abstract
This study aimed to develop machine learning based quantitative structure biodegradability relationship (QSBR) models for predicting primary and ultimate biodegradation rates of organic chemicals, which are essential parameters for environmental risk assessment. For this purpose, experimental primary and ultimate biodegradation rates of high consistency were compiled for 173 organic compounds. A significant number of descriptors were calculated with a collection of quantum/computational chemistry software and tools to achieve comprehensive representation and interpretability. Following a pre-screening process, multiple QSBR models were developed for both primary and ultimate endpoints using three algorithms: extreme gradient boosting (XGBoost), support vector machine (SVM), and multiple linear regression (MLR). Furthermore, a unified QSBR model was constructed using the knowledge transfer technique and XGBoost. Results demonstrated that all QSBR models developed in this study had good performance. Particularly, SVM models exhibited high level of goodness of fit (coefficient of determination on the training set of 0.973 for primary and 0.980 for ultimate), robustness (leave-one-out cross-validated coefficient of 0.953 for primary and 0.967 for ultimate), and external predictive ability (external explained variance of 0.947 for primary and 0.958 for ultimate). The knowledge transfer technique enhanced model performance by learning from properties of two biodegradation endpoints. Williams plots were used to visualize the application domains of the models. Through SHapley Additive exPlanations (SHAP) analysis, this study identified key features affecting biodegradation rates. Notably, MDEO-12, APC2D1_C_O, and other features contributed to primary biodegradation, while AATS0v, AATS2v, and others inhibited it. For ultimate biodegradation, features like No. of Rotatable Bonds, APC2D1_C_O, and minHBa were contributors, while C1SP3, Halogen Ratio, GGI4, and others hindered the process. Also, the study quantified the contributions of each feature in predictions for individual chemicals. This research provides valuable tools for predicting both primary and ultimate biodegradation rates while offering insights into the mechanisms.
Collapse
Affiliation(s)
- Shan Jiang
- School of Environment and Energy, South China University of Technology, Guangzhou, Guangdong 510006, People's Republic of China; The Key Lab of Pollution Control and Ecosystem Restoration in Industry Clusters, Ministry of Education, South China University of Technology, Guangzhou, Guangdong 510006, People's Republic of China
| | - Yuzhen Liang
- School of Environment and Energy, South China University of Technology, Guangzhou, Guangdong 510006, People's Republic of China; The Key Lab of Pollution Control and Ecosystem Restoration in Industry Clusters, Ministry of Education, South China University of Technology, Guangzhou, Guangdong 510006, People's Republic of China.
| | - Songlin Shi
- School of Environment and Energy, South China University of Technology, Guangzhou, Guangdong 510006, People's Republic of China; The Key Lab of Pollution Control and Ecosystem Restoration in Industry Clusters, Ministry of Education, South China University of Technology, Guangzhou, Guangdong 510006, People's Republic of China
| | - Chunya Wu
- School of Environment and Energy, South China University of Technology, Guangzhou, Guangdong 510006, People's Republic of China; The Key Lab of Pollution Control and Ecosystem Restoration in Industry Clusters, Ministry of Education, South China University of Technology, Guangzhou, Guangdong 510006, People's Republic of China
| | - Zhenqing Shi
- School of Environment and Energy, South China University of Technology, Guangzhou, Guangdong 510006, People's Republic of China; The Key Lab of Pollution Control and Ecosystem Restoration in Industry Clusters, Ministry of Education, South China University of Technology, Guangzhou, Guangdong 510006, People's Republic of China
| |
Collapse
|
2
|
Ngara TR, Zeng P, Zhang H. mibPOPdb: An online database for microbial biodegradation of persistent organic pollutants. IMETA 2022; 1:e45. [PMID: 38867901 PMCID: PMC10989864 DOI: 10.1002/imt2.45] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Revised: 07/04/2022] [Accepted: 07/11/2022] [Indexed: 06/14/2024]
Abstract
Microbial biodegradation of persistent organic pollutants (POPs) is an attractive, ecofriendly, and cost-efficient clean-up technique for reclaiming POP-contaminated environments. In the last few decades, the number of publications documenting POP-degrading microbes, enzymes, and experimental data sets has continuously increased, necessitating the development of a dedicated web resource that catalogs consolidated information on POP-degrading microbes and tools to facilitate integrative analysis of POP degradation data sets. To address this knowledge gap, we developed the Microbial Biodegradation of Persistent Organic Pollutants Database (mibPOPdb) by accumulating microbial POP degradation information from the public domain and manually curating published scientific literature. Currently, in mibPOPdb, there are 9215 microbial strain entries, including 184 gene (sub)families, 100 enzymes, 48 biodegradation pathways, and 593 intermediate compounds identified in POP-biodegradation processes, and information on 32 toxic compounds listed under the Stockholm Convention environmental treaty. Besides the standard database functionalities, which include data searching, browsing, and retrieval of database entries, we provide a suite of bioinformatics services to facilitate comparative analysis of users' own data sets against mibPOPdb entries. Additionally, we built a Graph Neural Network-based prediction model for the biodegradability classification of chemicals. The predictive model exhibited a good biodegradability classification performance and high prediction accuracy. mibPOPdb is a free data-sharing platform designated to promote research in microbial-based biodegradation of POPs and fills a long-standing gap in environmental protection research. Database URL: http://mibpop.genome-mining.cn/.
Collapse
Affiliation(s)
- Tanyaradzwa R. Ngara
- Department of Biotechnology, College of Life Science and Technology, MOE KEY Laboratory of Molecular BiophysicsHuazhong University of Science and TechnologyWuhanChina
| | - Peiji Zeng
- Department of Biotechnology, College of Life Science and Technology, MOE KEY Laboratory of Molecular BiophysicsHuazhong University of Science and TechnologyWuhanChina
| | - Houjin Zhang
- Department of Biotechnology, College of Life Science and Technology, MOE KEY Laboratory of Molecular BiophysicsHuazhong University of Science and TechnologyWuhanChina
| |
Collapse
|
3
|
Al-Fakih AM, Algamal ZY, Qasim MK. An improved opposition-based crow search algorithm for biodegradable material classification. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2022; 33:403-415. [PMID: 35469528 DOI: 10.1080/1062936x.2022.2064546] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Accepted: 04/05/2022] [Indexed: 06/14/2023]
Abstract
The development of a reliable quantitative structure-activity relationship (QSAR) classification model with a small number of molecular descriptors is a crucial step in chemometrics. In this study, an improvement of crow search algorithm (CSA) is proposed by adapting the opposite-based learning (OBL) approach, which is named as OBL-CSA, to improve the exploration and exploitation capability of the CSA in quantitative structure-biodegradation relationship (QSBR) modelling of classifying the biodegradable materials. The results reveal that the performance of OBL-CSA not only manifest in improving the classification performance, but also in reduced computational time required to complete the process when compared to the standard CSA and other four optimization algorithms tested, which are the particle swarm algorithm (PSO), black hole algorithm (BHA), grey wolf algorithm (GWA), and whale optimization algorithm (WOA). In conclusion, the OBL-CSA could be a valuable resource in the classification of biodegradable materials.
Collapse
Affiliation(s)
- A M Al-Fakih
- Department of Chemistry, Faculty of Science, Universiti Teknologi Malaysia, Johor, Malaysia and Department of Chemistry, Faculty of Science, Sana'a University, Sana'a, Yemen
| | - Z Y Algamal
- Department of Statistics and Informatics, University of Mosul, Mosul, Iraq
| | - M K Qasim
- Department of General Science, University of Mosul, Mosul, Iraq
| |
Collapse
|
4
|
Lee M, Min K. A Comparative Study of the Performance for Predicting Biodegradability Classification: The Quantitative Structure-Activity Relationship Model vs the Graph Convolutional Network. ACS OMEGA 2022; 7:3649-3655. [PMID: 35128273 PMCID: PMC8811760 DOI: 10.1021/acsomega.1c06274] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/08/2021] [Accepted: 12/28/2021] [Indexed: 06/14/2023]
Abstract
The prediction and evaluation of the biodegradability of molecules with computational methods are becoming increasingly important. Among the various methods, quantitative structure-activity relationship (QSAR) models have been demonstrated to predict the ready biodegradation of chemicals but have limited functionality owing to their complex implementation. In this study, we employ the graph convolutional network (GCN) method to overcome these issues. A biodegradability dataset from previous studies was trained to generate prediction models by (i) the QSAR models using the Mordred molecular descriptor calculator and MACCS molecular fingerprint and (ii) the GCN model using molecular graphs. The performance comparison of the methods confirms that the GCN model is more straightforward to implement and more stable; the specificity and sensitivity values are almost identical without specific descriptors or fingerprints. In addition, the performance of the models was further verified by randomly dividing the dataset into 100 different cases of training and test sets and by varying the test set ratio from 20 to 80%. The results of the current study clearly suggest the promise of the GCN model, which can be implemented straightforwardly and can replace conventional QSAR prediction models for various types and properties of molecules.
Collapse
Affiliation(s)
- Myeonghun Lee
- School of Systems Biomedical Science, Soongsil University, 369 Sangdo-ro, Dongjak-gu, Seoul 06978, Republic of Korea
| | - Kyoungmin Min
- School of Mechanical Engineering, Soongsil University, 369 Sangdo-ro, Dongjak-gu, Seoul 06978, Republic of Korea
| |
Collapse
|
5
|
Sharma SR, Singh B, Kaur M. Hybrid SFO and TLBO optimization for biodegradable classification. Soft comput 2021. [DOI: 10.1007/s00500-021-06196-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
6
|
Singh AK, Bilal M, Iqbal HMN, Raj A. Trends in predictive biodegradation for sustainable mitigation of environmental pollutants: Recent progress and future outlook. THE SCIENCE OF THE TOTAL ENVIRONMENT 2021; 770:144561. [PMID: 33736422 DOI: 10.1016/j.scitotenv.2020.144561] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/27/2020] [Revised: 12/13/2020] [Accepted: 12/13/2020] [Indexed: 02/05/2023]
Abstract
The feasibility of in-silico techniques, together with the computational framework, has been applied to predictive bioremediation aiming to clean-up contaminants, toxicity evaluation, and possibilities for the degradation of complex recalcitrant compounds. Emerging contaminants from different industries have posed a significant hazard to the environment and public health. Given current bioremediation strategies, it is often a failure or inadequate for sustainable mitigation of hazardous pollutants. However, clear-cut vital information about biodegradation is quite incomplete from a conventional remediation techniques perspective. Lacking complete information on bio-transformed compounds leads to seeking alternative methods. Only scarce information about the transformed products and toxicity profile is available in the published literature. To fulfill this literature gap, various computational or in-silico technologies have emerged as alternating techniques, which are being recognized as in-silico approaches for bioremediation. Molecular docking, molecular dynamics simulation, and biodegradation pathways predictions are the vital part of predictive biodegradation, including the Quantitative Structure-Activity Relationship (QSAR), Quantitative structure-biodegradation relationship (QSBR) model system. Furthermore, machine learning (ML), artificial neural network (ANN), genetic algorithm (GA) based programs offer simultaneous biodegradation prediction along with toxicity and environmental fate prediction. Herein, we spotlight the feasibility of in-silico remediation approaches for various persistent, recalcitrant contaminants while traditional bioremediation fails to mitigate such pollutants. Such could be addressed by exploiting described model systems and algorithm-based programs. Furthermore, recent advances in QSAR modeling, algorithm, and dedicated biodegradation prediction system have been summarized with unique attributes.
Collapse
Affiliation(s)
- Anil Kumar Singh
- Environmental Microbiology Laboratory, Environmental Toxicology Group, CSIR-Indian Institute of Toxicology Research (CSIR-IITR), Vishvigyan Bhawan, 31, Mahatma Gandhi Marg, Lucknow 226001, Uttar Pradesh, India; Academy of Scientific and Innovative Research (AcSIR), Ghaziabad 201002, India
| | - Muhammad Bilal
- School of Life Science and Food Engineering, Huaiyin Institute of Technology, Huaian 223003, China
| | - Hafiz M N Iqbal
- Tecnologico de Monterrey, School of Engineering and Sciences, Monterrey 64849, Mexico.
| | - Abhay Raj
- Environmental Microbiology Laboratory, Environmental Toxicology Group, CSIR-Indian Institute of Toxicology Research (CSIR-IITR), Vishvigyan Bhawan, 31, Mahatma Gandhi Marg, Lucknow 226001, Uttar Pradesh, India; Academy of Scientific and Innovative Research (AcSIR), Ghaziabad 201002, India.
| |
Collapse
|
7
|
Classification of Biodegradable Substances Using Balanced Random Trees and Boosted C5.0 Decision Trees. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2020; 17:ijerph17249322. [PMID: 33322123 PMCID: PMC7763457 DOI: 10.3390/ijerph17249322] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/18/2020] [Revised: 11/28/2020] [Accepted: 12/11/2020] [Indexed: 12/12/2022]
Abstract
Substances that do not degrade over time have proven to be harmful to the environment and are dangerous to living organisms. Being able to predict the biodegradability of substances without costly experiments is useful. Recently, the quantitative structure-activity relationship (QSAR) models have proposed effective solutions to this problem. However, the molecular descriptor datasets usually suffer from the problems of unbalanced class distribution, which adversely affects the efficiency and generalization of the derived models. Accordingly, this study aims at validating the performances of balanced random trees (RTs) and boosted C5.0 decision trees (DTs) to construct QSAR models to classify the ready biodegradation of substances and their abilities to deal with unbalanced data. The balanced RTs model algorithm builds individual trees using balanced bootstrap samples, while the boosted C5.0 DT is modeled using cost-sensitive learning. We employed the two-dimensional molecular descriptor dataset, which is publicly available through the University of California, Irvine (UCI) machine learning repository. The molecular descriptors were ranked according to their contributions to the balanced RTs classification process. The performance of the proposed models was compared with previously reported results. Based on the statistical measures, the experimental results showed that the proposed models outperform the classification results of the support vector machine (SVM), K-nearest neighbors (KNN), and discrimination analysis (DA). Classification measures were analyzed in terms of accuracy, sensitivity, specificity, precision, false positive rate, false negative rate, F1 score, receiver operating characteristic (ROC) curve, and area under the ROC curve (AUROC).
Collapse
|
8
|
Tutorial: multivariate classification for vibrational spectroscopy in biological samples. Nat Protoc 2020; 15:2143-2162. [PMID: 32555465 DOI: 10.1038/s41596-020-0322-8] [Citation(s) in RCA: 127] [Impact Index Per Article: 31.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2019] [Accepted: 03/20/2020] [Indexed: 12/26/2022]
Abstract
Vibrational spectroscopy techniques, such as Fourier-transform infrared (FTIR) and Raman spectroscopy, have been successful methods for studying the interaction of light with biological materials and facilitating novel cell biology analysis. Spectrochemical analysis is very attractive in disease screening and diagnosis, microbiological studies and forensic and environmental investigations because of its low cost, minimal sample preparation, non-destructive nature and substantially accurate results. However, there is now an urgent need for multivariate classification protocols allowing one to analyze biologically derived spectrochemical data to obtain accurate and reliable results. Multivariate classification comprises discriminant analysis and class-modeling techniques where multiple spectral variables are analyzed in conjunction to distinguish and assign unknown samples to pre-defined groups. The requirement for such protocols is demonstrated by the fact that applications of deep-learning algorithms of complex datasets are being increasingly recognized as critical for extracting important information and visualizing it in a readily interpretable form. Hereby, we have provided a tutorial for multivariate classification analysis of vibrational spectroscopy data (FTIR, Raman and near-IR) highlighting a series of critical steps, such as preprocessing, data selection, feature extraction, classification and model validation. This is an essential aspect toward the construction of a practical spectrochemical analysis model for biological analysis in real-world applications, where fast, accurate and reliable classification models are fundamental.
Collapse
|
9
|
Nunes KM, Andrade MVO, Almeida MR, Fantini C, Sena MM. Raman spectroscopy and discriminant analysis applied to the detection of frauds in bovine meat by the addition of salts and carrageenan. Microchem J 2019. [DOI: 10.1016/j.microc.2019.03.076] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
|
10
|
de Carvalho Rocha WF, Sheen DA. Determination of physicochemical properties of petroleum derivatives and biodiesel using GC/MS and chemometric methods with uncertainty estimation. FUEL (LONDON, ENGLAND) 2019; 243:413-422. [PMID: 38516536 PMCID: PMC10956500 DOI: 10.1016/j.fuel.2018.12.126] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/23/2024]
Abstract
The physicochemical properties of a substance, such as a fuel, can vary significantly with composition. Determining these properties with ASTM standard methods is both expensive and time-consuming, which has led to a desire to use chemometric modeling as an alternative. In this study, we compare the accuracy and robustness of two chemometric models, partial least squares (PLS) regression and support vector machine (SVM) with uncertainty estimation to determine how the physicochemical properties depend on the composition. A set of hydrocarbon mixtures, including crude oil, oil, gasoline, and biofuel/biodiesel, were collected. GC-MS data were taken, and physicochemical properties were measured for these mixtures using ASTM standard methods. PLS and SVM were used to develop predictive models of the physicochemical properties. Uncertainty in the estimated property values was estimated using a bootstrapping technique. With this uncertainty estimate, it is possible to assess the trustworthiness of any prediction, which ensures that the chemometric models can be applied for general purposes. SVM was found to be generally better for predicting the physicochemical properties, although we expect that with a more comprehensive data set the performance of the PLS models can be improved. We show in this work that PLS and SVM can be used to generate a predictive model of physicochemical properties based on GC-MS data. Combined with uncertainty analysis, these models provide robust predictions that can be used for regulatory, economic, and safety purposes.
Collapse
Affiliation(s)
| | - David A Sheen
- Chemical Sciences Division, National Institute of Standards and Technology, Gaithersburg, MD 20899, USA
| |
Collapse
|
11
|
Mine landslide susceptibility assessment using IVM, ANN and SVM models considering the contribution of affecting factors. PLoS One 2019; 14:e0215134. [PMID: 30973936 PMCID: PMC6459520 DOI: 10.1371/journal.pone.0215134] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2018] [Accepted: 03/27/2019] [Indexed: 11/29/2022] Open
Abstract
The fragile ecological environment near mines provide advantageous conditions for the development of landslides. Mine landslide susceptibility mapping is of great importance for mine geo-environment control and restoration planning. In this paper, a total of 493 landslides in Shangli County, China were collected through historical landslide inventory. 16 spectral, geomorphic and hydrological predictive factors, mainly derived from Landsat 8 imagery and Global Digital Elevation Model (ASTER GDEM), were prepared initially for landslide susceptibility assessment. Predictive capability of these factors was evaluated by using the value of variance inflation factor and information gain ratio. Three models, namely artificial neural network (ANN), support vector machine (SVM) and information value model (IVM), were applied to assess the mine landslide sensitivity. The receiver operating characteristic curve (ROC) and rank probability score were used to validate and compare the comprehensive predictive capabilities of three models involving uncertainty. Results showed that ANN model achieved higher prediction capability, proving its advantage of solve nonlinear and complex problems. Comparing the estimated landslide susceptibility map with the ground-truth one, the high-prone area tends to be located in the middle area with multiple fault distributions and the steeply sloped hill.
Collapse
|
12
|
Morais CLM, Lima KMG, Martin FL. Uncertainty estimation and misclassification probability for classification models based on discriminant analysis and support vector machines. Anal Chim Acta 2018; 1063:40-46. [PMID: 30967184 DOI: 10.1016/j.aca.2018.09.022] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2018] [Revised: 09/05/2018] [Accepted: 09/11/2018] [Indexed: 10/28/2022]
Abstract
Uncertainty estimation provides a quantitative value of the predictive performance of a classification model based on its misclassification probability. Low misclassification probabilities are associated with a low degree of uncertainty, indicating high trustworthiness; while high misclassification probabilities are associated with a high degree of uncertainty, indicating a high susceptibility to generate incorrect classification. Herein, misclassification probability estimations based on uncertainty estimation by bootstrap were developed for classification models using discriminant analysis [linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA)] and support vector machines (SVM). Principal component analysis (PCA) was used as variable reduction technique prior classification. Four spectral datasets were tested (1 simulated and 3 real applications) for binary and ternary classifications. Models with lower misclassification probabilities were more stable when the spectra were perturbed with white Gaussian noise, indicating better robustness. Thus, misclassification probability can be used as an additional figure of merit to assess model robustness, providing a reliable metric to evaluate the predictive performance of a classifier.
Collapse
Affiliation(s)
- Camilo L M Morais
- School of Pharmacy and Biomedical Sciences, University of Central Lancashire, Preston PR1 2HE, United Kingdom.
| | - Kássio M G Lima
- Biological Chemistry and Chemometrics, Institute of Chemistry, Federal University of Rio Grande do Norte, Natal, 59072-970, Brazil
| | - Francis L Martin
- School of Pharmacy and Biomedical Sciences, University of Central Lancashire, Preston PR1 2HE, United Kingdom
| |
Collapse
|
13
|
Classification of samples from NMR-based metabolomics using principal components analysis and partial least squares with uncertainty estimation. Anal Bioanal Chem 2018; 410:6305-6319. [PMID: 30043113 DOI: 10.1007/s00216-018-1240-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2018] [Revised: 06/14/2018] [Accepted: 07/02/2018] [Indexed: 12/18/2022]
Abstract
Recent progress in metabolomics has been aided by the development of analysis techniques such as gas and liquid chromatography coupled with mass spectrometry (GC-MS and LC-MS) and nuclear magnetic resonance (NMR) spectroscopy. The vast quantities of data produced by these techniques has resulted in an increase in the use of machine algorithms that can aid in the interpretation of this data, such as principal components analysis (PCA) and partial least squares (PLS). Techniques such as these can be applied to biomarker discovery, interlaboratory comparison, and clinical diagnoses. However, there is a lingering question whether the results of these studies can be applied to broader sets of clinical data, usually taken from different data sources. In this work, we address this question by creating a metabolomics workflow that combines a previously published consensus analysis procedure ( https://doi.org/10.1016/j.chemolab.2016.12.010 ) with PCA and PLS models using uncertainty analysis based on bootstrapping. This workflow is applied to NMR data that come from an interlaboratory comparison study using synthetic and biologically obtained metabolite mixtures. The consensus analysis identifies trusted laboratories, whose data are used to create classification models that are more reliable than without. With uncertainty analysis, the reliability of the classification can be rigorously quantified, both for data from the original set and from new data that the model is analyzing. Graphical abstract ᅟ.
Collapse
|