1
|
Gjelsvik EL, Tøndel K. Increased interpretation of deep learning models using hierarchical cluster-based modelling. PLoS One 2023; 18:e0295251. [PMID: 38060472 PMCID: PMC10703235 DOI: 10.1371/journal.pone.0295251] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Accepted: 11/20/2023] [Indexed: 12/18/2023] Open
Abstract
Linear prediction models based on data with large inhomogeneity or abrupt non-linearities often perform poorly because relationships between groups in the data dominate the model. Given that the data is locally linear, this can be overcome by splitting the data into smaller clusters and creating a local model within each cluster. In this study, the previously published Hierarchical Cluster-based Partial Least Squares Regression (HC-PLSR) procedure was extended to deep learning, in order to increase the interpretability of the deep learning models through local modelling. Hierarchical Cluster-based Convolutional Neural Networks (HC-CNNs), Hierarchical Cluster-based Recurrent Neural Networks (HC-RNNs) and Hierarchical Cluster-based Support Vector Regression models (HC-SVRs) were implemented and tested on spectroscopic data consisting of Fourier Transform Infrared (FT-IR) measurements of raw material dry films, for prediction of average molecular weight during hydrolysis and a simulated data set constructed to contain three clusters of observations with different non-linear relationships between the independent variables and the response. HC-CNN, HC-RNN and HC-SVR outperformed HC-PLSR for the simulated data set, showing the disadvantage of PLSR for highly non-linear data, but for the FT-IR data set there was little to gain in prediction ability from using more complex models than HC-PLSR. Local modelling can ease the interpretation of deep learning models through highlighting differences in feature importance between different regions of the input or output space. Our results showed clear differences between the feature importance for the various local models, which demonstrate the advantages of a local modelling approach with regards to interpretation of deep learning models.
Collapse
Affiliation(s)
- Elise Lunde Gjelsvik
- Faculty of Science and Technology, Norwegian University of Life Sciences, Aas, Norway
| | - Kristin Tøndel
- Faculty of Science and Technology, Norwegian University of Life Sciences, Aas, Norway
| |
Collapse
|
2
|
Shan P, Bi Y, Li Z, Wang Q, He Z, Zhao Y, Peng S. Unsupervised model adaptation for multivariate calibration by domain adaptation-regularization based kernel partial least square. SPECTROCHIMICA ACTA. PART A, MOLECULAR AND BIOMOLECULAR SPECTROSCOPY 2023; 292:122418. [PMID: 36736045 DOI: 10.1016/j.saa.2023.122418] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/13/2022] [Revised: 01/24/2023] [Accepted: 01/25/2023] [Indexed: 06/18/2023]
Abstract
In chemometrics, calibration model adaptation is desired when training- and test-samples come from different distributions. Domain-invariant feature representation is currently a successful strategy to realize model adaptation and has received wide attention. The paper presents a nonlinear unsupervised model adaptation method termed as domain adaption regularization-based kernel partial least squares regression (DarKPLS). DarKPLS aims to minimize the source and target distributions in a low-dimensional latent space projected from the reproducing kernel Hilbert space (RKHS) generated with the labeled source data and unlabeled target data. Specially, the distributional means and variances between source and target latent variables are aligned in the RKHS. By extending existing domain invariant partial least square regression (di-PLS) with the projected maximum mean discrepancy (PMMD) to reduce the mean discrepancy in the RKHS further, DarKPLS could realize fine-grained domain alignment that further improves the adaptation performance. DarKPLS is applied to the γ-polyglutamic acid fermentation dataset, tobacco dataset and corn dataset, and it demonstrates improved prediction results in comparison with No adaptation partial least squares (PLS), null augmented regression (NAR), extended linear joint trained framework (ExtJT), scatter component analysis (SCA) and domain-invariant iterative partial least squares (DIPALS).
Collapse
Affiliation(s)
- Peng Shan
- College of Information Science and Engineering, Northeastern University, Shenyang 110819, Liaoning Province, China.
| | - Yiming Bi
- Technology Center, China Tobacco Zhejiang Industrial Co., Ltd, Hangzhou 310008, Zhejiang Province, China
| | - Zhigang Li
- College of Information Science and Engineering, Northeastern University, Shenyang 110819, Liaoning Province, China
| | - Qiaoyun Wang
- College of Information Science and Engineering, Northeastern University, Shenyang 110819, Liaoning Province, China
| | - Zhonghai He
- College of Information Science and Engineering, Northeastern University, Shenyang 110819, Liaoning Province, China
| | - Yuhui Zhao
- School Of Computer Science and Engineering, Northeastern University, Shenyang 110819, Liaoning Province, China
| | - Silong Peng
- Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
| |
Collapse
|
3
|
The Correlation Analysis between Air Quality and Construction Sites: Evaluation in the Urban Environment during the COVID-19 Pandemic. SUSTAINABILITY 2022. [DOI: 10.3390/su14127075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
This research studies the data on air quality and construction activities from 29 January 2020 to 30 April 2020. The analysis focuses on three sample districts of Hangzhou’s Xiacheng, Gongshu, and Xiaoshan districts. The samples, respectively, represent low-level, mid-level, and high-level districts in the scale of construction projects. The correlative relationships are investigated, respectively, in the periods of ‘pandemic lockdown (29 January 2020–20 February 2020)’ and ‘after pandemic lockdown (21 February 2020–30 April 2020)’. The correlative equations are obtained. Based on the guideline values of air parameters provided by the Chinese criteria and standards, the recommended maximum scales of construction projects are defined. The numbers of construction sites are 16, 118, and 311 for the Xiacheng, Gongshu, and Xiaoshan districts during the imposed lockdown period, respectively, and 19, 88, 234, respectively, after the lockdown period. Because the construction site is only one influential factor on the air quality, and the database is not large enough, there are some limitations in the mathematical model and the management plan. Possible problem solving techniques and future studies are introduced at the end of the research study.
Collapse
|
4
|
Stavropoulos G, van Vorstenbosch R, Jonkers DMAE, Penders J, Hill JE, van Schooten FJ, Smolinska A. Advanced data fusion: Random forest proximities and pseudo-sample principle towards increased prediction accuracy and variable interpretation. Anal Chim Acta 2021; 1183:339001. [PMID: 34627524 DOI: 10.1016/j.aca.2021.339001] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Revised: 08/24/2021] [Accepted: 08/25/2021] [Indexed: 11/26/2022]
Abstract
Data fusion has gained much attention in the field of life sciences, and this is because analysis of biological samples may require the use of data coming from multiple complementary sources to express the samples fully. Data fusion lies in the idea that different data platforms detect different biological entities. Therefore, if these different biological compounds are then combined, they can provide comprehensive profiling and understanding of the research question in hand. Data fusion can be performed in three different traditional ways: low-level, mid-level, and high-level data fusion. However, the increasing complexity and amount of generated data require the development of more sophisticated fusion approaches. In that regard, the current study presents an advanced data fusion approach (i.e. proximities stacking) based on random forest proximities coupled with the pseudo-sample principle. Four different data platforms of 130 samples each (faecal microbiome, blood, blood headspace, and exhaled breath samples of patients who have Crohn's disease) were used to demonstrate the classification performance of this new approach. More specifically, 104 samples were used to train and validate the models, whereas the remaining 26 samples were used to validate the models externally. Mid-level, high-level, as well as individual platform classification predictions, were made and compared against the proximities stacking approach. The performance of each approach was assessed by calculating the sensitivity and specificity of each model for the external test set, and visualized by performing principal component analysis on the proximity matrices of the training samples to then, subsequently, project the test samples onto that space. The implementation of pseudo-samples allowed for the identification of the most important variables per platform, finding relations among variables of the different data platforms, and the examination of how variables behave in the samples. The proximities stacking approach outperforms both mid-level and high-level fusion approaches, as well as all individual platform predictions. Concurrently, it tackles significant bottlenecks of the traditional ways of fusion and of another advanced fusion way discussed in the paper, and finally, it contradicts the general belief that the more data, the merrier the result, and therefore, considerations have to be taken into account before any data fusion analysis is conducted.
Collapse
Affiliation(s)
- Georgios Stavropoulos
- Department of Pharmacology and Toxicology, NUTRIM School of Nutrition and Translational Research, Maastricht University, Maastricht, the Netherlands
| | - Robert van Vorstenbosch
- Department of Pharmacology and Toxicology, NUTRIM School of Nutrition and Translational Research, Maastricht University, Maastricht, the Netherlands
| | - Daisy M A E Jonkers
- Division of Gastroenterology and Hepatology, NUTRIM School of Nutrition and Translational Research, Maastricht University, Maastricht, the Netherlands
| | - John Penders
- Department of Medical Microbiology, NUTRIM School of Nutrition and Translational Research, Maastricht University, Maastricht, the Netherlands
| | - Jane E Hill
- Department of Chemical and Biological Engineering, School of Biomedical Engineering, The University of British Columbia, Vancouver, Canada
| | - Frederik-Jan van Schooten
- Department of Pharmacology and Toxicology, NUTRIM School of Nutrition and Translational Research, Maastricht University, Maastricht, the Netherlands
| | - Agnieszka Smolinska
- Department of Pharmacology and Toxicology, NUTRIM School of Nutrition and Translational Research, Maastricht University, Maastricht, the Netherlands.
| |
Collapse
|
5
|
Guo HN, Wu SB, Tian YJ, Zhang J, Liu HT. Application of machine learning methods for the prediction of organic solid waste treatment and recycling processes: A review. BIORESOURCE TECHNOLOGY 2021; 319:124114. [PMID: 32942236 DOI: 10.1016/j.biortech.2020.124114] [Citation(s) in RCA: 89] [Impact Index Per Article: 29.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/22/2020] [Revised: 09/04/2020] [Accepted: 09/07/2020] [Indexed: 05/23/2023]
Abstract
Conventional treatment and recycling methods of organic solid waste contain inherent flaws, such as low efficiency, low accuracy, high cost, and potential environmental risks. In the past decade, machine learning has gradually attracted increasing attention in solving the complex problems of organic solid waste treatment. Although significant research has been carried out, there is a lack of a systematic review of the research findings in this field. This study sorts the research studies published between 2003 and 2020, summarizes the specific application fields, characteristics, and suitability of different machine learning models, and discusses the relevant application limitations and future prospects. It can be concluded that studies mostly focused on municipal solid waste management, followed by anaerobic digestion, thermal treatment, composting, and landfill. The most widely used model is the artificial neural network, which has been successfully applied to various complicated non-linear organic solid waste related problems.
Collapse
Affiliation(s)
- Hao-Nan Guo
- Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China; College of Resources and Environment, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Shu-Biao Wu
- Aarhus Institute of Advanced Studies, Aarhus University, DK-8000 Aarhus C, Denmark
| | - Ying-Jie Tian
- CAS Research Center on Fictitious Economy & Data Science, Beijing 100190, China
| | - Jun Zhang
- Guangxi Key Laboratory of Environmental Pollution Control Theory and Technology, Guilin University of Technology, Guilin 541004, China
| | - Hong-Tao Liu
- Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China; Engineering Laboratory for Yellow River Delta Modern Agriculture, Chinese Academy of Sciences, Beijing 100101, China.
| |
Collapse
|
6
|
Chemometric Strategies for Spectroscopy-Based Food Authentication. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10186544] [Citation(s) in RCA: 41] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
In the last decades, spectroscopic techniques have played an increasingly crucial role in analytical chemistry, due to the numerous advantages they offer. Several of these techniques (e.g., Near-InfraRed—NIR—or Fourier Transform InfraRed—FT-IR—spectroscopy) are considered particularly valuable because, by means of suitable equipment, they enable a fast and non-destructive sample characterization. This aspect, together with the possibility of easily developing devices for on- and in-line applications, has recently favored the diffusion of such approaches especially in the context of foodstuff quality control. Nevertheless, the complex nature of the signal yielded by spectroscopy instrumentation (regardless of the spectral range investigated) inevitably calls for the use of multivariate chemometric strategies for its accurate assessment and interpretation. This review aims at providing a comprehensive overview of some of the chemometric tools most commonly exploited for spectroscopy-based foodstuff analysis and authentication. More in detail, three different scenarios will be surveyed here: data exploration, calibration and classification. The main methodologies suited to addressing each one of these different tasks will be outlined and examples illustrating their use will be provided alongside their description.
Collapse
|
7
|
Zhang H, Deng X, Zhang Y, Hou C, Li C. Dynamic nonlinear batch process fault detection and identification based on two‐directional dynamic kernel slow feature analysis. CAN J CHEM ENG 2020. [DOI: 10.1002/cjce.23832] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Affiliation(s)
- Hanyuan Zhang
- School of Information and Electrical Engineering Shandong Jianzhu University Jinan China
| | - Xiaogang Deng
- College of Control Science and Engineering China University of Petroleum (East China) Qingdao China
| | - Yunchu Zhang
- School of Information and Electrical Engineering Shandong Jianzhu University Jinan China
| | - Chuanjing Hou
- School of Information and Electrical Engineering Shandong Jianzhu University Jinan China
| | - Chengdong Li
- School of Information and Electrical Engineering Shandong Jianzhu University Jinan China
| |
Collapse
|
8
|
Constructing bi-plots for random forest: Tutorial. Anal Chim Acta 2020; 1131:146-155. [PMID: 32928475 DOI: 10.1016/j.aca.2020.06.043] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2020] [Revised: 06/15/2020] [Accepted: 06/16/2020] [Indexed: 01/29/2023]
Abstract
Current technological developments have allowed for a significant increase and availability of data. Consequently, this has opened enormous opportunities for the machine learning and data science field, translating into the development of new algorithms in a wide range of applications in medical, biomedical, daily-life, and national security areas. Ensemble techniques are among the pillars of the machine learning field, and they can be defined as approaches in which multiple, complex, independent/uncorrelated, predictive models are subsequently combined by either averaging or voting to yield a higher model performance. Random forest (RF), a popular ensemble method, has been successfully applied in various domains due to its ability to build predictive models with high certainty and little necessity of model optimization. RF provides both a predictive model and an estimation of the variable importance. However, the estimation of the variable importance is based on thousands of trees, and therefore, it does not specify which variable is important for which sample group. The present study demonstrates an approach based on the pseudo-sample principle that allows for construction of bi-plots (i.e. spin plots) associated with RF models. The pseudo-sample principle for RF. is explained and demonstrated by using two simulated datasets, and three different types of real data, which include political sciences, food chemistry and the human microbiome data. The pseudo-sample bi-plots, associated with RF and its unsupervised version, allow for a versatile visualization of multivariate models, and the variable importance and the relation among them.
Collapse
|
9
|
Narayanan H, Sokolov M, Butté A, Morbidelli M. Decision Tree-PLS (DT-PLS) algorithm for the development of process: Specific local prediction models. Biotechnol Prog 2019; 35:e2818. [PMID: 30969466 DOI: 10.1002/btpr.2818] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2018] [Revised: 03/15/2019] [Accepted: 03/25/2019] [Indexed: 12/26/2022]
Abstract
This work presents a novel multivariate statistical algorithm, Decision Tree-PLS (DT-PLS), to improve the prediction and understanding of dynamic processes based on local partial least square regression (PLSR) models for characteristic process groups defined based on Decision Tree (DT) analysis. The DT-PLS algorithm is successfully applied to two different cell culture data sets, one obtained from bioreactors of 3.5 L lab scale and the other obtained from the 15 ml ambr microbioreactor system. Substantial improvement in the predictive capabilities of the model can be achieved based on the localization compared to the classical PLSR approach, which is implemented in the commercially available packages. Additionally, the differences in the model parameters of the local models suggest that the governing process variables vary for the different process regimes indicating the different states of the cell under different process conditions.
Collapse
Affiliation(s)
- Harini Narayanan
- Institute of Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zürich, Switzerland
| | - Michael Sokolov
- Institute of Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zürich, Switzerland.,DataHow AG, Zurich, Switzerland
| | - Alessandro Butté
- Institute of Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zürich, Switzerland.,DataHow AG, Zurich, Switzerland
| | - Massimo Morbidelli
- Institute of Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zürich, Switzerland.,DataHow AG, Zurich, Switzerland
| |
Collapse
|
10
|
Sanz H, Valim C, Vegas E, Oller JM, Reverter F. SVM-RFE: selection and visualization of the most relevant features through non-linear kernels. BMC Bioinformatics 2018; 19:432. [PMID: 30453885 PMCID: PMC6245920 DOI: 10.1186/s12859-018-2451-4] [Citation(s) in RCA: 233] [Impact Index Per Article: 38.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2018] [Accepted: 10/30/2018] [Indexed: 02/02/2023] Open
Abstract
Background Support vector machines (SVM) are a powerful tool to analyze data with a number of predictors approximately equal or larger than the number of observations. However, originally, application of SVM to analyze biomedical data was limited because SVM was not designed to evaluate importance of predictor variables. Creating predictor models based on only the most relevant variables is essential in biomedical research. Currently, substantial work has been done to allow assessment of variable importance in SVM models but this work has focused on SVM implemented with linear kernels. The power of SVM as a prediction model is associated with the flexibility generated by use of non-linear kernels. Moreover, SVM has been extended to model survival outcomes. This paper extends the Recursive Feature Elimination (RFE) algorithm by proposing three approaches to rank variables based on non-linear SVM and SVM for survival analysis. Results The proposed algorithms allows visualization of each one the RFE iterations, and hence, identification of the most relevant predictors of the response variable. Using simulation studies based on time-to-event outcomes and three real datasets, we evaluate the three methods, based on pseudo-samples and kernel principal component analysis, and compare them with the original SVM-RFE algorithm for non-linear kernels. The three algorithms we proposed performed generally better than the gold standard RFE for non-linear kernels, when comparing the truly most relevant variables with the variable ranks produced by each algorithm in simulation studies. Generally, the RFE-pseudo-samples outperformed the other three methods, even when variables were assumed to be correlated in all tested scenarios. Conclusions The proposed approaches can be implemented with accuracy to select variables and assess direction and strength of associations in analysis of biomedical data using SVM for categorical or time-to-event responses. Conducting variable selection and interpreting direction and strength of associations between predictors and outcomes with the proposed approaches, particularly with the RFE-pseudo-samples approach can be implemented with accuracy when analyzing biomedical data. These approaches, perform better than the classical RFE of Guyon for realistic scenarios about the structure of biomedical data. Electronic supplementary material The online version of this article (10.1186/s12859-018-2451-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Hector Sanz
- Department of Genetics, Microbiology and Statistics, Faculty of Biology, Universitat de Barcelona, Diagonal, 643, 08028, Barcelona, Catalonia, Spain.
| | - Clarissa Valim
- Department of Osteopathic Medical Specialties, Michigan State University, 909 Fee Road, Room B 309 West Fee Hall, East Lansing, MI, 48824, USA.,Department of Immunology and Infectious Diseases, Harvard T.H. Chen School of Public Health, 675 Huntington Ave, Boston, MA, 02115, USA
| | - Esteban Vegas
- Department of Genetics, Microbiology and Statistics, Faculty of Biology, Universitat de Barcelona, Diagonal, 643, 08028, Barcelona, Catalonia, Spain
| | - Josep M Oller
- Department of Genetics, Microbiology and Statistics, Faculty of Biology, Universitat de Barcelona, Diagonal, 643, 08028, Barcelona, Catalonia, Spain
| | - Ferran Reverter
- Department of Genetics, Microbiology and Statistics, Faculty of Biology, Universitat de Barcelona, Diagonal, 643, 08028, Barcelona, Catalonia, Spain.,Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Dr. Aiguader 88, 08003, Barcelona, Spain
| |
Collapse
|
11
|
Zhang H, Tian X, Deng X, Cao Y. Batch process fault detection and identification based on discriminant global preserving kernel slow feature analysis. ISA TRANSACTIONS 2018; 79:108-126. [PMID: 29776590 DOI: 10.1016/j.isatra.2018.05.005] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/29/2017] [Revised: 05/01/2018] [Accepted: 05/08/2018] [Indexed: 06/08/2023]
Abstract
As an attractive nonlinear dynamic data analysis tool, global preserving kernel slow feature analysis (GKSFA) has achieved great success in extracting the high nonlinearity and inherently time-varying dynamics of batch process. However, GKSFA is an unsupervised feature extraction method and lacks the ability to utilize batch process class label information, which may not offer the most effective means for dealing with batch process monitoring. To overcome this problem, we propose a novel batch process monitoring method based on the modified GKSFA, referred to as discriminant global preserving kernel slow feature analysis (DGKSFA), by closely integrating discriminant analysis and GKSFA. The proposed DGKSFA method can extract discriminant feature of batch process as well as preserve global and local geometrical structure information of observed data. For the purpose of fault detection, a monitoring statistic is constructed based on the distance between the optimal kernel feature vectors of test data and normal data. To tackle the challenging issue of nonlinear fault variable identification, a new nonlinear contribution plot method is also developed to help identifying the fault variable after a fault is detected, which is derived from the idea of variable pseudo-sample trajectory projection in DGKSFA nonlinear biplot. Simulation results conducted on a numerical nonlinear dynamic system and the benchmark fed-batch penicillin fermentation process demonstrate that the proposed process monitoring and fault diagnosis approach can effectively detect fault and distinguish fault variables from normal variables.
Collapse
Affiliation(s)
- Hanyuan Zhang
- School of Information and Electrical Engineering, Shandong Jianzhu University, Jinan 250101, Shandong, China.
| | - Xuemin Tian
- College of Information and Control Engineering, China University of Petroleum (East China), Qingdao 266580 Shangdong, China.
| | - Xiaogang Deng
- College of Information and Control Engineering, China University of Petroleum (East China), Qingdao 266580 Shangdong, China.
| | - Yuping Cao
- College of Information and Control Engineering, China University of Petroleum (East China), Qingdao 266580 Shangdong, China.
| |
Collapse
|
12
|
Song W, Wang H, Maguire P, Nibouche O. Nearest clusters based partial least squares discriminant analysis for the classification of spectral data. Anal Chim Acta 2018; 1009:27-38. [DOI: 10.1016/j.aca.2018.01.023] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2017] [Revised: 12/18/2017] [Accepted: 01/15/2018] [Indexed: 11/29/2022]
|
13
|
Differentiation Between Organic and Non-Organic Apples Using Diffraction Grating and Image Processing-A Cost-Effective Approach. SENSORS 2018; 18:s18061667. [PMID: 29789501 PMCID: PMC6021810 DOI: 10.3390/s18061667] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/15/2018] [Revised: 05/15/2018] [Accepted: 05/20/2018] [Indexed: 11/17/2022]
Abstract
As the expectation for higher quality of life increases, consumers have higher demands for quality food. Food authentication is the technical means of ensuring food is what it says it is. A popular approach to food authentication is based on spectroscopy, which has been widely used for identifying and quantifying the chemical components of an object. This approach is non-destructive and effective but expensive. This paper presents a computer vision-based sensor system for food authentication, i.e., differentiating organic from non-organic apples. This sensor system consists of low-cost hardware and pattern recognition software. We use a flashlight to illuminate apples and capture their images through a diffraction grating. These diffraction images are then converted into a data matrix for classification by pattern recognition algorithms, including k-nearest neighbors (k-NN), support vector machine (SVM) and three partial least squares discriminant analysis (PLS-DA)- based methods. We carry out experiments on a reasonable collection of apple samples and employ a proper pre-processing, resulting in a highest classification accuracy of 94%. Our studies conclude that this sensor system has the potential to provide a viable solution to empower consumers in food authentication.
Collapse
|
14
|
Chemometric Methods for Classification and Feature Selection. COMPREHENSIVE ANALYTICAL CHEMISTRY 2018. [DOI: 10.1016/bs.coac.2018.08.006] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/05/2022]
|
15
|
Wongsaipun S, Krongchai C, Jakmunee J, Kittiwachana S. Rice Grain Freshness Measurement Using Rapid Visco Analyzer and Chemometrics. FOOD ANAL METHOD 2017. [DOI: 10.1007/s12161-017-1031-y] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
16
|
Bian X, Li S, Lin L, Tan X, Fan Q, Li M. High and low frequency unfolded partial least squares regression based on empirical mode decomposition for quantitative analysis of fuel oil samples. Anal Chim Acta 2016; 925:16-22. [DOI: 10.1016/j.aca.2016.04.029] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2015] [Revised: 03/31/2016] [Accepted: 04/21/2016] [Indexed: 12/26/2022]
|
17
|
Tan C, Chen H, Lin Z, Wu T, Wang L, Zhang K. Classification of Liquor Using Near-Infrared Spectroscopy and Chemometrics. ANAL LETT 2014. [DOI: 10.1080/00032719.2014.938343] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
18
|
Chen H, Tan C, Wu H, Lin Z, Wu T. Feasibility of Rapid Diagnosis of Colorectal Cancer by Near-Infrared Spectroscopy and Support Vector Machine. ANAL LETT 2014. [DOI: 10.1080/00032719.2014.915410] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
19
|
Singh KP, Gupta S, Rai P. Predicting dissolved oxygen concentration using kernel regression modeling approaches with nonlinear hydro-chemical data. ENVIRONMENTAL MONITORING AND ASSESSMENT 2014; 186:2749-2765. [PMID: 24338099 DOI: 10.1007/s10661-013-3576-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/10/2013] [Accepted: 11/28/2013] [Indexed: 06/03/2023]
Abstract
Kernel function-based regression models were constructed and applied to a nonlinear hydro-chemical dataset pertaining to surface water for predicting the dissolved oxygen levels. Initial features were selected using nonlinear approach. Nonlinearity in the data was tested using BDS statistics, which revealed the data with nonlinear structure. Kernel ridge regression, kernel principal component regression, kernel partial least squares regression, and support vector regression models were developed using the Gaussian kernel function and their generalization and predictive abilities were compared in terms of several statistical parameters. Model parameters were optimized using the cross-validation procedure. The proposed kernel regression methods successfully captured the nonlinear features of the original data by transforming it to a high dimensional feature space using the kernel function. Performance of all the kernel-based modeling methods used here were comparable both in terms of predictive and generalization abilities. Values of the performance criteria parameters suggested for the adequacy of the constructed models to fit the nonlinear data and their good predictive capabilities.
Collapse
Affiliation(s)
- Kunwar P Singh
- Academy of Scientific and Innovative Research, Anusandhan Bhawan, Rafi Marg, New Delhi, 110001, India,
| | | | | |
Collapse
|
20
|
A quantitative structure-activity relationship study of anti-HIV activity of substituted HEPT using nonlinear models. Med Chem Res 2013; 22:5442-5452. [PMID: 24098069 PMCID: PMC3785711 DOI: 10.1007/s00044-013-0525-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2012] [Accepted: 01/31/2013] [Indexed: 11/27/2022]
Abstract
We performed studies on extended series of 79 HEPT ligands (1-[(2-hydroxyethoxy)methyl]-6-(phenylthio)thymine), inhibitors of HIV reverse-transcriptase with anti-HIV biological activity, using quantitative structure–activity relationship (QSAR) methods that imply analysis of correlations and representation of models. A suitable set of molecular descriptors was calculated, and the genetic algorithm was employed to select those descriptors which resulted in the best-fit models. The kernel partial least square and Levenberg–Marquardt artificial neural network were utilized to construct the nonlinear QSAR models. The proposed methods will be of great significance in this research, and would be expected to apply to other similar research fields.
Collapse
|
21
|
Platikanov S, Martín J, Tauler R. Linear and non-linear chemometric modeling of THM formation in Barcelona's water treatment plant. THE SCIENCE OF THE TOTAL ENVIRONMENT 2012; 432:365-374. [PMID: 22750183 DOI: 10.1016/j.scitotenv.2012.05.097] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/12/2012] [Revised: 05/22/2012] [Accepted: 05/31/2012] [Indexed: 06/01/2023]
Abstract
The complex behavior observed for the dependence of trihalomethane formation on forty one water treatment plant (WTP) operational variables is investigated by means of linear and non-linear regression methods, including kernel-partial least squares (K-PLS), and support vector machine regression (SVR). Lower prediction errors of total trihalomethane concentrations (lower than 14% for external validation samples) were obtained when these two methods were applied in comparison to when linear regression methods were applied. A new visualization technique revealed the complex nonlinear relationships among the operational variables and displayed the existing correlations between input variables and the kernel matrix on one side and the support vectors on the other side. Whereas some water treatment plant variables like river water TOC and chloride concentrations, and breakpoint chlorination were not considered to be significant due to the multi-collinear effect in straight linear regression modeling methods, they were now confirmed to be significant using K-PLS and SVR non-linear modeling regression methods, proving the better performance of these methods for the prediction of complex formation of trihalomethanes in water disinfection plants.
Collapse
Affiliation(s)
- Stefan Platikanov
- Department of Environmental Chemistry, IDAEA-CSIC, Jordi Girona, 18-26, Barcelona 08026, Spain
| | | | | |
Collapse
|
22
|
Interpretation and visualization of non-linear data fusion in kernel space: study on metabolomic characterization of progression of multiple sclerosis. PLoS One 2012; 7:e38163. [PMID: 22715376 PMCID: PMC3371049 DOI: 10.1371/journal.pone.0038163] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2012] [Accepted: 05/01/2012] [Indexed: 11/22/2022] Open
Abstract
Background In the last decade data fusion has become widespread in the field of metabolomics. Linear data fusion is performed most commonly. However, many data display non-linear parameter dependences. The linear methods are bound to fail in such situations. We used proton Nuclear Magnetic Resonance and Gas Chromatography-Mass Spectrometry, two well established techniques, to generate metabolic profiles of Cerebrospinal fluid of Multiple Sclerosis (MScl) individuals. These datasets represent non-linearly separable groups. Thus, to extract relevant information and to combine them a special framework for data fusion is required. Methodology The main aim is to demonstrate a novel approach for data fusion for classification; the approach is applied to metabolomics datasets coming from patients suffering from MScl at a different stage of the disease. The approach involves data fusion in kernel space and consists of four main steps. The first one is to extract the significant information per data source using Support Vector Machine Recursive Feature Elimination. This method allows one to select a set of relevant variables. In the next step the optimized kernel matrices are merged by linear combination. In step 3 the merged datasets are analyzed with a classification technique, namely Kernel Partial Least Square Discriminant Analysis. In the final step, the variables in kernel space are visualized and their significance established. Conclusions We find that fusion in kernel space allows for efficient and reliable discrimination of classes (MScl and early stage). This data fusion approach achieves better class prediction accuracy than analysis of individual datasets and the commonly used mid-level fusion. The prediction accuracy on an independent test set (8 samples) reaches 100%. Additionally, the classification model obtained on fused kernels is simpler in terms of complexity, i.e. just one latent variable was sufficient. Finally, visualization of variables importance in kernel space was achieved.
Collapse
|
23
|
Cristescu SM, Gietema HA, Blanchet L, Kruitwagen CLJJ, Munnik P, van Klaveren RJ, Lammers JWJ, Buydens L, Harren FJM, Zanen P. Screening for emphysema via exhaled volatile organic compounds. J Breath Res 2011; 5:046009. [PMID: 22071870 DOI: 10.1088/1752-7155/5/4/046009] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Chronic obstructive pulmonary disease (COPD)/emphysema risk groups are well defined and screening allows for early identification of disease. The capability of exhaled volatile organic compounds (VOCs) to detect emphysema, as found by computed tomography (CT) in current and former heavy smokers participating in a lung cancer screening trial, was investigated. CT scans, pulmonary function tests and breath sample collections were obtained from 204 subjects. Breath samples were analyzed with a proton-transfer reaction mass spectrometer (PTR-MS) to obtain VOC profiles listed as ions at various mass-to-charge ratios (m/z). Using bootstrapped stepwise forward logistic regression, we identified specific breath profiles as a potential tool for the diagnosis of emphysema, of airflow limitation or gas-exchange impairment. A marker for emphysema was found at m/z 87 (tentatively attributed to 2-methylbutanal). The area under the receiver operating characteristic curve (ROC) of this marker to diagnose emphysema was 0.588 (95% CI 0.453-0.662). Mass-to-charge ratios m/z 52 (most likely chloramine) and m/z 135 (alkyl benzene) were linked to obstructive disease and m/z 122 (most probably alkyl homologs) to an impaired diffusion capacity. ROC areas were 0.646 (95% CI 0.562-0.730) and 0.671 (95% CI 0.524-0.710), respectively. In the screening setting, exhaled VOCs measured by PTR-MS constitute weak markers for emphysema, pulmonary obstruction and impaired diffusion capacity.
Collapse
Affiliation(s)
- S M Cristescu
- Life Science Trace Gas Facility, Molecular and Laser Physics, Institute for Molecules and Materials, Radboud University, Nijmegen, the Netherlands.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
24
|
Noorizadeh H, Farmany A, Noorizadeh M. Application of GA–KPLS and L–M ANN calculations for the prediction of the capacity factor of hazardous psychoactive designer drugs. Med Chem Res 2011. [DOI: 10.1007/s00044-011-9794-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|