1
|
Chen LP. Classification and prediction for multi-cancer data with ultrahigh-dimensional gene expressions. PLoS One 2022; 17:e0274440. [PMID: 36107929 PMCID: PMC9477337 DOI: 10.1371/journal.pone.0274440] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2021] [Accepted: 08/28/2022] [Indexed: 11/29/2022] Open
Abstract
Analysis of gene expression data is an attractive topic in the field of bioinformatics, and a typical application is to classify and predict individuals' diseases or tumors by treating gene expression values as predictors. A primary challenge of this study comes from ultrahigh-dimensionality, which makes that (i) many predictors in the dataset might be non-informative, (ii) pairwise dependence structures possibly exist among high-dimensional predictors, yielding the network structure. While many supervised learning methods have been developed, it is expected that the prediction performance would be affected if impacts of ultrahigh-dimensionality were not carefully addressed. In this paper, we propose a new statistical learning algorithm to deal with multi-classification subject to ultrahigh-dimensional gene expressions. In the proposed algorithm, we employ the model-free feature screening method to retain informative gene expression values from ultrahigh-dimensional data, and then construct predictive models with network structures of selected gene expression accommodated. Different from existing supervised learning methods that build predictive models based on entire dataset, our approach is able to identify informative predictors and dependence structures for gene expression. Throughout analysis of a real dataset, we find that the proposed algorithm gives precise classification as well as accurate prediction, and outperforms some commonly used supervised learning methods.
Collapse
Affiliation(s)
- Li-Pang Chen
- Department of Statistics, National Chengchi University, Taipei, Taiwan, ROC
| |
Collapse
|
2
|
Fop M, Mattei PA, Bouveyron C, Murphy TB. Unobserved classes and extra variables in high-dimensional discriminant analysis. ADV DATA ANAL CLASSI 2022; 16:55-92. [PMID: 35308632 PMCID: PMC8924148 DOI: 10.1007/s11634-021-00474-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Revised: 07/15/2021] [Accepted: 10/03/2021] [Indexed: 11/30/2022]
Abstract
AbstractIn supervised classification problems, the test set may contain data points belonging to classes not observed in the learning phase. Moreover, the same units in the test data may be measured on a set of additional variables recorded at a subsequent stage with respect to when the learning sample was collected. In this situation, the classifier built in the learning phase needs to adapt to handle potential unknown classes and the extra dimensions. We introduce a model-based discriminant approach, Dimension-Adaptive Mixture Discriminant Analysis (D-AMDA), which can detect unobserved classes and adapt to the increasing dimensionality. Model estimation is carried out via a full inductive approach based on an EM algorithm. The method is then embedded in a more general framework for adaptive variable selection and classification suitable for data of large dimensions. A simulation study and an artificial experiment related to classification of adulterated honey samples are used to validate the ability of the proposed framework to deal with complex situations.
Collapse
Affiliation(s)
- Michael Fop
- School of Mathematics & Statistics, University College Dublin, Dublin, Ireland
| | | | - Charles Bouveyron
- Université Côte d'Azur, Inria, CNRS, Laboratoire J.A. Dieudonné, Maasai team, Nice, France
| | - Thomas Brendan Murphy
- Université Côte d'Azur, Inria, CNRS, Laboratoire J.A. Dieudonné, Maasai team, Nice, France
| |
Collapse
|
3
|
de Almeida VE, de Sousa Fernandes DD, Diniz PHGD, de Araújo Gomes A, Véras G, Galvão RKH, Araujo MCU. Scores selection via Fisher's discriminant power in PCA-LDA to improve the classification of food data. Food Chem 2021; 363:130296. [PMID: 34144419 DOI: 10.1016/j.foodchem.2021.130296] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2021] [Revised: 05/31/2021] [Accepted: 06/01/2021] [Indexed: 11/29/2022]
Abstract
This paper proposes an adaptation of the Fisher's discriminability criterion (named here as discriminant power, DP) for choosing principal components (obtained from Principal Component Analysis, PCA), which will be used to construct supervised Linear Discriminant Analysis (LDA) models for solving classification problems of food data. The proposed PCA-DP-LDA algorithm was then applied to (i) simulated data, (ii) classify soybean oils with respect to expiration date, and (iii) identify cachaça adulteration with wood extracts that simulated aging. For comparison, PCA-DP-LDA was evaluated against conventional PCA-LDA (based on explained variance) and Partial Least Squares-Discriminant Analysis (PLS-DA). Among them, PCA-DP-LDA achieved the most parsimonious and interpretable results, with similar or better classification performance. Therefore, the new algorithm can be considered a good alternative to the already well-established discriminant methods, being potentially applied where the discriminability of the principal components may not follow the same behavior of the explained variance.
Collapse
Affiliation(s)
- Valber Elias de Almeida
- Universidade Federal de Paraíba, Departamento de Química, P.O.Box 5093, CEP 58051-970 João Pessoa, PB, Brazil
| | | | | | - Adriano de Araújo Gomes
- Universidade Federal do Rio Grande do Sul, Departamento de Química Inorgânica, CEP 91501-970 Porto Alegre, RS, Brazil.
| | - Germano Véras
- Universidade Estadual da Paraíba, Centro de Ciência e Tecnologia, Departamento de Química, CEP 58429-500 Campina Grande, PB, Brazil
| | | | - Mario Cesar Ugulino Araujo
- Universidade Federal de Paraíba, Departamento de Química, P.O.Box 5093, CEP 58051-970 João Pessoa, PB, Brazil.
| |
Collapse
|
4
|
Robust variable selection for model-based learning in presence of adulteration. Comput Stat Data Anal 2021. [DOI: 10.1016/j.csda.2021.107186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
5
|
Cappozzo A, Duponchel L, Greselin F, Murphy TB. Robust variable selection in the framework of classification with label noise and outliers: Applications to spectroscopic data in agri-food. Anal Chim Acta 2021; 1153:338245. [PMID: 33714445 DOI: 10.1016/j.aca.2021.338245] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Revised: 12/23/2020] [Accepted: 01/20/2021] [Indexed: 11/28/2022]
Abstract
Classification of high-dimensional spectroscopic data is a common task in analytical chemistry. Well-established procedures like support vector machines (SVMs) and partial least squares discriminant analysis (PLS-DA) are the most common methods for tackling this supervised learning problem. Nonetheless, interpretation of these models remains sometimes difficult, and solutions based on feature selection are often adopted as they lead to the automatic identification of the most informative wavelengths. Unfortunately, for some delicate applications like food authenticity, mislabeled and adulterated spectra occur both in the calibration and/or validation sets, with dramatic effects on the model development, its prediction accuracy and robustness. Motivated by these issues, the present paper proposes a robust model-based method that simultaneously performs variable selection, outliers and label noise detection. We demonstrate the effectiveness of our proposal in dealing with three agri-food spectroscopic studies, where several forms of perturbations are considered. Our approach succeeds in diminishing problem complexity, identifying anomalous spectra and attaining competitive predictive accuracy considering a very low number of selected wavelengths.
Collapse
Affiliation(s)
- Andrea Cappozzo
- Department of Statistics and Quantitative Methods, University of Milano-Bicocca, Milan, Italy.
| | - Ludovic Duponchel
- Univ. Lille, CNRS, UMR 8516, LASIRE-Laboratoire avancé de spectroscopie pour les interactions, la réactivité et l'environnement, F-59000, Lille, France.
| | - Francesca Greselin
- Department of Statistics and Quantitative Methods, University of Milano-Bicocca, Milan, Italy.
| | - Thomas Brendan Murphy
- School of Mathematics & Statistics and Insight Research Centre, University College Dublin, Dublin, Ireland.
| |
Collapse
|
6
|
Nategh NA, Dalvand MJ, Anvar A. Detection of toxic and non-toxic sweet cherries at different degrees of maturity using an electronic nose. JOURNAL OF FOOD MEASUREMENT AND CHARACTERIZATION 2021. [DOI: 10.1007/s11694-020-00724-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
7
|
Yang DH, Zhou X, Wang XY, Huang JP. Mirco-earthquake source depth detection using machine learning techniques. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2020.07.045] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
8
|
Aghilinategh N, Dalvand MJ, Anvar A. Detection of ripeness grades of berries using an electronic nose. Food Sci Nutr 2020; 8:4919-4928. [PMID: 32994953 PMCID: PMC7500766 DOI: 10.1002/fsn3.1788] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2020] [Revised: 06/29/2020] [Accepted: 06/30/2020] [Indexed: 01/02/2023] Open
Abstract
The estimation of ripeness is a significant section of quality determination since maturity at harvest can affect sensory and storage properties of fruits. A possible tactic for defining the grade of ripeness is sensing the aromatic volatiles released by fruit using electronic nose (e-nose). For detection of the five ripeness grades of berries (whiteberry and blackberry), the e-nose machine was designed and fabricated. Artificial neural networks (ANN), principal components analysis (PCA), and linear discriminant analysis (LDA) were applied for pattern recognition of array sensors. The best structure (10-11-5) can classify the samples in five classes in ANN analysis with a precision of 100% and 88.3% for blackberry and whiteberry, respectively. Also, PCA analysis characterized 97% and 93% variance in the blackberry and whiteberry, respectively. The least correct classification for whiteberry was observed in the LDA method.
Collapse
Affiliation(s)
- Nahid Aghilinategh
- Department of Agricultural Machinery EngineeringSonqor Agriculture FacultyRazi UniversityKermanshahIran
| | | | - Adieh Anvar
- Agricultural Science and Natural Resources University of KhuzestanIran
| |
Collapse
|
9
|
Leclercq M, Vittrant B, Martin-Magniette ML, Scott Boyer MP, Perin O, Bergeron A, Fradet Y, Droit A. Large-Scale Automatic Feature Selection for Biomarker Discovery in High-Dimensional OMICs Data. Front Genet 2019; 10:452. [PMID: 31156708 PMCID: PMC6532608 DOI: 10.3389/fgene.2019.00452] [Citation(s) in RCA: 58] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2019] [Accepted: 04/30/2019] [Indexed: 12/11/2022] Open
Abstract
The identification of biomarker signatures in omics molecular profiling is usually performed to predict outcomes in a precision medicine context, such as patient disease susceptibility, diagnosis, prognosis, and treatment response. To identify these signatures, we have developed a biomarker discovery tool, called BioDiscML. From a collection of samples and their associated characteristics, i.e., the biomarkers (e.g., gene expression, protein levels, clinico-pathological data), BioDiscML exploits various feature selection procedures to produce signatures associated to machine learning models that will predict efficiently a specified outcome. To this purpose, BioDiscML uses a large variety of machine learning algorithms to select the best combination of biomarkers for predicting categorical or continuous outcomes from highly unbalanced datasets. The software has been implemented to automate all machine learning steps, including data pre-processing, feature selection, model selection, and performance evaluation. BioDiscML is delivered as a stand-alone program and is available for download at https://github.com/mickaelleclercq/BioDiscML.
Collapse
Affiliation(s)
- Mickael Leclercq
- Centre de Recherche du CHU de Québec-Université Laval, Québec City, QC, Canada.,Département de Médecine Moléculaire, Université Laval, Québec City, QC, Canada
| | - Benjamin Vittrant
- Centre de Recherche du CHU de Québec-Université Laval, Québec City, QC, Canada.,Département de Médecine Moléculaire, Université Laval, Québec City, QC, Canada
| | - Marie Laure Martin-Magniette
- Institute of Plant Sciences Paris Saclay IPS2, CNRS, INRA, Université Paris-Sud, Université Evry, Université Paris-Saclay, Paris Diderot, Sorbonne Paris-Cité, Orsay, France.,UMR MIA-Paris, AgroParisTech, INRA, Université Paris-Saclay, Paris, France
| | - Marie Pier Scott Boyer
- Centre de Recherche du CHU de Québec-Université Laval, Québec City, QC, Canada.,Département de Médecine Moléculaire, Université Laval, Québec City, QC, Canada
| | - Olivier Perin
- Digital Sciences Department, L'Oréal Advanced Research, Aulnay-sous-bois, France
| | - Alain Bergeron
- Centre de Recherche du CHU de Québec-Université Laval, Québec City, QC, Canada.,Département de Chirurgie, Oncology Axis, Université Laval, Québec City, QC, Canada
| | - Yves Fradet
- Centre de Recherche du CHU de Québec-Université Laval, Québec City, QC, Canada.,Département de Chirurgie, Oncology Axis, Université Laval, Québec City, QC, Canada
| | - Arnaud Droit
- Centre de Recherche du CHU de Québec-Université Laval, Québec City, QC, Canada.,Département de Médecine Moléculaire, Université Laval, Québec City, QC, Canada
| |
Collapse
|
10
|
Li Y, Liu JS. Robust Variable and Interaction Selection for Logistic Regression and General Index Models. J Am Stat Assoc 2018; 114:271-286. [PMID: 32863479 DOI: 10.1080/01621459.2017.1401541] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Under the logistic regression framework, we propose a forward-backward method, SODA, for variable selection with both main and quadratic interaction terms. In the forward stage, SODA adds in predictors that have significant overall effects, whereas in the backward stage SODA removes unimportant terms to optimize the extended Bayesian Information Criterion (EBIC). Compared with existing methods for variable selection in quadratic discriminant analysis, SODA can deal with high-dimensional data in which the number of predictors is much larger than the sample size and does not require the joint normality assumption on predictors, leading to much enhanced robustness. We further extend SODA to conduct variable selection and model fitting for general index models. Compared with existing variable selection methods based on the Sliced Inverse Regression (SIR) (Li, 1991), SODA requires neither linearity nor constant variance condition and is thus more robust. Our theoretical analysis establishes the variable-selection consistency of SODA under high-dimensional settings, and our simulation studies as well as real-data applications demonstrate superior performances of SODA in dealing with non-Gaussian design matrices in both logistic and general index models.
Collapse
Affiliation(s)
- Yang Li
- Yang Li is Sr. Market Scientist, Vatic Labs LLC, New York, NY 10036. Jun S Liu is Professor, Department of Statistics, Harvard University, Cambridge, MA 02138; and is also co- Director for the Center for Statistical Science, Department of Industrial Engineering, Tsinghua University, Beijing, China
| | - Jun S Liu
- Yang Li is Sr. Market Scientist, Vatic Labs LLC, New York, NY 10036. Jun S Liu is Professor, Department of Statistics, Harvard University, Cambridge, MA 02138; and is also co- Director for the Center for Statistical Science, Department of Industrial Engineering, Tsinghua University, Beijing, China
| |
Collapse
|
11
|
Celeux G, Maugis-Rabusseau C, Sedki M. Variable selection in model-based clustering and discriminant analysis with a regularization approach. ADV DATA ANAL CLASSI 2018. [DOI: 10.1007/s11634-018-0322-5] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
12
|
Pattarapon P, Zhang M, Bhandari B, Gao Z. Effect of vacuum storage on the freshness of grass carp (
Ctenopharyngodon idella
) fillet based on normal and electronic sensory measurement. J FOOD PROCESS PRES 2017. [DOI: 10.1111/jfpp.13418] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Affiliation(s)
- Phuhongsung Pattarapon
- State Key Laboratory of Food Science and TechnologyJiangnan UniversityWuxi, Jiangsu China
| | - Min Zhang
- State Key Laboratory of Food Science and TechnologyJiangnan UniversityWuxi, Jiangsu China
- Jiangnan University (Yangzhou) Food Biotechnology InstituteYangzhou China
| | - Bhesh Bhandari
- School of Agriculture and Food SciencesUniversity of QueenslandBrisbane, Queensland Australia
| | | |
Collapse
|
13
|
Moser VC, Stewart N, Freeborn DL, Crooks J, MacMillan DK, Hedge JM, Wood CE, McMahen RL, Strynar MJ, Herr DW. Assessment of serum biomarkers in rats after exposure to pesticides of different chemical classes. Toxicol Appl Pharmacol 2014; 282:161-74. [PMID: 25497286 DOI: 10.1016/j.taap.2014.11.016] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2014] [Revised: 11/03/2014] [Accepted: 11/26/2014] [Indexed: 11/25/2022]
Abstract
There is increasing emphasis on the use of biomarkers of adverse outcomes in safety assessment and translational research. We evaluated serum biomarkers and targeted metabolite profiles after exposure to pesticides (permethrin, deltamethrin, imidacloprid, carbaryl, triadimefon, fipronil) with different neurotoxic actions. Adult male Long-Evans rats were evaluated after single exposure to vehicle or one of two doses of each pesticide at the time of peak effect. The doses were selected to produce similar magnitude of behavioral effects across chemicals. Serum or plasma was analyzed using commercial cytokine/protein panels and targeted metabolomics. Additional studies of fipronil used lower doses (lacking behavioral effects), singly or for 14 days, and included additional markers of exposure and biological activity. Biomarker profiles varied in the number of altered analytes and patterns of change across pesticide classes, and discriminant analysis could separate treatment groups from control. Low doses of fipronil produced greater effects when given for 14 days compared to a single dose. Changes in thyroid hormones and relative amounts of fipronil and its sulfone metabolite also differed between the dosing regimens. Most cytokine changes reflected alterations in inflammatory responses, hormone levels, and products of phospholipid, fatty acid, and amino acid metabolism. These findings demonstrate distinct blood-based analyte profiles across pesticide classes, dose levels, and exposure duration. These results show promise for detailed analyses of these biomarkers and their linkages to biological pathways.
Collapse
Affiliation(s)
- Virginia C Moser
- Neurotoxicology Branch/Toxicity Assessment Division, National Health and Environmental Effects Research Laboratory, Office of Research and Development, US Environmental Protection Agency, Research Triangle Park, NC 27711, USA.
| | - Nicholas Stewart
- Neurotoxicology Branch/Toxicity Assessment Division, National Health and Environmental Effects Research Laboratory, Office of Research and Development, US Environmental Protection Agency, Research Triangle Park, NC 27711, USA
| | - Danielle L Freeborn
- Neurotoxicology Branch/Toxicity Assessment Division, National Health and Environmental Effects Research Laboratory, Office of Research and Development, US Environmental Protection Agency, Research Triangle Park, NC 27711, USA
| | - James Crooks
- Analytical Chemistry Research Core/Research Cores Unit, National Health and Environmental Effects Research Laboratory, Office of Research and Development, US Environmental Protection Agency, Research Triangle Park, NC 27711, USA
| | - Denise K MacMillan
- Analytical Chemistry Research Core/Research Cores Unit, National Health and Environmental Effects Research Laboratory, Office of Research and Development, US Environmental Protection Agency, Research Triangle Park, NC 27711, USA
| | - Joan M Hedge
- Integrated Systems Toxicology Division, National Health and Environmental Effects Research Laboratory, Office of Research and Development, US Environmental Protection Agency, Research Triangle Park, NC 27711, USA
| | - Charles E Wood
- Integrated Systems Toxicology Division, National Health and Environmental Effects Research Laboratory, Office of Research and Development, US Environmental Protection Agency, Research Triangle Park, NC 27711, USA
| | - Rebecca L McMahen
- ORISE fellow, Human Exposure and Atmospheric Sciences Division, National Exposure Research Laboratory, Office of Research and Development, US Environmental Protection Agency, Research Triangle Park, NC 27711, USA
| | - Mark J Strynar
- Human Exposure and Atmospheric Sciences Division, National Exposure Research Laboratory, Office of Research and Development, US Environmental Protection Agency, Research Triangle Park, NC 27711, USA
| | - David W Herr
- Neurotoxicology Branch/Toxicity Assessment Division, National Health and Environmental Effects Research Laboratory, Office of Research and Development, US Environmental Protection Agency, Research Triangle Park, NC 27711, USA
| |
Collapse
|