1
|
Casa A, O’Callaghan TF, Murphy TB. Parsimonious Bayesian factor analysis for modelling latent structures in spectroscopy data. Ann Appl Stat 2022. [DOI: 10.1214/21-aoas1597] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Alessandro Casa
- School of Mathematics & Statistics, University College Dublin
| | | | | |
Collapse
|
2
|
Yu W, Wade S, Bondell HD, Azizi L. Non-stationary Gaussian process discriminant analysis with variable selection for high-dimensional functional data. J Comput Graph Stat 2022. [DOI: 10.1080/10618600.2022.2098136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Affiliation(s)
- Weichang Yu
- Melbourne Centre for Data Science, University of Melbourne
| | - Sara Wade
- School of Mathematics, University of Edinburgh
| | | | - Lamiae Azizi
- School of Mathematics and Statistics, University of Sydney
| |
Collapse
|
3
|
Fop M, Mattei PA, Bouveyron C, Murphy TB. Unobserved classes and extra variables in high-dimensional discriminant analysis. ADV DATA ANAL CLASSI 2022; 16:55-92. [PMID: 35308632 PMCID: PMC8924148 DOI: 10.1007/s11634-021-00474-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Revised: 07/15/2021] [Accepted: 10/03/2021] [Indexed: 11/30/2022]
Abstract
AbstractIn supervised classification problems, the test set may contain data points belonging to classes not observed in the learning phase. Moreover, the same units in the test data may be measured on a set of additional variables recorded at a subsequent stage with respect to when the learning sample was collected. In this situation, the classifier built in the learning phase needs to adapt to handle potential unknown classes and the extra dimensions. We introduce a model-based discriminant approach, Dimension-Adaptive Mixture Discriminant Analysis (D-AMDA), which can detect unobserved classes and adapt to the increasing dimensionality. Model estimation is carried out via a full inductive approach based on an EM algorithm. The method is then embedded in a more general framework for adaptive variable selection and classification suitable for data of large dimensions. A simulation study and an artificial experiment related to classification of adulterated honey samples are used to validate the ability of the proposed framework to deal with complex situations.
Collapse
Affiliation(s)
- Michael Fop
- School of Mathematics & Statistics, University College Dublin, Dublin, Ireland
| | | | - Charles Bouveyron
- Université Côte d'Azur, Inria, CNRS, Laboratoire J.A. Dieudonné, Maasai team, Nice, France
| | - Thomas Brendan Murphy
- Université Côte d'Azur, Inria, CNRS, Laboratoire J.A. Dieudonné, Maasai team, Nice, France
| |
Collapse
|
4
|
Frizzarin M, O'Callaghan TF, Murphy TB, Hennessy D, Casa A. Application of machine-learning methods to milk mid-infrared spectra for discrimination of cow milk from pasture or total mixed ration diets. J Dairy Sci 2021; 104:12394-12402. [PMID: 34593222 DOI: 10.3168/jds.2021-20812] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Accepted: 08/10/2021] [Indexed: 11/19/2022]
Abstract
The prevalence of "grass-fed" labeled food products on the market has increased in recent years, often commanding a premium price. To date, the majority of methods used for the authentication of grass-fed source products are driven by auditing and inspection of farm records. As such, the ability to verify grass-fed source claims to ensure consumer confidence will be important in the future. Mid-infrared (MIR) spectroscopy is widely used in the dairy industry as a rapid method for the routine monitoring of individual herd milk composition and quality. Further harnessing the data from individual spectra offers a promising and readily implementable strategy to authenticate the milk source at both farm and processor levels. Herein, a comprehensive comparison of the robustness, specificity, and accuracy of 11 machine-learning statistical analysis methods were tested for the discrimination of grass-fed versus non-grass-fed milks based on the MIR spectra of 4,320 milk samples collected from cows on pasture or indoor total mixed ration-based feeding systems over a 3-yr period. Linear discriminant analysis and partial least squares discriminant analysis (PLS-DA) were demonstrated to offer the greatest level of accuracy for the prediction of cow diet from MIR spectra. Parsimonious strategies for the selection of the most discriminating wavelengths within the spectra are also highlighted.
Collapse
Affiliation(s)
- M Frizzarin
- School of Mathematics and Statistics, University College Dublin, Belfield, Dublin 4, Ireland D04 V1W8; Teagasc, Animal & Grassland Research and Innovation Centre, Moorepark, Fermoy, Co. Cork, Ireland P61 P302
| | - T F O'Callaghan
- VistaMilk SFI Research Center, Moorepark, Fermoy, Ireland P61 P302; School of Food and Nutritional Sciences, University College Cork, Cork, Ireland T12 Y337
| | - T B Murphy
- School of Mathematics and Statistics, University College Dublin, Belfield, Dublin 4, Ireland D04 V1W8; VistaMilk SFI Research Center, Moorepark, Fermoy, Ireland P61 P302
| | - D Hennessy
- Teagasc, Animal & Grassland Research and Innovation Centre, Moorepark, Fermoy, Co. Cork, Ireland P61 P302; VistaMilk SFI Research Center, Moorepark, Fermoy, Ireland P61 P302
| | - A Casa
- School of Mathematics and Statistics, University College Dublin, Belfield, Dublin 4, Ireland D04 V1W8; VistaMilk SFI Research Center, Moorepark, Fermoy, Ireland P61 P302.
| |
Collapse
|
5
|
Robust variable selection for model-based learning in presence of adulteration. Comput Stat Data Anal 2021. [DOI: 10.1016/j.csda.2021.107186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
6
|
Cappozzo A, Duponchel L, Greselin F, Murphy TB. Robust variable selection in the framework of classification with label noise and outliers: Applications to spectroscopic data in agri-food. Anal Chim Acta 2021; 1153:338245. [PMID: 33714445 DOI: 10.1016/j.aca.2021.338245] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Revised: 12/23/2020] [Accepted: 01/20/2021] [Indexed: 11/28/2022]
Abstract
Classification of high-dimensional spectroscopic data is a common task in analytical chemistry. Well-established procedures like support vector machines (SVMs) and partial least squares discriminant analysis (PLS-DA) are the most common methods for tackling this supervised learning problem. Nonetheless, interpretation of these models remains sometimes difficult, and solutions based on feature selection are often adopted as they lead to the automatic identification of the most informative wavelengths. Unfortunately, for some delicate applications like food authenticity, mislabeled and adulterated spectra occur both in the calibration and/or validation sets, with dramatic effects on the model development, its prediction accuracy and robustness. Motivated by these issues, the present paper proposes a robust model-based method that simultaneously performs variable selection, outliers and label noise detection. We demonstrate the effectiveness of our proposal in dealing with three agri-food spectroscopic studies, where several forms of perturbations are considered. Our approach succeeds in diminishing problem complexity, identifying anomalous spectra and attaining competitive predictive accuracy considering a very low number of selected wavelengths.
Collapse
Affiliation(s)
- Andrea Cappozzo
- Department of Statistics and Quantitative Methods, University of Milano-Bicocca, Milan, Italy.
| | - Ludovic Duponchel
- Univ. Lille, CNRS, UMR 8516, LASIRE-Laboratoire avancé de spectroscopie pour les interactions, la réactivité et l'environnement, F-59000, Lille, France.
| | - Francesca Greselin
- Department of Statistics and Quantitative Methods, University of Milano-Bicocca, Milan, Italy.
| | - Thomas Brendan Murphy
- School of Mathematics & Statistics and Insight Research Centre, University College Dublin, Dublin, Ireland.
| |
Collapse
|
7
|
Fortunato F, Anderlucci L, Montanari A. One‐class classification with application to forensic analysis. J R Stat Soc Ser C Appl Stat 2020. [DOI: 10.1111/rssc.12438] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
8
|
Chen J, Zhou G, Xie J, Wang M, Ding Y, Chen S, Xia S, Deng X, Chen Q, Niu B. Dairy Safety Prediction Based on Machine Learning Combined with Chemicals. Med Chem 2020; 16:664-676. [DOI: 10.2174/1573406415666191004142810] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2019] [Revised: 07/16/2019] [Accepted: 08/23/2019] [Indexed: 11/22/2022]
Abstract
Background:
Dairy safety has caused widespread concern in society. Unsafe dairy
products have threatened people's health and lives. In order to improve the safety of dairy products
and effectively prevent the occurrence of dairy insecurity, countries have established different prevention
and control measures and safety warnings.
Objective:
The purpose of this study is to establish a dairy safety prediction model based on machine
learning to determine whether the dairy products are qualified.
Methods:
The 34 common items in the dairy sampling inspection were used as features in this
study. Feature selection was performed on the data to obtain a better subset of features, and different
algorithms were applied to construct the classification model.
Results:
The results show that the prediction model constructed by using a subset of features including
“total plate”, “water” and “nitrate” is superior. The SN, SP and ACC of the model were
62.50%, 91.67% and 72.22%, respectively. It was found that the accuracy of the model established
by the integrated algorithm is higher than that by the non-integrated algorithm.
Conclusion:
This study provides a new method for assessing dairy safety. It helps to improve the
quality of dairy products, ensure the safety of dairy products, and reduce the risk of dairy safety.
Collapse
Affiliation(s)
- Jiahui Chen
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Guangya Zhou
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Jiayang Xie
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Minjia Wang
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Yanting Ding
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Shuxian Chen
- Guang Xi Institute for Food and Drug Control, Nannin, 530021, China
| | - Sijing Xia
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Xiaojun Deng
- Tech Ctr Anim Plant & Food Inspect & Quarantine, Shanghai Entry-Exit Inspect & Quarantine Bur, Shanghai 200135, China
| | - Qin Chen
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Bing Niu
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| |
Collapse
|
9
|
Kuras MJ, Zielińska-Pisklak M, Duszyńska J, Jabłońska J. Determination of the elemental composition and antioxidant properties of dates ( Phoenix dactyliferia) originated from different regions. Journal of Food Science and Technology 2020; 57:2828-2839. [PMID: 32616962 PMCID: PMC7316905 DOI: 10.1007/s13197-020-04314-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Revised: 02/06/2020] [Accepted: 02/24/2020] [Indexed: 12/31/2022]
Abstract
Due to the growing interest in running a healthy life, including the diet a special interest has been put in searching for products that are rich in nutrients, macro and micronutrients and vitamins. Dates are the fruits that meet these requirements and show multidirectional pro-health effects. These fruits are a source of potassium and other macro- and micronutrients. They have antioxidant properties thanks to the content of flavonoids and polyphenols. The elemental composition (Al, Ca, Cu, Fe, K, Mg, Mn, P, Sr and Zn) and antioxidant properties (total equivalent antioxidant capacity, total polyphenol content, total flavonoid content) of various dates from different regions of the world was determined. The results have shown that the peel and flesh of dates differ significantly in chemical composition. The peel is significantly richer in chemical components of biological importance. Discriminant analysis of the results obtained for dates originated from various regions indicated that the main factor determining the tested chemical composition is the place of cultivation, not the variety.
Collapse
Affiliation(s)
- Marzena Joanna Kuras
- Department of Biomaterials Chemistry, Chair of Analytical and Biomaterials Chemistry, Faculty of Pharmacy, Medical University of Warsaw, 1 Banacha St., 02-097 Warsaw, Poland
| | - Monika Zielińska-Pisklak
- Department of Biomaterials Chemistry, Chair of Analytical and Biomaterials Chemistry, Faculty of Pharmacy, Medical University of Warsaw, 1 Banacha St., 02-097 Warsaw, Poland
| | - Justyna Duszyńska
- Department of Biomaterials Chemistry, Chair of Analytical and Biomaterials Chemistry, Faculty of Pharmacy, Medical University of Warsaw, 1 Banacha St., 02-097 Warsaw, Poland
| | - Joanna Jabłońska
- Department of Biomaterials Chemistry, Chair of Analytical and Biomaterials Chemistry, Faculty of Pharmacy, Medical University of Warsaw, 1 Banacha St., 02-097 Warsaw, Poland
| |
Collapse
|
10
|
Affiliation(s)
- Yong Wang
- Department of Statistics, University of Auckland, Auckland, New Zealand
| | - Xuxu Wang
- Department of Statistics, University of Auckland, Auckland, New Zealand
| |
Collapse
|
11
|
Hlongwane GN, Dodoo-Arhin D, Wamwangi D, Daramola MO, Moothi K, Iyuke SE. DNA hybridisation sensors for product authentication and tracing: State of the art and challenges. SOUTH AFRICAN JOURNAL OF CHEMICAL ENGINEERING 2019. [DOI: 10.1016/j.sajce.2018.11.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 10/27/2022] Open
|
12
|
A novel feature selection method to predict protein structural class. Comput Biol Chem 2018; 76:118-129. [DOI: 10.1016/j.compbiolchem.2018.06.007] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2018] [Revised: 05/14/2018] [Accepted: 06/30/2018] [Indexed: 01/05/2023]
|
13
|
Li Y, Liu JS. Robust Variable and Interaction Selection for Logistic Regression and General Index Models. J Am Stat Assoc 2018; 114:271-286. [PMID: 32863479 DOI: 10.1080/01621459.2017.1401541] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Under the logistic regression framework, we propose a forward-backward method, SODA, for variable selection with both main and quadratic interaction terms. In the forward stage, SODA adds in predictors that have significant overall effects, whereas in the backward stage SODA removes unimportant terms to optimize the extended Bayesian Information Criterion (EBIC). Compared with existing methods for variable selection in quadratic discriminant analysis, SODA can deal with high-dimensional data in which the number of predictors is much larger than the sample size and does not require the joint normality assumption on predictors, leading to much enhanced robustness. We further extend SODA to conduct variable selection and model fitting for general index models. Compared with existing variable selection methods based on the Sliced Inverse Regression (SIR) (Li, 1991), SODA requires neither linearity nor constant variance condition and is thus more robust. Our theoretical analysis establishes the variable-selection consistency of SODA under high-dimensional settings, and our simulation studies as well as real-data applications demonstrate superior performances of SODA in dealing with non-Gaussian design matrices in both logistic and general index models.
Collapse
Affiliation(s)
- Yang Li
- Yang Li is Sr. Market Scientist, Vatic Labs LLC, New York, NY 10036. Jun S Liu is Professor, Department of Statistics, Harvard University, Cambridge, MA 02138; and is also co- Director for the Center for Statistical Science, Department of Industrial Engineering, Tsinghua University, Beijing, China
| | - Jun S Liu
- Yang Li is Sr. Market Scientist, Vatic Labs LLC, New York, NY 10036. Jun S Liu is Professor, Department of Statistics, Harvard University, Cambridge, MA 02138; and is also co- Director for the Center for Statistical Science, Department of Industrial Engineering, Tsinghua University, Beijing, China
| |
Collapse
|
14
|
Celeux G, Maugis-Rabusseau C, Sedki M. Variable selection in model-based clustering and discriminant analysis with a regularization approach. ADV DATA ANAL CLASSI 2018. [DOI: 10.1007/s11634-018-0322-5] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
15
|
Fop M, Smart KM, Murphy TB. Variable selection for latent class analysis with application to low back pain diagnosis. Ann Appl Stat 2017. [DOI: 10.1214/17-aoas1061] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
16
|
Multivariate classification of the geographic origin of Chinese cabbage using an electronic nose-mass spectrometry. Food Sci Biotechnol 2017; 26:603-609. [PMID: 30263584 DOI: 10.1007/s10068-017-0102-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2016] [Revised: 01/26/2017] [Accepted: 02/10/2017] [Indexed: 10/19/2022] Open
Abstract
An electronic nose-mass spectrometry (EN-MS) that profiles volatile compounds is a candidate device for identifying the geographic origin of cultivation of agricultural products when an adequate algorithm is derived. The objectives of this study were to apply two types of multivariate analysis, discriminant function analysis (DFA) and principal component analysis (PCA), to the volatile compounds detected by an EN-MS for the geographic classification of Chinese cabbage cultivated in Korea (42 samples) or in China (29 samples). DFA showed that Chinese cabbage from Korea were completely separable from those originating in China with 12 volatile compounds among the 151 detected. PCA revealed that Chinese cabbage data fell into two completely separable origins of Korea and China. This is the first study involving EN-MS data of volatile compounds with multivariate statistics to discriminate the geographical origin of Chinese cabbage, with further applications for other agricultural products.
Collapse
|
17
|
Shalabi A, Inoue M, Watkins J, De Rinaldis E, Coolen AC. Bayesian clinical classification from high-dimensional data: Signatures versus variability. Stat Methods Med Res 2016; 27:336-351. [PMID: 26984907 DOI: 10.1177/0962280216628901] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
When data exhibit imbalance between a large number d of covariates and a small number n of samples, clinical outcome prediction is impaired by overfitting and prohibitive computation demands. Here we study two simple Bayesian prediction protocols that can be applied to data of any dimension and any number of outcome classes. Calculating Bayesian integrals and optimal hyperparameters analytically leaves only a small number of numerical integrations, and CPU demands scale as O(nd). We compare their performance on synthetic and genomic data to the mclustDA method of Fraley and Raftery. For small d they perform as well as mclustDA or better. For d = 10,000 or more mclustDA breaks down computationally, while the Bayesian methods remain efficient. This allows us to explore phenomena typical of classification in high-dimensional spaces, such as overfitting and the reduced discriminative effectiveness of signatures compared to intra-class variability.
Collapse
Affiliation(s)
- Akram Shalabi
- 1 Institute for Mathematical and Molecular Biomedicine, King's College London, London, UK
| | - Masato Inoue
- 2 Department of Electrical Engineering and Bioscience, School of Advanced Science and Engineering, Waseda University, Tokyo, Japan
| | - Johnathan Watkins
- 3 Breakthrough Breast Cancer Research Unit, Department of Research Oncology, Guy's Hospital, London, UK
| | | | - Anthony Cc Coolen
- 1 Institute for Mathematical and Molecular Biomedicine, King's College London, London, UK
| |
Collapse
|
18
|
Flynt A, Daepp MIG. Diet-related chronic disease in the northeastern United States: a model-based clustering approach. Int J Health Geogr 2015; 14:25. [PMID: 26338084 PMCID: PMC4559302 DOI: 10.1186/s12942-015-0017-5] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2015] [Accepted: 08/14/2015] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Obesity and diabetes are global public health concerns. Studies indicate a relationship between socioeconomic, demographic and environmental variables and the spatial patterns of diet-related chronic disease. In this paper, we propose a methodology using model-based clustering and variable selection to predict rates of obesity and diabetes. We test this method through an application in the northeastern United States. METHODS We use model-based clustering, an unsupervised learning approach, to find latent clusters of similar US counties based on a set of socioeconomic, demographic, and environmental variables chosen through the process of variable selection. We then use Analysis of Variance and Post-hoc Tukey comparisons to examine differences in rates of obesity and diabetes for the clusters from the resulting clustering solution. RESULTS We find access to supermarkets, median household income, population density and socioeconomic status to be important in clustering the counties of two northeastern states. The results of the cluster analysis can be used to identify two sets of counties with significantly lower rates of diet-related chronic disease than those observed in the other identified clusters. These relatively healthy clusters are distinguished by the large central and large fringe metropolitan areas contained in their component counties. However, the relationship of socio-demographic factors and diet-related chronic disease is more complicated than previous research would suggest. Additionally, we find evidence of low food access in two clusters of counties adjacent to large central and fringe metropolitan areas. While food access has previously been seen as a problem of inner-city or remote rural areas, this study offers preliminary evidence of declining food access in suburban areas. CONCLUSIONS Model-based clustering with variable selection offers a new approach to the analysis of socioeconomic, demographic, and environmental data for diet-related chronic disease prediction. In a test application to two northeastern states, this method allows us to identify two sets of metropolitan counties with significantly lower diet-related chronic disease rates than those observed in most rural and suburban areas. Our method could be applied to larger geographic areas or other countries with comparable data sets, offering a promising method for researchers interested in the global increase in diet-related chronic disease.
Collapse
Affiliation(s)
- Abby Flynt
- Department of Mathematics, Bucknell University, 701 Moore Ave, 17837, Lewisburg, PA, USA.
| | - Madeleine I G Daepp
- Integrated Studies in Land and Food Systems, The University of British Columbia Vancouver, 2329 West Mall, V6T 1Z4, Vancouver, BC, Canada.
| |
Collapse
|
19
|
Salter-Townshend M, Murphy TB. Role Analysis in Networks using Mixtures of Exponential Random Graph Models. J Comput Graph Stat 2015; 24:520-538. [PMID: 26101465 DOI: 10.1080/10618600.2014.923777] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
A novel and flexible framework for investigating the roles of actors within a network is introduced. Particular interest is in roles as defined by local network connectivity patterns, identified using the ego-networks extracted from the network. A mixture of Exponential-family Random Graph Models is developed for these ego-networks in order to cluster the nodes into roles. We refer to this model as the ego-ERGM. An Expectation-Maximization algorithm is developed to infer the unobserved cluster assignments and to estimate the mixture model parameters using a maximum pseudo-likelihood approximation. The flexibility and utility of the method are demonstrated on examples of simulated and real networks.
Collapse
|
20
|
Jiang B, Liu JS. Variable selection for general index models via sliced inverse regression. Ann Stat 2014. [DOI: 10.1214/14-aos1233] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
21
|
|
22
|
Galligan MC, Saldova R, Campbell MP, Rudd PM, Murphy TB. Greedy feature selection for glycan chromatography data with the generalized Dirichlet distribution. BMC Bioinformatics 2013; 14:155. [PMID: 23651459 PMCID: PMC3703279 DOI: 10.1186/1471-2105-14-155] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2012] [Accepted: 03/20/2013] [Indexed: 11/25/2022] Open
Abstract
Background Glycoproteins are involved in a diverse range of biochemical and biological processes. Changes in protein glycosylation are believed to occur in many diseases, particularly during cancer initiation and progression. The identification of biomarkers for human disease states is becoming increasingly important, as early detection is key to improving survival and recovery rates. To this end, the serum glycome has been proposed as a potential source of biomarkers for different types of cancers. High-throughput hydrophilic interaction liquid chromatography (HILIC) technology for glycan analysis allows for the detailed quantification of the glycan content in human serum. However, the experimental data from this analysis is compositional by nature. Compositional data are subject to a constant-sum constraint, which restricts the sample space to a simplex. Statistical analysis of glycan chromatography datasets should account for their unusual mathematical properties. As the volume of glycan HILIC data being produced increases, there is a considerable need for a framework to support appropriate statistical analysis. Proposed here is a methodology for feature selection in compositional data. The principal objective is to provide a template for the analysis of glycan chromatography data that may be used to identify potential glycan biomarkers. Results A greedy search algorithm, based on the generalized Dirichlet distribution, is carried out over the feature space to search for the set of “grouping variables” that best discriminate between known group structures in the data, modelling the compositional variables using beta distributions. The algorithm is applied to two glycan chromatography datasets. Statistical classification methods are used to test the ability of the selected features to differentiate between known groups in the data. Two well-known methods are used for comparison: correlation-based feature selection (CFS) and recursive partitioning (rpart). CFS is a feature selection method, while recursive partitioning is a learning tree algorithm that has been used for feature selection in the past. Conclusions The proposed feature selection method performs well for both glycan chromatography datasets. It is computationally slower, but results in a lower misclassification rate and a higher sensitivity rate than both correlation-based feature selection and the classification tree method.
Collapse
Affiliation(s)
- Marie C Galligan
- School of Mathematical Sciences, University College Dublin, Belfield, Dublin 4, Ireland.
| | | | | | | | | |
Collapse
|
23
|
Duraipandian S, Sylvest Bergholt M, Zheng W, Yu Ho K, Teh M, Guan Yeoh K, Bok Yan So J, Shabbir A, Huang Z. Real-time Raman spectroscopy for in vivo, online gastric cancer diagnosis during clinical endoscopic examination. JOURNAL OF BIOMEDICAL OPTICS 2012; 17:081418. [PMID: 23224179 DOI: 10.1117/1.jbo.17.8.081418] [Citation(s) in RCA: 78] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
Optical spectroscopic techniques including reflectance, fluorescence and Raman spectroscopy have shown promising potential for in vivo precancer and cancer diagnostics in a variety of organs. However, data-analysis has mostly been limited to post-processing and off-line algorithm development. In this work, we develop a fully automated on-line Raman spectral diagnostics framework integrated with a multimodal image-guided Raman technique for real-time in vivo cancer detection at endoscopy. A total of 2748 in vivo gastric tissue spectra (2465 normal and 283 cancer) were acquired from 305 patients recruited to construct a spectral database for diagnostic algorithms development. The novel diagnostic scheme developed implements on-line preprocessing, outlier detection based on principal component analysis statistics (i.e., Hotelling's T2 and Q-residuals) for tissue Raman spectra verification as well as for organ specific probabilistic diagnostics using different diagnostic algorithms. Free-running optical diagnosis and processing time of < 0.5 s can be achieved, which is critical to realizing real-time in vivo tissue diagnostics during clinical endoscopic examination. The optimized partial least squares-discriminant analysis (PLS-DA) models based on the randomly resampled training database (80% for learning and 20% for testing) provide the diagnostic accuracy of 85.6% [95% confidence interval (CI): 82.9% to 88.2%] [sensitivity of 80.5% (95% CI: 71.4% to 89.6%) and specificity of 86.2% (95% CI: 83.6% to 88.7%)] for the detection of gastric cancer. The PLS-DA algorithms are further applied prospectively on 10 gastric patients at gastroscopy, achieving the predictive accuracy of 80.0% (60/75) [sensitivity of 90.0% (27/30) and specificity of 73.3% (33/45)] for in vivo diagnosis of gastric cancer. The receiver operating characteristics curves further confirmed the efficacy of Raman endoscopy together with PLS-DA algorithms for in vivo prospective diagnosis of gastric cancer. This work successfully moves biomedical Raman spectroscopic technique into real-time, on-line clinical cancer diagnosis, especially in routine endoscopic diagnostic applications.
Collapse
Affiliation(s)
- Shiyamala Duraipandian
- National University of Singapore, Department of Bioengineering, Faculty of Engineering, Optical Bioimaging Laboratory, Singapore 117576, Singapore
| | | | | | | | | | | | | | | | | |
Collapse
|
24
|
Stingo FC, Vannucci M, Downey G. BAYESIAN WAVELET-BASED CURVE CLASSIFICATION VIA DISCRIMINANT ANALYSIS WITH MARKOV RANDOM TREE PRIORS. Stat Sin 2012; 22:465-488. [PMID: 24761126 DOI: 10.5705/ss.2010.141] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Discriminant analysis is an effective tool for the classification of experimental units into groups. When the number of variables is much larger than the number of observations it is necessary to include a dimension reduction procedure into the inferential process. Here we present a typical example from chemometrics that deals with the classification of different types of food into species via near infrared spectroscopy. We take a nonparametric approach by modeling the functional predictors via wavelet transforms and then apply discriminant analysis in the wavelet domain. We consider a Bayesian conjugate normal discriminant model, either linear or quadratic, that avoids independence assumptions among the wavelet coefficients. We introduce latent binary indicators for the selection of the discriminatory wavelet coefficients and propose prior formulations that use Markov random tree (MRT) priors to map scale-location connections among wavelets coefficients. We conduct posterior inference via MCMC methods, we show performances on our case study on food authenticity and compare results to several other procedures..
Collapse
Affiliation(s)
| | - Marina Vannucci
- Department of Statistics, Rice University, Houston, TX 77251, U.S.A.
| | - Gerard Downey
- Ashtown Food Research Centre, Teagasc, Ashtown, Dublin 15, Ireland.
| |
Collapse
|
25
|
Maugis C, Celeux G, Martin-Magniette ML. Variable selection in model-based discriminant analysis. J MULTIVARIATE ANAL 2011. [DOI: 10.1016/j.jmva.2011.05.004] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
26
|
Presi P, Reist M. Review of methodologies applicable to the validation of animal based indicators of welfare. ACTA ACUST UNITED AC 2011. [DOI: 10.2903/sp.efsa.2011.en-171] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
27
|
Stingo FC, Vannucci M. Variable selection for discriminant analysis with Markov random field priors for the analysis of microarray data. Bioinformatics 2010; 27:495-501. [PMID: 21159623 DOI: 10.1093/bioinformatics/btq690] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
MOTIVATION Discriminant analysis is an effective tool for the classification of experimental units into groups. Here, we consider the typical problem of classifying subjects according to phenotypes via gene expression data and propose a method that incorporates variable selection into the inferential procedure, for the identification of the important biomarkers. To achieve this goal, we build upon a conjugate normal discriminant model, both linear and quadratic, and include a stochastic search variable selection procedure via an MCMC algorithm. Furthermore, we incorporate into the model prior information on the relationships among the genes as described by a gene-gene network. We use a Markov random field (MRF) prior to map the network connections among genes. Our prior model assumes that neighboring genes in the network are more likely to have a joint effect on the relevant biological processes. RESULTS We use simulated data to assess performances of our method. In particular, we compare the MRF prior to a situation where independent Bernoulli priors are chosen for the individual predictors. We also illustrate the method on benchmark datasets for gene expression. Our simulation studies show that employing the MRF prior improves on selection accuracy. In real data applications, in addition to identifying markers and improving prediction accuracy, we show how the integration of existing biological knowledge into the prior model results in an increased ability to identify genes with strong discriminatory power and also aids the interpretation of the results.
Collapse
|
28
|
Murphy TB, Dean N, Raftery AE. Variable Selection and Updating In Model-Based Discriminant Analysis for High Dimensional Data with Food Authenticity Applications. Ann Appl Stat 2010; 4:396-421. [PMID: 20936055 DOI: 10.1214/09-aoas279] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Food authenticity studies are concerned with determining if food samples have been correctly labelled or not. Discriminant analysis methods are an integral part of the methodology for food authentication. Motivated by food authenticity applications, a model-based discriminant analysis method that includes variable selection is presented. The discriminant analysis model is fitted in a semi-supervised manner using both labeled and unlabeled data. The method is shown to give excellent classification performance on several high-dimensional multiclass food authenticity datasets with more variables than observations. The variables selected by the proposed method provide information about which variables are meaningful for classification purposes. A headlong search strategy for variable selection is shown to be efficient in terms of computation and achieves excellent classification performance. In applications to several food authenticity datasets, our proposed method outperformed default implementations of Random Forests, AdaBoost, transductive SVMs and Bayesian Multinomial Regression by substantial margins.
Collapse
|