1
|
Preud'homme G, Duarte K, Dalleau K, Lacomblez C, Bresso E, Smaïl-Tabbone M, Couceiro M, Devignes MD, Kobayashi M, Huttin O, Ferreira JP, Zannad F, Rossignol P, Girerd N. Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark. Sci Rep 2021; 11:4202. [PMID: 33603019 PMCID: PMC7892576 DOI: 10.1038/s41598-021-83340-8] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Accepted: 02/02/2021] [Indexed: 11/22/2022] Open
Abstract
The choice of the most appropriate unsupervised machine-learning method for “heterogeneous” or “mixed” data, i.e. with both continuous and categorical variables, can be challenging. Our aim was to examine the performance of various clustering strategies for mixed data using both simulated and real-life data. We conducted a benchmark analysis of “ready-to-use” tools in R comparing 4 model-based (Kamila algorithm, Latent Class Analysis, Latent Class Model [LCM] and Clustering by Mixture Modeling) and 5 distance/dissimilarity-based (Gower distance or Unsupervised Extra Trees dissimilarity followed by hierarchical clustering or Partitioning Around Medoids, K-prototypes) clustering methods. Clustering performances were assessed by Adjusted Rand Index (ARI) on 1000 generated virtual populations consisting of mixed variables using 7 scenarios with varying population sizes, number of clusters, number of continuous and categorical variables, proportions of relevant (non-noisy) variables and degree of variable relevance (low, mild, high). Clustering methods were then applied on the EPHESUS randomized clinical trial data (a heart failure trial evaluating the effect of eplerenone) allowing to illustrate the differences between different clustering techniques. The simulations revealed the dominance of K-prototypes, Kamila and LCM models over all other methods. Overall, methods using dissimilarity matrices in classical algorithms such as Partitioning Around Medoids and Hierarchical Clustering had a lower ARI compared to model-based methods in all scenarios. When applying clustering methods to a real-life clinical dataset, LCM showed promising results with regard to differences in (1) clinical profiles across clusters, (2) prognostic performance (highest C-index) and (3) identification of patient subgroups with substantial treatment benefit. The present findings suggest key differences in clustering performance between the tested algorithms (limited to tools readily available in R). In most of the tested scenarios, model-based methods (in particular the Kamila and LCM packages) and K-prototypes typically performed best in the setting of heterogeneous data.
Collapse
Affiliation(s)
- Gregoire Preud'homme
- Centre d'Investigations Cliniques Plurithématique 1433, INSERM 1116, CHRU de Nancy, Université de Lorraine, Nancy, France.,F-CRIN INI-CRCT Cardiovascular and Renal Clinical Trialists Network, Nancy, France
| | - Kevin Duarte
- Centre d'Investigations Cliniques Plurithématique 1433, INSERM 1116, CHRU de Nancy, Université de Lorraine, Nancy, France
| | - Kevin Dalleau
- CNRS, Inria Nancy Grand-Est, LORIA, UMR 7503, Université de Lorraine, Vandoeuvre-lès-Nancy, France
| | - Claire Lacomblez
- Centre d'Investigations Cliniques Plurithématique 1433, INSERM 1116, CHRU de Nancy, Université de Lorraine, Nancy, France
| | - Emmanuel Bresso
- CNRS, Inria Nancy Grand-Est, LORIA, UMR 7503, Université de Lorraine, Vandoeuvre-lès-Nancy, France
| | - Malika Smaïl-Tabbone
- F-CRIN INI-CRCT Cardiovascular and Renal Clinical Trialists Network, Nancy, France.,CNRS, Inria Nancy Grand-Est, LORIA, UMR 7503, Université de Lorraine, Vandoeuvre-lès-Nancy, France
| | - Miguel Couceiro
- CNRS, Inria Nancy Grand-Est, LORIA, UMR 7503, Université de Lorraine, Vandoeuvre-lès-Nancy, France
| | - Marie-Dominique Devignes
- F-CRIN INI-CRCT Cardiovascular and Renal Clinical Trialists Network, Nancy, France.,CNRS, Inria Nancy Grand-Est, LORIA, UMR 7503, Université de Lorraine, Vandoeuvre-lès-Nancy, France
| | - Masatake Kobayashi
- Centre d'Investigations Cliniques Plurithématique 1433, INSERM 1116, CHRU de Nancy, Université de Lorraine, Nancy, France.,F-CRIN INI-CRCT Cardiovascular and Renal Clinical Trialists Network, Nancy, France
| | - Olivier Huttin
- Centre d'Investigations Cliniques Plurithématique 1433, INSERM 1116, CHRU de Nancy, Université de Lorraine, Nancy, France.,F-CRIN INI-CRCT Cardiovascular and Renal Clinical Trialists Network, Nancy, France
| | - João Pedro Ferreira
- Centre d'Investigations Cliniques Plurithématique 1433, INSERM 1116, CHRU de Nancy, Université de Lorraine, Nancy, France.,F-CRIN INI-CRCT Cardiovascular and Renal Clinical Trialists Network, Nancy, France
| | - Faiez Zannad
- Centre d'Investigations Cliniques Plurithématique 1433, INSERM 1116, CHRU de Nancy, Université de Lorraine, Nancy, France.,F-CRIN INI-CRCT Cardiovascular and Renal Clinical Trialists Network, Nancy, France
| | - Patrick Rossignol
- Centre d'Investigations Cliniques Plurithématique 1433, INSERM 1116, CHRU de Nancy, Université de Lorraine, Nancy, France.,F-CRIN INI-CRCT Cardiovascular and Renal Clinical Trialists Network, Nancy, France
| | - Nicolas Girerd
- Centre d'Investigations Cliniques Plurithématique 1433, INSERM 1116, CHRU de Nancy, Université de Lorraine, Nancy, France. .,F-CRIN INI-CRCT Cardiovascular and Renal Clinical Trialists Network, Nancy, France. .,Centre d'Investigation Clinique Pierre Drouin -INSERM - CHRU de Nancy, Institut Lorrain du cœur Et Des Vaisseaux Louis Mathieu, 4, Rue du Morvan, 54500, Vandœuvre-Lès-Nancy, France.
| |
Collapse
|
2
|
Mongrand S, Badoc A, Patouille B, Lacomblez C, Chavent M, Cassagne C, Bessoule JJ. Taxonomy of gymnospermae: multivariate analyses of leaf fatty acid composition. Phytochemistry 2001; 58:101-115. [PMID: 11524119 DOI: 10.1016/s0031-9422(01)00139-x] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
The fatty acid composition of photosynthetic tissues from 137 species of gymnosperms belonging to 14 families was determined by gas chromatography. Statistical analysis clearly discriminated four groups. Ginkgoaceae, Cycadaceae, Stangeriaceae, Zamiaceae, Sciadopityaceae, Podocarpaceae, Cephalotaxaceae, Taxaceae, Ephedraceae and Welwitschiaceae are in the first group, while Cupressaceae and Araucariaceae are mainly in the second one. The third and the fourth groups composed of Pinaceae species are characterized by the genera Larix, and Abies and Cedrus, respectively. Principal component and discriminant analyses and divisive hierarchical clustering analysis of the 43 Pinaceae species were also performed. A clear-cut separation of the genera Abies, Larix, and Cedrus from the other Pinaceae was evidenced. In addition, a mass analysis of the two main chloroplastic lipids from 14 gymnosperms was performed. The results point to a great originality in gymnosperms since in several species and contrary to the angiosperms, the amount of digalactosyldiacylglycerol exceeds that of monogalactosyldiacylglycerol.
Collapse
Affiliation(s)
- S Mongrand
- Laboratoire de Biogenèse Membranaire-CNRS-UMR 5544, Université Victor Segalen-Bordeaux II, Bordeaux, France.
| | | | | | | | | | | | | |
Collapse
|