1
|
Battistella E, Ghiassian D, Barabási AL. Improving the performance and interpretability on medical datasets using graphical ensemble feature selection. Bioinformatics 2024; 40:btae341. [PMID: 38837347 PMCID: PMC11187494 DOI: 10.1093/bioinformatics/btae341] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2023] [Revised: 04/19/2024] [Accepted: 05/24/2024] [Indexed: 06/07/2024] Open
Abstract
MOTIVATION A major hindrance towards using Machine Learning (ML) on medical datasets is the discrepancy between a large number of variables and small sample sizes. While multiple feature selection techniques have been proposed to avoid the resulting overfitting, overall ensemble techniques offer the best selection robustness. Yet, current methods designed to combine different algorithms generally fail to leverage the dependencies identified by their components. Here, we propose Graphical Ensembling (GE), a graph-theory-based ensemble feature selection technique designed to improve the stability and relevance of the selected features. RESULTS Relying on four datasets, we show that GE increases classification performance with fewer selected features. For example, on rheumatoid arthritis patient stratification, GE outperforms the baseline methods by 9% Balanced Accuracy while relying on fewer features. We use data on sub-cellular networks to show that the selected features (proteins) are closer to the known disease genes, and the uncovered biological mechanisms are more diversified. By successfully tackling the complex correlations between biological variables, we anticipate that GE will improve the medical applications of ML. AVAILABILITY AND IMPLEMENTATION https://github.com/ebattistella/auto_machine_learning.
Collapse
Affiliation(s)
- Enzo Battistella
- Network Science Institute, Northeastern University, Boston, MA 02115, United States
| | | | - Albert-László Barabási
- Network Science Institute, Northeastern University, Boston, MA 02115, United States
- Department of Data and Network Science, Central Eastern University, Budapest 1051, Hungary
- Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, United States
| |
Collapse
|
2
|
Maiorino E, De Marzio M, Xu Z, Yun JH, Chase RP, Hersh CP, Weiss ST, Silverman EK, Castaldi PJ, Glass K. Joint clinical and molecular subtyping of COPD with variational autoencoders. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2023.08.19.23294298. [PMID: 38260473 PMCID: PMC10802661 DOI: 10.1101/2023.08.19.23294298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Chronic Obstructive Pulmonary Disease (COPD) is a complex, heterogeneous disease. Traditional subtyping methods generally focus on either the clinical manifestations or the molecular endotypes of the disease, resulting in classifications that do not fully capture the disease's complexity. Here, we bridge this gap by introducing a subtyping pipeline that integrates clinical and gene expression data with variational autoencoders. We apply this methodology to the COPDGene study, a large study of current and former smoking individuals with and without COPD. Our approach generates a set of vector embeddings, called Personalized Integrated Profiles (PIPs), that recapitulate the joint clinical and molecular state of the subjects in the study. Prediction experiments show that the PIPs have a predictive accuracy comparable to or better than other embedding approaches. Using trajectory learning approaches, we analyze the main trajectories of variation in the PIP space and identify five well-separated subtypes with distinct clinical phenotypes, expression signatures, and disease outcomes. Notably, these subtypes are more robust to data resampling compared to those identified using traditional clustering approaches. Overall, our findings provide new avenues to establish fine-grained associations between the clinical characteristics, molecular processes, and disease outcomes of COPD.
Collapse
Affiliation(s)
- Enrico Maiorino
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School
| | - Margherita De Marzio
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School
| | - Zhonghui Xu
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School
| | - Jeong H. Yun
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School
| | - Robert P. Chase
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School
| | - Craig P. Hersh
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School
| | - Scott T. Weiss
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School
| | - Edwin K. Silverman
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School
| | | | | |
Collapse
|
3
|
Ghafari R, Azar AS, Ghafari A, Aghdam FM, Valizadeh M, Khalili N, Hatamkhani S. Prediction of the Fatal Acute Complications of Myocardial Infarction via Machine Learning Algorithms. J Tehran Heart Cent 2023; 18:278-287. [PMID: 38680646 PMCID: PMC11053239 DOI: 10.18502/jthc.v18i4.14827] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Accepted: 06/05/2023] [Indexed: 05/01/2024] Open
Abstract
Background Myocardial infarction (MI) is a major cause of death, particularly during the first year. The avoidance of potentially fatal outcomes requires expeditious preventative steps. Machine learning (ML) is a subfield of artificial intelligence science that detects the underlying patterns of available big data for modeling them. This study aimed to establish an ML model with numerous features to predict the fatal complications of MI during the first 72 hours of hospital admission. Methods We applied an MI complications database that contains the demographic and clinical records of patients during the 3 days of admission based on 2 output classes: dead due to the known complications of MI and alive. We utilized the recursive feature elimination (RFE) method to apply feature selection. Thus, after applying this method, we reduced the number of features to 50. The performance of 4 common ML classifier algorithms, namely logistic regression, support vector machine, random forest, and extreme gradient boosting (XGBoost), was evaluated using 8 classification metrics (sensitivity, specificity, precision, false-positive rate, false-negative rate, accuracy, F1-score, and AUC). Results In this study of 1699 patients with confirmed MI, 15.94% experienced fatal complications, and the rest remained alive. The XGBoost model achieved more desirable results based on the accuracy and F1-score metrics and distinguished patients with fatal complications from surviving ones (AUC=78.65%, sensitivity=94.35%, accuracy=91.47%, and F1-score=95.14%). Cardiogenic shock was the most significant feature influencing the prediction of the XGBoost algorithm. Conclusion XGBoost algorithms can be a promising model for predicting fatal complications following MI.
Collapse
Affiliation(s)
- Reza Ghafari
- Pharmacy Faculty, Urmia University of Medical Sciences, Urmia, Iran
| | | | - Ali Ghafari
- Medical Physics and Biomedical Engineering Department, School of Medicine, Tehran University of Medical Sciences, Tehran, Iran
- Research Center for Evidence-Based Medicine, Tabriz University of Medical Sciences, Tabriz, Iran
| | | | - Morteza Valizadeh
- Faculty of Electrical and Computer Engineering, Urmia University, Urmia, Iran
| | - Naser Khalili
- Department of Cardiology, School of Medicine, Urmia University of Medical Sciences, Urmia, Iran
| | - Shima Hatamkhani
- Experimental and Applied Pharmaceutical Sciences Research Center, Urmia University of Medical Sciences, Urmia, Iran
- Department of Clinical Pharmacy, Urmia University of Medical Sciences, Urmia, Iran
| |
Collapse
|
4
|
Shkunnikova S, Mijakovac A, Sironic L, Hanic M, Lauc G, Kavur MM. IgG glycans in health and disease: Prediction, intervention, prognosis, and therapy. Biotechnol Adv 2023; 67:108169. [PMID: 37207876 DOI: 10.1016/j.biotechadv.2023.108169] [Citation(s) in RCA: 18] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2023] [Revised: 05/01/2023] [Accepted: 05/02/2023] [Indexed: 05/21/2023]
Abstract
Immunoglobulin (IgG) glycosylation is a complex enzymatically controlled process, essential for the structure and function of IgG. IgG glycome is relatively stable in the state of homeostasis, yet its alterations have been associated with aging, pollution and toxic exposure, as well as various diseases, including autoimmune and inflammatory diseases, cardiometabolic diseases, infectious diseases and cancer. IgG is also an effector molecule directly involved in the inflammation processes included in the pathogenesis of many diseases. Numerous recently published studies support the idea that IgG N-glycosylation fine-tunes the immune response and plays a significant role in chronic inflammation. This makes it a promising novel biomarker of biological age, and a prognostic, diagnostic and treatment evaluation tool. Here we provide an overview of the current state of knowledge regarding the IgG glycosylation in health and disease, and its potential applications in pro-active prevention and monitoring of various health interventions.
Collapse
Affiliation(s)
- Sofia Shkunnikova
- Genos Glycoscience Research Laboratory, Borongajska cesta 83H, Zagreb, Croatia
| | - Anika Mijakovac
- University of Zagreb, Faculty of Science, Department of Biology, Horvatovac 102a, Zagreb, Croatia
| | - Lucija Sironic
- Genos Glycoscience Research Laboratory, Borongajska cesta 83H, Zagreb, Croatia
| | - Maja Hanic
- Genos Glycoscience Research Laboratory, Borongajska cesta 83H, Zagreb, Croatia
| | - Gordan Lauc
- Genos Glycoscience Research Laboratory, Borongajska cesta 83H, Zagreb, Croatia; University of Zagreb, Faculty of Pharmacy and Biochemistry, Ulica Ante Kovačića 1, Zagreb, Croatia
| | | |
Collapse
|
5
|
Kirdin A, Sidorov S, Zolotykh N. Rosenblatt's First Theorem and Frugality of Deep Learning. ENTROPY (BASEL, SWITZERLAND) 2022; 24:1635. [PMID: 36359726 PMCID: PMC9689667 DOI: 10.3390/e24111635] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/29/2022] [Revised: 11/02/2022] [Accepted: 11/06/2022] [Indexed: 06/16/2023]
Abstract
The Rosenblatt's first theorem about the omnipotence of shallow networks states that elementary perceptrons can solve any classification problem if there are no discrepancies in the training set. Minsky and Papert considered elementary perceptrons with restrictions on the neural inputs: a bounded number of connections or a relatively small diameter of the receptive field for each neuron at the hidden layer. They proved that under these constraints, an elementary perceptron cannot solve some problems, such as the connectivity of input images or the parity of pixels in them. In this note, we demonstrated Rosenblatt's first theorem at work, showed how an elementary perceptron can solve a version of the travel maze problem, and analysed the complexity of that solution. We also constructed a deep network algorithm for the same problem. It is much more efficient. The shallow network uses an exponentially large number of neurons on the hidden layer (Rosenblatt's A-elements), whereas for the deep network, the second-order polynomial complexity is sufficient. We demonstrated that for the same complex problem, the deep network can be much smaller and reveal a heuristic behind this effect.
Collapse
Affiliation(s)
- Alexander Kirdin
- Institute of Information Technologies, Mathematics and Mechanics, Lobachevsky State University, 603022 Nizhni Novgorod, Russia
- Institute for Computational Modelling, Russian Academy of Sciences, Siberian Branch, 660036 Krasnoyarsk, Russia
| | - Sergey Sidorov
- Institute of Information Technologies, Mathematics and Mechanics, Lobachevsky State University, 603022 Nizhni Novgorod, Russia
| | - Nikolai Zolotykh
- Institute of Information Technologies, Mathematics and Mechanics, Lobachevsky State University, 603022 Nizhni Novgorod, Russia
| |
Collapse
|
6
|
A Genetically-optimised Artificial Life Algorithm for Complexity-based Synthetic Dataset Generation. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.11.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
|
7
|
Barkalov K, Shtanyuk A, Sysoyev A. A Fast kNN Algorithm Using Multiple Space-Filling Curves. ENTROPY 2022; 24:e24060767. [PMID: 35741488 PMCID: PMC9223091 DOI: 10.3390/e24060767] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Revised: 05/23/2022] [Accepted: 05/25/2022] [Indexed: 12/10/2022]
Abstract
The paper considers a time-efficient implementation of the k nearest neighbours (kNN) algorithm. A well-known approach for accelerating the kNN algorithm is to utilise dimensionality reduction methods based on the use of space-filling curves. In this paper, we take this approach further and propose an algorithm that employs multiple space-filling curves and is faster (with comparable quality) compared with the kNN algorithm, which uses kd-trees to determine the nearest neighbours. A specific method for constructing multiple Peano curves is outlined, and statements are given about the preservation of object proximity information in the course of dimensionality reduction. An experimental comparison with known kNN implementations using kd-trees was performed using test and real-life data.
Collapse
|
8
|
Zinovyev A, Sadovsky M, Calzone L, Fouché A, Groeneveld CS, Chervov A, Barillot E, Gorban AN. Modeling Progression of Single Cell Populations Through the Cell Cycle as a Sequence of Switches. Front Mol Biosci 2022; 8:793912. [PMID: 35178429 PMCID: PMC8846220 DOI: 10.3389/fmolb.2021.793912] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 12/15/2021] [Indexed: 11/13/2022] Open
Abstract
Cell cycle is a biological process underlying the existence and propagation of life in time and space. It has been an object for mathematical modeling for long, with several alternative mechanistic modeling principles suggested, describing in more or less details the known molecular mechanisms. Recently, cell cycle has been investigated at single cell level in snapshots of unsynchronized cell populations, exploiting the new methods for transcriptomic and proteomic molecular profiling. This raises a need for simplified semi-phenomenological cell cycle models, in order to formalize the processes underlying the cell cycle, at a higher abstracted level. Here we suggest a modeling framework, recapitulating the most important properties of the cell cycle as a limit trajectory of a dynamical process characterized by several internal states with switches between them. In the simplest form, this leads to a limit cycle trajectory, composed by linear segments in logarithmic coordinates describing some extensive (depending on system size) cell properties. We prove a theorem connecting the effective embedding dimensionality of the cell cycle trajectory with the number of its linear segments. We also develop a simplified kinetic model with piecewise-constant kinetic rates describing the dynamics of lumps of genes involved in S-phase and G2/M phases. We show how the developed cell cycle models can be applied to analyze the available single cell datasets and simulate certain properties of the observed cell cycle trajectories. Based on our model, we can predict with good accuracy the cell line doubling time from the length of cell cycle trajectory.
Collapse
Affiliation(s)
- Andrei Zinovyev
- Institut Curie, PSL Research University, Paris, France
- INSERM, Paris, France
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, Paris, France
- *Correspondence: Andrei Zinovyev,
| | - Michail Sadovsky
- Institute of Computational Modeling (RAS), Krasnoyarsk, Russia
- Laboratory of Medical Cybernetics, V.F.Voino-Yasenetsky Krasnoyarsk State Medical University, Krasnoyarsk, Russia
- Federal Research and Clinic Center of FMBA of Russia, Krasnoyarsk, Russia
- Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky University, Nizhniy Novgorod, Russia
| | - Laurence Calzone
- Institut Curie, PSL Research University, Paris, France
- INSERM, Paris, France
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, Paris, France
| | - Aziz Fouché
- Institut Curie, PSL Research University, Paris, France
- INSERM, Paris, France
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, Paris, France
| | - Clarice S. Groeneveld
- Cartes d’Identité des Tumeurs (CIT) Program, Ligue Nationale Contre le Cancer, Paris, France
- Oncologie Moleculaire, UMR144, Institut Curie, Paris, France
| | - Alexander Chervov
- Institut Curie, PSL Research University, Paris, France
- INSERM, Paris, France
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, Paris, France
| | - Emmanuel Barillot
- Institut Curie, PSL Research University, Paris, France
- INSERM, Paris, France
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, Paris, France
| | - Alexander N. Gorban
- Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky University, Nizhniy Novgorod, Russia
- Department of Mathematics, University of Leicester, Leicester, United Kingdom
| |
Collapse
|
9
|
Gautier T, Ziegler LB, Gerber MS, Campos-Náñez E, Patek SD. Artificial intelligence and diabetes technology: A review. Metabolism 2021; 124:154872. [PMID: 34480920 DOI: 10.1016/j.metabol.2021.154872] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/18/2021] [Revised: 07/27/2021] [Accepted: 08/28/2021] [Indexed: 12/15/2022]
Abstract
Artificial intelligence (AI) is widely discussed in the popular literature and is portrayed as impacting many aspects of human life, both in and out of the workplace. The potential for revolutionizing healthcare is significant because of the availability of increasingly powerful computational platforms and methods, along with increasingly informative sources of patient data, both in and out of clinical settings. This review aims to provide a realistic assessment of the potential for AI in understanding and managing diabetes, accounting for the state of the art in the methodology and medical devices that collect data, process data, and act accordingly. Acknowledging that many conflicting definitions of AI have been put forth, this article attempts to characterize the main elements of the field as they relate to diabetes, identifying the main perspectives and methods that can (i) affect basic understanding of the disease, (ii) affect understanding of risk factors (genetic, clinical, and behavioral) of diabetes development, (iii) improve diagnosis, (iv) improve understanding of the arc of disease (progression and personal/societal impact), and finally (v) improve treatment.
Collapse
Affiliation(s)
- Thibault Gautier
- Dexcom/TypeZero, 946 Grady Avenue, Suite 203, Charlottesville, VA 22903, United States of America.
| | - Leah B Ziegler
- Dexcom/TypeZero, 946 Grady Avenue, Suite 203, Charlottesville, VA 22903, United States of America
| | - Matthew S Gerber
- Dexcom/TypeZero, 946 Grady Avenue, Suite 203, Charlottesville, VA 22903, United States of America
| | - Enrique Campos-Náñez
- Dexcom/TypeZero, 946 Grady Avenue, Suite 203, Charlottesville, VA 22903, United States of America
| | - Stephen D Patek
- Dexcom/TypeZero, 946 Grady Avenue, Suite 203, Charlottesville, VA 22903, United States of America
| |
Collapse
|
10
|
Bac J, Mirkes EM, Gorban AN, Tyukin I, Zinovyev A. Scikit-Dimension: A Python Package for Intrinsic Dimension Estimation. ENTROPY (BASEL, SWITZERLAND) 2021; 23:1368. [PMID: 34682092 PMCID: PMC8534554 DOI: 10.3390/e23101368] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Revised: 10/10/2021] [Accepted: 10/16/2021] [Indexed: 02/07/2023]
Abstract
Dealing with uncertainty in applications of machine learning to real-life data critically depends on the knowledge of intrinsic dimensionality (ID). A number of methods have been suggested for the purpose of estimating ID, but no standard package to easily apply them one by one or all at once has been implemented in Python. This technical note introduces scikit-dimension, an open-source Python package for intrinsic dimension estimation. The scikit-dimension package provides a uniform implementation of most of the known ID estimators based on the scikit-learn application programming interface to evaluate the global and local intrinsic dimension, as well as generators of synthetic toy and benchmark datasets widespread in the literature. The package is developed with tools assessing the code quality, coverage, unit testing and continuous integration. We briefly describe the package and demonstrate its use in a large-scale (more than 500 datasets) benchmarking of methods for ID estimation for real-life and synthetic data.
Collapse
Affiliation(s)
- Jonathan Bac
- Institut Curie, PSL Research University, 75248 Paris, France
- INSERM, U900, 75248 Paris, France
- CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75272 Paris, France
| | - Evgeny M. Mirkes
- Department of Mathematics, University of Leicester, Leicester LE1 7RH, UK; (E.M.M.); (A.N.G.); (I.T.)
- Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky University, 603105 Nizhniy Novgorod, Russia
| | - Alexander N. Gorban
- Department of Mathematics, University of Leicester, Leicester LE1 7RH, UK; (E.M.M.); (A.N.G.); (I.T.)
- Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky University, 603105 Nizhniy Novgorod, Russia
| | - Ivan Tyukin
- Department of Mathematics, University of Leicester, Leicester LE1 7RH, UK; (E.M.M.); (A.N.G.); (I.T.)
- Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky University, 603105 Nizhniy Novgorod, Russia
| | - Andrei Zinovyev
- Institut Curie, PSL Research University, 75248 Paris, France
- INSERM, U900, 75248 Paris, France
- CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75272 Paris, France
- Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky University, 603105 Nizhniy Novgorod, Russia
| |
Collapse
|
11
|
Acceleration of Global Optimization Algorithm by Detecting Local Extrema Based on Machine Learning. ENTROPY 2021; 23:e23101272. [PMID: 34681996 PMCID: PMC8534649 DOI: 10.3390/e23101272] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/28/2021] [Revised: 09/25/2021] [Accepted: 09/26/2021] [Indexed: 11/17/2022]
Abstract
This paper features the study of global optimization problems and numerical methods of their solution. Such problems are computationally expensive since the objective function can be multi-extremal, nondifferentiable, and, as a rule, given in the form of a “black box”. This study used a deterministic algorithm for finding the global extremum. This algorithm is based neither on the concept of multistart, nor nature-inspired algorithms. The article provides computational rules of the one-dimensional algorithm and the nested optimization scheme which could be applied for solving multidimensional problems. Please note that the solution complexity of global optimization problems essentially depends on the presence of multiple local extrema. In this paper, we apply machine learning methods to identify regions of attraction of local minima. The use of local optimization algorithms in the selected regions can significantly accelerate the convergence of global search as it could reduce the number of search trials in the vicinity of local minima. The results of computational experiments carried out on several hundred global optimization problems of different dimensionalities presented in the paper confirm the effect of accelerated convergence (in terms of the number of search trials required to solve a problem with a given accuracy).
Collapse
|
12
|
Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets. ENTROPY 2020; 22:e22111274. [PMID: 33287042 PMCID: PMC7711596 DOI: 10.3390/e22111274] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/16/2020] [Revised: 11/06/2020] [Accepted: 11/07/2020] [Indexed: 12/27/2022]
Abstract
Construction of graph-based approximations for multi-dimensional data point clouds is widely used in a variety of areas. Notable examples of applications of such approximators are cellular trajectory inference in single-cell data analysis, analysis of clinical trajectories from synchronic datasets, and skeletonization of images. Several methods have been proposed to construct such approximating graphs, with some based on computation of minimum spanning trees and some based on principal graphs generalizing principal curves. In this article we propose a methodology to compare and benchmark these two graph-based data approximation approaches, as well as to define their hyperparameters. The main idea is to avoid comparing graphs directly, but at first to induce clustering of the data point cloud from the graph approximation and, secondly, to use well-established methods to compare and score the data cloud partitioning induced by the graphs. In particular, mutual information-based approaches prove to be useful in this context. The induced clustering is based on decomposing a graph into non-branching segments, and then clustering the data point cloud by the nearest segment. Such a method allows efficient comparison of graph-based data approximations of arbitrary topology and complexity. The method is implemented in Python using the standard scikit-learn library which provides high speed and efficiency. As a demonstration of the methodology we analyse and compare graph-based data approximation methods using synthetic as well as real-life single cell datasets.
Collapse
|