1
|
Amézquita EJ, Nasrin F, Storey KM, Yoshizawa M. Genomics data analysis via spectral shape and topology. PLoS One 2023; 18:e0284820. [PMID: 37099525 PMCID: PMC10132553 DOI: 10.1371/journal.pone.0284820] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2022] [Accepted: 04/09/2023] [Indexed: 04/27/2023] Open
Abstract
Mapper, a topological algorithm, is frequently used as an exploratory tool to build a graphical representation of data. This representation can help to gain a better understanding of the intrinsic shape of high-dimensional genomic data and to retain information that may be lost using standard dimension-reduction algorithms. We propose a novel workflow to process and analyze RNA-seq data from tumor and healthy subjects integrating Mapper, differential gene expression, and spectral shape analysis. Precisely, we show that a Gaussian mixture approximation method can be used to produce graphical structures that successfully separate tumor and healthy subjects, and produce two subgroups of tumor subjects. A further analysis using DESeq2, a popular tool for the detection of differentially expressed genes, shows that these two subgroups of tumor cells bear two distinct gene regulations, suggesting two discrete paths for forming lung cancer, which could not be highlighted by other popular clustering methods, including t-distributed stochastic neighbor embedding (t-SNE). Although Mapper shows promise in analyzing high-dimensional data, tools to statistically analyze Mapper graphical structures are limited in the existing literature. In this paper, we develop a scoring method using heat kernel signatures that provides an empirical setting for statistical inferences such as hypothesis testing, sensitivity analysis, and correlation analysis.
Collapse
Affiliation(s)
- Erik J. Amézquita
- Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI, United States of America
| | - Farzana Nasrin
- Department of Mathematics, University of Hawaii at Manoa, Honolulu, HI, United States of America
| | - Kathleen M. Storey
- Department of Mathematics, Lafayette College, Easton, PA, United States of America
| | - Masato Yoshizawa
- School of Life Sciences, University of Hawaii at Manoa, Honolulu, HI, United States of America
| |
Collapse
|
2
|
Owada T. Convergence of persistence diagram in the sparse regime. ANN APPL PROBAB 2022. [DOI: 10.1214/22-aap1800] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
3
|
Aslam J, Ardanza-Trevijano S, Xiong J, Arsuaga J, Sazdanovic R. TAaCGH Suite for Detecting Cancer-Specific Copy Number Changes Using Topological Signatures. ENTROPY 2022; 24:e24070896. [PMID: 35885119 PMCID: PMC9318413 DOI: 10.3390/e24070896] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Revised: 06/13/2022] [Accepted: 06/23/2022] [Indexed: 11/25/2022]
Abstract
Copy number changes play an important role in the development of cancer and are commonly associated with changes in gene expression. Persistence curves, such as Betti curves, have been used to detect copy number changes; however, it is known these curves are unstable with respect to small perturbations in the data. We address the stability of lifespan and Betti curves by providing bounds on the distance between persistence curves of Vietoris–Rips filtrations built on data and slightly perturbed data in terms of the bottleneck distance. Next, we perform simulations to compare the predictive ability of Betti curves, lifespan curves (conditionally stable) and stable persistent landscapes to detect copy number aberrations. We use these methods to identify significant chromosome regions associated with the four major molecular subtypes of breast cancer: Luminal A, Luminal B, Basal and HER2 positive. Identified segments are then used as predictor variables to build machine learning models which classify patients as one of the four subtypes. We find that no single persistence curve outperforms the others and instead suggest a complementary approach using a suite of persistence curves. In this study, we identified new cytobands associated with three of the subtypes: 1q21.1-q25.2, 2p23.2-p16.3, 23q26.2-q28 with the Basal subtype, 8p22-p11.1 with Luminal B and 2q12.1-q21.1 and 5p14.3-p12 with Luminal A. These segments are validated by the TCGA BRCA cohort dataset except for those found for Luminal A.
Collapse
Affiliation(s)
- Jai Aslam
- Department of Mathematics, NC State University, Raleigh, NC 27695, USA;
| | - Sergio Ardanza-Trevijano
- Department of Physics and Applied Mathematics, University of Navarra, 31008 Pamplona, Spain;
- Institute for Data Science and Artificial Intelligence, University of Navarra, 31009 Pamplona, Spain
| | - Jingwei Xiong
- Graduate Group in Biostatistics University of California Davis, Davis, CA 95616, USA;
| | - Javier Arsuaga
- Department of Molecular and Cellular Biology, University of California Davis, Davis, CA 95616, USA
- Department of Mathematics, University of California Davis, Davis, CA 95616, USA
- Correspondence: (J.A.); (R.S.)
| | - Radmila Sazdanovic
- Department of Mathematics, NC State University, Raleigh, NC 27695, USA;
- Correspondence: (J.A.); (R.S.)
| |
Collapse
|
4
|
Dey TK, Mandal S, Mukherjee S. Gene expression data classification using topology and machine learning models. BMC Bioinformatics 2021; 22:627. [PMID: 35596135 PMCID: PMC9121583 DOI: 10.1186/s12859-022-04704-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2022] [Accepted: 04/28/2022] [Indexed: 12/02/2022] Open
Abstract
Background Interpretation of high-throughput gene expression data continues to require mathematical tools in data analysis that recognizes the shape of the data in high dimensions. Topological data analysis (TDA) has recently been successful in extracting robust features in several applications dealing with high dimensional constructs. In this work, we utilize some recent developments in TDA to curate gene expression data. Our work differs from the predecessors in two aspects: (1) Traditional TDA pipelines use topological signatures called barcodes to enhance feature vectors which are used for classification. In contrast, this work involves curating relevant features to obtain somewhat better representatives with the help of TDA. This representatives of the entire data facilitates better comprehension of the phenotype labels. (2) Most of the earlier works employ barcodes obtained using topological summaries as fingerprints for the data. Even though they are stable signatures, there exists no direct mapping between the data and said barcodes. Results The topology relevant curated data that we obtain provides an improvement in shallow learning as well as deep learning based supervised classifications. We further show that the representative cycles we compute have an unsupervised inclination towards phenotype labels. This work thus shows that topological signatures are able to comprehend gene expression levels and classify cohorts accordingly. Conclusions In this work, we engender representative persistent cycles to discern the gene expression data. These cycles allow us to directly procure genes entailed in similar processes.
Collapse
|
5
|
Prediction in Cancer Genomics Using Topological Signatures and Machine Learning. TOPOLOGICAL DATA ANALYSIS 2020. [DOI: 10.1007/978-3-030-43408-3_10] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
|
6
|
Martín-Vide C, Vega-Rodríguez MA, Wheeler T. A Topological Data Analysis Approach on Predicting Phenotypes from Gene Expression Data. ALGORITHMS FOR COMPUTATIONAL BIOLOGY 2020. [PMCID: PMC7197058 DOI: 10.1007/978-3-030-42266-0_14] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
Abstract
The goal of this study was to investigate if gene expression measured from RNA sequencing contains enough signal to separate healthy and afflicted individuals in the context of phenotype prediction. We observed that standard machine learning methods alone performed somewhat poorly on the disease phenotype prediction task; therefore we devised an approach augmenting machine learning with topological data analysis. We describe a framework for predicting phenotype values by utilizing gene expression data transformed into sample-specific topological signatures by employing feature subsampling and persistent homology. The topological data analysis approach developed in this work yielded improved results on Parkinson’s disease phenotype prediction when measured against standard machine learning methods. This study confirms that gene expression can be a useful indicator of the presence or absence of a condition, and the subtle signal contained in this high dimensional data reveals itself when considering the intricate topological connections between expressed genes.
Collapse
|
7
|
Gene Coexpression Network Comparison via Persistent Homology. Int J Genomics 2018; 2018:7329576. [PMID: 30327773 PMCID: PMC6169238 DOI: 10.1155/2018/7329576] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2018] [Revised: 07/21/2018] [Accepted: 07/26/2018] [Indexed: 11/17/2022] Open
Abstract
Persistent homology, a topological data analysis (TDA) method, is applied to microarray data sets. Although there are a few papers referring to TDA methods in microarray analysis, the usage of persistent homology in the comparison of several weighted gene coexpression networks (WGCN) was not employed before to the very best of our knowledge. We calculate the persistent homology of weighted networks constructed from 38 Arabidopsis microarray data sets to test the relevance and the success of this approach in distinguishing the stress factors. We quantify multiscale topological features of each network using persistent homology and apply a hierarchical clustering algorithm to the distance matrix whose entries are pairwise bottleneck distance between the networks. The immunoresponses to different stress factors are distinguishable by our method. The networks of similar immunoresponses are found to be close with respect to bottleneck distance indicating the similar topological features of WGCNs. This computationally efficient technique analyzing networks provides a quick test for advanced studies.
Collapse
|
8
|
DeGregory KW, Kuiper P, DeSilvio T, Pleuss JD, Miller R, Roginski JW, Fisher CB, Harness D, Viswanath S, Heymsfield SB, Dungan I, Thomas DM. A review of machine learning in obesity. Obes Rev 2018; 19:668-685. [PMID: 29426065 PMCID: PMC8176949 DOI: 10.1111/obr.12667] [Citation(s) in RCA: 101] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/22/2017] [Revised: 11/18/2017] [Accepted: 11/28/2017] [Indexed: 12/15/2022]
Abstract
Rich sources of obesity-related data arising from sensors, smartphone apps, electronic medical health records and insurance data can bring new insights for understanding, preventing and treating obesity. For such large datasets, machine learning provides sophisticated and elegant tools to describe, classify and predict obesity-related risks and outcomes. Here, we review machine learning methods that predict and/or classify such as linear and logistic regression, artificial neural networks, deep learning and decision tree analysis. We also review methods that describe and characterize data such as cluster analysis, principal component analysis, network science and topological data analysis. We introduce each method with a high-level overview followed by examples of successful applications. The algorithms were then applied to National Health and Nutrition Examination Survey to demonstrate methodology, utility and outcomes. The strengths and limitations of each method were also evaluated. This summary of machine learning algorithms provides a unique overview of the state of data analysis applied specifically to obesity.
Collapse
Affiliation(s)
- K W DeGregory
- Department of Mathematical Sciences, United States Military Academy, West Point, NY, USA
| | - P Kuiper
- Department of Mathematical Sciences, United States Military Academy, West Point, NY, USA
| | - T DeSilvio
- Case Western Reserve University, Cleveland, OH, USA
| | - J D Pleuss
- Department of Mathematical Sciences, United States Military Academy, West Point, NY, USA
| | - R Miller
- Department of Mathematical Sciences, United States Military Academy, West Point, NY, USA
| | - J W Roginski
- Department of Mathematical Sciences, United States Military Academy, West Point, NY, USA
| | - C B Fisher
- Department of Mathematical Sciences, United States Military Academy, West Point, NY, USA
| | - D Harness
- Department of Mathematical Sciences, United States Military Academy, West Point, NY, USA
| | - S Viswanath
- Case Western Reserve University, Cleveland, OH, USA
| | - S B Heymsfield
- Pennington Biomedical Research Center, Baton Rouge, LA, USA
| | - I Dungan
- Department of Mathematical Sciences, United States Military Academy, West Point, NY, USA
| | - D M Thomas
- Department of Mathematical Sciences, United States Military Academy, West Point, NY, USA
| |
Collapse
|
9
|
Abstract
Topological methods are emerging as a new set of tools for the analysis of large genomic datasets. They are mathematically grounded methods that extract information from the geometric structure of data. In the last few years, applications to evolutionary biology, cancer genomics, and the analysis of complex diseases have uncovered significant biological results, highlighting their utility for fulfilling some of the current analytic needs of genomics. In this review, the state of the art in the application of topological methods to genomics is summarized, and some of the present limitations and possible future developments are reviewed.
Collapse
|
10
|
High DRC Levels Are Associated with Let-7b Overexpression in Women with Breast Cancer. Int J Mol Sci 2016; 17:ijms17060865. [PMID: 27271599 PMCID: PMC4926399 DOI: 10.3390/ijms17060865] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2016] [Revised: 05/05/2016] [Accepted: 05/16/2016] [Indexed: 12/28/2022] Open
Abstract
Nucleotide Excision Repair (NER) is a critical pathway involved in breast cancer (BC). We have previously published that a low DNA repair capacity (DRC) is associated with a higher risk of BC in Puerto Rican women. Let-7b belongs to a miRNA family with tumor suppressor activity that targets oncogenes. We isolated miRNAs from plasma of 153 Puerto Rican women with and without BC. DRC was measured in lymphocytes by means of a host cell reactivation assay. These women were divided into four groups according to their DRC level: High (>3.8%) and low (<3.8%). The four groups consisted of BC patients with high (n = 35) and low (n = 43) DRC and controls with high (n = 39) and low (n = 36) DRC. Epidemiologic data were collected at initial BC diagnosis and almost five years after diagnosis. A significant difference in Let-7b expression was found in BC patients with high DRC versus the remaining groups (p < 0.001). Thus, our data reveal a possible role of Let-7b on DRC during breast carcinogenesis. Our study is innovative because it provides the first evidence that Let-7b may play role in DRC regulation (through the NER repair pathway) in BC.
Collapse
|