1
|
Jeon H, Xie J, Jeon Y, Jung KJ, Gupta A, Chang W, Chung D. Statistical Power Analysis for Designing Bulk, Single-Cell, and Spatial Transcriptomics Experiments: Review, Tutorial, and Perspectives. Biomolecules 2023; 13:221. [PMID: 36830591 PMCID: PMC9952882 DOI: 10.3390/biom13020221] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2022] [Revised: 01/20/2023] [Accepted: 01/21/2023] [Indexed: 01/26/2023] Open
Abstract
Gene expression profiling technologies have been used in various applications such as cancer biology. The development of gene expression profiling has expanded the scope of target discovery in transcriptomic studies, and each technology produces data with distinct characteristics. In order to guarantee biologically meaningful findings using transcriptomic experiments, it is important to consider various experimental factors in a systematic way through statistical power analysis. In this paper, we review and discuss the power analysis for three types of gene expression profiling technologies from a practical standpoint, including bulk RNA-seq, single-cell RNA-seq, and high-throughput spatial transcriptomics. Specifically, we describe the existing power analysis tools for each research objective for each of the bulk RNA-seq and scRNA-seq experiments, along with recommendations. On the other hand, since there are no power analysis tools for high-throughput spatial transcriptomics at this point, we instead investigate the factors that can influence power analysis.
Collapse
Affiliation(s)
- Hyeongseon Jeon
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH 43210, USA
| | - Juan Xie
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH 43210, USA
- The Interdisciplinary Ph.D. Program in Biostatistics, The Ohio State University, Columbus, OH 43210, USA
| | - Yeseul Jeon
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA
- Department of Statistics and Data Science, Yonsei University, Seoul 03722, Republic of Korea
- Department of Applied Statistics, Yonsei University, Seoul 03722, Republic of Korea
| | - Kyeong Joo Jung
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210, USA
| | - Arkobrato Gupta
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH 43210, USA
- The Interdisciplinary Ph.D. Program in Biostatistics, The Ohio State University, Columbus, OH 43210, USA
| | - Won Chang
- Division of Statistics and Data Science, University of Cincinnati, Cincinnati, OH 45221, USA
| | - Dongjun Chung
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH 43210, USA
- The Interdisciplinary Ph.D. Program in Biostatistics, The Ohio State University, Columbus, OH 43210, USA
| |
Collapse
|
2
|
Su M, Pan T, Chen QZ, Zhou WW, Gong Y, Xu G, Yan HY, Li S, Shi QZ, Zhang Y, He X, Jiang CJ, Fan SC, Li X, Cairns MJ, Wang X, Li YS. Data analysis guidelines for single-cell RNA-seq in biomedical studies and clinical applications. Mil Med Res 2022; 9:68. [PMID: 36461064 PMCID: PMC9716519 DOI: 10.1186/s40779-022-00434-8] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 11/18/2022] [Indexed: 12/03/2022] Open
Abstract
The application of single-cell RNA sequencing (scRNA-seq) in biomedical research has advanced our understanding of the pathogenesis of disease and provided valuable insights into new diagnostic and therapeutic strategies. With the expansion of capacity for high-throughput scRNA-seq, including clinical samples, the analysis of these huge volumes of data has become a daunting prospect for researchers entering this field. Here, we review the workflow for typical scRNA-seq data analysis, covering raw data processing and quality control, basic data analysis applicable for almost all scRNA-seq data sets, and advanced data analysis that should be tailored to specific scientific questions. While summarizing the current methods for each analysis step, we also provide an online repository of software and wrapped-up scripts to support the implementation. Recommendations and caveats are pointed out for some specific analysis tasks and approaches. We hope this resource will be helpful to researchers engaging with scRNA-seq, in particular for emerging clinical applications.
Collapse
Affiliation(s)
- Min Su
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
| | - Tao Pan
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199 Hainan China
| | - Qiu-Zhen Chen
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
| | - Wei-Wei Zhou
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081 Heilongjiang China
| | - Yi Gong
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
- Department of Immunology, Nanjing Medical University, Nanjing, 211166 China
| | - Gang Xu
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199 Hainan China
| | - Huan-Yu Yan
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
| | - Si Li
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199 Hainan China
| | - Qiao-Zhen Shi
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
| | - Ya Zhang
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199 Hainan China
| | - Xiao He
- Department of Laboratory Medicine, Women and Children’s Hospital of Chongqing Medical University, Chongqing, 401174 China
| | | | - Shi-Cai Fan
- Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen, 518110 Guangdong China
| | - Xia Li
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081 Heilongjiang China
| | - Murray J. Cairns
- School of Biomedical Sciences and Pharmacy, Faculty of Health and Medicine, the University of Newcastle, University Drive, Callaghan, NSW 2308 Australia
- Precision Medicine Research Program, Hunter Medical Research Institute, New Lambton Heights, NSW 2305 Australia
| | - Xi Wang
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
| | - Yong-Sheng Li
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199 Hainan China
| |
Collapse
|
3
|
Zhang LX, Yan H, Liu Y, Xu J, Song J, Yu DJ. Enhancing Characteristic Gene Selection and Tumor Classification by the Robust Laplacian Supervised Discriminative Sparse PCA. J Chem Inf Model 2022; 62:1794-1807. [PMID: 35353532 DOI: 10.1021/acs.jcim.1c01403] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Characteristic gene selection and tumor classification of gene expression data play major roles in genomic research. Due to the characteristics of a small sample size and high dimensionality of gene expression data, it is a common practice to perform dimensionality reduction prior to the use of machine learning-based methods to analyze the expression data. In this context, classical principal component analysis (PCA) and its improved versions have been widely used. Recently, methods based on supervised discriminative sparse PCA have been developed to improve the performance of data dimensionality reduction. However, such methods still have limitations: most of them have not taken into consideration the improvement of robustness to outliers and noise, label information, sparsity, as well as capturing intrinsic geometrical structures in one objective function. To address this drawback, in this study, we propose a novel PCA-based method, known as the robust Laplacian supervised discriminative sparse PCA, termed RLSDSPCA, which enforces the L2,1 norm on the error function and incorporates the graph Laplacian into supervised discriminative sparse PCA. To evaluate the efficacy of the proposed RLSDSPCA, we applied it to the problems of characteristic gene selection and tumor classification problems using gene expression data. The results demonstrate that the proposed RLSDSPCA method, when used in combination with other related methods, can effectively identify new pathogenic genes associated with diseases. In addition, RLSDSPCA has also achieved the best performance compared with the state-of-the-art methods on tumor classification in terms of major performance metrics. The codes and data sets used in the study are freely available at http://csbio.njust.edu.cn/bioinf/rlsdspca/.
Collapse
Affiliation(s)
- Lu-Xing Zhang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - He Yan
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Yan Liu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Jian Xu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, Victoria 3800, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| |
Collapse
|
4
|
Fratello M, Cattelani L, Federico A, Pavel A, Scala G, Serra A, Greco D. Unsupervised Algorithms for Microarray Sample Stratification. Methods Mol Biol 2022; 2401:121-146. [PMID: 34902126 DOI: 10.1007/978-1-0716-1839-4_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The amount of data made available by microarrays gives researchers the opportunity to delve into the complexity of biological systems. However, the noisy and extremely high-dimensional nature of this kind of data poses significant challenges. Microarrays allow for the parallel measurement of thousands of molecular objects spanning different layers of interactions. In order to be able to discover hidden patterns, the most disparate analytical techniques have been proposed. Here, we describe the basic methodologies to approach the analysis of microarray datasets that focus on the task of (sub)group discovery.
Collapse
Affiliation(s)
- Michele Fratello
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere University, Tampere, Finland
| | - Luca Cattelani
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere University, Tampere, Finland
| | - Antonio Federico
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere University, Tampere, Finland
| | - Alisa Pavel
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere University, Tampere, Finland
| | - Giovanni Scala
- Department of Biology, University of Naples Federico II, Naples, Italy
| | - Angela Serra
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere University, Tampere, Finland
| | - Dario Greco
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland.
- BioMediTech Institute, Tampere University, Tampere, Finland.
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere University, Tampere, Finland.
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland.
| |
Collapse
|
5
|
Way GP, Zietz M, Rubinetti V, Himmelstein DS, Greene CS. Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations. Genome Biol 2020; 21:109. [PMID: 32393369 PMCID: PMC7212571 DOI: 10.1186/s13059-020-02021-3] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2019] [Accepted: 04/16/2020] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Unsupervised compression algorithms applied to gene expression data extract latent or hidden signals representing technical and biological sources of variation. However, these algorithms require a user to select a biologically appropriate latent space dimensionality. In practice, most researchers fit a single algorithm and latent dimensionality. We sought to determine the extent by which selecting only one fit limits the biological features captured in the latent representations and, consequently, limits what can be discovered with subsequent analyses. RESULTS We compress gene expression data from three large datasets consisting of adult normal tissue, adult cancer tissue, and pediatric cancer tissue. We train many different models across a large range of latent space dimensionalities and observe various performance differences. We identify more curated pathway gene sets significantly associated with individual dimensions in denoising autoencoder and variational autoencoder models trained using an intermediate number of latent dimensionalities. Combining compressed features across algorithms and dimensionalities captures the most pathway-associated representations. When trained with different latent dimensionalities, models learn strongly associated and generalizable biological representations including sex, neuroblastoma MYCN amplification, and cell types. Stronger signals, such as tumor type, are best captured in models trained at lower dimensionalities, while more subtle signals such as pathway activity are best identified in models trained with more latent dimensionalities. CONCLUSIONS There is no single best latent dimensionality or compression algorithm for analyzing gene expression data. Instead, using features derived from different compression models across multiple latent space dimensionalities enhances biological representations.
Collapse
Affiliation(s)
- Gregory P Way
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA
- Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Michael Zietz
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA
| | - Vincent Rubinetti
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA
| | - Daniel S Himmelstein
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA.
- Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Philadelphia, PA, 19102, USA.
| |
Collapse
|
6
|
A topological approach for cancer subtyping from gene expression data. J Biomed Inform 2020; 102:103357. [PMID: 31893527 DOI: 10.1016/j.jbi.2019.103357] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2019] [Revised: 11/27/2019] [Accepted: 12/12/2019] [Indexed: 12/27/2022]
Abstract
BACKGROUND Gene expression data contains key information which can be used for subtyping cancer patients. However, computational methods suffer from 'curse of dimensionality' due to very high dimensionality of omics data and therefore are not able to clearly distinguish between the discovered subtypes in terms of separation of survival plots. METHODS To address this we propose a framework based on Topological Mapper algorithm. The novelty of this work is that we suggest a method for defining the filter function on which the mapper algorithm heavily depends. Survival analysis of the discovered cancer subtypes is carried out and evaluated in terms of minimum pairwise separation between the Kaplan-Meier plots. Furthermore, we present a method to measure the separation between the discovered subtypes based on hazard ratios. RESULTS Five cancer genomics datasets obtained from The Cancer Genome Atlas portal have been used for comparisons with Robust Sparse Correlation-Otrimle (RSC-Otrimle) algorithm and Similarity Network Fusion(SNF). Comparisons show that the minimum pairwise life expectancy difference (in days) between the discovered subtypes for lung, colon, breast, glioblastoma and kidney cancers is 107, 204, 20, 88 and 425 days, respectively, for the proposed methodology whereas it is only 69, 43, 6, 61 and 282 days for RSC-Otrimle and 9, 95, 18, 60 and 148 days for SNF. Hazard ratio analysis also shows that the proposed methodology performs better in four of the five datasets. A visual inspection of Kaplan-Meier plots reveals that the proposed methodology achieves lesser overlap in Kaplan-Meier plots especially for lung, breast and kidney cases. Furthermore, relevant genetic pathways for each subtype have been obtained and pathways which can be possible targets for treatment have been discussed. CONCLUSION The significance of this work lies in individualized understanding of cancer from patient to patient which is the backbone of Precision Medicine.
Collapse
|
7
|
Pamukcu E. Choosing the optimal hybrid covariance estimators in adaptive elastic net regression models using information complexity. J STAT COMPUT SIM 2019. [DOI: 10.1080/00949655.2019.1647431] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Affiliation(s)
- Esra Pamukcu
- Department of Statistics, Fırat University, Elazig, Turkey
| |
Collapse
|
8
|
Wilk G, Braun R. Integrative analysis reveals disrupted pathways regulated by microRNAs in cancer. Nucleic Acids Res 2019; 46:1089-1101. [PMID: 29294105 PMCID: PMC5814839 DOI: 10.1093/nar/gkx1250] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2017] [Accepted: 12/01/2017] [Indexed: 02/06/2023] Open
Abstract
MicroRNAs (miRNAs) are small endogenous regulatory molecules that modulate gene expression post-transcriptionally. Although differential expression of miRNAs have been implicated in many diseases (including cancers), the underlying mechanisms of action remain unclear. Because each miRNA can target multiple genes, miRNAs may potentially have functional implications for the overall behavior of entire pathways. Here, we investigate the functional consequences of miRNA dysregulation through an integrative analysis of miRNA and mRNA expression data using a novel approach that incorporates pathway information a priori. By searching for miRNA-pathway associations that differ between healthy and tumor tissue, we identify specific relationships at the systems level which are disrupted in cancer. Our approach is motivated by the hypothesis that if an miRNA and pathway are associated, then the expression of the miRNA and the collective behavior of the genes in a pathway will be correlated. As such, we first obtain an expression-based summary of pathway activity using Isomap, a dimension reduction method which can articulate non-linear structure in high-dimensional data. We then search for miRNAs that exhibit differential correlations with the pathway summary between phenotypes as a means of finding aberrant miRNA-pathway coregulation in tumors. We apply our method to cancer data using gene and miRNA expression datasets from The Cancer Genome Atlas and compare ∼105 miRNA-pathway relationships between healthy and tumor samples from four tissues (breast, prostate, lung and liver). Many of the flagged pairs we identify have a biological basis for disruption in cancer.
Collapse
Affiliation(s)
- Gary Wilk
- Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL 60208, USA
| | - Rosemary Braun
- Biostatistics Division, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA.,Department of Engineering Sciences and Applied Mathematics, Northwestern University, Evanston, IL 60208, USA
| |
Collapse
|
9
|
Bhola A, Singh S. Visualisation and Modelling of High-Dimensional Cancerous Gene Expression Dataset. JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT 2019. [DOI: 10.1142/s0219649219500011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The increase in the number of dimensions of cancerous gene expression dataset causes an increase in complexity, misinterpretation and decrease in the visualisation of the particular dataset for further analysis. Therefore, dimensionality reduction, visualisation and modelling tasks of these dataset become challenging. In this paper, a framework is developed which helps to understand, visualise and model high-dimensional cancerous gene expression dataset into lower dimensions which may be helpful in revealing cancer mechanism and diagnosis. Initially, cancerous gene expression datasets are preprocessed to make them complete, precise and efficient; and principal component analysis is applied for dimensionality reduction and visualisation purpose. The regression is used to model the cancerous gene expression dataset so that type of association (linear or nonlinear) and directions between gene profiles may be estimated. To assess the performance of the developed framework, three different types of cancerous gene expression datasets are taken namely: breast (GEO Acc. No. GDS5076), lung (GEO Acc. No. GDS5040) and prostate (GEO Acc. No. GDS5072) which are publicly available. To validate the results of the regression the cross-validation method is used. The results revealed that a linear approach is to be used for prostate cancer dataset and nonlinear approach for breast and lung cancer datasets in finding an association between gene pairs.
Collapse
Affiliation(s)
- Abhishek Bhola
- Department of Computer Science and Engineering, Punjab Engineering College (Deemed to be University), Sector 12, Chandigarh 160012, India
| | - Shailendra Singh
- Department of Computer Science and Engineering, Punjab Engineering College (Deemed to be University), Sector 12, Chandigarh 160012, India
| |
Collapse
|
10
|
Gupta Y, Saini A. A new swarm-based efficient data clustering approach using KHM and fuzzy logic. Soft comput 2018. [DOI: 10.1007/s00500-018-3514-1] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
11
|
Lefèvre T, Chariot P, Chauvin P. Multivariate methods for the analysis of complex and big data in forensic sciences. Application to age estimation in living persons. Forensic Sci Int 2016; 266:581.e1-581.e9. [DOI: 10.1016/j.forsciint.2016.05.014] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2015] [Revised: 03/14/2016] [Accepted: 05/16/2016] [Indexed: 10/21/2022]
|
12
|
Ranked k-medoids: A fast and accurate rank-based partitioning algorithm for clustering large datasets. Knowl Based Syst 2013. [DOI: 10.1016/j.knosys.2012.10.012] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
13
|
Ginsburg S, Ali S, Lee G, Basavanhally A, Madabhushi A. Variable importance in nonlinear kernels (VINK): classification of digitized histopathology. MEDICAL IMAGE COMPUTING AND COMPUTER-ASSISTED INTERVENTION : MICCAI ... INTERNATIONAL CONFERENCE ON MEDICAL IMAGE COMPUTING AND COMPUTER-ASSISTED INTERVENTION 2013; 16:238-45. [PMID: 24579146 DOI: 10.1007/978-3-642-40763-5_30] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Quantitative histomorphometry is the process of modeling appearance of disease morphology on digitized histopathology images via image-based features (e.g., texture, graphs). Due to the curse of dimensionality, building classifiers with large numbers of features requires feature selection (which may require a large training set) or dimensionality reduction (DR). DR methods map the original high-dimensional features in terms of eigenvectors and eigenvalues, which limits the potential for feature transparency or interpretability. Although methods exist for variable selection and ranking on embeddings obtained via linear DR schemes (e.g., principal components analysis (PCA)), similar methods do not yet exist for nonlinear DR (NLDR) methods. In this work we present a simple yet elegant method for approximating the mapping between the data in the original feature space and the transformed data in the kernel PCA (KPCA) embedding space; this mapping provides the basis for quantification of variable importance in nonlinear kernels (VINK). We show how VINK can be implemented in conjunction with the popular Isomap and Laplacian eigenmap algorithms. VINK is evaluated in the contexts of three different problems in digital pathology: (1) predicting five year PSA failure following radical prostatectomy, (2) predicting Oncotype DX recurrence risk scores for ER+ breast cancers, and (3) distinguishing good and poor outcome p16+ oropharyngeal tumors. We demonstrate that subsets of features identified by VINK provide similar or better classification or regression performance compared to the original high dimensional feature sets.
Collapse
Affiliation(s)
- Shoshana Ginsburg
- Department of Biomedical Engineering, Case Western Reserve University, USA
| | - Sahirzeeshan Ali
- Department of Biomedical Engineering, Case Western Reserve University, USA
| | - George Lee
- Department of Biomedical Engineering, Rutgers University, USA
| | | | - Anant Madabhushi
- Department of Biomedical Engineering, Case Western Reserve University, USA
| |
Collapse
|
14
|
Dinger SC, Van Wyk MA, Carmona S, Rubin DM. Clustering gene expression data using a diffraction-inspired framework. Biomed Eng Online 2012; 11:85. [PMID: 23164195 PMCID: PMC3549897 DOI: 10.1186/1475-925x-11-85] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2012] [Accepted: 11/12/2012] [Indexed: 11/17/2022] Open
Abstract
Background The recent developments in microarray technology has allowed for the simultaneous measurement of gene expression levels. The large amount of captured data challenges conventional statistical tools for analysing and finding inherent correlations between genes and samples. The unsupervised clustering approach is often used, resulting in the development of a wide variety of algorithms. Typical clustering algorithms require selecting certain parameters to operate, for instance the number of expected clusters, as well as defining a similarity measure to quantify the distance between data points. The diffraction‐based clustering algorithm however is designed to overcome this necessity for user‐defined parameters, as it is able to automatically search the data for any underlying structure. Methods The diffraction‐based clustering algorithm presented in this paper is tested using five well‐known expression datasets pertaining to cancerous tissue samples. The clustering results are then compared to those results obtained from conventional algorithms such as the k‐means, fuzzy c‐means, self‐organising map, hierarchical clustering algorithm, Gaussian mixture model and density‐based spatial clustering of applications with noise (DBSCAN). The performance of each algorithm is measured using an average external criterion and an average validity index. Results The diffraction‐based clustering algorithm is shown to be independent of the number of clusters as the algorithm searches the feature space and requires no form of parameter selection. The results show that the diffraction‐based clustering algorithm performs significantly better on the real biological datasets compared to the other existing algorithms. Conclusion The results of the diffraction‐based clustering algorithm presented in this paper suggest that the method can provide researchers with a new tool for successfully analysing microarray data.
Collapse
Affiliation(s)
- Steven C Dinger
- Biomedical Engineering Research Group, School of Electrical & Information Engineering, University of the Witwatersrand, Johannesburg, South Africa.
| | | | | | | |
Collapse
|
15
|
Mahapatra R, Majhi B, Rout M. Reduced Feature Based Efficient Cancer Classification Using Single Layer Neural Network. ACTA ACUST UNITED AC 2012. [DOI: 10.1016/j.protcy.2012.10.022] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
16
|
Pamplona R, Costantini D. Molecular and structural antioxidant defenses against oxidative stress in animals. Am J Physiol Regul Integr Comp Physiol 2011; 301:R843-63. [PMID: 21775650 DOI: 10.1152/ajpregu.00034.2011] [Citation(s) in RCA: 204] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
In this review, it is our aim 1) to describe the high diversity in molecular and structural antioxidant defenses against oxidative stress in animals, 2) to extend the traditional concept of antioxidant to other structural and functional factors affecting the "whole" organism, 3) to incorporate, when supportable by evidence, mechanisms into models of life-history trade-offs and maternal/epigenetic inheritance, 4) to highlight the importance of studying the biochemical integration of redox systems, and 5) to discuss the link between maximum life span and antioxidant defenses. The traditional concept of antioxidant defenses emphasizes the importance of the chemical nature of molecules with antioxidant properties. Research in the past 20 years shows that animals have also evolved a high diversity in structural defenses that should be incorporated in research on antioxidant responses to reactive species. Although there is a high diversity in antioxidant defenses, many of them are evolutionary conserved across animal taxa. In particular, enzymatic defenses and heat shock response mediated by proteins show a low degree of variation. Importantly, activation of an antioxidant response may be also energetically and nutrient demanding. So knowledge of antioxidant mechanisms could allow us to identify and to quantify any underlying costs, which can help explain life-history trade-offs. Moreover, the study of inheritance mechanisms of antioxidant mechanisms has clear potential to evaluate the contribution of epigenetic mechanisms to stress response phenotype variation.
Collapse
Affiliation(s)
- Reinald Pamplona
- Department of Experimental Medicine, University of Lleida Biomedical Research Institute of Lleida, Lleida, Spain
| | | |
Collapse
|