1
|
Hwangbo S, Lee S, Hosain MM, Goo T, Lee S, Kim I, Park T. Kernel-based hierarchical structural component models for pathway analysis on survival phenotype. Genes Genomics 2024:10.1007/s13258-024-01569-9. [PMID: 39327384 DOI: 10.1007/s13258-024-01569-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 09/07/2024] [Indexed: 09/28/2024]
Abstract
BACKGROUND High-throughput sequencing, particularly RNA-sequencing (RNA-seq), has advanced differential gene expression analysis, revealing pathways involved in various biological conditions. Traditional pathway-based methods generally consider pathways independently, overlooking the correlations among them and ignoring quite a few overlapping biomarkers between pathways. In addition, most pathway-based approaches assume that biomarkers have linear effects on the phenotype of interest. OBJECTIVE This study aims to develop the HisCoM-KernelS model to identify survival phenotype-related pathways by accommodating complex, nonlinear relationships between genes and survival outcomes, while accounting for inter-pathway correlations. METHODS We applied HisCoM-KernelS model to the TCGA pancreatic ductal adenocarcinoma (PDAC) RNA-seq dataset, comprising 4,498 protein-coding genes mapped to 186 KEGG pathways from 148 PDAC samples. Kernel machine regression was used to model pathway effects on survival outcomes, incorporating hierarchical gene-pathway structures. Model parameters were estimated using the alternating least squares algorithm, and the significance of pathways was assessed through a permutation test. RESULTS HisCoM-KernelS identified several pathways significantly associated with pancreatic cancer survival, including those corroborated by previous studies. HisCoM-KernelS, especially with the Gaussian kernel, showed a better balance of detection rate and number of significant pathways compared to four other existing pathway-based methods: HisCoM-PAGE, Global Test, GSEA, and CoxKM. CONCLUSION HisCoM-KernelS successfully extends pathway-based analysis to survival outcomes, capturing complex nonlinear gene effects and inter-pathway correlations. Its application to the TCGA PDAC dataset emphasizes its utility in identifying biologically relevant pathways, offering a robust tool for survival phenotype research in high-throughput sequencing data.
Collapse
Affiliation(s)
- Suhyun Hwangbo
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 151-747, Korea
- Department of Genomic Medicine, Seoul National University Hospital, Seoul, 03080, Korea
| | - Sungyoung Lee
- Department of Genomic Medicine, Seoul National University Hospital, Seoul, 03080, Korea
| | - Md Mozaffar Hosain
- Department of Statistics, Seoul National University, Seoul, 151-747, Korea
| | - Taewan Goo
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 151-747, Korea
| | - Seungyeoun Lee
- Department of Mathematics and Statistics, Sejong University, Sejong, 05006, Korea
| | - Inyoung Kim
- Department of Statistics, Virginia Tech, Blacksburg, VA, 24060, USA
| | - Taesung Park
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 151-747, Korea.
- Department of Statistics, Seoul National University, Seoul, 151-747, Korea.
| |
Collapse
|
2
|
Atkins S, Einarsson G, Clemmensen L, Ames B. Proximal methods for sparse optimal scoring and discriminant analysis. ADV DATA ANAL CLASSI 2022. [DOI: 10.1007/s11634-022-00530-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
3
|
Hwangbo S, Lee S, Lee S, Hwang H, Kim I, Park T. Kernel-based hierarchical structural component models for pathway analysis. Bioinformatics 2022; 38:3078-3086. [PMID: 35460238 DOI: 10.1093/bioinformatics/btac276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2021] [Revised: 04/08/2022] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Pathway analyses have led to more insight into the underlying biological functions related to the phenotype of interest in various types of omics data. Pathway-based statistical approaches have been actively developed, but most of them do not consider correlations among pathways. Because it is well known that there are quite a few biomarkers that overlap between pathways, these approaches may provide misleading results. In addition, most pathway-based approaches tend to assume that biomarkers within a pathway have linear associations with the phenotype of interest, even though the relationships are more complex. RESULTS To model complex effects including nonlinear effects, we propose a new approach, Hierarchical structural CoMponent analysis using Kernel (HisCoM-Kernel). The proposed method models nonlinear associations between biomarkers and phenotype by extending the kernel machine regression and analyzes entire pathways simultaneously by using the biomarker-pathway hierarchical structure. HisCoM-Kernel is a flexible model that can be applied to various omics data. It was successfully applied to three omics datasets generated by different technologies. Our simulation studies showed that HisCoM-Kernel provided higher statistical power than other existing pathway-based methods in all datasets. The application of HisCoM-Kernel to three types of omics dataset showed its superior performance compared to existing methods in identifying more biologically meaningful pathways, including those reported in previous studies. AVAILABILITY AND IMPLEMENTATION Freely available at http://statgen.snu.ac.kr/software/HisCom-Kernel/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Suhyun Hwangbo
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 151-747, Korea.,Department of Genomic Medicine, Seoul National University Hospital, Seoul, 03080, Korea
| | - Sungyoung Lee
- Department of Genomic Medicine, Seoul National University Hospital, Seoul, 03080, Korea
| | - Seungyeoun Lee
- Department of Mathematics and Statistics, Sejong University, Sejong, 05006, Korea
| | - Heungsun Hwang
- Department of Psychology, McGill University, Montreal, QC, H3A 1B1, Canada
| | - Inyoung Kim
- Department of Statistics, Virginia Tech, Blacksburg, Virginia, 24060, U.S.A
| | - Taesung Park
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 151-747, Korea.,Department of Statistics, Seoul National University, Seoul, 151-747, Korea
| |
Collapse
|
4
|
Bao Y, Liu Y. Varying coefficient linear discriminant analysis for dynamic data. Electron J Stat 2022. [DOI: 10.1214/22-ejs2066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Yajie Bao
- School of Mathematical Sciences, Shanghai Jiao Tong University, 200240 Shanghai, China
| | - Yuyang Liu
- School of Mathematical Sciences, Shanghai Jiao Tong University, 200240 Shanghai, China
| |
Collapse
|
5
|
Broś-Konopielko M, Białek A, Oleszczuk-Modzelewska L, Zaleśkiewicz B, Różańska-Walędziak A, Czajkowski K. Nutritional, Anthropometric and Sociodemographic Factors Affecting Fatty Acids Profile of Pregnant Women's Serum at Labour-Chemometric Studies. Nutrients 2021; 13:2948. [PMID: 34578833 PMCID: PMC8470577 DOI: 10.3390/nu13092948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Revised: 08/18/2021] [Accepted: 08/23/2021] [Indexed: 11/16/2022] Open
Abstract
Diet influences the health of pregnant women and their children in prenatal, postnatal and adult periods. GC-FID fatty acids profile analysis in maternal serum and a survey of dietary habits were performed in 161 pregnant patients from the II Faculty and Clinic of Obstetrics and Gynaecology of the Medical University of Warsaw. Their diet did not fulfil all nutritional recommendations regarding dietary fat sources. Olive and rapeseed oil were the most popular edible oils. High usage of sunflower oil as well as high consumption of butter were also observed, whereas fish and fish oil intake by pregnant women was low. A chemometric approach for nutritional data, connected with anthropometric, sociodemographic and biochemical parameters regarding mothers and newborns, was conducted for diet and its impact estimation. It revealed four clusters of patients with differing fatty acids profile, which resulted from differences in their dietary habits. Multiparous women to a lesser extent followed dietary recommendations, which resulted in deterioration of fatty acids profile and higher frequency of complications. Observed high usage of sunflower oil is disquieting due to its lower oxidative stability, whereas high butter consumption is beneficial due to conjugated linoleic acids supply. Pregnant women should also be encouraged to introduce fish and fish oil into their diet, as these products are rich sources of long chain polyunsaturated fatty acids (LC PUFA). Multiparous women should be given special medical care by medical providers (physicians, midwifes and dietitians) and growing attention from the government to diminish the risk of possible adverse effects affecting mother and child.
Collapse
Affiliation(s)
- Magdalena Broś-Konopielko
- II Faculty and Clinic of Obstetrics and Gynaecology, Medical University of Warsaw, 02-091 Warsaw, Poland
| | - Agnieszka Białek
- Department of Biotechnology and Nutrigenomics, Institute of Genetics and Animal Biotechnology of the Polish Academy of Sciences, 05-552 Magdalenka, Poland
| | | | - Barbara Zaleśkiewicz
- II Faculty and Clinic of Obstetrics and Gynaecology, Medical University of Warsaw, 02-091 Warsaw, Poland
| | - Anna Różańska-Walędziak
- II Faculty and Clinic of Obstetrics and Gynaecology, Medical University of Warsaw, 02-091 Warsaw, Poland
| | - Krzysztof Czajkowski
- II Faculty and Clinic of Obstetrics and Gynaecology, Medical University of Warsaw, 02-091 Warsaw, Poland
| |
Collapse
|
6
|
Li G, Duan X, Wu Z, Wu C. Generalized elastic net optimal scoring problem for feature selection. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.03.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
7
|
High-dimensional linear discriminant analysis with moderately clipped LASSO. COMMUNICATIONS FOR STATISTICAL APPLICATIONS AND METHODS 2021. [DOI: 10.29220/csam.2021.28.1.021] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
8
|
Zhang L, Kim I. Finite mixtures of semiparametric Bayesian survival kernel machine regressions: Application to breast cancer gene pathway subgroup analysis. J R Stat Soc Ser C Appl Stat 2020. [DOI: 10.1111/rssc.12457] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Lin Zhang
- Department of Statistics Virginia Tech Blacksburg VAUSA
| | - Inyoung Kim
- Department of Statistics Virginia Tech Blacksburg VAUSA
| |
Collapse
|
9
|
Luo S, Chen Z. A procedure of linear discrimination analysis with detected sparsity structure for high-dimensional multi-class classification. J MULTIVARIATE ANAL 2020. [DOI: 10.1016/j.jmva.2020.104641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
10
|
Chen H, He Y, Ji J, Shi Y. The sparse group lasso for high-dimensional integrative linear discriminant analysis with application to alzheimer's disease prediction. J STAT COMPUT SIM 2020. [DOI: 10.1080/00949655.2020.1800011] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Affiliation(s)
- Hao Chen
- School of Statistics, Shandong University of Finance and Economics, Jinan, People's Republic of China
| | - Yong He
- Institute for Financial Studies, Shandong University, Jinan, People's Republic of China
| | - Jiadong Ji
- School of Statistics, Shandong University of Finance and Economics, Jinan, People's Republic of China
| | - Yufeng Shi
- School of Statistics, Shandong University of Finance and Economics, Jinan, People's Republic of China
- Institute for Financial Studies, Shandong University, Jinan, People's Republic of China
| |
Collapse
|
11
|
Yang X, Tian L, Chen Y, Yang L, Xu S, Wu W. Inverse Projection Representation and Category Contribution Rate for Robust Tumor Recognition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1262-1275. [PMID: 30575544 DOI: 10.1109/tcbb.2018.2886334] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Sparse representation based classification (SRC) methods have achieved remarkable results. SRC, however, still suffer from requiring enough training samples, insufficient use of test samples, and instability of representation. In this paper, a stable inverse projection representation based classification (IPRC) is presented to tackle these problems by effectively using test samples. An IPR is first proposed and its feasibility and stability are analyzed. A classification criterion named category contribution rate is constructed to match the IPR and complete classification. Moreover, a statistical measure is introduced to quantify the stability of representation-based classification methods. Based on the IPRC technique, a robust tumor recognition framework is presented by interpreting microarray gene expression data, where a two-stage hybrid gene selection method is introduced to select informative genes. Finally, the functional analysis of candidate's pathogenicity-related genes is given. Extensive experiments on six public tumor microarray gene expression datasets demonstrate the proposed technique is competitive with state-of-the-art methods.
Collapse
|
12
|
Lippmann C, Ultsch A, Lötsch J. Computational functional genomics-based reduction of disease-related gene sets to their key components. Bioinformatics 2020; 35:2362-2370. [PMID: 30500872 DOI: 10.1093/bioinformatics/bty986] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2018] [Revised: 09/05/2018] [Accepted: 11/29/2018] [Indexed: 01/21/2023] Open
Abstract
MOTIVATION The genetic architecture of diseases becomes increasingly known. This raises difficulties in picking suitable targets for further research among an increasing number of candidates. Although expression based methods of gene set reduction are applied to laboratory-derived genetic data, the analysis of topical sets of genes gathered from knowledge bases requires a modified approach as no quantitative information about gene expression is available. RESULTS We propose a computational functional genomics-based approach at reducing sets of genes to the most relevant items based on the importance of the gene within the polyhierarchy of biological processes characterizing the disease. Knowledge bases about the biological roles of genes can provide a valid description of traits or diseases represented as a directed acyclic graph (DAG) picturing the polyhierarchy of disease relevant biological processes. The proposed method uses a gene importance score derived from the location of the gene-related biological processes in the DAG. It attempts to recreate the DAG and thereby, the roles of the original gene set, with the least number of genes in descending order of importance. This obtained precision and recall of over 70% to recreate the components of the DAG charactering the biological functions of n=540 genes relevant to pain with a subset of only the k=29 best-scoring genes. CONCLUSIONS A new method for reduction of gene sets is shown that is able to reproduce the biological processes in which the full gene set is involved by over 70%; however, by using only ∼5% of the original genes. AVAILABILITY AND IMPLEMENTATION The necessary numerical parameters for the calculation of gene importance are implemented in the R package dbtORA at https://github.com/IME-TMP-FFM/dbtORA. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Catharina Lippmann
- Fraunhofer Institute of Molecular Biology and Applied Ecology - Project Group Translational Medicine and Pharmacology (IME-TMP), Frankfurt am Main, Germany
| | - Alfred Ultsch
- DataBionics Research Group, University of Marburg, Marburg, Germany
| | - Jörn Lötsch
- Fraunhofer Institute of Molecular Biology and Applied Ecology - Project Group Translational Medicine and Pharmacology (IME-TMP), Frankfurt am Main, Germany.,Goethe-University, Institute of Clinical Pharmacology, Frankfurt am Main, Germany
| |
Collapse
|
13
|
Islam SJ, Kim JH, Topel M, Liu C, Ko YA, Mujahid MS, Sims M, Mubasher M, Ejaz K, Morgan-Billingslea J, Jones K, Waller EK, Jones D, Uppal K, Dunbar SB, Pemu P, Vaccarino V, Searles CD, Baltrus P, Lewis TT, Quyyumi AA, Taylor H. Cardiovascular Risk and Resilience Among Black Adults: Rationale and Design of the MECA Study. J Am Heart Assoc 2020; 9:e015247. [PMID: 32340530 PMCID: PMC7428584 DOI: 10.1161/jaha.119.015247] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Background Cardiovascular disease incidence, prevalence, morbidity, and mortality have declined in the past several decades; however, disparities persist among subsets of the population. Notably, blacks have not experienced the same improvements on the whole as whites. Furthermore, frequent reports of relatively poorer health statistics among the black population have led to a broad assumption that black race reliably predicts relatively poorer health outcomes. However, substantial intraethnic and intraracial heterogeneity exists; moreover, individuals with similar risk factors and environmental exposures are often known to experience vastly different cardiovascular health outcomes. Thus, some individuals have good outcomes even in the presence of cardiovascular risk factors, a concept known as resilience. Methods and Results The MECA (Morehouse‐Emory Center for Health Equity) Study was designed to investigate the multilevel exposures that contribute to “resilience” in the face of risk for poor cardiovascular health among blacks in the greater Atlanta, GA, metropolitan area. We used census tract data to determine “at‐risk” and “resilient” neighborhoods with high or low prevalence of cardiovascular morbidity and mortality, based on cardiovascular death, hospitalization, and emergency department visits for blacks. More than 1400 individuals from these census tracts assented to demographic, health, and psychosocial questionnaires administered through telephone surveys. Afterwards, ≈500 individuals were recruited to enroll in a clinical study, where risk biomarkers, such as oxidative stress, and inflammatory markers, endothelial progenitor cells, metabolomic and microRNA profiles, and subclinical vascular dysfunction were measured. In addition, comprehensive behavioral questionnaires were collected and ideal cardiovascular health metrics were assessed using the American Heart Association's Life Simple 7 measure. Last, 150 individuals with low Life Simple 7 were recruited and randomized to a behavioral mobile health (eHealth) plus health coach or eHealth only intervention and followed up for improvement. Conclusions The MECA Study is investigating socioenvironmental and individual behavioral measures that promote resilience to cardiovascular disease in blacks by assessing biological, functional, and molecular mechanisms. REGISTRATION URL: https://www.clinicaltrials.gov. Unique identifier: NCT03308812.
Collapse
Affiliation(s)
- Shabatun J Islam
- Division of Cardiology Department of Medicine Emory University School of Medicine Atlanta GA
| | - Jeong Hwan Kim
- Division of Cardiology Department of Medicine Emory University School of Medicine Atlanta GA
| | - Matthew Topel
- Division of Cardiology Department of Medicine Emory University School of Medicine Atlanta GA
| | - Chang Liu
- Division of Cardiology Department of Medicine Emory University School of Medicine Atlanta GA.,Department of Epidemiology Rollins School of Public Health Emory University Atlanta GA
| | - Yi-An Ko
- Department of Biostatistics and Bioinformatics Rollins School of Public Health Emory University Atlanta GA
| | - Mahasin S Mujahid
- Division of Epidemiology School of Public Health University of California Berkeley CA
| | - Mario Sims
- Department of Medicine University of Mississippi Medical Center Jackson MS
| | - Mohamed Mubasher
- Department of Community Health and Preventive Medicine Morehouse School of Medicine Atlanta GA
| | - Kiran Ejaz
- Division of Cardiology Department of Medicine Emory University School of Medicine Atlanta GA
| | - Jan Morgan-Billingslea
- Department of Community Health and Preventive Medicine Morehouse School of Medicine Atlanta GA
| | - Kia Jones
- Division of Cardiology Department of Medicine Emory University School of Medicine Atlanta GA
| | - Edmund K Waller
- Department of Hematology and Oncology Winship Cancer Institute Emory University School of Medicine Atlanta GA
| | - Dean Jones
- Division of Pulmonary, Allergy, Critical Care and Sleep Medicine Department of Medicine Emory University School of Medicine Atlanta GA
| | - Karan Uppal
- Division of Pulmonary, Allergy, Critical Care and Sleep Medicine Department of Medicine Emory University School of Medicine Atlanta GA
| | - Sandra B Dunbar
- Nell Hodgson Woodruff School of Nursing Emory University Atlanta GA
| | - Priscilla Pemu
- Department of Medicine Morehouse School of Medicine Atlanta GA
| | - Viola Vaccarino
- Division of Cardiology Department of Medicine Emory University School of Medicine Atlanta GA.,Department of Epidemiology Rollins School of Public Health Emory University Atlanta GA
| | - Charles D Searles
- Division of Cardiology Department of Medicine Emory University School of Medicine Atlanta GA
| | - Peter Baltrus
- Department of Community Health and Preventive Medicine Morehouse School of Medicine Atlanta GA.,National Center for Primary Care Morehouse School of Medicine Atlanta GA
| | - Tené T Lewis
- Department of Epidemiology Rollins School of Public Health Emory University Atlanta GA
| | - Arshed A Quyyumi
- Division of Cardiology Department of Medicine Emory University School of Medicine Atlanta GA
| | - Herman Taylor
- Department of Medicine Morehouse School of Medicine Atlanta GA
| |
Collapse
|
14
|
Tony Cai T, Zhang L. High dimensional linear discriminant analysis: optimality, adaptive algorithm and missing data. J R Stat Soc Series B Stat Methodol 2019. [DOI: 10.1111/rssb.12326] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Affiliation(s)
- T. Tony Cai
- University of Pennsylvania; Philadelphia USA
| | | |
Collapse
|
15
|
Jung S, Ahn J, Jeon Y. Penalized Orthogonal Iteration for Sparse Estimation of Generalized Eigenvalue Problem. J Comput Graph Stat 2019. [DOI: 10.1080/10618600.2019.1568014] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
- Sungkyu Jung
- Department of Statistics, Seoul National University, Seoul, South Korea
| | - Jeongyoun Ahn
- Department of Statistics, University of Georgia, Athens, GA
| | - Yongho Jeon
- Department of Applied Statistics, Yonsei University, Seoul, South Korea
| |
Collapse
|
16
|
Gaynanova I, Wang T. Sparse quadratic classification rules via linear dimension reduction. J MULTIVARIATE ANAL 2019; 169:278-299. [PMID: 31105355 PMCID: PMC6516858 DOI: 10.1016/j.jmva.2018.09.011] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
We consider the problem of high-dimensional classification between two groups with unequal covariance matrices. Rather than estimating the full quadratic discriminant rule, we propose to perform simultaneous variable selection and linear dimension reduction on the original data, with the subsequent application of quadratic discriminant analysis on the reduced space. In contrast to quadratic discriminant analysis, the proposed framework doesn't require the estimation of precision matrices; it scales linearly with the number of measurements, making it especially attractive for the use on high-dimensional datasets. We support the methodology with theoretical guarantees on variable selection consistency, and empirical comparisons with competing approaches. We apply the method to gene expression data of breast cancer patients, and confirm the crucial importance of the ESR1 gene in differentiating estrogen receptor status.
Collapse
Affiliation(s)
- Irina Gaynanova
- Department of Statistics, Texas A&M University, 3143 TAMU, College Station, TX 77843, USA
| | - Tianying Wang
- Department of Statistics, Texas A&M University, 3143 TAMU, College Station, TX 77843, USA
| |
Collapse
|
17
|
Liu J, Yu G, Liu Y. Graph-based sparse linear discriminant analysis for high-dimensional classification. J MULTIVARIATE ANAL 2018; 171:250-269. [PMID: 31983784 DOI: 10.1016/j.jmva.2018.12.007] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Linear discriminant analysis (LDA) is a well-known classification technique that enjoyed great success in practical applications. Despite its effectiveness for traditional low-dimensional problems, extensions of LDA are necessary in order to classify high-dimensional data. Many variants of LDA have been proposed in the literature. However, most of these methods do not fully incorporate the structure information among predictors when such information is available. In this paper, we introduce a new high-dimensional LDA technique, namely graph-based sparse LDA (GSLDA), that utilizes the graph structure among the features. In particular, we use the regularized regression formulation for penalized LDA techniques, and propose to impose a structure-based sparse penalty on the discriminant vector β . The graph structure can be either given or estimated from the training data. Moreover, we explore the relationship between the within-class feature structure and the overall feature structure. Based on this relationship, we further propose a variant of our proposed GSLDA to utilize effectively unlabeled data, which can be abundant in the semi-supervised learning setting. With the new regularization, we can obtain a sparse estimate of β and more accurate and interpretable classifiers than many existing methods. Both the selection consistency of β estimation and the convergence rate of the classifier are established, and the resulting classifier has an asymptotic Bayes error rate. Finally, we demonstrate the competitive performance of the proposed GSLDA on both simulated and real data studies.
Collapse
Affiliation(s)
- Jianyu Liu
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Guan Yu
- Department of Biostatistics, University at Buffalo, Buffalo, NY 14214, USA
| | - Yufeng Liu
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC 27599, USA.,Department of Genetics, Department of Biostatistics, and Carolina Center for Genome Sciences, University of North Carolina, Chapel Hill, NC 27599, USA
| |
Collapse
|
18
|
Li Q, Li L. Integrative linear discriminant analysis with guaranteed error rate improvement. Biometrika 2018; 105:917-930. [PMID: 31762476 DOI: 10.1093/biomet/asy047] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
Multiple types of data measured on a common set of subjects arise in many areas. Numerous empirical studies have found that integrative analysis of such data can result in better statistical performance in terms of prediction and feature selection. However, the advantages of integrative analysis have mostly been demonstrated empirically. In the context of two-class classification, we propose an integrative linear discriminant analysis method and establish a theoretical guarantee that it achieves a smaller classification error than running linear discriminant analysis on each data type individually. We address the issues of outliers and missing values, frequently encountered in integrative analysis, and illustrate our method through simulations and a neuroimaging study of Alzheimer's disease.
Collapse
Affiliation(s)
- Quefeng Li
- Department of Biostatistics, University of North Carolina at Chapel Hill, 3105D McGavran-Greenberg Hall, Chapel Hill, North Carolina 27599, U.S.A
| | - Lexin Li
- Division of Biostatistics, University of California at Berkeley, 50 University Hall 7360, Berkeley, California 94720, U.S.A
| |
Collapse
|
19
|
Zhang L, Kim I. Semiparametric Bayesian kernel survival model for evaluating pathway effects. Stat Methods Med Res 2018; 28:3301-3317. [PMID: 30289021 DOI: 10.1177/0962280218797360] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
Massive amounts of high-dimensional data have been accumulated over the past two decades, which has cultured increasing interests in identifying gene pathways related to certain biological processes. In particular, since pathway-based analysis has the ability to detect subtle changes of differentially expressed genes that could be missed when using gene-based analysis, detecting the gene pathways that regulate certain diseases can provide new strategies for medical procedures and new targets for drug discovery. Limited work has been carried out, primarily in regression settings, to study the effects of pathways on survival outcomes. Motivated by a breast cancer gene-pathway data set, which exhibits the "small n, large p" characteristics, we propose a semiparametric Bayesian kernel survival model (s-BKSurv) to study the effects of both clinical covariates and gene expression levels within a pathway on survival time. We model the unknown high-dimensional functions of pathways via Gaussian kernel machine to consider the possibility that genes within the same pathway interact with each other. To address the multiple comparisons problem under a full Bayesian setting, we propose a similarity-dependent procedure based on Bayes factor to control the family-wise error rate. We demonstrate the outperformance of our approach under various simulation settings and pathways data.
Collapse
Affiliation(s)
- Lin Zhang
- Department of Statistics, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA
| | - Inyoung Kim
- Department of Statistics, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA
| |
Collapse
|
20
|
|
21
|
Abstract
The Illumina Infinium BeadChips are a powerful array-based platform for genome-wide DNA methylation profiling at approximately 485,000 (450K) and 850,000 (EPIC) CpG sites across the genome. The platform is used in many large-scale population-based epigenetic studies of complex diseases, environmental exposures, or other experimental conditions. This chapter provides an overview of the key steps in analyzing Illumina BeadChip data. We describe key preprocessing steps including data extraction and quality control as well as normalization strategies. We further present principles and guidelines for conducting association analysis at the individual CpG level as well as more sophisticated pathway-based association tests.
Collapse
Affiliation(s)
- Michael C Wu
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, M3-C102, Seattle, WA, 98109, USA.
| | - Pei-Fen Kuan
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA
| |
Collapse
|
22
|
Lu Q, Qiao X. Sparse Fisher's linear discriminant analysis for partially labeled data. Stat Anal Data Min 2017. [DOI: 10.1002/sam.11367] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Affiliation(s)
- Qiyi Lu
- Department of Mathematical Sciences Binghamton University, State University of New York Binghamton New York 13902‐6000
| | - Xingye Qiao
- Department of Mathematical Sciences Binghamton University, State University of New York Binghamton New York 13902‐6000
| |
Collapse
|
23
|
|
24
|
Richardson JB, Lee KY, Mireji P, Enyaru J, Sistrom M, Aksoy S, Zhao H, Caccone A. Genomic analyses of African Trypanozoon strains to assess evolutionary relationships and identify markers for strain identification. PLoS Negl Trop Dis 2017; 11:e0005949. [PMID: 28961238 PMCID: PMC5636163 DOI: 10.1371/journal.pntd.0005949] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2017] [Revised: 10/11/2017] [Accepted: 09/11/2017] [Indexed: 11/27/2022] Open
Abstract
African trypanosomes of the sub-genus Trypanozoon) are eukaryotic parasitesthat cause disease in either humans or livestock. The development of genomic resources can be of great use to those interested in studying and controlling the spread of these trypanosomes. Here we present a large comparative analysis of Trypanozoon whole genomes, 83 in total, including human and animal infective African trypanosomes: 21 T. brucei brucei, 22 T. b. gambiense, 35 T. b. rhodesiense and 4 T. evansi strains, of which 21 were from Uganda. We constructed a maximum likelihood phylogeny based on 162,210 single nucleotide polymorphisms (SNPs.) The three Trypanosoma brucei sub-species and Trypanosoma evansi are not monophyletic, confirming earlier studies that indicated high similarity among Trypanosoma “sub-species”. We also used discriminant analysis of principal components (DAPC) on the same set of SNPs, identifying seven genetic clusters. These clusters do not correspond well with existing taxonomic classifications, in agreement with the phylogenetic analysis. Geographic origin is reflected in both the phylogeny and clustering analysis. Finally, we used sparse linear discriminant analysis to rank SNPs by their informativeness in differentiating the strains in our data set. As few as 84 SNPs can completely distinguish the strains used in our study, and discriminant analysis was still able to detect genetic structure using as few as 10 SNPs. Our results reinforce earlier results of high genetic similarity between the African Trypanozoon. Despite this, a small subset of SNPs can be used to identify genetic markers that can be used for strain identification or other epidemiological investigations. Trypanosomes are a major health threat to the people and livestock of Sub-Saharan Africa. Building genomic resources and understanding the genetic structure of these parasites will aid researchers trying to control their spread. To this end, we compared the genomes from 83 trypanosome strains, identifying 162,210 single nucleotide polymorphisms (SNPs) between them. Our analysis shows high genetic similarity between the trypanosomes, and confirms earlier results indicating that the traditional taxonomic classifications do not correspond well with genetic data. Further, we demonstrate that, despite the high genetic similarity, each strain in the study can be distinguished using as few as 84 SNPs, suggesting that a small number of SNPs can be useful for tracking and classifying populations of African trypanosomes.
Collapse
Affiliation(s)
- Joshua Brian Richardson
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, United States of America
- * E-mail:
| | - Kuang-Yao Lee
- Yale School of Public Health, Yale University, New Haven, CT, United States of America
| | - Paul Mireji
- Biotechnology Research Institute, Kenya Agricultural and Livestock Research Organization, Kikuyu, Kenya
| | - John Enyaru
- School of Biological Sciences, Makerere University, Kampala, Uganda
| | - Mark Sistrom
- School of Natural Sciences, UC Merced, Merced, CA, United States of America
| | - Serap Aksoy
- Yale School of Public Health, Yale University, New Haven, CT, United States of America
| | - Hongyu Zhao
- Yale School of Public Health, Yale University, New Haven, CT, United States of America
| | - Adalgisa Caccone
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, United States of America
- Yale School of Public Health, Yale University, New Haven, CT, United States of America
| |
Collapse
|
25
|
Tharwat A, Gaber T, Ibrahim A, Hassanien AE. Linear discriminant analysis: A detailed tutorial. AI COMMUN 2017. [DOI: 10.3233/aic-170729] [Citation(s) in RCA: 343] [Impact Index Per Article: 49.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Affiliation(s)
- Alaa Tharwat
- Department of Computer Science and Engineering, Frankfurt University of Applied Sciences, Frankfurt am Main, Germany
- Faculty of Engineering, Suez Canal University, Egypt. E-mail:
| | - Tarek Gaber
- Faculty of Computers and Informatics, Suez Canal University, Egypt. E-mail:
| | | | | |
Collapse
|
26
|
Inferring Genes and Biological Functions That Are Sensitive to the Severity of Toxicity Symptoms. Int J Mol Sci 2017; 18:ijms18040755. [PMID: 28368331 PMCID: PMC5412340 DOI: 10.3390/ijms18040755] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2016] [Revised: 03/23/2017] [Accepted: 03/30/2017] [Indexed: 11/16/2022] Open
Abstract
The effective development of new drugs relies on the identification of genes that are related to the symptoms of toxicity. Although many researchers have inferred toxicity markers, most have focused on discovering toxicity occurrence markers rather than toxicity severity markers. In this study, we aimed to identify gene markers that are relevant to both the occurrence and severity of toxicity symptoms. To identify gene markers for each of four targeted liver toxicity symptoms, we used microarray expression profiles and pathology data from 14,143 in vivo rat samples. The gene markers were found using sparse linear discriminant analysis (sLDA) in which symptom severity is used as a class label. To evaluate the inferred gene markers, we constructed regression models that predicted the severity of toxicity symptoms from gene expression profiles. Our cross-validated results revealed that our approach was more successful at finding gene markers sensitive to the aggravation of toxicity symptoms than conventional methods. Moreover, these markers were closely involved in some of the biological functions significantly related to toxicity severity in the four targeted symptoms.
Collapse
|
27
|
|
28
|
Gaynanova I, Booth JG, Wells MT. Simultaneous Sparse Estimation of Canonical Vectors in the p ≫ N Setting. J Am Stat Assoc 2016. [DOI: 10.1080/01621459.2015.1034318] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
29
|
He Q, Cai T, Liu Y, Zhao N, Harmon QE, Almli LM, Binder EB, Engel SM, Ressler KJ, Conneely KN, Lin X, Wu MC. Prioritizing individual genetic variants after kernel machine testing using variable selection. Genet Epidemiol 2016; 40:722-731. [PMID: 27488097 DOI: 10.1002/gepi.21993] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2015] [Revised: 05/28/2016] [Accepted: 06/20/2016] [Indexed: 01/06/2023]
Abstract
Kernel machine learning methods, such as the SNP-set kernel association test (SKAT), have been widely used to test associations between traits and genetic polymorphisms. In contrast to traditional single-SNP analysis methods, these methods are designed to examine the joint effect of a set of related SNPs (such as a group of SNPs within a gene or a pathway) and are able to identify sets of SNPs that are associated with the trait of interest. However, as with many multi-SNP testing approaches, kernel machine testing can draw conclusion only at the SNP-set level, and does not directly inform on which one(s) of the identified SNP set is actually driving the associations. A recently proposed procedure, KerNel Iterative Feature Extraction (KNIFE), provides a general framework for incorporating variable selection into kernel machine methods. In this article, we focus on quantitative traits and relatively common SNPs, and adapt the KNIFE procedure to genetic association studies and propose an approach to identify driver SNPs after the application of SKAT to gene set analysis. Our approach accommodates several kernels that are widely used in SNP analysis, such as the linear kernel and the Identity by State (IBS) kernel. The proposed approach provides practically useful utilities to prioritize SNPs, and fills the gap between SNP set analysis and biological functional studies. Both simulation studies and real data application are used to demonstrate the proposed approach.
Collapse
Affiliation(s)
- Qianchuan He
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Tianxi Cai
- Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, United States of America
| | - Yang Liu
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Ni Zhao
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Quaker E Harmon
- Epidemiology Branch, NIEHS, Research Triangle Park, North Carolina, United States of America
| | - Lynn M Almli
- Department of Psychiatry and Behavioral Sciences, Emory University School of Medicine, Atlanta, Georgia, United States of America
| | - Elisabeth B Binder
- Department of Translational Research in Psychiatry, Max-Planck Institute of Psychiatry, Munich, Germany
| | - Stephanie M Engel
- Department of Epidemiology, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Kerry J Ressler
- Division of Depression & Anxiety Disorders, McLean Hospital, Belmont, Massachusetts, United States of America
| | - Karen N Conneely
- Department of Human Genetics, Emory University School of Medicine, Atlanta, Georgia, United States of America
| | - Xihong Lin
- Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, United States of America
| | - Michael C Wu
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| |
Collapse
|
30
|
Fan J, Feng Y, Jiang J, Tong X. Feature Augmentation via Nonparametrics and Selection (FANS) in High-Dimensional Classification. J Am Stat Assoc 2016; 111:275-287. [PMID: 27185970 DOI: 10.1080/01621459.2015.1005212] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
We propose a high dimensional classification method that involves nonparametric feature augmentation. Knowing that marginal density ratios are the most powerful univariate classifiers, we use the ratio estimates to transform the original feature measurements. Subsequently, penalized logistic regression is invoked, taking as input the newly transformed or augmented features. This procedure trains models equipped with local complexity and global simplicity, thereby avoiding the curse of dimensionality while creating a flexible nonlinear decision boundary. The resulting method is called Feature Augmentation via Nonparametrics and Selection (FANS). We motivate FANS by generalizing the Naive Bayes model, writing the log ratio of joint densities as a linear combination of those of marginal densities. It is related to generalized additive models, but has better interpretability and computability. Risk bounds are developed for FANS. In numerical analysis, FANS is compared with competing methods, so as to provide a guideline on its best application domain. Real data analysis demonstrates that FANS performs very competitively on benchmark email spam and gene expression data sets. Moreover, FANS is implemented by an extremely fast algorithm through parallel computing.
Collapse
Affiliation(s)
- Jianqing Fan
- Jianqing Fan is Frederick L. Moore Professor of Finance, Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ, 08544 ( )
| | - Yang Feng
- Yang Feng is Assistant Professor, Department of Statistics, Columbia University, New York, NY, 10027 ( )
| | - Jiancheng Jiang
- Jiancheng Jiang is Associate Professor, Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte, NC, 28223 ( )
| | - Xin Tong
- Xin Tong is Assistant Professor, Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA, 90089 ( )
| |
Collapse
|
31
|
|
32
|
|
33
|
Nguyen T, Khosravi A, Creighton D, Nahavandi S. A novel aggregate gene selection method for microarray data classification. Pattern Recognit Lett 2015. [DOI: 10.1016/j.patrec.2015.03.018] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
34
|
Zheng Z, Huang X, Chen Z, He X, Liu H, Yang J. Regression analysis of locality preserving projections via sparse penalty. Inf Sci (N Y) 2015. [DOI: 10.1016/j.ins.2015.01.004] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
35
|
|
36
|
Kolar M, Liu H. Optimal Feature Selection in High-Dimensional Discriminant Analysis. IEEE TRANSACTIONS ON INFORMATION THEORY 2015; 61:1063-1083. [PMID: 25620807 PMCID: PMC4302965 DOI: 10.1109/tit.2014.2381241] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
We consider the high-dimensional discriminant analysis problem. For this problem, different methods have been proposed and justified by establishing exact convergence rates for the classification risk, as well as the ℓ2 convergence results to the discriminative rule. However, sharp theoretical analysis for the variable selection performance of these procedures have not been established, even though model interpretation is of fundamental importance in scientific data analysis. This paper bridges the gap by providing sharp sufficient conditions for consistent variable selection using the sparse discriminant analysis (Mai et al., 2012). Through careful analysis, we establish rates of convergence that are significantly faster than the best known results and admit an optimal scaling of the sample size n, dimensionality p, and sparsity level s in the high-dimensional setting. Sufficient conditions are complemented by the necessary information theoretic limits on the variable selection problem in the context of high-dimensional discriminant analysis. Exploiting a numerical equivalence result, our method also establish the optimal results for the ROAD estimator (Fan et al., 2012) and the sparse optimal scaling estimator (Clemmensen et al., 2011). Furthermore, we analyze an exhaustive search procedure, whose performance serves as a benchmark, and show that it is variable selection consistent under weaker conditions. Extensive simulations demonstrating the sharpness of the bounds are also provided.
Collapse
Affiliation(s)
- Mladen Kolar
- Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15217, USA
| | - Han Liu
- Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA; Research supported by NSF Grant IIS–1116730
| |
Collapse
|
37
|
Gaynanova I, Kolar M. Optimal variable selection in multi-group sparse discriminant analysis. Electron J Stat 2015. [DOI: 10.1214/15-ejs1064] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
38
|
Hino H, Fujiki J. ADHERENTLY PENALIZED LINEAR DISCRIMINANT ANALYSIS. JOURNAL JAPANESE SOCIETY OF COMPUTATIONAL STATISTICS 2015. [DOI: 10.5183/jjscs.1412001_219] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Hideitsu Hino
- Graduate School of Systems and Information Engineering, University of Tsukuba
| | - Jun Fujiki
- Department of Applied Mathematics, Fukuoka University
| |
Collapse
|
39
|
Zhang Y, Huo L, Lin L, Zeng Y. The Dantzig Discriminant Analysis with High Dimensional Data. COMMUN STAT-THEOR M 2014. [DOI: 10.1080/03610926.2013.878359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
40
|
Hao N, Dong B, Fan J. Sparsifying the Fisher Linear Discriminant by Rotation. J R Stat Soc Series B Stat Methodol 2014; 77:827-851. [PMID: 26512210 DOI: 10.1111/rssb.12092] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Many high dimensional classification techniques have been proposed in the literature based on sparse linear discriminant analysis (LDA). To efficiently use them, sparsity of linear classifiers is a prerequisite. However, this might not be readily available in many applications, and rotations of data are required to create the needed sparsity. In this paper, we propose a family of rotations to create the required sparsity. The basic idea is to use the principal components of the sample covariance matrix of the pooled samples and its variants to rotate the data first and to then apply an existing high dimensional classifier. This rotate-and-solve procedure can be combined with any existing classifiers, and is robust against the sparsity level of the true model. We show that these rotations do create the sparsity needed for high dimensional classifications and provide theoretical understanding why such a rotation works empirically. The effectiveness of the proposed method is demonstrated by a number of simulated and real data examples, and the improvements of our method over some popular high dimensional classification rules are clearly shown.
Collapse
Affiliation(s)
- Ning Hao
- University of Arizona, University of Arizona, and Princeton University
| | - Bin Dong
- University of Arizona, University of Arizona, and Princeton University
| | - Jianqing Fan
- University of Arizona, University of Arizona, and Princeton University
| |
Collapse
|
41
|
Huang H, Huang Y. Improved discriminant sparsity neighborhood preserving embedding for hyperspectral image classification. Neurocomputing 2014. [DOI: 10.1016/j.neucom.2014.01.010] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
42
|
Zhan X, Epstein MP, Ghosh D. An Adaptive Genetic Association Test Using Double Kernel Machines. STATISTICS IN BIOSCIENCES 2014; 7:262-281. [PMID: 26640602 DOI: 10.1007/s12561-014-9116-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
Recently, gene set-based approaches have become very popular in gene expression profiling studies for assessing how genetic variants are related to disease outcomes. Since most genes are not differentially expressed, existing pathway tests considering all genes within a pathway suffer from considerable noise and power loss. Moreover, for a differentially expressed pathway, it is of interest to select important genes that drive the effect of the pathway. In this article, we propose an adaptive association test using double kernel machines (DKM), which can both select important genes within the pathway as well as test for the overall genetic pathway effect. This DKM procedure first uses the garrote kernel machines (GKM) test for the purposes of subset selection and then the least squares kernel machine (LSKM) test for testing the effect of the subset of genes. An appealing feature of the kernel machine framework is that it can provide a flexible and unified method for multi-dimensional modeling of the genetic pathway effect allowing for both parametric and nonparametric components. This DKM approach is illustrated with application to simulated data as well as to data from a neuroimaging genetics study.
Collapse
Affiliation(s)
- Xiang Zhan
- Department of Statistics, Pennsylvania State University, University Park, PA 16802, U.S.A. Tel.: +1-8143213493
| | - Michael P Epstein
- Department of Human Genetics, Emory University, Atlanta, GA 30322, U.S.A
| | - Debashis Ghosh
- Department of Statistics, Department of Public Health Sciences, Pennsylvania State University, University Park, PA 16802, U.S.A
| |
Collapse
|
43
|
Pepe D, Grassi M. Investigating perturbed pathway modules from gene expression data via structural equation models. BMC Bioinformatics 2014; 15:132. [PMID: 24885496 PMCID: PMC4052286 DOI: 10.1186/1471-2105-15-132] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2013] [Accepted: 04/25/2014] [Indexed: 01/18/2023] Open
Abstract
Background It is currently accepted that the perturbation of complex intracellular networks, rather than the dysregulation of a single gene, is the basis for phenotypical diversity. High-throughput gene expression data allow to investigate changes in gene expression profiles among different conditions. Recently, many efforts have been made to individuate which biological pathways are perturbed, given a list of differentially expressed genes (DEGs). In order to understand these mechanisms, it is necessary to unveil the variation of genes in relation to each other, considering the different phenotypes. In this paper, we illustrate a pipeline, based on Structural Equation Modeling (SEM) that allowed to investigate pathway modules, considering not only deregulated genes but also the connections between the perturbed ones. Results The procedure was tested on microarray experiments relative to two neurological diseases: frontotemporal lobar degeneration with ubiquitinated inclusions (FTLD-U) and multiple sclerosis (MS). Starting from DEGs and dysregulated biological pathways, a model for each pathway was generated using databases information biological databases, in order to design how DEGs were connected in a causal structure. Successively, SEM analysis proved if pathways differ globally, between groups, and for specific path relationships. The results confirmed the importance of certain genes in the analyzed diseases, and unveiled which connections are modified among them. Conclusions We propose a framework to perform differential gene expression analysis on microarray data based on SEM, which is able to: 1) find relevant genes and perturbed biological pathways, investigating putative sub-pathway models based on the concept of disease module; 2) test and improve the generated models; 3) detect a differential expression level of one gene, and differential connection between two genes. This could shed light, not only on the mechanisms affecting variations in gene expression, but also on the causes of gene-gene relationship modifications in diseased phenotypes.
Collapse
Affiliation(s)
- Daniele Pepe
- Department of Brain and Behavioural Sciences, Medical and Genomic Statistics Unit, University of Pavia, Pavia, Italy.
| | | |
Collapse
|
44
|
Cai H, Ruan P, Ng M, Akutsu T. Feature weight estimation for gene selection: a local hyperlinear learning approach. BMC Bioinformatics 2014; 15:70. [PMID: 24625071 PMCID: PMC4007530 DOI: 10.1186/1471-2105-15-70] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2013] [Accepted: 03/06/2014] [Indexed: 11/10/2022] Open
Abstract
Background Modeling high-dimensional data involving thousands of variables is particularly important for gene expression profiling experiments, nevertheless,it remains a challenging task. One of the challenges is to implement an effective method for selecting a small set of relevant genes, buried in high-dimensional irrelevant noises. RELIEF is a popular and widely used approach for feature selection owing to its low computational cost and high accuracy. However, RELIEF based methods suffer from instability, especially in the presence of noisy and/or high-dimensional outliers. Results We propose an innovative feature weighting algorithm, called LHR, to select informative genes from highly noisy data. LHR is based on RELIEF for feature weighting using classical margin maximization. The key idea of LHR is to estimate the feature weights through local approximation rather than global measurement, which is typically used in existing methods. The weights obtained by our method are very robust in terms of degradation of noisy features, even those with vast dimensions. To demonstrate the performance of our method, extensive experiments involving classification tests have been carried out on both synthetic and real microarray benchmark datasets by combining the proposed technique with standard classifiers, including the support vector machine (SVM), k-nearest neighbor (KNN), hyperplane k-nearest neighbor (HKNN), linear discriminant analysis (LDA) and naive Bayes (NB). Conclusion Experiments on both synthetic and real-world datasets demonstrate the superior performance of the proposed feature selection method combined with supervised learning in three aspects: 1) high classification accuracy, 2) excellent robustness to noise and 3) good stability using to various classification algorithms.
Collapse
Affiliation(s)
- Hongmin Cai
- School of Computer Science and Engineering, South China University of Technology, Guangdong, China.
| | | | | | | |
Collapse
|
45
|
Chi YY, Gribbin MJ, Johnson JL, Muller KE. Power calculation for overall hypothesis testing with high-dimensional commensurate outcomes. Stat Med 2014; 33:812-27. [PMID: 24122945 PMCID: PMC4072336 DOI: 10.1002/sim.5986] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2012] [Revised: 08/19/2013] [Accepted: 08/21/2013] [Indexed: 11/07/2022]
Abstract
The complexity of system biology means that any metabolic, genetic, or proteomic pathway typically includes so many components (e.g., molecules) that statistical methods specialized for overall testing of high-dimensional and commensurate outcomes are required. While many overall tests have been proposed, very few have power and sample size methods. We develop accurate power and sample size methods and software to facilitate study planning for high-dimensional pathway analysis. With an account of any complex correlation structure between high-dimensional outcomes, the new methods allow power calculation even when the sample size is less than the number of variables. We derive the exact (finite-sample) and approximate non-null distributions of the 'univariate' approach to repeated measures test statistic, as well as power-equivalent scenarios useful to generalize our numerical evaluations. Extensive simulations of group comparisons support the accuracy of the approximations even when the ratio of number of variables to sample size is large. We derive a minimum set of constants and parameters sufficient and practical for power calculation. Using the new methods and specifying the minimum set to determine power for a study of metabolic consequences of vitamin B6 deficiency helps illustrate the practical value of the new results. Free software implementing the power and sample size methods applies to a wide range of designs, including one group pre-intervention and post-intervention comparisons, multiple parallel group comparisons with one-way or factorial designs, and the adjustment and evaluation of covariate effects.
Collapse
Affiliation(s)
- Yueh-Yun Chi
- Department of Biostatistics, University of Florida, Gainesville, FL, U.S.A
| | | | | | | |
Collapse
|
46
|
An J, Pan Y, Yan Z, Li W, Cui J, Yuan J, Tian L, Xing R, Lu Y. MiR-23a in amplified 19p13.13 loci targets metallothionein 2A and promotes growth in gastric cancer cells. J Cell Biochem 2013; 114:2160-9. [PMID: 23553990 DOI: 10.1002/jcb.24565] [Citation(s) in RCA: 47] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2012] [Accepted: 03/28/2013] [Indexed: 12/19/2022]
Abstract
Copy number variation (CNV) and abnormal expression of microRNAs (miRNAs) always lead to deregulation of genes in cancer, including gastric cancer (GC). However, little is known about how CNVs affect the expression of miRNAs. By integrating CNV and miRNA profiles in the same samples, we identified eight miRNAs (miR-1274a, miR-196b, miR-4298, miR-181c, miR-181d, miR-23a, miR-27a and miR-24-2) that were located in the amplified regions and were upregulated in GC. In particular, amplification of miR-23a-27a-24-2 cluster and miR-181c-181d cluster frequently occurred at 19p13.13 and were confirmed by genomic real-time PCR in another 25 paired GC samples. Moreover, in situ hybridization (ISH) experiments represented that mature miR-23a was increased in GCs (75.5%, 40/53) compared with matched normal tissues (28.6%, 14/49, P = 0.001). Knocking down of miR-23a expression inhibited BGC823 cell growth in vitro and in vivo. In addition, the potential target genes of miR-23a were investigated by integration of mRNA profile and miRNA TargetScan predictions, we found that upregulation of miR-23a and downregulation of metallothionein 2A (MT2A) were detected simultaneously in 70% (7/10) of the miRNA and mRNA profiles. Furthermore, an inverse correlation between miR-23a and MT2A expression was detected in GCs and normal tissues. Through combining luciferase assay, we confirmed that MT2A is a potential target of miR-23a. In conclusion, these results suggest that integration of CNV-miRNA-mRNA profiling is a powerful tool for identifying molecular signatures, and that miR-23a might play a role in regulating MT2A expression in GC.
Collapse
Affiliation(s)
- Juan An
- Laboratory of Molecular Oncology, Key Laboratory of Carcinogenesis and Translational Research Ministry of Education, Peking University Cancer Hospital/Institute, Beijing, 100142, P.R., China
| | | | | | | | | | | | | | | | | |
Collapse
|
47
|
Wang C, Cao L, Miao B. Optimal feature selection for sparse linear discriminant analysis and its applications in gene expression data. Comput Stat Data Anal 2013. [DOI: 10.1016/j.csda.2013.04.003] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
48
|
Mai Q, Zou H. A Note On the Connection and Equivalence of Three Sparse Linear Discriminant Analysis Methods. Technometrics 2013. [DOI: 10.1080/00401706.2012.746208] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
49
|
|
50
|
Yu T, Bai Y. Analyzing LC/MS metabolic profiling data in the context of existing metabolic networks. ACTA ACUST UNITED AC 2012; 1:83-91. [PMID: 24010053 DOI: 10.2174/2213235x11301010084] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Metabolic profiling is the unbiased detection and quantification of low molecular-weight metabolites in a living system. It is rapidly developing in biological and translational research, contributing to disease mechanism elucidation, environmental chemical surveillance, biomarker detection, and health outcome prediction. Recent developments in experimental and computational technology allow more and more known metabolites to be detected and quantified from complex samples. As the coverage of the metabolic network improves, it has become feasible to examine metabolic profiling data from a systems perspective, i.e. interpreting the data and performing statistical inference in the context of pathways and genome-scale metabolic networks. Recently a number of methods have been developed in this area, and much improvement in algorithms and databases are still needed. In this review, we survey some methods for the analysis of metabolic profiling data based on metabolic networks.
Collapse
Affiliation(s)
- Tianwei Yu
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA
| | | |
Collapse
|