1
|
Matsui Y, Togayachi A, Sakamoto K, Angata K, Kadomatsu K, Nishihara S. Integrated Systems Analysis Deciphers Transcriptome and Glycoproteome Links in Alzheimer's Disease. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.12.25.573290. [PMID: 38234803 PMCID: PMC10793412 DOI: 10.1101/2023.12.25.573290] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/19/2024]
Abstract
Glycosylation is increasingly recognized as a potential therapeutic target in Alzheimer's disease. In recent years, evidence of Alzheimer's disease-specific glycoproteins has been established. However, the mechanisms underlying their dysregulation, including tissue- and cell-type specificity, are not fully understood. We aimed to explore the upstream regulators of aberrant glycosylation by integrating multiple data sources using a glycogenomics approach. We identified dysregulation of the glycosyltransferase PLOD3 in oligodendrocytes as an upstream regulator of cerebral vessels and found that it is involved in COL4A5 synthesis, which is strongly correlated with amyloid fiber formation. Furthermore, COL4A5 has been suggested to interact with astrocytes via extracellular matrix receptors as a ligand. This study suggests directions for new therapeutic strategies for Alzheimer's disease targeting glycosyltransferases.
Collapse
Affiliation(s)
- Yusuke Matsui
- Institute for Glyco-core Research (iGCORE), Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8601, Japan
- Biomedical and Health Informatics Unit, Department of Integrated Health Science, Nagoya University Graduate School of Medicine, Daiko-minami, Higashi-ku, Nagoya, 461-8673, Japan
| | - Akira Togayachi
- Glycan and Life Systems Integration Center (GaLSIC), Soka University, 1-236 Tangi-machi, Hachioji, Tokyo 192-8577, Japan
| | - Kazuma Sakamoto
- Institute for Glyco-core Research (iGCORE), Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8601, Japan
- Department of Biochemistry, Nagoya University Graduate School of Medicine, Tsurumai-cho, Showa-ku, Nagoya, 466-8550, Japan
| | - Kiyohiko Angata
- Glycan and Life Systems Integration Center (GaLSIC), Soka University, 1-236 Tangi-machi, Hachioji, Tokyo 192-8577, Japan
| | - Kenji Kadomatsu
- Institute for Glyco-core Research (iGCORE), Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8601, Japan
- Department of Biochemistry, Nagoya University Graduate School of Medicine, Tsurumai-cho, Showa-ku, Nagoya, 466-8550, Japan
| | - Shoko Nishihara
- Glycan and Life Systems Integration Center (GaLSIC), Soka University, 1-236 Tangi-machi, Hachioji, Tokyo 192-8577, Japan
| |
Collapse
|
2
|
Treaster S, Deelen J, Daane JM, Murabito J, Karasik D, Harris MP. Convergent genomics of longevity in rockfishes highlights the genetics of human life span variation. SCIENCE ADVANCES 2023; 9:eadd2743. [PMID: 36630509 PMCID: PMC9833670 DOI: 10.1126/sciadv.add2743] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Accepted: 12/09/2022] [Indexed: 05/16/2023]
Abstract
Longevity is a defining, heritable trait that varies dramatically between species. To resolve the genetic regulation of this trait, we have mined genomic variation in rockfishes, which range in longevity from 11 to over 205 years. Multiple shifts in rockfish longevity have occurred independently and in a short evolutionary time frame, thus empowering convergence analyses. Our analyses reveal a common network of genes under convergent evolution, encompassing established aging regulators such as insulin signaling, yet also identify flavonoid (aryl-hydrocarbon) metabolism as a pathway modulating longevity. The selective pressures on these pathways indicate the ancestral state of rockfishes was long lived and that the changes in short-lived lineages are adaptive. These pathways were also used to explore genome-wide association studies of human longevity, identifying the aryl-hydrocarbon metabolism pathway to be significantly associated with human survival to the 99th percentile. This evolutionary intersection defines and cross-validates a previously unappreciated genetic architecture that associates with the evolution of longevity across vertebrates.
Collapse
Affiliation(s)
- Stephen Treaster
- Department of Orthopaedic Surgery, Boston Children’s Hospital, Boston, MA, USA
- Department of Genetics, Harvard Medical School, Boston, MA, USA
| | - Joris Deelen
- Max Planck Institute for Biology of Ageing, Joseph-Stelzmann-Str. 9b, D-50931 Köln, Germany
- Molecular Epidemiology, Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, Netherlands
- Cologne Excellence Cluster on Cellular Stress Responses in Aging-Associated Diseases (CECAD), University of Cologne, Cologne, Germany
| | - Jacob M. Daane
- Department of Biology and Biochemistry, University of Houston, Houston TX, USA
| | - Joanne Murabito
- Section of General Internal Medicine, Department of Medicine, Boston University School of Medicine, Boston, MA, USA
- Framingham Heart Study, Framingham, MA, USA
| | - David Karasik
- Azrieli Faculty of Medicine, Bar-Ilan University, Safed, Israel
- Marcus Institute for Aging Research, Hebrew Senior Life, Boston, MA, USA
| | - Matthew P. Harris
- Department of Orthopaedic Surgery, Boston Children’s Hospital, Boston, MA, USA
- Department of Genetics, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
3
|
Treaster S, Daane JM, Harris MP. Refining Convergent Rate Analysis with Topology in Mammalian Longevity and Marine Transitions. Mol Biol Evol 2021; 38:5190-5203. [PMID: 34324001 PMCID: PMC8557430 DOI: 10.1093/molbev/msab226] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
The quest to map the genetic foundations of phenotypes has been empowered by the modern diversity, quality, and availability of genomic resources. Despite these expanding resources, the abundance of variation within lineages makes it challenging to associate genetic change to specific phenotypes, without an a priori means of isolating the changes from background genomic variation. Evolution provides this means through convergence-i.e., the shared variation that may result from replicate evolutionary experiments across independent trait occurrences. To leverage these opportunities, we developed TRACCER: Topologically Ranked Analysis of Convergence via Comparative Evolutionary Rates. Compared to current methods, this software empowers rate convergence analysis by factoring in topological relationships, because genetic variation between phylogenetically proximate trait changes is more likely to be facilitating the trait. Comparisons are performed not with singular branches, but with the complete paths to the most recent common ancestor for each pair of lineages. This ensures that comparisons represent a single context diverging over the same timeframe while obviating the problematic requirement of assigning ancestral states. We applied TRACCER to two case studies: mammalian transitions to marine environments, an unambiguous collection of traits which have independently evolved three times; and the evolution of mammalian longevity, a less delineated trait but with more instances to compare. By factoring in topology, TRACCER identifies highly significant, convergent genetic signals, with important incongruities and statistical resolution when compared to existing approaches. These improvements in sensitivity and specificity of convergence analysis generates refined targets for downstream validation and identification of genotype-phenotype relationships.
Collapse
Affiliation(s)
- Stephen Treaster
- Department of Orthopaedic Research, Boston Children's Hospital, Boston, MA, 02124, USA.,Department of Genetics, Harvard Medical School, Boston, MA, 02124, USA
| | - Jacob M Daane
- Department of Orthopaedic Research, Boston Children's Hospital, Boston, MA, 02124, USA.,Department of Genetics, Harvard Medical School, Boston, MA, 02124, USA.,Department of Marine and Environmental Sciences, Northeastern University Marine Science Center, Nahant, MA, 01908, USA
| | - Matthew P Harris
- Department of Orthopaedic Research, Boston Children's Hospital, Boston, MA, 02124, USA.,Department of Genetics, Harvard Medical School, Boston, MA, 02124, USA
| |
Collapse
|
4
|
Deng Y, Wu S, Fan H. Genome-wide pathway-based quantitative multiple phenotypes analysis. PLoS One 2020; 15:e0240910. [PMID: 33175855 PMCID: PMC7657528 DOI: 10.1371/journal.pone.0240910] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2020] [Accepted: 10/06/2020] [Indexed: 11/18/2022] Open
Abstract
For complex diseases, genome-wide pathway association studies have become increasingly promising. Currently, however, pathway-based association analysis mainly focus on a single phenotype, which may insufficient to describe the complex diseases and physiological processes. This work proposes a combination model to evaluate the association between a pathway and multiple phenotypes and to reduce the run time based on asymptotic results. For a single phenotype, we propose a semi-supervised maximum kernel-based U-statistics (mSKU) method to assess the pathway-based association analysis. For multiple phenotypes, we propose the fisher combination function with dependent phenotypes (FC) to transform the p-values between the pathway and each marginal phenotype individually to achieve pathway-based multiple phenotypes analysis. With real data from the Alzheimer Disease Neuroimaging Initiative (ADNI) study and Human Liver Cohort (HLC) study, the FC-mSKU method allows us to specify which pathways are specific to a single phenotype or contribute to common genetic constructions of multiple phenotypes. If we only focus on single-phenotype tests, we may miss some findings for etiology studies. Through extensive simulation studies, the FC-mSKU method demonstrates its advantages compared with its counterparts.
Collapse
Affiliation(s)
- Yamin Deng
- Statistics Center, First Hospital of Shanxi Medical University, Taiyuan, China.,Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Shiman Wu
- Statistics Center, First Hospital of Shanxi Medical University, Taiyuan, China
| | - Huifang Fan
- Statistics Center, First Hospital of Shanxi Medical University, Taiyuan, China
| |
Collapse
|
5
|
Odom GJ, Ban Y, Colaprico A, Liu L, Silva TC, Sun X, Pico AR, Zhang B, Wang L, Chen X. PathwayPCA: an R/Bioconductor Package for Pathway Based Integrative Analysis of Multi-Omics Data. Proteomics 2020; 20:e1900409. [PMID: 32430990 PMCID: PMC7677175 DOI: 10.1002/pmic.201900409] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2020] [Revised: 05/01/2020] [Indexed: 01/01/2023]
Abstract
The authors present pathwayPCA, an R/Bioconductor package for integrative pathway analysis that utilizes modern statistical methodology, including supervised and adaptive, elastic-net, sparse principal component analysis. pathwayPCA can be applied to continuous, binary, and survival outcomes in studies with multiple covariates and/or interaction effects. It outperforms several alternative methods at identifying disease-associated pathways in integrative analysis using both simulated and real datasets. In addition, several case studies are provided to illustrate pathwayPCA analysis with gene selection, estimating, and visualizing sample-specific pathway activities, identifying sex-specific pathway effects in kidney cancer, and building integrative models for predicting patient prognosis. pathwayPCA is an open-source R package, freely available through the Bioconductor repository. pathwayPCA is expected to be a useful tool for empowering the wider scientific community to analyze and interpret the wealth of available proteomics data, along with other types of molecular data recently made available by Clinical Proteomic Tumor Analysis Consortium and other large consortiums.
Collapse
Affiliation(s)
- Gabriel J. Odom
- Department of Biostatistics, Florida International University, Stempel College of Public Health, Miami, FL 33199, USA
- Division of Biostatistics, Department of Public Health Sciences, University of Miami, Miller School of Medicine, Miami, FL 33136, USA
| | - Yuguang Ban
- Sylvester Comprehensive Cancer Center, University of Miami, Miller School of Medicine, Miami, FL 33136, USA
| | - Antonio Colaprico
- Division of Biostatistics, Department of Public Health Sciences, University of Miami, Miller School of Medicine, Miami, FL 33136, USA
| | - Lizhong Liu
- Division of Biostatistics, Department of Public Health Sciences, University of Miami, Miller School of Medicine, Miami, FL 33136, USA
| | - Tiago Chedraoui Silva
- Division of Biostatistics, Department of Public Health Sciences, University of Miami, Miller School of Medicine, Miami, FL 33136, USA
| | - Xiaodian Sun
- Sylvester Comprehensive Cancer Center, University of Miami, Miller School of Medicine, Miami, FL 33136, USA
| | - Alexander R. Pico
- Institute for Data Science and Biotechnology, Gladstone Institutes, San Francisco, CA 94158, USA
| | - Bing Zhang
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston TX 77030, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston TX 77030, USA
| | - Lily Wang
- Division of Biostatistics, Department of Public Health Sciences, University of Miami, Miller School of Medicine, Miami, FL 33136, USA
- Sylvester Comprehensive Cancer Center, University of Miami, Miller School of Medicine, Miami, FL 33136, USA
- Dr. John T Macdonald Foundation Department of Human Genetics, University of Miami, Miller School of Medicine, Miami, FL 33136, USA
- John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL 33136, USA
| | - Xi Chen
- Division of Biostatistics, Department of Public Health Sciences, University of Miami, Miller School of Medicine, Miami, FL 33136, USA
- Sylvester Comprehensive Cancer Center, University of Miami, Miller School of Medicine, Miami, FL 33136, USA
| |
Collapse
|
6
|
Ghulam A, Lei X, Guo M, Bian C. A Review of Pathway Databases and Related Methods Analysis. Curr Bioinform 2020. [DOI: 10.2174/1574893614666191018162505] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Pathway analysis integrates most of the computational tools for the investigation of
high-level and complex human diseases. In the field of bioinformatics research, biological pathways
analysis is an important part of systems biology. The molecular complexities of biological
pathways are difficult to understand in human diseases, which can be explored through pathway
analysis. In this review, we describe essential information related to pathway databases and their
mechanisms, algorithms and methods. In the pathway database analysis, we present a brief introduction
on how to gain knowledge from fundamental pathway data in regard to specific human
pathways and how to use pathway databases and pathway analysis to predict diseases during an
experiment. We also provide detailed information related to computational tools that are used in
complex pathway data analysis, the roles of these tools in the bioinformatics field and how to store
the pathway data. We illustrate various methodological difficulties that are faced during pathway
analysis. The main ideas and techniques for the pathway-based examination approaches are presented.
We provide the list of pathway databases and analytical tools. This review will serve as a
helpful manual for pathway analysis databases.
Collapse
Affiliation(s)
- Ali Ghulam
- School of Computer Science, Shaanxi Normal University, Xian, China
| | - Xiujuan Lei
- School of Computer Science, Shaanxi Normal University, Xian, China
| | - Min Guo
- School of Computer Science, Shaanxi Normal University, Xian, China
| | - Chen Bian
- School of Computer Science, Shaanxi Normal University, Xian, China
| |
Collapse
|
7
|
Fifteen Years of Gene Set Analysis for High-Throughput Genomic Data: A Review of Statistical Approaches and Future Challenges. ENTROPY 2020; 22:e22040427. [PMID: 33286201 PMCID: PMC7516904 DOI: 10.3390/e22040427] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/24/2020] [Revised: 03/18/2020] [Accepted: 04/03/2020] [Indexed: 12/22/2022]
Abstract
Over the last decade, gene set analysis has become the first choice for gaining insights into underlying complex biology of diseases through gene expression and gene association studies. It also reduces the complexity of statistical analysis and enhances the explanatory power of the obtained results. Although gene set analysis approaches are extensively used in gene expression and genome wide association data analysis, the statistical structure and steps common to these approaches have not yet been comprehensively discussed, which limits their utility. In this article, we provide a comprehensive overview, statistical structure and steps of gene set analysis approaches used for microarrays, RNA-sequencing and genome wide association data analysis. Further, we also classify the gene set analysis approaches and tools by the type of genomic study, null hypothesis, sampling model and nature of the test statistic, etc. Rather than reviewing the gene set analysis approaches individually, we provide the generation-wise evolution of such approaches for microarrays, RNA-sequencing and genome wide association studies and discuss their relative merits and limitations. Here, we identify the key biological and statistical challenges in current gene set analysis, which will be addressed by statisticians and biologists collectively in order to develop the next generation of gene set analysis approaches. Further, this study will serve as a catalog and provide guidelines to genome researchers and experimental biologists for choosing the proper gene set analysis approach based on several factors.
Collapse
|
8
|
Chimusa ER, Dalvie S, Dandara C, Wonkam A, Mazandu GK. Post genome-wide association analysis: dissecting computational pathway/network-based approaches. Brief Bioinform 2020; 20:690-700. [PMID: 29701762 DOI: 10.1093/bib/bby035] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2018] [Revised: 04/04/2018] [Indexed: 02/02/2023] Open
Abstract
Over thousands of genetic associations to diseases have been identified by genome-wide association studies (GWASs), which conceptually is a single-marker-based approach. There are potentially many uses of these identified variants, including a better understanding of the pathogenesis of diseases, new leads for studying underlying risk prediction and clinical prediction of treatment. However, because of inadequate power, GWAS might miss disease genes and/or pathways with weak genetic or strong epistatic effects. Driven by the need to extract useful information from GWAS summary statistics, post-GWAS approaches (PGAs) were introduced. Here, we dissect and discuss advances made in pathway/network-based PGAs, with a particular focus on protein-protein interaction networks that leverage GWAS summary statistics by combining effects of multiple loci, subnetworks or pathways to detect genetic signals associated with complex diseases. We conclude with a discussion of research areas where further work on summary statistic-based methods is needed.
Collapse
Affiliation(s)
- Emile R Chimusa
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Level 3, Wernher and Beit North, Private Bag, Rondebosch, 7700, Anzio road, Observatory Cape Town, South Africa
| | - Shareefa Dalvie
- Department of Psychiatry and Mental Health, University of Cape Town, Observatory, 7925, Cape Town, South Africa
| | - Collet Dandara
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Private Bag, Rondebosch, 7700, Cape Town, South Africa
| | - Ambroise Wonkam
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Private Bag, Rondebosch, 7700, Cape Town, South Africa
| | - Gaston K Mazandu
- Division of Human Genetics, Department of Pathology, Institute of Infectious Disease and Molecular Medicine, Faculty of Health Sciences, University of Cape Town, Private Bag, Rondebosch, 7700, Cape Town, South Africa; African Institute for Mathematical Sciences, 7945 Muizenberg, Cape Town, South Africa and Computational Biology Division, Department of Integrative Biomedical Sciences, Institute of Infectious Disease and Molecular Medicine, University of Cape Town, Medical School, Anzio Road, Observatory, 7925, Cape Town, South Africa
| |
Collapse
|
9
|
Mora A. Gene set analysis methods for the functional interpretation of non-mRNA data—Genomic range and ncRNA data. Brief Bioinform 2019; 21:1495-1508. [DOI: 10.1093/bib/bbz090] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2019] [Revised: 05/30/2019] [Accepted: 06/28/2019] [Indexed: 12/31/2022] Open
Abstract
Abstract
Gene set analysis (GSA) is one of the methods of choice for analyzing the results of current omics studies; however, it has been mainly developed to analyze mRNA (microarray, RNA-Seq) data. The following review includes an update regarding general methods and resources for GSA and then emphasizes GSA methods and tools for non-mRNA omics datasets, specifically genomic range data (ChIP-Seq, SNP and methylation) and ncRNA data (miRNAs, lncRNAs and others). In the end, the state of the GSA field for non-mRNA datasets is discussed, and some current challenges and trends are highlighted, especially the use of network approaches to face complexity issues.
Collapse
Affiliation(s)
- Antonio Mora
- Joint School of Life Sciences, Guangzhou Medical University and Guangzhou Institutes of Biomedicine and Health - Chinese Academy of Sciences
| |
Collapse
|
10
|
Hüls A, Ickstadt K, Schikowski T, Krämer U. Detection of gene-environment interactions in the presence of linkage disequilibrium and noise by using genetic risk scores with internal weights from elastic net regression. BMC Genet 2017; 18:55. [PMID: 28606108 PMCID: PMC5469185 DOI: 10.1186/s12863-017-0519-1] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2017] [Accepted: 05/23/2017] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND For the analysis of gene-environment (GxE) interactions commonly single nucleotide polymorphisms (SNPs) are used to characterize genetic susceptibility, an approach that mostly lacks power and has poor reproducibility. One promising approach to overcome this problem might be the use of weighted genetic risk scores (GRS), which are defined as weighted sums of risk alleles of gene variants. The gold-standard is to use external weights from published meta-analyses. METHODS In this study, we used internal weights from the marginal genetic effects of the SNPs estimated by a multivariate elastic net regression and thereby provided a method that can be used if there are no external weights available. We conducted a simulation study for the detection of GxE interactions and compared power and type I error of single SNPs analyses with Bonferroni correction and corresponding analysis with unweighted and our weighted GRS approach in scenarios with six risk SNPs and an increasing number of highly correlated (up to 210) and noise SNPs (up to 840). RESULTS Applying weighted GRS increased the power enormously in comparison to the common single SNPs approach (e.g. 94.2% vs. 35.4%, respectively, to detect a weak interaction with an OR ≈ 1.04 for six uncorrelated risk SNPs and n = 700 with a well-controlled type I error). Furthermore, weighted GRS outperformed the unweighted GRS, in particular in the presence of SNPs without any effect on the phenotype (e.g. 90.1% vs. 43.9%, respectively, when 20 noise SNPs were added to the six risk SNPs). This outperforming of the weighted GRS was confirmed in a real data application on lung inflammation in the SALIA cohort (n = 402). However, in scenarios with a high number of noise SNPs (>200 vs. 6 risk SNPs), larger sample sizes are needed to avoid an increased type I error, whereas a high number of correlated SNPs can be handled even in small samples (e.g. n = 400). CONCLUSION In conclusion, weighted GRS with weights from the marginal genetic effects of the SNPs estimated by a multivariate elastic net regression were shown to be a powerful tool to detect gene-environment interactions in scenarios of high Linkage disequilibrium and noise.
Collapse
Affiliation(s)
- Anke Hüls
- IUF-Leibniz Research Institute for Environmental Medicine, Auf'm Hennekamp 50, 40225, Düsseldorf, Germany.
- Faculty of Statistics, TU Dortmund University, Dortmund, Germany.
| | - Katja Ickstadt
- Faculty of Statistics, TU Dortmund University, Dortmund, Germany
| | - Tamara Schikowski
- IUF-Leibniz Research Institute for Environmental Medicine, Auf'm Hennekamp 50, 40225, Düsseldorf, Germany
| | - Ursula Krämer
- IUF-Leibniz Research Institute for Environmental Medicine, Auf'm Hennekamp 50, 40225, Düsseldorf, Germany
| |
Collapse
|
11
|
Anbunathan H, Bowcock AM. The Molecular Revolution in Cutaneous Biology: The Era of Genome-Wide Association Studies and Statistical, Big Data, and Computational Topics. J Invest Dermatol 2017; 137:e113-e118. [PMID: 28411841 DOI: 10.1016/j.jid.2016.03.047] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2015] [Revised: 02/10/2016] [Accepted: 03/02/2016] [Indexed: 01/04/2023]
Abstract
The investigation of biological systems involving all organs of the body including the skin is in era of big data. This requires heavy-duty computational tools, and novel statistical methods. Microarrays have allowed the interrogation of thousands of common genetic markers in thousands of individuals from the same population (termed genome wide association studies or GWAS) to reveal common variation associated with disease or phenotype. These markers are usually single nucleotide polymorphisms (SNPs) that are relatively common in the population. In the case of dermatological diseases such as alopecia areata, vitiligo, psoriasis and atopic dermatitis, common variants have been identified that are associated with disease, and these provide insights into biological pathways and reveal possible novel drug targets. Other skin phenotypes such as acne, color and skin cancers are also being investigated with GWAS. Analyses of such large GWAS datasets require a consideration of a number of statistical issues including the testing of multiple markers, population substructure, and ultimately a requirement for replication. There are also issues regarding the missing heritability of disease that cannot be entirely explained with current GWAS approaches. Next generation sequencing technologies such as exome and genome sequencing of similar patient cohorts will reveal additional variants contributing to disease susceptibility. However, the data generated with these approaches will be orders of magnitude greater than that those generated with arrays, with concomitant challenges in the identification of disease causing variants.
Collapse
Affiliation(s)
- Hima Anbunathan
- National Heart and Lung Institute, Imperial College, London, UK
| | - Anne M Bowcock
- National Heart and Lung Institute, Imperial College, London, UK.
| |
Collapse
|
12
|
Ran S, Zhang L, Liu L, Feng AP, Pei YF, Zhang L, Han YY, Lin Y, Li X, Kong WW, You XY, Zhao W, Tian Q, Shen H, Zhang YH, Deng HW. Gene-based genome-wide association study identified 19p13.3 for lean body mass. Sci Rep 2017; 7:45025. [PMID: 28322352 PMCID: PMC5359571 DOI: 10.1038/srep45025] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2016] [Accepted: 02/17/2017] [Indexed: 12/15/2022] Open
Abstract
Lean body mass (LBM) is a complex trait for human health. To identify genomic loci underlying LBM, we performed a gene-based genome-wide association study of lean mass index (LMI) in 1000 unrelated Caucasian subjects, and replicated in 2283 unrelated Caucasians subjects. Gene-based association analyses highlighted the significant associations of three genes UQCR, TCF3 and MBD3 in one single locus 19p13.3 (discovery p = 6.10 × 10-5, 1.65 × 10-4 and 1.10 × 10-4; replication p = 2.21 × 10-3, 1.84 × 10-3 and 6.95 × 10-3; combined p = 2.26 × 10-6, 4.86 × 10-6 and 1.15 × 10-5, respectively). These results, together with the known functional relevance of the three genes to LMI, suggested that the 19p13.3 region containing UQCR, TCF3 and MBD3 genes was a novel locus underlying lean mass variation.
Collapse
Affiliation(s)
- Shu Ran
- Center of System Biomedical Sciences, University of Shanghai for Science and Technology, Shanghai, PR China
| | - Lei Zhang
- Center for Genetic Epidemiology and Genomics, School of Public Health, Soochow University, Jiangsu, PR China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Soochow University, Jiangsu, PR China
| | - Lu Liu
- Center for Genetic Epidemiology and Genomics, School of Public Health, Soochow University, Jiangsu, PR China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Soochow University, Jiangsu, PR China
| | - An-Ping Feng
- Center for Genetic Epidemiology and Genomics, School of Public Health, Soochow University, Jiangsu, PR China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Soochow University, Jiangsu, PR China
| | - Yu-Fang Pei
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Soochow University, Jiangsu, PR China
- Department of Epidemiology and Statistics, School of Public Health, Soochow University, Jiangsu, PR China
| | - Lei Zhang
- Center of System Biomedical Sciences, University of Shanghai for Science and Technology, Shanghai, PR China
| | - Ying-Ying Han
- Center of System Biomedical Sciences, University of Shanghai for Science and Technology, Shanghai, PR China
| | - Yong Lin
- Center of System Biomedical Sciences, University of Shanghai for Science and Technology, Shanghai, PR China
| | - Xiao Li
- Center for Genetic Epidemiology and Genomics, School of Public Health, Soochow University, Jiangsu, PR China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Soochow University, Jiangsu, PR China
| | - Wei-Wen Kong
- Center for Genetic Epidemiology and Genomics, School of Public Health, Soochow University, Jiangsu, PR China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Soochow University, Jiangsu, PR China
| | - Xin-Yi You
- Center for Genetic Epidemiology and Genomics, School of Public Health, Soochow University, Jiangsu, PR China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Soochow University, Jiangsu, PR China
| | - Wen Zhao
- Center for Genetic Epidemiology and Genomics, School of Public Health, Soochow University, Jiangsu, PR China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Soochow University, Jiangsu, PR China
| | - Qing Tian
- Department of Biostatistics, Tulane University, New Orleans, Louisiana, USA
| | - Hui Shen
- Department of Biostatistics, Tulane University, New Orleans, Louisiana, USA
| | - Yong-Hong Zhang
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Soochow University, Jiangsu, PR China
- Department of Epidemiology and Statistics, School of Public Health, Soochow University, Jiangsu, PR China
| | - Hong-Wen Deng
- Center of System Biomedical Sciences, University of Shanghai for Science and Technology, Shanghai, PR China
- Department of Biostatistics, Tulane University, New Orleans, Louisiana, USA
| |
Collapse
|
13
|
Wang L, Liu H, Liu L, Wang Q, Li S, Li Q. Prediction of peanut protein solubility based on the evaluation model established by supervised principal component regression. Food Chem 2017; 218:553-560. [DOI: 10.1016/j.foodchem.2016.09.091] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2016] [Revised: 08/18/2016] [Accepted: 09/14/2016] [Indexed: 10/21/2022]
|
14
|
Liu W, Wang W, Tian G, Xie W, Lei L, Liu J, Huang W, Xu L, Li E. Topologically inferring pathway activity for precise survival outcome prediction: breast cancer as a case. MOLECULAR BIOSYSTEMS 2017; 13:537-548. [PMID: 28098303 DOI: 10.1039/c6mb00757k] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Accurately predicting the survival outcome of patients is of great importance in clinical cancer research. In the past decade, building survival prediction models based on gene expression data has received increasing interest. However, the existing methods are mainly based on individual gene signatures, which are known to have limited prediction accuracy on independent datasets and unclear biological relevance. Here, we propose a novel pathway-based survival prediction method called DRWPSurv in order to accurately predict survival outcome. DRWPSurv integrates gene expression profiles and prior gene interaction information to topologically infer survival associated pathway activities, and uses the pathway activities as features to construct Lasso-Cox model. It uses topological importance of genes evaluated by directed random walk to enhance the robustness of pathway activities and thereby improve the predictive performance. We applied DRWPSurv on three independent breast cancer datasets and compared the predictive performance with a traditional gene-based method and four pathway-based methods. Results showed that pathway-based methods obtained comparable or better predictive performance than the gene-based method, whereas DRWPSurv could predict survival outcome with better accuracy and robustness among the pathway-based methods. In addition, the risk pathways identified by DRWPSurv provide biologically informative models for breast cancer prognosis and treatment.
Collapse
Affiliation(s)
- Wei Liu
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, 515041, China. and Department of Mathematics, Heilongjiang Institute of Technology, Harbin, 150050, China
| | - Wei Wang
- Department of Mathematics, Heilongjiang Institute of Technology, Harbin, 150050, China
| | - Guohua Tian
- Department of Mathematics, Heilongjiang Institute of Technology, Harbin, 150050, China
| | - Wenming Xie
- Network Information Center, Shantou University Medical College, Shantou, 515041, China
| | - Li Lei
- Network Information Center, Shantou University Medical College, Shantou, 515041, China
| | - Jiujin Liu
- Network Information Center, Shantou University Medical College, Shantou, 515041, China
| | - Wanxun Huang
- Network Information Center, Shantou University Medical College, Shantou, 515041, China
| | - Liyan Xu
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, 515041, China. and Institute of Oncologic Pathology, Shantou University Medical College, Shantou, 515041, China
| | - Enmin Li
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Shantou University Medical College, Shantou, 515041, China. and Department of Biochemistry and Molecular Biology, Shantou University Medical College, Shantou 515041, China
| |
Collapse
|
15
|
Kao PYP, Leung KH, Chan LWC, Yip SP, Yap MKH. Pathway analysis of complex diseases for GWAS, extending to consider rare variants, multi-omics and interactions. Biochim Biophys Acta Gen Subj 2016; 1861:335-353. [PMID: 27888147 DOI: 10.1016/j.bbagen.2016.11.030] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2016] [Revised: 10/17/2016] [Accepted: 11/19/2016] [Indexed: 12/20/2022]
Abstract
BACKGROUND Genome-wide association studies (GWAS) is a major method for studying the genetics of complex diseases. Finding all sequence variants to explain fully the aetiology of a disease is difficult because of their small effect sizes. To better explain disease mechanisms, pathway analysis is used to consolidate the effects of multiple variants, and hence increase the power of the study. While pathway analysis has previously been performed within GWAS only, it can now be extended to examining rare variants, other "-omics" and interaction data. SCOPE OF REVIEW 1. Factors to consider in the choice of software for GWAS pathway analysis. 2. Examples of how pathway analysis is used to analyse rare variants, other "-omics" and interaction data. MAJOR CONCLUSIONS To choose appropriate software tools, factors for consideration include covariate compatibility, null hypothesis, one- or two-step analysis required, curation method of gene sets, size of pathways, and size of flanking regions to define gene boundaries. For rare variants, analysis performance depends on consistency between assumed and actual effect distribution of variants. Integration of other "-omics" data and interaction can better explain gene functions. GENERAL SIGNIFICANCE Pathway analysis methods will be more readily used for integration of multiple sources of data, and enable more accurate prediction of phenotypes.
Collapse
Affiliation(s)
- Patrick Y P Kao
- Centre for Myopia Research, School of Optometry, The Hong Kong Polytechnic University, Hong Kong SAR, China
| | - Kim Hung Leung
- Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hong Kong SAR, China
| | - Lawrence W C Chan
- Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hong Kong SAR, China
| | - Shea Ping Yip
- Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hong Kong SAR, China.
| | - Maurice K H Yap
- Centre for Myopia Research, School of Optometry, The Hong Kong Polytechnic University, Hong Kong SAR, China
| |
Collapse
|
16
|
Brodie A, Azaria JR, Ofran Y. How far from the SNP may the causative genes be? Nucleic Acids Res 2016; 44:6046-54. [PMID: 27269582 PMCID: PMC5291268 DOI: 10.1093/nar/gkw500] [Citation(s) in RCA: 112] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2015] [Revised: 05/20/2016] [Accepted: 05/22/2016] [Indexed: 02/03/2023] Open
Abstract
While GWAS identify many disease-associated SNPs, using them to decipher disease mechanisms is hindered by the difficulty in mapping SNPs to genes. Most SNPs are in non-coding regions and it is often hard to identify the genes they implicate. To explore how far the SNP may be from the affected genes we used a pathway-based approach. We found that affected genes are often up to 2 Mbps away from the associated SNP, and are not necessarily the closest genes to the SNP. Existing approaches for mapping SNPs to genes leave many SNPs unmapped to genes and reveal only 86 significant phenotype-pathway associations for all known GWAS hits combined. Using the pathway-based approach we propose here allows mapping of virtually all SNPs to genes and reveals 435 statistically significant phenotype-pathway associations. In search for mechanisms that may explain the relationships between SNPs and distant genes, we found that SNPs that are mapped to distant genes have significantly more large insertions/deletions around them than other SNPs, suggesting that these SNPs may sometimes be markers for large insertions/deletions that may affect large genomic regions.
Collapse
Affiliation(s)
- Aharon Brodie
- The Goodman faculty of life sciences, Nanotechnology building, Bar Ilan University, Ramat Gan 52900, Israel
| | - Johnathan Roy Azaria
- The Goodman faculty of life sciences, Nanotechnology building, Bar Ilan University, Ramat Gan 52900, Israel
| | - Yanay Ofran
- The Goodman faculty of life sciences, Nanotechnology building, Bar Ilan University, Ramat Gan 52900, Israel
| |
Collapse
|
17
|
Zhang Q, Zhao Y, Zhang R, Wei Y, Yi H, Shao F, Chen F. A Comparative Study of Five Association Tests Based on CpG Set for Epigenome-Wide Association Studies. PLoS One 2016; 11:e0156895. [PMID: 27258058 PMCID: PMC4892473 DOI: 10.1371/journal.pone.0156895] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2016] [Accepted: 05/20/2016] [Indexed: 11/19/2022] Open
Abstract
An epigenome-wide association study (EWAS) is a large-scale study of human disease-associated epigenetic variation, specifically variation in DNA methylation. High throughput technologies enable simultaneous epigenetic profiling of DNA methylation at hundreds of thousands of CpGs across the genome. The clustering of correlated DNA methylation at CpGs is reportedly similar to that of linkage-disequilibrium (LD) correlation in genetic single nucleotide polymorphisms (SNP) variation. However, current analysis methods, such as the t-test and rank-sum test, may be underpowered to detect differentially methylated markers. We propose to test the association between the outcome (e.g case or control) and a set of CpG sites jointly. Here, we compared the performance of five CpG set analysis approaches: principal component analysis (PCA), supervised principal component analysis (SPCA), kernel principal component analysis (KPCA), sequence kernel association test (SKAT), and sliced inverse regression (SIR) with Hotelling's T2 test and t-test using Bonferroni correction. The simulation results revealed that the first six methods can control the type I error at the significance level, while the t-test is conservative. SPCA and SKAT performed better than other approaches when the correlation among CpG sites was strong. For illustration, these methods were also applied to a real methylation dataset.
Collapse
Affiliation(s)
- Qiuyi Zhang
- Department of Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China, 211166
| | - Yang Zhao
- Department of Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China, 211166
| | - Ruyang Zhang
- Department of Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China, 211166
| | - Yongyue Wei
- Department of Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China, 211166
| | - Honggang Yi
- Department of Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China, 211166
| | - Fang Shao
- Department of Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China, 211166
| | - Feng Chen
- Department of Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, China, 211166
| |
Collapse
|
18
|
Mooney MA, Wilmot B. Gene set analysis: A step-by-step guide. Am J Med Genet B Neuropsychiatr Genet 2015; 168:517-27. [PMID: 26059482 PMCID: PMC4638147 DOI: 10.1002/ajmg.b.32328] [Citation(s) in RCA: 58] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/16/2015] [Accepted: 05/20/2015] [Indexed: 12/21/2022]
Abstract
To maximize the potential of genome-wide association studies, many researchers are performing secondary analyses to identify sets of genes jointly associated with the trait of interest. Although methods for gene-set analyses (GSA), also called pathway analyses, have been around for more than a decade, the field is still evolving. There are numerous algorithms available for testing the cumulative effect of multiple SNPs, yet no real consensus in the field about the best way to perform a GSA. This paper provides an overview of the factors that can affect the results of a GSA, the lessons learned from past studies, and suggestions for how to make analysis choices that are most appropriate for different types of data. © 2015 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- Michael A. Mooney
- Department of Medical Informatics & Clinical Epidemiology, Division of Bioinformatics & Computational Biology, Oregon Health & Science University, Portland, Oregon,OHSU Knight Cancer Institute, Portland, Oregon
| | - Beth Wilmot
- Department of Medical Informatics & Clinical Epidemiology, Division of Bioinformatics & Computational Biology, Oregon Health & Science University, Portland, Oregon,OHSU Knight Cancer Institute, Portland, Oregon,Oregon Clinical and Translational Research Institute, Portland, Oregon,Correspondence to: Beth Wilmot, Department of Medical Informatics & Clinical Epidemiology, Division of Bioinformatics & Computational Biology, Oregon Health & Science University, Portland, OR 97239.
| |
Collapse
|
19
|
Yi H, Wo H, Zhao Y, Zhang R, Dai J, Jin G, Ma H, Wu T, Hu Z, Lin D, Shen H, Chen F. Comparison of dimension reduction-based logistic regression models for case-control genome-wide association study: principal components analysis vs. partial least squares. J Biomed Res 2015; 29:298-307. [PMID: 26243516 PMCID: PMC4547378 DOI: 10.7555/jbr.29.20140043] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2014] [Revised: 09/29/2014] [Accepted: 01/15/2015] [Indexed: 12/18/2022] Open
Abstract
With recent advances in biotechnology, genome-wide association study (GWAS) has been widely used to identify genetic variants that underlie human complex diseases and traits. In case-control GWAS, typical statistical strategy is traditional logistical regression (LR) based on single-locus analysis. However, such a single-locus analysis leads to the well-known multiplicity problem, with a risk of inflating type I error and reducing power. Dimension reduction-based techniques, such as principal component-based logistic regression (PC-LR), partial least squares-based logistic regression (PLS-LR), have recently gained much attention in the analysis of high dimensional genomic data. However, the performance of these methods is still not clear, especially in GWAS. We conducted simulations and real data application to compare the type I error and power of PC-LR, PLS-LR and LR applicable to GWAS within a defined single nucleotide polymorphism (SNP) set region. We found that PC-LR and PLS can reasonably control type I error under null hypothesis. On contrast, LR, which is corrected by Bonferroni method, was more conserved in all simulation settings. In particular, we found that PC-LR and PLS-LR had comparable power and they both outperformed LR, especially when the causal SNP was in high linkage disequilibrium with genotyped ones and with a small effective size in simulation. Based on SNP set analysis, we applied all three methods to analyze non-small cell lung cancer GWAS data.
Collapse
Affiliation(s)
- Honggang Yi
- Department of Epidemiology and Biostatistics, School of Public Health
| | - Hongmei Wo
- Department of Public Service Management, School of KangDa
| | - Yang Zhao
- Department of Epidemiology and Biostatistics, School of Public Health
| | - Ruyang Zhang
- Department of Epidemiology and Biostatistics, School of Public Health
| | - Junchen Dai
- Department of Epidemiology and Biostatistics, School of Public Health
| | - Guangfu Jin
- Department of Epidemiology and Biostatistics, School of Public Health.,Section of Clinical Epidemiology, Jiangsu Key Laboratory of Cancer Biomarkers, Prevention and Treatment, Cancer Center
| | - Hongxia Ma
- Department of Epidemiology and Biostatistics, School of Public Health.,Section of Clinical Epidemiology, Jiangsu Key Laboratory of Cancer Biomarkers, Prevention and Treatment, Cancer Center
| | - Tangchun Wu
- Institute of Occupational Medicine and Ministry of Education, Key Laboratory for Environment and Health, School of Public Health, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, China
| | - Zhibin Hu
- Department of Epidemiology and Biostatistics, School of Public Health.,Section of Clinical Epidemiology, Jiangsu Key Laboratory of Cancer Biomarkers, Prevention and Treatment, Cancer Center.,State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Dongxin Lin
- State Key Laboratory of Molecular Oncology and Department of Etiology and Carcinogenesis, Cancer Institute and Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100021, China
| | - Hongbing Shen
- Department of Epidemiology and Biostatistics, School of Public Health.,Section of Clinical Epidemiology, Jiangsu Key Laboratory of Cancer Biomarkers, Prevention and Treatment, Cancer Center.,State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Feng Chen
- Department of Epidemiology and Biostatistics, School of Public Health.
| |
Collapse
|
20
|
Pan W, Kwak IY, Wei P. A Powerful Pathway-Based Adaptive Test for Genetic Association with Common or Rare Variants. Am J Hum Genet 2015; 97:86-98. [PMID: 26119817 DOI: 10.1016/j.ajhg.2015.05.018] [Citation(s) in RCA: 54] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2015] [Accepted: 05/21/2015] [Indexed: 12/11/2022] Open
Abstract
In spite of the success of genome-wide association studies (GWASs), only a small proportion of heritability for each complex trait has been explained by identified genetic variants, mainly SNPs. Likely reasons include genetic heterogeneity (i.e., multiple causal genetic variants) and small effect sizes of causal variants, for which pathway analysis has been proposed as a promising alternative to the standard single-SNP-based analysis. A pathway contains a set of functionally related genes, each of which includes multiple SNPs. Here we propose a pathway-based test that is adaptive at both the gene and SNP levels, thus maintaining high power across a wide range of situations with varying numbers of the genes and SNPs associated with a trait. The proposed method is applicable to both common variants and rare variants and can incorporate biological knowledge on SNPs and genes to boost statistical power. We use extensively simulated data and a WTCCC GWAS dataset to compare our proposal with several existing pathway-based and SNP-set-based tests, demonstrating its promising performance and its potential use in practice.
Collapse
|
21
|
Stingo FC, Swartz MD, Vannucci M. A Bayesian approach to identify genes and gene-level SNP aggregates in a genetic analysis of cancer data. STATISTICS AND ITS INTERFACE 2015; 8:137-151. [PMID: 28989562 PMCID: PMC5630184 DOI: 10.4310/sii.2015.v8.n2.a2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Complex diseases, such as cancer, arise from complex etiologies consisting of multiple single-nucleotide polymorphisms (SNPs), each contributing a small amount to the overall risk of disease. Thus, many researchers have gone beyond single-SNPs analysis methods, focusing instead on groups of SNPs, for example by analysing haplotypes. More recently, pathway-based methods have been proposed that use prior biological knowledge on gene function to achieve a more powerful analysis of genome-wide association studies (GWAS) data. In this paper we propose a novel Bayesian modeling framework to identify molecular biomarkers for disease prediction. Our method combines pathway-based approaches with multiple SNP analyses of a specified region of interest. The model's development is motivated by SNP data from a lung cancer study. In our approach we define gene-level scores based on SNP allele frequencies and use a linear modeling setting to study the scores association to the observed phenotype. The basic idea behind the definition of gene-level scores is to weigh the SNPs within the gene according to their rarity, based on genotype frequencies expected under the Hardy-Weinberg equilibrium law. This results in scores giving more importance to the unusually low frequencies, i.e. to SNPs that might indicate peculiar genetic differences between subjects belonging to different groups. An additional feature of our approach is that we incorporate information on SNP-to-SNP associations into the model. In particular, we use network priors that model the linkage disequilibrium between SNPs. For posterior inference, we design a stochastic search method that identifies significant biomarkers (genes and SNPs) for disease prediction. We assess performances on simulated data and compare results to existing approaches. We then show the ability of the proposed methodology to detect relevant genes and associated SNPs in a lung cancer dataset.
Collapse
Affiliation(s)
- Francesco C Stingo
- Department of Biostatistics, MD Anderson Cancer Center, 1400 Pressler St. Houston, TX 77030, USA
| | - Michael D Swartz
- Department of Biostatistics, UT School of Public Health, 1200 Pressler St. Houston, TX 77030, USA
| | - Marina Vannucci
- Department of Statistics, MS 138, Rice University, 6100 Main St. Houston, TX 77251-1892 USA
| |
Collapse
|
22
|
Saez I, Set E, Hsu M. From genes to behavior: placing cognitive models in the context of biological pathways. Front Neurosci 2014; 8:336. [PMID: 25414628 PMCID: PMC4220121 DOI: 10.3389/fnins.2014.00336] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2014] [Accepted: 10/05/2014] [Indexed: 01/16/2023] Open
Abstract
Connecting neural mechanisms of behavior to their underlying molecular and genetic substrates has important scientific and clinical implications. However, despite rapid growth in our knowledge of the functions and computational properties of neural circuitry underlying behavior in a number of important domains, there has been much less progress in extending this understanding to their molecular and genetic substrates, even in an age marked by exploding availability of genomic data. Here we describe recent advances in analytical strategies that aim to overcome two important challenges associated with studying the complex relationship between genes and behavior: (i) reducing distal behavioral phenotypes to a set of molecular, physiological, and neural processes that render them closer to the actions of genetic forces, and (ii) striking a balance between the competing demands of discovery and interpretability when dealing with genomic data containing up to millions of markers. Our proposed approach involves linking, on one hand, models of neural computations and circuits hypothesized to underlie behavior, and on the other hand, the set of the genes carrying out biochemical processes related to the functioning of these neural systems. In particular, we focus on the specific example of value-based decision-making, and discuss how such a combination allows researchers to leverage existing biological knowledge at both neural and genetic levels to advance our understanding of the neurogenetic mechanisms underlying behavior.
Collapse
Affiliation(s)
- Ignacio Saez
- Helen Wills Neuroscience Program, Haas School of Business, University of California, Berkeley Berkeley, CA, USA
| | - Eric Set
- Helen Wills Neuroscience Program, Haas School of Business, University of California, Berkeley Berkeley, CA, USA ; Department of Economics, University of Illinois at Urbana-Champaign Urbana, IL, USA
| | - Ming Hsu
- Helen Wills Neuroscience Program, Haas School of Business, University of California, Berkeley Berkeley, CA, USA
| |
Collapse
|
23
|
Jin L, Zuo XY, Su WY, Zhao XL, Yuan MQ, Han LZ, Zhao X, Chen YD, Rao SQ. Pathway-based analysis tools for complex diseases: a review. GENOMICS PROTEOMICS & BIOINFORMATICS 2014; 12:210-20. [PMID: 25462153 PMCID: PMC4411419 DOI: 10.1016/j.gpb.2014.10.002] [Citation(s) in RCA: 93] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 06/21/2014] [Revised: 08/30/2014] [Accepted: 09/04/2014] [Indexed: 11/23/2022]
Abstract
Genetic studies are traditionally based on single-gene analysis. The use of these analyses can pose tremendous challenges for elucidating complicated genetic interplays involved in complex human diseases. Modern pathway-based analysis provides a technique, which allows a comprehensive understanding of the molecular mechanisms underlying complex diseases. Extensive studies utilizing the methods and applications for pathway-based analysis have significantly advanced our capacity to explore large-scale omics data, which has rapidly accumulated in biomedical fields. This article is a comprehensive review of the pathway-based analysis methods—the powerful methods with the potential to uncover the biological depths of the complex diseases. The general concepts and procedures for the pathway-based analysis methods are introduced and then, a comprehensive review of the major approaches for this analysis is presented. In addition, a list of available pathway-based analysis software and databases is provided. Finally, future directions and challenges for the methodological development and applications of pathway-based analysis techniques are discussed. This review will provide a useful guide to dissect complex diseases.
Collapse
Affiliation(s)
- Lv Jin
- Institute for Medical Systems Biology, and Department of Medical Statistics and Epidemiology, School of Public Health, Guangdong Medical College, Dongguan 523808, China
| | - Xiao-Yu Zuo
- Department of Medical Statistics and Epidemiology, School of Public Health, Sun Yat-Sen University, Guangzhou 510080, China
| | - Wei-Yang Su
- Community Health Service Management Center of Panyu District, Guangzhou 511400, China
| | - Xiao-Lei Zhao
- Institute for Medical Systems Biology, and Department of Medical Statistics and Epidemiology, School of Public Health, Guangdong Medical College, Dongguan 523808, China
| | - Man-Qiong Yuan
- Department of Statistical Sciences, School of Mathematics and Computational Science, Sun Yat-Sen University, Guangzhou 510275, China
| | - Li-Zhen Han
- Department of Medical Statistics and Epidemiology, School of Public Health, Sun Yat-Sen University, Guangzhou 510080, China
| | - Xiang Zhao
- Institute for Medical Systems Biology, and Department of Medical Statistics and Epidemiology, School of Public Health, Guangdong Medical College, Dongguan 523808, China
| | - Ye-Da Chen
- Institute for Medical Systems Biology, and Department of Medical Statistics and Epidemiology, School of Public Health, Guangdong Medical College, Dongguan 523808, China
| | - Shao-Qi Rao
- Institute for Medical Systems Biology, and Department of Medical Statistics and Epidemiology, School of Public Health, Guangdong Medical College, Dongguan 523808, China; Department of Medical Statistics and Epidemiology, School of Public Health, Sun Yat-Sen University, Guangzhou 510080, China; Department of Statistical Sciences, School of Mathematics and Computational Science, Sun Yat-Sen University, Guangzhou 510275, China.
| |
Collapse
|
24
|
Mooney MA, Nigg JT, McWeeney SK, Wilmot B. Functional and genomic context in pathway analysis of GWAS data. Trends Genet 2014; 30:390-400. [PMID: 25154796 DOI: 10.1016/j.tig.2014.07.004] [Citation(s) in RCA: 86] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2014] [Revised: 07/18/2014] [Accepted: 07/18/2014] [Indexed: 02/07/2023]
Abstract
Gene set analysis (GSA) is a promising tool for uncovering the polygenic effects associated with complex diseases. However, the available techniques reflect a wide variety of hypotheses about how genetic effects interact to contribute to disease susceptibility. The lack of consensus about the best way to perform GSA has led to confusion in the field and has made it difficult to compare results across methods. A clear understanding of the various choices made during GSA - such as how gene sets are defined, how single-nucleotide polymorphisms (SNPs) are assigned to genes, and how individual SNP-level effects are aggregated to produce gene- or pathway-level effects - will improve the interpretability and comparability of results across methods and studies. In this review we provide an overview of the various data sources used to construct gene sets and the statistical methods used to test for gene set association, as well as provide guidelines for ensuring the comparability of results.
Collapse
Affiliation(s)
- Michael A Mooney
- Division of Bioinformatics and Computational Biology, Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, USA; OHSU Knight Cancer Institute, Portland, OR, USA
| | - Joel T Nigg
- Division of Psychology, Department of Psychiatry, Oregon Health & Science University, Portland, OR, USA; Department of Behavioral Neuroscience, Oregon Health & Science University, Portland, OR, USA
| | - Shannon K McWeeney
- Division of Bioinformatics and Computational Biology, Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, USA; Oregon Clinical and Translational Research Institute, Portland, OR, USA; OHSU Knight Cancer Institute, Portland, OR, USA.
| | - Beth Wilmot
- Division of Bioinformatics and Computational Biology, Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, USA; Oregon Clinical and Translational Research Institute, Portland, OR, USA; OHSU Knight Cancer Institute, Portland, OR, USA
| |
Collapse
|
25
|
Hicks C, Koganti T, Giri S, Tekere M, Ramani R, Sitthi-Amorn J, Vijayakumar S. Integrative genomic analysis for the discovery of biomarkers in prostate cancer. Biomark Insights 2014; 9:39-51. [PMID: 25057237 PMCID: PMC4085106 DOI: 10.4137/bmi.s13729] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2013] [Revised: 04/03/2014] [Accepted: 04/06/2014] [Indexed: 12/18/2022] Open
Abstract
Genome-wide association studies (GWAS) have achieved great success in identifying single nucleotide polymorphisms (SNPs, herein called genetic variants) and genes associated with risk of developing prostate cancer. However, GWAS do not typically link the genetic variants to the disease state or inform the broader context in which the genetic variants operate. Here, we present a novel integrative genomics approach that combines GWAS information with gene expression data to infer the causal association between gene expression and the disease and to identify the network states and biological pathways enriched for genetic variants. We identified gene regulatory networks and biological pathways enriched for genetic variants, including the prostate cancer, IGF-1, JAK2, androgen, and prolactin signaling pathways. The integration of GWAS information with gene expression data provides insights about the broader context in which genetic variants associated with an increased risk of developing prostate cancer operate.
Collapse
Affiliation(s)
- Chindo Hicks
- Cancer Institute, University of Mississippi Medical Center, Jackson, MS, USA. ; Department of Medicine, University of Mississippi Medical Center, Jackson, MS, USA. ; Department of Radiation Oncology, University of Mississippi Medical Center, Jackson, MS, USA. ; Department of Public Health Sciences, University of Lusaka, Lusaka, Zambia
| | - Tejaswi Koganti
- Cancer Institute, University of Mississippi Medical Center, Jackson, MS, USA
| | - Shankar Giri
- Cancer Institute, University of Mississippi Medical Center, Jackson, MS, USA
| | - Memory Tekere
- Department of Environmental Sciences, University of South Africa, UNISA Florida Campus, Florida, South Africa
| | - Ritika Ramani
- Cancer Institute, University of Mississippi Medical Center, Jackson, MS, USA
| | | | - Srinivasan Vijayakumar
- Department of Radiation Oncology, University of Mississippi Medical Center, Jackson, MS, USA
| |
Collapse
|
26
|
Lu M, Lee HS, Hadley D, Huang JZ, Qian X. Supervised categorical principal component analysis for genome-wide association analyses. BMC Genomics 2014; 15 Suppl 1:S10. [PMID: 24564304 PMCID: PMC4046680 DOI: 10.1186/1471-2164-15-s1-s10] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/22/2023] Open
Abstract
In order to have a better understanding of unexplained heritability for complex diseases in conventional Genome-Wide Association Studies (GWAS), aggregated association analyses based on predefined functional regions, such as genes and pathways, become popular recently as they enable evaluating joint effect of multiple Single-Nucleotide Polymorphisms (SNPs), which helps increase the detection power, especially when investigating genetic variants with weak individual effects. In this paper, we focus on aggregated analysis methods based on the idea of Principal Component Analysis (PCA). The past approaches using PCA mostly make some inherent genotype data and/or risk effect model assumptions, which may hinder the accurate detection of potential disease SNPs that influence disease phenotypes. In this paper, we derive a general Supervised Categorical Principal Component Analysis (SCPCA), which explicitly models categorical SNP data without imposing any risk effect model assumption. We have evaluated the efficacy of SCPCA with the comparison to a traditional Supervised PCA (SPCA) and a previously developed Supervised Logistic Principal Component Analysis (SLPCA) based on both the simulated genotype data by HAPGEN2 and the genotype data of Crohn's Disease (CD) from Wellcome Trust Case Control Consortium (WTCCC). Our preliminary results have demonstrated the superiority of SCPCA over both SPCA and SLPCA due to its modeling explicitly designed for categorical SNP data as well as its flexibility on the risk effect model assumption.
Collapse
|
27
|
Carbonetto P, Stephens M. Integrated enrichment analysis of variants and pathways in genome-wide association studies indicates central role for IL-2 signaling genes in type 1 diabetes, and cytokine signaling genes in Crohn's disease. PLoS Genet 2013; 9:e1003770. [PMID: 24098138 PMCID: PMC3789883 DOI: 10.1371/journal.pgen.1003770] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2012] [Accepted: 07/22/2013] [Indexed: 12/17/2022] Open
Abstract
Pathway analyses of genome-wide association studies aggregate information over sets of related genes, such as genes in common pathways, to identify gene sets that are enriched for variants associated with disease. We develop a model-based approach to pathway analysis, and apply this approach to data from the Wellcome Trust Case Control Consortium (WTCCC) studies. Our method offers several benefits over existing approaches. First, our method not only interrogates pathways for enrichment of disease associations, but also estimates the level of enrichment, which yields a coherent way to promote variants in enriched pathways, enhancing discovery of genes underlying disease. Second, our approach allows for multiple enriched pathways, a feature that leads to novel findings in two diseases where the major histocompatibility complex (MHC) is a major determinant of disease susceptibility. Third, by modeling disease as the combined effect of multiple markers, our method automatically accounts for linkage disequilibrium among variants. Interrogation of pathways from eight pathway databases yields strong support for enriched pathways, indicating links between Crohn's disease (CD) and cytokine-driven networks that modulate immune responses; between rheumatoid arthritis (RA) and "Measles" pathway genes involved in immune responses triggered by measles infection; and between type 1 diabetes (T1D) and IL2-mediated signaling genes. Prioritizing variants in these enriched pathways yields many additional putative disease associations compared to analyses without enrichment. For CD and RA, 7 of 8 additional non-MHC associations are corroborated by other studies, providing validation for our approach. For T1D, prioritization of IL-2 signaling genes yields strong evidence for 7 additional non-MHC candidate disease loci, as well as suggestive evidence for several more. Of the 7 strongest associations, 4 are validated by other studies, and 3 (near IL-2 signaling genes RAF1, MAPK14, and FYN) constitute novel putative T1D loci for further study.
Collapse
Affiliation(s)
- Peter Carbonetto
- Dept. of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
| | - Matthew Stephens
- Dept. of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
- Dept. of Statistics, University of Chicago, Chicago, Illinois, United States of America
| |
Collapse
|
28
|
Association signals unveiled by a comprehensive gene set enrichment analysis of dental caries genome-wide association studies. PLoS One 2013; 8:e72653. [PMID: 23967329 PMCID: PMC3743773 DOI: 10.1371/journal.pone.0072653] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2013] [Accepted: 07/11/2013] [Indexed: 11/19/2022] Open
Abstract
Gene set-based analysis of genome-wide association study (GWAS) data has recently emerged as a useful approach to examine the joint effects of multiple risk loci in complex human diseases or phenotypes. Dental caries is a common, chronic, and complex disease leading to a decrease in quality of life worldwide. In this study, we applied the approaches of gene set enrichment analysis to a major dental caries GWAS dataset, which consists of 537 cases and 605 controls. Using four complementary gene set analysis methods, we analyzed 1331 Gene Ontology (GO) terms collected from the Molecular Signatures Database (MSigDB). Setting false discovery rate (FDR) threshold as 0.05, we identified 13 significantly associated GO terms. Additionally, 17 terms were further included as marginally associated because they were top ranked by each method, although their FDR is higher than 0.05. In total, we identified 30 promising GO terms, including ‘Sphingoid metabolic process,’ ‘Ubiquitin protein ligase activity,’ ‘Regulation of cytokine secretion,’ and ‘Ceramide metabolic process.’ These GO terms encompass broad functions that potentially interact and contribute to the oral immune response related to caries development, which have not been reported in the standard single marker based analysis. Collectively, our gene set enrichment analysis provided complementary insights into the molecular mechanisms and polygenic interactions in dental caries, revealing promising association signals that could not be detected through single marker analysis of GWAS data.
Collapse
|
29
|
SNP set association analysis for genome-wide association studies. PLoS One 2013; 8:e62495. [PMID: 23658731 PMCID: PMC3643925 DOI: 10.1371/journal.pone.0062495] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2012] [Accepted: 03/22/2013] [Indexed: 11/29/2022] Open
Abstract
Genome-wide association study (GWAS) is a promising approach for identifying common genetic variants of the diseases on the basis of millions of single nucleotide polymorphisms (SNPs). In order to avoid low power caused by overmuch correction for multiple comparisons in single locus association study, some methods have been proposed by grouping SNPs together into a SNP set based on genomic features, then testing the joint effect of the SNP set. We compare the performances of principal component analysis (PCA), supervised principal component analysis (SPCA), kernel principal component analysis (KPCA), and sliced inverse regression (SIR). Simulated SNP sets are generated under scenarios of 0, 1 and ≥2 causal SNPs model. Our simulation results show that all of these methods can control the type I error at the nominal significance level. SPCA is always more powerful than the other methods at different settings of linkage disequilibrium structures and minor allele frequency of the simulated datasets. We also apply these four methods to a real GWAS of non-small cell lung cancer (NSCLC) in Han Chinese population
Collapse
|
30
|
Zhao Y, Chen F, Zhai R, Lin X, Diao N, Christiani DC. Association test based on SNP set: logistic kernel machine based test vs. principal component analysis. PLoS One 2012; 7:e44978. [PMID: 23028716 PMCID: PMC3441747 DOI: 10.1371/journal.pone.0044978] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2012] [Accepted: 08/16/2012] [Indexed: 01/04/2023] Open
Abstract
GWAS has facilitated greatly the discovery of risk SNPs associated with complex diseases. Traditional methods analyze SNP individually and are limited by low power and reproducibility since correction for multiple comparisons is necessary. Several methods have been proposed based on grouping SNPs into SNP sets using biological knowledge and/or genomic features. In this article, we compare the linear kernel machine based test (LKM) and principal components analysis based approach (PCA) using simulated datasets under the scenarios of 0 to 3 causal SNPs, as well as simple and complex linkage disequilibrium (LD) structures of the simulated regions. Our simulation study demonstrates that both LKM and PCA can control the type I error at the significance level of 0.05. If the causal SNP is in strong LD with the genotyped SNPs, both the PCA with a small number of principal components (PCs) and the LKM with kernel of linear or identical-by-state function are valid tests. However, if the LD structure is complex, such as several LD blocks in the SNP set, or when the causal SNP is not in the LD block in which most of the genotyped SNPs reside, more PCs should be included to capture the information of the causal SNP. Simulation studies also demonstrate the ability of LKM and PCA to combine information from multiple causal SNPs and to provide increased power over individual SNP analysis. We also apply LKM and PCA to analyze two SNP sets extracted from an actual GWAS dataset on non-small cell lung cancer.
Collapse
Affiliation(s)
- Yang Zhao
- Department of Environmental Health, Harvard School of Public Health, Harvard University, Boston, Massachusetts, United States of America
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu, China
| | - Feng Chen
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu, China
| | - Rihong Zhai
- Department of Environmental Health, Harvard School of Public Health, Harvard University, Boston, Massachusetts, United States of America
| | - Xihong Lin
- Department of Biostatistics, Harvard School of Public Health, Harvard University, Boston, Massachusetts, United States of America
| | - Nancy Diao
- Department of Environmental Health, Harvard School of Public Health, Harvard University, Boston, Massachusetts, United States of America
| | - David C. Christiani
- Department of Environmental Health, Harvard School of Public Health, Harvard University, Boston, Massachusetts, United States of America
- * E-mail:
| |
Collapse
|
31
|
Pathway analysis of genomic data: concepts, methods, and prospects for future development. Trends Genet 2012; 28:323-32. [PMID: 22480918 DOI: 10.1016/j.tig.2012.03.004] [Citation(s) in RCA: 215] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2012] [Revised: 03/02/2012] [Accepted: 03/07/2012] [Indexed: 12/31/2022]
Abstract
Genome-wide data sets are increasingly being used to identify biological pathways and networks underlying complex diseases. In particular, analyzing genomic data through sets defined by functional pathways offers the potential of greater power for discovery and natural connections to biological mechanisms. With the burgeoning availability of next-generation sequencing, this is an opportune moment to revisit strategies for pathway-based analysis of genomic data. Here, we synthesize relevant concepts and extant methodologies to guide investigators in study design and execution. We also highlight ongoing challenges and proposed solutions. As relevant analytical strategies mature, pathways and networks will be ideally placed to integrate data from diverse -omics sources to harness the extensive, rich information related to disease and treatment mechanisms.
Collapse
|
32
|
Shahbaba B, Shachaf CM, Yu Z. A pathway analysis method for genome-wide association studies. Stat Med 2012; 31:988-1000. [PMID: 22302470 DOI: 10.1002/sim.4477] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2011] [Revised: 10/20/2011] [Accepted: 11/02/2011] [Indexed: 12/20/2022]
Abstract
For genome-wide association studies, we propose a new method for identifying significant biological pathways. In this approach, we aggregate data across single-nucleotide polymorphisms to obtain summary measures at the gene level. We then use a hierarchical Bayesian model, which takes the gene-level summary measures as data, in order to evaluate the relevance of each pathway to an outcome of interest (e.g., disease status). Although shifting the focus of analysis from individual genes to pathways has proven to improve the statistical power and provide more robust results, such methods tend to eliminate a large number of genes whose pathways are unknown. For these genes, we propose to use a Bayesian multinomial logit model to predict the associated pathways by using the genes with known pathways as the training data. Our hierarchical Bayesian model takes the uncertainty regarding the pathway predictions into account while assessing the significance of pathways. We apply our method to two independent studies on type 2 diabetes and show that the overlap between the results from the two studies is statistically significant. We also evaluate our approach on the basis of simulated data.
Collapse
Affiliation(s)
- Babak Shahbaba
- Department of Statistics, University of California, Irvine, CA, USA
| | | | | |
Collapse
|
33
|
Menashe I, Figueroa JD, Garcia-Closas M, Chatterjee N, Malats N, Picornell A, Maeder D, Yang Q, Prokunina-Olsson L, Wang Z, Real FX, Jacobs KB, Baris D, Thun M, Albanes D, Purdue MP, Kogevinas M, Hutchinson A, Fu YP, Tang W, Burdette L, Tardón A, Serra C, Carrato A, García-Closas R, Lloreta J, Johnson A, Schwenn M, Schned A, Andriole G, Black A, Jacobs EJ, Diver RW, Gapstur SM, Weinstein SJ, Virtamo J, Caporaso NE, Landi MT, Fraumeni JF, Chanock SJ, Silverman DT, Rothman N. Large-scale pathway-based analysis of bladder cancer genome-wide association data from five studies of European background. PLoS One 2012; 7:e29396. [PMID: 22238607 PMCID: PMC3251580 DOI: 10.1371/journal.pone.0029396] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2011] [Accepted: 11/28/2011] [Indexed: 12/14/2022] Open
Abstract
Pathway analysis of genome-wide association studies (GWAS) offer a unique opportunity to collectively evaluate genetic variants with effects that are too small to be detected individually. We applied a pathway analysis to a bladder cancer GWAS containing data from 3,532 cases and 5,120 controls of European background (n = 5 studies). Thirteen hundred and ninety-nine pathways were drawn from five publicly available resources (Biocarta, Kegg, NCI-PID, HumanCyc, and Reactome), and we constructed 22 additional candidate pathways previously hypothesized to be related to bladder cancer. In total, 1421 pathways, 5647 genes and ∼90,000 SNPs were included in our study. Logistic regression model adjusting for age, sex, study, DNA source, and smoking status was used to assess the marginal trend effect of SNPs on bladder cancer risk. Two complementary pathway-based methods (gene-set enrichment analysis [GSEA], and adapted rank-truncated product [ARTP]) were used to assess the enrichment of association signals within each pathway. Eighteen pathways were detected by either GSEA or ARTP at P≤0.01. To minimize false positives, we used the I(2) statistic to identify SNPs displaying heterogeneous effects across the five studies. After removing these SNPs, seven pathways ('Aromatic amine metabolism' [P(GSEA) = 0.0100, P(ARTP) = 0.0020], 'NAD biosynthesis' [P(GSEA) = 0.0018, P(ARTP) = 0.0086], 'NAD salvage' [P(ARTP) = 0.0068], 'Clathrin derived vesicle budding' [P(ARTP) = 0.0018], 'Lysosome vesicle biogenesis' [P(GSEA) = 0.0023, P(ARTP)<0.00012], 'Retrograde neurotrophin signaling' [P(GSEA) = 0.00840], and 'Mitotic metaphase/anaphase transition' [P(GSEA) = 0.0040]) remained. These pathways seem to belong to three fundamental cellular processes (metabolic detoxification, mitosis, and clathrin-mediated vesicles). Identification of the aromatic amine metabolism pathway provides support for the ability of this approach to identify pathways with established relevance to bladder carcinogenesis.
Collapse
Affiliation(s)
- Idan Menashe
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, United States of America.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
34
|
Schaid DJ, Sinnwell JP, Jenkins GD, McDonnell SK, Ingle JN, Kubo M, Goss PE, Costantino JP, Wickerham DL, Weinshilboum RM. Using the gene ontology to scan multilevel gene sets for associations in genome wide association studies. Genet Epidemiol 2011; 36:3-16. [PMID: 22161999 DOI: 10.1002/gepi.20632] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2011] [Revised: 07/22/2011] [Accepted: 08/02/2011] [Indexed: 11/07/2022]
Abstract
Gene-set analyses have been widely used in gene expression studies, and some of the developed methods have been extended to genome wide association studies (GWAS). Yet, complications due to linkage disequilibrium (LD) among single nucleotide polymorphisms (SNPs), and variable numbers of SNPs per gene and genes per gene-set, have plagued current approaches, often leading to ad hoc "fixes." To overcome some of the current limitations, we developed a general approach to scan GWAS SNP data for both gene-level and gene-set analyses, building on score statistics for generalized linear models, and taking advantage of the directed acyclic graph structure of the gene ontology when creating gene-sets. However, other types of gene-set structures can be used, such as the popular Kyoto Encyclopedia of Genes and Genomes (KEGG). Our approach combines SNPs into genes, and genes into gene-sets, but assures that positive and negative effects of genes on a trait do not cancel. To control for multiple testing of many gene-sets, we use an efficient computational strategy that accounts for LD and provides accurate step-down adjusted P-values for each gene-set. Application of our methods to two different GWAS provide guidance on the potential strengths and weaknesses of our proposed gene-set analyses.
Collapse
Affiliation(s)
- Daniel J Schaid
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota 55905, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
35
|
Adaptive elastic-net sparse principal component analysis for pathway association testing. Stat Appl Genet Mol Biol 2011; 10:/j/sagmb.2011.10.issue-1/1544-6115.1697/1544-6115.1697.xml. [PMID: 23089825 DOI: 10.2202/1544-6115.1697] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
Pathway or gene set analysis has become an increasingly popular approach for analyzing high-throughput biological experiments such as microarray gene expression studies. The purpose of pathway analysis is to identify differentially expressed pathways associated with outcomes. Important challenges in pathway analysis are selecting a subset of genes contributing most to association with clinical phenotypes and conducting statistical tests of association for the pathways efficiently. We propose a two-stage analysis strategy: (1) extract latent variables representing activities within each pathway using a dimension reduction approach based on adaptive elastic-net sparse principal component analysis; (2) integrate the latent variables with the regression modeling framework to analyze studies with different types of outcomes such as binary, continuous or survival outcomes. Our proposed approach is computationally efficient. For each pathway, because the latent variables are estimated in an unsupervised fashion without using disease outcome information, in the sample label permutation testing procedure, the latent variables only need to be calculated once rather than for each permutation resample. Using both simulated and real datasets, we show our approach performed favorably when compared with five other currently available pathway testing methods.
Collapse
|
36
|
Wang L, Jia P, Wolfinger RD, Chen X, Zhao Z. Gene set analysis of genome-wide association studies: methodological issues and perspectives. Genomics 2011; 98:1-8. [PMID: 21565265 PMCID: PMC3852939 DOI: 10.1016/j.ygeno.2011.04.006] [Citation(s) in RCA: 164] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2010] [Revised: 03/02/2011] [Accepted: 04/15/2011] [Indexed: 12/25/2022]
Abstract
Recent studies have demonstrated that gene set analysis, which tests disease association with genetic variants in a group of functionally related genes, is a promising approach for analyzing and interpreting genome-wide association studies (GWAS) data. These approaches aim to increase power by combining association signals from multiple genes in the same gene set. In addition, gene set analysis can also shed more light on the biological processes underlying complex diseases. However, current approaches for gene set analysis are still in an early stage of development in that analysis results are often prone to sources of bias, including gene set size and gene length, linkage disequilibrium patterns and the presence of overlapping genes. In this paper, we provide an in-depth review of the gene set analysis procedures, along with parameter choices and the particular methodology challenges at each stage. In addition to providing a survey of recently developed tools, we also classify the analysis methods into larger categories and discuss their strengths and limitations. In the last section, we outline several important areas for improving the analytical strategies in gene set analysis.
Collapse
Affiliation(s)
- Lily Wang
- Department of Biostatistics, Vanderbilt University, Nashville, TN 37232, USA
| | - Peilin Jia
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN 37232, USA
- Department of Psychiatry, Vanderbilt University School of Medicine, Nashville, TN 37232, USA
| | | | - Xi Chen
- Division of Cancer Biostatistics, Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, TN 37232, USA
| | - Zhongming Zhao
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN 37232, USA
- Department of Psychiatry, Vanderbilt University School of Medicine, Nashville, TN 37232, USA
- Department of Cancer Biology, Vanderbilt University School of Medicine, Nashville, TN 37232, USA
| |
Collapse
|
37
|
Fridley BL, Biernacka JM. Gene set analysis of SNP data: benefits, challenges, and future directions. Eur J Hum Genet 2011; 19:837-43. [PMID: 21487444 DOI: 10.1038/ejhg.2011.57] [Citation(s) in RCA: 108] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
The last decade of human genetic research witnessed the completion of hundreds of genome-wide association studies (GWASs). However, the genetic variants discovered through these efforts account for only a small proportion of the heritability of complex traits. One explanation for the missing heritability is that the common analysis approach, assessing the effect of each single-nucleotide polymorphism (SNP) individually, is not well suited to the detection of small effects of multiple SNPs. Gene set analysis (GSA) is one of several approaches that may contribute to the discovery of additional genetic risk factors for complex traits. Complex phenotypes are thought to be controlled by networks of interacting biochemical and physiological pathways influenced by the products of sets of genes. By assessing the overall evidence of association of a phenotype with all measured variation in a set of genes, GSA may identify functionally relevant sets of genes corresponding to relevant biomolecular pathways, which will enable more focused studies of genetic risk factors. This approach may thus contribute to the discovery of genetic variants responsible for some of the missing heritability. With the increased use of these approaches for the secondary analysis of data from GWAS, it is important to understand the different GSA methods and their strengths and weaknesses, and consider challenges inherent in these types of analyses. This paper provides an overview of GSA, highlighting the key challenges, potential solutions, and directions for ongoing research.
Collapse
Affiliation(s)
- Brooke L Fridley
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55905, USA.
| | | |
Collapse
|
38
|
Wang L, Jia P, Wolfinger RD, Chen X, Grayson BL, Aune TM, Zhao Z. An efficient hierarchical generalized linear mixed model for pathway analysis of genome-wide association studies. ACTA ACUST UNITED AC 2011; 27:686-92. [PMID: 21266443 DOI: 10.1093/bioinformatics/btq728] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION In genome-wide association studies (GWAS) of complex diseases, genetic variants having real but weak associations often fail to be detected at the stringent genome-wide significance level. Pathway analysis, which tests disease association with combined association signals from a group of variants in the same pathway, has become increasingly popular. However, because of the complexities in genetic data and the large sample sizes in typical GWAS, pathway analysis remains to be challenging. We propose a new statistical model for pathway analysis of GWAS. This model includes a fixed effects component that models mean disease association for a group of genes, and a random effects component that models how each gene's association with disease varies about the gene group mean, thus belongs to the class of mixed effects models. RESULTS The proposed model is computationally efficient and uses only summary statistics. In addition, it corrects for the presence of overlapping genes and linkage disequilibrium (LD). Via simulated and real GWAS data, we showed our model improved power over currently available pathway analysis methods while preserving type I error rate. Furthermore, using the WTCCC Type 1 Diabetes (T1D) dataset, we demonstrated mixed model analysis identified meaningful biological processes that agreed well with previous reports on T1D. Therefore, the proposed methodology provides an efficient statistical modeling framework for systems analysis of GWAS. AVAILABILITY The software code for mixed models analysis is freely available at http://biostat.mc.vanderbilt.edu/LilyWang.
Collapse
Affiliation(s)
- Lily Wang
- Department of Biostatistics, Vanderbilt University, Nashville, TN 37232, USA.
| | | | | | | | | | | | | |
Collapse
|
39
|
Ma S, Dai Y. Principal component analysis based methods in bioinformatics studies. Brief Bioinform 2011; 12:714-22. [PMID: 21242203 DOI: 10.1093/bib/bbq090] [Citation(s) in RCA: 125] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In analysis of bioinformatics data, a unique challenge arises from the high dimensionality of measurements. Without loss of generality, we use genomic study with gene expression measurements as a representative example but note that analysis techniques discussed in this article are also applicable to other types of bioinformatics studies. Principal component analysis (PCA) is a classic dimension reduction approach. It constructs linear combinations of gene expressions, called principal components (PCs). The PCs are orthogonal to each other, can effectively explain variation of gene expressions, and may have a much lower dimensionality. PCA is computationally simple and can be realized using many existing software packages. This article consists of the following parts. First, we review the standard PCA technique and their applications in bioinformatics data analysis. Second, we describe recent 'non-standard' applications of PCA, including accommodating interactions among genes, pathways and network modules and conducting PCA with estimating equations as opposed to gene expressions. Third, we introduce several recently proposed PCA-based techniques, including the supervised PCA, sparse PCA and functional PCA. The supervised PCA and sparse PCA have been shown to have better empirical performance than the standard PCA. The functional PCA can analyze time-course gene expression data. Last, we raise the awareness of several critical but unsolved problems related to PCA. The goal of this article is to make bioinformatics researchers aware of the PCA technique and more importantly its most recent development, so that this simple yet effective dimension reduction technique can be better employed in bioinformatics data analysis.
Collapse
Affiliation(s)
- Shuangge Ma
- 60 College ST, LEPH 209, School of Public Health, Yale University, New Haven, CT 06520, USA.
| | | |
Collapse
|