1
|
Hajiaghabozorgi M, Fischbach M, Albrecht M, Wang W, Myers CL. BridGE: a pathway-based analysis tool for detecting genetic interactions from GWAS. Nat Protoc 2024; 19:1400-1435. [PMID: 38514837 DOI: 10.1038/s41596-024-00954-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Accepted: 11/22/2023] [Indexed: 03/23/2024]
Abstract
Genetic interactions have the potential to modulate phenotypes, including human disease. In principle, genome-wide association studies (GWAS) provide a platform for detecting genetic interactions; however, traditional methods for identifying them, which tend to focus on testing individual variant pairs, lack statistical power. In this protocol, we describe a novel computational approach, called Bridging Gene sets with Epistasis (BridGE), for discovering genetic interactions between biological pathways from GWAS data. We present a Python-based implementation of BridGE along with instructions for its application to a typical human GWAS cohort. The major stages include initial data processing and quality control, construction of a variant-level genetic interaction network, measurement of pathway-level genetic interactions, evaluation of statistical significance using sample permutations and generation of results in a standardized output format. The BridGE software pipeline includes options for running the analysis on multiple cores and multiple nodes for users who have access to computing clusters or a cloud computing environment. In a cluster computing environment with 10 nodes and 100 GB of memory per node, the method can be run in less than 24 h for typical human GWAS cohorts. Using BridGE requires knowledge of running Python programs and basic shell script programming experience.
Collapse
Affiliation(s)
- Mehrad Hajiaghabozorgi
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA
| | - Mathew Fischbach
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA
- Graduate Program in Bioinformatics and Computational Biology (BICB), University of Minnesota, Minneapolis, MN, USA
| | - Michael Albrecht
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA
| | - Wen Wang
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA.
| | - Chad L Myers
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA.
- Graduate Program in Bioinformatics and Computational Biology (BICB), University of Minnesota, Minneapolis, MN, USA.
| |
Collapse
|
2
|
Chakraborty S, Kahali B. Exome-wide analysis reveals role of LRP1 and additional novel loci in cognition. HGG ADVANCES 2023; 4:100208. [PMID: 37305557 PMCID: PMC10248556 DOI: 10.1016/j.xhgg.2023.100208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Accepted: 05/16/2023] [Indexed: 06/13/2023] Open
Abstract
Cognitive functioning is heritable, with metabolic risk factors known to accelerate age-associated cognitive decline. Identifying genetic underpinnings of cognition is thus crucial. Here, we undertake single-variant and gene-based association analyses upon 6 neurocognitive phenotypes across 6 cognition domains in whole-exome sequencing data from 157,160 individuals of the UK Biobank cohort to expound the genetic architecture of human cognition. We report 20 independent loci associated with 5 cognitive domains while controlling for APOE isoform-carrier status and metabolic risk factors; 18 of which were not previously reported, and implicated genes relating to oxidative stress, synaptic plasticity and connectivity, and neuroinflammation. A subset of significant hits for cognition indicates mediating effects via metabolic traits. Some of these variants also exhibit pleiotropic effects on metabolic traits. We further identify previously unknown interactions of APOE variants with LRP1 (rs34949484 and others, suggestively significant), AMIGO1 (rs146766120; pAla25Thr, significant), and ITPR3 (rs111522866, significant), controlling for lipid and glycemic risks. Our gene-based analysis also suggests that APOC1 and LRP1 have plausible roles along shared pathways of amyloid beta (Aβ) and lipid and/or glucose metabolism in affecting complex processing speed and visual attention. In addition, we report pairwise suggestive interactions of variants harbored in these genes with APOE affecting visual attention. Our report based on this large-scale exome-wide study highlights the effects of neuronal genes, such as LRP1, AMIGO1, and other genomic loci, thus providing further evidence of the genetic underpinnings for cognition during aging.
Collapse
Affiliation(s)
- Shreya Chakraborty
- Centre for Brain Research, Indian Institute of Science, Bangalore, Karnataka 560012, India
- Interdisciplinary Mathematical Sciences, Indian Institute of Science, Bangalore, Karnataka 560012, India
| | - Bratati Kahali
- Centre for Brain Research, Indian Institute of Science, Bangalore, Karnataka 560012, India
| |
Collapse
|
3
|
Discovering genetic interactions bridging pathways in genome-wide association studies. Nat Commun 2019; 10:4274. [PMID: 31537791 PMCID: PMC6753138 DOI: 10.1038/s41467-019-12131-7] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2019] [Accepted: 08/20/2019] [Indexed: 12/20/2022] Open
Abstract
Genetic interactions have been reported to underlie phenotypes in a variety of systems, but the extent to which they contribute to complex disease in humans remains unclear. In principle, genome-wide association studies (GWAS) provide a platform for detecting genetic interactions, but existing methods for identifying them from GWAS data tend to focus on testing individual locus pairs, which undermines statistical power. Importantly, a global genetic network mapped for a model eukaryotic organism revealed that genetic interactions often connect genes between compensatory functional modules in a highly coherent manner. Taking advantage of this expected structure, we developed a computational approach called BridGE that identifies pathways connected by genetic interactions from GWAS data. Applying BridGE broadly, we discover significant interactions in Parkinson's disease, schizophrenia, hypertension, prostate cancer, breast cancer, and type 2 diabetes. Our novel approach provides a general framework for mapping complex genetic networks underlying human disease from genome-wide genotype data.
Collapse
|
4
|
Kim SA, Cho CS, Kim SR, Bull SB, Yoo YJ. A new haplotype block detection method for dense genome sequencing data based on interval graph modeling of clusters of highly correlated SNPs. Bioinformatics 2018; 34:388-397. [PMID: 29028986 PMCID: PMC5860363 DOI: 10.1093/bioinformatics/btx609] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2016] [Revised: 09/11/2017] [Accepted: 09/28/2017] [Indexed: 11/13/2022] Open
Abstract
Motivation Linkage disequilibrium (LD) block construction is required for research in population genetics and genetic epidemiology, including specification of sets of single nucleotide polymorphisms (SNPs) for analysis of multi-SNP based association and identification of haplotype blocks in high density sequencing data. Existing methods based on a narrow sense definition do not allow intermediate regions of low LD between strongly associated SNP pairs and tend to split high density SNP data into small blocks having high between-block correlation. Results We present Big-LD, a block partition method based on interval graph modeling of LD bins which are clusters of strong pairwise LD SNPs, not necessarily physically consecutive. Big-LD uses an agglomerative approach that starts by identifying small communities of SNPs, i.e. the SNPs in each LD bin region, and proceeds by merging these communities. We determine the number of blocks using a method to find maximum-weight independent set. Big-LD produces larger LD blocks compared to existing methods such as MATILDE, Haploview, MIG ++, or S-MIG ++ and the LD blocks better agree with recombination hotspot locations determined by sperm-typing experiments. The observed average runtime of Big-LD for 13 288 240 non-monomorphic SNPs from 1000 Genomes Project autosome data (286 East Asians) is about 5.83 h, which is a significant improvement over the existing methods. Availability and implementation Source code and documentation are available for download at http://github.com/sunnyeesl/BigLD. Contact yyoo@snu.ac.kr. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sun Ah Kim
- Department of Statistics, Seoul National University, Seoul, South Korea
| | - Chang-Sung Cho
- Department of Mathematics Education, Seoul National University, Seoul, South Korea
| | - Suh-Ryung Kim
- Department of Mathematics Education, Seoul National University, Seoul, South Korea
| | - Shelley B Bull
- Prosserman Centre for Health Research, The Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, Canada
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Canada
| | - Yun Joo Yoo
- Department of Mathematics Education, Seoul National University, Seoul, South Korea
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea
| |
Collapse
|
5
|
Malhotra J, Malvezzi M, Negri E, La Vecchia C, Boffetta P. Risk factors for lung cancer worldwide. Eur Respir J 2016; 48:889-902. [PMID: 27174888 DOI: 10.1183/13993003.00359-2016] [Citation(s) in RCA: 438] [Impact Index Per Article: 54.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2016] [Accepted: 04/04/2016] [Indexed: 02/06/2023]
Abstract
Lung cancer is the most frequent malignant neoplasm in most countries, and the main cancer-related cause of mortality worldwide in both sexes combined.The geographic and temporal patterns of lung cancer incidence, as well as lung cancer mortality, on a population level are chiefly determined by tobacco consumption, the main aetiological factor in lung carcinogenesis.Other factors such as genetic susceptibility, poor diet, occupational exposures and air pollution may act independently or in concert with tobacco smoking in shaping the descriptive epidemiology of lung cancer. Moreover, novel approaches in the classification of lung cancer based on molecular techniques have started to bring new insights to its aetiology, in particular among nonsmokers. Despite the success in delineation of tobacco smoking as the major risk factor for lung cancer, this highly preventable disease remains among the most common and most lethal cancers globally.Future preventive efforts and research need to focus on non-cigarette tobacco smoking products, as well as better understanding of risk factors underlying lung carcinogenesis in never-smokers.
Collapse
Affiliation(s)
- Jyoti Malhotra
- Tisch Cancer Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA Rutgers Cancer Institute of New Jersey, Robert Wood Johnson Medical School, New Brunswick, NJ, USA
| | - Matteo Malvezzi
- Dept of Clinical Sciences and Community Health, University of Milan, Milan, Italy Dept of Epidemiology, IRCCS - Istituto di Ricerche Farmacologiche Mario Negri, Milan, Italy
| | - Eva Negri
- Dept of Epidemiology, IRCCS - Istituto di Ricerche Farmacologiche Mario Negri, Milan, Italy
| | - Carlo La Vecchia
- Dept of Clinical Sciences and Community Health, University of Milan, Milan, Italy
| | - Paolo Boffetta
- Tisch Cancer Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
6
|
Neykov M, Liu JS, Cai T. L1-Regularized Least Squares for Support Recovery of High Dimensional Single Index Models with Gaussian Designs. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2016; 17:2976-3012. [PMID: 28503101 PMCID: PMC5426818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
It is known that for a certain class of single index models (SIMs) [Formula: see text], support recovery is impossible when X ~ 𝒩(0, 𝕀 p×p ) and a model complexity adjusted sample size is below a critical threshold. Recently, optimal algorithms based on Sliced Inverse Regression (SIR) were suggested. These algorithms work provably under the assumption that the design X comes from an i.i.d. Gaussian distribution. In the present paper we analyze algorithms based on covariance screening and least squares with L1 penalization (i.e. LASSO) and demonstrate that they can also enjoy optimal (up to a scalar) rescaled sample size in terms of support recovery, albeit under slightly different assumptions on f and ε compared to the SIR based algorithms. Furthermore, we show more generally, that LASSO succeeds in recovering the signed support of β0 if X ~ 𝒩 (0, Σ), and the covariance Σ satisfies the irrepresentable condition. Our work extends existing results on the support recovery of LASSO for the linear model, to a more general class of SIMs.
Collapse
Affiliation(s)
- Matey Neykov
- Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA
| | - Jun S Liu
- Department of Statistics, Harvard University, Cambridge, MA 02138, USA
| | - Tianxi Cai
- Department of Biostatistics, Harvard University, Boston, MA 02115, USA
| |
Collapse
|
7
|
Pathway-Based Genome-Wide Association Studies for Two Meat Production Traits in Simmental Cattle. Sci Rep 2015; 5:18389. [PMID: 26672757 PMCID: PMC4682090 DOI: 10.1038/srep18389] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2015] [Accepted: 11/17/2015] [Indexed: 01/15/2023] Open
Abstract
Most single nucleotide polymorphisms (SNPs) detected by genome-wide association studies (GWAS), explain only a small fraction of phenotypic variation. Pathway-based GWAS were proposed to improve the proportion of genes for some human complex traits that could be explained by enriching a mass of SNPs within genetic groups. However, few attempts have been made to describe the quantitative traits in domestic animals. In this study, we used a dataset with approximately 7,700,000 SNPs from 807 Simmental cattle and analyzed live weight and longissimus muscle area using a modified pathway-based GWAS method to orthogonalise the highly linked SNPs within each gene using principal component analysis (PCA). As a result, of the 262 biological pathways of cattle collected from the KEGG database, the gamma aminobutyric acid (GABA)ergic synapse pathway and the non-alcoholic fatty liver disease (NAFLD) pathway were significantly associated with the two traits analyzed. The GABAergic synapse pathway was biologically applicable to the traits analyzed because of its roles in feed intake and weight gain. The proposed method had high statistical power and a low false discovery rate, compared to those of the smallest P-value and SNP set enrichment analysis methods.
Collapse
|
8
|
Komatsu S, Sakata K, Nanjo Y. ‘Omics’ techniques and their use to identify how soybean responds to flooding. J Anal Sci Technol 2015. [DOI: 10.1186/s40543-015-0052-7] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
|
9
|
Malhotra J, Sartori S, Brennan P, Zaridze D, Szeszenia-Dabrowska N, Świątkowska B, Rudnai P, Lissowska J, Fabianova E, Mates D, Bencko V, Gaborieau V, Stücker I, Foretova L, Janout V, Boffetta P. Effect of occupational exposures on lung cancer susceptibility: a study of gene-environment interaction analysis. Cancer Epidemiol Biomarkers Prev 2015; 24:570-9. [PMID: 25583949 DOI: 10.1158/1055-9965.epi-14-1143-t] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
BACKGROUND Occupational exposures are known risk factors for lung cancer. Role of genetically determined host factors in occupational exposure-related lung cancer is unclear. METHODS We used genome-wide association (GWA) data from a case-control study conducted in 6 European countries from 1998 to 2002 to identify gene-occupation interactions and related pathways for lung cancer risk. GWA analysis was performed for each exposure using logistic regression and interaction term for genotypes, and exposure was included in this model. Both SNP-based and gene-based interaction P values were calculated. Pathway analysis was performed using three complementary methods, and analyses were adjusted for multiple comparisons. We analyzed 312,605 SNPs and occupational exposure to 70 agents from 1,802 lung cancer cases and 1,725 cancer-free controls. RESULTS Mean age of study participants was 60.1 ± 9.1 years and 75% were male. Largest number of significant associations (P ≤ 1 × 10(-5)) at SNP level was demonstrated for nickel, brick dust, concrete dust, and cement dust, and for brick dust and cement dust at the gene-level (P ≤ 1 × 10(-4)). Approximately 14 occupational exposures showed significant gene-occupation interactions with pathways related to response to environmental information processing via signal transduction (P < 0.001 and FDR < 0.05). Other pathways that showed significant enrichment were related to immune processes and xenobiotic metabolism. CONCLUSION Our findings suggest that pathways related to signal transduction, immune process, and xenobiotic metabolism may be involved in occupational exposure-related lung carcinogenesis. IMPACT Our study exemplifies an integrative approach using pathway-based analysis to demonstrate the role of genetic variants in occupational exposure-related lung cancer susceptibility. Cancer Epidemiol Biomarkers Prev; 24(3); 570-9. ©2015 AACR.
Collapse
Affiliation(s)
- Jyoti Malhotra
- Icahn School of Medicine at Mount Sinai, New York, New York.
| | | | - Paul Brennan
- International Agency for Research on Cancer, Lyon, France
| | | | | | - Beata Świątkowska
- Department of Epidemiology, The Nofer Institute of Occupational Medicine, Lodz, Poland
| | - Peter Rudnai
- National Institute of Environmental Health, Budapest, Hungary
| | - Jolanta Lissowska
- M. Sklodowska-Curie Memorial Cancer Center and Institute of Oncology, Warsaw, Poland
| | - Eleonora Fabianova
- Department of Occupational Health, Specialized State Health Institute, Banska Bystrica, Slovakia
| | - Dana Mates
- National Institute of Public Health, Bucharest, Romania
| | - Vladimir Bencko
- Institute of Hygiene and Epidemiology, Charles University, First Faculty of Medicine, Prague, Czech Republic
| | | | - Isabelle Stücker
- Centre for Research in Epidemiology and Population Health, INSERM, Villejuif, France
| | - Lenka Foretova
- Department of Cancer Epidemiology and Genetics, Masaryk Memorial Cancer Institute and MF MU Brno, Brno, Czech Republic
| | - Vladimir Janout
- Department of Preventive Medicine, Faculty of Medicine, Palacky University, Olomouc, Czech Republic
| | - Paolo Boffetta
- Icahn School of Medicine at Mount Sinai, New York, New York
| |
Collapse
|
10
|
Zeng P, Zhao Y, Qian C, Zhang L, Zhang R, Gou J, Liu J, Liu L, Chen F. Statistical analysis for genome-wide association study. J Biomed Res 2014; 29:285-97. [PMID: 26243515 PMCID: PMC4547377 DOI: 10.7555/jbr.29.20140007] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2014] [Revised: 06/07/2014] [Accepted: 09/27/2014] [Indexed: 12/19/2022] Open
Abstract
In the past few years, genome-wide association study (GWAS) has made great successes in identifying genetic susceptibility loci underlying many complex diseases and traits. The findings provide important genetic insights into understanding pathogenesis of diseases. In this paper, we present an overview of widely used approaches and strategies for analysis of GWAS, offered a general consideration to deal with GWAS data. The issues regarding data quality control, population structure, association analysis, multiple comparison and visual presentation of GWAS results are discussed; other advanced topics including the issue of missing heritability, meta-analysis, set-based association analysis, copy number variation analysis and GWAS cohort analysis are also briefly introduced.
Collapse
Affiliation(s)
- Ping Zeng
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China.,Department of Epidemiology and Biostatistics, School of Public Health, Xuzhou Medical College, Xuzhou, Jiangsu 221004, China
| | - Yang Zhao
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Cheng Qian
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Liwei Zhang
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Ruyang Zhang
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Jianwei Gou
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Jin Liu
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Liya Liu
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Feng Chen
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China.
| |
Collapse
|
11
|
Huang A, Martin ER, Vance JM, Cai X. Detecting genetic interactions in pathway-based genome-wide association studies. Genet Epidemiol 2014; 38:300-9. [PMID: 24719383 DOI: 10.1002/gepi.21803] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2013] [Revised: 01/06/2014] [Accepted: 02/28/2014] [Indexed: 12/13/2022]
Abstract
Pathway-based genome-wide association studies (GWAS) can exploit collective effects of causal variants in a pathway to increase power of detection. However, current methods for pathway-based GWAS do not consider epistatic effects of genetic variants, although interactions between genetic variants may play an important role in influencing complex traits. In this paper, we employed a Bayesian Lasso logistic regression model for pathway-based GWAS to include all possible main effects and a large number of pairwise interactions of single nucleotide polymorphisms (SNPs) in a pathway, and then inferred the model with an efficient group empirical Bayesian Lasso (EBLasso) method. Using the inferred model, the statistical significance of a pathway was tested with the Wald statistics. Reliable effects in a significant pathway were selected using the stability selection technique. Extensive computer simulations demonstrated that our group EBlasso method significantly outperformed two competitive methods in most simulation setups and offered similar performance in other simulation setups. When applying to a GWAS dataset for Parkinson disease, EBLasso identified three significant pathways including the primary bile acid biosynthesis pathway, the neuroactive ligand-receptor interaction, and the MAPK signaling pathway. All effects identified in the primary bile acid biosynthesis pathway and many of effects in the other two pathways were epistatic effects. The group EBLasso method provides a valuable tool for pathway-based GWAS to identify main and epistatic effects of genetic variants.
Collapse
Affiliation(s)
- Anhui Huang
- Department of Electrical and Computer Engineering, University of Miami, Coral Gables, Florida, United States of America
| | | | | | | |
Collapse
|
12
|
Silver M, Chen P, Li R, Cheng CY, Wong TY, Tai ES, Teo YY, Montana G. Pathways-driven sparse regression identifies pathways and genes associated with high-density lipoprotein cholesterol in two Asian cohorts. PLoS Genet 2013; 9:e1003939. [PMID: 24278029 PMCID: PMC3836716 DOI: 10.1371/journal.pgen.1003939] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2013] [Accepted: 09/11/2013] [Indexed: 01/11/2023] Open
Abstract
Standard approaches to data analysis in genome-wide association studies (GWAS) ignore any potential functional relationships between gene variants. In contrast gene pathways analysis uses prior information on functional structure within the genome to identify pathways associated with a trait of interest. In a second step, important single nucleotide polymorphisms (SNPs) or genes may be identified within associated pathways. The pathways approach is motivated by the fact that genes do not act alone, but instead have effects that are likely to be mediated through their interaction in gene pathways. Where this is the case, pathways approaches may reveal aspects of a trait's genetic architecture that would otherwise be missed when considering SNPs in isolation. Most pathways methods begin by testing SNPs one at a time, and so fail to capitalise on the potential advantages inherent in a multi-SNP, joint modelling approach. Here, we describe a dual-level, sparse regression model for the simultaneous identification of pathways and genes associated with a quantitative trait. Our method takes account of various factors specific to the joint modelling of pathways with genome-wide data, including widespread correlation between genetic predictors, and the fact that variants may overlap multiple pathways. We use a resampling strategy that exploits finite sample variability to provide robust rankings for pathways and genes. We test our method through simulation, and use it to perform pathways-driven gene selection in a search for pathways and genes associated with variation in serum high-density lipoprotein cholesterol levels in two separate GWAS cohorts of Asian adults. By comparing results from both cohorts we identify a number of candidate pathways including those associated with cardiomyopathy, and T cell receptor and PPAR signalling. Highlighted genes include those associated with the L-type calcium channel, adenylate cyclase, integrin, laminin, MAPK signalling and immune function.
Collapse
Affiliation(s)
- Matt Silver
- Statistics Section, Department of Mathematics, Imperial College, London, United Kingdom
- MRC International Nutrition Group, London School of Hygiene and Tropical Medicine, London, United Kingdom
- * E-mail:
| | - Peng Chen
- Saw Swee Hock School of Public Health, National University of Singapore, Singapore
| | - Ruoying Li
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Ching-Yu Cheng
- Saw Swee Hock School of Public Health, National University of Singapore, Singapore
- Department of Ophthalmology, National University of Singapore, Singapore
- Singapore Eye Research Institute, Singapore National Eye Center, Singapore
| | - Tien-Yin Wong
- Department of Ophthalmology, National University of Singapore, Singapore
- Singapore Eye Research Institute, Singapore National Eye Center, Singapore
| | - E-Shyong Tai
- Saw Swee Hock School of Public Health, National University of Singapore, Singapore
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Yik-Ying Teo
- Saw Swee Hock School of Public Health, National University of Singapore, Singapore
- NUS Graduate School for Integrative Science and Engineering, National University of Singapore, Singapore
- Life Sciences Institute, National University of Singapore, Singapore
- Genome Institute of Singapore, Agency for Science, Technology and Research, Singapore
- Department of Statistics and Applied Probability, National University of Singapore, Singapore
| | - Giovanni Montana
- Statistics Section, Department of Mathematics, Imperial College, London, United Kingdom
| |
Collapse
|
13
|
Combined genotype and haplotype tests for region-based association studies. BMC Genomics 2013; 14:569. [PMID: 23964661 PMCID: PMC3852120 DOI: 10.1186/1471-2164-14-569] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2013] [Accepted: 08/13/2013] [Indexed: 12/13/2022] Open
Abstract
Background Although single-SNP analysis has proven to be useful in identifying many disease-associated loci, region-based analysis has several advantages. Empirically, it has been shown that region-based genotype and haplotype approaches may possess much higher power than single-SNP statistical tests. Both high quality haplotypes and genotypes may be available for analysis given the development of next generation sequencing technologies and haplotype assembly algorithms. Results As generally it is unknown whether genotypes or haplotypes are more relevant for identifying an association, we propose to use both of them with the purpose of preserving high power under both genotype and haplotype disease scenarios. We suggest two approaches for a combined association test and investigate the performance of these two approaches based on a theoretical model, population genetics simulations and analysis of a real data set. Conclusions Based on a theoretical model, population genetics simulations and analysis of a central corneal thickness (CCT) Genome Wide Association Study (GWAS) data set we have shown that combined genotype and haplotype approach has a high potential utility for applications in association studies.
Collapse
|
14
|
Kang C, Yu H, Yi GS. Finding type 2 diabetes causal single nucleotide polymorphism combinations and functional modules from genome-wide association data. BMC Med Inform Decis Mak 2013; 13 Suppl 1:S3. [PMID: 23566118 PMCID: PMC3618247 DOI: 10.1186/1472-6947-13-s1-s3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Background Due to the low statistical power of individual markers from a genome-wide association study (GWAS), detecting causal single nucleotide polymorphisms (SNPs) for complex diseases is a challenge. SNP combinations are suggested to compensate for the low statistical power of individual markers, but SNP combinations from GWAS generate high computational complexity. Methods We aim to detect type 2 diabetes (T2D) causal SNP combinations from a GWAS dataset with optimal filtration and to discover the biological meaning of the detected SNP combinations. Optimal filtration can enhance the statistical power of SNP combinations by comparing the error rates of SNP combinations from various Bonferroni thresholds and p-value range-based thresholds combined with linkage disequilibrium (LD) pruning. T2D causal SNP combinations are selected using random forests with variable selection from an optimal SNP dataset. T2D causal SNP combinations and genome-wide SNPs are mapped into functional modules using expanded gene set enrichment analysis (GSEA) considering pathway, transcription factor (TF)-target, miRNA-target, gene ontology, and protein complex functional modules. The prediction error rates are measured for SNP sets from functional module-based filtration that selects SNPs within functional modules from genome-wide SNPs based expanded GSEA. Results A T2D causal SNP combination containing 101 SNPs from the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset are selected using optimal filtration criteria, with an error rate of 10.25%. Matching 101 SNPs with known T2D genes and functional modules reveals the relationships between T2D and SNP combinations. The prediction error rates of SNP sets from functional module-based filtration record no significance compared to the prediction error rates of randomly selected SNP sets and T2D causal SNP combinations from optimal filtration. Conclusions We propose a detection method for complex disease causal SNP combinations from an optimal SNP dataset by using random forests with variable selection. Mapping the biological meanings of detected SNP combinations can help uncover complex disease mechanisms.
Collapse
Affiliation(s)
- Chiyong Kang
- Department of Bio and Brain Engineering, KAIST, Daejeon 305-701, South Korea
| | | | | |
Collapse
|
15
|
Abstract
Life science technologies generate a deluge of data that hold the keys to unlocking the secrets of important biological functions and disease mechanisms. We present DEAP, Differential Expression Analysis for Pathways, which capitalizes on information about biological pathways to identify important regulatory patterns from differential expression data. DEAP makes significant improvements over existing approaches by including information about pathway structure and discovering the most differentially expressed portion of the pathway. On simulated data, DEAP significantly outperformed traditional methods: with high differential expression, DEAP increased power by two orders of magnitude; with very low differential expression, DEAP doubled the power. DEAP performance was illustrated on two different gene and protein expression studies. DEAP discovered fourteen important pathways related to chronic obstructive pulmonary disease and interferon treatment that existing approaches omitted. On the interferon study, DEAP guided focus towards a four protein path within the 26 protein Notch signalling pathway. The data deluge represents a growing challenge for life sciences. Within this sea of data surely lie many secrets to understanding important biological and medical systems. To quantify important patterns in this data, we present DEAP (Differential Expression Analysis for Pathways). DEAP amalgamates information about biological pathway structure and differential expression to identify important patterns of regulation. On both simulated and biological data, we show that DEAP is able to identify key mechanisms while making significant improvements over existing methodologies. For example, on the interferon study, DEAP uniquely identified both the interferon gamma signalling pathway and the JAK STAT signalling pathway.
Collapse
|
16
|
Silver M, Janousova E, Hua X, Thompson PM, Montana G. Identification of gene pathways implicated in Alzheimer's disease using longitudinal imaging phenotypes with sparse regression. Neuroimage 2012; 63:1681-94. [PMID: 22982105 PMCID: PMC3549495 DOI: 10.1016/j.neuroimage.2012.08.002] [Citation(s) in RCA: 59] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2012] [Revised: 08/01/2012] [Accepted: 08/03/2012] [Indexed: 02/04/2023] Open
Abstract
We present a new method for the detection of gene pathways associated with a multivariate quantitative trait, and use it to identify causal pathways associated with an imaging endophenotype characteristic of longitudinal structural change in the brains of patients with Alzheimer's disease (AD). Our method, known as pathways sparse reduced-rank regression (PsRRR), uses group lasso penalised regression to jointly model the effects of genome-wide single nucleotide polymorphisms (SNPs), grouped into functional pathways using prior knowledge of gene-gene interactions. Pathways are ranked in order of importance using a resampling strategy that exploits finite sample variability. Our application study uses whole genome scans and MR images from 99 probable AD patients and 164 healthy elderly controls in the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. 66,182 SNPs are mapped to 185 gene pathways from the KEGG pathway database. Voxel-wise imaging signatures characteristic of AD are obtained by analysing 3D patterns of structural change at 6, 12 and 24 months relative to baseline. High-ranking, AD endophenotype-associated pathways in our study include those describing insulin signalling, vascular smooth muscle contraction and focal adhesion. All of these have been previously implicated in AD biology. In a secondary analysis, we investigate SNPs and genes that may be driving pathway selection. High ranking genes include a number previously linked in gene expression studies to β-amyloid plaque formation in the AD brain (PIK3R3,PIK3CG,PRKCAandPRKCB), and to AD related changes in hippocampal gene expression (ADCY2, ACTN1, ACACA, and GNAI1). Other high ranking previously validated AD endophenotype-related genes include CR1, TOMM40 and APOE.
Collapse
Affiliation(s)
- Matt Silver
- Statistics Section, Department of Mathematics, Imperial College London, UK
| | - Eva Janousova
- Statistics Section, Department of Mathematics, Imperial College London, UK
- Institute of Biostatistics and Analyses, Masaryk University, Brno, Czech Republic
| | - Xue Hua
- Laboratory of Neuro Imaging, Department of Neurology, UCLA School of Medicine, Los Angeles, CA, USA
| | - Paul M. Thompson
- Laboratory of Neuro Imaging, Department of Neurology, UCLA School of Medicine, Los Angeles, CA, USA
| | - Giovanni Montana
- Statistics Section, Department of Mathematics, Imperial College London, UK
- Corresponding author.
| | | |
Collapse
|
17
|
Pathway analysis of genomic data: concepts, methods, and prospects for future development. Trends Genet 2012; 28:323-32. [PMID: 22480918 DOI: 10.1016/j.tig.2012.03.004] [Citation(s) in RCA: 215] [Impact Index Per Article: 17.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2012] [Revised: 03/02/2012] [Accepted: 03/07/2012] [Indexed: 12/31/2022]
Abstract
Genome-wide data sets are increasingly being used to identify biological pathways and networks underlying complex diseases. In particular, analyzing genomic data through sets defined by functional pathways offers the potential of greater power for discovery and natural connections to biological mechanisms. With the burgeoning availability of next-generation sequencing, this is an opportune moment to revisit strategies for pathway-based analysis of genomic data. Here, we synthesize relevant concepts and extant methodologies to guide investigators in study design and execution. We also highlight ongoing challenges and proposed solutions. As relevant analytical strategies mature, pathways and networks will be ideally placed to integrate data from diverse -omics sources to harness the extensive, rich information related to disease and treatment mechanisms.
Collapse
|
18
|
Comparison of pathway analysis approaches using lung cancer GWAS data sets. PLoS One 2012; 7:e31816. [PMID: 22363742 PMCID: PMC3283683 DOI: 10.1371/journal.pone.0031816] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2011] [Accepted: 01/13/2012] [Indexed: 11/25/2022] Open
Abstract
Pathway analysis has been proposed as a complement to single SNP analyses in GWAS. This study compared pathway analysis methods using two lung cancer GWAS data sets based on four studies: one a combined data set from Central Europe and Toronto (CETO); the other a combined data set from Germany and MD Anderson (GRMD). We searched the literature for pathway analysis methods that were widely used, representative of other methods, and had available software for performing analysis. We selected the programs EASE, which uses a modified Fishers Exact calculation to test for pathway associations, GenGen (a version of Gene Set Enrichment Analysis (GSEA)), which uses a Kolmogorov-Smirnov-like running sum statistic as the test statistic, and SLAT, which uses a p-value combination approach. We also included a modified version of the SUMSTAT method (mSUMSTAT), which tests for association by averaging χ2 statistics from genotype association tests. There were nearly 18000 genes available for analysis, following mapping of more than 300,000 SNPs from each data set. These were mapped to 421 GO level 4 gene sets for pathway analysis. Among the methods designed to be robust to biases related to gene size and pathway SNP correlation (GenGen, mSUMSTAT and SLAT), the mSUMSTAT approach identified the most significant pathways (8 in CETO and 1 in GRMD). This included a highly plausible association for the acetylcholine receptor activity pathway in both CETO (FDR≤0.001) and GRMD (FDR = 0.009), although two strong association signals at a single gene cluster (CHRNA3-CHRNA5-CHRNB4) drive this result, complicating its interpretation. Few other replicated associations were found using any of these methods. Difficulty in replicating associations hindered our comparison, but results suggest mSUMSTAT has advantages over the other approaches, and may be a useful pathway analysis tool to use alongside other methods such as the commonly used GSEA (GenGen) approach.
Collapse
|
19
|
Silver M, Montana G. Fast identification of biological pathways associated with a quantitative trait using group lasso with overlaps. Stat Appl Genet Mol Biol 2012; 11:Article 7. [PMID: 22499682 PMCID: PMC3491888 DOI: 10.2202/1544-6115.1755] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Where causal SNPs (single nucleotide polymorphisms) tend to accumulate within biological pathways, the incorporation of prior pathways information into a statistical model is expected to increase the power to detect true associations in a genetic association study. Most existing pathways-based methods rely on marginal SNP statistics and do not fully exploit the dependence patterns among SNPs within pathways.We use a sparse regression model, with SNPs grouped into pathways, to identify causal pathways associated with a quantitative trait. Notable features of our "pathways group lasso with adaptive weights" (P-GLAW) algorithm include the incorporation of all pathways in a single regression model, an adaptive pathway weighting procedure that accounts for factors biasing pathway selection, and the use of a bootstrap sampling procedure for the ranking of important pathways. P-GLAW takes account of the presence of overlapping pathways and uses a novel combination of techniques to optimise model estimation, making it fast to run, even on whole genome datasets.In a comparison study with an alternative pathways method based on univariate SNP statistics, our method demonstrates high sensitivity and specificity for the detection of important pathways, showing the greatest relative gains in performance where marginal SNP effect sizes are small.
Collapse
|