1
|
Chen W, Coombes BJ, Larson NB. Recent advances and challenges of rare variant association analysis in the biobank sequencing era. Front Genet 2022; 13:1014947. [PMID: 36276986 PMCID: PMC9582646 DOI: 10.3389/fgene.2022.1014947] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Accepted: 09/22/2022] [Indexed: 12/04/2022] Open
Abstract
Causal variants for rare genetic diseases are often rare in the general population. Rare variants may also contribute to common complex traits and can have much larger per-allele effect sizes than common variants, although power to detect these associations can be limited. Sequencing costs have steadily declined with technological advancements, making it feasible to adopt whole-exome and whole-genome profiling for large biobank-scale sample sizes. These large amounts of sequencing data provide both opportunities and challenges for rare-variant association analysis. Herein, we review the basic concepts of rare-variant analysis methods, the current state-of-the-art methods in utilizing variant annotations or external controls to improve the statistical power, and particular challenges facing rare variant analysis such as accounting for population structure, extremely unbalanced case-control design. We also review recent advances and challenges in rare variant analysis for familial sequencing data and for more complex phenotypes such as survival data. Finally, we discuss other potential directions for further methodology investigation.
Collapse
Affiliation(s)
- Wenan Chen
- Center for Applied Bioinformatics, St. Jude Children’s Research Hospital, Memphis, TN, United States
- *Correspondence: Wenan Chen, ; Brandon J. Coombes, ; Nicholas B. Larson,
| | - Brandon J. Coombes
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, United States
- *Correspondence: Wenan Chen, ; Brandon J. Coombes, ; Nicholas B. Larson,
| | - Nicholas B. Larson
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, United States
- *Correspondence: Wenan Chen, ; Brandon J. Coombes, ; Nicholas B. Larson,
| |
Collapse
|
2
|
Hwangbo S, Lee S, Lee S, Hwang H, Kim I, Park T. Kernel-based hierarchical structural component models for pathway analysis. Bioinformatics 2022; 38:3078-3086. [PMID: 35460238 DOI: 10.1093/bioinformatics/btac276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2021] [Revised: 04/08/2022] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Pathway analyses have led to more insight into the underlying biological functions related to the phenotype of interest in various types of omics data. Pathway-based statistical approaches have been actively developed, but most of them do not consider correlations among pathways. Because it is well known that there are quite a few biomarkers that overlap between pathways, these approaches may provide misleading results. In addition, most pathway-based approaches tend to assume that biomarkers within a pathway have linear associations with the phenotype of interest, even though the relationships are more complex. RESULTS To model complex effects including nonlinear effects, we propose a new approach, Hierarchical structural CoMponent analysis using Kernel (HisCoM-Kernel). The proposed method models nonlinear associations between biomarkers and phenotype by extending the kernel machine regression and analyzes entire pathways simultaneously by using the biomarker-pathway hierarchical structure. HisCoM-Kernel is a flexible model that can be applied to various omics data. It was successfully applied to three omics datasets generated by different technologies. Our simulation studies showed that HisCoM-Kernel provided higher statistical power than other existing pathway-based methods in all datasets. The application of HisCoM-Kernel to three types of omics dataset showed its superior performance compared to existing methods in identifying more biologically meaningful pathways, including those reported in previous studies. AVAILABILITY AND IMPLEMENTATION Freely available at http://statgen.snu.ac.kr/software/HisCom-Kernel/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Suhyun Hwangbo
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 151-747, Korea.,Department of Genomic Medicine, Seoul National University Hospital, Seoul, 03080, Korea
| | - Sungyoung Lee
- Department of Genomic Medicine, Seoul National University Hospital, Seoul, 03080, Korea
| | - Seungyeoun Lee
- Department of Mathematics and Statistics, Sejong University, Sejong, 05006, Korea
| | - Heungsun Hwang
- Department of Psychology, McGill University, Montreal, QC, H3A 1B1, Canada
| | - Inyoung Kim
- Department of Statistics, Virginia Tech, Blacksburg, Virginia, 24060, U.S.A
| | - Taesung Park
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 151-747, Korea.,Department of Statistics, Seoul National University, Seoul, 151-747, Korea
| |
Collapse
|
3
|
Fore R, Boehme J, Li K, Westra J, Tintle N. Multi-Set Testing Strategies Show Good Behavior When Applied to Very Large Sets of Rare Variants. Front Genet 2020; 11:591606. [PMID: 33240333 PMCID: PMC7680887 DOI: 10.3389/fgene.2020.591606] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Accepted: 10/05/2020] [Indexed: 12/22/2022] Open
Abstract
Gene-based tests of association (e.g., variance components and burden tests) are now common practice for analyses attempting to elucidate the contribution of rare genetic variants on common disease. As sequencing datasets continue to grow in size, the number of variants within each set (e.g., gene) being tested is also continuing to grow. Pathway-based methods have been used to allow for the initial aggregation of gene-based statistical evidence and then the subsequent aggregation of evidence across the pathway. This “multi-set” approach (first gene-based test, followed by pathway-based) lacks thorough exploration in regard to evaluating genotype–phenotype associations in the age of large, sequenced datasets. In particular, we wonder whether there are statistical and biological characteristics that make the multi-set approach optimal vs. simply doing all gene-based tests? In this paper, we provide an intuitive framework for evaluating these questions and use simulated data to affirm us this intuition. A real data application is provided demonstrating how our insights manifest themselves in practice. Ultimately, we find that when initial subsets are biologically informative (e.g., tending to aggregate causal genetic variants within one or more subsets, often genes), multi-set strategies can improve statistical power, with particular gains in cases where causal variants are aggregated in subsets with less variants overall (high proportion of causal variants in the subset). However, we find that there is little advantage when the sets are non-informative (similar proportion of causal variants in the subsets). Our application to real data further demonstrates this intuition. In practice, we recommend wider use of pathway-based methods and further exploration of optimal ways of aggregating variants into subsets based on emerging biological evidence of the genetic architecture of complex disease.
Collapse
Affiliation(s)
- Ruby Fore
- Department of Biostatistics, Brown University, Providence, RI, United States
| | - Jaden Boehme
- Department of Mathematics, Oregon State University, Corvallis, OR, United States
| | - Kevin Li
- Department of Mathematics, School of Arts and Sciences, Columbia University, New York, NY, United States
| | - Jason Westra
- Department of Mathematics and Statistics, Dordt University, Sioux Center, IA, United States
| | - Nathan Tintle
- Department of Mathematics and Statistics, Dordt University, Sioux Center, IA, United States
| |
Collapse
|
4
|
Dapas M, Dunaif A. The contribution of rare genetic variants to the pathogenesis of polycystic ovary syndrome. ACTA ACUST UNITED AC 2020; 12:26-32. [PMID: 32440573 DOI: 10.1016/j.coemr.2020.02.011] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Polycystic ovary syndrome (PCOS) is a highly heritable disorder, but only a small proportion of the heritability can be accounted for by common genetic risk variants identified to date. It is possible that variants with lower allele frequencies that cannot be detected using genome-wide association study arrays contribute to PCOS. Here, we discuss the challenges inherent to studying rare genetic variants in complex disease and review several recent studies that have used DNA sequencing techniques to investigate whether rare variants play a role in PCOS pathogenesis. We evaluate these findings in the context of the latest literature in PCOS and complex disease genetics.
Collapse
|
5
|
González-Castro TB, Martínez-Magaña JJ, Tovilla-Zárate CA, Juárez-Rojop IE, Sarmiento E, Genis-Mendoza AD, Nicolini H. Gene-level genome-wide association analysis of suicide attempt, a preliminary study in a psychiatric Mexican population. Mol Genet Genomic Med 2019; 7:e983. [PMID: 31578828 PMCID: PMC6900393 DOI: 10.1002/mgg3.983] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2019] [Revised: 08/20/2019] [Accepted: 09/03/2019] [Indexed: 12/11/2022] Open
Abstract
Background Evidence suggests that liability for suicide behavior is heritable; additionally, suicide has been partly related to other psychiatric disorders. Nevertheless, most of the information reported so far address Caucasian and Asian individuals. Hence, our aim was to conduct a gene‐level association study in Mexican psychiatric individuals diagnosed with suicide attempt. Methods We recruited 192 individuals from two clinical centers in Mexico. All participants were born in Mexico and had Mexican parents and grandparents. Direct genotyping was performed using the commercial platform Infinium PsychArray BeadChip. A p‐value lower than 1e‐05 was considered as gene‐level significant and a p‐value lower than 1e‐04 was considered as gene‐level nominal significant. Results Our analyses showed that SCARA5 was associated to suicide intent at a gene‐level with statistical significance (p‐value = 1.12e‐6). Other genes were nominally associated with suicide attempt: GHSR (p‐value = 0.0004), RGS10 (p‐value = 5.13e‐5), and STK33 (p‐value = 3.62e‐5). Regarding gene variant analyses, the SNPs with a statistical association (p > .05) were rs561361616, rs1537577, rs11198999 for RGS10, and rs11041981, rs11041993, rs11041994, rs11041995, rs11041997, rs10840083, rs10769918 for STK33. For these genes, previous studies have associated SCARA5 with depression, GHSR with alcohol dependence and depression, and RGS10 with schizophrenia and depression. To date, STK33 has not been associated with any psychiatric disorder. Conclusion Our outcomes revealed that SCARA5, GHSR, RGS10 and STK33 could be considered as risk biomarkers for suicide attempt behavior in our Mexican psychiatric sample. We recommend to perform larger scale analyses to have conclusive results.
Collapse
Affiliation(s)
- Thelma Beatriz González-Castro
- División Académica Multidisciplinaria de Jalpa de Méndez, Universidad Juárez Autónoma de Tabasco, Mexico City, Mexico.,División Académica Multidisciplinaria de Ciencias de la Salud, Universidad Juárez Autónoma de Tabasco, Villahermosa, Mexico
| | - José Jaime Martínez-Magaña
- División Académica Multidisciplinaria de Ciencias de la Salud, Universidad Juárez Autónoma de Tabasco, Villahermosa, Mexico.,Instituto Nacional de Medicina Genómica (INMEGEN), Secretaria de Salud, Mexico City, Mexico
| | | | - Isela Esther Juárez-Rojop
- División Académica Multidisciplinaria de Ciencias de la Salud, Universidad Juárez Autónoma de Tabasco, Villahermosa, Mexico
| | - Emmanuel Sarmiento
- Hospital Psiquiátrico Infantil "Dr. Juan N. Navarro", Mexico City, Mexico
| | - Alma Delia Genis-Mendoza
- Instituto Nacional de Medicina Genómica (INMEGEN), Secretaria de Salud, Mexico City, Mexico.,Hospital Psiquiátrico Infantil "Dr. Juan N. Navarro", Mexico City, Mexico
| | - Humberto Nicolini
- Instituto Nacional de Medicina Genómica (INMEGEN), Secretaria de Salud, Mexico City, Mexico
| |
Collapse
|
6
|
Abstract
While genome-wide association studies have been very successful in identifying associations of common genetic variants with many different traits, the rarer frequency spectrum of the genome has not yet been comprehensively explored. Technological developments increasingly lift restrictions to access rare genetic variation. Dense reference panels enable improved genotype imputation for rarer variants in studies using DNA microarrays. Moreover, the decreasing cost of next generation sequencing makes whole exome and genome sequencing increasingly affordable for large samples. Large-scale efforts based on sequencing, such as ExAC, 100,000 Genomes, and TopMed, are likely to significantly advance this field.The main challenge in evaluating complex trait associations of rare variants is statistical power. The choice of population should be considered carefully because allele frequencies and linkage disequilibrium structure differ between populations. Genetically isolated populations can have favorable genomic characteristics for the study of rare variants.One strategy to increase power is to assess the combined effect of multiple rare variants within a region, known as aggregate testing. A range of methods have been developed for this. Model performance depends on the genetic architecture of the region of interest.
Collapse
Affiliation(s)
- Karoline Kuchenbaecker
- Wellcome Trust Sanger Institute, Cambridge, UK. .,University College London, London, UK.
| | - Emil Vincent Rosenbaum Appel
- Novo Nordisk Foundation Center for Basic Metabolic Research, Section for Metabolic Genetics, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
7
|
Cirillo E, Parnell LD, Evelo CT. A Review of Pathway-Based Analysis Tools That Visualize Genetic Variants. Front Genet 2017; 8:174. [PMID: 29163640 PMCID: PMC5681904 DOI: 10.3389/fgene.2017.00174] [Citation(s) in RCA: 42] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2017] [Accepted: 10/24/2017] [Indexed: 01/04/2023] Open
Abstract
Pathway analysis is a powerful method for data analysis in genomics, most often applied to gene expression analysis. It is also promising for single-nucleotide polymorphism (SNP) data analysis, such as genome-wide association study data, because it allows the interpretation of variants with respect to the biological processes in which the affected genes and proteins are involved. Such analyses support an interactive evaluation of the possible effects of variations on function, regulation or interaction of gene products. Current pathway analysis software often does not support data visualization of variants in pathways as an alternate method to interpret genetic association results, and specific statistical methods for pathway analysis of SNP data are not combined with these visualization features. In this review, we first describe the visualization options of the tools that were identified by a literature review, in order to provide insight for improvements in this developing field. Tool evaluation was performed using a computational epistatic dataset of gene–gene interactions for obesity risk. Next, we report the necessity to include in these tools statistical methods designed for the pathway-based analysis with SNP data, expressly aiming to define features for more comprehensive pathway-based analysis tools. We conclude by recognizing that pathway analysis of genetic variations data requires a sophisticated combination of the most useful and informative visual aspects of the various tools evaluated.
Collapse
Affiliation(s)
- Elisa Cirillo
- Department of Bioinformatics - BiGCaT, Maastricht University, Maastricht, Netherlands
| | - Laurence D Parnell
- Jean Mayer-USDA Human Nutrition Research Center on Aging at Tufts University, Agricultural Research Service, USDA, Boston, MA, United States
| | - Chris T Evelo
- Department of Bioinformatics - BiGCaT, Maastricht University, Maastricht, Netherlands
| |
Collapse
|
8
|
Larson NB, McDonnell S, Cannon Albright L, Teerlink C, Stanford J, Ostrander EA, Isaacs WB, Xu J, Cooney KA, Lange E, Schleutker J, Carpten JD, Powell I, Bailey-Wilson JE, Cussenot O, Cancel-Tassin G, Giles GG, MacInnis RJ, Maier C, Whittemore AS, Hsieh CL, Wiklund F, Catalona WJ, Foulkes W, Mandal D, Eeles R, Kote-Jarai Z, Ackerman MJ, Olson TM, Klein CJ, Thibodeau SN, Schaid DJ. gsSKAT: Rapid gene set analysis and multiple testing correction for rare-variant association studies using weighted linear kernels. Genet Epidemiol 2017; 41:297-308. [PMID: 28211093 DOI: 10.1002/gepi.22036] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2016] [Revised: 11/16/2016] [Accepted: 12/09/2016] [Indexed: 01/28/2023]
Abstract
Next-generation sequencing technologies have afforded unprecedented characterization of low-frequency and rare genetic variation. Due to low power for single-variant testing, aggregative methods are commonly used to combine observed rare variation within a single gene. Causal variation may also aggregate across multiple genes within relevant biomolecular pathways. Kernel-machine regression and adaptive testing methods for aggregative rare-variant association testing have been demonstrated to be powerful approaches for pathway-level analysis, although these methods tend to be computationally intensive at high-variant dimensionality and require access to complete data. An additional analytical issue in scans of large pathway definition sets is multiple testing correction. Gene set definitions may exhibit substantial genic overlap, and the impact of the resultant correlation in test statistics on Type I error rate control for large agnostic gene set scans has not been fully explored. Herein, we first outline a statistical strategy for aggregative rare-variant analysis using component gene-level linear kernel score test summary statistics as well as derive simple estimators of the effective number of tests for family-wise error rate control. We then conduct extensive simulation studies to characterize the behavior of our approach relative to direct application of kernel and adaptive methods under a variety of conditions. We also apply our method to two case-control studies, respectively, evaluating rare variation in hereditary prostate cancer and schizophrenia. Finally, we provide open-source R code for public use to facilitate easy application of our methods to existing rare-variant analysis results.
Collapse
Affiliation(s)
- Nicholas B Larson
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Shannon McDonnell
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Lisa Cannon Albright
- Department of Internal Medicine, University of Utah School of Medicine, Salt Lake City, Utah, United States of America
| | - Craig Teerlink
- Department of Internal Medicine, University of Utah School of Medicine, Salt Lake City, Utah, United States of America
| | - Janet Stanford
- Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Elaine A Ostrander
- National Human Genome Research Institute, Bethesda, Maryland, United States of America
| | - William B Isaacs
- Brady Urological Institute, School of Medicine, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Jianfeng Xu
- NorthShore University HealthSystem Research Institute, Chicago, Illinois, United States of America
| | - Kathleen A Cooney
- Department of Internal Medicine, University of Utah School of Medicine, Salt Lake City, Utah, United States of America.,Department of Internal Medicine, University of Michigan Medical School, Ann Arbor, Michigan, United States of America.,Department of Urology, University of Michigan Medical School, Ann Arbor, Michigan, United States of America
| | - Ethan Lange
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Johanna Schleutker
- Department of Medical Biochemistry and Genetics, Institute of Biomedicine, University of Turku, Turku, Finland
| | - John D Carpten
- Department of Translational Genomics, University of Southern California, Los Angeles, California, United States of America
| | - Isaac Powell
- Department of Urology, Wayne State University, Detroit, Michigan, United States of America
| | - Joan E Bailey-Wilson
- Statistical Genetics Section, National Human Genome Research Institute, Bethesda, Maryland, United States of America
| | | | | | - Graham G Giles
- Cancer Epidemiology Centre, Cancer Council Victoria, Melbourne, Australia.,Centre for Epidemiology and Biostatistics, School of Population and Global Health, University of Melbourne, Melbourne, Australia
| | - Robert J MacInnis
- Cancer Epidemiology Centre, Cancer Council Victoria, Melbourne, Australia.,Centre for Epidemiology and Biostatistics, School of Population and Global Health, University of Melbourne, Melbourne, Australia
| | | | - Alice S Whittemore
- Department of Health Research and Policy, Stanford University, Stanford, California, United States of America
| | - Chih-Lin Hsieh
- Department of Urology, University of Southern California, Los Angeles, California, United States of America
| | - Fredrik Wiklund
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - William J Catalona
- Department of Urology, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, United States of America
| | - William Foulkes
- Department of Oncology, Montreal General Hospital, Montreal, Quebec, Canada.,Department of Human Genetics, Montreal General Hospital, Montreal, Quebec, Canada
| | - Diptasri Mandal
- Department of Genetics, LSU Health Sciences Center, New Orleans, Louisiana, United States of America
| | | | - Zsofia Kote-Jarai
- The Institute of Cancer Research, London, UK.,The Institute of Cancer Research and Royal Marsden NHS Foundation Trust, London
| | - Michael J Ackerman
- Department of Pediatric and Adolescent Medicine, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Timothy M Olson
- Department of Pediatric and Adolescent Medicine, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Christopher J Klein
- Department of Neurology, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Stephen N Thibodeau
- Department of Laboratory Medicine/Pathology, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Daniel J Schaid
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America
| |
Collapse
|
9
|
Taylor BD, Zheng X, Darville T, Zhong W, Konganti K, Abiodun-Ojo O, Ness RB, O'Connell CM, Haggerty CL. Whole-Exome Sequencing to Identify Novel Biological Pathways Associated With Infertility After Pelvic Inflammatory Disease. Sex Transm Dis 2017; 44:35-41. [PMID: 27898568 PMCID: PMC5145761 DOI: 10.1097/olq.0000000000000533] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
BACKGROUND Ideal management of sexually transmitted infections (STI) may require risk markers for pathology or vaccine development. Previously, we identified common genetic variants associated with chlamydial pelvic inflammatory disease (PID) and reduced fecundity. As this explains only a proportion of the long-term morbidity risk, we used whole-exome sequencing to identify biological pathways that may be associated with STI-related infertility. METHODS We obtained stored DNA from 43 non-Hispanic black women with PID from the PID Evaluation and Clinical Health Study. Infertility was assessed at a mean of 84 months. Principal component analysis revealed no population stratification. Potential covariates did not significantly differ between groups. Sequencing kernel association test was used to examine associations between aggregates of variants on a single gene and infertility. The results from the sequencing kernel association test were used to choose "focus genes" (P < 0.01; n = 150) for subsequent Ingenuity Pathway Analysis to identify "gene sets" that are enriched in biologically relevant pathways. RESULTS Pathway analysis revealed that focus genes were enriched in canonical pathways including, IL-1 signaling, P2Y purinergic receptor signaling, and bone morphogenic protein signaling. CONCLUSIONS Focus genes were enriched in pathways that impact innate and adaptive immunity, protein kinase A activity, cellular growth, and DNA repair. These may alter host resistance or immunopathology after infection. Targeted sequencing of biological pathways identified in this study may provide insight into STI-related infertility.
Collapse
Affiliation(s)
- Brandie D Taylor
- From the *Department of Epidemiology and Biostatistics, Texas A&M University, College Station, TX; †Department of Pediatrics, ‡Department of Biostatistics, University of North Carolina Chapel Hill, Chapel Hill, NC; §Institute for Genome Sciences and Society, Texas A&M University, College Station, TX; ¶University of Texas School of Public Health, Houston, TX; and ∥Department of Epidemiology, University of Pittsburgh, Pittsburgh, PA
| | | | | | | | | | | | | | | | | |
Collapse
|
10
|
Marini S, Limongelli I, Rizzo E, Malovini A, Errichiello E, Vetro A, Da T, Zuffardi O, Bellazzi R. A Data Fusion Approach to Enhance Association Study in Epilepsy. PLoS One 2016; 11:e0164940. [PMID: 27984588 PMCID: PMC5161322 DOI: 10.1371/journal.pone.0164940] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2016] [Accepted: 10/04/2016] [Indexed: 11/25/2022] Open
Abstract
Among the scientific challenges posed by complex diseases with a strong genetic component, two stand out. One is unveiling the role of rare and common genetic variants; the other is the design of classification models to improve clinical diagnosis and predictive models for prognosis and personalized therapies. In this paper, we present a data fusion framework merging gene, domain, pathway and protein-protein interaction data related to a next generation sequencing epilepsy gene panel. Our method allows integrating association information from multiple genomic sources and aims at highlighting the set of common and rare variants that are capable to trigger the occurrence of a complex disease. When compared to other approaches, our method shows better performances in classifying patients affected by epilepsy.
Collapse
Affiliation(s)
- Simone Marini
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto, Japan
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy
- * E-mail: ,
| | - Ivan Limongelli
- Genomic Core Center, IRCCS Fondazione San Matteo, Pavia, Italy
- enGenome S.r.l., Via Ferrata 5, Pavia, Italy
- Centre for Health Technologies, University of Pavia, Pavia, Italy
| | - Ettore Rizzo
- enGenome S.r.l., Via Ferrata 5, Pavia, Italy
- Centre for Health Technologies, University of Pavia, Pavia, Italy
| | | | | | - Annalisa Vetro
- Genomic Core Center, IRCCS Fondazione San Matteo, Pavia, Italy
- Department of Molecular Medicine, University of Pavia, Pavia, Italy
| | - Tan Da
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy
| | - Orsetta Zuffardi
- Genomic Core Center, IRCCS Fondazione San Matteo, Pavia, Italy
- Department of Molecular Medicine, University of Pavia, Pavia, Italy
| | - Riccardo Bellazzi
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy
- Centre for Health Technologies, University of Pavia, Pavia, Italy
- IRCCS Fondazione S. Maugeri, Pavia, Italy
| |
Collapse
|
11
|
Kao PYP, Leung KH, Chan LWC, Yip SP, Yap MKH. Pathway analysis of complex diseases for GWAS, extending to consider rare variants, multi-omics and interactions. Biochim Biophys Acta Gen Subj 2016; 1861:335-353. [PMID: 27888147 DOI: 10.1016/j.bbagen.2016.11.030] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2016] [Revised: 10/17/2016] [Accepted: 11/19/2016] [Indexed: 12/20/2022]
Abstract
BACKGROUND Genome-wide association studies (GWAS) is a major method for studying the genetics of complex diseases. Finding all sequence variants to explain fully the aetiology of a disease is difficult because of their small effect sizes. To better explain disease mechanisms, pathway analysis is used to consolidate the effects of multiple variants, and hence increase the power of the study. While pathway analysis has previously been performed within GWAS only, it can now be extended to examining rare variants, other "-omics" and interaction data. SCOPE OF REVIEW 1. Factors to consider in the choice of software for GWAS pathway analysis. 2. Examples of how pathway analysis is used to analyse rare variants, other "-omics" and interaction data. MAJOR CONCLUSIONS To choose appropriate software tools, factors for consideration include covariate compatibility, null hypothesis, one- or two-step analysis required, curation method of gene sets, size of pathways, and size of flanking regions to define gene boundaries. For rare variants, analysis performance depends on consistency between assumed and actual effect distribution of variants. Integration of other "-omics" data and interaction can better explain gene functions. GENERAL SIGNIFICANCE Pathway analysis methods will be more readily used for integration of multiple sources of data, and enable more accurate prediction of phenotypes.
Collapse
Affiliation(s)
- Patrick Y P Kao
- Centre for Myopia Research, School of Optometry, The Hong Kong Polytechnic University, Hong Kong SAR, China
| | - Kim Hung Leung
- Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hong Kong SAR, China
| | - Lawrence W C Chan
- Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hong Kong SAR, China
| | - Shea Ping Yip
- Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hong Kong SAR, China.
| | - Maurice K H Yap
- Centre for Myopia Research, School of Optometry, The Hong Kong Polytechnic University, Hong Kong SAR, China
| |
Collapse
|
12
|
Lee S, Choi S, Kim YJ, Kim BJ, Hwang H, Park T. Pathway-based approach using hierarchical components of collapsed rare variants. Bioinformatics 2016; 32:i586-i594. [PMID: 27587678 PMCID: PMC5013912 DOI: 10.1093/bioinformatics/btw425] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
MOTIVATION To address 'missing heritability' issue, many statistical methods for pathway-based analyses using rare variants have been proposed to analyze pathways individually. However, neglecting correlations between multiple pathways can result in misleading solutions, and pathway-based analyses of large-scale genetic datasets require massive computational burden. We propose a Pathway-based approach using HierArchical components of collapsed RAre variants Of High-throughput sequencing data (PHARAOH) for the analysis of rare variants by constructing a single hierarchical model that consists of collapsed gene-level summaries and pathways and analyzes entire pathways simultaneously by imposing ridge-type penalties on both gene and pathway coefficient estimates; hence our method considers the correlation of pathways without constraint by a multiple testing problem. RESULTS Through simulation studies, the proposed method was shown to have higher statistical power than the existing pathway-based methods. In addition, our method was applied to the large-scale whole-exome sequencing data with levels of a liver enzyme using two well-known pathway databases Biocarta and KEGG. This application demonstrated that our method not only identified associated pathways but also successfully detected biologically plausible pathways for a phenotype of interest. These findings were successfully replicated by an independent large-scale exome chip study. AVAILABILITY AND IMPLEMENTATION An implementation of PHARAOH is available at http://statgen.snu.ac.kr/software/pharaoh/ CONTACT tspark@stats.snu.ac.kr SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sungyoung Lee
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 151-747, Korea
| | - Sungkyoung Choi
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 151-747, Korea
| | - Young Jin Kim
- Center for Genome Science, National Institute of Health, Osong Health Technology Administration Complex, Chungcheongbuk-Do 363-951, Korea
| | - Bong-Jo Kim
- Center for Genome Science, National Institute of Health, Osong Health Technology Administration Complex, Chungcheongbuk-Do 363-951, Korea
| | - Heungsun Hwang
- Department of Psychology, McGill University, Montreal, QC H3A 1B1, Canada
| | - Taesung Park
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 151-747, Korea Department of Statistics, Seoul National University, Seoul 151-747, Korea
| |
Collapse
|
13
|
Larson NB, McDonnell S, Albright LC, Teerlink C, Stanford J, Ostrander EA, Isaacs WB, Xu J, Cooney KA, Lange E, Schleutker J, Carpten JD, Powell I, Bailey-Wilson J, Cussenot O, Cancel-Tassin G, Giles G, MacInnis R, Maier C, Whittemore AS, Hsieh CL, Wiklund F, Catolona WJ, Foulkes W, Mandal D, Eeles R, Kote-Jarai Z, Ackerman MJ, Olson TM, Klein CJ, Thibodeau SN, Schaid DJ. Post hoc Analysis for Detecting Individual Rare Variant Risk Associations Using Probit Regression Bayesian Variable Selection Methods in Case-Control Sequencing Studies. Genet Epidemiol 2016; 40:461-9. [PMID: 27312771 PMCID: PMC5063501 DOI: 10.1002/gepi.21983] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2015] [Revised: 04/22/2016] [Accepted: 04/27/2016] [Indexed: 12/27/2022]
Abstract
Rare variants (RVs) have been shown to be significant contributors to complex disease risk. By definition, these variants have very low minor allele frequencies and traditional single-marker methods for statistical analysis are underpowered for typical sequencing study sample sizes. Multimarker burden-type approaches attempt to identify aggregation of RVs across case-control status by analyzing relatively small partitions of the genome, such as genes. However, it is generally the case that the aggregative measure would be a mixture of causal and neutral variants, and these omnibus tests do not directly provide any indication of which RVs may be driving a given association. Recently, Bayesian variable selection approaches have been proposed to identify RV associations from a large set of RVs under consideration. Although these approaches have been shown to be powerful at detecting associations at the RV level, there are often computational limitations on the total quantity of RVs under consideration and compromises are necessary for large-scale application. Here, we propose a computationally efficient alternative formulation of this method using a probit regression approach specifically capable of simultaneously analyzing hundreds to thousands of RVs. We evaluate our approach to detect causal variation on simulated data and examine sensitivity and specificity in instances of high RV dimensionality as well as apply it to pathway-level RV analysis results from a prostate cancer (PC) risk case-control sequencing study. Finally, we discuss potential extensions and future directions of this work.
Collapse
Affiliation(s)
- Nicholas B. Larson
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN
| | - Shannon McDonnell
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN
| | - Lisa Cannon Albright
- Dept. Internal Medicine, University of Utah School of Medicine, Salt Lake City, UT
| | - Craig Teerlink
- Dept. Internal Medicine, University of Utah School of Medicine, Salt Lake City, UT
| | | | | | | | - Jianfeng Xu
- NorthShore University Health System Research Institute, Chicago, IL
| | - Kathleen A. Cooney
- Depts. of Internal Medicine and Urology, University of Michigan Medical School, Ann Arbor, MI
| | - Ethan Lange
- Dept. of Genetics, University of North Carolina, Chapel Hill, NC
| | - Johanna Schleutker
- Dept. of Medical Biochemistry and Genetics, Institute of Biomedicine, University of Turku, Finland
| | - John D. Carpten
- Integrated Cancer Genomics Division, The Translational Genomics Research Institute, Phoenix, AZ
| | | | - Joan Bailey-Wilson
- Statistical Genetics Section, National Human Genome Research Institute, Bethesda, MD
| | | | | | - Graham Giles
- Cancer Epidemiology Centre, Cancer Council Victoria, and Centre for Epidemiology and Biostatistics, School of Population and Global Health, University of Melbourne, Melbourne, Australia
| | - Robert MacInnis
- Cancer Epidemiology Centre, Cancer Council Victoria, and Centre for Epidemiology and Biostatistics, School of Population and Global Health, University of Melbourne, Melbourne, Australia
| | | | | | - Chih-Lin Hsieh
- Dept. of Urology, University of Southern California, Los Angeles, CA
| | - Fredrik Wiklund
- Dept. of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | | | - William Foulkes
- Depts. Of Oncology and Human Genetics, Montreal General Hospital, Montreal QC, Canada
| | - Diptasri Mandal
- Dept. of Genetics, LSU Health Sciences Center, New Orleans, LA
| | - Rosalind Eeles
- Genetics and Epidemiology, Institute of Cancer Research, Sutton Surrey, UK
| | - Zsofia Kote-Jarai
- Genetics and Epidemiology, Institute of Cancer Research, Sutton Surrey, UK
| | | | - Timothy M. Olson
- Dept. of Pediatric and Adolescent Medicine, Mayo Clinic, Rochester, MN
| | | | | | - Daniel J. Schaid
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN
| |
Collapse
|
14
|
Sardell RJ, Bailey JNC, Courtenay MD, Whitehead P, Laux RA, Adams LD, Fortun JA, Brantley MA, Kovach JL, Schwartz SG, Agarwal A, Scott WK, Haines JL, Pericak-Vance MA. Whole exome sequencing of extreme age-related macular degeneration phenotypes. Mol Vis 2016; 22:1062-76. [PMID: 27625572 PMCID: PMC5007100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2016] [Accepted: 08/27/2016] [Indexed: 11/29/2022] Open
Abstract
PURPOSE Demographic, environmental, and genetic risk factors for age-related macular degeneration (AMD) have been identified; however, a substantial portion of the variance in AMD disease risk and heritability remains unexplained. To identify AMD risk variants and generate hypotheses for future studies, we performed whole exome sequencing for 75 individuals whose phenotype was not well predicted by their genotype at known risk loci. We hypothesized that these phenotypically extreme individuals were more likely to carry rare risk or protective variants with large effect sizes. METHODS A genetic risk score was calculated in a case-control set of 864 individuals (467 AMD cases, 397 controls) based on 19 common (≥1% minor allele frequency, MAF) single nucleotide variants previously associated with the risk of advanced AMD in a large meta-analysis of advanced cases and controls. We then selected for sequencing 39 cases with bilateral choroidal neovascularization with the lowest genetic risk scores to detect risk variants and 36 unaffected controls with the highest genetic risk score to detect protective variants. After minimizing the influence of 19 common genetic risk loci on case-control status, we targeted single variants of large effect and the aggregate effect of weaker variants within genes and pathways. Single variant tests were conducted on all variants, while gene-based and pathway analyses were conducted on three subsets of data: 1) rare (≤1% MAF in the European population) stop, splice, or damaging missense variants, 2) all rare variants, and 3) all variants. All analyses controlled for the effects of age and sex. RESULTS No variant, gene, or pathway outside regions known to be associated with risk for advanced AMD reached genome-wide significance. However, we identified several variants with substantial differences in allele frequency between cases and controls with strong additive effects on affection status after controlling for age and sex. Protective effects trending toward significance were detected at two loci identified in single-variant analyses: an intronic variant in FBLN7 (the gene encoding fibulin 7) and at three variants near pyridoxal (pyridoxine, vitamin B6) kinase (PDXK). Aggregate rare-variant analyses suggested evidence for association at ASRGL1, a gene previously linked to photoreceptor cell death, and at BSDC1. In known AMD loci we also identified 29 novel or rare damaging missense or stop/splice variants in our sample of cases and controls. CONCLUSIONS Identified variants and genes may highlight regions important in the pathogenesis of AMD and are key targets for replication.
Collapse
Affiliation(s)
- Rebecca J. Sardell
- John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL
| | - Jessica N Cooke Bailey
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH
| | - Monique D. Courtenay
- John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL
| | - Patrice Whitehead
- John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL
| | - Reneé A. Laux
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH
| | - Larry D. Adams
- John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL
| | - Jorge A. Fortun
- Bascom Palmer Eye Institute, University of Miami Miller School of Medicine, Miami, FL
| | - Milam A. Brantley
- Department of Ophthalmology and Visual Sciences, Vanderbilt University School of Medicine, Nashville, TN
| | - Jaclyn L. Kovach
- Bascom Palmer Eye Institute, University of Miami Miller School of Medicine, Miami, FL
| | - Stephen G. Schwartz
- Bascom Palmer Eye Institute, University of Miami Miller School of Medicine, Miami, FL
| | - Anita Agarwal
- Department of Ophthalmology and Visual Sciences, Vanderbilt University School of Medicine, Nashville, TN
| | - William K. Scott
- John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL
| | - Jonathan L. Haines
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH
| | - Margaret A. Pericak-Vance
- John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL
| |
Collapse
|
15
|
Dai H, Wu G, Wu M, Zhi D. An Optimal Bahadur-Efficient Method in Detection of Sparse Signals with Applications to Pathway Analysis in Sequencing Association Studies. PLoS One 2016; 11:e0152667. [PMID: 27380176 PMCID: PMC4933358 DOI: 10.1371/journal.pone.0152667] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2015] [Accepted: 03/17/2016] [Indexed: 11/18/2022] Open
Abstract
Next-generation sequencing data pose a severe curse of dimensionality, complicating traditional "single marker-single trait" analysis. We propose a two-stage combined p-value method for pathway analysis. The first stage is at the gene level, where we integrate effects within a gene using the Sequence Kernel Association Test (SKAT). The second stage is at the pathway level, where we perform a correlated Lancaster procedure to detect joint effects from multiple genes within a pathway. We show that the Lancaster procedure is optimal in Bahadur efficiency among all combined p-value methods. The Bahadur efficiency,[Formula: see text], compares sample sizes among different statistical tests when signals become sparse in sequencing data, i.e. ε →0. The optimal Bahadur efficiency ensures that the Lancaster procedure asymptotically requires a minimal sample size to detect sparse signals ([Formula: see text]). The Lancaster procedure can also be applied to meta-analysis. Extensive empirical assessments of exome sequencing data show that the proposed method outperforms Gene Set Enrichment Analysis (GSEA). We applied the competitive Lancaster procedure to meta-analysis data generated by the Global Lipids Genetics Consortium to identify pathways significantly associated with high-density lipoprotein cholesterol, low-density lipoprotein cholesterol, triglycerides, and total cholesterol.
Collapse
Affiliation(s)
- Hongying Dai
- Health Services and Outcomes Research, Children’s Mercy Hospital, Kansas City, MO, United States of America
- Department of Biomedical & Health Informatics, University of Missouri-Kansas City, Kansas City, MO, United States of America
- * E-mail:
| | - Guodong Wu
- Lovelace Respiratory Research Institute, Albuquerque, New Mexico, United States of America
| | - Michael Wu
- Biostatistics and Biomathematics Program, Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA, United States of America
| | - Degui Zhi
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, United States of America
| |
Collapse
|
16
|
Alzheimer disease: modeling an Aβ-centered biological network. Mol Psychiatry 2016; 21:861-71. [PMID: 27021818 DOI: 10.1038/mp.2016.38] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/06/2015] [Revised: 02/16/2016] [Accepted: 02/18/2016] [Indexed: 01/15/2023]
Abstract
In genetically complex diseases, the search for missing heritability is focusing on rare variants with large effect. Thanks to next generation sequencing technologies, genome-wide characterization of these variants is now feasible in every individual. However, a lesson from current studies is that collapsing rare variants at the gene level is often insufficient to obtain a statistically significant signal in case-control studies, and that network-based analyses are an attractive complement to classical approaches. In Alzheimer disease (AD), according to the prevalent amyloid cascade hypothesis, the pathology is driven by the amyloid beta (Aβ) peptide. In past years, based on experimental studies, several hundreds of proteins have been shown to interfere with Aβ production, clearance, aggregation or toxicity. Thanks to a manual curation of the literature, we identified 335 genes/proteins involved in this biological network and classified them according to their cellular function. The complete list of genes, or its subcomponents, will be of interest in ongoing AD genetic studies.
Collapse
|
17
|
Lee HS, Bae SC. Recent advances in systemic lupus erythematosus genetics in an Asian population. Int J Rheum Dis 2014; 18:192-9. [DOI: 10.1111/1756-185x.12498] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Hye-Soon Lee
- Hanyang University Hospital for Rheumatic Diseases; Seoul Korea
| | - Sang Cheol Bae
- Hanyang University Hospital for Rheumatic Diseases; Seoul Korea
| |
Collapse
|
18
|
Krupp DR, Soldano KL, Garrett ME, Cope H, Ashley-Koch AE, Gregory SG. Missing genetic risk in neural tube defects: can exome sequencing yield an insight? ACTA ACUST UNITED AC 2014; 100:642-6. [PMID: 25044326 DOI: 10.1002/bdra.23276] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2014] [Revised: 05/30/2014] [Accepted: 05/31/2014] [Indexed: 01/12/2023]
Abstract
BACKGROUND Neural tube defects (NTD) have a strong genetic component, with up to 70% of variance in human prevalence determined by heritable factors. Although the identification of causal DNA variants by sequencing candidate genes from functionally relevant pathways and model organisms has provided some success, alternative approaches are demanded. METHODS Next generation sequencing platforms are facilitating the production of massive amounts of sequencing data, primarily from the protein coding regions of the genome, at a faster rate and cheaper cost than has previously been possible. These platforms are permitting the identification of variants (de novo, rare, and common) that are drivers of NYTD etiology, and the cost of the approach allows for the screening of increased numbers of affected and unaffected individuals from NTD families and in simplex cases. CONCLUSION The next generation sequencing platforms represent a powerful tool in the armory of the genetics researcher to identify the causal genetic basis of NTDs.
Collapse
Affiliation(s)
- Deidre R Krupp
- Duke Molecular Physiology Institute, DUMC, 300 North Duke Street, Durham, NC, 27701
| | | | | | | | | | | |
Collapse
|
19
|
|